OCR Pdf docs parsing

Requirements

1. Document Processing Scope

• Process PDF documents containing scanned images or embedded text.

• Extract structured and unstructured data such as tables, text blocks, headers, and metadata.

• Handle multi-page documents and ensure accurate parsing across various PDF formats.

2. OCR Implementation

• Use Optical Character Recognition (OCR) to recognize text from scanned PDFs.

• Support multilingual text extraction with high accuracy, including special characters and numbers.

3. Data Structuring and Formatting

• Extract data in a clean, structured format: JSON, CSV, or database-ready format.

• Handle tables, grouped data, and hierarchical content effectively.

• Include advanced post-processing to clean noise, align extracted data, and ensure readability.

4. Accuracy and Error Handling

• Implement preprocessing steps to improve OCR accuracy, including image enhancement, noise reduction, and text alignment.

• Provide error logs for unrecognized data or formatting inconsistencies.

5. Scalability and Performance

• Ensure the system processes large PDF files efficiently without performance degradation.

• Optimize execution time for batch document processing.

6. Output Formats and Storage

• Support export options for parsed data, such as:

• JSON or CSV files for easy integration.

• Direct saving into relational databases like MariaDB.

• Automate saving outputs with customizable naming conventions.

7. Integration with Other Tools

• Allow integration with cloud storage platforms like Google Drive or Dropbox for PDF uploads.

• Provide APIs for seamless integration with other systems.

8. User-Friendly Execution

• Develop a command-line or GUI-based tool for initiating parsing tasks.

• Provide progress tracking and reporting for large file batches.

Implementation

1. Technology Stack
• Python Libraries: Use Tesseract OCR (via pytesseract), OpenCV for preprocessing, and pdf2image for PDF-to-image conversion.
• Data Processing: Pandas for table structuring and cleanup.
• Output Storage: JSON, CSV, and MariaDB for structured data saving.
2. OCR Workflow
• Convert PDFs to high-quality images for OCR processing.
• Apply preprocessing techniques like grayscale conversion, noise removal, and adaptive thresholding to enhance OCR accuracy.
• Extract text and tables separately, organizing them into a structured format.
3. Data Structuring
• Parse the extracted text to identify tables, headings, and content blocks.
• Structure the data into JSON or CSV formats for easy consumption.
• Validate and clean data to remove noise or misrecognized characters.
4. Batch Processing and Scalability
• Enable batch processing for multiple PDFs with automated naming and progress tracking.
• Optimize memory usage to handle large PDF files efficiently.
5. Error Logging and Monitoring
• Implement error logging for failed OCR attempts or parsing issues.
• Provide reports highlighting extraction accuracy and problematic pages for manual review.
6. Integration and Automation
• Integrate with storage solutions like Google Drive for file uploads and automated triggers.
• Develop a REST API for external applications to send PDFs for processing and receive parsed outputs.
7. Output Delivery
• Automate saving the parsed data in structured formats like JSON/CSV.
• Push data directly into a MariaDB database for further analysis and use.
8. Testing and Optimization
• Conduct testing with diverse document types to ensure high OCR accuracy.
• Optimize the workflow for speed, accuracy, and robustness.

Mailwizz

Requirements

Project Summary

For this project, I am looking for a freelance developer to create a python script that downloads data via and API, and then manipulates that data through google sheets.

Detailed Project Requirements

Step 1 – navigating to the mastersheet

First, please navigate to the following google drive location:

In this folder, you will see the file “input-edited-w-dates”. This sheet contains a large dataset with 26 columns.

Create a copy of this file, name the copy as “input” and place the copy in the following google drive location:

Step 2 – downloading and processing bouncebacks

The next step of the script is to create a create a program that interacts with the https://www.mailwizz.com/ API to download data and automate the creation of csv files for this data.

API Documentation > https://api-docs.mailwizz.com/#introduction

Understanding how Mailwizz works

Mailwizz is an email delivery application that is used to create email campaigns. If you are unfamiliar with Mailwizz, please watch the following videos to familiarize yourself with how the application works:

Introduction - https://www.youtube.com/watch?v=jTaYTjevsog

Creating a subscriber list - https://www.youtube.com/watch?v=VSPxmUvdSxI

Creating Campaigns - https://www.youtube.com/watch?v=gol9NaRWqSs&t=488s

Sending Campaigns - https://www.youtube.com/watch?v=aU9_m5M6cmU

You can watch more videos via the official Mailwizz YouTube account if you wish: https://www.youtube.com/c/mailwizz/videos.

Please take a look through the API documentation by following this link https://api-docs.mailwizz.com/?python#introduction.

Please find the login details and API key needed to log in to my MailWizz account and to access the authentication needed to use the API associated with this account in the following google sheet:

For this project, looking at the API documentation, we are most interested in the “Campaigns” and “Campaign Bounces” commands. Mailwizz is used to create email campaigns to contact a large number of emails, however, sometimes there are a number of emails which cannot be contacted. These emails are labelled by Mailwizz as “Bouncing Emails”, as the emails are delivered, but then “bounce” back, so to speak.

For each campaign created an MailWizz, you can use the API to retrieve the bouncing email data. In order to do this, you must first retrieve the Campaign Unique ID, which is then used in the Campaign Bounces>Get All Bounces API command.

For the python program that we are creating, we need it to do the following:

Step 2: Sub-step 1

· Using the last 7 days from the time the program launches as a date range, get the unique IDs of all campaigns with a start date that falls within that date range.

· For each campaign, using the unique ID, run the “Get All Bounces” API command to download all email bounces for each campaign.

· Store all data from all campaigns in a csv file in the program’s working directory called “bounces”.

· Insert a column in this csv file titled “campaignName”, and place the name of the campaign for which each email bounce belongs to.

Step 2: Sub-step 2

Once the “bounces” csv file has been created, it will have 5 columns:

· campaignName

· Email

· Bounce type

· Message

· Date added

We need to manipulate this data, reducing it to 3 columns, as seen in the “bounces-edited.csv” file. The manipulations needed are as follows:

For each row seen in “bounces”:

· Take the value in column C and surround it with square brackets/parenthesis – [ ].

· Concatenate this value with the adjacent value in column D.

· Delete all columns except A and B and place this new concatenated value in column C for all rows. Name column C “Reason”.

Step 3 – manipulate the output from step 2

Now, open the “bounces-edited” file. This file has 3 columns. Column B contains emails.

For all emails in “bounces-edited” column B, if the email can be found in the “input” file’s column H, copy the adjacent value from “bounces-edited” column C into the adjacent cell in “input” column N.

In “input”, create a new sheet called “DOMAINS&MESSAGES” and copy all values from column E, G, Y, I and N from the sheet1 tab into columns A, B, C, D and E respectively in the DOMAINS&MESSAGES tab, naming the columns “companyName”, “emailDomain”, “domainFormat”, “status” and “bounceMessage” respectively.

In DOMAINS&MESSAGES, de-duplicate the table rows based on email addresses.

In DOMAINS&MESSAGES, for all cells in column A, that do not have a value in the adjacent column E cell, populate the adjacent column E cell with the value “delivered”.

In DOMAINS&MESSAGES, create a duplicate of column E and place it to the right in column F, and name it “shortBounceMessage”.

In DOMAINS&MESSAGES, for all cells in column F, delete all characters after the first “]” value in the cell, to trim all messages down to [Hard], [Internal] or [Soft] without any remaining text.

Now, create a new tab in the input file called “SUMMARY”. And create a table that displays data from DOMAINS&MESSAGES in the following columns:

A. Each unique company name.

B. Each unique domain.

C. The domain format.

D. Status

E. Total number of [Hard] messages.

F. Total number of [Internal] messages.

G. Total number of [Soft] messages.

H. Total number of Delivered values.

I. Total number of [Hard], [Internal] and [Soft] messages.

J. Total number of Delivered values, [Hard], [Internal] and [Soft] messages.

K. Total number of Delivered values divided by the (sum of all Delivered values, [Hard], [Internal] and [Soft] messages).

And name the columns as such as seen in the mastersheet.xlsx file:

A. CompanyName

B. Domain

C. Domain Format

D. Status

E. [Hard]

F. [Internal]

G. [Soft]

H. Total Delivered

I. Total Bounced

J. Total Attempted

K. EmailDeliverySuccess%

However, for each row that contains a value of 0 in the “status” column in DOMAINS&MESSAGES, ensure that this data is not displayed in the summary tab. Additionally, for each row that contains a value of 0 in the “status” column, if any rows in the SUMMARY tab contain the corresponding “emailDomain” value in the “Domain” column, delete that row in the SUMMARY tab.

Finally, insert a new column to the right of column A in the SUMMARY tab, and title this column as “Registrations”, and assign a value of 0 to all cells in the table.

Part 4 – measuring emailDeliverySuccess% and pausing campaigns accordingly

Once the summary page has been populated with data, the next part of the script will focus on the “EmailDeliverySuccess%”.

For all values in the EmailDeliverySuccess% column, if the value is greater than or equal to 60%, take the “CompanyName” and “domain format” value for that row, and find all rows in the “sheet1” of the “input” file that have a matching value for “companyName”, and a “domain format” value that does not equal the “domain format” value picked up in the SUMMARY tab.

For all rows that meet this criteria, populate the corresponding cells in the “status” column, with a 0.

Part 5 – take all emails with “0” status, and unsubscribe

For each row in the “sheet1” tab that is assigned a value of “0” in the status column, the value in the email column must be copied and pasted in to the “blacklist” file found in the following google drive folder:

For all emails that are added to this file, the corresponding value in the “reason” column in this “blacklist” file, must be set to “obsolete”.

Part 6 – Adding registrations to the Summary tab

Please navigate to the following google drive location:

In this folder, you will see the “registrion-log” sheet. In this sheet, you will see a list of domains in column “E”. Edit the script so that for each domain value in column E, if the domain is also found in domain column in the “input” file SUMMARY tab, increment the value in the SUMMARY tab “registrations” column by 1 for the corresponding row.

If the domain is not found in domain column in the “input” file SUMMARY tab, append the domain in the domain column in the a new row, in the “input” file SUMMARY tab. Then, each time the domain appears in “registration-log”, increment the value in the SUMMARY tab “registrations” column by 1 for the corresponding row.

Also, each time 1 row is processed in the “registration-log”, the corresponding value in column K for that row must be populated with a “1” integer value.

Part 7 – cloud-based cycle, every three days

For this project, we would like you to develop a python script that runs once every 24 hours, 7 days a week, even when the local device is switched off, therefore this may have to be a cloud-based solution.

Project complete

Implementation

3 days of working
4 folders read/changed
6 files created/edited/read
20 sec all operations
3 APIs connected

Page updated

Google Sites

Report abuse

OCR Pdf docs parsing

Requirements

Implementation

Mailwizz

Requirements

Implementation

3 days of working

4 folders read/changed

6 files created/edited/read

20 sec all operations

3 APIs connected