Revolutionizing PDF Data Extraction with Microsoft Document Intelligence: A Case Study
By JoeVu, at: Feb. 4, 2024, 10:19 a.m.
Estimated Reading Time: __READING_TIME__ minutes
Revolutionizing PDF Data Extraction with Microsoft Document Intelligence: A Case Study
Introduction
Extracting structured data from PDF documents, especially tables with empty cells, presents a significant challenge. Many Python libraries and conventional methods fall short in accurately parsing these elements.
ChatGPT does not work well too.
This blog post explores a case study where traditional approaches, including Python libraries and manual edits, proved ineffective for extracting table data from a complex PDF. The breakthrough came with Microsoft Document Intelligence, an advanced solution that precisely identifies and extracts data, including handling empty table cells efficiently.
The Challenge: Extracting Data from Complex PDFs
The challenge began with a seemingly straightforward task: extract structured data from a PDF document (as the below image). However, the document in question was not just any PDF. It contained multiple tables with a mix of text and blank spaces within cells—a configuration that mirrors the complexity of real-world data presentation. The presence of blank spaces is not uncommon in documents intended for human readers, where visual layout and readability are prioritized over machine readability. However, these blank spaces present a significant hurdle for automated data extraction processes, as they disrupt the straightforward mapping of table structures into digital formats.
Traditional text extraction tools are designed to parse and extract visible text, operating under the assumption that every piece of data will be explicitly represented as text within the document. This assumption falls apart when faced with tables that include blank cells, as these cells are visually recognizable to human readers but typically ignored or mishandled by automated tools (if those empty cells are replaced by a special number or character, it would be easier to work with). The challenge, therefore, was not just about extracting text; it was about interpreting the structure of the data within the document, including understanding which cells were intentionally left blank as part of the table's layout.
In my initial attempts to tackle this challenge, I turned to several well-known Python libraries that are commonly used for PDF text extraction:
- PyPDF2: A library that allows for extracting text from PDFs, but often struggles with layout and formatting, especially in complex table structures.
- PyMuPDF (fitz): Known for its speed and ease of extracting text and images, but similarly challenged by tables with blank spaces.
- PDFMiner: Offers more granular control over the extraction process and is better at handling layout, yet still falls short with tables that include empty cells.
- PDFPlumber: Praised for its ability to extract text and tables with more accuracy regarding layout, but still not fully equipped to handle the nuances of blank spaces within tables.
Each of these tools offered a glimmer of hope, suggesting that with enough tweaking and customization, a solution might be within reach. However, I have tested them all in my PDF, they do not work with empty or blank space cells.
This realization prompted me to explore beyond the boundaries of conventional Python libraries, seeking solutions that could bridge the gap between the visible text and the implied structure of data within the document. The journey was marked by trial and error, with each attempt providing valuable insights into the limitations of existing tools and the complexities of PDF data extraction. It was a journey that would eventually lead me to discover a solution that could not only recognize text but also understand the context and structure within which that text was presented.
Beyond Conventional Tools: Exploring Creative Solutions
Faced with the limitations of traditional data extraction methods, this pushed me into the realm of creative problem-solving. Recognizing that conventional tools were ill-equipped to handle the nuances of blank spaces in tables, a series of experimental approaches aimed at overcoming this challenge is designed and implemented.
Convert PDF to Image and OCR with PyTesseract
My first venture beyond standard libraries involved converting the problematic PDF into a high-resolution image. The rationale was straightforward: if text extraction tools falter due to the format's complexities, perhaps an Optical Character Recognition (OCR) approach could bridge the gap. PyTesseract, a Python wrapper for Google's Tesseract-OCR Engine, seemed like a promising candidate for this task.
The process was simple in concept but complex in execution. Each page of the PDF was converted into an image, maintaining a resolution high enough to ensure that even the finest details were preserved. PyTesseract was then deployed to interpret the text within these images, including the elusive blank table cells I hoped it would recognize as distinct spaces.
However, the results were mixed. While PyTesseract excelled in recognizing text from the images, it struggled with the same challenge that had stumped the PDF libraries: differentiating between empty cells and those filled with content. The OCR process could not reliably interpret the structure of tables, particularly when it came to recognizing and preserving the significance of blank spaces. This approach, while innovative, ultimately fell short of solving the core issue.
❯ pytesseract /Users/joe/Desktop/Glinteco-LLC.png
Glinteco LLC
Store Name
Malon
SI #: 212
4320 Winfield
Road, Suite 200
Warrenville, IL
60555 (630)
862-9552, cell
Order Number
1192
Store Name
Hokkaido
S| #: 304
Higashi 16 Sen-
15 Yamabe,
Furano,
Hokkaido 079-
1581, Japan
Item Description Total
Number Units
MbD9123 Machine 392
dock
BOOK19 Book 900
COMPS55 Computer 345
CUP666 Cup 786
Order Date
15/2/2024
Store Name
ShenZheng
SI #: 978
GWWV+9X
Nanshan,
Shenzhen,
Guangdong
Province, China
Vendor
Japanese
Culture
Store Name
Osaka
SI #: 651
840 Murodocho,
Izumi, Osaka
594-1101, Japan
SI #212 SI # 304 S1#978
Units Units Units
100 92
400 200
145 100
786
Cancel Date
28/2/2024
Store Name
Hanoi
SI #: 992
302 D. Cau Giay,
Dich Vong, Hoan
Kiém, Ha NGi,
Vietnam
SI1#651 SIl#992
Units Units
100 100
300
100
And when the cell values are missing, the content format is difficult to analyze and parse to a structured format. We are failed to use PyTesseract.
Manual Modifications and AI Assistance
Undeterred, I turned to a more manual, yet technologically assisted approach. Considering that the extraction tools could not detect empty cells, I experimented with manually adding placeholders—such as "0" or "YYY"—into the PDF to represent these spaces. The hope was that by making the empty cells visibly occupied with these placeholders, extraction tools, including advanced AI solutions like ChatGPT-4, could then recognize and correctly interpret the structure of the tables.
This method required meticulous editing of the PDF to ensure that each placeholder accurately represented an empty cell, a process both time-consuming and prone to human error. Once the placeholders were in place, I revisited the text extraction process, this time using ChatGPT-4 in the hopes that its advanced understanding capabilities would recognize and correctly parse the modified content.
The outcome, however, was not as expected. While ChatGPT-4 successfully extracted the text, including the placeholders, it treated them as part of the document's textual content rather than indicators of structural elements like blank table cells. Consequently, the placeholders were parsed to the end of the extracted text, disrupting the intended structure and rendering the data less meaningful for analysis. This attempt, though innovative, highlighted the limitations of AI in understanding document layouts when not explicitly trained for this purpose.
Here is the link for ChatGPT-4 Result.
I also use PyTesseract to verify that, and it has the same error as ChatGPT-4.
❯ pytesseract /Users/joe/Desktop/Glinteco-LLC-Edited.png
Glinteco LLC
Order Number
1192
Order Date
Vendor
Cancel Date
15/2/2024
Store Name
Malon
SI #: 212
4320 Winfield
Road, Suite 200
Warrenville, IL
60555 (630)
862-9552, cell
Store Name
Hokkaido
SI #: 304
Higashi 16 Sen-
15 Yamabe,
Furano,
Hokkaido 079-
1581, Japan
Item Description Total
Number Units
MbD9123 Machine 392
dock
BOOK19 Book 900
COMP55 Computer 345
CUP666 Cup 786
Store Name
ShenZheng
SI #: 978
GWWV+9X
Nanshan,
Shenzhen,
Guangdong
Province, China
Japanese
Culture
Store Name
Osaka
SI #: 651
840 Murodocho,
Izumi, Osaka
594-1101, Japan
SI #212 SI # 304
Units Units Units
100 92
0 400 200
145 0 100
0 786 0
28/2/2024
Store Name
Hanoi
SI #: 992
302 D. Cau Giay,
Dich Vong, Hoan
Kiém, Ha NGi,
Vietnam
SI#978 SI#651 SI#992
Units Units
100 100
0 300
100 0
0
The Breakthrough: Discovering Microsoft Document Intelligence
I almost quited, however, I decided to take a break with a cup of coffee then come back in 30 mins.
Faced with the shortcomings of both traditional and unconventional methods, the breakthrough came when I discovered Microsoft Document Intelligence. This solution promised not only to extract text but also to understand and preserve the document's structure, including the nuanced handling of tables with blank cells.
Microsoft Document Intelligence leverages advanced AI algorithms to analyze documents, recognizing and interpreting various elements with a level of precision and understanding that conventional tools simply could not match. It was designed with complex documents in mind, offering the potential to solve the very challenge that had stumped me: accurately extracting structured data from PDFs, regardless of their complexity or layout intricacies.
Intrigued by its capabilities, I dived into Microsoft Document Intelligence, eager to see if it could live up to its promise and finally provide a solution to the enduring challenge of extracting structured data from PDFs with blank table cells. This section of the journey was marked by a mix of skepticism and hope, as I prepared to test the tool on the very document that had tested the limits of every other solution I had tried.
An important thing is Microsoft acquired Adobe Reader since 2019, that could be a possible reason why Document Intelligenc works pretty well with PDF files.
My code snippet is:
# azure-ai-documentintelligence
# azure-ai-formrecognizer
# import libraries
import os
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
from io import BytesIO
# set `<your-endpoint>` and `<your-key>` variables with the values from the Azure portal
endpoint = 'https://your-end-point.cognitiveservices.azure.com/'
key = 'your-key'</your-key></your-endpoint>
def analyze_layout(file_name):
with open(file_name, "rb") as fh:
file_content = BytesIO(fh.read())
document_analysis_client = DocumentAnalysisClient(
endpoint=endpoint, credential=AzureKeyCredential(key)
)
poller = document_analysis_client.begin_analyze_document(
"prebuilt-layout", file_content)
result = poller.result()
for idx, style in enumerate(result.styles):
print(
"Document contains {} content".format(
"handwritten" if style.is_handwritten else "no handwritten"
)
)
for page in result.pages:
print("----Analyzing layout from page #{}----".format(page.page_number))
print(
"Page has width: {} and height: {}, measured with unit: {}".format(
page.width, page.height, page.unit
)
)
for line_idx, line in enumerate(page.lines):
words = line.get_words()
print(
"...Line # {} has word count {} and text '{}' within bounding polygon '{}'".format(
line_idx,
len(words),
line.content,
line.polygon,
)
)
for word in words:
print(
"......Word '{}' has a confidence of {}".format(
word.content, word.confidence
)
)
for selection_mark in page.selection_marks:
print(
"...Selection mark is '{}' within bounding polygon '{}' and has a confidence of {}".format(
selection_mark.state,
selection_mark.polygon,
selection_mark.confidence,
)
)
for table_idx, table in enumerate(result.tables):
print(
"Table # {} has {} rows and {} columns".format(
table_idx, table.row_count, table.column_count
)
)
for region in table.bounding_regions:
print(
"Table # {} location on page: {} is {}".format(
table_idx,
region.page_number,
region.polygon,
)
)
for cell in table.cells:
print(
"...Cell[{}][{}] has content '{}'".format(
cell.row_index,
cell.column_index,
cell.content,
)
)
for region in cell.bounding_regions:
print(
"...content on page {} is within bounding polygon '{}'".format(
region.page_number,
region.polygon,
)
)
print("----------------------------------------")
if __name__ == "__main__":
file_name = '/Users/joe/Downloads/Glinteco-LLC.pdf'
analyze_layout(file_name)
A part of the Document Intelligence result is:
...Cell[4][1] has content 'Cup'
...content on page 1 is within bounding polygon '[Point(x=1.8961, y=6.2294), Point(x=2.8631, y=6.2227), Point(x=2.8698, y=6.5851), Point(x=1.8961, y=6.5851)]'
...Cell[4][2] has content '786'
...content on page 1 is within bounding polygon '[Point(x=2.8631, y=6.2227), Point(x=3.6958, y=6.2227), Point(x=3.6958, y=6.5851), Point(x=2.8698, y=6.5851)]'
...Cell[4][3] has content ''
...content on page 1 is within bounding polygon '[Point(x=3.6958, y=6.2227), Point(x=4.5285, y=6.2227), Point(x=4.5285, y=6.5851), Point(x=3.6958, y=6.5851)]'
...Cell[4][4] has content '786'
...content on page 1 is within bounding polygon '[Point(x=4.5285, y=6.2227), Point(x=5.341, y=6.2294), Point(x=5.3477, y=6.5919), Point(x=4.5285, y=6.5851)]'
...Cell[4][5] has content ''
...content on page 1 is within bounding polygon '[Point(x=5.341, y=6.2294), Point(x=6.0662, y=6.2294), Point(x=6.0662, y=6.5919), Point(x=5.3477, y=6.5919)]'
...Cell[4][6] has content ''
...content on page 1 is within bounding polygon '[Point(x=6.0662, y=6.2294), Point(x=6.7781, y=6.2294), Point(x=6.7781, y=6.5919), Point(x=6.0662, y=6.5919)]'
...Cell[4][7] has content ''
...content on page 1 is within bounding polygon '[Point(x=6.7781, y=6.2294), Point(x=7.5033, y=6.2294), Point(x=7.5033, y=6.5919), Point(x=6.7781, y=6.5919)]'
As we can see, the Document Intelligence can read the empty cells which are skiped by all Python libraries and ChatGPT-4.
I then turn the Document Intelligence result into a CSV data, ask ChatGPT-4 to format it to a structured format.
Comparative Insights: Microsoft Document Intelligence vs. Traditional Methods
Reflecting on the journey that led me to Microsoft Document Intelligence, it's clear that this tool offers a significant advancement over traditional PDF data extraction methods. While tools like PyPDF2, PyMuPDF, PDFMiner, and PDFPlumber have their strengths, particularly in straightforward text extraction scenarios, they fall short when faced with the complexity of extracting structured data from tables, especially those with blank cells.
The manual approaches, including the use of OCR with PyTesseract and the insertion of placeholders to denote blank cells, while creative, highlighted the limitations of relying on text recognition alone. These methods required significant manual intervention and often resulted in data that needed further cleaning and structuring, diminishing their effectiveness.
In contrast, Microsoft Document Intelligence represents a paradigm shift in how we approach the challenge of extracting structured data from PDFs. Its AI-driven approach to understanding document structure and content offers a level of precision and efficiency that manual methods and traditional tools cannot match. This capability is not just a technical achievement; it's a practical solution that can save businesses and individuals countless hours of work and open up new possibilities for data analysis and automation.
Lessons Learned and Best Practices
The journey to finding an effective solution for extracting structured data from complex PDFs was fraught with challenges, but it was also immensely instructive. It highlighted the importance of understanding the limitations of the tools at our disposal and the necessity of keeping abreast of technological advancements, such as AI and machine learning, that are continually reshaping the landscape of data extraction.
For those embarking on similar quests to extract data from PDFs, the key lessons learned include:
- Understand the nature of your document and the specific challenges it presents. This understanding is crucial in selecting the right tool for the job.
- Don't hesitate to explore beyond conventional tools. The rapid advancement of AI technologies offers new solutions that may surpass traditional methods.
- Be prepared for a process of trial and error. Finding the right solution often requires testing multiple tools and approaches.
- Consider the scalability of your chosen solution. While manual methods may work for small one-off projects, tools like Microsoft Document Intelligence offer a scalable solution for larger datasets and more complex document structures.
In conclusion, the evolution of document processing technologies, exemplified by Microsoft Document Intelligence, is a testament to the innovative solutions emerging in the field of data extraction. As we continue to navigate the complexities of digital documents, these advancements not only solve immediate challenges but also open the door to new possibilities for leveraging structured data in ways that were previously unimaginable.