PDF Extraction Issues

The Problem

Text extracted from PDFs is garbled, missing formatting, has incorrect character encoding, or includes layout artifacts that break semantic meaning.

Symptoms

❌ Extracted text shows "ﬁ" instead of "fi" (ligatures)
❌ Multi-column layout text interleaved incorrectly
❌ Headers/footers repeated in every chunk
❌ Mathematical equations render as gibberish
❌ Text order scrambled (reads right-to-left when should be left-to-right)

Real-World Example

Original PDF (two-column academic paper):

Column 1:                    Column 2:
"The algorithm operates      "performance improvements
in O(n log n) time with     of up to 300% compared to
significant                 baseline implementations."

Naive extraction reads left-to-right across page:
"The algorithm operates performance improvements
in O(n log n) time with of up to 300% compared to  
significant baseline implementations."

Result: Sentences interleaved, incomprehensible

Deep Technical Analysis

PDF Structure Complexity

PDFs are not text documents—they're page layout instructions:

PDF Internal Representation:

Not stored as:
"Hello World"

Stored as:
<positioning command> (x=100, y=200)
<font command> (Arial, 12pt)
<glyph IDs> [H:72 e:101 l:108 l:108 o:111 space W:87 o:111...]

Text extraction must:
→ Decode glyph IDs to Unicode
→ Determine reading order (not stored)
→ Reconstruct words from positioned glyphs
→ Infer paragraphs from layout

The Reading Order Problem:

PDF page with elements:
→ Header (y=50)
→ Main text column 1 (y=100-500, x=50-300)
→ Main text column 2 (y=100-500, x=350-600)
→ Footer (y=550)
→ Sidebar (y=100-300, x=650-750)

Physical order (top to bottom, left to right):
Header → Col1 top → Col2 top → Sidebar → Col1 middle → ...

Logical reading order:
Header → Col1 (all) → Col2 (all) → Sidebar → Footer

Extraction tools must infer logical order from positions
→ Heuristic-based (often wrong)

Ligatures and Character Encoding

PDF fonts use glyph substitution:

The Ligature Problem:

Text typed: "difficult"
Rendered in PDF: "difﬁcult" (fi → ﬁ ligature)

PDF stores:
→ Glyph ID: 345 (ligature ﬁ)
→ Unicode mapping may be wrong or missing

Extraction tools produce:
→ "difﬁcult" (Unicode U+FB01 "ﬁ")
→ Or: "dicult" (ligature not recognized, "fi" dropped)
→ Or: "difficult" (if tool has ligature expansion)

RAG impact:
→ Search for "difficult": No match (stored as "difﬁcult")
→ Embedding sees different tokens
→ Retrieval fails

Custom Font Encoding:

PDF creator uses custom font:
→ Glyph ID 65 → "A" (standard)
→ Glyph ID 66 → "★" (non-standard!)

Standard Unicode mapping:
→ Glyph 66 → "B"

Extraction result:
→ Text shows "B" where should be "★"
→ Complete character substitution
→ Gibberish output

Common in:
→ Logo fonts
→ Symbol fonts (Wingdings, etc.)
→ Embedded fonts without proper encoding

Multi-Column Layout Detection

Column detection is error-prone:

Column Detection Heuristics:

Algorithm attempts:
1. Analyze white space on page
2. Detect vertical gaps
3. Infer column boundaries

Fails when:
→ Columns have uneven height (one ends early)
→ Narrow columns (sidebars)
→ Indented quotes or code blocks (look like new columns)
→ Images between columns
→ Complex layouts (magazine-style)

The Column Transition Problem:

Two-column page:

Column 1 ends mid-sentence:
"The key benefit of this approach is"

Column 2 continues:
"improved performance and reduced latency."

Extraction must:
→ Recognize sentence continuation
→ Append Column 2 after Column 1
→ Maintain sentence flow

Bad extraction:
"The key benefit of this approach is [END]
New paragraph: improved performance and reduced latency."

Treats continuation as new paragraph
→ Loses semantic connection

Headers, Footers, and Page Numbers

Repeated elements contaminate text:

Header/Footer Repetition:

50-page PDF, each page has:
Header: "Company Name - Product Guide - 2024"
Footer: "Page N of 50 | Confidential"

Naive extraction: Includes headers/footers as content

Resulting text:
"...authentication methods.
Company Name - Product Guide - 2024
Page 1 of 50 | Confidential
OAuth 2.0 is recommended for..."

Every page: +30 tokens of repeated noise
50 pages × 30 tokens = 1,500 tokens wasted

RAG impact:
→ Chunks filled with repetitive headers
→ Dilutes semantic content
→ "Page N of 50" appears in embeddings
→ Retrieval contaminated

Header/Footer Detection:

Heuristics to detect:
→ Text appears in same position across pages
→ Contains "page", "confidential", etc.
→ Smaller font size
→ Outside main content area

But fails when:
→ Headers vary per section
→ Footer only on odd/even pages
→ Watermarks (diagonal text)

Tables and Forms

Structured layouts are challenging:

Table Extraction:

PDF table:
┌─────────┬───────┬────────┐
│ Product │ Price │ Stock  │
├─────────┼───────┼────────┤
│ Widget  │ $10   │ 50     │
│ Gadget  │ $20   │ 30     │
└─────────┴───────┴────────┘

Text extraction sees positioned text elements:
x=100, y=100: "Product"
x=200, y=100: "Price"
x=300, y=100: "Stock"
x=100, y=120: "Widget"
x=200, y=120: "$10"
...

Must infer:
→ These elements form a table
→ Column alignments
→ Row groupings
→ Cell boundaries

Poor extraction:
"Product Price Stock Widget $10 50 Gadget $20 30"
→ All run together, no structure

Form Field Extraction:

PDF form with fillable fields:
→ Field labels: "Name:", "Email:", "Phone:"
→ Field values: (user-filled data)

Text extraction:
→ May extract field names but not values
→ Or: Extract values but not labels
→ Or: Extract in wrong order

Result: Incomplete or incomprehensible

Scanned PDFs and OCR

Image-based PDFs require optical character recognition:

The OCR Accuracy Problem:

Scanned document quality:
→ High resolution (300+ DPI): 99% accuracy
→ Medium resolution (150 DPI): 95% accuracy
→ Low resolution (<100 DPI): 85% accuracy
→ Skewed/rotated: 80% accuracy
→ Handwritten: 70% accuracy

95% accuracy means:
→ 1 in 20 characters wrong
→ 100-word document: ~5 word errors

Errors accumulate:
→ "difficult" → "difficuIt" (l → I)
→ "form" → "forn" (m → n)
→ "0" → "O" (zero → letter O)

OCR Formatting Loss:

Original scanned page:
→ Bold headings
→ Italic emphasis
→ Bullet points
→ Indentation

OCR output:
→ Plain text only
→ All formatting lost
→ Can't distinguish headings from body
→ List structure flattened

The Confidence Score Problem:

OCR engines return confidence per character:

"difﬁcult" with confidences:
d: 0.99
i: 0.98
f: 0.92
ﬁ: 0.65  ← Low confidence (ambiguous)
c: 0.97
u: 0.96
l: 0.94
t: 0.99

Should we:
→ Accept all characters? (includes errors)
→ Reject low confidence? (lose real characters)
→ Flag uncertain chunks for review?

Embedded Images and Figures

Visual content in PDFs is lost:

Image Extraction Approaches:

Option 1: Ignore images
→ Fast, simple
→ But: Loses visual information
→ "See Figure 3" references broken

Option 2: Extract image alt text (if present)
→ PDFs rarely have alt text
→ Usually empty

Option 3: Run OCR on images
→ Extracts text from diagrams
→ But: Layout/arrows/connections lost

Option 4: Image captioning AI
→ Generate descriptions with vision models
→ Expensive, slow
→ "Figure shows a flowchart with 5 nodes..."

The Figure Reference Problem:

Document text:
"As illustrated in Figure 3, the authentication flow
involves three steps..."

Figure 3: [complex diagram]

If figure not extracted:
→ Reference to "Figure 3" meaningless
→ LLM can't see what's illustrated
→ Incomplete answer

Current workaround:
→ Hope figure has caption
→ Extract caption as proxy
→ Often insufficient

Encrypted and Protected PDFs

Security features block extraction:

Encryption Types:

User password:
→ Required to open PDF
→ If known, can extract normally

Owner password:
→ Restricts copying/printing
→ May block text extraction APIs

Digital Rights Management (DRM):
→ Vendor-specific encryption
→ Prevents extraction entirely

The Partial Access Problem:

PDF with restrictions:
→ Can view in reader app
→ But: Cannot select/copy text
→ Extraction tools respect restrictions

User uploads protected PDF:
→ Twig extraction fails: "Access denied"
→ No content indexed
→ Document invisible to AI agent

How to Solve

Use advanced PDF libraries (pdfplumber, PyMuPDF) with layout analysis + implement ligature expansion + detect and remove headers/footers + run OCR on scanned PDFs + handle multi-column layouts with column detection. See PDF Processing.

PreviousTables Breaking Across Chunks NextOptimizing Chunk Size

Last updated 0 minutes ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagPDF Structure Complexity

hashtagLigatures and Character Encoding

hashtagMulti-Column Layout Detection

hashtagHeaders, Footers, and Page Numbers

hashtagTables and Forms

hashtagScanned PDFs and OCR

hashtagEmbedded Images and Figures

hashtagEncrypted and Protected PDFs

hashtagHow to Solve