# PDF Extraction Issues

## The Problem

Text extracted from PDFs is garbled, missing formatting, has incorrect character encoding, or includes layout artifacts that break semantic meaning.

### Symptoms

* ❌ Extracted text shows "ﬁ" instead of "fi" (ligatures)
* ❌ Multi-column layout text interleaved incorrectly
* ❌ Headers/footers repeated in every chunk
* ❌ Mathematical equations render as gibberish
* ❌ Text order scrambled (reads right-to-left when should be left-to-right)

### Real-World Example

```
Original PDF (two-column academic paper):

Column 1:                    Column 2:
"The algorithm operates      "performance improvements
in O(n log n) time with     of up to 300% compared to
significant                 baseline implementations."

Naive extraction reads left-to-right across page:
"The algorithm operates performance improvements
in O(n log n) time with of up to 300% compared to  
significant baseline implementations."

Result: Sentences interleaved, incomprehensible
```

***

## Deep Technical Analysis

### PDF Structure Complexity

PDFs are not text documents—they're page layout instructions:

**PDF Internal Representation:**

```
Not stored as:
"Hello World"

Stored as:
<positioning command> (x=100, y=200)
<font command> (Arial, 12pt)
<glyph IDs> [H:72 e:101 l:108 l:108 o:111 space W:87 o:111...]

Text extraction must:
→ Decode glyph IDs to Unicode
→ Determine reading order (not stored)
→ Reconstruct words from positioned glyphs
→ Infer paragraphs from layout
```

**The Reading Order Problem:**

```
PDF page with elements:
→ Header (y=50)
→ Main text column 1 (y=100-500, x=50-300)
→ Main text column 2 (y=100-500, x=350-600)
→ Footer (y=550)
→ Sidebar (y=100-300, x=650-750)

Physical order (top to bottom, left to right):
Header → Col1 top → Col2 top → Sidebar → Col1 middle → ...

Logical reading order:
Header → Col1 (all) → Col2 (all) → Sidebar → Footer

Extraction tools must infer logical order from positions
→ Heuristic-based (often wrong)
```

### Ligatures and Character Encoding

PDF fonts use glyph substitution:

**The Ligature Problem:**

```
Text typed: "difficult"
Rendered in PDF: "difﬁcult" (fi → ﬁ ligature)

PDF stores:
→ Glyph ID: 345 (ligature ﬁ)
→ Unicode mapping may be wrong or missing

Extraction tools produce:
→ "difﬁcult" (Unicode U+FB01 "ﬁ")
→ Or: "dicult" (ligature not recognized, "fi" dropped)
→ Or: "difficult" (if tool has ligature expansion)

RAG impact:
→ Search for "difficult": No match (stored as "difﬁcult")
→ Embedding sees different tokens
→ Retrieval fails
```

**Custom Font Encoding:**

```
PDF creator uses custom font:
→ Glyph ID 65 → "A" (standard)
→ Glyph ID 66 → "★" (non-standard!)

Standard Unicode mapping:
→ Glyph 66 → "B"

Extraction result:
→ Text shows "B" where should be "★"
→ Complete character substitution
→ Gibberish output

Common in:
→ Logo fonts
→ Symbol fonts (Wingdings, etc.)
→ Embedded fonts without proper encoding
```

### Multi-Column Layout Detection

Column detection is error-prone:

**Column Detection Heuristics:**

```
Algorithm attempts:
1. Analyze white space on page
2. Detect vertical gaps
3. Infer column boundaries

Fails when:
→ Columns have uneven height (one ends early)
→ Narrow columns (sidebars)
→ Indented quotes or code blocks (look like new columns)
→ Images between columns
→ Complex layouts (magazine-style)
```

**The Column Transition Problem:**

```
Two-column page:

Column 1 ends mid-sentence:
"The key benefit of this approach is"

Column 2 continues:
"improved performance and reduced latency."

Extraction must:
→ Recognize sentence continuation
→ Append Column 2 after Column 1
→ Maintain sentence flow

Bad extraction:
"The key benefit of this approach is [END]
New paragraph: improved performance and reduced latency."

Treats continuation as new paragraph
→ Loses semantic connection
```

### Headers, Footers, and Page Numbers

Repeated elements contaminate text:

**Header/Footer Repetition:**

```
50-page PDF, each page has:
Header: "Company Name - Product Guide - 2024"
Footer: "Page N of 50 | Confidential"

Naive extraction: Includes headers/footers as content

Resulting text:
"...authentication methods.
Company Name - Product Guide - 2024
Page 1 of 50 | Confidential
OAuth 2.0 is recommended for..."

Every page: +30 tokens of repeated noise
50 pages × 30 tokens = 1,500 tokens wasted

RAG impact:
→ Chunks filled with repetitive headers
→ Dilutes semantic content
→ "Page N of 50" appears in embeddings
→ Retrieval contaminated
```

**Header/Footer Detection:**

```
Heuristics to detect:
→ Text appears in same position across pages
→ Contains "page", "confidential", etc.
→ Smaller font size
→ Outside main content area

But fails when:
→ Headers vary per section
→ Footer only on odd/even pages
→ Watermarks (diagonal text)
```

### Tables and Forms

Structured layouts are challenging:

**Table Extraction:**

```
PDF table:
┌─────────┬───────┬────────┐
│ Product │ Price │ Stock  │
├─────────┼───────┼────────┤
│ Widget  │ $10   │ 50     │
│ Gadget  │ $20   │ 30     │
└─────────┴───────┴────────┘

Text extraction sees positioned text elements:
x=100, y=100: "Product"
x=200, y=100: "Price"
x=300, y=100: "Stock"
x=100, y=120: "Widget"
x=200, y=120: "$10"
...

Must infer:
→ These elements form a table
→ Column alignments
→ Row groupings
→ Cell boundaries

Poor extraction:
"Product Price Stock Widget $10 50 Gadget $20 30"
→ All run together, no structure
```

**Form Field Extraction:**

```
PDF form with fillable fields:
→ Field labels: "Name:", "Email:", "Phone:"
→ Field values: (user-filled data)

Text extraction:
→ May extract field names but not values
→ Or: Extract values but not labels
→ Or: Extract in wrong order

Result: Incomplete or incomprehensible
```

### Scanned PDFs and OCR

Image-based PDFs require optical character recognition:

**The OCR Accuracy Problem:**

```
Scanned document quality:
→ High resolution (300+ DPI): 99% accuracy
→ Medium resolution (150 DPI): 95% accuracy
→ Low resolution (<100 DPI): 85% accuracy
→ Skewed/rotated: 80% accuracy
→ Handwritten: 70% accuracy

95% accuracy means:
→ 1 in 20 characters wrong
→ 100-word document: ~5 word errors

Errors accumulate:
→ "difficult" → "difficuIt" (l → I)
→ "form" → "forn" (m → n)
→ "0" → "O" (zero → letter O)
```

**OCR Formatting Loss:**

```
Original scanned page:
→ Bold headings
→ Italic emphasis
→ Bullet points
→ Indentation

OCR output:
→ Plain text only
→ All formatting lost
→ Can't distinguish headings from body
→ List structure flattened
```

**The Confidence Score Problem:**

```
OCR engines return confidence per character:

"difﬁcult" with confidences:
d: 0.99
i: 0.98
f: 0.92
ﬁ: 0.65  ← Low confidence (ambiguous)
c: 0.97
u: 0.96
l: 0.94
t: 0.99

Should we:
→ Accept all characters? (includes errors)
→ Reject low confidence? (lose real characters)
→ Flag uncertain chunks for review?
```

### Embedded Images and Figures

Visual content in PDFs is lost:

**Image Extraction Approaches:**

```
Option 1: Ignore images
→ Fast, simple
→ But: Loses visual information
→ "See Figure 3" references broken

Option 2: Extract image alt text (if present)
→ PDFs rarely have alt text
→ Usually empty

Option 3: Run OCR on images
→ Extracts text from diagrams
→ But: Layout/arrows/connections lost

Option 4: Image captioning AI
→ Generate descriptions with vision models
→ Expensive, slow
→ "Figure shows a flowchart with 5 nodes..."
```

**The Figure Reference Problem:**

```
Document text:
"As illustrated in Figure 3, the authentication flow
involves three steps..."

Figure 3: [complex diagram]

If figure not extracted:
→ Reference to "Figure 3" meaningless
→ LLM can't see what's illustrated
→ Incomplete answer

Current workaround:
→ Hope figure has caption
→ Extract caption as proxy
→ Often insufficient
```

### Encrypted and Protected PDFs

Security features block extraction:

**Encryption Types:**

```
User password:
→ Required to open PDF
→ If known, can extract normally

Owner password:
→ Restricts copying/printing
→ May block text extraction APIs

Digital Rights Management (DRM):
→ Vendor-specific encryption
→ Prevents extraction entirely
```

**The Partial Access Problem:**

```
PDF with restrictions:
→ Can view in reader app
→ But: Cannot select/copy text
→ Extraction tools respect restrictions

User uploads protected PDF:
→ Twig extraction fails: "Access denied"
→ No content indexed
→ Document invisible to AI agent
```

***

## How to Solve

**Use advanced PDF libraries (pdfplumber, PyMuPDF) with layout analysis + implement ligature expansion + detect and remove headers/footers + run OCR on scanned PDFs + handle multi-column layouts with column detection.** See [PDF Processing](/rag-scenarios-and-solutions/chunking/pdf-extraction.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/chunking/pdf-extraction.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
