# PDF Extraction Issues ## The Problem Text extracted from PDFs is garbled, missing formatting, has incorrect character encoding, or includes layout artifacts that break semantic meaning. ### Symptoms * ❌ Extracted text shows "ﬁ" instead of "fi" (ligatures) * ❌ Multi-column layout text interleaved incorrectly * ❌ Headers/footers repeated in every chunk * ❌ Mathematical equations render as gibberish * ❌ Text order scrambled (reads right-to-left when should be left-to-right) ### Real-World Example ``` Original PDF (two-column academic paper): Column 1: Column 2: "The algorithm operates "performance improvements in O(n log n) time with of up to 300% compared to significant baseline implementations." Naive extraction reads left-to-right across page: "The algorithm operates performance improvements in O(n log n) time with of up to 300% compared to significant baseline implementations." Result: Sentences interleaved, incomprehensible ``` *** ## Deep Technical Analysis ### PDF Structure Complexity PDFs are not text documents—they're page layout instructions: **PDF Internal Representation:** ``` Not stored as: "Hello World" Stored as: (x=100, y=200) (Arial, 12pt) [H:72 e:101 l:108 l:108 o:111 space W:87 o:111...] Text extraction must: → Decode glyph IDs to Unicode → Determine reading order (not stored) → Reconstruct words from positioned glyphs → Infer paragraphs from layout ``` **The Reading Order Problem:** ``` PDF page with elements: → Header (y=50) → Main text column 1 (y=100-500, x=50-300) → Main text column 2 (y=100-500, x=350-600) → Footer (y=550) → Sidebar (y=100-300, x=650-750) Physical order (top to bottom, left to right): Header → Col1 top → Col2 top → Sidebar → Col1 middle → ... Logical reading order: Header → Col1 (all) → Col2 (all) → Sidebar → Footer Extraction tools must infer logical order from positions → Heuristic-based (often wrong) ``` ### Ligatures and Character Encoding PDF fonts use glyph substitution: **The Ligature Problem:** ``` Text typed: "difficult" Rendered in PDF: "difﬁcult" (fi → ﬁ ligature) PDF stores: → Glyph ID: 345 (ligature ﬁ) → Unicode mapping may be wrong or missing Extraction tools produce: → "difﬁcult" (Unicode U+FB01 "ﬁ") → Or: "dicult" (ligature not recognized, "fi" dropped) → Or: "difficult" (if tool has ligature expansion) RAG impact: → Search for "difficult": No match (stored as "difﬁcult") → Embedding sees different tokens → Retrieval fails ``` **Custom Font Encoding:** ``` PDF creator uses custom font: → Glyph ID 65 → "A" (standard) → Glyph ID 66 → "★" (non-standard!) Standard Unicode mapping: → Glyph 66 → "B" Extraction result: → Text shows "B" where should be "★" → Complete character substitution → Gibberish output Common in: → Logo fonts → Symbol fonts (Wingdings, etc.) → Embedded fonts without proper encoding ``` ### Multi-Column Layout Detection Column detection is error-prone: **Column Detection Heuristics:** ``` Algorithm attempts: 1. Analyze white space on page 2. Detect vertical gaps 3. Infer column boundaries Fails when: → Columns have uneven height (one ends early) → Narrow columns (sidebars) → Indented quotes or code blocks (look like new columns) → Images between columns → Complex layouts (magazine-style) ``` **The Column Transition Problem:** ``` Two-column page: Column 1 ends mid-sentence: "The key benefit of this approach is" Column 2 continues: "improved performance and reduced latency." Extraction must: → Recognize sentence continuation → Append Column 2 after Column 1 → Maintain sentence flow Bad extraction: "The key benefit of this approach is [END] New paragraph: improved performance and reduced latency." Treats continuation as new paragraph → Loses semantic connection ``` ### Headers, Footers, and Page Numbers Repeated elements contaminate text: **Header/Footer Repetition:** ``` 50-page PDF, each page has: Header: "Company Name - Product Guide - 2024" Footer: "Page N of 50 | Confidential" Naive extraction: Includes headers/footers as content Resulting text: "...authentication methods. Company Name - Product Guide - 2024 Page 1 of 50 | Confidential OAuth 2.0 is recommended for..." Every page: +30 tokens of repeated noise 50 pages × 30 tokens = 1,500 tokens wasted RAG impact: → Chunks filled with repetitive headers → Dilutes semantic content → "Page N of 50" appears in embeddings → Retrieval contaminated ``` **Header/Footer Detection:** ``` Heuristics to detect: → Text appears in same position across pages → Contains "page", "confidential", etc. → Smaller font size → Outside main content area But fails when: → Headers vary per section → Footer only on odd/even pages → Watermarks (diagonal text) ``` ### Tables and Forms Structured layouts are challenging: **Table Extraction:** ``` PDF table: ┌─────────┬───────┬────────┐ │ Product │ Price │ Stock │ ├─────────┼───────┼────────┤ │ Widget │ $10 │ 50 │ │ Gadget │ $20 │ 30 │ └─────────┴───────┴────────┘ Text extraction sees positioned text elements: x=100, y=100: "Product" x=200, y=100: "Price" x=300, y=100: "Stock" x=100, y=120: "Widget" x=200, y=120: "$10" ... Must infer: → These elements form a table → Column alignments → Row groupings → Cell boundaries Poor extraction: "Product Price Stock Widget $10 50 Gadget $20 30" → All run together, no structure ``` **Form Field Extraction:** ``` PDF form with fillable fields: → Field labels: "Name:", "Email:", "Phone:" → Field values: (user-filled data) Text extraction: → May extract field names but not values → Or: Extract values but not labels → Or: Extract in wrong order Result: Incomplete or incomprehensible ``` ### Scanned PDFs and OCR Image-based PDFs require optical character recognition: **The OCR Accuracy Problem:** ``` Scanned document quality: → High resolution (300+ DPI): 99% accuracy → Medium resolution (150 DPI): 95% accuracy → Low resolution (<100 DPI): 85% accuracy → Skewed/rotated: 80% accuracy → Handwritten: 70% accuracy 95% accuracy means: → 1 in 20 characters wrong → 100-word document: ~5 word errors Errors accumulate: → "difficult" → "difficuIt" (l → I) → "form" → "forn" (m → n) → "0" → "O" (zero → letter O) ``` **OCR Formatting Loss:** ``` Original scanned page: → Bold headings → Italic emphasis → Bullet points → Indentation OCR output: → Plain text only → All formatting lost → Can't distinguish headings from body → List structure flattened ``` **The Confidence Score Problem:** ``` OCR engines return confidence per character: "difﬁcult" with confidences: d: 0.99 i: 0.98 f: 0.92 ﬁ: 0.65 ← Low confidence (ambiguous) c: 0.97 u: 0.96 l: 0.94 t: 0.99 Should we: → Accept all characters? (includes errors) → Reject low confidence? (lose real characters) → Flag uncertain chunks for review? ``` ### Embedded Images and Figures Visual content in PDFs is lost: **Image Extraction Approaches:** ``` Option 1: Ignore images → Fast, simple → But: Loses visual information → "See Figure 3" references broken Option 2: Extract image alt text (if present) → PDFs rarely have alt text → Usually empty Option 3: Run OCR on images → Extracts text from diagrams → But: Layout/arrows/connections lost Option 4: Image captioning AI → Generate descriptions with vision models → Expensive, slow → "Figure shows a flowchart with 5 nodes..." ``` **The Figure Reference Problem:** ``` Document text: "As illustrated in Figure 3, the authentication flow involves three steps..." Figure 3: [complex diagram] If figure not extracted: → Reference to "Figure 3" meaningless → LLM can't see what's illustrated → Incomplete answer Current workaround: → Hope figure has caption → Extract caption as proxy → Often insufficient ``` ### Encrypted and Protected PDFs Security features block extraction: **Encryption Types:** ``` User password: → Required to open PDF → If known, can extract normally Owner password: → Restricts copying/printing → May block text extraction APIs Digital Rights Management (DRM): → Vendor-specific encryption → Prevents extraction entirely ``` **The Partial Access Problem:** ``` PDF with restrictions: → Can view in reader app → But: Cannot select/copy text → Extraction tools respect restrictions User uploads protected PDF: → Twig extraction fails: "Access denied" → No content indexed → Document invisible to AI agent ``` *** ## How to Solve **Use advanced PDF libraries (pdfplumber, PyMuPDF) with layout analysis + implement ligature expansion + detect and remove headers/footers + run OCR on scanned PDFs + handle multi-column layouts with column detection.** See [PDF Processing](/rag-scenarios-and-solutions/chunking/pdf-extraction.md). --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://help.twig.so/rag-scenarios-and-solutions/chunking/pdf-extraction.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.