# Multi-Column Layout Issues

## The Problem

Documents with multi-column layouts (newspapers, academic papers, brochures) have text extracted in wrong order, mixing columns and destroying readability.

### Symptoms

* ❌ Text reads across columns instead of down
* ❌ Sentences interleaved from different columns
* ❌ Sidebars mixed into main content
* ❌ Reading order completely wrong
* ❌ Incomprehensible extracted text

### Real-World Example

```
Two-column academic paper:

Column 1:              Column 2:
"The algorithm        "performance of 95%
processes data in     accuracy with low
O(n log n) time,     latency. Future work
achieving             will explore..."

Naive left-to-right extraction:
"The algorithm performance of 95%
processes data in accuracy with low
O(n log n) time, latency. Future work
achieving will explore..."

Sentences from different columns interleaved
Completely unreadable
```

***

## Deep Technical Analysis

### Column Detection Algorithms

Identifying columns from layout is heuristic-based:

**White Space Analysis:**

```
Document page scan:
→ Detect vertical white space strips
→ Width > threshold (e.g., 20px)
→ Height spans most of page
→ Infer: Column boundary

Fails when:
→ Columns have justified text (minimal gaps)
→ Images span columns
→ Uneven column lengths
→ Narrow margins between columns
```

**Text Block Clustering:**

```
Algorithm:
1. Extract all text blocks with coordinates
2. Cluster by X-position similarity
3. Group into column regions
4. Sort by Y-position within each column

Challenges:
→ Indented paragraphs look like new column
→ Block quotes offset from main text
→ Footnotes at page bottom
→ Headers/footers span all columns
```

**The Reading Order Problem:**

```
Three possible reading orders for 2 columns:

Order 1: Down Column 1, then Down Column 2
[A] [D]
[B] [E]
[C] [F]
Reading: A→B→C→D→E→F ✓

Order 2: Across then Down (wrong!)
[A] [B]
[C] [D]
[E] [F]
Reading: A→B→C→D→E→F ✗

Order 3: Z-pattern (very wrong!)
[A] [B]
[C] [D]
Reading: A→B→D→C ✗

Must detect correct pattern
```

### Sidebar and Inset Handling

Additional content boxes complicate layout:

**Sidebar Positioning:**

```
Layout:
┌──────────────┬──────┐
│ Main content │ Side │
│              │ bar  │
│              ├──────┤
│              │ Ad   │
└──────────────┴──────┘

Reading order should be:
1. All main content
2. Sidebar
3. Ad

Not:
1. Main paragraph 1
2. Sidebar (interrupts!)
3. Main paragraph 2
4. Ad
5. Main paragraph 3
```

**Inset Boxes:**

```
Text flow with callout box:

Main text flows around │ ┌─────────┐  │
the callout box that   │ │ CALLOUT │  │
appears to the right   │ │  box    │  │
and continues after it │ └─────────┘  │

Extraction challenge:
→ Callout in middle of paragraph
→ Should it be inline or separate?
→ Does text flow around it or is it independent?

Wrong extraction:
"Main text flows around CALLOUT box the callout..."
→ Callout inserted mid-sentence
```

### Academic Paper Layouts

Scientific papers have complex structures:

**Two-Column Abstract:**

```
┌─────────────────────────┐
│        TITLE            │
│   Authors, Affiliation  │
├────────────┬────────────┤
│  Abstract  │  Abstract  │
│  (spans    │  continued │
│  2 cols)   │  here)     │
├────────────┴────────────┤
│ Intro │ Methods │ Results│
```

**Extraction Issues:**

```
Abstract spans 2 columns:
→ Must read left column fully
→ Then right column
→ Not line-by-line across

Sections (Intro, Methods, Results):
→ May be in single column each
→ Or each spans 2 columns
→ Layout varies by paper

Cannot use simple heuristic
→ Need per-document analysis
```

**Footnotes and References:**

```
Main text in 2 columns:
[Content... ¹]

Bottom of page (spanning both columns):
────────────────────────
¹ Footnote text here

Extraction must:
→ Detect footnote marker in main text
→ Find matching footnote at bottom
→ Associate reference with marker
→ Not treat footnote as 3rd column
```

### Magazine and Newsletter Layouts

Creative layouts are unpredictable:

**Non-Uniform Columns:**

```
┌─────┬─────────┬───────┐
│     │         │ Side  │
│ Img │ Content │ bar   │
│     │         ├───────┤
├─────┴─────────┤ Ad    │
│  Caption      │       │
└───────────────┴───────┘

Columns of different widths
Image caption spans 2 columns
Sidebar changes mid-page
```

**The Unpredictability:**

```
Each page may have:
→ Different number of columns (1, 2, 3)
→ Variable column widths
→ Images breaking grid
→ Text wrapping around shapes

No consistent pattern
→ Per-page layout detection needed
→ Or: Give up on perfect ordering
→ Accept some errors
```

### PDF Coordinate Systems

PDFs use absolute positioning:

**Text Positioning:**

```
PDF stores:
→ "Hello" at (x=100, y=200)
→ "World" at (x=400, y=200)

No inherent reading order
→ Just x,y coordinates
→ Must infer order from positions
```

**The Sorting Problem:**

```
Sort by Y (top to bottom):
→ Reads across page first
→ Wrong for multi-column

Sort by X, then Y:
→ Reads column by column
→ But: All Column 1, then all Column 2
→ Doesn't handle column boundaries

Hybrid approach:
→ Cluster by X (identify columns)
→ Sort each column by Y
→ Concatenate columns in order
→ But: Clustering non-trivial
```

### Text Reflow and Reflowable PDFs

Some PDFs support reflow:

**Reflowable vs Fixed Layout:**

```
Reflowable PDF:
→ Contains logical structure (tags)
→ Text can adapt to window width
→ Extraction follows logical order

Fixed Layout PDF:
→ Absolute positioning only
→ No logical structure
→ Must infer reading order

Most PDFs: Fixed layout
→ Harder to extract correctly
```

### Column Break Indicators

Documents may hint at columns:

**Column Break Characters:**

```
Some documents include:
→ Column break marker (rare)
→ Page break indicator

More common:
→ Continuous text, no markers
→ Must infer from layout alone
```

**The Continuation Problem:**

```
Sentence split across columns:

Column 1 ends: "The results indicate that performance"
Column 2 starts: "improvements are statistically significant."

Must recognize:
→ Sentence continues in next column
→ Not a new sentence
→ Join without adding period or space

Wrong joining:
"The results indicate that performance. Improvements are statistically significant."
→ Added period (wrong!)

Or:
"The results indicate that performanceimprovements are..."
→ Missing space (wrong!)
```

***

## How to Solve

**Implement X-axis clustering to detect columns + use white space analysis for column boundaries + sort by column first, then Y-position within column + handle sidebars separately from main flow + preserve sentence continuity across column breaks.** See [Multi-Column Layout](/rag-scenarios-and-solutions/chunking/multi-column.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/chunking/multi-column.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
