Tables Breaking Across Chunks

The Problem

Tables are split mid-row or mid-column, making the data incomprehensible and breaking the semantic relationships between table headers and values.

Symptoms

  • ❌ Retrieved chunks show table rows without headers

  • ❌ Table columns split across chunks

  • ❌ AI can't answer questions about tabular data

  • ❌ Pricing tables incomplete in responses

  • ❌ Comparison tables broken and meaningless

Real-World Example

Original table in documentation:

| Plan      | Price | Users | Storage | API Calls |
|-----------|-------|-------|---------|-----------|
| Free      | $0    | 1     | 1GB     | 100/day   |
| Pro       | $49   | 5     | 50GB    | 10K/day   |
| Enterprise| $299  | 50    | 500GB   | 100K/day  |

Chunk boundary falls here ↓

Chunk 1 contains:
| Plan      | Price | Users | Storage | API Calls |
|-----------|-------|-------|---------|-----------|
| Free      | $0    | 1     | 1GB     |

Chunk 2 contains:
| 100/day   |
| Pro       | $49   | 5     | 50GB    | 10K/day   |
| Enterprise| $299  | 50    | 500GB   | 100K/day  |

Result: Headers separated from data, columns misaligned
User query: "What's included in Pro plan?"
AI can't determine which values belong to Pro

Deep Technical Analysis

Table Structure and Boundaries

Tables have inherent structural units:

Table Components:

The Header-Data Dependency:

Row-Level vs Table-Level Semantics:

Markdown Table Parsing

Markdown tables have specific syntax:

Format Variations:

Detection Challenges:

The Cell Content Problem:

HTML Table Complexity

HTML tables add structural depth:

Nested Structure:

Parsing Requirements:

The Rowspan/Colspan Problem:

Table Linearization for Embeddings

Tables must be converted to text:

Flattening Strategies:

The Comparison Loss Problem:

Responsive and Complex Tables

Modern tables have dynamic layouts:

Multi-Header Tables:

Pivot Tables and Aggregations:

Large Table Strategies

Tables exceeding chunk size need special handling:

Vertical Splitting (by rows):

Horizontal Splitting (by columns):

Semantic Chunking by Table:

Table Context and Captions

Tables need surrounding context:

Caption and Title:

Reference Text:


How to Solve

Implement table-aware chunking that detects table boundaries (Markdown and HTML) + repeat headers when splitting large tables + keep tables with their captions + linearize to structured text for embeddings. See Table Handling.

Last updated