# Chunking & Processing

## Overview

Chunking is the process of breaking down large documents into smaller, semantically meaningful pieces that can be embedded and retrieved effectively. Poor chunking is one of the most common yet overlooked causes of RAG failures—it directly impacts retrieval quality, context coherence, and answer accuracy.

## Why Chunking Matters

Effective chunking ensures:

* **Semantic coherence** - Each chunk contains complete, meaningful information
* **Optimal retrieval** - Relevant information can be found and returned
* **Context preservation** - Important relationships and structure are maintained
* **Efficient embeddings** - Chunks fit within token limits while maximizing information density

Poor chunking leads to:

* **Fragmented context** - Critical information split across chunks
* **Lost structure** - Formatting, tables, and code blocks broken apart
* **Retrieval failures** - Relevant information exists but can't be found
* **Incomplete answers** - Agents receive partial context and fill gaps with hallucinations

## Common Chunking Challenges

### Size Issues

* **Chunks too small** - Context loss, fragmented information, excessive chunks
* **Chunks too large** - Poor retrieval granularity, exceeded token limits
* **Inconsistent sizing** - Variable quality across your knowledge base

### Structure Preservation

* **Code blocks split incorrectly** - Broken syntax, incomplete examples
* **Tables breaking across chunks** - Lost relationships between rows/columns
* **Nested lists fragmented** - Hierarchy and context destroyed
* **Footnotes separated** - References disconnected from content

### Format Handling

* **Markdown formatting lost** - Headers, emphasis, and structure stripped
* **Mathematical notation corrupted** - LaTeX and formulas broken
* **Multi-column layouts** - Spatial relationships lost in conversion
* **HTML to text problems** - Semantic meaning lost in conversion

## Solutions in This Section

Browse these guides to optimize your chunking strategy:

* [Chunks Too Small](/rag-scenarios-and-solutions/chunking/chunks-too-small.md)
* [Chunks Too Large](/rag-scenarios-and-solutions/chunking/chunks-too-large.md)
* [Code Blocks Split Wrong](/rag-scenarios-and-solutions/chunking/code-splitting.md)
* [Tables Breaking Across Chunks](/rag-scenarios-and-solutions/chunking/table-splitting.md)
* [PDF Extraction Issues](/rag-scenarios-and-solutions/chunking/pdf-extraction.md)
* [Optimizing Chunk Size](/rag-scenarios-and-solutions/chunking/optimize-chunks.md)
* [Markdown Formatting Lost](/rag-scenarios-and-solutions/chunking/markdown-lost.md)
* [Nested Lists Broken](/rag-scenarios-and-solutions/chunking/nested-lists.md)
* [Mathematical Notation Corrupted](/rag-scenarios-and-solutions/chunking/math-notation.md)
* [Multi-Column Layout Issues](/rag-scenarios-and-solutions/chunking/multi-column.md)
* [Footnotes and References Lost](/rag-scenarios-and-solutions/chunking/footnotes-lost.md)
* [HTML to Text Conversion Problems](/rag-scenarios-and-solutions/chunking/html-conversion.md)

## Chunking Strategies

Different content types require different chunking approaches:

### By Content Type

| Content Type         | Recommended Strategy           | Chunk Size      |
| -------------------- | ------------------------------ | --------------- |
| **Documentation**    | Semantic splitting by sections | 500-1000 tokens |
| **Code**             | Function/class boundaries      | 200-800 tokens  |
| **Conversations**    | Message groupings              | 300-600 tokens  |
| **Technical papers** | Paragraph-aware splitting      | 600-1200 tokens |
| **FAQs**             | Question-answer pairs          | 100-400 tokens  |

### By Goal

* **Maximum recall** - Smaller chunks (200-400 tokens) with more overlap
* **Maximum context** - Larger chunks (800-1200 tokens) with less fragmentation
* **Balanced approach** - Medium chunks (400-800 tokens) with 10-20% overlap

## Best Practices

1. **Respect document structure** - Use headers, sections, and natural boundaries
2. **Preserve semantic units** - Don't split sentences, code blocks, or tables
3. **Add metadata** - Include source, section, and context in chunk metadata
4. **Test retrieval quality** - Measure how well chunking supports finding answers
5. **Use overlap strategically** - Help bridge context between chunks (10-20%)
6. **Document-type specific strategies** - Code needs different handling than prose
7. **Monitor and iterate** - Track chunk-level performance and adjust

## Impact on RAG Performance

Chunking quality directly affects:

| Metric                  | Impact of Good Chunking         | Impact of Bad Chunking         |
| ----------------------- | ------------------------------- | ------------------------------ |
| **Retrieval Precision** | Relevant chunks returned        | Irrelevant fragments retrieved |
| **Retrieval Recall**    | Complete information found      | Key details missed             |
| **Context Quality**     | Coherent, complete context      | Fragmented, incomplete context |
| **Answer Accuracy**     | Correct, well-supported answers | Hallucinations to fill gaps    |
| **Citation Quality**    | Clean source attribution        | Unclear or broken references   |
| **Embedding Quality**   | Meaningful semantic vectors     | Noisy, incoherent embeddings   |

## Quick Diagnostics

**Signs your chunking needs work:**

* ✗ Agents give partial answers that feel "cut off"
* ✗ Retrieved chunks don't contain complete thoughts
* ✗ Code examples are broken or incomplete
* ✗ Tables appear without headers or context
* ✗ References to "see above" or "as mentioned" without the context
* ✗ High retrieval count but low answer quality

**Signs your chunking is working well:**

* ✓ Retrieved chunks are self-contained and meaningful
* ✓ Agents provide complete, coherent answers
* ✓ Citations point to relevant, complete content
* ✓ Code examples are syntactically complete
* ✓ Tables preserve their structure and relationships

## Advanced Considerations

### Semantic Chunking

Move beyond fixed token limits to:

* Split at semantic boundaries (topic shifts, section changes)
* Use embedding similarity to detect natural breakpoints
* Preserve logical units (arguments, explanations, examples)

### Hierarchical Chunking

Create multi-level chunk hierarchies:

* **Document level** - Overview and metadata
* **Section level** - Major topics and themes
* **Paragraph level** - Specific details and facts

This enables coarse-to-fine retrieval strategies.

### Dynamic Chunking

Adjust chunk size based on:

* Document type and structure
* Content density and complexity
* Retrieval performance metrics
* User query patterns

**Remember**: Perfect chunking is content-specific. What works for technical docs may not work for conversational data. Test, measure, and iterate based on your specific use case.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/chunking.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
