# Chunking Strategies

Chunking is the process of breaking down large documents into smaller, manageable pieces that can be effectively processed and retrieved by your AI agents. The right chunking strategy significantly impacts the quality and relevance of responses.

## What is Chunking?

Chunking divides long documents into smaller segments (chunks) that:

* Fit within the context window of AI models
* Contain semantically coherent information
* Can be independently retrieved and understood
* Maintain sufficient context for accurate interpretation

## Why Chunking Matters

Proper chunking affects several critical aspects of your AI system:

* **Retrieval Precision**: Smaller, focused chunks help retrieve exactly what's needed
* **Context Preservation**: Well-chunked content maintains meaning without the full document
* **Performance**: Optimally sized chunks improve processing speed
* **Cost Efficiency**: Smaller chunks reduce token usage in LLM calls

## Chunking Strategies

### Fixed-Size Chunking

Split documents into chunks of a predetermined size.

**When to Use:**

* Uniform content without clear structural boundaries
* Quick processing with minimal overhead
* Content where exact boundaries are less critical

**Parameters:**

* **Chunk Size**: Number of characters or tokens per chunk (e.g., 512, 1000, 2000)
* **Overlap**: Number of characters/tokens to overlap between chunks (e.g., 50-200)

**Pros:**

* Simple and fast to implement
* Predictable chunk sizes
* Low computational overhead

**Cons:**

* May split sentences or paragraphs mid-thought
* Doesn't respect document structure
* Can lose semantic coherence

**Example Configuration:**

```json
{
  "strategy": "fixed-size",
  "chunkSize": 1000,
  "overlap": 100,
  "unit": "characters"
}
```

### Semantic Chunking

Split documents based on semantic meaning and topic boundaries.

**When to Use:**

* Content with clear topic transitions
* Technical documentation with distinct sections
* Articles and blog posts with well-defined structure

**How It Works:**

* Analyzes text for semantic similarity
* Identifies topic boundaries using embeddings or NLP
* Creates chunks around natural transition points

**Pros:**

* Preserves semantic coherence
* Natural, meaningful segments
* Better retrieval accuracy

**Cons:**

* More computationally intensive
* Variable chunk sizes
* May require fine-tuning

**Example Configuration:**

```json
{
  "strategy": "semantic",
  "similarityThreshold": 0.7,
  "minChunkSize": 200,
  "maxChunkSize": 2000
}
```

### Structural Chunking

Split documents based on their inherent structure (headings, paragraphs, sections).

**When to Use:**

* Well-structured documents (Markdown, HTML)
* Technical manuals with clear hierarchies
* Documentation with consistent formatting

**How It Works:**

* Identifies structural elements (h1, h2, paragraphs)
* Chunks based on hierarchy levels
* Maintains document outline

**Pros:**

* Respects document organization
* Preserves hierarchical context
* Intuitive chunk boundaries

**Cons:**

* Requires structured input
* Variable chunk sizes
* May create very large or very small chunks

**Example Configuration:**

```json
{
  "strategy": "structural",
  "splitLevel": "h2",
  "includeParentHeadings": true,
  "maxChunkSize": 3000
}
```

### Recursive Character Splitting

Hierarchically split text using multiple separators in order of priority.

**When to Use:**

* Mixed content types
* When you want to maintain natural boundaries
* General-purpose chunking

**How It Works:**

1. Try splitting by paragraph (\n\n)
2. If chunks too large, split by sentence
3. If still too large, split by words
4. As last resort, split by characters

**Pros:**

* Flexible and adaptive
* Maintains natural boundaries when possible
* Good general-purpose strategy

**Cons:**

* More complex logic
* May still need manual tuning
* Variable performance

**Example Configuration:**

```json
{
  "strategy": "recursive",
  "separators": ["\n\n", "\n", ". ", " "],
  "chunkSize": 1000,
  "overlap": 100
}
```

### Token-Based Chunking

Split documents based on token count rather than characters.

**When to Use:**

* When optimizing for LLM token limits
* Cost-sensitive applications
* Need precise control over API usage

**How It Works:**

* Uses tokenizer to count actual tokens
* Splits to maintain token budget
* Accounts for model-specific tokenization

**Pros:**

* Precise token control
* Optimal for API cost management
* Model-aware chunking

**Cons:**

* Requires tokenizer overhead
* Model-specific implementation
* May not respect semantic boundaries

**Example Configuration:**

```json
{
  "strategy": "token-based",
  "maxTokens": 512,
  "overlap": 50,
  "tokenizer": "gpt-4"
}
```

## Choosing the Right Strategy

### Content Type Considerations

| Content Type        | Recommended Strategy | Reasoning                            |
| ------------------- | -------------------- | ------------------------------------ |
| Technical Docs      | Structural           | Respects hierarchies and code blocks |
| Articles/Blogs      | Semantic             | Maintains topic coherence            |
| FAQs                | Structural           | Each Q\&A is a natural chunk         |
| Legal Documents     | Recursive            | Preserves clauses and paragraphs     |
| Code Files          | Structural           | Respects functions and classes       |
| Conversational Data | Fixed-Size           | Uniform structure                    |

### Performance Considerations

* **Small Chunks (200-500 tokens)**: Better retrieval precision, more API calls
* **Medium Chunks (500-1000 tokens)**: Balanced approach for most use cases
* **Large Chunks (1000-2000 tokens)**: More context, fewer retrievals, may be less precise

## Advanced Techniques

### Chunk Overlap

Include overlapping content between adjacent chunks to maintain context continuity.

**Benefits:**

* Prevents information loss at boundaries
* Improves retrieval of concepts spanning chunks
* Provides additional context

**Best Practices:**

* Use 10-20% overlap for fixed-size chunks
* Adjust based on content type and chunk size
* Consider computational cost vs. benefit

### Metadata Enrichment

Add metadata to chunks for better filtering and context:

```json
{
  "chunk": "...",
  "metadata": {
    "source": "user-manual.pdf",
    "section": "Installation",
    "page": 15,
    "headings": ["Getting Started", "Installation"],
    "created_at": "2024-01-15",
    "doc_type": "manual"
  }
}
```

### Parent-Child Chunking

Create hierarchical chunk relationships:

* **Parent Chunks**: Larger context chunks (e.g., full sections)
* **Child Chunks**: Smaller retrievable chunks
* **Benefit**: Retrieve specific content but have access to broader context

## Implementation Guide

### Step 1: Analyze Your Content

* Review document structure
* Identify natural boundaries
* Consider content density
* Assess variability

### Step 2: Select Initial Strategy

* Start with a recommended strategy for your content type
* Choose conservative chunk sizes
* Enable overlap initially

### Step 3: Test and Measure

* Process sample documents
* Review chunk quality
* Test retrieval accuracy
* Measure performance metrics

### Step 4: Iterate and Optimize

* Adjust chunk sizes based on results
* Try alternative strategies
* Fine-tune parameters
* Monitor ongoing performance

## Common Pitfalls

### Chunks Too Small

* **Problem**: Lost context, too many retrievals
* **Solution**: Increase chunk size or add overlap

### Chunks Too Large

* **Problem**: Irrelevant information included, slow processing
* **Solution**: Decrease chunk size or use more granular strategy

### Ignoring Structure

* **Problem**: Split mid-sentence or mid-concept
* **Solution**: Use structural or semantic chunking

### No Overlap

* **Problem**: Information loss at boundaries
* **Solution**: Add 10-20% overlap

### One-Size-Fits-All

* **Problem**: Poor performance across different content types
* **Solution**: Use content-specific strategies

## Monitoring Chunk Quality

Track these metrics to ensure optimal chunking:

* **Average Chunk Size**: Should be consistent with target
* **Chunk Size Distribution**: Watch for outliers
* **Retrieval Accuracy**: Measure relevance of retrieved chunks
* **User Satisfaction**: Track feedback on response quality
* **Token Usage**: Monitor API costs

## Best Practices

1. **Start Conservative**: Begin with medium-sized chunks and adjust
2. **Respect Boundaries**: Don't split sentences or code blocks mid-way
3. **Add Context**: Include headings or section titles in chunks
4. **Use Metadata**: Tag chunks with source, section, and category
5. **Test Thoroughly**: Validate chunking with real queries
6. **Iterate Regularly**: Refine based on performance data
7. **Document Decisions**: Keep track of why you chose specific strategies

## Next Steps

* [Synthetic Data](/product/data-prep/synthetic-data.md) - Enhance your chunks with generated content
* [Data Manipulations](/product/data-prep/data-manipulation.md) - Transform and enrich your chunks


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/product/data-prep/chunking-strategies.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
