Context Window Overflow

The Problem

Retrieved chunks exceed LLM context window capacity, forcing truncation and losing critical information needed for accurate answers.

Symptoms

❌ "Context length exceeded" errors
❌ Later chunks cut off mid-sentence
❌ Inconsistent answers (depends what fits)
❌ Cannot use all relevant retrieved docs
❌ Quality degrades with more context

Real-World Example

LLM context window: 8,000 tokens
System prompt: 500 tokens
User query: 50 tokens
Response generation buffer: 1,000 tokens
Available for retrieval: 6,450 tokens

Retrieved top-10 chunks (1,000 tokens each):
→ Total: 10,000 tokens
→ Exceeds available 6,450 tokens
→ Last 4 chunks truncated

Most relevant chunk was #8 (truncated)
→ AI cannot see it
→ Gives incomplete answer

Deep Technical Analysis

Context Window Constraints

Models have fixed input limits:

Common Limits:

GPT-3.5-turbo: 4K tokens (16K variant available)
GPT-4: 8K tokens (32K/128K variants)
Claude 2: 100K tokens
Claude 3: 200K tokens

But practical limits lower:
→ Need space for response
→ System prompts consume tokens
→ Effective capacity ~70% of max

The Token Budget:

8K model breakdown:
- System prompt: 300 tokens
- Conversation history: 500 tokens (multi-turn)
- Current query: 100 tokens
- Response generation: 1,000 tokens (reserve)
- Available for context: 6,100 tokens

If chunks average 800 tokens:
→ Can fit ~7 chunks maximum
→ Must be selective

Truncation Strategies

When context exceeds limit:

Last-In-First-Out (LIFO):

Include chunks in order until full:
→ Chunk 1 (highest relevance): Included
→ Chunk 2: Included
→ ...
→ Chunk 8: Included (reaches limit)
→ Chunks 9-10: Truncated

Problem: Lower-ranked chunks may have unique info

Sliding Window:

For very long context:
→ Use first N tokens
→ Use last M tokens
→ Skip middle

Preserves beginning and end:
→ Good for narratives
→ Bad for scattered information

Smart Truncation:

Analyze chunks:
→ Compute diversity score
→ Prioritize unique information
→ Include redundant content last

Maximizes information density within limit

Chunk Size vs Retrieval K Trade-off

Competing objectives:

Large Chunks:

Chunk size: 2,000 tokens
Capacity: 6,000 tokens
Chunks that fit: 3

Pros:
+ More context per chunk
+ Preserves coherence

Cons:
- Fewer chunks (less diversity)
- Higher chance of irrelevant content

Small Chunks:

Chunk size: 500 tokens
Capacity: 6,000 tokens  
Chunks that fit: 12

Pros:
+ More information sources
+ Higher precision

Cons:
- Less context per chunk
- May lose coherence

Dynamic Adjustment:

Measure query complexity:
→ Simple factual: Small chunks, high K
→ Complex explanatory: Large chunks, low K

Adaptive strategy based on query type

Lossy Compression Techniques

Summarize context to fit:

Extractive Summarization:

For each chunk:
1. Extract key sentences (top 30%)
2. Discard filler content
3. Preserve critical facts

Original: 1,000 tokens → Compressed: 300 tokens
→ 3x more chunks fit
→ But: Some nuance lost

Abstractive Summarization:

Use LLM to summarize chunks:
→ "Summarize in 100 words"
→ Condenses information
→ Rephrases for brevity

Trade-off:
+ Fit more content
- Risk: Summary LLM may hallucinate
- Extra API call (cost, latency)

Hierarchical Context Assembly

Multi-level retrieval:

Coarse-to-Fine:

Stage 1: Broad retrieval
→ Get top-20 chunks
→ Scan for relevant sections

Stage 2: Focused retrieval
→ Identify most relevant 3 sections
→ Retrieve full content for those
→ Discard others

Progressive refinement within token budget

Section-Level Granularity:

Store documents at multiple levels:
→ Document summary (200 tokens)
→ Section summaries (500 tokens each)
→ Full sections (2,000 tokens each)

Retrieval:
1. Match at summary level
2. Fetch full sections only for top matches
3. Efficient token usage

Long-Context Model Considerations

Models with larger windows:

Claude 200K Benefits:

Can include entire documents:
→ No truncation needed
→ Full context available

But:
→ More expensive
→ Slower inference
→ Diminishing returns (recall degradation)

The Lost-in-the-Middle Problem:

Research shows:
→ LLMs attend to beginning and end
→ Middle content often ignored

Even with 100K context:
→ Relevant info in middle may be missed
→ Positioning matters

Optimal Positioning:

Place most relevant chunks:
→ At beginning (primacy)
→ At end (recency)

Less relevant:
→ In middle

Despite large window, order still matters

How to Solve

Implement dynamic chunk sizing based on query type + use extractive summarization for less critical chunks + prioritize top-K chunks and truncate lower-ranked + consider long-context models (Claude 200K) for comprehensive docs + apply smart truncation preserving key info. See Context Management.

PreviousHallucination in Responses NextToken Limit Exceeded

Last updated 1 minute ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagContext Window Constraints

hashtagTruncation Strategies

hashtagChunk Size vs Retrieval K Trade-off

hashtagLossy Compression Techniques

hashtagHierarchical Context Assembly

hashtagLong-Context Model Considerations

hashtagHow to Solve