# Context Window Utilization

## The Problem

Cannot monitor how efficiently the LLM context window is used, leading to wasted tokens, truncation, or suboptimal retrieval configurations.

### Symptoms

* ❌ Don't know % of context window used
* ❌ Frequent context overflow unexplained
* ❌ Wasteful token usage
* ❌ Cannot optimize K parameter
* ❌ No visibility into token budget

### Real-World Example

```
Configuration:
→ LLM: GPT-4 (8K context)
→ Retrieval: K=10 chunks
→ Chunk size: ~500 tokens each

Observed:
→ Context overflow errors: 15% of queries

Investigation:
→ System prompt: 300 tokens
→ User query: 100 tokens average
→ Retrieved context: 10 × 500 = 5,000 tokens
→ Response budget: 1,000 tokens
→ Total: 6,400 tokens (fits in 8K)

Why overflows?
→ No monitoring of actual token usage
→ Some chunks larger than 500 tokens (outliers)
→ Some queries longer (max: 800 tokens)
→ Total occasionally exceeds 8K
```

***

## Deep Technical Analysis

### Token Accounting

**Component Breakdown:**

```
Total context (8,000 tokens):
1. System prompt: 300 tokens (fixed)
2. Conversation history: 0-2,000 tokens (variable)
3. Retrieved context: 2,000-6,000 tokens (variable)
4. Current query: 50-500 tokens (variable)
5. Response generation: 500-2,000 tokens (reserved)

Monitor each component:
→ Which consumes most?
→ Where to optimize?
```

**Utilization Percentage:**

```
Metric: Context utilization
= (used_tokens / max_tokens) × 100%

Examples:
→ Query A: 6,400 / 8,000 = 80% (good)
→ Query B: 8,500 / 8,000 = 106% (overflow!)
→ Query C: 3,200 / 8,000 = 40% (underutilized)

Target: 70-85% utilization
→ Below 70%: Retrieving too few chunks
→ Above 85%: Risk of overflow
```

### Dynamic K Adjustment

**Token-Based K Selection:**

```
Instead of fixed K=10:
→ Retrieve chunks until token budget nearly full

Algorithm:
1. Calculate available: 8,000 - system - query - response_buffer = 5,500 tokens
2. Retrieve chunks sequentially:
   - Chunk 1: 450 tokens (total: 450)
   - Chunk 2: 520 tokens (total: 970)
   - ...
   - Chunk 11: 480 tokens (total: 5,520 → exceeds 5,500)
3. Stop at Chunk 10

Adaptive K based on token budget
```

**Query Complexity Adaptation:**

```
Simple query: "What is X?"
→ Short query (30 tokens)
→ More budget for context
→ K=15 possible

Complex query: "Explain how X, Y, and Z interact..."
→ Long query (200 tokens)
→ Less budget for context
→ K=8

Dynamic based on query length
```

### Truncation Strategy Monitoring

**Where Truncation Happens:**

```
Monitor truncation point:
→ Truncated at chunk #8 of 10
→ Lost chunks 9-10

Were important chunks lost?
→ Chunk #9 score: 0.72 (relevant)
→ Chunk #10 score: 0.68 (marginal)

Impact: Moderate (lost one relevant chunk)
```

**Sliding Window Stats:**

```
Track truncation patterns:
→ 5% of queries: Truncate at chunk 5-7
→ 10% of queries: Truncate at chunk 8-10
→ 85% of queries: No truncation

Optimize:
→ Reduce K for shorter queries
→ Increase response buffer
```

### Optimization Opportunities

**Token Waste Detection:**

```
Query uses 40% of context:
→ Only 3,200 / 8,000 tokens
→ Underutilized

Could increase K:
→ From K=5 to K=8
→ More context for better answers
```

**Compression Opportunities:**

```
System prompt: 300 tokens
→ Can compress to 200 tokens?
→ Saves 100 tokens
→ More for context

Conversation history: 1,500 tokens
→ Summarize to 500 tokens?
→ Saves 1,000 tokens
```

***

## How to Solve

**Log token usage per component (system prompt, query, context, response) + calculate utilization % (used/max) + monitor truncation frequency and position + implement dynamic K based on available token budget + alert on high utilization (>85%) or overflow + track token distribution across queries + optimize underutilized queries (increase K) + compress system prompt or conversation history if needed.** See [Token Utilization](/rag-scenarios-and-solutions/monitoring/context-utilization.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/monitoring/context-utilization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
