# Token Limit Exceeded

## The Problem

Generated responses hit model token limits mid-answer, cutting off responses or preventing generation entirely.

### Symptoms

* ❌ Responses end abruptly mid-sentence
* ❌ "Maximum tokens reached" errors
* ❌ Incomplete lists or code examples
* ❌ Must request "continue" from user
* ❌ Cannot generate long-form content

### Real-World Example

```
User asks: "List all API endpoints"
AI starts response:
"Here are the API endpoints:
1. POST /auth/login - User authentication
2. GET /users - Retrieve user list
3. POST /users - Create new user
4. GET /users/{id} - Get user details
5. PUT /users/{id} - Update user
..." 

[Token limit reached at 1000 tokens]

Response cuts off at endpoint #15 of 50 total
User sees incomplete list
```

***

## Deep Technical Analysis

### Output Token Limits

Separate from input context limits:

**Max Tokens Parameter:**

```
API calls specify max_tokens:
→ GPT-4: max_tokens=4096
→ Controls response length
→ Prevents runaway generation

Trade-offs:
→ Too low: Truncated responses
→ Too high: Longer latency, higher cost
→ Must balance
```

**Automatic Truncation:**

```
LLM generates token-by-token until:
1. Reaches natural stop (EOS token)
2. Hits max_tokens limit
3. Encounters stop sequence

If hits #2 mid-generation:
→ Cuts off wherever it stopped
→ No graceful ending
→ Incomplete output
```

### Estimating Response Length

Predicting token needs:

**Query Type Heuristics:**

```
Factual query: "What is X?"
→ Expected: 50-200 tokens
→ Set max_tokens: 300 (buffer)

List query: "List all..."
→ Unknown length
→ Set high limit or paginate

Explanatory: "How does X work?"
→ Expected: 300-800 tokens
→ Set max_tokens: 1000
```

**Dynamic Allocation:**

```
Analyze query:
→ Count items in retrieved context
→ "50 API endpoints found"
→ Estimate: 50 × 30 tokens/item = 1500 tokens
→ Set max_tokens: 2000

Adaptive based on content
```

### Pagination Strategies

Breaking responses into chunks:

**Explicit Pagination:**

```
System prompt: "If response exceeds 800 tokens, end with
[Continued in next message] and stop."

User experience:
→ AI sends first part
→ User clicks "Continue"
→ AI resumes with context

Preserves continuity across messages
```

**Automatic Chunking:**

```
Backend splits long responses:
1. Generate full response (internally)
2. Split at natural boundaries (paragraphs)
3. Send as multiple messages
4. Stream to user sequentially

Transparent to user
```

### Summarization vs Detail

Adjusting verbosity:

**Conciseness Prompting:**

```
Add to system prompt:
"Be concise. Provide direct answers without unnecessary
elaboration."

Reduces token usage:
→ Same information
→ Fewer words
→ Fits in token budget
```

**Detail Level Control:**

```
User specifies preference:
→ "Give brief overview" (200 tokens)
→ "Explain in detail" (1000 tokens)

Adjust max_tokens accordingly
```

### Token Accounting

Tracking usage:

**Input + Output Budget:**

```
Total model capacity: 8K tokens

Input (6K tokens):
→ System prompt: 300
→ Context: 5,500
→ Query: 200

Remaining: 2,000 tokens
→ Maximum possible response length
→ Set max_tokens ≤ 2,000
```

**Conversation History:**

```
Multi-turn chat accumulates:
→ Turn 1: 500 tokens (in + out)
→ Turn 2: 600 tokens
→ Turn 3: 700 tokens
→ Total: 1,800 tokens in history

Context window filling up:
→ Less space for future responses
→ Must prune old turns
```

### Response Compression

Fitting more in less space:

**Structured Formats:**

```
Instead of prose:
"The API rate limit is 1000 requests per hour. If you 
exceed this limit, you will receive a 429 error..."

Use structured:
{
  "rate_limit": "1000/hour",
  "error_code": 429,
  "retry_after": "60 seconds"
}

Same info, fewer tokens
```

**Tables Over Lists:**

```
Verbose list (200 tokens):
"Endpoint 1: POST /auth/login - Used for authentication...
Endpoint 2: GET /users - Retrieves user list..."

Table (120 tokens):
| Method | Path | Description |
|--------|------|-------------|
| POST | /auth/login | Authentication |
| GET | /users | User list |
```

***

## How to Solve

**Set max\_tokens dynamically based on query type + implement response pagination for long outputs + use conciseness prompts to reduce verbosity + employ structured formats (tables, JSON) over prose + track token usage and adjust context accordingly.** See [Token Management](/rag-scenarios-and-solutions/llm/token-limit.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/llm/token-limit.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.