Code Blocks Split Wrong

The Problem

Code snippets are split mid-function or mid-block, breaking syntax and making the code incomprehensible in retrieval results.

Symptoms

❌ Retrieved code missing opening/closing braces
❌ Function definitions split from their bodies
❌ Indentation broken across chunks
❌ AI generates invalid code based on partial snippets
❌ Import statements separated from usage

Real-World Example

Original documentation:
```python
def authenticate_user(username, password):
    """
    Authenticates a user with username and password.
    Returns JWT token on success.
    """
    if not username or not password:
        raise ValueError("Credentials required")
    
    user = db.query(User).filter_by(username=username).first()
    if not user or not verify_password(password, user.password_hash):
        raise AuthenticationError("Invalid credentials")
    
    token = generate_jwt(user.id)
    return {"token": token, "expires": 3600}

Chunk boundary at 512 tokens falls here ↓

Chunk 1 ends with:

def authenticate_user(username, password):
    """
    Authenticates a user with username and password.
    Returns JWT token on success.
    """
    if not username or not password:

Chunk 2 starts with:

        raise ValueError("Credentials required")
    
    user = db.query(User).filter_by(username=username).first()

Result: Both chunks have broken, syntactically invalid code


---

## Deep Technical Analysis

### AST (Abstract Syntax Tree) Boundaries

Code has natural structural boundaries that must be respected:

**Programming Language Structure:**

Module level: → Imports → Class definitions → Function definitions → Global variables

Class level: → Methods → Properties → Inner classes

Function level: → Function signature → Docstring → Function body → Return statement

Block level: → if/else blocks → try/except blocks → for/while loops


**Naive Text Chunking:**

Standard RAG chunker:

Count tokens
Split at token 512
Repeat

Ignores: → Is this mid-function? → Is this inside a string literal? → Are braces balanced? → Is indentation preserved?

Result: Syntactically broken code


**AST-Aware Chunking:**

Better approach:

Parse code into AST
Identify top-level nodes (functions, classes)
Chunk at node boundaries

Example (Python): ast.parse(code) → Module( body=[ FunctionDef(name='func1', ...), ← Chunk 1 FunctionDef(name='func2', ...), ← Chunk 2 ClassDef(name='MyClass', ...), ← Chunk 3 ] )

Each chunk contains complete, valid syntax unit


**The Multi-Language Problem:**

Need different parsers for each language: → Python: ast module → JavaScript: esprima, acorn → Java: Eclipse JDT, JavaParser → C++: Clang, libclang → Go: go/parser → Rust: syn

Maintenance burden: → 20+ languages to support → Each with unique syntax → Version-specific parsing (Python 2 vs 3) → Dialect support (TypeScript, JSX)


### Indentation and Whitespace Preservation

Code semantics depend on formatting:

**Python Indentation:**
```python
# Original (valid):
def process():
    if condition:
        result = compute()
        return result

# Chunked incorrectly:
Chunk 1:
def process():
    if condition:

Chunk 2:
        result = compute()
        return result  ← Wrong indentation level!

The Indentation Context Loss:

Chunk 2 starts mid-block:
→ Missing context: Inside "process" function
→ Missing context: Inside "if" statement
→ Indentation appears wrong without parent context

LLM sees chunk 2:
→ "Why is this indented 8 spaces?"
→ May normalize to 4 spaces (breaking it)
→ Or assume it's a standalone block (wrong)

Language-Specific Rules:

Python: Indentation is syntax
→ 4 spaces vs tabs matters
→ Inconsistent indent = SyntaxError

JavaScript: Indentation is style
→ Braces define blocks
→ Indentation doesn't affect semantics

YAML: Indentation is structure
→ 2-space indent standard
→ Indentation defines nesting

Each requires different handling

Context and Dependencies

Code chunks need surrounding context:

Import Statements:

# Top of file:
import os
import requests
from typing import List, Dict
from .models import User, Session

# ... 500 lines later ...

# Function that uses imports:
def fetch_users() -> List[User]:
    response = requests.get(os.getenv("API_URL"))
    return [User(**u) for u in response.json()]

Chunking Problem:

Chunk 1 (imports):
import os
import requests
from typing import List, Dict
from .models import User, Session

Chunk 10 (function, 500 lines later):
def fetch_users() -> List[User]:
    response = requests.get(os.getenv("API_URL"))
    return [User(**u) for u in response.json()]

Query: "How to fetch users from API?"
→ Retrieves Chunk 10 (function)
→ Missing imports (Chunk 1)
→ LLM doesn't know:
  - What's "requests"?
  - Where does "User" come from?
  - What's "os.getenv"?

Answer: Incomplete or wrong imports suggested

The Dependency Chain:

File: auth.py

class AuthService:
    def __init__(self, secret_key: str):
        self.secret = secret_key
    
    def generate_token(self, user_id: int):
        return jwt.encode({"user": user_id}, self.secret)

Later in file:

def login(username, password):
    service = AuthService(os.getenv("SECRET"))
    # ... auth logic ...
    token = service.generate_token(user.id)

Chunking:
Chunk 1: AuthService class
Chunk 2: login function

Query: "How does login work?"
→ Retrieves Chunk 2 (login function)
→ References "AuthService" but definition not in chunk
→ LLM must infer or ask for more context
→ May hallucinate AuthService implementation

Documentation and Code Separation

Code examples in docs need special handling:

Markdown Code Blocks:

To authenticate, use the following code:

```python
import requests

response = requests.post(
    "https://api.example.com/auth",
    json={"username": "user", "password": "pass"}
)
token = response.json()["token"]
```

Store the token securely and include it in subsequent requests.

The Boundary Problem:

Chunk ends here ↓

Chunk 1:
To authenticate, use the following code:

```python
import requests

response = requests.post(

Chunk 2:
    "https://api.example.com/auth",
    json={"username": "user", "password": "pass"}
)
token = response.json()["token"]

Store the token securely...

Both chunks: Broken code blocks → Chunk 1: Unclosed triple-backticks → Chunk 2: Starts mid-code-block


**Fenced Code Block Detection:**

Chunker must recognize: → or ~~~ (code fence markers) → Language identifier (python, ```javascript) → Code fence boundaries → Nested code blocks (rare but possible)

Logic:

Detect opening ```
Don't chunk until closing ```
Keep entire code block together

But: → Code block might be 2000 tokens → Exceeds chunk size limit → Must allow splitting WITHIN code block → But intelligently (at function boundaries)


### Multi-File Context

Code often references other files:

**Cross-File Dependencies:**

File: routes.py from .auth import require_auth from .models import User

@require_auth def get_user_profile(user_id): return User.query.get(user_id)

File: auth.py def require_auth(func): # decorator implementation ...

File: models.py class User: # model definition ...


**The Single-File Chunk Limitation:**

Query: "How does user profile route work?"

Retrieved chunk (from routes.py): from .auth import require_auth from .models import User

@require_auth def get_user_profile(user_id): return User.query.get(user_id)

Missing: → What does @require_auth do? (in auth.py) → What fields does User have? (in models.py)

LLM must infer or hallucinate these details


**Graph-Based Chunking (Advanced):**

Ideal approach:

Build dependency graph: routes.py → auth.py routes.py → models.py
When chunking routes.py: → Include summaries of auth.py and models.py → Or: Retrieve related files automatically
Embed with context: "get_user_profile uses @require_auth decorator from auth.py and User model from models.py"

Complexity: → Must parse imports → Resolve relative paths → Handle circular dependencies → Maintain graph for entire codebase


### Inline Comments and Docstrings

Comments provide crucial context:

**Docstring Separation:**
```python
def complex_algorithm(data: List[int]) -> int:
    """
    Implements the Knuth-Morris-Pratt algorithm for pattern matching.
    
    Time complexity: O(n + m)
    Space complexity: O(m)
    
    Args:
        data: Input array of integers
    
    Returns:
        Index of pattern match or -1 if not found
    """
    # Implementation details...
    ...

Chunking Issue:

Chunk boundary splits docstring from implementation:

Chunk 1:
def complex_algorithm(data: List[int]) -> int:

Chunk 2:
    """
    Implements the Knuth-Morris-Pratt algorithm...
    """
    # Implementation...

Or worse:

Chunk 1:
    """
    Implements the Knuth-Morris-Pratt algorithm for pattern matching.
    
    Time complexity: O(n + m)

Chunk 2:
    Space complexity: O(m)
    
    Args:
        data: Input array of integers
    """

Docstring split mid-sentence → loses coherence

How to Solve

Implement AST-based chunking for code blocks + detect language with syntax highlighter + keep function/class definitions intact + preserve indentation context + include parent scope metadata. See Code Chunking.

PreviousChunks Too Large NextTables Breaking Across Chunks

Last updated 0 minutes ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagContext and Dependencies

hashtagDocumentation and Code Separation

hashtagHow to Solve