Code Blocks Split Wrong

The Problem

Code snippets are split mid-function or mid-block, breaking syntax and making the code incomprehensible in retrieval results.

Symptoms

  • ❌ Retrieved code missing opening/closing braces

  • ❌ Function definitions split from their bodies

  • ❌ Indentation broken across chunks

  • ❌ AI generates invalid code based on partial snippets

  • ❌ Import statements separated from usage

Real-World Example

Original documentation:
```python
def authenticate_user(username, password):
    """
    Authenticates a user with username and password.
    Returns JWT token on success.
    """
    if not username or not password:
        raise ValueError("Credentials required")
    
    user = db.query(User).filter_by(username=username).first()
    if not user or not verify_password(password, user.password_hash):
        raise AuthenticationError("Invalid credentials")
    
    token = generate_jwt(user.id)
    return {"token": token, "expires": 3600}

Chunk boundary at 512 tokens falls here ↓

Chunk 1 ends with:

Chunk 2 starts with:

Result: Both chunks have broken, syntactically invalid code

Module level: → Imports → Class definitions → Function definitions → Global variables

Class level: → Methods → Properties → Inner classes

Function level: → Function signature → Docstring → Function body → Return statement

Block level: → if/else blocks → try/except blocks → for/while loops

Standard RAG chunker:

  1. Count tokens

  2. Split at token 512

  3. Repeat

Ignores: → Is this mid-function? → Is this inside a string literal? → Are braces balanced? → Is indentation preserved?

Result: Syntactically broken code

Better approach:

  1. Parse code into AST

  2. Identify top-level nodes (functions, classes)

  3. Chunk at node boundaries

Example (Python): ast.parse(code) → Module( body=[ FunctionDef(name='func1', ...), ← Chunk 1 FunctionDef(name='func2', ...), ← Chunk 2 ClassDef(name='MyClass', ...), ← Chunk 3 ] )

Each chunk contains complete, valid syntax unit

Need different parsers for each language: → Python: ast module → JavaScript: esprima, acorn → Java: Eclipse JDT, JavaParser → C++: Clang, libclang → Go: go/parser → Rust: syn

Maintenance burden: → 20+ languages to support → Each with unique syntax → Version-specific parsing (Python 2 vs 3) → Dialect support (TypeScript, JSX)

The Indentation Context Loss:

Language-Specific Rules:

Context and Dependencies

Code chunks need surrounding context:

Import Statements:

Chunking Problem:

The Dependency Chain:

Documentation and Code Separation

Code examples in docs need special handling:

Markdown Code Blocks:

The Boundary Problem:

Store the token securely...

Both chunks: Broken code blocks → Chunk 1: Unclosed triple-backticks → Chunk 2: Starts mid-code-block

Chunker must recognize: → or ~~~ (code fence markers) → Language identifier (python, ```javascript) → Code fence boundaries → Nested code blocks (rare but possible)

Logic:

  1. Detect opening ```

  2. Don't chunk until closing ```

  3. Keep entire code block together

But: → Code block might be 2000 tokens → Exceeds chunk size limit → Must allow splitting WITHIN code block → But intelligently (at function boundaries)

File: routes.py from .auth import require_auth from .models import User

@require_auth def get_user_profile(user_id): return User.query.get(user_id)

File: auth.py def require_auth(func): # decorator implementation ...

File: models.py class User: # model definition ...

Query: "How does user profile route work?"

Retrieved chunk (from routes.py): from .auth import require_auth from .models import User

@require_auth def get_user_profile(user_id): return User.query.get(user_id)

Missing: → What does @require_auth do? (in auth.py) → What fields does User have? (in models.py)

LLM must infer or hallucinate these details

Ideal approach:

  1. Build dependency graph: routes.py → auth.py routes.py → models.py

  2. When chunking routes.py: → Include summaries of auth.py and models.py → Or: Retrieve related files automatically

  3. Embed with context: "get_user_profile uses @require_auth decorator from auth.py and User model from models.py"

Complexity: → Must parse imports → Resolve relative paths → Handle circular dependencies → Maintain graph for entire codebase

Chunking Issue:


How to Solve

Implement AST-based chunking for code blocks + detect language with syntax highlighter + keep function/class definitions intact + preserve indentation context + include parent scope metadata. See Code Chunking.

Last updated