Markdown Formatting Lost

The Problem

Markdown formatting (bold, italic, links, code) is stripped during chunking, losing semantic emphasis and making retrieved content less useful.

Symptoms

  • code blocks render as plain text

  • bold emphasis lost in answers

  • Linksarrow-up-right converted to bare URLs or text

  • ❌ > Blockquotes not distinguished from body text

  • ❌ Headers (##) indistinguishable from paragraphs

Real-World Example

Original markdown:
## Authentication Methods

We support **three authentication methods**:

1. **API Key**: Include `X-API-Key` header
2. **OAuth 2.0**: Use `/oauth/token` endpoint  
3. **JWT**: Bearer token in `Authorization` header

See [Authentication Guide](/docs/auth) for details.

After naive text extraction:
Authentication Methods We support three authentication methods: API Key: Include X-API-Key header OAuth 2.0: Use /oauth/token endpoint JWT: Bearer token in Authorization header See Authentication Guide for details.

Lost:
- Header hierarchy (##)
- Bold emphasis (**important**)
- Inline code (`code`)
- List structure (1, 2, 3)
- Link destination ([text](url))

Deep Technical Analysis

Semantic Information in Formatting

Markdown formatting carries meaning:

Format Types and Semantics:

The Loss of Signal:

Conversion Strategies

Different approaches to handling markdown:

Strategy 1: Strip All Formatting (current)

Strategy 2: Convert to Natural Language:

Strategy 3: Keep Markdown Syntax:

Strategy 4: Convert to HTML:

Inline Code vs Code Blocks

Different code representations need different handling:

Inline Code:

Preservation Importance:

Code Blocks:

Extraction Challenge:

Links have both anchor text and destination:

Link Structure:

Extraction Options:

Reference Link Problem:

Reference Resolution:

List Structure Preservation

Lists have hierarchical structure:

Ordered Lists:

Flattening Problem:

Unordered Lists:

Structure Loss:

Blockquotes and Callouts

Special blocks carry semantic meaning:

Blockquote:

Semantic Significance:

Callout Boxes (GitHub-flavored):

Semantic Type Loss:

Header Hierarchy

Headers define document structure:

Markdown Headers:

Structural Information:

Embedding Model Considerations

How models handle formatted text:

Training Data Exposure:

Token Efficiency:


How to Solve

Keep markdown syntax intact for LLM consumption + convert special blocks (warnings, notes) to natural language labels + preserve inline code backticks + expand links to "text (URL)" format + maintain list numbering with explicit labels. See Markdown Handling.

Last updated