Character Encoding in Chunks

The Problem

Text with special characters, emojis, or non-ASCII content breaks during embedding or retrieval, causing garbled text in AI responses.

Symptoms

  • ❌ Foreign language text displays as "???" or boxes

  • ❌ Emojis become broken characters

  • ❌ Math symbols corrupted

  • ❌ Smart quotes become weird chars

  • ❌ Retrieval fails on non-ASCII queries

Real-World Example

Source document (UTF-8):
"Price: €500 for 10× improvement 🚀"

After ingestion (wrong encoding):
"Price: €500 for 10× improvement ?"

AI response cites corrupted text:
→ User sees garbage characters
→ Cannot understand pricing
→ Trust degraded

Deep Technical Analysis

Encoding Mismatches

UTF-8 vs Latin-1:

Windows-1252 vs UTF-8:

Embedding Model Limitations

Model Vocabulary:

Normalization:

Detection Strategies

Encoding Detection:

Validation:

PDF Extraction Issues

OCR vs Native Text:

Font Encoding:


How to Solve

Detect encoding with chardet before processing + standardize to UTF-8 for all content + normalize problematic characters (smart quotes, dashes) + use multilingual embedding models for non-English + validate extracted text for replacement characters + prefer native text PDF over OCR when possible + test with multi-language eval set. See Character Encoding.

Last updated