# Character Encoding in Chunks

## The Problem

Text with special characters, emojis, or non-ASCII content breaks during embedding or retrieval, causing garbled text in AI responses.

### Symptoms

* ❌ Foreign language text displays as "???" or boxes
* ❌ Emojis become broken characters
* ❌ Math symbols corrupted
* ❌ Smart quotes become weird chars
* ❌ Retrieval fails on non-ASCII queries

### Real-World Example

```
Source document (UTF-8):
"Price: €500 for 10× improvement 🚀"

After ingestion (wrong encoding):
"Price: â‚¬500 for 10Ã— improvement ?"

AI response cites corrupted text:
→ User sees garbage characters
→ Cannot understand pricing
→ Trust degraded
```

***

## Deep Technical Analysis

### Encoding Mismatches

**UTF-8 vs Latin-1:**

```
Document encoded as UTF-8:
→ "café" = [0xC3, 0xA9, 0x66, 0xC3, 0xA9]

Read as Latin-1 (wrong):
→ Interprets UTF-8 bytes as Latin-1
→ Displays: "cafÃ©"

Must detect/declare encoding correctly
```

**Windows-1252 vs UTF-8:**

```
Smart quotes (Word docs):
→ " " (curly quotes)
→ Windows-1252 encoding

If treated as UTF-8:
→ Displays as � or ?
→ Common with Office doc imports
```

### Embedding Model Limitations

**Model Vocabulary:**

```
Some embedding models:
→ Trained primarily on English ASCII
→ Limited non-ASCII support
→ May handle poorly:
  - Chinese characters
  - Arabic script
  - Cyrillic
  - Emoji

Result: Poor embeddings for non-English
```

**Normalization:**

```
Pre-process before embedding:
→ Convert smart quotes to straight quotes
→ " " → " "
→ – (en-dash) → - (hyphen)

Reduces encoding issues
But: Loses semantic nuance
```

### Detection Strategies

**Encoding Detection:**

```
Use chardet library (Python):
→ Detects encoding probabilistically
→ "This looks like UTF-8 with 95% confidence"

Apply detected encoding:
→ Decode file correctly
→ Re-encode as UTF-8 standard

Prevents misinterpretation
```

**Validation:**

```
After ingestion, check:
→ Any � (replacement character)?
→ Excessive non-ASCII ranges?
→ Flag for review

Alerts to encoding problems
```

### PDF Extraction Issues

**OCR vs Native Text:**

```
PDFs with scanned images:
→ OCR extracts text
→ OCR errors common:
  - l vs I vs 1 (ambiguous)
  - o vs 0
  - Special chars misread

Native PDFs (better):
→ Embedded text preserved
→ Higher fidelity
```

**Font Encoding:**

```
Custom fonts in PDFs:
→ Character mapping non-standard
→ Extraction gives wrong characters

Example:
→ Displays "А" (Cyrillic A)
→ Extracts "A" (Latin A)
→ Looks same, semantically different
```

***

## How to Solve

**Detect encoding with chardet before processing + standardize to UTF-8 for all content + normalize problematic characters (smart quotes, dashes) + use multilingual embedding models for non-English + validate extracted text for replacement characters + prefer native text PDF over OCR when possible + test with multi-language eval set.** See [Character Encoding](/rag-scenarios-and-solutions/data-quality/encoding-issues.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/data-quality/encoding-issues.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
