# Mathematical Notation Corrupted

## The Problem

LaTeX equations, mathematical symbols, and formulas are garbled or lost during extraction, making technical/scientific content unusable.

### Symptoms

* ❌ Equations show as raw LaTeX `\frac{a}{b}` instead of rendered
* ❌ Greek letters (α, β, γ) replaced with ?
* ❌ Subscripts/superscripts lost (x₂ becomes x2)
* ❌ Matrices and complex formulas unreadable
* ❌ Mathematical meaning destroyed

### Real-World Example

```
Original documentation (LaTeX):

The formula for standard deviation is:
$$\sigma = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}$$

After text extraction:
"The formula for standard deviation is: sigma = sqrt(1/N sum i=1 N (x_i - mu)^2)"

AI cannot interpret mathematical structure
User query: "How to calculate standard deviation?"
AI gives garbled formula
```

***

## Deep Technical Analysis

### LaTeX Rendering and Extraction

Math documents use LaTeX notation:

**Inline vs Display Math:**

```latex
Inline: The value $x^2 + y^2 = r^2$ represents a circle.
Display: $$E = mc^2$$
```

**Extraction Challenges:**

```
Markdown processors:
→ May strip $ delimiters
→ May render to images (SVG/PNG)
→ May convert to MathML
→ May leave as raw LaTeX

Text extraction sees:
→ "The value x^2 + y^2 = r^2 represents a circle" (best case)
→ "The value [Math object] represents a circle" (if rendered)
→ "The value represents a circle" (worst case - formula gone)
```

**The Symbol Loss:**

```
LaTeX: $\alpha + \beta = \gamma$
Should be: α + β = γ (Unicode)
Often becomes: a + b = g (ASCII substitution)

Or worse:
LaTeX: $\nabla \cdot \vec{E} = \frac{\rho}{\epsilon_0}$
Becomes: "? · vec E = ? / ?_0" (symbols lost)
```

### Unicode Mathematical Symbols

PDFs and web pages use Unicode math:

**Unicode Ranges:**

```
Greek: α β γ δ (U+03B1 to U+03C9)
Subscripts: ₀ ₁ ₂ (U+2080 to U+2089)
Superscripts: ⁰ ¹ ² (U+2070 to U+2079)
Operators: ∫ ∑ ∏ ∂ (U+2200 to U+22FF)
Arrows: → ⇒ ↔ (U+2190 to U+21FF)
```

**Extraction Problems:**

```
Character encoding issues:
→ UTF-8 document extracted as Latin-1
→ α (U+03B1) becomes  (garbled)

Or:
→ Fonts use private Unicode area
→ Symbol fonts map standard letters to symbols
→ "α" stored as "a" with special font
→ Extraction sees "a" not "α"
```

### Subscripts and Superscripts

Scientific notation uses sub/superscripts:

**Representation Formats:**

```
Unicode subscript: H₂O (H-two-O)
HTML: H<sub>2</sub>O
LaTeX: H_2O or H$_2$O
Plain text: H2O (ambiguous!)
```

**Semantic Loss:**

```
Chemical formula: CO₂
Extracted as: CO2

Is this:
→ Carbon dioxide (CO₂)?
→ Company/abbreviation "CO2"?
→ Variable "co" multiplied by 2?

Without subscript, meaning ambiguous
```

**Complex Examples:**

```
Mathematical: e^(iπ) + 1 = 0
Should be: e raised to the power of (iπ)

Extracted variants:
→ "e^(iπ) + 1 = 0" (caret notation, acceptable)
→ "e(iπ) + 1 = 0" (superscript lost, wrong!)
→ "eiπ + 1 = 0" (completely wrong)
```

### Matrices and Vectors

Multi-dimensional math needs structure:

**Matrix Notation:**

```latex
$$\begin{bmatrix}
a & b \\
c & d
\end{bmatrix}$$
```

**Text Extraction:**

```
Best case: "[[a, b], [c, d]]" (array notation)
Common case: "a b c d" (flat, no structure)
Worst case: "a    b c    d" (spacing preserved but meaningless)

Lost information:
→ 2x2 matrix structure
→ Row/column organization
→ Mathematical meaning

Query: "What's the transformation matrix?"
AI sees: "a b c d"
Cannot reconstruct matrix form
```

**Vector Notation:**

```
LaTeX: $\vec{v} = \langle x, y, z \rangle$
or: $\vec{v} = (x, y, z)$

Extracted:
→ "vec v = <x, y, z>" (best case)
→ "v = x, y, z" (lost vector arrow)
→ "v = x y z" (no delimiters, ambiguous)
```

### Fractions and Complex Expressions

Hierarchical math expressions:

**Fraction Rendering:**

```
LaTeX: \frac{numerator}{denominator}

Display rendering:
  numerator
  ---------
 denominator

Text extraction:
"numerator / denominator" (linear, acceptable)
or "numerator denominator" (lost division!)
```

**Nested Fractions:**

```
LaTeX: \frac{1}{\frac{a}{b} + c}

Should be: 1 / ((a/b) + c)

Often becomes: "1 a/b + c"
→ Lost nested structure
→ Ambiguous: Is it (1/(a/b)) + c or 1/((a/b) + c)?
→ Different meanings!
```

### Sum and Product Notation

Series notation with bounds:

**Summation:**

```
LaTeX: \sum_{i=1}^{n} x_i

Should convey:
→ Sum over variable i
→ From i=1 to i=n
→ of expression x_i

Extracted variants:
→ "sum(i=1 to n) x_i" (good)
→ "sum i=1 n x_i" (acceptable)
→ "sum x_i" (bounds lost!)
→ "x_i" (sum operator lost!)
```

**Integral:**

```
LaTeX: \int_{0}^{\infty} e^{-x} dx

Components:
→ Integral operator (∫)
→ Lower bound: 0
→ Upper bound: ∞
→ Integrand: e^{-x}
→ Variable: dx

Text extraction must preserve all components
→ "integral from 0 to infinity of e^(-x) dx"
→ Or: "∫₀^∞ e^(-x) dx"

Common failures:
→ "e^(-x) dx" (bounds lost)
→ "integral e^(-x)" (variable lost)
```

### Equation Numbering and References

Documents reference equations by number:

**Equation Labels:**

```latex
The relationship is shown in Equation (3):
$$E = mc^2 \tag{3}$$

As we saw in (3), energy and mass are related.
```

**Cross-Reference Problem:**

```
Chunking:
Chunk 1: "The relationship is shown in Equation (3):"
Chunk 2: "E = mc^2 (equation 3)"
Chunk 3: "As we saw in (3), energy and mass are related."

Chunks 1 and 3 reference equation in Chunk 2
→ Separated by chunking
→ References broken
→ Cannot follow "see equation (3)"
```

### Embedding Mathematical Text

Math in vector embeddings:

**Token Representation:**

```
Text: "The formula x² + y² = r²"

Tokenization:
→ "The" "formula" "x" "²" "+" "y" "²" "=" "r" "²"

Or:
→ "The" "formula" "x" "^2" "+" "y" "^2" "=" "r" "^2"

Different tokenizations:
→ Different embeddings
→ Inconsistent retrieval

Query: "formula for circle"
Embedding may not match "x² + y² = r²"
→ Semantic gap between prose and formula
```

**Semantic Understanding:**

```
Embedding models trained on:
→ Natural language primarily
→ Some code (if code-aware)
→ Limited mathematical notation

Query: "standard deviation formula"
Doc contains: "σ = √(1/N Σ(x_i - μ)²)"

Embedding similarity:
→ Low, because model doesn't "understand" math symbols
→ Treats symbols as arbitrary characters
→ Misses semantic connection

Better if doc also has:
"σ = standard deviation
μ = mean
Formula calculates population standard deviation"

Natural language augments math symbols
```

### MathML and Structured Math

Alternative to LaTeX:

**MathML Format:**

```xml
<math>
  <mfrac>
    <mi>a</mi>
    <mi>b</mi>
  </mfrac>
</math>
```

**Extraction Advantage:**

```
MathML is structured XML:
→ Can parse with XML parser
→ Extract semantic meaning
→ Convert to text format

<mfrac>: "fraction"
<mi>a</mi>: "variable a"
<mi>b</mi>: "variable b"

Can generate: "a/b" or "a divided by b"

Better than LaTeX string parsing
```

**But Rare:**

```
Most documents use:
→ LaTeX (academic, technical docs)
→ Unicode symbols (web pages)
→ Images (PowerPoint exports)

MathML rare in:
→ PDFs
→ Markdown
→ Plain text

Limited applicability
```

***

## How to Solve

**Convert LaTeX to Unicode math symbols where possible + use textual descriptions for complex formulas ("sum from i=1 to n of x\_i") + preserve equation structure with parentheses + keep equation labels and numbers + augment formulas with natural language explanations.** See [Mathematical Content](/rag-scenarios-and-solutions/chunking/math-notation.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/chunking/math-notation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
