# Footnotes and References Lost

## The Problem

Footnotes, endnotes, and citations are separated from their reference markers during chunking, losing critical context and academic/legal citations.

### Symptoms

* ❌ Reference markers (\[1], \*, †) appear without footnotes
* ❌ Footnotes separated from main text
* ❌ "See note 5" but note 5 not in chunk
* ❌ Academic citations incomplete
* ❌ Legal references missing source

### Real-World Example

```
Original document:

The API rate limit¹ is 1000 requests per hour for free users².

────────────────────────
¹ Rate limits reset at midnight UTC
² Enterprise plans have higher limits

Chunk boundary falls between text and footnotes ↓

Chunk 1:
"The API rate limit¹ is 1000 requests per hour for free users²."

Chunk 2:
"¹ Rate limits reset at midnight UTC
² Enterprise plans have higher limits"

User sees Chunk 1: "What does ¹ mean?"
AI cannot resolve reference (footnote in different chunk)
```

***

## Deep Technical Analysis

### Footnote Types and Formats

Different notation systems:

**Numbering Systems:**

```
Numeric: [1], [2], [3] or ¹, ², ³
Alphabetic: [a], [b], [c]
Symbolic: *, †, ‡, §, ||, ¶
Roman: [i], [ii], [iii]
```

**Placement Variations:**

```
Bottom of page (traditional):
Main text with marker¹
────────────────────
¹ Footnote content

End of section:
Main text with marker¹
[Section continues...]
Notes:
¹ Footnote content

End of document (endnotes):
Main text with marker¹
[Many pages later...]
Endnotes:
¹ Footnote content
```

**Detection Challenges:**

```
Superscript numbers:
→ Could be footnote: "value¹"
→ Or exponent: "x²"
→ Or chemical: "CO₂"

Context needed to distinguish

Square brackets:
→ Could be footnote: "research [1] shows..."
→ Or citation: "see [Smith, 2020]"
→ Or array notation: "array[1]"
→ Or just brackets: "[optional] parameter"
```

### Marker-to-Note Matching

Associating references with definitions:

**Matching Algorithm:**

```
1. Scan document for markers: ¹, ², ³
2. Scan document for footnote definitions:
   → Look for "¹ " at start of line
   → Or "──" separator followed by notes
3. Match markers to definitions by number/symbol
4. Store associations

Challenges:
→ Multiple numbering systems in same doc
→ Nested footnotes (footnotes referencing footnotes)
→ Reused numbers across chapters
→ Non-contiguous numbering (1, 2, 5, 7 - some missing)
```

**The Multiple Reference Problem:**

```
Main text:
"This concept¹ was explored by several researchers¹."

Same footnote referenced twice
→ Footnote 1 appears once at bottom
→ Two markers in text
→ Must link both to same footnote
```

**Cross-Chapter Footnotes:**

```
Chapter 1 footnotes: 1-15
Chapter 2 footnotes: 1-12 (numbering restarts!)

Marker "2" in Chapter 2 ≠ Marker "2" in Chapter 1
→ Must track context (chapter/section)
→ Avoid mixing footnotes across chapters
```

### Academic Citations

Scholarly documents use formal citations:

**Citation Formats:**

```
IEEE: [1], [2], [3] (numeric)
APA: (Smith, 2020) (author-year)
MLA: (Smith 24) (author-page)
Chicago: Superscript¹ (note-based)
Harvard: (Smith 2020, p.15) (author-year-page)
```

**Inline vs Bibliography:**

```
Inline citation:
"Previous research [1] demonstrated..."

Bibliography (separate section):
[1] Smith, J. (2020). Title of Paper. Journal Name, 15(3), 123-145.

Chunking challenge:
→ Citation [1] in main text
→ Full reference in bibliography (different location)
→ May be in different chunk entirely

LLM sees [1] but can't resolve to full citation
```

**Citation Clustering:**

```
Multiple citations together:
"This is well documented [1,2,3,5-8,12]."

Represents:
→ 9 separate citations (1,2,3,5,6,7,8,12)
→ Must expand ranges (5-8)
→ Link all to bibliography

If chunk contains this line:
→ Need ALL 9 bibliography entries
→ But they may span multiple pages
→ Impractical to include all
```

### Legal Citations

Legal documents have specific citation requirements:

**Legal Citation Format:**

```
Case law: Smith v. Jones, 123 F.3d 456 (9th Cir. 2020)
Statute: 42 U.S.C. § 1983
Regulation: 17 C.F.R. § 240.10b-5

Components:
→ Case name
→ Reporter volume & page
→ Court
→ Year
```

**The String Citation:**

```
Legal writing uses "string citations":
"This principle is established. See Smith v. Jones, 123 F.3d 456, 460 (9th Cir. 2020); Doe v. Roe, 789 F.2d 123, 125 (2d Cir. 2019); Johnson v. Williams, 456 F.Supp. 789 (S.D.N.Y. 2018)."

Single sentence with 3 citations
→ Must keep together
→ Splitting loses context
→ "See Smith v. Jones" alone is incomplete
```

**Abbreviated Citations:**

```
First reference (full):
"Smith v. Jones, 123 F.3d 456 (9th Cir. 2020)"

Later references (short):
"Smith, 123 F.3d at 460"
or just: "Id. at 461" (same case as previous)

"Id." depends on previous citation
→ Must track citation history
→ If chunk starts mid-document: "Id." unresolvable
```

### Footnote Content Length

Footnotes vary from brief to extensive:

**Short Footnotes:**

```
Main text: "The API¹ supports JSON."
Footnote: ¹ Application Programming Interface

Brief definition: 5-10 words
→ Easy to include with main text
```

**Long Footnotes:**

```
Main text: "The algorithm¹ is efficient."
Footnote: ¹ The algorithm is based on dynamic programming principles first introduced by Bellman (1957) and later refined by Dijkstra (1959). Modern implementations typically use a priority queue for efficiency. For a comprehensive treatment of the theoretical foundations, see Cormen et al. (2009), Chapter 24. Note that the worst-case time complexity is O(n²) for dense graphs but can be reduced to O(n log n) with appropriate data structures. [200 words...]

Footnote longer than main text!
→ Including footnote inflates chunk size
→ Excluding footnote loses critical detail
```

**The Inclusion Decision:**

```
Options:
1. Always include footnotes in chunk
   → Pros: Complete context
   → Cons: Very large chunks, repetition

2. Never include footnotes
   → Pros: Smaller chunks
   → Cons: Incomplete information

3. Include short footnotes (<50 words)
   → Pros: Balance
   → Cons: Arbitrary threshold, inconsistent

4. Include footnotes as separate chunks with back-references
   → Pros: Modular
   → Cons: Complex linking required
```

### Inline Notes vs Margin Notes

Different annotation styles:

**Inline Parenthetical:**

```
"The result (as shown in Figure 3) demonstrates..."

Not a footnote, but similar
→ Interruptive aside
→ Could be moved to footnote
→ But author chose inline

Should chunk preserve parenthetical positioning?
Or normalize: "The result demonstrates... See Figure 3."
```

**Margin Notes:**

```
Document layout:
┌────────────────┬──────────┐
│ Main text here │ Note: Im-│
│ continues with │ portant  │
│ more content.  │ detail!  │
└────────────────┴──────────┘

Margin note parallel to main text
→ Not clearly "after" any specific paragraph
→ Extraction order ambiguous
→ Associate with nearby text? Which paragraph?
```

### Reference Loops and Nested Notes

Complex referencing structures:

**Footnote Referencing Footnote:**

```
Main text: "The concept¹ is fundamental."
Footnote 1: "See also related work²"
Footnote 2: "Smith (2020) provides comprehensive review."

Multi-level reference chain
→ Resolving ¹ requires ²
→ Deep linking required
```

**Circular References:**

```
Footnote A: "See Footnote B for details"
Footnote B: "As mentioned in Footnote A..."

Circular dependency
→ Cannot resolve independently
→ Must include both together
```

### Embedding and Retrieval Impact

Footnotes affect semantic search:

**Footnote Content in Embeddings:**

```
Query: "API rate limit reset time"

Option 1: Embed main text only
"The API rate limit is 1000 requests per hour."
→ Doesn't match query (no "reset" mentioned)

Option 2: Embed main text + footnotes
"The API rate limit is 1000 requests per hour. Rate limits reset at midnight UTC."
→ Matches query! Footnote has answer

Conclusion: Must include footnotes for complete semantic coverage
```

**Citation Noise:**

```
Academic paper full of citations:
"This approach [1,2,3] outperforms baseline [4,5] with significant improvements [6,7,8,9]."

Embedding includes: [1,2,3,4,5,6,7,8,9]
→ Citation numbers are noise
→ No semantic value
→ Dilute actual content signal

Should strip [1-9] before embedding?
→ But then lose traceability
→ Cannot cite sources
```

***

## How to Solve

**Detect footnote markers (superscripts, brackets) + match to footnote definitions at page/section end + inline short footnotes (<50 words) directly + link long footnotes as metadata + resolve "Id." and abbreviated citations + strip citation brackets from embeddings but store separately.** See [Footnote Handling](/rag-scenarios-and-solutions/chunking/footnotes-lost.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/chunking/footnotes-lost.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
