Broken Cross-References

The Problem

Links and references between documents break during ingestion, causing AI to cite non-existent pages or fail to follow related content.

Symptoms

  • ❌ "See section 3.2" - but section not linked

  • ❌ Hyperlinks become plain text

  • ❌ "Click here" with no actual link

  • ❌ Cross-document references lost

  • ❌ Cannot navigate related content

Real-World Example

Source HTML documentation:
"For authentication details, see <a href="/docs/auth">Authentication Guide</a>"

After ingestion:
"For authentication details, see Authentication Guide"
→ Link lost, just plain text

AI response:
"See Authentication Guide for details"
→ User: "Where is Authentication Guide?"
→ No way to navigate

Deep Technical Analysis

HTML to Text Conversion:

Relative vs Absolute URLs:

Internal Reference Resolution

Section References:

Anchor Links:

Citation Accuracy

"See Also" Links:

Page Numbers:

Markdown Format:

Hyperlink Metadata:


How to Solve

Preserve links during extraction (convert to Markdown or metadata) + resolve relative URLs to absolute + extract and store hyperlink metadata with chunks + implement document graph (cross-references) + map PDF page numbers to chunk IDs + include source URLs in AI citations + test link integrity post-ingestion. See Link Management.

Last updated