Entity Resolution Errors

The Problem

Same real-world entities referenced inconsistently (different names, IDs, spellings) causing fragmented information and failed connections across documents.

Symptoms

  • ❌ "John Smith" and "J. Smith" treated as different people

  • ❌ Cannot find all docs mentioning same entity

  • ❌ Relationships broken by naming variations

  • ❌ Duplicate entity entries

  • ❌ Cross-reference failures

Real-World Example

Knowledge base references:
→ Doc A: "Project Phoenix led by Alice Johnson"
→ Doc B: "A. Johnson manages infrastructure"
→ Doc C: "Contact [email protected] for access"
→ Doc D: "User ID 12345 owns this repository"

All refer to SAME person (Alice Johnson, user_12345)
But treated as 4 different entities:
→ "Alice Johnson"
→ "A. Johnson"
→ "[email protected]"
→ "User ID 12345"

Query: "What projects does Alice Johnson lead?"
→ Only finds Doc A (exact match "Alice Johnson")
→ Misses B, C, D (different representations)
→ Incomplete answer

Deep Technical Analysis

Entity Variation Types

Name Variations:

Organizational Variations:

Canonical Entity Mapping

Entity ID Assignment:

Entity Linking:

Fuzzy Matching

String Similarity:

Probabilistic Matching:

Named Entity Recognition (NER)

Entity Extraction:

Entity Disambiguation:

Co-Reference Resolution

Pronoun Resolution:


How to Solve

Implement NER (spaCy, Stanford NER) to extract entities + create canonical entity IDs (user_id, product_id) + build entity mapping table (all variations → canonical) + use fuzzy string matching (Dedupe library) for likely matches + tag chunks with canonical entity IDs in metadata + enable entity-based retrieval (find all chunks mentioning user_12345) + implement co-reference resolution for pronouns. See Entity Resolution.

Last updated