Entity Resolution Errors

The Problem

Same real-world entities referenced inconsistently (different names, IDs, spellings) causing fragmented information and failed connections across documents.

Symptoms

❌ "John Smith" and "J. Smith" treated as different people
❌ Cannot find all docs mentioning same entity
❌ Relationships broken by naming variations
❌ Duplicate entity entries
❌ Cross-reference failures

Real-World Example

Knowledge base references:
→ Doc A: "Project Phoenix led by Alice Johnson"
→ Doc B: "A. Johnson manages infrastructure"
→ Doc C: "Contact [email protected] for access"
→ Doc D: "User ID 12345 owns this repository"

All refer to SAME person (Alice Johnson, user_12345)
But treated as 4 different entities:
→ "Alice Johnson"
→ "A. Johnson"
→ "[email protected]"
→ "User ID 12345"

Query: "What projects does Alice Johnson lead?"
→ Only finds Doc A (exact match "Alice Johnson")
→ Misses B, C, D (different representations)
→ Incomplete answer

Deep Technical Analysis

Entity Variation Types

Name Variations:

Same person:
→ Full: "Robert James Smith"
→ Common: "Bob Smith"
→ Formal: "R. J. Smith"
→ Email: "robert.smith"
→ Nickname: "Bobby"

Without resolution:
→ 5 separate entities in knowledge base
→ Information fragmented

Organizational Variations:

Same company:
→ "International Business Machines"
→ "IBM"
→ "IBM Corporation"
→ "IBM Corp."

Same product:
→ "Microsoft Office 365"
→ "Office 365"
→ "O365"
→ "M365"

Canonical Entity Mapping

Entity ID Assignment:

Create canonical identifiers:
→ Person: user_id (from auth system)
→ Company: domain or LEI
→ Product: SKU or product_id

Map all variations:
{
  "Alice Johnson": "user_12345",
  "A. Johnson": "user_12345",
  "[email protected]": "user_12345"
}

All references point to same canonical ID

Entity Linking:

During ingestion:
1. Extract entities from text (NER - Named Entity Recognition)
2. Resolve to canonical ID
3. Tag chunk with canonical entities

Chunk metadata:
{
  text: "Project Phoenix led by Alice Johnson",
  entities: [
    {name: "Alice Johnson", canonical_id: "user_12345", type: "person"},
    {name: "Project Phoenix", canonical_id: "project_789", type: "project"}
  ]
}

Retrieval by entity:
WHERE entities CONTAINS "user_12345"
→ Finds all chunks mentioning Alice (any variation)

Fuzzy Matching

String Similarity:

Detect likely matches:
→ "Alice Johnson" vs "Alyce Jonson" (typo)
→ Levenshtein distance: 2 edits
→ Likely same entity (needs verification)

→ "Bob Smith" vs "Robert Smith"
→ No string similarity
→ But: Bob = common nickname for Robert
→ Requires nickname dictionary

Probabilistic Matching:

Dedupe library (Python):
→ Fuzzy matching algorithm
→ Assigns probability: 85% same entity
→ Threshold: >80% = match

Automated entity resolution

Named Entity Recognition (NER)

Entity Extraction:

SpaCy, Stanford NER:
→ Identifies: PERSON, ORG, PRODUCT, LOCATION
→ "Alice Johnson manages Project Phoenix"
  - PERSON: Alice Johnson
  - PROJECT: Project Phoenix

Links entities across documents

Entity Disambiguation:

Challenge: Same name, different entities
→ "Apple" (company) vs "apple" (fruit)
→ "Paris" (city) vs "Paris Hilton" (person)

Context-based disambiguation:
→ "Apple released iPhone" → ORG
→ "I ate an apple" → FOOD

Requires context analysis

Co-Reference Resolution

Pronoun Resolution:

Text: "Alice manages the project. She reports to Bob."
→ "She" = "Alice"

Co-reference resolution:
→ Replace pronouns with entities
→ "Alice manages the project. Alice reports to Bob."

Clearer entity relationships

How to Solve

Implement NER (spaCy, Stanford NER) to extract entities + create canonical entity IDs (user_id, product_id) + build entity mapping table (all variations → canonical) + use fuzzy string matching (Dedupe library) for likely matches + tag chunks with canonical entity IDs in metadata + enable entity-based retrieval (find all chunks mentioning user_12345) + implement co-reference resolution for pronouns. See Entity Resolution.

PreviousTemporal Context Loss NextRetrieval Stage Debugging

Last updated 18 minutes ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagEntity Variation Types

hashtagCanonical Entity Mapping

hashtagFuzzy Matching

hashtagNamed Entity Recognition (NER)

hashtagCo-Reference Resolution

hashtagHow to Solve