# Entity Resolution Errors

## The Problem

Same real-world entities referenced inconsistently (different names, IDs, spellings) causing fragmented information and failed connections across documents.

### Symptoms

* ❌ "John Smith" and "J. Smith" treated as different people
* ❌ Cannot find all docs mentioning same entity
* ❌ Relationships broken by naming variations
* ❌ Duplicate entity entries
* ❌ Cross-reference failures

### Real-World Example

```
Knowledge base references:
→ Doc A: "Project Phoenix led by Alice Johnson"
→ Doc B: "A. Johnson manages infrastructure"
→ Doc C: "Contact alice.johnson@company.com for access"
→ Doc D: "User ID 12345 owns this repository"

All refer to SAME person (Alice Johnson, user_12345)
But treated as 4 different entities:
→ "Alice Johnson"
→ "A. Johnson"
→ "alice.johnson@company.com"
→ "User ID 12345"

Query: "What projects does Alice Johnson lead?"
→ Only finds Doc A (exact match "Alice Johnson")
→ Misses B, C, D (different representations)
→ Incomplete answer
```

***

## Deep Technical Analysis

### Entity Variation Types

**Name Variations:**

```
Same person:
→ Full: "Robert James Smith"
→ Common: "Bob Smith"
→ Formal: "R. J. Smith"
→ Email: "robert.smith"
→ Nickname: "Bobby"

Without resolution:
→ 5 separate entities in knowledge base
→ Information fragmented
```

**Organizational Variations:**

```
Same company:
→ "International Business Machines"
→ "IBM"
→ "IBM Corporation"
→ "IBM Corp."

Same product:
→ "Microsoft Office 365"
→ "Office 365"
→ "O365"
→ "M365"
```

### Canonical Entity Mapping

**Entity ID Assignment:**

```
Create canonical identifiers:
→ Person: user_id (from auth system)
→ Company: domain or LEI
→ Product: SKU or product_id

Map all variations:
{
  "Alice Johnson": "user_12345",
  "A. Johnson": "user_12345",
  "alice.johnson@company.com": "user_12345"
}

All references point to same canonical ID
```

**Entity Linking:**

```
During ingestion:
1. Extract entities from text (NER - Named Entity Recognition)
2. Resolve to canonical ID
3. Tag chunk with canonical entities

Chunk metadata:
{
  text: "Project Phoenix led by Alice Johnson",
  entities: [
    {name: "Alice Johnson", canonical_id: "user_12345", type: "person"},
    {name: "Project Phoenix", canonical_id: "project_789", type: "project"}
  ]
}

Retrieval by entity:
WHERE entities CONTAINS "user_12345"
→ Finds all chunks mentioning Alice (any variation)
```

### Fuzzy Matching

**String Similarity:**

```
Detect likely matches:
→ "Alice Johnson" vs "Alyce Jonson" (typo)
→ Levenshtein distance: 2 edits
→ Likely same entity (needs verification)

→ "Bob Smith" vs "Robert Smith"
→ No string similarity
→ But: Bob = common nickname for Robert
→ Requires nickname dictionary
```

**Probabilistic Matching:**

```
Dedupe library (Python):
→ Fuzzy matching algorithm
→ Assigns probability: 85% same entity
→ Threshold: >80% = match

Automated entity resolution
```

### Named Entity Recognition (NER)

**Entity Extraction:**

```
SpaCy, Stanford NER:
→ Identifies: PERSON, ORG, PRODUCT, LOCATION
→ "Alice Johnson manages Project Phoenix"
  - PERSON: Alice Johnson
  - PROJECT: Project Phoenix

Links entities across documents
```

**Entity Disambiguation:**

```
Challenge: Same name, different entities
→ "Apple" (company) vs "apple" (fruit)
→ "Paris" (city) vs "Paris Hilton" (person)

Context-based disambiguation:
→ "Apple released iPhone" → ORG
→ "I ate an apple" → FOOD

Requires context analysis
```

### Co-Reference Resolution

**Pronoun Resolution:**

```
Text: "Alice manages the project. She reports to Bob."
→ "She" = "Alice"

Co-reference resolution:
→ Replace pronouns with entities
→ "Alice manages the project. Alice reports to Bob."

Clearer entity relationships
```

***

## How to Solve

**Implement NER (spaCy, Stanford NER) to extract entities + create canonical entity IDs (user\_id, product\_id) + build entity mapping table (all variations → canonical) + use fuzzy string matching (Dedupe library) for likely matches + tag chunks with canonical entity IDs in metadata + enable entity-based retrieval (find all chunks mentioning user\_12345) + implement co-reference resolution for pronouns.** See [Entity Resolution](/rag-scenarios-and-solutions/data-quality/entity-resolution.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/data-quality/entity-resolution.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
