# Duplicate Content in Vector DB

## The Problem

Same or highly similar content embedded multiple times, wasting storage, increasing costs, and causing repetitive or confused AI responses.

### Symptoms

* ❌ Same answer repeated multiple times
* ❌ Storage costs higher than expected
* ❌ Retrieval returns near-identical chunks
* ❌ AI cites same source 3+ times
* ❌ Multiple versions of same doc embedded

### Real-World Example

```
Knowledge base contains:
→ "FAQ v1.pdf" (ingested January)
→ "FAQ v2.pdf" (ingested March, 90% overlap with v1)
→ "FAQ-copy.pdf" (duplicate file, different name)

Query: "How to reset password?"

Retrieved chunks:
→ Chunk A (FAQ v1): "Reset password: Click forgot password..."
→ Chunk B (FAQ v2): "Reset password: Click forgot password..." (identical)
→ Chunk C (FAQ-copy): "Reset password: Click forgot password..." (identical)

AI response cites all three (redundant)
Storage: 3x cost for same content
```

***

## Deep Technical Analysis

### Sources of Duplication

**Document Re-Ingestion:**

```
Common scenario:
→ Ingest doc_v1.pdf
→ Update to doc_v2.pdf
→ Re-ingest without deleting v1
→ Both versions coexist

Result: Duplicate + outdated data
```

**Cross-Source Duplication:**

```
Same content in multiple places:
→ Help Center article
→ Internal wiki (copy/paste of article)
→ PDF export of article

All ingested → 3x duplicate
```

**Chunking Overlap:**

```
Sliding window chunking:
→ Chunk 1: Tokens 0-500 (with 10% overlap)
→ Chunk 2: Tokens 450-950
→ Overlap: Tokens 450-500 duplicated

Some overlap intentional (context preservation)
Too much overlap = duplication
```

### Detection Strategies

**Exact Duplicate Detection:**

```
Hash-based:
→ Hash each chunk text (MD5, SHA-256)
→ Store hash
→ Before inserting, check if hash exists

Fast, catches exact duplicates
Misses: Paraphrases, minor edits
```

**Semantic Duplicate Detection:**

```
Cosine similarity between embeddings:
→ If similarity > 0.95 (very high)
→ Likely duplicate/near-duplicate

Example:
→ "Reset your password" vs "Reset password"
→ Different text, same meaning
→ Embeddings very similar
→ Flag as duplicate
```

**Fuzzy Matching:**

```
Levenshtein distance:
→ Edit distance between texts
→ If distance < 5% of length
→ Near-duplicate

Catches typos, minor rephrasing
```

### Deduplication Strategies

**Pre-Ingestion Dedup:**

```
Before embedding:
1. Hash new chunks
2. Check against existing hashes
3. Skip if duplicate

Prevents ingestion entirely
Most efficient
```

**Post-Ingestion Dedup:**

```
After ingestion:
1. Compute pairwise similarities
2. Identify duplicates (similarity > 0.95)
3. Delete lower-priority duplicates

Use when:
→ Cleanup needed
→ Legacy data has duplicates
```

**Version-Aware Ingestion:**

```
Track document versions:
{
  document_id: "FAQ",
  version: 2,
  chunks: [...]
}

On re-ingest:
→ Delete chunks where document_id="FAQ" AND version < 2
→ Add new version

Automatic cleanup
```

### Storage Impact

**Cost Calculation:**

```
10,000 unique chunks
20% duplication rate → 2,000 duplicates

Storage:
→ 12,000 vectors vs 10,000 (20% extra cost)

Embedding cost:
→ 2,000 duplicate embeddings generated
→ Wasted API calls

Retrieval:
→ More data to search → slightly slower
```

***

## How to Solve

**Implement hash-based exact duplicate detection at ingestion + run semantic similarity deduplication (cosine > 0.95) periodically + use version-aware ingestion (delete old versions on update) + track document\_id and version metadata + prefer single source of truth (don't ingest same content from multiple sources) + monitor duplicate rate metric.** See [Duplicate Management](/rag-scenarios-and-solutions/data-quality/duplicate-content.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/data-quality/duplicate-content.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
