Duplicate Content in Vector DB

The Problem

Same or highly similar content embedded multiple times, wasting storage, increasing costs, and causing repetitive or confused AI responses.

Symptoms

❌ Same answer repeated multiple times
❌ Storage costs higher than expected
❌ Retrieval returns near-identical chunks
❌ AI cites same source 3+ times
❌ Multiple versions of same doc embedded

Real-World Example

Knowledge base contains:
→ "FAQ v1.pdf" (ingested January)
→ "FAQ v2.pdf" (ingested March, 90% overlap with v1)
→ "FAQ-copy.pdf" (duplicate file, different name)

Query: "How to reset password?"

Retrieved chunks:
→ Chunk A (FAQ v1): "Reset password: Click forgot password..."
→ Chunk B (FAQ v2): "Reset password: Click forgot password..." (identical)
→ Chunk C (FAQ-copy): "Reset password: Click forgot password..." (identical)

AI response cites all three (redundant)
Storage: 3x cost for same content

Deep Technical Analysis

Sources of Duplication

Document Re-Ingestion:

Common scenario:
→ Ingest doc_v1.pdf
→ Update to doc_v2.pdf
→ Re-ingest without deleting v1
→ Both versions coexist

Result: Duplicate + outdated data

Cross-Source Duplication:

Same content in multiple places:
→ Help Center article
→ Internal wiki (copy/paste of article)
→ PDF export of article

All ingested → 3x duplicate

Chunking Overlap:

Sliding window chunking:
→ Chunk 1: Tokens 0-500 (with 10% overlap)
→ Chunk 2: Tokens 450-950
→ Overlap: Tokens 450-500 duplicated

Some overlap intentional (context preservation)
Too much overlap = duplication

Detection Strategies

Exact Duplicate Detection:

Hash-based:
→ Hash each chunk text (MD5, SHA-256)
→ Store hash
→ Before inserting, check if hash exists

Fast, catches exact duplicates
Misses: Paraphrases, minor edits

Semantic Duplicate Detection:

Cosine similarity between embeddings:
→ If similarity > 0.95 (very high)
→ Likely duplicate/near-duplicate

Example:
→ "Reset your password" vs "Reset password"
→ Different text, same meaning
→ Embeddings very similar
→ Flag as duplicate

Fuzzy Matching:

Levenshtein distance:
→ Edit distance between texts
→ If distance < 5% of length
→ Near-duplicate

Catches typos, minor rephrasing

Deduplication Strategies

Pre-Ingestion Dedup:

Before embedding:
1. Hash new chunks
2. Check against existing hashes
3. Skip if duplicate

Prevents ingestion entirely
Most efficient

Post-Ingestion Dedup:

After ingestion:
1. Compute pairwise similarities
2. Identify duplicates (similarity > 0.95)
3. Delete lower-priority duplicates

Use when:
→ Cleanup needed
→ Legacy data has duplicates

Version-Aware Ingestion:

Track document versions:
{
  document_id: "FAQ",
  version: 2,
  chunks: [...]
}

On re-ingest:
→ Delete chunks where document_id="FAQ" AND version < 2
→ Add new version

Automatic cleanup

Storage Impact

Cost Calculation:

10,000 unique chunks
20% duplication rate → 2,000 duplicates

Storage:
→ 12,000 vectors vs 10,000 (20% extra cost)

Embedding cost:
→ 2,000 duplicate embeddings generated
→ Wasted API calls

Retrieval:
→ More data to search → slightly slower

How to Solve

Implement hash-based exact duplicate detection at ingestion + run semantic similarity deduplication (cosine > 0.95) periodically + use version-aware ingestion (delete old versions on update) + track document_id and version metadata + prefer single source of truth (don't ingest same content from multiple sources) + monitor duplicate rate metric. See Duplicate Management.

PreviousMulti-Hop Reasoning Failure NextCharacter Encoding in Chunks

Last updated 18 minutes ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagSources of Duplication

hashtagDetection Strategies

hashtagDeduplication Strategies

hashtagStorage Impact

hashtagHow to Solve