# Knowledge Base Drift

## The Problem

Over time, accumulated updates, duplicates, and inconsistencies cause knowledge base quality to degrade, reducing answer accuracy.

### Symptoms

* ❌ Duplicate content with slight variations
* ❌ Inconsistent terminology across docs
* ❌ Orphaned chunks from deleted docs
* ❌ Degrading retrieval quality over time
* ❌ Conflicting information proliferating

### Real-World Example

```
Month 1: Clean knowledge base
→ 1,000 docs, well-organized

Month 12: Drifted knowledge base
→ 5,000 docs (5x growth)
→ 200 duplicates (re-ingested without dedup)
→ 500 orphaned chunks (source docs deleted)
→ Inconsistent terms ("log in" vs "sign in" vs "authenticate")

Retrieval quality:
→ Month 1: 90% accuracy
→ Month 12: 65% accuracy (degraded)
```

***

## Deep Technical Analysis

### Incremental Degradation

**Duplicate Accumulation:**

```
Document updated multiple times:
→ v1 ingested → chunks A, B, C
→ v2 updated → chunks D, E, F added
→ v1 chunks NOT removed

Result:
→ A, B, C (old) + D, E, F (new) coexist
→ Retrieves old info
→ Conflicting answers
```

**Orphaned Data:**

```
Source doc deleted from CMS:
→ Chunks remain in vector DB
→ No automatic cleanup
→ Stale data persists

Cites deleted/non-existent sources
```

### Terminology Drift

**Inconsistent Naming:**

```
Early docs: "User authentication"
Later docs: "Login system"
Recent docs: "Identity management"

Same concept, different terms:
→ Retrieval fragmented
→ Misses related docs
→ Incomplete answers
```

**Canonical Terms:**

```
Solution: Maintain glossary
→ Standardize on "Authentication"
→ Map aliases: "login", "sign in", "auth"
→ Normalize at ingestion or query time
```

### Index Fragmentation

**Vector DB Performance:**

```
After many updates/deletes:
→ Index structure fragmented
→ Search slower
→ Quality degrades

Requires:
→ Periodic reindexing
→ Compaction
→ Optimization
```

### Data Quality Metrics

**Staleness Detection:**

```
Monitor per-chunk age:
→ Chunks not updated in 6+ months
→ Flag for review
→ Possibly obsolete

Automated alerts:
→ "Document X not updated since 2022"
→ Review/remove
```

**Duplicate Detection:**

```
Semantic similarity between chunks:
→ If cosine similarity > 0.95
→ Likely duplicate
→ Consolidate or remove
```

***

## How to Solve

**Implement version-aware ingestion (delete old chunks on update) + run periodic deduplication (detect semantic duplicates) + track chunk age and flag stale content + use canonical terminology (glossary + normalization) + schedule index compaction quarterly + monitor retrieval quality metrics (accuracy trend) + perform annual knowledge base audit/cleanup.** See [Knowledge Drift](/rag-scenarios-and-solutions/accuracy/factual-drift.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/accuracy/factual-drift.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
