# Outdated Knowledge Base

## The Problem

Knowledge base contains stale information that's been updated in source systems, causing AI to provide outdated answers.

### Symptoms

* ❌ AI cites old pricing/features
* ❌ Refers to deprecated APIs
* ❌ Misses recent updates
* ❌ Inconsistent with current docs
* ❌ Users correct AI with newer info

### Real-World Example

```
January: Document states "API v1 is current"
→ Ingested into knowledge base

March: API v2 released, docs updated
→ Knowledge base NOT updated

User query (April): "What's the current API version?"
AI: "The current API version is v1"

Wrong - knowledge base 3 months stale
```

***

## Deep Technical Analysis

### Sync Frequency Issues

**One-Time Ingestion:**

```
Common mistake:
→ Ingest docs once at setup
→ Never refresh
→ Knowledge base frozen in time

Source docs evolve:
→ New features added
→ Bugs fixed
→ Policies updated
→ RAG oblivious to changes
```

**Manual Re-Ingestion:**

```
Admin must remember to:
→ Re-sync periodically
→ Error-prone (forgotten)
→ Unpredictable staleness
```

### Incremental Sync Challenges

**Delta Detection:**

```
Which documents changed?
→ Check last_modified timestamp
→ Only re-ingest modified docs

But:
→ Some sources don't expose last_modified
→ Metadata unreliable
→ Must re-ingest everything (slow)
```

**Version Conflicts:**

```
Doc version 1: Embedded as chunks A, B, C
Doc version 2: Updated

Options:
A) Delete old chunks, add new → Clean but complex
B) Add new chunks, keep old → Duplicate/conflicting info

Need versioning strategy
```

### Real-Time vs Batch

**Batch Sync (Daily):**

```
Pros:
+ Simple
+ Predictable load

Cons:
- Up to 24h staleness
- Critical updates delayed
```

**Real-Time Sync (Webhooks):**

```
Source system sends webhook: "Doc X updated"
→ Immediately re-embed and update

Pros:
+ Always current
+ No staleness

Cons:
- Complex infrastructure
- Webhook reliability
- Burst load handling
```

### Document Deletion Handling

**Deleted Docs:**

```
Source: Document removed
Knowledge base: Still has old chunks

AI cites deleted/non-existent document:
→ User: "That link is dead"
→ Trust eroded

Must detect deletions and remove chunks
```

**Soft Deletes:**

```
Some systems soft-delete (mark as deleted):
→ Doc still exists but hidden
→ API may still return it
→ RAG may ingest deleted content

Filter at ingestion: WHERE deleted_at IS NULL
```

### Timestamp Metadata

**Last Updated Tracking:**

```
Store with each chunk:
{
  vector: [...],
  metadata: {
    document_id: "doc_123",
    last_updated: "2024-01-15T10:00:00Z",
    source_url: "https://..."
  }
}

Benefits:
→ Know freshness of each chunk
→ Surface last_updated in AI response
→ User can assess relevance
```

***

## How to Solve

**Implement automated incremental sync (daily or real-time webhooks) + track document last\_modified timestamps + delete outdated chunks when source updated + use webhook-triggered re-ingestion for critical docs + display "last updated" timestamp in AI responses + monitor sync lag metrics.** See [Data Freshness](/rag-scenarios-and-solutions/accuracy/stale-data.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/accuracy/stale-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
