# Stale Data After Deletion

## The Problem

Documents deleted from your data source still appear in your AI agent's responses, causing outdated or incorrect information.

### Symptoms

* ❌ AI cites deleted Confluence pages
* ❌ Old pricing info appears despite doc deletion
* ❌ Deleted product docs still in knowledge base
* ❌ "This page was archived" but AI still references it
* ❌ Sync shows "completed" but deleted content remains

### Real-World Example

```
Day 1: Product V1 documentation in Confluence (50 pages)
Day 30: Product V2 released, V1 docs archived/deleted
Day 31: Twig sync runs successfully

User asks: "How do I configure V1?"
AI responds: "To configure V1, follow these steps..."
   → Cites deleted V1 documentation
   → V1 is deprecated, answer is wrong

Vector DB still contains: 50 embedded chunks from V1 docs
```

***

## Deep Technical Analysis

### Soft Delete vs Hard Delete

Data sources implement deletion differently:

**Hard Delete:**

```
User clicks "Delete" in Confluence
→ Page immediately removed from database
→ Page ID no longer exists
→ API queries return 404 Not Found

Clean from API perspective:
→ Document truly gone
→ No API endpoint to query it
→ But: How does Twig know it existed and should be removed?
```

**Soft Delete (Archive):**

```
User clicks "Archive" in Confluence
→ Page marked as archived in database
→ Page ID still exists
→ API can still query it

API response:
{
  "id": "12345",
  "title": "Old Guide",
  "status": "archived"  ← Key indicator
}

Still accessible but shouldn't be in knowledge base
```

**The Discovery Problem:**

```
Twig incremental sync:
→ Query: WHERE modified_at > last_sync
→ Returns: new and modified documents
→ Missing: deleted/archived documents

Deleted doc:
→ Not in query results (no modified_at update on deletion)
→ Twig unaware it was deleted
→ Old embedding stays in vector DB
→ AI continues citing it
```

### Deletion Detection Strategies

RAG systems must actively detect deletions:

**1. Full Reconciliation (expensive):**

```
Every sync:
1. Query ALL document IDs from source
2. Compare with document IDs in vector DB
3. Missing in source → deleted → remove from vector DB

For 10,000 documents:
→ Fetch 10,000 IDs from source API
→ Query 10,000 IDs from vector DB
→ Set diff: O(n) comparison
→ Slow and API-intensive
```

**2. Change Log / Delta API (ideal but rare):**

```
Google Drive changes API:
GET /drive/v3/changes?pageToken=xyz

Response:
{
  "changes": [
    {"fileId": "123", "type": "file", "deleted": true},
    {"fileId": "456", "type": "file", "deleted": false, ...}
  ]
}

Explicitly reports deletions
Efficient: Only queries changes

But:
→ Few APIs support this (Google Drive, Dropbox)
→ Most APIs (Confluence, Zendesk) don't
```

**3. Status Field Polling (if available):**

```
If API supports status queries:
GET /articles?status=archived

Returns all archived articles
Twig can query and remove these

But:
→ Requires separate API call
→ Doesn't catch hard deletes
→ "Archived" ≠ "deleted" in some systems
```

**4. Document Metadata Tracking:**

```
Twig stores metadata for each chunk:
{
  "chunk_id": "chunk_abc",
  "source_doc_id": "123",
  "last_seen": "2024-01-20T10:00:00Z"
}

Every sync:
→ Update last_seen for documents encountered
→ Periodically: Query chunks where last_seen > 30 days ago
→ Attempt to fetch source document
→ If 404: Assume deleted, remove chunk

Pros:
+ Catches hard deletes
+ No need for full reconciliation

Cons:
- False positives (temporary API errors)
- 30-day lag (deleted content stays for a month)
```

### Cascade Deletion and Relationships

Deleting parent documents should delete children:

**Hierarchical Content:**

```
Confluence space:
→ Parent page: "Product Guide"
  → Child: "Installation"
  → Child: "Configuration"
    → Grandchild: "Advanced Config"
```

**Deletion Cascade:**

```
User deletes "Product Guide" parent page

Expected:
→ All children and grandchildren deleted

API behavior varies:
1. Confluence: Children auto-deleted (cascade)
2. Notion: Children moved to trash (still accessible)
3. Zendesk: Articles independent (no cascade)

Twig must:
→ Detect parent deletion
→ Query if children still exist
→ Remove children if inaccessible
→ Complex logic per data source
```

**The Orphan Problem:**

```
Scenario:
1. Twig syncs parent + children
2. User deletes parent (but not children in UI)
3. Children become orphans (no parent path)
4. Twig's next sync:
   → Detects parent deleted
   → Removes parent from vector DB
   → Children still referenced in chunks
   → Chunks contain "Parent > Child" breadcrumb
   → Breadcrumb now wrong

RAG retrieval:
→ User query matches orphan child chunk
→ AI cites: "See Product Guide > Installation"
→ But "Product Guide" no longer exists
→ Broken reference
```

### Vector Database Deletion Complexity

Removing embeddings isn't straightforward:

**Chunk-Level Deletion:**

```
Document "Guide.md" chunked into 50 chunks:
→ chunk_1, chunk_2, ..., chunk_50

Each chunk embedded separately in vector DB:
→ 50 separate vectors with metadata: doc_id="Guide.md"

Document deleted, Twig must:
1. Query vector DB: WHERE doc_id="Guide.md"
2. Get all 50 chunk IDs
3. Delete each chunk: DELETE FROM vectors WHERE id IN (...)
4. Rebuild index (some vector DBs require this)

For 100 deleted documents (5,000 chunks):
→ 100 metadata queries
→ 5,000 deletions
→ Index rebuild
→ Slow operation
```

**Soft Delete in Vector DB:**

```
Instead of deleting:
→ Mark as deleted with metadata: {"deleted": true}
→ Filter at query time: WHERE deleted != true

Pros:
+ Fast "deletion" (just metadata update)
+ Reversible (if mistake)
+ No index rebuild

Cons:
- Storage still used (deleted chunks remain)
- Query overhead (filtering)
- Need periodic cleanup job
```

**The Deletion Propagation Delay:**

```
Timeline:
10:00 AM: User deletes Confluence page
10:05 AM: Twig incremental sync detects deletion
10:06 AM: Twig queries vector DB for chunks
10:07 AM: Deletion submitted to vector DB
10:15 AM: Vector DB index rebuild completes

During 10:05-10:15 window:
→ User asks AI question
→ Query retrieves deleted chunks (still in index)
→ AI cites deleted content
→ Eventual consistency problem
```

### Multi-Source Content Deduplication

Same content from multiple sources complicates deletion:

**Cross-Source Duplication:**

```
Content: "API Authentication Guide"
Sources:
→ Confluence page ID: conf_123
→ Notion page ID: notion_456
→ Website URL: /docs/auth

All three embedded separately:
→ 3 sets of chunks for same content
→ Different source_ids but similar embeddings

User deletes Confluence page:
→ Twig removes conf_123 chunks
→ But notion_456 and /docs/auth chunks remain
→ AI still cites auth guide (from other sources)

Is this correct behavior?
→ Content still exists in Notion/Website
→ Maybe deletion should only affect Confluence references
→ Or: Deduplicate and delete all copies?
```

### Deletion Audit Trail and Recovery

Tracking deletions for compliance and recovery:

**Audit Requirements:**

```
GDPR/compliance often requires:
→ Who deleted what and when
→ Proof of deletion
→ Ability to show "right to be forgotten" enforced

Twig must log:
{
  "action": "delete",
  "doc_id": "conf_123",
  "source": "confluence",
  "deleted_at": "2024-01-20T10:05:00Z",
  "deleted_by": "integration_sync",
  "chunks_removed": 50
}

Retention: Keep logs for X years
```

**Accidental Deletion Recovery:**

```
User scenario:
→ Accidentally deletes important guide
→ Twig syncs, removes from vector DB
→ User realizes mistake 2 hours later
→ Restores guide in Confluence
→ But: Knowledge base still missing it until next sync

Time gap:
→ 2 hours of "document not found" responses
→ User expects immediate recovery

Solution:
→ Maintain soft-delete window (24 hours)
→ Don't hard delete for 24 hours
→ If document reappears: unmark deleted
→ But: Complexity and storage overhead
```

### Deletion Performance at Scale

Large-scale deletions are expensive:

**The Bulk Delete Problem:**

```
User archives entire Confluence space:
→ 1,000 pages
→ 50,000 chunks total

Deletion operations:
→ Query 1,000 doc IDs from vector DB
→ Retrieve 50,000 chunk IDs
→ Delete 50,000 vectors
→ Rebuild index

Time:
→ Query: 2 minutes
→ Delete: 10 minutes
→ Index rebuild: 30 minutes
→ Total: 42 minutes

During this time:
→ Vector DB under load
→ Query performance degraded
→ Other customers affected (shared infrastructure)
```

**Batch Delete Strategy:**

```
Instead of deleting immediately:
1. Mark chunks as pending_deletion
2. Queue deletion job
3. Background worker processes in batches:
   → Delete 100 chunks at a time
   → Sleep 1 second between batches
   → Spread load over time

Tradeoff:
→ Deleted content may appear for 15-30 minutes
→ But: No performance spike
→ Gradual cleanup
```

***

## How to Solve

**Track all document IDs and reconcile on each sync + use change log APIs where available + implement last\_seen metadata with periodic cleanup + soft-delete with 24h grace period + batch large deletions.** See [Data Lifecycle Management](https://github.com/thrivapp/twig-help-docs/blob/staging/data/lifecycle.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/data-integration/stale-data-deletion.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
