# Erasure from Vector Index

## The Problem

Deleting embeddings from vector indexes is slow, incomplete, or impacts index performance, making GDPR-compliant erasure technically challenging.

### Symptoms

* ❌ Deletion takes hours for large indexes
* ❌ Index performance degrades after deletions
* ❌ Cannot verify complete erasure
* ❌ "Soft delete" leaves data present
* ❌ Rebuild required for clean deletion

### Real-World Example

```
User requests deletion:
→ 50,000 vectors contain user's data
→ Vector DB: Pinecone (10M total vectors)

Deletion process:
→ Delete by metadata filter: 2 hours
→ Index fragmentation: Performance drop 30%
→ Recommendation: Rebuild index
→ Rebuild time: 8 hours
→ Total impact: 10 hours

Cannot meet "immediate deletion" expectation
```

***

## Deep Technical Analysis

### Vector Index Structures

**HNSW (Hierarchical Navigable Small World):**

```
Index structure:
→ Graph of vectors
→ Links between nearby vectors
→ Optimized for search, not deletion

Deletion impact:
→ Remove node from graph
→ Must update neighbor links
→ Graph becomes fragmented
→ Search quality degrades over time
```

**IVF (Inverted File Index):**

```
Index structure:
→ Vectors partitioned into clusters
→ Search within relevant clusters

Deletion:
→ Remove from cluster
→ Cluster imbalance over time
→ Some clusters empty, others too full
→ Re-clustering needed
```

### Deletion Strategies

**Lazy Deletion:**

```
Mark as deleted, don't physically remove:
→ Add "deleted=true" flag in metadata
→ Filter out at query time

Pros:
+ Fast "deletion"
+ No index rebuild

Cons:
- Data still physically present (GDPR violation?)
- Storage still used
- Query overhead (filtering)
```

**Immediate Deletion:**

```
Physically remove from index:
→ Update index structure
→ Rebalance neighbors

Pros:
+ True deletion
+ Storage reclaimed

Cons:
- Slow (especially bulk deletes)
- Index fragmentation
- Performance impact
```

**Batch Deletion with Rebuild:**

```
1. Queue deletions
2. Daily: Rebuild index excluding deleted
3. Swap new index for old

Pros:
+ Clean index (no fragmentation)
+ Efficient (rebuild once)

Cons:
- Deletion not immediate (up to 24h delay)
- GDPR: "Without undue delay" = how long?
```

### Performance Considerations

**Large-Scale Deletion:**

```
Delete 100K vectors from 10M index:
→ 1% of index

Options:
A) Delete one-by-one: 100K API calls = slow
B) Batch delete (WHERE user_id='X'): Single call, but long execution
C) Export, filter, re-import: 2-4 hours

Trade-off: Speed vs operational complexity
```

**Index Compaction:**

```
After many deletions:
→ Index sparse, fragmented
→ Search slower
→ Storage not reclaimed

Compaction:
→ Rebuild index (densify)
→ Restore performance
→ Reclaim storage
→ But: Requires downtime or dual indexes
```

### Verification of Deletion

**Audit Trail:**

```
Prove deletion occurred:
1. Before: Count vectors with user_id='X' → 50,000
2. Execute deletion
3. After: Count vectors with user_id='X' → 0

Log:
{
  "deletion_request": "2024-01-15",
  "user_id": "X",
  "vectors_deleted": 50000,
  "verified_at": "2024-01-15 14:30:00"
}
```

**Residual Data Check:**

```
Semantic search for deleted data:
→ Query: "John Smith's address"
→ Should return: No results
→ If results found: Deletion incomplete

Test queries post-deletion to verify erasure
```

***

## How to Solve

**Implement metadata-based deletion (DELETE WHERE user\_id=X) for immediate removal + schedule periodic index rebuilds for compaction + use batch deletion for efficiency + maintain deletion audit logs with verification counts + consider dual-index strategy (rebuild while serving from old) + document deletion SLA (e.g., complete within 72 hours).** See [Vector Erasure](/rag-scenarios-and-solutions/privacy/right-to-erasure.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/privacy/right-to-erasure.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
