# GDPR Right to Forget in Vector DB

## The Problem

When users request data deletion under GDPR Article 17, removing their data from vector embeddings is technically complex and often incomplete.

### Symptoms

* ❌ Cannot locate user's vectors
* ❌ Text deleted but embeddings remain
* ❌ No mapping: user data → vectors
* ❌ Partial deletion leaves traces
* ❌ Cannot prove complete erasure

### Real-World Example

```
User requests deletion:
"Delete all my data per GDPR Article 17"

Company deletes:
→ Source documents from document DB ✓
→ User account from auth DB ✓

But vector DB still contains:
→ Embeddings of user's emails
→ Chunks mentioning user's name
→ Context where user participated

How to find and delete these vectors?
→ No direct identifier linking vectors to user
→ Cannot execute complete erasure
→ GDPR violation
```

***

## Deep Technical Analysis

### Vector-to-Source Mapping Problem

**Embedding Anonymity:**

```
Text: "Email from john.smith@example.com regarding..."
→ Embedded as: [0.234, -0.567, 0.891, ...]
→ Vector has no inherent link to "john.smith"

Deletion request:
→ Search vectors for john.smith?
→ Semantic search might miss variations
→ Cannot guarantee finding all vectors
```

**Metadata Dependency:**

```
Solution: Store metadata with vectors:
{
  vector: [0.234, ...],
  metadata: {
    user_id: "12345",
    document_id: "doc789",
    source: "email"
  }
}

Enables:
→ Query: "Find all vectors where user_id=12345"
→ Delete matching vectors
→ But: Metadata must be comprehensive
```

**Secondary References:**

```
User's data appears in others' content:
→ Email TO john.smith (from someone else)
→ Comments mentioning john.smith
→ Collaborative docs with john.smith's edits

Delete these too?
→ GDPR says: Yes, if identifies user
→ But: Hard to detect all references
```

### Deletion Strategies

**Metadata Filtering:**

```
1. Tag all chunks with user identifiers
2. On deletion request:
   DELETE FROM vector_db 
   WHERE metadata->user_id = '12345'
3. Verify: Count remaining matches

Requires:
→ Comprehensive tagging at ingestion
→ Query capability by metadata
→ Not all vector DBs support this efficiently
```

**Re-Embedding After Deletion:**

```
1. Delete source documents with user data
2. Trigger re-ingestion of entire knowledge base
3. Rebuild vector index from scratch

Pros:
+ Guaranteed complete removal
+ No orphaned vectors

Cons:
- Expensive (re-embed everything)
- Downtime during rebuild
- Not scalable for frequent deletions
```

**Soft Deletion:**

```
Don't actually delete vectors:
→ Mark as deleted in metadata
→ Filter out at query time

Pros:
+ Reversible (backup/recovery)
+ Fast

Cons:
- Data still exists (GDPR violation?)
- Requires filtering layer
- Storage still used
```

### Vector DB Capabilities

**Deletion Support by Platform:**

```
Pinecone:
→ Delete by ID: Yes
→ Delete by metadata filter: Yes
→ Batch deletion: Yes

Weaviate:
→ Delete by filter: Yes
→ Cascade deletion: Yes

Chroma:
→ Delete by ID: Yes
→ Filter-based: Limited

PostgreSQL + pgvector:
→ Standard SQL DELETE
→ Full filtering support
```

**Performance Concerns:**

```
Large-scale deletion:
→ Delete 100,000 vectors for one user
→ May lock index
→ Impact query performance
→ Require maintenance window
```

### Audit Trail

**Proving Deletion:**

```
GDPR requires proof:
→ Log deletion timestamp
→ Count vectors before/after
→ Store deletion certificate

Example log:
{
  user_id: "12345",
  deletion_requested: "2024-01-15T10:00:00Z",
  vectors_deleted: 15234,
  documents_deleted: 89,
  completed: "2024-01-15T10:15:23Z",
  verified_by: "admin@company.com"
}
```

***

## How to Solve

**Tag all vectors with user/document IDs at ingestion + implement metadata-based deletion (DELETE WHERE user\_id=X) + perform semantic search for residual references + maintain audit log of deletions + consider re-indexing for guaranteed erasure + verify deletion with count queries.** See [GDPR Compliance](/rag-scenarios-and-solutions/privacy/gdpr-compliance.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/privacy/gdpr-compliance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
