# PII Leaking in Retrieved Context

## The Problem

Personally Identifiable Information (PII) from ingested documents appears in retrieved chunks and gets included in AI responses, creating privacy violations.

### Symptoms

* ❌ Names, emails, phone numbers in responses
* ❌ SSNs or IDs visible in context
* ❌ Private customer data exposed
* ❌ GDPR/CCPA violations
* ❌ Sensitive info in citations

### Real-World Example

```
Knowledge base ingests:
→ Customer support tickets
→ Internal emails
→ CRM exports

Query: "How to handle refunds?"

Retrieved chunk includes:
"Customer John Smith (john.smith@email.com, SSN: 123-45-6789)
requested refund for order #5678..."

AI response inadvertently exposes PII to different user
```

***

## Deep Technical Analysis

### Ingestion-Time PII Exposure

**Document Scraping:**

```
Source systems contain mixed content:
→ Policy docs (safe)
→ Example tickets with real names (unsafe)
→ Email threads with signatures (unsafe)

Bulk ingestion:
→ No PII filtering
→ Everything embedded into vector DB
→ PII becomes retrievable
```

**PII in Metadata:**

```
Document metadata carries PII:
→ author: "john.smith@company.com"
→ last_modified_by: "Jane Doe"
→ file_path: "/users/john.smith/documents/"

Metadata included in chunks:
→ PII propagates to responses
```

### Retrieval-Time Leakage

**Cross-User Data Bleed:**

```
User A's query retrieves:
→ Chunks containing User B's data
→ No filtering by data ownership
→ PII from one user visible to another

Multi-tenant knowledge base:
→ Insufficient isolation
→ Privacy violation
```

**Citation Exposure:**

```
AI cites sources:
"According to ticket #12345 from john.smith@email.com..."

Citation itself contains PII:
→ Even if response text safe
→ Source attribution leaks data
```

### Vector DB PII Persistence

**Embeddings Encode PII:**

```
Text: "John Smith's email is john@example.com"
→ Embedded as vector [0.23, -0.45, 0.67, ...]
→ PII encoded in embedding space

Semantic search can retrieve:
→ "Find contact for John Smith"
→ Returns chunks with PII
→ PII extractable from embeddings
```

**Deletion Challenges:**

```
GDPR right to erasure:
→ Must delete John Smith's data
→ But embeddings distributed across vector DB
→ No direct mapping: text → specific vectors
→ Cannot fully purge
```

### PII Detection Complexity

**False Negatives:**

```
Regex patterns miss:
→ Nicknames: "Johnny" (instead of John Smith)
→ Obfuscated: "j.smith@example.com"
→ International formats: "+44 20 1234 5678"
→ Contextual PII: "the CEO" (reveals identity in context)
```

**False Positives:**

```
Over-aggressive filtering:
→ "John Doe" (example name, not real)
→ "555-1234" (example phone)
→ "admin@example.com" (generic)

Breaks legitimate content
```

***

## How to Solve

**Implement PII detection at ingestion (regex + NER models) + redact or exclude PII-containing chunks + apply user-level access control on retrieved chunks + use synthetic data for examples + audit retrieved context pre-generation + implement vector-level deletion for GDPR.** See [PII Protection](/rag-scenarios-and-solutions/privacy/pii-detection.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/privacy/pii-detection.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
