PII Leaking in Retrieved Context

The Problem

Personally Identifiable Information (PII) from ingested documents appears in retrieved chunks and gets included in AI responses, creating privacy violations.

Symptoms

  • ❌ Names, emails, phone numbers in responses

  • ❌ SSNs or IDs visible in context

  • ❌ Private customer data exposed

  • ❌ GDPR/CCPA violations

  • ❌ Sensitive info in citations

Real-World Example

Knowledge base ingests:
→ Customer support tickets
→ Internal emails
→ CRM exports

Query: "How to handle refunds?"

Retrieved chunk includes:
"Customer John Smith ([email protected], SSN: 123-45-6789)
requested refund for order #5678..."

AI response inadvertently exposes PII to different user

Deep Technical Analysis

Ingestion-Time PII Exposure

Document Scraping:

PII in Metadata:

Retrieval-Time Leakage

Cross-User Data Bleed:

Citation Exposure:

Vector DB PII Persistence

Embeddings Encode PII:

Deletion Challenges:

PII Detection Complexity

False Negatives:

False Positives:


How to Solve

Implement PII detection at ingestion (regex + NER models) + redact or exclude PII-containing chunks + apply user-level access control on retrieved chunks + use synthetic data for examples + audit retrieved context pre-generation + implement vector-level deletion for GDPR. See PII Protection.

Last updated