Embedding Service Privacy

The Problem

Third-party embedding APIs (OpenAI, Cohere) process sensitive text to generate embeddings, creating privacy exposure during the embedding generation phase.

Symptoms

❌ Sensitive text sent to external API
❌ No control over embedding provider's data handling
❌ Cannot verify data deletion by provider
❌ Compliance risks with third-party processors
❌ Unclear data retention policies

Real-World Example

Company ingests confidential documents:
→ "Q4 Revenue: $500M (confidential)"
→ Sends to OpenAI Embeddings API
→ OpenAI processes text, returns vector

Privacy concerns:
→ OpenAI sees: "Q4 Revenue: $500M (confidential)"
→ Does OpenAI log this? (Enterprise: No, but trust required)
→ Does it train on it? (Enterprise: No per ToS)
→ Can we verify? (No direct audit capability)
→ If breached at OpenAI? (Data exposed)

Deep Technical Analysis

Third-Party Data Processing

API Request Flow:

Your system:
→ Raw text: "Confidential merger with AcmeCorp..."

HTTPS POST to api.openai.com/v1/embeddings:
{
  "input": "Confidential merger with AcmeCorp...",
  "model": "text-embedding-ada-002"
}

OpenAI servers:
→ Receive plaintext
→ Process through embedding model
→ Return vector

Risk: OpenAI sees raw sensitive text

Data Retention Policies:

OpenAI (Enterprise):
→ Zero data retention (claims)
→ Not used for training
→ Deleted after processing

Cohere:
→ Similar policies

But:
→ Must trust vendor
→ Cannot independently verify
→ Compliance auditors may not accept

Compliance Implications

GDPR Data Processors:

Embedding API = Data Processor:
→ Requires Data Processing Agreement (DPA)
→ Must follow GDPR obligations
→ Must have adequate security

Check vendor DPA:
→ OpenAI: Provides DPA
→ Cohere: Provides DPA
→ Verify coverage

BAA for HIPAA:

If embedding PHI:
→ Must have Business Associate Agreement
→ Not all vendors offer BAA
→ OpenAI: Enterprise only
→ Alternatives: Self-host

Industry-Specific:

Financial (PCI-DSS):
→ Cardholder data to third-party?
→ May violate PCI requirements

Defense (ITAR):
→ Controlled technical data cannot leave US
→ Cannot use cloud embedding APIs
→ Must self-host

Self-Hosted Alternatives

Open Source Embedding Models:

sentence-transformers:
→ all-MiniLM-L6-v2
→ all-mpnet-base-v2
→ Runs locally, no API call

Deployment:
→ Docker container
→ GPU optional (faster with GPU)
→ No data leaves infrastructure

Quality Trade-offs:

OpenAI text-embedding-ada-002:
→ 1536 dimensions
→ Very high quality
→ But: Cloud API

sentence-transformers/all-mpnet-base-v2:
→ 768 dimensions
→ Good quality (~90% of OpenAI)
→ Self-hostable

For sensitive data: Quality trade-off acceptable

Data Minimization

Pre-Processing:

Before sending to API:
→ Remove explicit PII (names, IDs)
→ Replace with tokens: "[NAME]", "[ID]"
→ Embed redacted text

Trade-off:
→ Reduced privacy risk
→ But: Semantic search less effective
→ "Find John Smith's email" won't work

How to Solve

For sensitive data: self-host embedding models (sentence-transformers) to avoid third-party exposure + if using APIs: execute DPA/BAA with provider + verify zero-retention policies + implement PII redaction before embedding + use enterprise API tiers with contractual protections + monitor vendor security posture. See Embedding Privacy.

PreviousKnowledge Retention vs Deletion NextErasure from Vector Index

Last updated 18 minutes ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagThird-Party Data Processing

hashtagCompliance Implications

hashtagSelf-Hosted Alternatives

hashtagData Minimization

hashtagHow to Solve