# Embedding Service Privacy

## The Problem

Third-party embedding APIs (OpenAI, Cohere) process sensitive text to generate embeddings, creating privacy exposure during the embedding generation phase.

### Symptoms

* ❌ Sensitive text sent to external API
* ❌ No control over embedding provider's data handling
* ❌ Cannot verify data deletion by provider
* ❌ Compliance risks with third-party processors
* ❌ Unclear data retention policies

### Real-World Example

```
Company ingests confidential documents:
→ "Q4 Revenue: $500M (confidential)"
→ Sends to OpenAI Embeddings API
→ OpenAI processes text, returns vector

Privacy concerns:
→ OpenAI sees: "Q4 Revenue: $500M (confidential)"
→ Does OpenAI log this? (Enterprise: No, but trust required)
→ Does it train on it? (Enterprise: No per ToS)
→ Can we verify? (No direct audit capability)
→ If breached at OpenAI? (Data exposed)
```

***

## Deep Technical Analysis

### Third-Party Data Processing

**API Request Flow:**

```
Your system:
→ Raw text: "Confidential merger with AcmeCorp..."

HTTPS POST to api.openai.com/v1/embeddings:
{
  "input": "Confidential merger with AcmeCorp...",
  "model": "text-embedding-ada-002"
}

OpenAI servers:
→ Receive plaintext
→ Process through embedding model
→ Return vector

Risk: OpenAI sees raw sensitive text
```

**Data Retention Policies:**

```
OpenAI (Enterprise):
→ Zero data retention (claims)
→ Not used for training
→ Deleted after processing

Cohere:
→ Similar policies

But:
→ Must trust vendor
→ Cannot independently verify
→ Compliance auditors may not accept
```

### Compliance Implications

**GDPR Data Processors:**

```
Embedding API = Data Processor:
→ Requires Data Processing Agreement (DPA)
→ Must follow GDPR obligations
→ Must have adequate security

Check vendor DPA:
→ OpenAI: Provides DPA
→ Cohere: Provides DPA
→ Verify coverage
```

**BAA for HIPAA:**

```
If embedding PHI:
→ Must have Business Associate Agreement
→ Not all vendors offer BAA
→ OpenAI: Enterprise only
→ Alternatives: Self-host
```

**Industry-Specific:**

```
Financial (PCI-DSS):
→ Cardholder data to third-party?
→ May violate PCI requirements

Defense (ITAR):
→ Controlled technical data cannot leave US
→ Cannot use cloud embedding APIs
→ Must self-host
```

### Self-Hosted Alternatives

**Open Source Embedding Models:**

```
sentence-transformers:
→ all-MiniLM-L6-v2
→ all-mpnet-base-v2
→ Runs locally, no API call

Deployment:
→ Docker container
→ GPU optional (faster with GPU)
→ No data leaves infrastructure
```

**Quality Trade-offs:**

```
OpenAI text-embedding-ada-002:
→ 1536 dimensions
→ Very high quality
→ But: Cloud API

sentence-transformers/all-mpnet-base-v2:
→ 768 dimensions
→ Good quality (~90% of OpenAI)
→ Self-hostable

For sensitive data: Quality trade-off acceptable
```

### Data Minimization

**Pre-Processing:**

```
Before sending to API:
→ Remove explicit PII (names, IDs)
→ Replace with tokens: "[NAME]", "[ID]"
→ Embed redacted text

Trade-off:
→ Reduced privacy risk
→ But: Semantic search less effective
→ "Find John Smith's email" won't work
```

***

## How to Solve

**For sensitive data: self-host embedding models (sentence-transformers) to avoid third-party exposure + if using APIs: execute DPA/BAA with provider + verify zero-retention policies + implement PII redaction before embedding + use enterprise API tiers with contractual protections + monitor vendor security posture.** See [Embedding Privacy](/rag-scenarios-and-solutions/privacy/processor-compliance.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/privacy/processor-compliance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
