# Embedding Quality Metrics

## The Problem

Cannot measure embedding quality, making it impossible to know if embeddings accurately capture document semantics or compare embedding models.

### Symptoms

* ❌ Don't know if embeddings are good
* ❌ Cannot compare embedding models
* ❌ No metrics for domain adaptation quality
* ❌ Embedding drift undetected
* ❌ Cannot validate fine-tuning improvements

### Real-World Example

```
Company considers switching:
→ Current: OpenAI text-embedding-ada-002
→ Candidate: Cohere embed-english-v3.0

Questions:
→ Which performs better for our domain?
→ How much better (quantify)?
→ Worth the migration cost?

No metrics:
→ Cannot answer objectively
→ Guessing based on vibes
```

***

## Deep Technical Analysis

### Intrinsic Metrics

**Embedding Dimensionality:**

```
Model A: 768 dimensions
Model B: 1536 dimensions

Higher dimensions:
→ More expressive (can capture nuance)
→ But: Slower search, more storage
→ Not always better

Need: Quality per dimension
```

**Intra-Cluster vs Inter-Cluster Distance:**

```
Good embeddings:
→ Similar docs close together (high intra-cluster similarity)
→ Dissimilar docs far apart (low inter-cluster similarity)

Measure:
→ Silhouette score
→ Davies-Bouldin index

Higher = better embedding quality
```

### Extrinsic Metrics (Task-Based)

**Retrieval Accuracy:**

```
Gold standard test set:
→ Query: "API authentication"
→ Relevant docs: [doc_5, doc_12, doc_23]

Measure:
→ Precision@K: Are top-K results relevant?
→ Recall@K: Did we find all relevant docs?
→ NDCG: Ranking quality

Compare models on same test set
```

**Example Evaluation:**

```
Model A (OpenAI):
→ Precision@5: 0.85
→ Recall@10: 0.78
→ NDCG@10: 0.82

Model B (Cohere):
→ Precision@5: 0.88 (+3pp)
→ Recall@10: 0.82 (+4pp)
→ NDCG@10: 0.86 (+4pp)

Model B objectively better
```

### Domain Adaptation Testing

**Out-of-Domain Detection:**

```
Test on domain-specific terms:
→ Your jargon: "Redwood Strategy", "Cedar mode"

Generic model:
→ May not understand these terms
→ Low similarity to related concepts

Domain-adapted model:
→ Should recognize: "Redwood" ~ "RAG approach"
→ Higher semantic accuracy
```

**Semantic Similarity Tests:**

```
Pairs that SHOULD be similar:
→ ("login", "authentication") = High similarity expected
→ ("API key", "access token") = High similarity expected

Pairs that SHOULD be dissimilar:
→ ("login", "pricing") = Low similarity expected

Measure:
→ % correct classifications
→ Higher = better embedding quality
```

### Embedding Drift Detection

**Temporal Consistency:**

```
Track embedding quality over time:
→ Month 1: NDCG = 0.85
→ Month 6: NDCG = 0.78 (dropped)

Possible causes:
→ Knowledge base changed (new jargon)
→ User queries evolved
→ Embedding model updated

Alerts: Quality degradation
```

**Model Version Comparison:**

```
OpenAI updates model:
→ text-embedding-ada-002 v1
→ text-embedding-ada-002 v2 (silently updated)

Re-evaluate:
→ Did metrics change?
→ Need re-embedding?

Track model versions explicitly
```

### Fine-Tuning Validation

**Before vs After:**

```
Base model:
→ NDCG: 0.75

After fine-tuning on your docs:
→ NDCG: 0.82 (+7pp)

Validates fine-tuning effort:
→ Quantifies improvement
→ Justifies cost
```

***

## How to Solve

**Create golden evaluation dataset (query + relevant docs) + measure Precision\@K, Recall\@K, NDCG\@K on eval set + compare models using same metrics + track metrics over time (detect drift) + test semantic similarity on domain-specific term pairs + evaluate fine-tuned models against baseline + monitor embedding model version changes + automate eval runs on each knowledge base update.** See [Embedding Metrics](/rag-scenarios-and-solutions/monitoring/embedding-metrics.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/monitoring/embedding-metrics.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.