Similarity Score Calibration

The Problem

Raw similarity scores (0.0 to 1.0) don't translate to meaningful relevance—a score of 0.75 might be excellent for some queries but poor for others, making threshold-setting impossible.

Symptoms

❌ Can't set universal relevance threshold
❌ Same score means different relevance per query
❌ "0.80 similarity" - is this good or bad?
❌ Generic docs always score high
❌ Specific docs score low despite being perfect matches

Real-World Example

Query A: "API authentication"
Top result: "API Guide" (score: 0.92) ← Excellent match

Query B: "Configure TPS-2000 subsystem"  
Top result: "System Configuration" (score: 0.68) ← Also excellent match!

Same threshold (0.75) would:
→ Accept Query A result ✓
→ Reject Query B result ✗ (below 0.75)

But Query B's 0.68 is actually the best possible match
→ Specific technical query
→ Limited vocabulary overlap
→ Lower scores expected

Threshold needs calibration per query type

Deep Technical Analysis

Cosine Similarity Range Compression

Embedding similarity doesn't use full [0,1] range:

Typical Score Distribution:

Theoretical range: 0.0 to 1.0
Actual observed range: 0.40 to 0.95

Why compressed?
→ Embeddings trained to be "similar" to related concepts
→ Dissimilar concepts still share some dimensions
→ Rarely see true 0.0 or 1.0

Practical implications:
→ 0.70 isn't "70% similar"
→ More like "moderately related"
→ 0.90 is "highly related"
→ Non-linear perception

The Dead Zone:

Scores 0.0 - 0.30: Almost never seen
→ Would require completely unrelated concepts
→ "Database" vs "Banana" might be 0.35
→ Still positive dot product from shared abstract concepts

Usable range: 0.40 - 0.95
→ 55% of theoretical range
→ Makes discrimination harder
→ Small score differences matter more

Query-Dependent Score Distributions

Different queries have different score patterns:

Broad vs Narrow Queries:

Broad query: "software"
→ Many documents match somewhat
→ Top score: 0.88
→ 10th score: 0.82
→ Tight distribution (0.06 spread)
→ Hard to distinguish quality

Narrow query: "JWT token expiration configuration"
→ Few documents match
→ Top score: 0.75
→ 10th score: 0.45
→ Wide distribution (0.30 spread)
→ Clear quality differences

Same threshold doesn't work for both

Vocabulary Overlap Effect:

Query with common words: "how to start"
→ "how", "to", "start" appear everywhere
→ High baseline similarity
→ Even irrelevant docs score 0.70+
→ True matches: 0.85+

Query with rare terms: "configure fsync durability"
→ "fsync", "durability" are rare
→ Most docs score 0.50
→ True matches: 0.68+

Absolute scores incomparable across queries

Document-Specific Baseline Scores

Some documents score high regardless of query:

Generic "Hub" Documents:

Document: "Getting Started Guide"
→ Covers many topics broadly
→ Contains keywords from many domains
→ Embedding is "central" in vector space

Matches many queries with scores 0.75-0.85
→ Not because it's relevant
→ Because it's generic

Specific document: "Advanced Kafka Partitioning"
→ Narrow focus
→ Embedding is "peripheral" in vector space
→ Rarely matches above 0.70
→ Even for perfect Kafka queries!

Need document-specific score normalization

Length and Density Bias:

Long document (5000 tokens):
→ Covers many sub-topics
→ Higher chance of keyword overlap
→ Tends to score higher

Short document (200 tokens):
→ Focused single concept
→ Limited vocabulary
→ Tends to score lower

"Getting Started" (long) vs "Error Code 503" (short)
→ Query: "503 error"
→ "Getting Started" scores 0.72 (mentions errors)
→ "Error Code 503" scores 0.70 (specific but short)

Length bias in scoring

Model-Specific Score Ranges

Different embedding models have different scales:

Model Comparison:

Sentence-BERT (all-mpnet-base):
→ Typical scores: 0.50 - 0.85
→ Mean: 0.67
→ Std dev: 0.12

OpenAI ada-002:
→ Typical scores: 0.60 - 0.92
→ Mean: 0.76
→ Std dev: 0.08

Cohere embed-english-v3:
→ Typical scores: 0.45 - 0.88
→ Mean: 0.66
→ Std dev: 0.14

Cannot use same threshold across models
→ 0.75 is "good" for Sentence-BERT
→ 0.75 is "mediocre" for OpenAI

Training Data Effects:

Model trained on:
→ Wikipedia + Books: Formal, diverse
→ Produces wider score range

Model trained on:
→ Q&A pairs: Focused, similar structure
→ Produces narrower score range

Score distribution reflects training data

Calibration Techniques

Methods to normalize scores:

Z-Score Normalization:

Per-query calibration:
1. Retrieve top 100 candidates
2. Compute mean μ and std dev σ of their scores
3. Normalize each score: z = (score - μ) / σ

Interpretation:
→ z = 0: Average relevance for this query
→ z = 2: 2 standard deviations above average (very relevant)
→ z = -1: Below average (likely irrelevant)

Threshold becomes: z > 1.5 (e.g.)
→ Adapts to query-specific distribution

Percentile-Based Cutoff:

Instead of absolute threshold:
→ "Return top 90th percentile"

Query A: 90th percentile = 0.87
Query B: 90th percentile = 0.72

Adaptive threshold
→ Accounts for query difficulty
→ Always returns "best" results relatively

BM25 Hybrid Scoring:

Combine semantic similarity with keyword match:
→ Semantic score: 0.75
→ BM25 score: 12.3 (different scale)

Normalized combination:
→ norm_semantic = (semantic - μ_semantic) / σ_semantic
→ norm_bm25 = (bm25 - μ_bm25) / σ_bm25
→ final = 0.7 × norm_semantic + 0.3 × norm_bm25

Both scores on same scale
→ Meaningful combination

Learning-to-Rank Approaches

ML-based score calibration:

Supervised Calibration:

Collect training data:
→ (query, document, score) → relevance label (0-4)

Examples:
→ ("API auth", "OAuth Guide", 0.82) → 4 (highly relevant)
→ ("API auth", "Pricing Page", 0.68) → 1 (low relevance)

Train model:
→ Input: [query_emb, doc_emb, cosine_score]
→ Output: Calibrated score (0-1)

Learns:
→ When 0.68 is actually good (narrow query)
→ When 0.82 is actually mediocre (generic doc)

Requires labeled data

Implicit Feedback Calibration:

Use click-through data:
→ User queries, results shown, which clicked

Learn:
→ (query, doc, score=0.75) → clicked 80% of time
→ (query, doc, score=0.85) → clicked 60% of time

Calibrate:
→ 0.75 for this query is actually better than 0.85
→ Adjust scoring function

Automatic, no manual labels
→ But requires traffic

Context-Dependent Thresholding

Adaptive thresholds based on context:

Query Type Detection:

Classify query:
→ Factual: "What is X?" → Threshold 0.80
→ Navigational: "Find X guide" → Threshold 0.70
→ Exploratory: "Learn about X" → Threshold 0.65

Different query types need different confidence levels

User Intent Inference:

User history:
→ Clicked 5 specific technical docs
→ Ignored general overviews

Personalize threshold:
→ For this user: Prefer specific docs
→ Boost scores of narrow docs
→ Penalize generic docs

Per-user calibration

Temporal Adjustment:

Query: "latest API updates"
→ Boost recent documents
→ Effective score = base_score × recency_weight
→ Fresh docs score higher

Query: "historical pricing"
→ Don't penalize old docs
→ Use base score as-is

Context-aware score adjustment

Combining different signals:

Signal Types:

1. Semantic similarity: 0.75
2. Keyword match (BM25): 12.3
3. Document popularity: 0.90 (90th percentile)
4. Freshness: 0.60 (60 days old)
5. User preference: 0.80 (user often reads this type)

Question: How to combine?
→ Each has different scale
→ Different meaning

Fusion Strategies:

Simple weighted average (bad):
→ 0.7×semantic + 0.2×popularity + 0.1×freshness
→ But scales incompatible
→ Semantic in [0.4, 0.9]
→ Popularity in [0, 1]

Better: Normalize first
→ z_semantic = (semantic - μ) / σ
→ z_popularity = (popularity - μ) / σ
→ z_freshness = (freshness - μ) / σ
→ final = 0.7×z_semantic + 0.2×z_popularity + 0.1×z_freshness

All on same scale

How to Solve

Use per-query z-score normalization + set percentile-based thresholds (e.g., top 20%) instead of absolute scores + track score distributions per query type + implement learning-to-rank with click data + normalize document-specific baseline scores. See Score Calibration.

PreviousDimensionality Mismatch NextMultilingual Embedding Issues

Last updated 1 minute ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagCosine Similarity Range Compression

hashtagQuery-Dependent Score Distributions

hashtagDocument-Specific Baseline Scores

hashtagModel-Specific Score Ranges

hashtagCalibration Techniques

hashtagLearning-to-Rank Approaches

hashtagContext-Dependent Thresholding

hashtagMulti-Modal Score Fusion

hashtagHow to Solve