Similarity Score Calibration

The Problem

Raw similarity scores (0.0 to 1.0) don't translate to meaningful relevance—a score of 0.75 might be excellent for some queries but poor for others, making threshold-setting impossible.

Symptoms

  • ❌ Can't set universal relevance threshold

  • ❌ Same score means different relevance per query

  • ❌ "0.80 similarity" - is this good or bad?

  • ❌ Generic docs always score high

  • ❌ Specific docs score low despite being perfect matches

Real-World Example

Query A: "API authentication"
Top result: "API Guide" (score: 0.92) ← Excellent match

Query B: "Configure TPS-2000 subsystem"  
Top result: "System Configuration" (score: 0.68) ← Also excellent match!

Same threshold (0.75) would:
→ Accept Query A result ✓
→ Reject Query B result ✗ (below 0.75)

But Query B's 0.68 is actually the best possible match
→ Specific technical query
→ Limited vocabulary overlap
→ Lower scores expected

Threshold needs calibration per query type

Deep Technical Analysis

Cosine Similarity Range Compression

Embedding similarity doesn't use full [0,1] range:

Typical Score Distribution:

The Dead Zone:

Query-Dependent Score Distributions

Different queries have different score patterns:

Broad vs Narrow Queries:

Vocabulary Overlap Effect:

Document-Specific Baseline Scores

Some documents score high regardless of query:

Generic "Hub" Documents:

Length and Density Bias:

Model-Specific Score Ranges

Different embedding models have different scales:

Model Comparison:

Training Data Effects:

Calibration Techniques

Methods to normalize scores:

Z-Score Normalization:

Percentile-Based Cutoff:

BM25 Hybrid Scoring:

Learning-to-Rank Approaches

ML-based score calibration:

Supervised Calibration:

Implicit Feedback Calibration:

Context-Dependent Thresholding

Adaptive thresholds based on context:

Query Type Detection:

User Intent Inference:

Temporal Adjustment:

Multi-Modal Score Fusion

Combining different signals:

Signal Types:

Fusion Strategies:


How to Solve

Use per-query z-score normalization + set percentile-based thresholds (e.g., top 20%) instead of absolute scores + track score distributions per query type + implement learning-to-rank with click data + normalize document-specific baseline scores. See Score Calibration.

Last updated