Reranking Score Analysis

The Problem

Cannot evaluate reranking effectiveness or debug why reranker changes initial retrieval order, making optimization impossible.

Symptoms

❌ Don't know if reranking helps
❌ Cannot see score changes (retrieval → reranking)
❌ Unexpected rank reversals
❌ No metrics for reranker quality
❌ Cannot compare reranking models

Real-World Example

Initial retrieval (vector search):
→ #1: Chunk A (score: 0.85)
→ #2: Chunk B (score: 0.83)
→ #3: Chunk C (score: 0.80)

After reranking (Cohere Rerank):
→ #1: Chunk C (score: 0.92) ← promoted
→ #2: Chunk A (score: 0.88) ← demoted
→ #3: Chunk B (score: 0.75) ← demoted

Why did Chunk C jump from #3 to #1?
→ No visibility into reranker reasoning
→ Cannot validate if correct

Deep Technical Analysis

Reranking Purpose

Query-Document Interaction:

Vector search (bi-encoder):
→ Query embedded separately
→ Document embedded separately
→ Cosine similarity

Limitations:
→ No interaction between query and doc
→ "Authentication" matches "Login" generally
→ But: Misses query-specific nuances

Reranker (cross-encoder):
→ Processes query + document together
→ "How to authenticate?" + Document
→ Models interaction
→ More accurate relevance

Example Improvement:

Query: "How to reset password for admin users?"

Chunk A: "Password reset procedure (general)"
→ Vector score: 0.85 (keyword match)
→ Rerank score: 0.70 (not admin-specific)

Chunk B: "Admin user management (including password reset)"
→ Vector score: 0.78 (lower - fewer direct matches)
→ Rerank score: 0.95 (admin + password reset = perfect match)

Reranker promotes Chunk B (more relevant)

Reranking Metrics

Precision Improvement:

Measure: Precision@5 (are top-5 relevant?)

Before reranking:
→ Top-5 from vector search: 3/5 relevant = 0.60

After reranking:
→ Top-5 after rerank: 5/5 relevant = 1.00

Improvement: +40pp

Rank Correlation:

Compare: Retrieval rank vs Ground truth rank

Vector search:
→ Spearman correlation: 0.65

With reranking:
→ Spearman correlation: 0.85 (+20pp)

Better alignment with ideal ranking

Score Distribution Analysis

Score Spread:

Vector search scores: 0.75-0.85 (tight)
→ Hard to discriminate
→ All seem equally relevant

Reranking scores: 0.55-0.95 (wide)
→ Clear differentiation
→ Top results confidently relevant

Wider spread = better discrimination

Confidence Calibration:

Reranker score vs Actual relevance:
→ Score 0.9-1.0: 95% actually relevant
→ Score 0.7-0.9: 75% actually relevant
→ Score 0.5-0.7: 40% actually relevant

Well-calibrated reranker
Enables confidence-based filtering

Debugging Rank Changes

Promotion/Demotion Tracking:

Log rank changes:
{
  chunk_id: "chunk_C",
  vector_rank: 3,
  vector_score: 0.80,
  rerank_rank: 1,
  rerank_score: 0.92,
  change: +2 (promoted)
}

Investigate large changes:
→ +5 or more: Why big jump?
→ Validate: Is it actually more relevant?

Reranker Explanation:

Some rerankers provide attribution:
→ Which query terms matched?
→ Which doc sections influenced score?

Example:
Query: "admin password reset"
→ Matched: "admin" (weight: 0.4)
→ Matched: "password reset" (weight: 0.6)
→ Total score: 0.92

Explains why high score

Cost-Benefit Analysis

Reranking Cost:

Cohere Rerank pricing:
→ $2 per 1000 rerank calls

Usage:
→ 10,000 queries/day
→ Rerank top-20 → top-5
→ Cost: 10,000 × $0.002 = $20/day = $600/month

Quality improvement:
→ Precision@5: +30pp
→ User satisfaction: +25%

Worth it?
→ Measure quality gain
→ Justify cost

How to Solve

Log both vector scores and rerank scores for comparison + track rank changes (promoted/demoted chunks) + measure Precision@K before and after reranking + calculate rank correlation (Spearman) improvement + monitor score distribution spread + test reranker models on eval set + analyze cost vs quality trade-off + investigate large rank changes (±5 positions) for validation. See Reranking Analysis.

PreviousEmbedding Quality Metrics NextContext Window Utilization

Last updated 18 minutes ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagReranking Purpose

hashtagReranking Metrics

hashtagScore Distribution Analysis

hashtagDebugging Rank Changes

hashtagCost-Benefit Analysis

hashtagHow to Solve