Retrieval Stage Debugging

The Problem

Cannot trace why specific chunks were or weren't retrieved, making it impossible to debug poor retrieval quality or unexpected results.

Symptoms

❌ No visibility into retrieval scores
❌ Cannot see why chunk ranked #1 vs #10
❌ Missing chunks unexplained
❌ Cannot replay retrieval for debugging
❌ Retrieval feels like "black box"

Real-World Example

User reports: "AI said rate limit is 100/hour but docs say 1000/hour"

Investigation needed:
→ What chunks were retrieved?
→ What were their scores?
→ Why was "100/hour" chunk ranked higher than "1000/hour"?
→ Was "1000/hour" chunk retrieved at all?

No retrieval logs:
→ Cannot answer these questions
→ Cannot debug
→ Cannot improve

Deep Technical Analysis

Retrieval Opacity

Missing Instrumentation:

Typical RAG pipeline:
1. Query → Vector DB
2. Get results
3. Pass to LLM

No logging:
→ Which chunks returned?
→ Similarity scores?
→ Reranking changes?
→ Filters applied?

Black box

Debug Requirements:

Need to log:
→ Query embedding
→ Retrieved chunk IDs + scores
→ Filters applied (metadata, recency)
→ Reranking scores (if used)
→ Final top-K selection
→ Context assembly

Retrieval Trace Format

Structured Logging:

{
  "query": "API rate limit",
  "timestamp": "2024-01-15T14:30:00Z",
  "embedding_model": "text-embedding-ada-002",
  "retrieval_stage": {
    "vector_search": {
      "k": 20,
      "results": [
        {
          "chunk_id": "doc_123_chunk_5",
          "score": 0.87,
          "text_preview": "Rate limit is 1000/hour...",
          "metadata": {"published": "2024-01-01"}
        },
        {
          "chunk_id": "doc_456_chunk_12",
          "score": 0.82,
          "text_preview": "Rate limit is 100/hour...",
          "metadata": {"published": "2022-06-15"}
        }
      ]
    },
    "reranking": {
      "model": "cohere-rerank",
      "results": [
        {"chunk_id": "doc_123_chunk_5", "score": 0.95},
        {"chunk_id": "doc_456_chunk_12", "score": 0.73}
      ]
    },
    "final_selection": ["doc_123_chunk_5", ...]
  }
}

Score Analysis

Similarity Distribution:

Visualize score distribution:
→ Top-5: 0.85-0.90 (excellent)
→ Top-10: 0.75-0.85 (good)
→ Top-20: 0.60-0.75 (marginal)

If all < 0.70:
→ Poor retrieval quality
→ Query-document mismatch
→ Need query expansion or rewriting

Score Gap Detection:

Score between #5 and #6:
→ #5: 0.82
→ #6: 0.78 (small gap)
→ Borderline - both relevant

vs

→ #5: 0.85
→ #6: 0.62 (large gap)
→ Clear relevance threshold
→ Top-5 sufficient

Helps tune K parameter

Replay Capability

Store Query Context:

Save for each query:
→ Query text
→ Query embedding (frozen)
→ Knowledge base snapshot ID
→ Retrieval config (K, filters)

Enables:
→ Exact replay of retrieval
→ Compare before/after changes
→ Regression testing

A/B Testing:

Test retrieval improvements:
→ Baseline: K=5, no rerank
→ Experiment: K=20, rerank top-5

Compare:
→ Same query set
→ Measure: Precision@5, NDCG
→ Determine if improvement real

Failure Mode Detection

No Results:

Log when retrieval returns nothing:
→ Query: "..."
→ Results: [] (empty)
→ Reason: All scores < threshold

Alert: Query reformulation needed

Low-Quality Results:

All scores < 0.65:
→ Flag: Poor retrieval
→ Trigger: Fallback to keyword search
→ Log: For analysis

How to Solve

Log retrieval traces (query, chunks, scores) for every request + include reranking scores if applicable + store query embedding for replay + visualize score distributions + implement retrieval replay capability for debugging + monitor low-score queries (< 0.70 threshold) + build retrieval analytics dashboard (score trends, failure modes). See Retrieval Debugging.

PreviousEntity Resolution Errors NextEmbedding Quality Metrics

Last updated 18 minutes ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagRetrieval Opacity

hashtagRetrieval Trace Format

hashtagScore Analysis

hashtagReplay Capability

hashtagFailure Mode Detection

hashtagHow to Solve