# RAG Pipeline Observability

## Overview

You can't improve what you can't measure. RAG systems are complex, multi-stage pipelines where failures can occur at any point: query processing, retrieval, reranking, context assembly, or generation. Without proper observability, you're flying blind—unable to diagnose failures, optimize performance, or understand user behavior. This section covers essential monitoring and debugging practices for production RAG systems.

## Why Observability Matters

Proper observability enables:

* **Rapid debugging** - Quickly identify where and why failures occur
* **Performance optimization** - Data-driven improvements to retrieval and generation
* **Quality assurance** - Detect degradation before users complain
* **Usage insights** - Understand how users interact with your agents
* **Cost management** - Track and optimize embedding, vector search, and LLM costs

Without observability, you face:

* **Mystery failures** - Agents break and you don't know why
* **Slow iteration** - Can't measure impact of changes
* **Cost overruns** - Unexpected bills from LLM APIs
* **Quality drift** - Performance degrades slowly without notice
* **User frustration** - Issues persist because you can't find them

## Common Observability Challenges

### Visibility Gaps

* **Retrieval stage debugging** - Can't see what documents were retrieved
* **Query understanding** - Don't know how queries were interpreted
* **Context window utilization** - No visibility into what's sent to LLM
* **Agent decision tracing** - Can't trace agent reasoning steps

### Metrics & Scoring

* **Embedding quality metrics** - No measure of embedding effectiveness
* **Reranking score analysis** - Can't assess reranker performance
* **Source attribution tracking** - Don't know which sources influenced answers

### Alerting & Response

* **No alerts for failures** - Don't know when system breaks
* **Missing SLA tracking** - Can't measure uptime and performance
* **Incident investigation** - Lack of historical data to diagnose issues

## Solutions in This Section

Browse these guides to improve RAG observability:

* [Retrieval Stage Debugging](https://help.twig.so/rag-scenarios-and-solutions/monitoring/retrieval-debugging)
* [Embedding Quality Metrics](https://help.twig.so/rag-scenarios-and-solutions/monitoring/embedding-metrics)
* [Reranking Score Analysis](https://help.twig.so/rag-scenarios-and-solutions/monitoring/reranking-analysis)
* [Context Window Utilization](https://help.twig.so/rag-scenarios-and-solutions/monitoring/context-utilization)
* [Agent Decision Tracing](https://help.twig.so/rag-scenarios-and-solutions/monitoring/agent-tracing)
* [Query Understanding Logs](https://help.twig.so/rag-scenarios-and-solutions/monitoring/query-logs)
* [Source Attribution Tracking](https://help.twig.so/rag-scenarios-and-solutions/monitoring/source-tracking)

## Observability Layers

Monitor your RAG system at multiple levels:

### 1. Request-Level Tracing

Track every user query end-to-end:

```
Request ID: req_abc123
User: user@example.com
Query: "How do I reset my password?"
Timestamp: 2024-01-15T10:30:00Z

Pipeline Stages:
├─ Query Enhancement (15ms)
│  ├─ Original: "How do I reset my password?"
│  └─ Enhanced: "password reset procedure steps"
│
├─ Embedding (120ms)
│  ├─ Model: text-embedding-3-small
│  ├─ Dimensions: 1536
│  └─ Cost: $0.0001
│
├─ Vector Search (45ms)
│  ├─ Top 20 candidates retrieved
│  ├─ Similarity range: 0.72 - 0.89
│  └─ Cost: $0.0001
│
├─ Reranking (200ms)
│  ├─ Model: cross-encoder
│  ├─ Top 5 after reranking
│  └─ Score range: 0.81 - 0.94
│
├─ Context Assembly (10ms)
│  ├─ Chunks: 5
│  ├─ Total tokens: 1,200
│  └─ Sources: 3 documents
│
└─ LLM Generation (2,300ms)
   ├─ Model: gpt-4-turbo
   ├─ Input tokens: 1,250
   ├─ Output tokens: 180
   ├─ Cost: $0.018
   └─ Citations: 2

Total Latency: 2,690ms
Total Cost: $0.0182
Result: Success
User Feedback: 👍
```

**Key benefits:**

* Full visibility into every stage
* Performance bottleneck identification
* Cost attribution per request
* Debugging with complete context

### 2. Component-Level Metrics

Track performance of each pipeline stage:

| Component            | Metrics to Track                                                          |
| -------------------- | ------------------------------------------------------------------------- |
| **Query Processing** | Parse success rate, enhancement frequency, spell-correction rate          |
| **Embeddings**       | Latency P50/P95/P99, batch size, cost per query, model version            |
| **Vector Search**    | Query latency, candidate count, similarity score distribution, index size |
| **Filtering**        | Documents filtered out, permission check latency, filter effectiveness    |
| **Reranking**        | Latency, score change vs initial ranking, reranker model version          |
| **Context Assembly** | Token count, chunk count, truncation rate, assembly latency               |
| **LLM**              | Generation latency, input/output tokens, cost per query, model version    |

### 3. System-Level Health

Monitor overall system performance:

* **Throughput**: Queries per second, per minute, per hour
* **Latency**: P50, P95, P99 end-to-end response times
* **Error rate**: % of requests that fail, by error type
* **Availability**: Uptime, SLA compliance
* **Cost**: Total spend, cost per query, trend over time

### 4. Quality Metrics

Track answer and retrieval quality:

* **User feedback**: Thumbs up/down, ratings, explicit feedback
* **Retrieval relevance**: Manual review of retrieved docs
* **Citation accuracy**: Are citations correct and helpful?
* **Groundedness**: Are answers supported by retrieved context?
* **Consistency**: Do similar queries get similar answers?

## Best Practices

### Structured Logging

Use consistent, parseable log formats:

```json
{
  "request_id": "req_abc123",
  "timestamp": "2024-01-15T10:30:00Z",
  "user_id": "user@example.com",
  "agent_id": "support_agent_v2",
  "stage": "vector_search",
  "duration_ms": 45,
  "status": "success",
  "metadata": {
    "candidates_count": 20,
    "top_similarity": 0.89,
    "bottom_similarity": 0.72,
    "index_version": "v2024-01-10"
  }
}
```

**Benefits:**

* Easy to parse and analyze
* Consistent across all stages
* Rich context for debugging

### Distributed Tracing

Use trace IDs to follow requests across services:

```
Trace ID: trace_xyz789

Service: API Gateway → trace_xyz789
  ↓
Service: Query Processor → trace_xyz789
  ↓
Service: Vector Search → trace_xyz789
  ↓
Service: LLM Service → trace_xyz789
```

Tools: OpenTelemetry, Jaeger, Zipkin, DataDog APM

### Real-Time Dashboards

Build dashboards for instant visibility:

**Operations Dashboard:**

* Queries per minute (last hour)
* P95 latency (last hour)
* Error rate (last hour)
* Current cost rate ($/hour)
* Top queries and agents

**Quality Dashboard:**

* User satisfaction score (last 24h)
* Retrieval zero-result rate
* Hallucination detection rate
* Citation accuracy score

**Cost Dashboard:**

* Total spend today/week/month
* Cost by component (embedding, vector, LLM)
* Cost per query trend
* Top cost-driving queries

### Alerting Strategy

Set up proactive alerts:

| Alert                | Threshold                         | Action                           |
| -------------------- | --------------------------------- | -------------------------------- |
| **High error rate**  | >5% for 5 min                     | Page on-call engineer            |
| **Slow queries**     | P95 >10s for 5 min                | Investigate performance          |
| **Cost spike**       | >2x normal rate for 30 min        | Check for abuse, runaway queries |
| **Low satisfaction** | <60% positive feedback for 1 hour | Review recent failures           |
| **Zero results**     | >20% queries with no retrieval    | Check index health               |
| **LLM failures**     | >10% generation failures          | Check LLM API status             |

### Data Retention

Balance storage costs with debugging needs:

* **Full traces**: 7 days (for detailed debugging)
* **Aggregated metrics**: 90 days (for trend analysis)
* **User feedback**: 1 year (for quality tracking)
* **Error logs**: 30 days (for incident investigation)
* **Cost data**: Forever (for financial reporting)

## Key Metrics to Track

### Retrieval Metrics

**Candidate Retrieval:**

* Candidates retrieved per query (avg, P95)
* Similarity score distribution (min, median, max)
* Zero-result query rate (%)
* Retrieval latency (P50, P95, P99)

**Reranking:**

* Reranking latency (P50, P95, P99)
* Score change after reranking (avg improvement)
* Rank change after reranking (avg position change)
* Reranker agreement with embedding similarity

**Final Context:**

* Chunks included in context (avg, P95)
* Tokens sent to LLM (avg, P95, max)
* Context truncation rate (%)
* Sources per query (avg)

### LLM Metrics

**Performance:**

* Time to first token (TTFT)
* Tokens per second (generation speed)
* Total generation latency
* Input tokens, output tokens, total tokens

**Cost:**

* Cost per query (avg, P95)
* Total cost per hour/day
* Cost by model
* Cost trend over time

**Quality:**

* User satisfaction (thumbs up rate)
* Citation accuracy (% correct)
* Groundedness score (% claims supported)
* Hallucination detection rate

### System Metrics

**Availability:**

* Uptime percentage (target: 99.9%)
* Error rate by type
* Successful query rate

**Performance:**

* End-to-end latency (P50, P95, P99)
* Throughput (queries per second)
* Concurrent query capacity

**Reliability:**

* Retry rate (%)
* Timeout rate (%)
* Fallback activation rate (%)

## Debugging Workflows

### Debugging a Failed Query

1. **Identify the failure** - Error message, user report, alert
2. **Find the request** - Look up by request ID, user ID, or timestamp
3. **Review full trace** - Examine each pipeline stage
4. **Isolate the failure point** - Which stage failed or returned poor results?
5. **Inspect inputs/outputs** - What went into and out of that stage?
6. **Reproduce locally** - Try to recreate the failure
7. **Fix and validate** - Implement fix, test with same query
8. **Monitor for recurrence** - Watch for similar failures

### Debugging Poor Answer Quality

1. **Review retrieved context** - Were relevant documents found?
2. **Check similarity scores** - Were scores reasonable?
3. **Examine reranking** - Did reranking improve or hurt results?
4. **Inspect final context** - What did the LLM actually see?
5. **Compare to ground truth** - What should have been retrieved?
6. **Identify root cause** - Query issue? Retrieval issue? LLM issue?
7. **Implement fix** - Adjust chunking, embeddings, prompts, etc.
8. **Validate improvement** - Test with similar queries

### Debugging Performance Issues

1. **Identify bottleneck** - Which stage is slowest?
2. **Check resource utilization** - CPU, memory, network
3. **Review query patterns** - Any unusual or expensive queries?
4. **Test component in isolation** - Validate latency outside full pipeline
5. **Optimize or scale** - Caching, batching, more replicas
6. **Measure improvement** - Confirm latency reduction
7. **Monitor under load** - Ensure fix holds at scale

## Advanced Observability

### Semantic Monitoring

Track retrieval quality automatically:

1. **Generate test queries** - Representative questions with known answers
2. **Run queries regularly** - Hourly or daily
3. **Evaluate results** - Are correct documents retrieved?
4. **Alert on degradation** - Notify when quality drops
5. **Investigate root cause** - What changed? Embeddings? Index? Content?

### User Behavior Analytics

Understand how users interact:

* **Query patterns**: Most common queries, query length distribution
* **Session analysis**: Queries per session, follow-up patterns
* **User segments**: Power users vs casual users, by department/role
* **Drop-off points**: Where do users abandon?
* **Feature usage**: Which agents, query types used most

### Cost Attribution

Track costs by dimension:

* **By user**: Who are the most expensive users?
* **By agent**: Which agents cost most to run?
* **By time**: When are costs highest? (Time of day, day of week)
* **By component**: Embedding vs vector search vs LLM
* **By query type**: Which types of queries cost most?

### A/B Testing Framework

Compare variants scientifically:

```
Variant A (50% traffic): Current embedding model
Variant B (50% traffic): New embedding model

Metrics:
├─ Retrieval quality: A=0.78, B=0.82 (B wins)
├─ Latency: A=120ms, B=95ms (B wins)
├─ Cost: A=$0.0001, B=$0.00015 (A wins)
└─ User satisfaction: A=75%, B=82% (B wins)

Decision: Deploy Variant B (quality and latency gains outweigh cost)
```

## Quick Diagnostics

**Signs your observability needs improvement:**

* ✗ Can't explain why a query failed
* ✗ Don't know which component is slow
* ✗ Discover issues only from user complaints
* ✗ Can't reproduce failures
* ✗ Unclear what changed when performance degraded
* ✗ Surprise cost overruns
* ✗ No visibility into what LLM sees

**Signs your observability is working:**

* ✓ Full request traces for every query
* ✓ Real-time dashboards show system health
* ✓ Alerts notify before users complain
* ✓ Easy to debug any failure with logs
* ✓ Track quality trends over time
* ✓ Cost is predictable and understood
* ✓ Can measure impact of every change

**Bottom line**: Observability is not optional for production RAG systems. Build it in from day one. The time spent instrumenting your pipeline will pay for itself many times over in faster debugging, better performance, and higher quality.
