Evaluation Framework

Measure and improve AI agent performance with automated evaluation tests and continuous quality monitoring.

What are Evals?

Evals (evaluations) are automated tests that measure how well your AI agents perform. They help you:

  • Track accuracy over time

  • Compare agent configurations

  • Identify knowledge gaps

  • Validate changes before deployment

  • Ensure consistent quality

Think of evals as unit tests for AI responses.

Why Evals Matter

Traditional Testing vs AI Evals

Traditional Software:

def test_add():
    assert add(2, 3) == 5  # Deterministic

AI Systems:

Challenges:

  • Responses are non-deterministic

  • Multiple valid answers exist

  • Quality is subjective

  • Context-dependent correctness

Evals solve this by using multiple evaluation criteria and techniques.

Evaluation Metrics

1. Relevance

Measures: How relevant is the response to the query?

Scale: 0.0 (irrelevant) to 1.0 (perfectly relevant)

Example:

How it's measured:

  • Semantic similarity between query and response

  • Keyword matching

  • Topic alignment

  • User feedback (optional)

2. Factual Accuracy

Measures: Are the facts in the response correct?

Scale: 0.0 (incorrect) to 1.0 (completely accurate)

Example:

How it's measured:

  • Comparison with ground truth data

  • Fact extraction and verification

  • Citation checking

  • Entity recognition

3. Completeness

Measures: Does the response cover all aspects of the query?

Scale: 0.0 (incomplete) to 1.0 (comprehensive)

Example:

How it's measured:

  • Coverage of expected key points

  • Presence of required information

  • Depth of explanation

4. Citation Quality

Measures: Are sources properly cited and relevant?

Scale: 0.0 (poor/missing) to 1.0 (excellent)

Example:

How it's measured:

  • Presence of citations

  • Relevance of cited sources

  • Accuracy of source references

  • Citation formatting

5. Response Time

Measures: How fast is the response generated?

Example:

6. Tone Appropriateness

Measures: Does the response match the desired tone?

Example:

Creating Eval Sets

Eval Set Structure

An eval set consists of test cases:

Best Practices for Test Cases

1. Cover Key Scenarios

2. Include Ground Truth

3. Define Acceptance Criteria

Creating Eval Sets from Real Data

Method 1: From Inbox

Method 2: From Analytics

Method 3: Manual Creation

Running Evaluations

Manual Evaluation

Via UI:

Via API:

Automated Evaluation

Scheduled Evals:

CI/CD Integration:

Regression Testing

Before Deployment:

Example:

Evaluation Results

Results Dashboard

Key Metrics Display:

Per-Test-Case Results

Failure Analysis

Failed Test Example:

Advanced Evaluation Techniques

Human-in-the-Loop Evals

Combine automated + human review:

Example:

A/B Testing with Evals

Compare two agent versions:

Continuous Evaluation

Monitor production in real-time:

Comparative Evaluation

Compare against competitors or baselines:

Evaluation Best Practices

1. Representative Test Sets

✅ Cover common query types (80% coverage) ✅ Include edge cases (15%) ✅ Test failure modes (5%) ✅ Update regularly with new patterns ❌ Don't test only happy path

2. Multiple Metrics

✅ Use 3-5 complementary metrics ✅ Weight metrics by importance ✅ Track trends over time ❌ Don't rely on single metric

3. Regular Cadence

✅ Daily automated evals ✅ Weekly human review ✅ Before/after every change ❌ Don't eval only at launch

4. Actionable Results

✅ Identify specific failures ✅ Suggest improvements ✅ Track progress on fixes ❌ Don't just collect scores

5. Version Control

✅ Track eval set versions ✅ Document changes ✅ Maintain baselines ❌ Don't modify tests without tracking

Integration with Development Workflow

Development Cycle

Pre-Commit Hook

Metrics Tracking Over Time

Trend Analysis

Degradation Alerts

Cost-Benefit Analysis

Evaluation Costs

Example Eval Sets

Basic Product Knowledge

Technical Documentation

Customer Support

Next Steps

Last updated