Evaluation Framework

Measure and improve AI agent performance with automated evaluation tests and continuous quality monitoring.

What are Evals?

Evals (evaluations) are automated tests that measure how well your AI agents perform. They help you:

Track accuracy over time
Compare agent configurations
Identify knowledge gaps
Validate changes before deployment
Ensure consistent quality

Think of evals as unit tests for AI responses.

Why Evals Matter

Traditional Testing vs AI Evals

Traditional Software:

def test_add():
    assert add(2, 3) == 5  # Deterministic

AI Systems:

def test_agent():
    response = agent.query("What is pricing?")
    # Response varies, need nuanced evaluation
    # Multiple correct answers possible
    # Context and tone matter

Challenges:

Responses are non-deterministic
Multiple valid answers exist
Quality is subjective
Context-dependent correctness

Evals solve this by using multiple evaluation criteria and techniques.

Evaluation Metrics

1. Relevance

Measures: How relevant is the response to the query?

Scale: 0.0 (irrelevant) to 1.0 (perfectly relevant)

Example:

Query: "What is the refund policy?"
Response: "We offer 30-day money-back guarantee..."
Relevance: 0.95 ✅

Query: "What is the refund policy?"
Response: "Our support hours are 9am-5pm..."
Relevance: 0.20 ❌

How it's measured:

Semantic similarity between query and response
Keyword matching
Topic alignment
User feedback (optional)

2. Factual Accuracy

Measures: Are the facts in the response correct?

Scale: 0.0 (incorrect) to 1.0 (completely accurate)

Example:

Query: "How many users does the Pro plan include?"
Ground Truth: "Up to 50 users"

Response: "The Pro plan includes up to 50 users."
Accuracy: 1.0 ✅

Response: "The Pro plan includes unlimited users."
Accuracy: 0.0 ❌

How it's measured:

Comparison with ground truth data
Fact extraction and verification
Citation checking
Entity recognition

3. Completeness

Measures: Does the response cover all aspects of the query?

Scale: 0.0 (incomplete) to 1.0 (comprehensive)

Example:

Query: "How do I integrate with Slack?"
Expected Coverage:
- Prerequisites (API key, Slack workspace admin)
- Installation steps
- Configuration
- Testing

Response covering all: 1.0 ✅
Response covering 2 of 4: 0.5 ⚠️

How it's measured:

Coverage of expected key points
Presence of required information
Depth of explanation

4. Citation Quality

Measures: Are sources properly cited and relevant?

Scale: 0.0 (poor/missing) to 1.0 (excellent)

Example:

Response with citation:
"Our API supports OAuth 2.0 authentication.
[Source: API Documentation > Authentication]"
Citation Quality: 0.95 ✅

Response without citation:
"Our API supports OAuth 2.0 authentication."
Citation Quality: 0.0 ❌

How it's measured:

Presence of citations
Relevance of cited sources
Accuracy of source references
Citation formatting

5. Response Time

Measures: How fast is the response generated?

Example:

Target: < 2 seconds (p95)
Actual: 1.8 seconds ✅

Target: < 2 seconds
Actual: 3.5 seconds ❌

6. Tone Appropriateness

Measures: Does the response match the desired tone?

Example:

For Technical Agent:
"Utilize the authentication endpoint..." ✅
"Just hit the auth thingy..." ❌

For Friendly Agent:
"I'd be happy to help!" ✅
"Authenticate using bearer token." ⚠️

Creating Eval Sets

Eval Set Structure

An eval set consists of test cases:

{
  "evalSet": {
    "name": "Product Knowledge - Pricing",
    "description": "Test agent knowledge of pricing",
    "createdAt": "2024-01-15",
    "testCases": [
      {
        "id": "tc-1",
        "query": "What does the Pro plan cost?",
        "expectedAnswer": "The Pro plan costs $99/month for up to 50 users",
        "acceptableAnswers": [
          "$99 per month",
          "$99/mo for 50 users",
          "99 dollars monthly"
        ],
        "requiredEntities": ["price", "user_limit"],
        "requiredCitations": ["pricing_page"],
        "maxResponseTime": 2000
      }
    ]
  }
}

Best Practices for Test Cases

1. Cover Key Scenarios

✅ Common questions (80% of queries)
✅ Edge cases
✅ Complex multi-part queries
✅ Ambiguous questions
✅ Follow-up questions

2. Include Ground Truth

{
  "query": "What's included in Enterprise?",
  "groundTruth": {
    "features": ["SSO", "API access", "Priority support"],
    "userLimit": "Unlimited",
    "price": "$299/month"
  }
}

3. Define Acceptance Criteria

Required Elements:
  - Must mention price
  - Must mention user limit
  - Must cite pricing page

Acceptable Variations:
  - "$99" or "$99/month" or "99 dollars"
  - "50 users" or "up to 50 users"

Unacceptable:
  - Wrong price
  - Missing user limit
  - No citation

Creating Eval Sets from Real Data

Method 1: From Inbox

1. Navigate to Inbox
2. Filter: Highly rated interactions
3. Select representative samples
4. Click "Add to Eval Set"
5. Review and confirm

Method 2: From Analytics

1. Analytics → Common Queries
2. Select top queries
3. Create test cases
4. Add ground truth
5. Save as eval set

Method 3: Manual Creation

1. Evaluations → Create New Eval Set
2. Add test cases one by one
3. Define expected answers
4. Set acceptance criteria
5. Save and run

Running Evaluations

Manual Evaluation

Via UI:

1. Navigate to Evaluations
2. Select eval set
3. Click "Run Evaluation"
4. Wait for completion
5. Review results

Via API:

curl -X POST https://api.twig.so/api/evaluations/run \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "evalSetId": "eval-123",
    "agentId": "agent-456",
    "parallel": true
  }'

Automated Evaluation

Scheduled Evals:

{
  "schedule": {
    "frequency": "daily",
    "time": "03:00",
    "timezone": "UTC"
  },
  "evalSets": ["eval-123", "eval-124"],
  "agents": ["agent-456"],
  "notifyOn": ["degradation", "failure"]
}

CI/CD Integration:

# .github/workflows/eval.yml
name: Run Evals on PR
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Run Evaluations
        run: |
          curl -X POST https://api.twig.so/api/evaluations/run \
            -H "Authorization: Bearer ${{ secrets.TWIG_API_KEY }}" \
            -d '{"evalSetId": "eval-123", "agentId": "agent-456"}'
      
      - name: Check Results
        run: |
          # Fail if accuracy < 85%
          if [ $ACCURACY -lt 85 ]; then
            exit 1
          fi

Regression Testing

Before Deployment:

1. Run eval set on current agent (baseline)
2. Make configuration changes
3. Run same eval set on modified agent
4. Compare results
5. Deploy only if metrics maintain/improve

Example:

Baseline (Current):
- Relevance: 0.89
- Accuracy: 0.92
- Response Time: 1.8s

After Config Change:
- Relevance: 0.91 ✅ (+2%)
- Accuracy: 0.88 ❌ (-4%)
- Response Time: 1.5s ✅ (+17% faster)

Decision: Don't deploy (accuracy regression)

Evaluation Results

Results Dashboard

Key Metrics Display:

Overall Score: 87/100

Breakdown:
├─ Relevance:     0.91 (91%)  ✅
├─ Accuracy:      0.85 (85%)  ⚠️
├─ Completeness:  0.89 (89%)  ✅
├─ Citations:     0.82 (82%)  ⚠️
└─ Avg Response:  2.1s        ✅

Pass Rate: 42/50 (84%)

Per-Test-Case Results

{
  "testCase": {
    "id": "tc-1",
    "query": "What does the Pro plan cost?",
    "response": "The Pro plan costs $99/month for up to 50 users.",
    "metrics": {
      "relevance": 0.95,
      "accuracy": 1.0,
      "completeness": 0.90,
      "citationQuality": 0.85,
      "responseTime": 1.8
    },
    "citations": [
      {
        "source": "Pricing Page",
        "relevance": 0.95
      }
    ],
    "status": "PASS"
  }
}

Failure Analysis

Failed Test Example:

{
  "testCase": "tc-15",
  "query": "How do I export data?",
  "expected": "Go to Settings > Data > Export",
  "actual": "You can manage your data in the settings panel.",
  "failureReason": "INCOMPLETE",
  "details": {
    "missingInfo": ["specific location", "export button"],
    "relevance": 0.65,
    "accuracy": 0.70,
    "completeness": 0.40
  },
  "suggestedFix": "Add data export documentation or improve retrieval"
}

Advanced Evaluation Techniques

Human-in-the-Loop Evals

Combine automated + human review:

1. Automated eval runs first (fast, cheap)
2. Flag uncertain cases (confidence < 0.8)
3. Human reviewers assess flagged cases
4. Human labels improve future automation

Example:

Automated: 80% confidence it's correct
→ Flag for human review
Human: Confirms it's actually incorrect
→ Update ground truth
→ Retrain evaluation model

A/B Testing with Evals

Compare two agent versions:

Agent A (Redwood):
- Avg Relevance: 0.88
- Avg Speed: 1.2s
- Cost per 1k: $0.36

Agent B (Cedar):
- Avg Relevance: 0.92 (+5%)
- Avg Speed: 2.1s (-43%)
- Cost per 1k: $0.56 (+56%)

Decision: Deploy Cedar (relevance improvement worth cost)

Continuous Evaluation

Monitor production in real-time:

{
  "continuousEval": {
    "enabled": true,
    "sampleRate": 0.1,        // Eval 10% of queries
    "metrics": ["relevance", "accuracy"],
    "alertThresholds": {
      "relevance": 0.80,      // Alert if < 0.80
      "accuracy": 0.85
    },
    "notificationChannel": "slack"
  }
}

Comparative Evaluation

Compare against competitors or baselines:

Query: "What is OAuth?"

Your Agent:
- Response quality: 0.90
- Response time: 1.8s

ChatGPT:
- Response quality: 0.88
- Response time: 2.3s

Benchmark:
Your agent outperforms ✅

Evaluation Best Practices

1. Representative Test Sets

✅ Cover common query types (80% coverage) ✅ Include edge cases (15%) ✅ Test failure modes (5%) ✅ Update regularly with new patterns ❌ Don't test only happy path

2. Multiple Metrics

✅ Use 3-5 complementary metrics ✅ Weight metrics by importance ✅ Track trends over time ❌ Don't rely on single metric

3. Regular Cadence

✅ Daily automated evals ✅ Weekly human review ✅ Before/after every change ❌ Don't eval only at launch

4. Actionable Results

✅ Identify specific failures ✅ Suggest improvements ✅ Track progress on fixes ❌ Don't just collect scores

5. Version Control

✅ Track eval set versions ✅ Document changes ✅ Maintain baselines ❌ Don't modify tests without tracking

Integration with Development Workflow

Development Cycle

1. Develop agent config
     ↓
2. Run evals locally
     ↓
3. Fix issues
     ↓
4. Create PR
     ↓
5. Automated evals in CI
     ↓
6. Review results
     ↓
7. Deploy to staging
     ↓
8. Run full eval suite
     ↓
9. Deploy to production
     ↓
10. Monitor continuous evals

Pre-Commit Hook

#!/bin/bash
# .git/hooks/pre-commit

# Run quick eval set
echo "Running evaluations..."
result=$(curl -s https://api.twig.so/api/evaluations/run \
  -H "Authorization: Bearer $TWIG_API_KEY" \
  -d '{"evalSetId": "quick-eval", "agentId": "agent-123"}')

score=$(echo $result | jq '.overallScore')

if [ $score -lt 85 ]; then
  echo "❌ Eval score too low: $score"
  echo "Fix issues before committing"
  exit 1
fi

echo "✅ Evals passed: $score"

Metrics Tracking Over Time

Trend Analysis

Week 1:  Relevance: 0.85 ━━━━━━━━━━━━━━━━━━━━━━
Week 2:  Relevance: 0.87 ━━━━━━━━━━━━━━━━━━━━━━━━
Week 3:  Relevance: 0.89 ━━━━━━━━━━━━━━━━━━━━━━━━━━
Week 4:  Relevance: 0.91 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Trend: +6% over 4 weeks ✅

Degradation Alerts

Current Score: 0.82
Baseline: 0.90
Threshold: 0.85

Alert: Score dropped 8% below baseline!
Possible causes:
- New data source with poor quality
- Model change
- Configuration drift
- Data source sync failure

Action: Investigate and rollback if needed

Cost-Benefit Analysis

Evaluation Costs

Time Investment:
- Initial setup: 4-8 hours
- Test case creation: 1 hour per 10 cases
- Maintenance: 2 hours/week

Monetary Costs:
- API calls for evals: $0.01 per test case
- 100 test cases × daily = $30/month

ROI:
- Catch issues before users: Priceless
- Prevent quality regressions: High value
- Enable confident deployments: Essential

Example Eval Sets

Basic Product Knowledge

name: "Product Knowledge - Basic"
testCases:
  - query: "What is [Product]?"
    expected_topics: ["description", "key_features"]
  
  - query: "How much does it cost?"
    required_entities: ["price", "plan_name"]
    required_citation: true

Technical Documentation

name: "Technical - API"
testCases:
  - query: "How do I authenticate?"
    required_elements:
      - "API key"
      - "Authorization header"
      - "Bearer token"
    must_include_code_example: true

Customer Support

name: "Support - Common Issues"
testCases:
  - query: "I can't log in"
    response_must_include:
      - troubleshooting_steps
      - escalation_path
    tone: "empathetic"

Next Steps

Performance Tuning - Optimize based on eval results
Analytics Dashboard - Track production metrics
Inbox & Training - Improve from real interactions
Agent Configuration - Adjust based on findings

PreviousInbox & Training NextPerformance Tuning

Last updated 7 hours ago

hashtagWhat are Evals?

hashtagWhy Evals Matter

hashtagTraditional Testing vs AI Evals

hashtagEvaluation Metrics

hashtag1. Relevance

hashtag2. Factual Accuracy

hashtag3. Completeness

hashtag4. Citation Quality

hashtag5. Response Time

hashtag6. Tone Appropriateness

hashtagCreating Eval Sets

hashtagEval Set Structure

hashtagBest Practices for Test Cases

hashtagCreating Eval Sets from Real Data

hashtagRunning Evaluations

hashtagManual Evaluation

hashtagAutomated Evaluation

hashtagRegression Testing

hashtagEvaluation Results

hashtagResults Dashboard

hashtagPer-Test-Case Results

hashtagFailure Analysis

hashtagAdvanced Evaluation Techniques

hashtagHuman-in-the-Loop Evals

hashtagA/B Testing with Evals

hashtagContinuous Evaluation

hashtagComparative Evaluation

hashtagEvaluation Best Practices

hashtag1. Representative Test Sets

hashtag2. Multiple Metrics

hashtag3. Regular Cadence

hashtag4. Actionable Results

hashtag5. Version Control

hashtagIntegration with Development Workflow

hashtagDevelopment Cycle

hashtagPre-Commit Hook

hashtagMetrics Tracking Over Time

hashtagTrend Analysis

hashtagDegradation Alerts

hashtagCost-Benefit Analysis

hashtagEvaluation Costs

hashtagExample Eval Sets

hashtagBasic Product Knowledge

hashtagTechnical Documentation

hashtagCustomer Support

hashtagNext Steps

What are Evals?

Why Evals Matter

Traditional Testing vs AI Evals

Evaluation Metrics

1. Relevance

2. Factual Accuracy

3. Completeness

4. Citation Quality

5. Response Time

6. Tone Appropriateness

Creating Eval Sets

Eval Set Structure

Best Practices for Test Cases

Creating Eval Sets from Real Data

Running Evaluations

Manual Evaluation

Automated Evaluation

Regression Testing

Evaluation Results

Results Dashboard

Per-Test-Case Results

Failure Analysis

Advanced Evaluation Techniques

Human-in-the-Loop Evals

A/B Testing with Evals

Continuous Evaluation

Comparative Evaluation

Evaluation Best Practices

1. Representative Test Sets

2. Multiple Metrics

3. Regular Cadence

4. Actionable Results

5. Version Control

Integration with Development Workflow

Development Cycle

Pre-Commit Hook

Metrics Tracking Over Time

Trend Analysis

Degradation Alerts

Cost-Benefit Analysis

Evaluation Costs

Example Eval Sets

Basic Product Knowledge

Technical Documentation

Customer Support

Next Steps