# Evaluation Framework

Measure and improve AI agent performance with automated evaluation tests and continuous quality monitoring.

## What are Evals?

**Evals** (evaluations) are automated tests that measure how well your AI agents perform. They help you:

* Track accuracy over time
* Compare agent configurations
* Identify knowledge gaps
* Validate changes before deployment
* Ensure consistent quality

Think of evals as unit tests for AI responses.

## Why Evals Matter

### Traditional Testing vs AI Evals

**Traditional Software:**

```python
def test_add():
    assert add(2, 3) == 5  # Deterministic
```

**AI Systems:**

```python
def test_agent():
    response = agent.query("What is pricing?")
    # Response varies, need nuanced evaluation
    # Multiple correct answers possible
    # Context and tone matter
```

**Challenges:**

* Responses are non-deterministic
* Multiple valid answers exist
* Quality is subjective
* Context-dependent correctness

**Evals solve this** by using multiple evaluation criteria and techniques.

## Evaluation Metrics

### 1. Relevance

**Measures**: How relevant is the response to the query?

**Scale**: 0.0 (irrelevant) to 1.0 (perfectly relevant)

**Example:**

```
Query: "What is the refund policy?"
Response: "We offer 30-day money-back guarantee..."
Relevance: 0.95 ✅

Query: "What is the refund policy?"
Response: "Our support hours are 9am-5pm..."
Relevance: 0.20 ❌
```

**How it's measured:**

* Semantic similarity between query and response
* Keyword matching
* Topic alignment
* User feedback (optional)

### 2. Factual Accuracy

**Measures**: Are the facts in the response correct?

**Scale**: 0.0 (incorrect) to 1.0 (completely accurate)

**Example:**

```
Query: "How many users does the Pro plan include?"
Ground Truth: "Up to 50 users"

Response: "The Pro plan includes up to 50 users."
Accuracy: 1.0 ✅

Response: "The Pro plan includes unlimited users."
Accuracy: 0.0 ❌
```

**How it's measured:**

* Comparison with ground truth data
* Fact extraction and verification
* Citation checking
* Entity recognition

### 3. Completeness

**Measures**: Does the response cover all aspects of the query?

**Scale**: 0.0 (incomplete) to 1.0 (comprehensive)

**Example:**

```
Query: "How do I integrate with Slack?"
Expected Coverage:
- Prerequisites (API key, Slack workspace admin)
- Installation steps
- Configuration
- Testing

Response covering all: 1.0 ✅
Response covering 2 of 4: 0.5 ⚠️
```

**How it's measured:**

* Coverage of expected key points
* Presence of required information
* Depth of explanation

### 4. Citation Quality

**Measures**: Are sources properly cited and relevant?

**Scale**: 0.0 (poor/missing) to 1.0 (excellent)

**Example:**

```
Response with citation:
"Our API supports OAuth 2.0 authentication.
[Source: API Documentation > Authentication]"
Citation Quality: 0.95 ✅

Response without citation:
"Our API supports OAuth 2.0 authentication."
Citation Quality: 0.0 ❌
```

**How it's measured:**

* Presence of citations
* Relevance of cited sources
* Accuracy of source references
* Citation formatting

### 5. Response Time

**Measures**: How fast is the response generated?

**Example:**

```
Target: < 2 seconds (p95)
Actual: 1.8 seconds ✅

Target: < 2 seconds
Actual: 3.5 seconds ❌
```

### 6. Tone Appropriateness

**Measures**: Does the response match the desired tone?

**Example:**

```
For Technical Agent:
"Utilize the authentication endpoint..." ✅
"Just hit the auth thingy..." ❌

For Friendly Agent:
"I'd be happy to help!" ✅
"Authenticate using bearer token." ⚠️
```

## Creating Eval Sets

### Eval Set Structure

An eval set consists of test cases:

```json
{
  "evalSet": {
    "name": "Product Knowledge - Pricing",
    "description": "Test agent knowledge of pricing",
    "createdAt": "2024-01-15",
    "testCases": [
      {
        "id": "tc-1",
        "query": "What does the Pro plan cost?",
        "expectedAnswer": "The Pro plan costs $99/month for up to 50 users",
        "acceptableAnswers": [
          "$99 per month",
          "$99/mo for 50 users",
          "99 dollars monthly"
        ],
        "requiredEntities": ["price", "user_limit"],
        "requiredCitations": ["pricing_page"],
        "maxResponseTime": 2000
      }
    ]
  }
}
```

### Best Practices for Test Cases

**1. Cover Key Scenarios**

```
✅ Common questions (80% of queries)
✅ Edge cases
✅ Complex multi-part queries
✅ Ambiguous questions
✅ Follow-up questions
```

**2. Include Ground Truth**

```json
{
  "query": "What's included in Enterprise?",
  "groundTruth": {
    "features": ["SSO", "API access", "Priority support"],
    "userLimit": "Unlimited",
    "price": "$299/month"
  }
}
```

**3. Define Acceptance Criteria**

```yaml
Required Elements:
  - Must mention price
  - Must mention user limit
  - Must cite pricing page

Acceptable Variations:
  - "$99" or "$99/month" or "99 dollars"
  - "50 users" or "up to 50 users"

Unacceptable:
  - Wrong price
  - Missing user limit
  - No citation
```

### Creating Eval Sets from Real Data

**Method 1: From Inbox**

```
1. Navigate to Inbox
2. Filter: Highly rated interactions
3. Select representative samples
4. Click "Add to Eval Set"
5. Review and confirm
```

**Method 2: From Analytics**

```
1. Analytics → Common Queries
2. Select top queries
3. Create test cases
4. Add ground truth
5. Save as eval set
```

**Method 3: Manual Creation**

```
1. Evaluations → Create New Eval Set
2. Add test cases one by one
3. Define expected answers
4. Set acceptance criteria
5. Save and run
```

## Running Evaluations

### Manual Evaluation

**Via UI:**

```
1. Navigate to Evaluations
2. Select eval set
3. Click "Run Evaluation"
4. Wait for completion
5. Review results
```

**Via API:**

```bash
curl -X POST https://api.twig.so/api/evaluations/run \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "evalSetId": "eval-123",
    "agentId": "agent-456",
    "parallel": true
  }'
```

### Automated Evaluation

**Scheduled Evals:**

```typescript
{
  "schedule": {
    "frequency": "daily",
    "time": "03:00",
    "timezone": "UTC"
  },
  "evalSets": ["eval-123", "eval-124"],
  "agents": ["agent-456"],
  "notifyOn": ["degradation", "failure"]
}
```

**CI/CD Integration:**

```yaml
# .github/workflows/eval.yml
name: Run Evals on PR
on: [pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - name: Run Evaluations
        run: |
          curl -X POST https://api.twig.so/api/evaluations/run \
            -H "Authorization: Bearer ${{ secrets.TWIG_API_KEY }}" \
            -d '{"evalSetId": "eval-123", "agentId": "agent-456"}'
      
      - name: Check Results
        run: |
          # Fail if accuracy < 85%
          if [ $ACCURACY -lt 85 ]; then
            exit 1
          fi
```

### Regression Testing

**Before Deployment:**

```
1. Run eval set on current agent (baseline)
2. Make configuration changes
3. Run same eval set on modified agent
4. Compare results
5. Deploy only if metrics maintain/improve
```

**Example:**

```
Baseline (Current):
- Relevance: 0.89
- Accuracy: 0.92
- Response Time: 1.8s

After Config Change:
- Relevance: 0.91 ✅ (+2%)
- Accuracy: 0.88 ❌ (-4%)
- Response Time: 1.5s ✅ (+17% faster)

Decision: Don't deploy (accuracy regression)
```

## Evaluation Results

### Results Dashboard

**Key Metrics Display:**

```
Overall Score: 87/100

Breakdown:
├─ Relevance:     0.91 (91%)  ✅
├─ Accuracy:      0.85 (85%)  ⚠️
├─ Completeness:  0.89 (89%)  ✅
├─ Citations:     0.82 (82%)  ⚠️
└─ Avg Response:  2.1s        ✅

Pass Rate: 42/50 (84%)
```

### Per-Test-Case Results

```json
{
  "testCase": {
    "id": "tc-1",
    "query": "What does the Pro plan cost?",
    "response": "The Pro plan costs $99/month for up to 50 users.",
    "metrics": {
      "relevance": 0.95,
      "accuracy": 1.0,
      "completeness": 0.90,
      "citationQuality": 0.85,
      "responseTime": 1.8
    },
    "citations": [
      {
        "source": "Pricing Page",
        "relevance": 0.95
      }
    ],
    "status": "PASS"
  }
}
```

### Failure Analysis

**Failed Test Example:**

```json
{
  "testCase": "tc-15",
  "query": "How do I export data?",
  "expected": "Go to Settings > Data > Export",
  "actual": "You can manage your data in the settings panel.",
  "failureReason": "INCOMPLETE",
  "details": {
    "missingInfo": ["specific location", "export button"],
    "relevance": 0.65,
    "accuracy": 0.70,
    "completeness": 0.40
  },
  "suggestedFix": "Add data export documentation or improve retrieval"
}
```

## Advanced Evaluation Techniques

### Human-in-the-Loop Evals

**Combine automated + human review:**

```
1. Automated eval runs first (fast, cheap)
2. Flag uncertain cases (confidence < 0.8)
3. Human reviewers assess flagged cases
4. Human labels improve future automation
```

**Example:**

```
Automated: 80% confidence it's correct
→ Flag for human review
Human: Confirms it's actually incorrect
→ Update ground truth
→ Retrain evaluation model
```

### A/B Testing with Evals

**Compare two agent versions:**

```
Agent A (Redwood):
- Avg Relevance: 0.88
- Avg Speed: 1.2s
- Cost per 1k: $0.36

Agent B (Cedar):
- Avg Relevance: 0.92 (+5%)
- Avg Speed: 2.1s (-43%)
- Cost per 1k: $0.56 (+56%)

Decision: Deploy Cedar (relevance improvement worth cost)
```

### Continuous Evaluation

**Monitor production in real-time:**

```typescript
{
  "continuousEval": {
    "enabled": true,
    "sampleRate": 0.1,        // Eval 10% of queries
    "metrics": ["relevance", "accuracy"],
    "alertThresholds": {
      "relevance": 0.80,      // Alert if < 0.80
      "accuracy": 0.85
    },
    "notificationChannel": "slack"
  }
}
```

### Comparative Evaluation

**Compare against competitors or baselines:**

```
Query: "What is OAuth?"

Your Agent:
- Response quality: 0.90
- Response time: 1.8s

ChatGPT:
- Response quality: 0.88
- Response time: 2.3s

Benchmark:
Your agent outperforms ✅
```

## Evaluation Best Practices

### 1. Representative Test Sets

✅ Cover common query types (80% coverage) ✅ Include edge cases (15%) ✅ Test failure modes (5%) ✅ Update regularly with new patterns ❌ Don't test only happy path

### 2. Multiple Metrics

✅ Use 3-5 complementary metrics ✅ Weight metrics by importance ✅ Track trends over time ❌ Don't rely on single metric

### 3. Regular Cadence

✅ Daily automated evals ✅ Weekly human review ✅ Before/after every change ❌ Don't eval only at launch

### 4. Actionable Results

✅ Identify specific failures ✅ Suggest improvements ✅ Track progress on fixes ❌ Don't just collect scores

### 5. Version Control

✅ Track eval set versions ✅ Document changes ✅ Maintain baselines ❌ Don't modify tests without tracking

## Integration with Development Workflow

### Development Cycle

```
1. Develop agent config
     ↓
2. Run evals locally
     ↓
3. Fix issues
     ↓
4. Create PR
     ↓
5. Automated evals in CI
     ↓
6. Review results
     ↓
7. Deploy to staging
     ↓
8. Run full eval suite
     ↓
9. Deploy to production
     ↓
10. Monitor continuous evals
```

### Pre-Commit Hook

```bash
#!/bin/bash
# .git/hooks/pre-commit

# Run quick eval set
echo "Running evaluations..."
result=$(curl -s https://api.twig.so/api/evaluations/run \
  -H "Authorization: Bearer $TWIG_API_KEY" \
  -d '{"evalSetId": "quick-eval", "agentId": "agent-123"}')

score=$(echo $result | jq '.overallScore')

if [ $score -lt 85 ]; then
  echo "❌ Eval score too low: $score"
  echo "Fix issues before committing"
  exit 1
fi

echo "✅ Evals passed: $score"
```

## Metrics Tracking Over Time

### Trend Analysis

```
Week 1:  Relevance: 0.85 ━━━━━━━━━━━━━━━━━━━━━━
Week 2:  Relevance: 0.87 ━━━━━━━━━━━━━━━━━━━━━━━━
Week 3:  Relevance: 0.89 ━━━━━━━━━━━━━━━━━━━━━━━━━━
Week 4:  Relevance: 0.91 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Trend: +6% over 4 weeks ✅
```

### Degradation Alerts

```
Current Score: 0.82
Baseline: 0.90
Threshold: 0.85

Alert: Score dropped 8% below baseline!
Possible causes:
- New data source with poor quality
- Model change
- Configuration drift
- Data source sync failure

Action: Investigate and rollback if needed
```

## Cost-Benefit Analysis

### Evaluation Costs

```
Time Investment:
- Initial setup: 4-8 hours
- Test case creation: 1 hour per 10 cases
- Maintenance: 2 hours/week

Monetary Costs:
- API calls for evals: $0.01 per test case
- 100 test cases × daily = $30/month

ROI:
- Catch issues before users: Priceless
- Prevent quality regressions: High value
- Enable confident deployments: Essential
```

## Example Eval Sets

### Basic Product Knowledge

```yaml
name: "Product Knowledge - Basic"
testCases:
  - query: "What is [Product]?"
    expected_topics: ["description", "key_features"]
  
  - query: "How much does it cost?"
    required_entities: ["price", "plan_name"]
    required_citation: true
```

### Technical Documentation

```yaml
name: "Technical - API"
testCases:
  - query: "How do I authenticate?"
    required_elements:
      - "API key"
      - "Authorization header"
      - "Bearer token"
    must_include_code_example: true
```

### Customer Support

```yaml
name: "Support - Common Issues"
testCases:
  - query: "I can't log in"
    response_must_include:
      - troubleshooting_steps
      - escalation_path
    tone: "empathetic"
```

## Next Steps

* [Performance Tuning](/product/monitoring/performance-tuning.md) - Optimize based on eval results
* [Analytics Dashboard](/product/monitoring/view-analytics.md) - Track production metrics
* [Inbox & Training](/product/monitoring/inbox-training.md) - Improve from real interactions
* [Agent Configuration](/product/overview/configuration.md) - Adjust based on findings


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/product/monitoring/evals.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.