Performance Tuning

Optimize your AI agents for speed, accuracy, and cost-effectiveness.

Performance Metrics

Track these key metrics:

Response Time: Latency from query to response
Token Usage: Input + output tokens per request
Accuracy: Response quality score
Cost per Query: Total cost including API calls
Cache Hit Rate: % of cached responses

Optimization Dimensions

You can optimize for:

Speed (lower latency)
Accuracy (higher quality)
Cost (lower expense)

⚠️ Note: These often trade off against each other.

Speed Optimization

1. Choose Faster RAG Strategy

Strategy

Avg Latency

Best For

Redwood

~1.2s

Maximum speed

Cedar

~2.0s

Balanced

Cypress

~3.5s

Maximum accuracy

Switch to Redwood when:

Questions are clear and direct
Speed is critical
High query volume

2. Reduce topK

// Higher topK = slower
topK: 20  → Response time: 2.5s
topK: 10  → Response time: 1.8s
topK: 5   → Response time: 1.2s

Recommendation: Start with 5-7, increase only if accuracy suffers.

3. Use Faster Model

Model

Speed

Quality

Cost

GPT-3.5-turbo

Fast

Good

Low

GPT-4o-mini

Fast

Better

Low

GPT-4o

Medium

Excellent

High

GPT-4

Slow

Excellent

High

For speed: Use GPT-3.5-turbo for simple queries, GPT-4o for complex.

4. Enable Caching

{
  "cache": {
    "enabled": true,
    "ttl": 300,              // 5 minutes
    "keyBy": ["prompt", "agentId"]
  }
}

Impact: 50-100ms for cached responses vs 1-3s for uncached.

5. Optimize Context

// Reduce context size
{
  "topK": 5,               // Fewer documents
  "maxContextTokens": 2000, // Limit context size
  "chunkSize": 300         // Smaller chunks
}

6. Use Streaming

// User sees response immediately
stream: true

// First token: ~500ms
// Complete response: ~2s
// Perceived latency: Much faster

Accuracy Optimization

1. Choose Better RAG Strategy

Cypress > Cedar > Redwood for accuracy.

2. Increase topK

topK: 5   → Accuracy: 85%
topK: 10  → Accuracy: 89%
topK: 20  → Accuracy: 91%

Diminishing returns after topK ~15.

3. Use Better Model

GPT-4o or GPT-4 for highest quality.

4. Improve Instructions

// Detailed system prompt
instructions: `
You are an expert [domain] assistant.

When answering:
1. Always cite specific sources
2. Provide step-by-step explanations
3. Include code examples when relevant
4. Verify facts against documentation
5. Admit uncertainty when appropriate
`

5. Add High-Quality Data Sources

✅ Official documentation ✅ Verified knowledge base ✅ Recent, updated content ❌ Low-quality, outdated content

6. Enable Reranking (Cypress)

Reranking improves precision by 20-30%.

7. Use Private Data Only

configAIUseOnlyPrivateData: true

Prevents hallucination from general knowledge.

Cost Optimization

1. Choose Cost-Effective Model

Model

Cost per 1M Tokens

GPT-3.5-turbo

$0.50

GPT-4o-mini

$0.15

GPT-4o

$5.00

GPT-4

$30.00

Recommendation: GPT-4o-mini for most use cases.

2. Reduce Token Usage

{
  "maxTokens": 300,        // Limit response length
  "topK": 5,               // Fewer documents
  "memoryTurns": 3,        // Less conversation history
  "temperature": 0.3       // More focused (fewer tokens)
}

3. Aggressive Caching

{
  "cache": {
    "enabled": true,
    "ttl": 3600,           // 1 hour (longer cache)
    "fuzzyMatching": true   // Match similar queries
  }
}

4. Use Redwood Strategy

Redwood is cheapest (single LLM call, no reranking).

5. Batch Operations

Process multiple queries together to reduce overhead.

6. Smart Routing

def route_query(query):
    if is_simple(query):
        return "REDWOOD"      # Cheapest
    elif is_complex(query):
        return "CYPRESS"      # Worth the cost
    else:
        return "CEDAR"        # Balanced

Balanced Optimization

The Performance Triangle

        Speed
         /\
        /  \
       /    \
      /      \
     /________\
  Cost      Accuracy

You can optimize 2 of 3:

Speed + Cost: Use Redwood, GPT-3.5-turbo
Speed + Accuracy: Use Cedar, GPT-4o, caching
Cost + Accuracy: Use Cypress, efficient models, batch

Recommended Configurations

High-Volume Support Bot:

Strategy: REDWOOD
Model: gpt-3.5-turbo
topK: 5
Cache: enabled (1 hour)
Goal: Handle 10k+ queries/day cheaply

Technical Documentation:

Strategy: CEDAR
Model: gpt-4o
topK: 10
Cache: enabled (30 min)
Goal: Balance speed and accuracy

Compliance Assistant:

Strategy: CYPRESS
Model: gpt-4o
topK: 10
Reranking: enabled
Goal: Maximum accuracy, cost secondary

Performance Monitoring

Key Metrics Dashboard

Agent: Customer Support
├─ Avg Response Time: 1.8s (target: <2s) ✅
├─ P95 Response Time: 2.4s (target: <3s) ✅
├─ P99 Response Time: 3.1s ⚠️
├─ Cache Hit Rate: 42%
├─ Avg Tokens: 1,523
├─ Cost per Query: $0.0045
└─ Accuracy Score: 89%

Set Performance Targets

{
  "targets": {
    "responseTime": {
      "p50": 1.5,
      "p95": 2.5,
      "p99": 4.0
    },
    "accuracy": 0.85,
    "costPerQuery": 0.01,
    "cacheHitRate": 0.40
  }
}

Alerting

{
  "alerts": {
    "responseTimeSlow": {
      "threshold": 3.0,
      "duration": "5m",
      "notify": "[email protected]"
    },
    "accuracyDrop": {
      "threshold": 0.80,
      "compare": "baseline",
      "notify": "[email protected]"
    }
  }
}

A/B Testing

Compare configurations to find optimal settings:

// Test A: Baseline
const configA = {
  strategyCode: 'CEDAR',
  topK: 10,
  temperature: 0.7
};

// Test B: Optimized for speed
const configB = {
  strategyCode: 'REDWOOD',
  topK: 5,
  temperature: 0.7
};

// Route 50% traffic to each
// Measure: speed, accuracy, cost
// Deploy winner after 1 week

Continuous Optimization

Weekly Review

Check performance metrics
Identify bottlenecks
Test optimizations
Deploy improvements
Measure impact

Monthly Audit

Review all configurations
Benchmark against baselines
Update targets
Plan next optimizations

Tools & Techniques

Performance Profiling

const startTime = Date.now();

const response = await twig.chat.create({
  prompt,
  agentId,
  profile: true  // Enable profiling
});

console.log('Breakdown:', {
  embedding: response.profile.embeddingTime,
  retrieval: response.profile.retrievalTime,
  llm: response.profile.llmTime,
  total: Date.now() - startTime
});

Load Testing

# Using Apache Bench
ab -n 1000 -c 10 -H "Authorization: Bearer KEY" \
  -p query.json https://api.twig.so/api/chat

# Results show:
# - Requests per second
# - Average latency
# - P50, P95, P99

Cache Analysis

const cacheStats = await twig.cache.stats();

console.log('Hit rate:', cacheStats.hitRate);
console.log('Avg savings:', cacheStats.avgTimeSaved);
console.log('Most cached:', cacheStats.topQueries);

Next Steps

Cost Optimization - Reduce expenses
Analytics Dashboard - Monitor metrics
RAG Strategies - Choose optimal strategy
Evaluation Framework - Measure quality

PreviousEvaluation Framework NextCost Optimization

Last updated 7 hours ago

hashtagPerformance Metrics

hashtagOptimization Dimensions

hashtagSpeed Optimization

hashtag1. Choose Faster RAG Strategy

hashtag2. Reduce topK

hashtag3. Use Faster Model

hashtag4. Enable Caching

hashtag5. Optimize Context

hashtag6. Use Streaming

hashtagAccuracy Optimization

hashtag1. Choose Better RAG Strategy

hashtag2. Increase topK

hashtag3. Use Better Model

hashtag4. Improve Instructions

hashtag5. Add High-Quality Data Sources

hashtag6. Enable Reranking (Cypress)

hashtag7. Use Private Data Only

hashtagCost Optimization

hashtag1. Choose Cost-Effective Model

hashtag2. Reduce Token Usage

hashtag3. Aggressive Caching

hashtag4. Use Redwood Strategy

hashtag5. Batch Operations

hashtag6. Smart Routing

hashtagBalanced Optimization

hashtagThe Performance Triangle

hashtagRecommended Configurations

hashtagPerformance Monitoring

hashtagKey Metrics Dashboard

hashtagSet Performance Targets

hashtagAlerting

hashtagA/B Testing

hashtagContinuous Optimization

hashtagWeekly Review

hashtagMonthly Audit

hashtagTools & Techniques

hashtagPerformance Profiling

hashtagLoad Testing

hashtagCache Analysis

hashtagNext Steps