Performance Tuning

Optimize your AI agents for speed, accuracy, and cost-effectiveness.

Performance Metrics

Track these key metrics:

  • Response Time: Latency from query to response

  • Token Usage: Input + output tokens per request

  • Accuracy: Response quality score

  • Cost per Query: Total cost including API calls

  • Cache Hit Rate: % of cached responses

Optimization Dimensions

You can optimize for:

  1. Speed (lower latency)

  2. Accuracy (higher quality)

  3. Cost (lower expense)

⚠️ Note: These often trade off against each other.

Speed Optimization

1. Choose Faster RAG Strategy

Strategy
Avg Latency
Best For

Redwood

~1.2s

Maximum speed

Cedar

~2.0s

Balanced

Cypress

~3.5s

Maximum accuracy

Switch to Redwood when:

  • Questions are clear and direct

  • Speed is critical

  • High query volume

2. Reduce topK

Recommendation: Start with 5-7, increase only if accuracy suffers.

3. Use Faster Model

Model
Speed
Quality
Cost

GPT-3.5-turbo

Fast

Good

Low

GPT-4o-mini

Fast

Better

Low

GPT-4o

Medium

Excellent

High

GPT-4

Slow

Excellent

High

For speed: Use GPT-3.5-turbo for simple queries, GPT-4o for complex.

4. Enable Caching

Impact: 50-100ms for cached responses vs 1-3s for uncached.

5. Optimize Context

6. Use Streaming

Accuracy Optimization

1. Choose Better RAG Strategy

Cypress > Cedar > Redwood for accuracy.

2. Increase topK

Diminishing returns after topK ~15.

3. Use Better Model

GPT-4o or GPT-4 for highest quality.

4. Improve Instructions

5. Add High-Quality Data Sources

✅ Official documentation ✅ Verified knowledge base ✅ Recent, updated content ❌ Low-quality, outdated content

6. Enable Reranking (Cypress)

Reranking improves precision by 20-30%.

7. Use Private Data Only

Prevents hallucination from general knowledge.

Cost Optimization

1. Choose Cost-Effective Model

Model
Cost per 1M Tokens

GPT-3.5-turbo

$0.50

GPT-4o-mini

$0.15

GPT-4o

$5.00

GPT-4

$30.00

Recommendation: GPT-4o-mini for most use cases.

2. Reduce Token Usage

3. Aggressive Caching

4. Use Redwood Strategy

Redwood is cheapest (single LLM call, no reranking).

5. Batch Operations

Process multiple queries together to reduce overhead.

6. Smart Routing

Balanced Optimization

The Performance Triangle

You can optimize 2 of 3:

  • Speed + Cost: Use Redwood, GPT-3.5-turbo

  • Speed + Accuracy: Use Cedar, GPT-4o, caching

  • Cost + Accuracy: Use Cypress, efficient models, batch

High-Volume Support Bot:

Technical Documentation:

Compliance Assistant:

Performance Monitoring

Key Metrics Dashboard

Set Performance Targets

Alerting

A/B Testing

Compare configurations to find optimal settings:

Continuous Optimization

Weekly Review

  1. Check performance metrics

  2. Identify bottlenecks

  3. Test optimizations

  4. Deploy improvements

  5. Measure impact

Monthly Audit

  1. Review all configurations

  2. Benchmark against baselines

  3. Update targets

  4. Plan next optimizations

Tools & Techniques

Performance Profiling

Load Testing

Cache Analysis

Next Steps

Last updated