Performance Tuning

Optimize your AI agents for speed, accuracy, and cost-effectiveness.

Performance Metrics

Track in Analytics dashboard:

Metric
Target
Measured How

p50 Latency

<2s

50th percentile response time

p95 Latency

<4s

95th percentile (worst-case normal)

p99 Latency

<6s

99th percentile (outliers)

Accuracy Rate

>85%

% responses marked accurate

Cost/Query

<$0.01

Total API costs / query count

Cache Hit Rate

>40%

Cached / total queries

Location: Dashboard → Analytics → Performance tab

Optimization Tradeoffs

Cannot maximize all three simultaneously:

SpeedAccuracyCost

Pick 2:

  • Speed + Cost → Redwood + GPT-3.5-turbo (accuracy: ~72%)

  • Speed + Accuracy → Cedar + GPT-4 + cache (cost: medium)

  • Accuracy + Cost → Cypress + GPT-4o-mini (speed: ~3-5s)

Speed Optimization

1. Choose Faster RAG Strategy

Strategy
Avg Latency
Best For

Redwood

~1.2s

Maximum speed

Cedar

~2.0s

Balanced

Cypress

~3.5s

Maximum accuracy

Switch to Redwood when:

  • Questions are clear and direct

  • Speed is critical

  • High query volume

2. Reduce topK

Recommendation: Start with 5-7, increase only if accuracy suffers.

3. Use Faster Model

Model
Speed
Quality
Cost

GPT-3.5-turbo

Fast

Good

Low

GPT-4o-mini

Fast

Better

Low

GPT-4o

Medium

Excellent

High

GPT-4

Slow

Excellent

High

For speed: Use GPT-3.5-turbo for simple queries, GPT-4o for complex.

4. Enable Caching

Impact: 50-100ms for cached responses vs 1-3s for uncached.

5. Optimize Context

6. Use Streaming

Accuracy Optimization

1. Choose Better RAG Strategy

Cypress > Cedar > Redwood for accuracy.

2. Increase topK

Diminishing returns after topK ~15.

3. Use Better Model

GPT-4o or GPT-4 for highest quality.

4. Improve Instructions

5. Add High-Quality Data Sources

✅ Official documentation ✅ Verified knowledge base ✅ Recent, updated content ❌ Low-quality, outdated content

6. Enable Reranking (Cypress)

Reranking improves precision by 20-30%.

7. Use Private Data Only

Prevents hallucination from general knowledge.

Cost Optimization

1. Choose Cost-Effective Model

Model
Cost per 1M Tokens

GPT-3.5-turbo

$0.50

GPT-4o-mini

$0.15

GPT-4o

$5.00

GPT-4

$30.00

Recommendation: GPT-4o-mini for most use cases.

2. Reduce Token Usage

3. Aggressive Caching

4. Use Redwood Strategy

Redwood is cheapest (single LLM call, no reranking).

5. Batch Operations

Process multiple queries together to reduce overhead.

6. Smart Routing

Balanced Optimization

The Performance Triangle

You can optimize 2 of 3:

  • Speed + Cost: Use Redwood, GPT-3.5-turbo

  • Speed + Accuracy: Use Cedar, GPT-4o, caching

  • Cost + Accuracy: Use Cypress, efficient models, batch

High-Volume Support Bot:

Technical Documentation:

Compliance Assistant:

Performance Monitoring

Key Metrics Dashboard

Set Performance Targets

Alerting

A/B Testing

Compare configurations to find optimal settings:

Continuous Optimization

Weekly Review

  1. Check performance metrics

  2. Identify bottlenecks

  3. Test optimizations

  4. Deploy improvements

  5. Measure impact

Monthly Audit

  1. Review all configurations

  2. Benchmark against baselines

  3. Update targets

  4. Plan next optimizations

Tools & Techniques

Performance Profiling

Load Testing

Cache Analysis

Next Steps

Last updated