Cost Optimization

Reduce AI operational costs while maintaining quality through smart configuration and usage patterns.

Cost Components

Understanding where costs come from:

Component
Cost Driver
Optimization Lever

LLM API Calls

Input + output tokens

Model choice, response length

Embeddings

Number of queries

Caching, deduplication

Vector Search

Query volume

Caching, topK

Reranking

Results reranked (Cypress)

Strategy choice

Data Processing

Documents processed

Incremental updates

Cost Breakdown Example

Per 1,000 Queries:

Redwood (Cheapest)

Embeddings:        $0.01
Vector Search:     $0.05
LLM (GPT-4o-mini): $0.20
─────────────────────────
Total:             $0.26

Cedar (Medium)

Cypress (Highest)

Optimization Strategies

1. Model Selection

Cost-Quality Matrix:

Recommendation:

  • High-volume, simple: GPT-3.5-turbo or GPT-4o-mini

  • Complex, critical: GPT-4o

  • Research/analysis: GPT-4

2. Strategy Selection

3. Aggressive Caching

Impact:

4. Reduce Token Usage

Limit Response Length:

Reduce Context:

Shorter Memory:

Impact:

5. Smart Routing

Route queries to appropriate strategy:

6. Batch Processing

Cost Monitoring

Usage Dashboard

Set Budgets

Cost by Component

Cost-Saving Tactics

Tactic 1: Tiered Response Quality

Impact: 40-60% cost reduction vs using highest quality for all.

Tactic 2: Query Deduplication

Tactic 3: Peak/Off-Peak Pricing

Tactic 4: Lazy Loading

Real-World Examples

Case Study: SaaS Company

Before Optimization:

After Optimization:

Case Study: E-commerce

Optimization:

  • Cached product queries (60% hit rate)

  • Redwood for FAQ (70% of queries)

  • Cedar for complex questions (30%)

  • GPT-3.5-turbo for product info

  • GPT-4o for technical support

Results:

  • Cost: $0.008/query (from $0.025)

  • 68% cost reduction

  • Response time: 1.3s (from 2.1s)

  • Accuracy: 86% (from 89%, acceptable trade-off)

Cost Analysis Tools

Cost Calculator

ROI Calculator

Best Practices

1. Start Cheap, Scale Up

✅ Begin with Redwood + GPT-3.5-turbo ✅ Monitor accuracy ✅ Upgrade only if needed ❌ Don't start with most expensive

2. Cache Aggressively

✅ Enable caching ✅ Long TTL for stable content ✅ Fuzzy matching ❌ Don't cache time-sensitive data

3. Monitor Continuously

✅ Track cost trends ✅ Set budget alerts ✅ Review monthly ❌ Don't ignore cost creep

4. Optimize Data Processing

✅ Incremental syncs ✅ Process only changes ✅ Schedule during off-peak ❌ Don't reprocess everything

Troubleshooting

Unexpected High Costs

Investigate:

  1. Check query volume (unexpected spike?)

  2. Review token usage (responses too long?)

  3. Check cache hit rate (caching working?)

  4. Verify strategy mix (using expensive strategies?)

  5. Audit model usage (using GPT-4 too much?)

Budget Exceeded

Immediate Actions:

  1. Enable hard budget limit

  2. Switch to cheaper strategies

  3. Reduce max tokens

  4. Increase cache TTL

  5. Queue non-urgent requests

Next Steps

Last updated