Vector Database Performance

The Problem

Vector similarity search becomes slow as database grows—queries take seconds instead of milliseconds, index builds timeout, and memory usage spirals out of control.

Symptoms

❌ Query latency >1 second (was <100ms)
❌ Index build takes hours
❌ Out-of-memory errors
❌ CPU usage spikes during queries
❌ Throughput drops as DB grows

Real-World Example

Vector DB performance degradation:

10K vectors: 50ms query time ✓
100K vectors: 150ms query time ✓
1M vectors: 800ms query time ⚠️
10M vectors: 3+ seconds query time ✗

User experience degrades:
→ Page loads feel slow
→ Real-time chat delayed
→ Users frustrated

Cause: O(n) brute-force search doesn't scale
Need approximate search algorithms

Deep Technical Analysis

Brute-Force vs Approximate Search

Exact vs approximate nearest neighbors:

Brute-Force (Exact):

Algorithm:
For each vector in database:
  → Compute cosine similarity with query
  → Track top-K

Complexity: O(n × d)
→ n = number of vectors
→ d = dimensions

1M vectors × 1536 dims:
→ 1.5 billion operations per query
→ ~500ms on modern CPU

Doesn't scale beyond 1M vectors

Approximate Nearest Neighbor (ANN):

Algorithms: HNSW, IVF, NSG, etc.

Trade accuracy for speed:
→ May miss some true nearest neighbors
→ But: Returns "good enough" results
→ 10-100x faster

Complexity: O(log n) (HNSW)
→ Logarithmic growth
→ Scales to billions of vectors

10M vectors: ~100ms query time
→ Acceptable for user-facing apps

HNSW Index Structure

Hierarchical Navigable Small World graphs:

Graph Construction:

Build multi-layer graph:
→ Layer 0: All vectors (dense)
→ Layer 1: 50% of vectors
→ Layer 2: 25% of vectors
→ Top layer: Few vectors

Search starts at top:
→ Quickly narrow region
→ Descend to lower layers
→ Refine search

Log(n) hops to find neighbors

The Memory Trade-off:

HNSW stores:
→ Vectors themselves (n × d × 4 bytes)
→ Graph edges (n × M × 4 bytes)
  → M = connections per node (16-64)

10M vectors × 1536 dims × 4 bytes = 58 GB (vectors)
10M vectors × 32 edges × 4 bytes = 1.2 GB (graph)
Total: ~60 GB RAM

High memory requirement
→ Expensive infrastructure
→ But: Fast queries

Build Time Challenge:

Inserting 10M vectors:
→ Each insert: O(log n) + edge updates
→ Total: O(n log n)
→ ~2-4 hours on standard hardware

During build:
→ Index not queryable
→ Or: Degraded performance
→ Requires maintenance windows

Index-Vector-Flat (IVF) Approach

Clustering-based search:

Clustering Strategy:

1. Cluster vectors into N groups (k-means)
2. Store cluster centroids
3. Assign each vector to nearest centroid

Query:
1. Find nearest K centroids to query (fast)
2. Search only vectors in those K clusters
3. Return top results

Reduces search space:
→ Search K/N of database
→ If K=10, N=1000: Search 1% of vectors
→ 100x speedup

The Re-Ranking Pattern:

Two-stage search:
1. IVF approximate search → 100 candidates
2. Exact brute-force on 100 → top-10

Fast first stage (approximate)
+ Accurate second stage (exact)
= Good balance

Total time: 20ms + 5ms = 25ms
→ vs 500ms brute-force
→ 20x faster

Quantization for Memory Reduction

Compress vectors to use less RAM:

Product Quantization:

Original vector: 1536 floats × 4 bytes = 6 KB

Quantized: 1536 × 1 byte = 1.5 KB
→ 75% memory savings

How:
→ Divide 1536 dims into 8 subspaces (192 dims each)
→ Cluster each subspace (256 clusters)
→ Store cluster ID (1 byte) instead of 192 floats

Retrieval:
→ Approximate similarity from cluster IDs
→ Re-rank top candidates with exact vectors

Accuracy-Speed Trade-off:

Float32 (full precision):
→ Exact similarity
→ 6 KB per vector
→ Slowest

Float16 (half precision):
→ Slight accuracy loss (~0.1%)
→ 3 KB per vector
→ 2x memory savings

Int8 (product quantization):
→ Moderate accuracy loss (~2-5%)
→ 1.5 KB per vector
→ 4x memory savings

Binary (extreme):
→ Significant accuracy loss (~10-15%)
→ 192 bytes per vector
→ 32x memory savings

Choose based on requirements

Sharding and Distribution

Scale horizontally:

Database Sharding:

10M vectors split across 10 shards:
→ Shard 1: 1M vectors
→ Shard 2: 1M vectors
→ ...
→ Shard 10: 1M vectors

Query:
→ Parallel search across all shards
→ Each returns top-10
→ Merge results → final top-10

Latency: Max(shard query times)
→ Not sum
→ Parallelization benefit

The Hot Shard Problem:

Uneven data distribution:
→ Shard 1: Popular documents (high QPS)
→ Shard 10: Rarely queried documents

Shard 1 becomes bottleneck:
→ Overloaded
→ Slower responses
→ Drags down overall latency

Solution:
→ Rebalance shards periodically
→ Or: Hash-based distribution
→ Or: Replicate hot shards

Write Amplification

Index updates are expensive:

Single Vector Insert:

HNSW insert:
1. Find insertion point (log n hops)
2. Add vector to layer 0
3. Create M edges to neighbors
4. Update neighbor edges (reciprocal)
5. Potentially propagate to upper layers

Operations: 50-100 per insert
→ vs 1 operation for raw storage

Write amplification: 50-100x

Bulk vs Incremental:

Incremental updates:
→ Add 1 vector at a time
→ Rebuild local graph structure
→ Fast per-vector
→ But: Index quality degrades over time

Bulk rebuild:
→ Build entire index from scratch
→ Optimal structure
→ Slow (hours)
→ Downtime or dual-index strategy

Query Optimization

Improve search performance:

Batch Queries:

Process queries in batches:
→ 10 queries at once
→ Amortize index traversal overhead
→ Better CPU cache utilization

Single query: 100ms
Batch of 10: 400ms (40ms each)
→ 2.5x throughput improvement

Trade-off: Adds latency (wait for batch)
→ Good for background processing
→ Bad for user-facing queries

Prefiltering vs Postfiltering:

Query with metadata filter:
"Find similar docs WHERE category='API'"

Prefilter approach:
1. Filter vectors by metadata
2. Build temporary index on filtered set
3. Search filtered index

Postfilter approach:
1. Search entire index
2. Get top-100 candidates
3. Filter by metadata
4. Return top-10 after filtering

Prefilter: Better if filter is selective (10% match)
Postfilter: Better if filter is broad (90% match)

Monitoring and Diagnosis

Track performance metrics:

Key Metrics:

Latency percentiles:
→ p50: 50ms (median)
→ p90: 120ms (90th percentile)
→ p99: 500ms (99th percentile)

Why percentiles?
→ Average hides outliers
→ p99 affects user experience

Tail latency:
→ 1% of queries take 500ms
→ 1 in 100 users frustrated
→ p99 latency critical

Degradation Signals:

Warning signs:
→ Latency creeping up over time
→ Memory usage increasing
→ Query throughput decreasing

Causes:
→ Index fragmentation (needs rebuild)
→ Hot spots (rebalance needed)
→ Memory pressure (swap, OOM)

Proactive monitoring prevents outages

How to Solve

Use approximate search (HNSW, IVF) instead of brute-force + implement product quantization for memory savings + shard database horizontally + batch queries where possible + rebuild indexes periodically + monitor p99 latency + use SSD for disk-based indexes. See Vector DB Performance.

PreviousEmbedding Cost Optimization NextHallucination in Responses

Last updated 1 minute ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagBrute-Force vs Approximate Search

hashtagHNSW Index Structure

hashtagIndex-Vector-Flat (IVF) Approach

hashtagQuantization for Memory Reduction

hashtagSharding and Distribution

hashtagWrite Amplification

hashtagQuery Optimization

hashtagMonitoring and Diagnosis

hashtagHow to Solve