# Vector Database Performance

## The Problem

Vector similarity search becomes slow as database grows—queries take seconds instead of milliseconds, index builds timeout, and memory usage spirals out of control.

### Symptoms

* ❌ Query latency >1 second (was <100ms)
* ❌ Index build takes hours
* ❌ Out-of-memory errors
* ❌ CPU usage spikes during queries
* ❌ Throughput drops as DB grows

### Real-World Example

```
Vector DB performance degradation:

10K vectors: 50ms query time ✓
100K vectors: 150ms query time ✓
1M vectors: 800ms query time ⚠️
10M vectors: 3+ seconds query time ✗

User experience degrades:
→ Page loads feel slow
→ Real-time chat delayed
→ Users frustrated

Cause: O(n) brute-force search doesn't scale
Need approximate search algorithms
```

***

## Deep Technical Analysis

### Brute-Force vs Approximate Search

Exact vs approximate nearest neighbors:

**Brute-Force (Exact):**

```
Algorithm:
For each vector in database:
  → Compute cosine similarity with query
  → Track top-K

Complexity: O(n × d)
→ n = number of vectors
→ d = dimensions

1M vectors × 1536 dims:
→ 1.5 billion operations per query
→ ~500ms on modern CPU

Doesn't scale beyond 1M vectors
```

**Approximate Nearest Neighbor (ANN):**

```
Algorithms: HNSW, IVF, NSG, etc.

Trade accuracy for speed:
→ May miss some true nearest neighbors
→ But: Returns "good enough" results
→ 10-100x faster

Complexity: O(log n) (HNSW)
→ Logarithmic growth
→ Scales to billions of vectors

10M vectors: ~100ms query time
→ Acceptable for user-facing apps
```

### HNSW Index Structure

Hierarchical Navigable Small World graphs:

**Graph Construction:**

```
Build multi-layer graph:
→ Layer 0: All vectors (dense)
→ Layer 1: 50% of vectors
→ Layer 2: 25% of vectors
→ Top layer: Few vectors

Search starts at top:
→ Quickly narrow region
→ Descend to lower layers
→ Refine search

Log(n) hops to find neighbors
```

**The Memory Trade-off:**

```
HNSW stores:
→ Vectors themselves (n × d × 4 bytes)
→ Graph edges (n × M × 4 bytes)
  → M = connections per node (16-64)

10M vectors × 1536 dims × 4 bytes = 58 GB (vectors)
10M vectors × 32 edges × 4 bytes = 1.2 GB (graph)
Total: ~60 GB RAM

High memory requirement
→ Expensive infrastructure
→ But: Fast queries
```

**Build Time Challenge:**

```
Inserting 10M vectors:
→ Each insert: O(log n) + edge updates
→ Total: O(n log n)
→ ~2-4 hours on standard hardware

During build:
→ Index not queryable
→ Or: Degraded performance
→ Requires maintenance windows
```

### Index-Vector-Flat (IVF) Approach

Clustering-based search:

**Clustering Strategy:**

```
1. Cluster vectors into N groups (k-means)
2. Store cluster centroids
3. Assign each vector to nearest centroid

Query:
1. Find nearest K centroids to query (fast)
2. Search only vectors in those K clusters
3. Return top results

Reduces search space:
→ Search K/N of database
→ If K=10, N=1000: Search 1% of vectors
→ 100x speedup
```

**The Re-Ranking Pattern:**

```
Two-stage search:
1. IVF approximate search → 100 candidates
2. Exact brute-force on 100 → top-10

Fast first stage (approximate)
+ Accurate second stage (exact)
= Good balance

Total time: 20ms + 5ms = 25ms
→ vs 500ms brute-force
→ 20x faster
```

### Quantization for Memory Reduction

Compress vectors to use less RAM:

**Product Quantization:**

```
Original vector: 1536 floats × 4 bytes = 6 KB

Quantized: 1536 × 1 byte = 1.5 KB
→ 75% memory savings

How:
→ Divide 1536 dims into 8 subspaces (192 dims each)
→ Cluster each subspace (256 clusters)
→ Store cluster ID (1 byte) instead of 192 floats

Retrieval:
→ Approximate similarity from cluster IDs
→ Re-rank top candidates with exact vectors
```

**Accuracy-Speed Trade-off:**

```
Float32 (full precision):
→ Exact similarity
→ 6 KB per vector
→ Slowest

Float16 (half precision):
→ Slight accuracy loss (~0.1%)
→ 3 KB per vector
→ 2x memory savings

Int8 (product quantization):
→ Moderate accuracy loss (~2-5%)
→ 1.5 KB per vector
→ 4x memory savings

Binary (extreme):
→ Significant accuracy loss (~10-15%)
→ 192 bytes per vector
→ 32x memory savings

Choose based on requirements
```

### Sharding and Distribution

Scale horizontally:

**Database Sharding:**

```
10M vectors split across 10 shards:
→ Shard 1: 1M vectors
→ Shard 2: 1M vectors
→ ...
→ Shard 10: 1M vectors

Query:
→ Parallel search across all shards
→ Each returns top-10
→ Merge results → final top-10

Latency: Max(shard query times)
→ Not sum
→ Parallelization benefit
```

**The Hot Shard Problem:**

```
Uneven data distribution:
→ Shard 1: Popular documents (high QPS)
→ Shard 10: Rarely queried documents

Shard 1 becomes bottleneck:
→ Overloaded
→ Slower responses
→ Drags down overall latency

Solution:
→ Rebalance shards periodically
→ Or: Hash-based distribution
→ Or: Replicate hot shards
```

### Write Amplification

Index updates are expensive:

**Single Vector Insert:**

```
HNSW insert:
1. Find insertion point (log n hops)
2. Add vector to layer 0
3. Create M edges to neighbors
4. Update neighbor edges (reciprocal)
5. Potentially propagate to upper layers

Operations: 50-100 per insert
→ vs 1 operation for raw storage

Write amplification: 50-100x
```

**Bulk vs Incremental:**

```
Incremental updates:
→ Add 1 vector at a time
→ Rebuild local graph structure
→ Fast per-vector
→ But: Index quality degrades over time

Bulk rebuild:
→ Build entire index from scratch
→ Optimal structure
→ Slow (hours)
→ Downtime or dual-index strategy
```

### Query Optimization

Improve search performance:

**Batch Queries:**

```
Process queries in batches:
→ 10 queries at once
→ Amortize index traversal overhead
→ Better CPU cache utilization

Single query: 100ms
Batch of 10: 400ms (40ms each)
→ 2.5x throughput improvement

Trade-off: Adds latency (wait for batch)
→ Good for background processing
→ Bad for user-facing queries
```

**Prefiltering vs Postfiltering:**

```
Query with metadata filter:
"Find similar docs WHERE category='API'"

Prefilter approach:
1. Filter vectors by metadata
2. Build temporary index on filtered set
3. Search filtered index

Postfilter approach:
1. Search entire index
2. Get top-100 candidates
3. Filter by metadata
4. Return top-10 after filtering

Prefilter: Better if filter is selective (10% match)
Postfilter: Better if filter is broad (90% match)
```

### Monitoring and Diagnosis

Track performance metrics:

**Key Metrics:**

```
Latency percentiles:
→ p50: 50ms (median)
→ p90: 120ms (90th percentile)
→ p99: 500ms (99th percentile)

Why percentiles?
→ Average hides outliers
→ p99 affects user experience

Tail latency:
→ 1% of queries take 500ms
→ 1 in 100 users frustrated
→ p99 latency critical
```

**Degradation Signals:**

```
Warning signs:
→ Latency creeping up over time
→ Memory usage increasing
→ Query throughput decreasing

Causes:
→ Index fragmentation (needs rebuild)
→ Hot spots (rebalance needed)
→ Memory pressure (swap, OOM)

Proactive monitoring prevents outages
```

***

## How to Solve

**Use approximate search (HNSW, IVF) instead of brute-force + implement product quantization for memory savings + shard database horizontally + batch queries where possible + rebuild indexes periodically + monitor p99 latency + use SSD for disk-based indexes.** See [Vector DB Performance](/rag-scenarios-and-solutions/vectors/vector-db-performance.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/vectors/vector-db-performance.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.