Incremental Sync Not Working

The Problem

Your data source performs full re-sync every time instead of only syncing changes, causing slow sync times and wasted resources.

Symptoms

❌ Every sync takes 30+ minutes even when no changes
❌ "Processing 5,000 documents" but only 5 changed
❌ High API usage and rate limiting
❌ Vector database grows unnecessarily
❌ Old deleted content still appears in knowledge base

Real-World Example

Confluence space: 500 pages
Last sync: 2 days ago
Changes since: 3 new pages, 2 updated, 1 deleted

Expected sync: Process 6 pages (5 min)
Actual sync: Process all 500 pages (45 min)

API calls: 1,500 (should be 18)
Status: "Full sync completed" (incremental failed)

Deep Technical Analysis

Change Detection Methods

Different data sources use different mechanisms to track changes:

1. Timestamp-Based (modified_at):

SQL-style approach:
SELECT * FROM documents WHERE modified_at > '2024-01-20T10:00:00Z'

Pros:
+ Simple to implement
+ Standard field on most systems
+ Efficient query

Cons:
- Clock skew problems (server time vs API time)
- Timezone ambiguity (UTC? Local?)
- Doesn't detect deletions
- Can miss documents if modified_at not updated

2. Version/ETag:

HTTP-style:
Request document with If-None-Match: "abc123"
If version unchanged: 304 Not Modified
If changed: 200 OK with new content + new ETag

Pros:
+ Accurate change detection
+ No clock skew issues
+ Server-authoritative

Cons:
- Requires per-document API call (slow)
- Not all systems support ETags
- Must store ETag for every document

3. Change Token/Cursor (event log):

Google Drive approach:
1. Initial sync returns: startPageToken: "xyz123"
2. Store token
3. Next sync: changes.list(pageToken="xyz123")
4. Returns: new/modified/deleted files since token
5. New token: "xyz456"

Pros:
+ Efficient (one API call)
+ Captures all change types (create/update/delete)
+ No clock skew
+ Server maintains change log

Cons:
- Token expires after ~1 week
- If expired, must do full sync
- Complex error handling
- Not all APIs support this

4. Webhook/Event-Based:

Real-time push:
→ Data source sends webhook on every change
→ Immediate processing
→ No polling needed

Pros:
+ Real-time updates
+ Zero polling overhead

Cons:
- Requires webhook endpoint setup
- Delivery not guaranteed (need fallback)
- Must handle high event volume
- Security: validate webhook signature

The Deletion Detection Problem

Most change detection methods don't report deletions:

The Invisible Deletion:

Scenario:
Day 1: Document "Product V1" exists in Confluence
Twig syncs: Embedded in vector DB

Day 2: User deletes "Product V1" page

Incremental sync using modified_at:
→ Query: WHERE modified_at > yesterday
→ Returns: 0 results (deleted page not in response)
→ Twig thinks: "No changes"
→ Old embedding stays in vector DB

User asks AI: "What's Product V1?"
AI responds: "Product V1 is..." (from deleted doc)

Deletion Detection Strategies:

1. Track document IDs:
   - Store all known document IDs
   - On sync, compare: known_ids vs returned_ids
   - Missing IDs = deleted
   - Complexity: O(n) comparison for large datasets

2. Use change log APIs:
   - APIs that report deletions explicitly
   - Example: Google Drive changes.list() includes "deleted: true"
   - Not all APIs support this

3. Full sync periodically:
   - Incremental sync daily
   - Full sync weekly
   - Catch any missed deletions
   - But: defeats purpose of incremental sync

4. Deletion webhooks:
   - Listen for "document.deleted" events
   - Immediate removal from vector DB
   - Requires webhook infrastructure

Timestamp Precision and Clock Skew

Timestamp-based sync has subtle timing issues:

Clock Skew Problem:

Scenario:
Twig server time: 2024-01-20 10:00:00 UTC
Data source server time: 2024-01-20 09:59:00 UTC (1 min behind)

Sync at 10:00:00 UTC:
→ Query: WHERE modified_at > '2024-01-20 09:55:00'
→ Store last_sync_ts = '2024-01-20 10:00:00'

Document updated at 09:59:30 (source server time)
→ Source server records: modified_at = '2024-01-20 09:59:30'

Next sync at 10:30:00 UTC:
→ Query: WHERE modified_at > '2024-01-20 10:00:00'
→ Doesn't match document (09:59:30 < 10:00:00)
→ Change missed permanently

Timezone Ambiguity:

API returns: "modified_at": "2024-01-20 10:00:00"

Is this:
→ UTC?
→ Server local time?
→ User's timezone?

If ambiguous:
→ Parser may interpret incorrectly
→ Off by hours (or even a day)
→ Changes missed or duplicated

Precision Loss:

Database stores: 2024-01-20 10:00:00.123456 (microseconds)
API returns: 2024-01-20 10:00:00 (seconds only)

Two documents updated in same second:
1. 10:00:00.100
2. 10:00:00.900

Next sync query: modified_at > '2024-01-20 10:00:00'
→ Might miss both (depending on > vs >= semantics)
→ Or include both

Safer approach:
→ Subtract buffer (e.g., 1 minute) from last_sync_ts
→ Query: modified_at > (last_sync_ts - 1 min)
→ May process some documents twice, but no misses

Concurrency and Mid-Sync Changes

Documents can change during sync:

The Moving Target Problem:

Sync starts at 10:00:00:
1. Query: modified_at > yesterday
2. Returns: 100 documents
3. Twig begins processing document 1
4. (Processing takes 10 minutes)
5. At 10:05:00: User updates document 50
6. At 10:10:00: Twig processes document 50
   → Processes OLD version (from 10:00:00 query)
7. Sync completes at 10:10:00
8. last_sync_ts = 10:10:00

Document 50's update at 10:05:00:
→ Timestamp: 10:05:00
→ Last sync: 10:10:00
→ Next sync: modified_at > 10:10:00
→ Update missed (10:05:00 < 10:10:00)

Solutions:

1. Snapshot timestamp at start:
   snapshot_ts = now()
   Query: modified_at > last_sync_ts AND modified_at < snapshot_ts
   Store: last_sync_ts = snapshot_ts
   Next sync won't miss concurrent changes

2. Re-query changed documents:
   After processing all documents
   Re-query: modified_at > start_ts AND modified_at < now()
   Process additional changes
   Repeat until no new changes
   (Eventually consistent)

3. Version-based sync:
   Use ETags or version numbers
   Immune to timing issues
   But requires per-document checks

Batch Processing and Pagination State

Incremental sync must handle large change sets:

Pagination Interruption:

Incremental sync finds 5,000 changed documents:

Page 1: 100 docs (cursor_1)
Page 2: 100 docs (cursor_2)
...
Page 25: 100 docs (cursor_25)
→ Twig server crashes or times out

On restart:
→ Cursors expired
→ Must restart from beginning
→ Re-process 2,500 documents already done
→ Inefficient

Checkpoint Strategy:

Better approach:
1. Fetch page 1 (100 docs)
2. Process all 100
3. Persist checkpoint: cursor_1, last_processed_id
4. Fetch page 2
5. If crash: resume from checkpoint

Requires:
→ Durable state storage (database, not memory)
→ Idempotent processing (safe to re-process docs)
→ Cursor TTL management (refresh if needed)

Some changes have cascading effects:

The Parent-Child Problem:

Confluence:
Parent page: "Product Guide"
Child pages: "Installation", "Configuration", "Troubleshooting"

Scenario:
1. User renames parent: "Product Guide" → "Product Documentation"
2. Child pages' breadcrumbs now outdated
3. Incremental sync detects: 1 change (parent)
4. Child pages unmodified, not re-synced
5. Knowledge base has broken references

AI agent sees:
→ "Product Guide > Installation" (old breadcrumb)
→ "Product Documentation" (new parent name)
→ Inconsistent context

Link Resolution:

Document A links to Document B:
"See more details in [[Document B]]"

Scenario:
1. Document B renamed to "Document B v2"
2. Incremental sync: only Document B re-processed
3. Document A not updated (link still points to old name)

RAG retrieval:
→ Query matches Document A
→ LLM reads: "See Document B"
→ But Document B no longer exists (renamed)
→ Broken reference

State Management and Sync Metadata

Incremental sync requires persistent state:

Metadata to Store:

For each data source:
→ last_sync_timestamp
→ change_token / cursor
→ processed_document_ids (for deletion detection)
→ sync_status (in_progress, completed, failed)
→ error_count, last_error

Storage requirements:
→ Durable (survives server restart)
→ Consistent (no race conditions)
→ Per-data-source (not global)
→ Queryable (need status reports)

Race Condition:

Two sync jobs run concurrently (bug or manual trigger):

Job A:
1. Read: last_sync_ts = T1
2. Sync changes since T1
3. Write: last_sync_ts = T2

Job B (concurrent):
1. Read: last_sync_ts = T1 (same)
2. Sync changes since T1 (duplicate work)
3. Write: last_sync_ts = T3
4. Overwrites Job A's timestamp

Result:
→ Changes between T2 and T3 missed
→ Or processed twice

Solution:
→ Distributed lock (only one sync at a time)
→ Optimistic concurrency control (version numbers)

How to Solve

Use change tokens/cursors where available + implement deletion tracking with ID comparison + add timestamp buffer for clock skew + checkpoint pagination state + acquire lock before sync. See Data Source Configuration.

PreviousCSV Upload Failures NextWebhook Delivery Failures

Last updated 0 minutes ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagChange Detection Methods

hashtagThe Deletion Detection Problem

hashtagTimestamp Precision and Clock Skew

hashtagConcurrency and Mid-Sync Changes

hashtagBatch Processing and Pagination State

hashtagDependency Tracking and Related Changes

hashtagState Management and Sync Metadata

hashtagHow to Solve