Incremental Sync Not Working

The Problem

Your data source performs full re-sync every time instead of only syncing changes, causing slow sync times and wasted resources.

Symptoms

  • ❌ Every sync takes 30+ minutes even when no changes

  • ❌ "Processing 5,000 documents" but only 5 changed

  • ❌ High API usage and rate limiting

  • ❌ Vector database grows unnecessarily

  • ❌ Old deleted content still appears in knowledge base

Real-World Example

Confluence space: 500 pages
Last sync: 2 days ago
Changes since: 3 new pages, 2 updated, 1 deleted

Expected sync: Process 6 pages (5 min)
Actual sync: Process all 500 pages (45 min)

API calls: 1,500 (should be 18)
Status: "Full sync completed" (incremental failed)

Deep Technical Analysis

Change Detection Methods

Different data sources use different mechanisms to track changes:

1. Timestamp-Based (modified_at):

2. Version/ETag:

3. Change Token/Cursor (event log):

4. Webhook/Event-Based:

The Deletion Detection Problem

Most change detection methods don't report deletions:

The Invisible Deletion:

Deletion Detection Strategies:

Timestamp Precision and Clock Skew

Timestamp-based sync has subtle timing issues:

Clock Skew Problem:

Timezone Ambiguity:

Precision Loss:

Concurrency and Mid-Sync Changes

Documents can change during sync:

The Moving Target Problem:

Solutions:

Batch Processing and Pagination State

Incremental sync must handle large change sets:

Pagination Interruption:

Checkpoint Strategy:

Some changes have cascading effects:

The Parent-Child Problem:

Link Resolution:

State Management and Sync Metadata

Incremental sync requires persistent state:

Metadata to Store:

Race Condition:


How to Solve

Use change tokens/cursors where available + implement deletion tracking with ID comparison + add timestamp buffer for clock skew + checkpoint pagination state + acquire lock before sync. See Data Source Configurationarrow-up-right.

Last updated