# Incremental Sync Not Working

## The Problem

Your data source performs full re-sync every time instead of only syncing changes, causing slow sync times and wasted resources.

### Symptoms

* ❌ Every sync takes 30+ minutes even when no changes
* ❌ "Processing 5,000 documents" but only 5 changed
* ❌ High API usage and rate limiting
* ❌ Vector database grows unnecessarily
* ❌ Old deleted content still appears in knowledge base

### Real-World Example

```
Confluence space: 500 pages
Last sync: 2 days ago
Changes since: 3 new pages, 2 updated, 1 deleted

Expected sync: Process 6 pages (5 min)
Actual sync: Process all 500 pages (45 min)

API calls: 1,500 (should be 18)
Status: "Full sync completed" (incremental failed)
```

***

## Deep Technical Analysis

### Change Detection Methods

Different data sources use different mechanisms to track changes:

**1. Timestamp-Based (modified\_at):**

```
SQL-style approach:
SELECT * FROM documents WHERE modified_at > '2024-01-20T10:00:00Z'

Pros:
+ Simple to implement
+ Standard field on most systems
+ Efficient query

Cons:
- Clock skew problems (server time vs API time)
- Timezone ambiguity (UTC? Local?)
- Doesn't detect deletions
- Can miss documents if modified_at not updated
```

**2. Version/ETag:**

```
HTTP-style:
Request document with If-None-Match: "abc123"
If version unchanged: 304 Not Modified
If changed: 200 OK with new content + new ETag

Pros:
+ Accurate change detection
+ No clock skew issues
+ Server-authoritative

Cons:
- Requires per-document API call (slow)
- Not all systems support ETags
- Must store ETag for every document
```

**3. Change Token/Cursor (event log):**

```
Google Drive approach:
1. Initial sync returns: startPageToken: "xyz123"
2. Store token
3. Next sync: changes.list(pageToken="xyz123")
4. Returns: new/modified/deleted files since token
5. New token: "xyz456"

Pros:
+ Efficient (one API call)
+ Captures all change types (create/update/delete)
+ No clock skew
+ Server maintains change log

Cons:
- Token expires after ~1 week
- If expired, must do full sync
- Complex error handling
- Not all APIs support this
```

**4. Webhook/Event-Based:**

```
Real-time push:
→ Data source sends webhook on every change
→ Immediate processing
→ No polling needed

Pros:
+ Real-time updates
+ Zero polling overhead

Cons:
- Requires webhook endpoint setup
- Delivery not guaranteed (need fallback)
- Must handle high event volume
- Security: validate webhook signature
```

### The Deletion Detection Problem

Most change detection methods don't report deletions:

**The Invisible Deletion:**

```
Scenario:
Day 1: Document "Product V1" exists in Confluence
Twig syncs: Embedded in vector DB

Day 2: User deletes "Product V1" page

Incremental sync using modified_at:
→ Query: WHERE modified_at > yesterday
→ Returns: 0 results (deleted page not in response)
→ Twig thinks: "No changes"
→ Old embedding stays in vector DB

User asks AI: "What's Product V1?"
AI responds: "Product V1 is..." (from deleted doc)
```

**Deletion Detection Strategies:**

```
1. Track document IDs:
   - Store all known document IDs
   - On sync, compare: known_ids vs returned_ids
   - Missing IDs = deleted
   - Complexity: O(n) comparison for large datasets

2. Use change log APIs:
   - APIs that report deletions explicitly
   - Example: Google Drive changes.list() includes "deleted: true"
   - Not all APIs support this

3. Full sync periodically:
   - Incremental sync daily
   - Full sync weekly
   - Catch any missed deletions
   - But: defeats purpose of incremental sync

4. Deletion webhooks:
   - Listen for "document.deleted" events
   - Immediate removal from vector DB
   - Requires webhook infrastructure
```

### Timestamp Precision and Clock Skew

Timestamp-based sync has subtle timing issues:

**Clock Skew Problem:**

```
Scenario:
Twig server time: 2024-01-20 10:00:00 UTC
Data source server time: 2024-01-20 09:59:00 UTC (1 min behind)

Sync at 10:00:00 UTC:
→ Query: WHERE modified_at > '2024-01-20 09:55:00'
→ Store last_sync_ts = '2024-01-20 10:00:00'

Document updated at 09:59:30 (source server time)
→ Source server records: modified_at = '2024-01-20 09:59:30'

Next sync at 10:30:00 UTC:
→ Query: WHERE modified_at > '2024-01-20 10:00:00'
→ Doesn't match document (09:59:30 < 10:00:00)
→ Change missed permanently
```

**Timezone Ambiguity:**

```
API returns: "modified_at": "2024-01-20 10:00:00"

Is this:
→ UTC?
→ Server local time?
→ User's timezone?

If ambiguous:
→ Parser may interpret incorrectly
→ Off by hours (or even a day)
→ Changes missed or duplicated
```

**Precision Loss:**

```
Database stores: 2024-01-20 10:00:00.123456 (microseconds)
API returns: 2024-01-20 10:00:00 (seconds only)

Two documents updated in same second:
1. 10:00:00.100
2. 10:00:00.900

Next sync query: modified_at > '2024-01-20 10:00:00'
→ Might miss both (depending on > vs >= semantics)
→ Or include both

Safer approach:
→ Subtract buffer (e.g., 1 minute) from last_sync_ts
→ Query: modified_at > (last_sync_ts - 1 min)
→ May process some documents twice, but no misses
```

### Concurrency and Mid-Sync Changes

Documents can change during sync:

**The Moving Target Problem:**

```
Sync starts at 10:00:00:
1. Query: modified_at > yesterday
2. Returns: 100 documents
3. Twig begins processing document 1
4. (Processing takes 10 minutes)
5. At 10:05:00: User updates document 50
6. At 10:10:00: Twig processes document 50
   → Processes OLD version (from 10:00:00 query)
7. Sync completes at 10:10:00
8. last_sync_ts = 10:10:00

Document 50's update at 10:05:00:
→ Timestamp: 10:05:00
→ Last sync: 10:10:00
→ Next sync: modified_at > 10:10:00
→ Update missed (10:05:00 < 10:10:00)
```

**Solutions:**

```
1. Snapshot timestamp at start:
   snapshot_ts = now()
   Query: modified_at > last_sync_ts AND modified_at < snapshot_ts
   Store: last_sync_ts = snapshot_ts
   Next sync won't miss concurrent changes

2. Re-query changed documents:
   After processing all documents
   Re-query: modified_at > start_ts AND modified_at < now()
   Process additional changes
   Repeat until no new changes
   (Eventually consistent)

3. Version-based sync:
   Use ETags or version numbers
   Immune to timing issues
   But requires per-document checks
```

### Batch Processing and Pagination State

Incremental sync must handle large change sets:

**Pagination Interruption:**

```
Incremental sync finds 5,000 changed documents:

Page 1: 100 docs (cursor_1)
Page 2: 100 docs (cursor_2)
...
Page 25: 100 docs (cursor_25)
→ Twig server crashes or times out

On restart:
→ Cursors expired
→ Must restart from beginning
→ Re-process 2,500 documents already done
→ Inefficient
```

**Checkpoint Strategy:**

```
Better approach:
1. Fetch page 1 (100 docs)
2. Process all 100
3. Persist checkpoint: cursor_1, last_processed_id
4. Fetch page 2
5. If crash: resume from checkpoint

Requires:
→ Durable state storage (database, not memory)
→ Idempotent processing (safe to re-process docs)
→ Cursor TTL management (refresh if needed)
```

### Dependency Tracking and Related Changes

Some changes have cascading effects:

**The Parent-Child Problem:**

```
Confluence:
Parent page: "Product Guide"
Child pages: "Installation", "Configuration", "Troubleshooting"

Scenario:
1. User renames parent: "Product Guide" → "Product Documentation"
2. Child pages' breadcrumbs now outdated
3. Incremental sync detects: 1 change (parent)
4. Child pages unmodified, not re-synced
5. Knowledge base has broken references

AI agent sees:
→ "Product Guide > Installation" (old breadcrumb)
→ "Product Documentation" (new parent name)
→ Inconsistent context
```

**Link Resolution:**

```
Document A links to Document B:
"See more details in [[Document B]]"

Scenario:
1. Document B renamed to "Document B v2"
2. Incremental sync: only Document B re-processed
3. Document A not updated (link still points to old name)

RAG retrieval:
→ Query matches Document A
→ LLM reads: "See Document B"
→ But Document B no longer exists (renamed)
→ Broken reference
```

### State Management and Sync Metadata

Incremental sync requires persistent state:

**Metadata to Store:**

```
For each data source:
→ last_sync_timestamp
→ change_token / cursor
→ processed_document_ids (for deletion detection)
→ sync_status (in_progress, completed, failed)
→ error_count, last_error

Storage requirements:
→ Durable (survives server restart)
→ Consistent (no race conditions)
→ Per-data-source (not global)
→ Queryable (need status reports)
```

**Race Condition:**

```
Two sync jobs run concurrently (bug or manual trigger):

Job A:
1. Read: last_sync_ts = T1
2. Sync changes since T1
3. Write: last_sync_ts = T2

Job B (concurrent):
1. Read: last_sync_ts = T1 (same)
2. Sync changes since T1 (duplicate work)
3. Write: last_sync_ts = T3
4. Overwrites Job A's timestamp

Result:
→ Changes between T2 and T3 missed
→ Or processed twice

Solution:
→ Distributed lock (only one sync at a time)
→ Optimistic concurrency control (version numbers)
```

***

## How to Solve

**Use change tokens/cursors where available + implement deletion tracking with ID comparison + add timestamp buffer for clock skew + checkpoint pagination state + acquire lock before sync.** See [Data Source Configuration](https://github.com/thrivapp/twig-help-docs/blob/staging/data/overview.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/data-integration/incremental-sync.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
