# Webhook Delivery Failures

## The Problem

Webhooks from your data sources fail to reach Twig, arrive late, or get delivered multiple times, causing inconsistent knowledge base updates.

### Symptoms

* ❌ Real-time updates not working despite webhook setup
* ❌ "Webhook endpoint unreachable" errors
* ❌ Same document processed 5 times
* ❌ Updates arrive 20 minutes late
* ❌ Webhooks only work sometimes

### Real-World Example

```
Confluence configured to send webhooks on page updates

User updates page at 10:00 AM
Expected: Immediate knowledge base update
Actual: No update

Webhook delivery attempts:
10:00:00 → 502 Bad Gateway
10:00:05 → Timeout (30s)
10:00:35 → Connection refused
10:05:00 → Retries exhausted, webhook dropped

Result: Update never synced, AI agent has stale data
```

***

## Deep Technical Analysis

### Webhook Delivery Guarantees

Webhooks are fundamentally unreliable:

**HTTP Push Model:**

```
Traditional API (pull):
→ Twig controls timing
→ Retries on failure
→ Guaranteed processing

Webhook (push):
→ Data source controls timing
→ Limited retries (5-10 attempts)
→ No guarantee of delivery
```

**Failure Modes:**

```
1. Network failure:
   → DNS resolution fails
   → Connection timeout
   → Packet loss
   → Result: Webhook never arrives

2. Server unavailable:
   → Twig server restarting
   → Load balancer down
   → 502/503 errors
   → Result: Source retries, then gives up

3. Request timeout:
   → Webhook delivered
   → Twig processing slow (>30s)
   → Source times out before response
   → Source thinks: delivery failed, retries
   → Result: Duplicate processing

4. Invalid response:
   → Twig returns 400 (bug in handler)
   → Source interprets as permanent failure
   → No retries
   → Result: Event lost
```

**At-Most-Once vs At-Least-Once:**

```
Most webhook providers guarantee:
→ At-least-once delivery
→ May deliver same event multiple times
→ Twig must handle idempotency

Alternative (rare):
→ Exactly-once delivery
→ Requires distributed transaction protocol
→ Most providers don't support this
```

### Webhook Endpoint Requirements

Receiving webhooks requires public infrastructure:

**Public Accessibility:**

```
Webhook sender (e.g., Confluence):
→ Must reach Twig's webhook endpoint
→ Requires public IP/domain
→ HTTPS required (most sources mandate TLS)
→ Valid SSL certificate

Development challenges:
→ Local development: localhost not reachable
→ Need ngrok/localtunnel for testing
→ Firewall rules must allow inbound HTTPS
→ DNS must resolve correctly
```

**Load Balancer Complications:**

```
Architecture:
Internet → Load Balancer → Twig Servers (3 instances)

Webhook delivery to: https://api.twig.com/webhooks/confluence

Load balancer distributes requests:
→ 33% to server 1
→ 33% to server 2
→ 34% to server 3

If server 2 is down:
→ Load balancer may route webhook there
→ Connection fails
→ Source retries
→ May hit different server (server 1)
→ Successful delivery

But:
→ Source logs show: 1 failure, 1 success
→ Looks like duplicate delivery
→ Idempotency needed
```

### Signature Verification and Security

Webhooks must verify authenticity to prevent attacks:

**Unsigned Webhooks (insecure):**

```
POST /webhooks/confluence
Body: { "page_id": 123, "action": "updated" }

Problem:
→ Anyone can POST this endpoint
→ Attacker sends fake webhook
→ Twig processes it as real
→ Malicious data injected into knowledge base
```

**HMAC Signature Verification:**

```
Confluence webhook:
Header: X-Hub-Signature: sha256=abc123...
Body: { "page_id": 123, "action": "updated" }

Verification:
1. Twig looks up secret for this data source
2. Computes: HMAC-SHA256(secret, request_body)
3. Compares with X-Hub-Signature
4. If match: authentic
5. If mismatch: reject (403 Forbidden)

Challenges:
→ Different sources use different header names
→ Different HMAC algorithms (SHA1, SHA256, SHA512)
→ Some use URL encoding, some don't
→ Secret rotation: old webhooks use old secret
→ Clock skew: timestamp-based signatures expire
```

**Timestamp Validation:**

```
Webhook header:
X-Slack-Request-Timestamp: 1642204800

Verification:
1. Extract timestamp
2. Current time: now()
3. Difference: abs(now - timestamp)
4. If difference > 5 minutes: reject (replay attack)

But:
→ Twig server clock off by 10 minutes
→ All webhooks rejected as "too old"
→ Need NTP sync
```

### Idempotency and Duplicate Handling

Webhooks may be delivered multiple times:

**The Duplicate Problem:**

```
Scenario:
10:00:00 → Webhook delivered, Twig processing
10:00:25 → Processing not complete yet
10:00:30 → Source times out, assumes failure
10:00:35 → Source retries, sends same webhook again
10:00:40 → Twig finishes first processing, returns 200
10:00:42 → Twig processes second (duplicate) webhook

Result:
→ Same document processed twice
→ Duplicate embeddings in vector DB
→ Wasted compute and storage
```

**Idempotency Key:**

```
Best practice: webhook includes unique ID
{
  "event_id": "evt_abc123",  ← Idempotency key
  "page_id": 456,
  "action": "updated"
}

Twig handler:
1. Check if event_id already processed
2. If yes: return 200 (idempotent)
3. If no: process and record event_id

Requires:
→ Persistent storage of processed event IDs
→ TTL/expiry (can't store forever)
→ Typically keep for 24-48 hours
```

**Stateless Idempotency:**

```
Alternative: use content hash
1. Hash webhook body: hash(JSON.stringify(body))
2. Check if hash processed recently
3. If yes: skip (probable duplicate)
4. If no: process

Pros:
+ No need for explicit event_id field
+ Works with any webhook source

Cons:
- False negatives (two different updates with same hash)
- Hash collisions (rare but possible)
- Doesn't work if webhook includes timestamp
```

### Ordering and Sequencing

Webhooks may arrive out of order:

**The Race Condition:**

```
User actions:
10:00:00 → Create page "Guide"
10:00:05 → Update page "Guide" (add content)
10:00:10 → Update page "Guide" (fix typo)

Webhook delivery:
10:00:01 → Created webhook sent
10:00:06 → Updated webhook sent
10:00:11 → Updated webhook sent

But network delays:
10:00:15 → "Updated" arrives first
10:00:18 → "Created" arrives second
10:00:19 → "Updated" arrives third

Twig processing:
→ Update page (but it doesn't exist yet!)
→ Create page (replaces update with empty version)
→ Update page (now correct, but lost first update)

Final state: Incomplete content
```

**Sequence Number Solution:**

```
Webhook payload:
{
  "event_id": "evt_abc123",
  "sequence": 456,  ← Global counter
  "page_id": 789,
  "action": "updated"
}

Twig handler:
1. Check last_processed_sequence for page 789
2. If incoming sequence <= last_processed: skip (old event)
3. If incoming sequence > last_processed + 1: gap detected
   → Store for later
   → Wait for missing sequences
4. If sequence == last_processed + 1: process normally

Complexity:
→ Must track sequence per document
→ Gap handling (what if missing event never arrives?)
→ Not all webhook sources provide sequence numbers
```

### Retry Backoff and Thundering Herd

Transient failures trigger retries:

**Exponential Backoff:**

```
Webhook delivery attempts:
1. Immediate: POST /webhook
2. 5s later: POST /webhook (retry 1)
3. 25s later: POST /webhook (retry 2)
4. 125s later: POST /webhook (retry 3)
5. Give up

But:
→ If Twig is down for 10 minutes
→ All webhooks from 10-minute window retry simultaneously
→ When Twig comes back: 600 webhooks hit at once
→ Thundering herd problem
→ Twig overloaded, fails again
```

**Rate Limiting on Receiver:**

```
Twig must limit incoming webhook rate:
→ 100 webhooks/second max
→ If exceeded: return 429 (Too Many Requests)
→ Retry-After header: 60 seconds

But source may not respect Retry-After:
→ Retry immediately anyway
→ More 429 errors
→ Eventually give up
→ Events lost
```

### Long Processing and Timeout

Webhook processing must be fast:

**The Timeout Problem:**

```
Webhook arrives: page_updated event
Twig handler:
1. Verify signature (50ms)
2. Validate payload (10ms)
3. Fetch full page content from Confluence API (2s)
4. Chunk content (500ms)
5. Generate embeddings (5s)
6. Store in vector DB (1s)
Total: 8.56 seconds

But:
→ Webhook source timeout: 5 seconds
→ Source sees no response after 5s
→ Source marks as failed, retries
→ Twig finishes at 8.56s, returns 200 (to closed connection)
→ Result: Duplicate processing on retry
```

**Async Processing Pattern:**

```
Better approach:
1. Webhook arrives
2. Verify signature (50ms)
3. Validate payload (10ms)
4. Add to processing queue (Redis, SQS)
5. Return 202 Accepted immediately (total: 60ms)

Background worker:
→ Dequeue event
→ Process fully (fetch, chunk, embed)
→ No timeout risk

Pros:
+ Fast webhook response
+ No duplicate processing
+ Can handle bursts

Cons:
- More complex architecture
- Need queue infrastructure
- Harder debugging (async)
```

### Failed Webhook Recovery

When webhooks are lost, fallback is needed:

**Hybrid Sync Strategy:**

```
Primary: Webhooks for real-time updates
Fallback: Periodic polling for missed events

Example:
→ Webhooks every second (real-time)
→ Full sync every 6 hours (catch missed events)

But:
→ Polling finds changes already processed by webhooks
→ Duplicate detection needed
→ Wasted API calls
→ Complex coordination logic
```

**Dead Letter Queue:**

```
Failed webhook handling:
1. Webhook processing fails (bug, bad data)
2. Retry 3 times
3. Still failing? Move to DLQ (Dead Letter Queue)
4. Alert engineers
5. Manual investigation

Prevents:
→ Blocking queue with poison messages
→ Infinite retry loops
→ But requires manual intervention
```

***

## How to Solve

**Implement HMAC signature verification + use async processing with queue + store idempotency keys + add periodic polling fallback + handle retry storms with rate limiting.** See [Webhook Configuration](https://github.com/thrivapp/twig-help-docs/blob/staging/integrations/webhooks.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/data-integration/webhook-failures.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
