Getting Started

Add and configure data sources for ingestion

What is a Data Source

A data source is content that Twig indexes for retrieval. Supported types:

  • Documentation sites and help centers

  • File uploads (PDF, DOCX, TXT)

  • Confluence spaces

  • Slack channels

  • Google Drive, SharePoint, OneDrive folders

  • Zendesk articles

Processing flow: Fetch documents → Parse text → Chunk (512 tokens) → Embed (OpenAI ada-002) → Index (Pinecone)

  1. Click Data in left navigation

  2. Or: Home → Data Sources → Manage Data Sources

Expected view: List of existing data sources with status (Active, Processing, Failed)

Navigating to Data Sources

Data Sources Screen

Columns displayed:

  • Name: Data source name

  • Type: WEBSITE, FILE, CONFLUENCE, SLACK, etc.

  • Status: Active (green), Processing (yellow), Failed (red)

  • Chunks Indexed: Count (e.g., "1,234 chunks")

  • Last Sync: Timestamp (e.g., "2 hours ago")

  • Actions: Process, Edit, Delete buttons

Status meanings:

  • Active: Ingestion complete, available for retrieval

  • Processing: Currently chunking/embedding/indexing

  • Failed: Error during processing (click for error log)

Data Sources Screen

Add a New Data Source

Click Add Data Source button (top right)

Supported Types

Type
Input
Max Size
Processing Time

Website Sitemap

Sitemap.xml URL

10,000 pages

5-30 min

Website Crawler

Base URL

10,000 pages

10-60 min

File Upload

PDF, DOCX, TXT

50MB per file

1-5 min

Zip Upload

.zip with documents

200MB

5-20 min

Confluence Space

OAuth connection

Unlimited

10-60 min

Slack Workspace

OAuth connection

Last 90 days

10-30 min

Google Drive

OAuth connection

Unlimited

10-60 min

Website Sitemap

  1. Select Website Sitemap from modal

  2. Enter sitemap URL: https://example.com/sitemap.xml

  3. Click Add

  4. Status changes: "Pending" → "Processing" → "Active"

Expected result: Pages crawled count displayed (e.g., "250 pages → 1,200 chunks")

Common errors:

  • "Sitemap not found (404)" → Verify URL is accessible

  • "Rate limit exceeded" → Wait 1 hour, crawler resumes automatically

File Upload

  1. Select File Upload

  2. Click Choose Files or drag-and-drop

  3. Select files: PDF, DOCX, TXT (max 50MB each)

  4. Click Upload

Expected result: Each file shows progress bar → "Processing" → "Active"

Supported formats:

  • PDF: Text-based (not scanned images)

  • DOCX: Microsoft Word 2007+

  • TXT: UTF-8 encoding

Confluence Space

  1. Select Confluence

  2. Click Connect to Confluence

  3. Authorize in Confluence OAuth screen

  4. Select spaces to index (checkboxes)

  5. Click Import

Expected result: Space count and page count displayed during processing

Permissions required: Confluence read access for selected spaces

Zip File

  1. Select Zip Upload

  2. Upload .zip file (max 200MB)

  3. Twig extracts and processes each file

Expected result: Shows file count (e.g., "50 files extracted → 200 chunks indexed")

Constraints:

  • Zip must contain only supported file types (PDF, DOCX, TXT)

  • Nested folders supported (files flattened during extraction)

How to Verify

  1. Data Sources list shows status "Active" (green)

  2. Chunks count > 0 (e.g., "450 chunks")

  3. Last sync timestamp recent (e.g., "5 minutes ago")

  4. Playground → Query agent → Check "Sources Used" panel shows chunks from this data source

Common Mistakes

Symptom: Status stuck at "Processing" for >30 minutes

Cause: Processing worker stalled or large dataset

Fix: Refresh page. If still processing after 1 hour, contact support with data source ID.


Symptom: Status "Failed" with error message

Cause: Invalid URL, authentication failure, or unsupported file format

Fix: Click data source name → Logs tab → check error message. Common fixes:

  • "401 Unauthorized" → Reconnect OAuth (Edit → Reconnect)

  • "Unsupported format" → Convert file to PDF/DOCX

  • "URL not accessible" → Verify URL works in browser

When This Doesn't Apply

This guide covers standard data source types. For custom integrations (APIs, databases), contact [email protected].

Last updated