# Getting Started

## What is a Data Source

A data source is content that Twig indexes for retrieval. Supported types:

* Documentation sites and help centers
* File uploads (PDF, DOCX, TXT)
* Confluence spaces
* Slack channels
* Google Drive, SharePoint, OneDrive folders
* Zendesk articles

**Processing flow**: Fetch documents → Parse text → Chunk (512 tokens) → Embed (OpenAI ada-002) → Index (Pinecone)

## Navigate to Data Sources

1. Click **Data** in left navigation
2. Or: Home → Data Sources → Manage Data Sources

**Expected view**: List of existing data sources with status (Active, Processing, Failed)

<figure><img src="/files/muDZCT7d645CVebfbrQX" alt=""><figcaption><p>Navigating to Data Sources</p></figcaption></figure>

## Data Sources Screen

**Columns displayed**:

* **Name**: Data source name
* **Type**: WEBSITE, FILE, CONFLUENCE, SLACK, etc.
* **Status**: Active (green), Processing (yellow), Failed (red)
* **Chunks Indexed**: Count (e.g., "1,234 chunks")
* **Last Sync**: Timestamp (e.g., "2 hours ago")
* **Actions**: Process, Edit, Delete buttons

**Status meanings**:

* **Active**: Ingestion complete, available for retrieval
* **Processing**: Currently chunking/embedding/indexing
* **Failed**: Error during processing (click for error log)

<figure><img src="/files/01jyyYBVAbgGTB5sp6RD" alt=""><figcaption><p>Data Sources Screen</p></figcaption></figure>

## Add a New Data Source

Click **Add Data Source** button (top right)

### Supported Types

| Type                 | Input               | Max Size      | Processing Time |
| -------------------- | ------------------- | ------------- | --------------- |
| **Website Sitemap**  | Sitemap.xml URL     | 10,000 pages  | 5-30 min        |
| **Website Crawler**  | Base URL            | 10,000 pages  | 10-60 min       |
| **File Upload**      | PDF, DOCX, TXT      | 50MB per file | 1-5 min         |
| **Zip Upload**       | .zip with documents | 200MB         | 5-20 min        |
| **Confluence Space** | OAuth connection    | Unlimited     | 10-60 min       |
| **Slack Workspace**  | OAuth connection    | Last 90 days  | 10-30 min       |
| **Google Drive**     | OAuth connection    | Unlimited     | 10-60 min       |

### Website Sitemap

1. Select **Website Sitemap** from modal
2. Enter sitemap URL: `https://example.com/sitemap.xml`
3. Click **Add**
4. Status changes: "Pending" → "Processing" → "Active"

**Expected result**: Pages crawled count displayed (e.g., "250 pages → 1,200 chunks")

**Common errors**:

* "Sitemap not found (404)" → Verify URL is accessible
* "Rate limit exceeded" → Wait 1 hour, crawler resumes automatically

### File Upload

1. Select **File Upload**
2. Click **Choose Files** or drag-and-drop
3. Select files: PDF, DOCX, TXT (max 50MB each)
4. Click **Upload**

**Expected result**: Each file shows progress bar → "Processing" → "Active"

**Supported formats**:

* PDF: Text-based (not scanned images)
* DOCX: Microsoft Word 2007+
* TXT: UTF-8 encoding

### Confluence Space

1. Select **Confluence**
2. Click **Connect to Confluence**
3. Authorize in Confluence OAuth screen
4. Select spaces to index (checkboxes)
5. Click **Import**

**Expected result**: Space count and page count displayed during processing

**Permissions required**: Confluence read access for selected spaces

### Zip File

1. Select **Zip Upload**
2. Upload .zip file (max 200MB)
3. Twig extracts and processes each file

**Expected result**: Shows file count (e.g., "50 files extracted → 200 chunks indexed")

**Constraints**:

* Zip must contain only supported file types (PDF, DOCX, TXT)
* Nested folders supported (files flattened during extraction)

<figure><img src="/files/ptVappgg2JRVsvCOuE7W" alt=""><figcaption></figcaption></figure>

## How to Verify

1. Data Sources list shows status "Active" (green)
2. Chunks count > 0 (e.g., "450 chunks")
3. Last sync timestamp recent (e.g., "5 minutes ago")
4. Playground → Query agent → Check "Sources Used" panel shows chunks from this data source

## Common Mistakes

**Symptom**: Status stuck at "Processing" for >30 minutes

**Cause**: Processing worker stalled or large dataset

**Fix**: Refresh page. If still processing after 1 hour, contact support with data source ID.

***

**Symptom**: Status "Failed" with error message

**Cause**: Invalid URL, authentication failure, or unsupported file format

**Fix**: Click data source name → Logs tab → check error message. Common fixes:

* "401 Unauthorized" → Reconnect OAuth (Edit → Reconnect)
* "Unsupported format" → Convert file to PDF/DOCX
* "URL not accessible" → Verify URL works in browser

## When This Doesn't Apply

This guide covers standard data source types. For custom integrations (APIs, databases), contact <support@twig.so>.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/product/data-integrations/add-new-data-sources.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
