# Website Crawling

Crawl public websites and index HTML content.

## Overview

| Property           | Value                            |
| ------------------ | -------------------------------- |
| **Type**           | Dynamic (scheduled crawls)       |
| **Sync Schedule**  | Hourly, Daily, Weekly, Manual    |
| **Plan**           | All plans                        |
| **Authentication** | Public sites only (no login)     |
| **Max Pages**      | 10,000 per data source           |
| **Crawl Depth**    | Configurable (default: 3 levels) |

## Use Cases

* **Public documentation**: API docs, user guides
* **Help centers**: Support articles, FAQs
* **Blog archives**: Technical posts, announcements
* **Product pages**: Features, pricing, comparisons

## How Crawling Works

**Process**:

1. Fetch start URL HTML
2. Parse HTML, extract text and links
3. Follow links within same domain (respects `robots.txt`)
4. Repeat steps 1-3 up to `maxDepth` levels
5. Chunk pages (512 tokens/chunk)
6. Embed chunks (OpenAI ada-002)
7. Index to Pinecone

**Scope rules**:

* Same domain only (e.g., docs.example.com won't crawl api.example.com)
* HTML pages only (skips PDFs, images, videos)
* Respects `robots.txt` disallow rules
* Deduplicates by URL (ignores query params by default)
* Rate limited: 1 req/second to target site

## How to Add a Website

### Step 1: Navigate to Data Sources

1. Log in to your Twig AI account
2. Click **Data** in the main navigation menu
3. Click **Add Data Source** or the **+** button

### Step 2: Select Website Connector

1. Choose **Website** from the list of connectors
2. The connector shows: "Reads a publicly accessible websites"

### Step 3: Configure the Data Source

#### Basic Information

* **Name** (required): Descriptive name for the website
  * Example: "Product Documentation", "Support Knowledge Base", "Company Blog"
* **Description** (optional): Additional context
  * Example: "Official product documentation site with API reference and user guides"

#### URL Configuration

* **URL** (required): The starting URL to crawl
  * Must be a valid URL starting with `http://` or `https://`
  * Example: `https://docs.example.com`
  * Example: `https://help.example.com/en/`

**URL Selection Tips:**

* Start at the most relevant section (e.g., `/docs/` instead of homepage)
* Use URLs with clear structure and navigation
* Avoid URLs with query parameters if possible

#### Advanced Parameters (JSON)

You can configure advanced crawling options using JSON in the Parameters field:

```json
{
  "maxDepth": 3,
  "maxPages": 100,
  "includePatterns": ["/docs/", "/help/"],
  "excludePatterns": ["/blog/", "/news/"],
  "followExternalLinks": false
}
```

**Available Parameters:**

| Parameter             | Type    | Description                       | Default  |
| --------------------- | ------- | --------------------------------- | -------- |
| `maxDepth`            | Number  | Maximum link depth from start URL | 3        |
| `maxPages`            | Number  | Maximum pages to crawl            | 100      |
| `includePatterns`     | Array   | URL patterns to include           | All      |
| `excludePatterns`     | Array   | URL patterns to exclude           | None     |
| `followExternalLinks` | Boolean | Crawl external domains            | false    |
| `respectRobotsTxt`    | Boolean | Follow robots.txt rules           | true     |
| `userAgent`           | String  | Custom user agent string          | Twig Bot |

#### Refresh Frequency

Choose how often to re-crawl the website:

* **Never** - Manual refresh only (static)
* **Daily** - Refresh every day
* **Weekly** - Refresh every week
* **Monthly** - Refresh every month

**Recommendation:**

* Daily: For frequently updated sites (news, blogs)
* Weekly: For moderately updated sites (documentation)
* Monthly: For stable content (marketing pages)

#### Tags (Optional)

Add tags for organization:

* Examples: "documentation", "external", "support", "public"

### Step 4: Save and Crawl

1. Click **Save** or **Create**
2. Initial crawl begins automatically
3. Monitor status in the data sources list

### Step 5: Verify Crawl

1. Check record count (number of pages crawled)
2. Verify status shows "END\_PROCESS"
3. Review process logs for any errors
4. Test knowledge with relevant questions

## Examples

### Example 1: Documentation Site

```
Name: Product Documentation
Description: Official API and user guide documentation
URL: https://docs.example.com
Parameters: 
{
  "maxDepth": 5,
  "maxPages": 500,
  "includePatterns": ["/docs/"],
  "excludePatterns": ["/blog/", "/changelog/"]
}
Refresh: Weekly
Tags: documentation, public, api
```

### Example 2: Help Center

```
Name: Customer Support Articles
Description: Complete help center with FAQs and troubleshooting guides
URL: https://help.example.com/en/
Parameters:
{
  "maxDepth": 3,
  "maxPages": 200,
  "includePatterns": ["/en/articles/"],
  "excludePatterns": ["/community/"]
}
Refresh: Daily
Tags: support, help-center, customer-facing
```

### Example 3: Company Blog

```
Name: Company Blog
Description: Technical blog posts and product announcements
URL: https://blog.example.com
Parameters:
{
  "maxDepth": 2,
  "maxPages": 100,
  "includePatterns": ["/technical/", "/products/"],
  "excludePatterns": ["/authors/", "/tags/"]
}
Refresh: Weekly
Tags: blog, marketing, technical
```

## Best Practices

### 1. Choose the Right Starting URL

**Good Starting Points:**

* `/docs/` - Documentation root
* `/help/en/` - Help center in specific language
* `/api/reference/` - API documentation section
* `/kb/` - Knowledge base root

**Avoid Starting From:**

* Homepage (too broad, many irrelevant links)
* Login pages (can't be crawled)
* Dynamic search results
* Paginated archives without limit

### 2. Use Include/Exclude Patterns

**Include Patterns** - Only crawl these sections:

```json
{
  "includePatterns": [
    "/docs/",
    "/api-reference/",
    "/getting-started/"
  ]
}
```

**Exclude Patterns** - Skip these sections:

```json
{
  "excludePatterns": [
    "/changelog/",
    "/blog/",
    "/about/",
    "/careers/",
    "/login",
    "/signup"
  ]
}
```

### 3. Optimize Crawl Limits

**For Large Sites:**

```json
{
  "maxDepth": 4,
  "maxPages": 500
}
```

**For Small Sites:**

```json
{
  "maxDepth": 10,
  "maxPages": 100
}
```

**Balance:** Higher depth = more comprehensive, longer crawl time

### 4. Set Appropriate Refresh Schedules

* **Daily:** News sites, rapidly changing content
* **Weekly:** Documentation with regular updates
* **Monthly:** Stable marketing content
* **Manual (Never):** One-time imports, archived content

### 5. Monitor Crawl Health

Regularly check:

* Number of pages crawled (is it increasing/decreasing?)
* Last successful crawl date
* Error logs for failed pages
* Response time and crawl duration

## Advanced Configuration

### Handling Authentication

The Website connector only supports **public** websites. For authenticated content, use alternative connectors:

* **Confluence** - For Atlassian Confluence
* **SharePoint** - For Microsoft SharePoint
* **Google Drive** - For Google Docs

### Crawling Subdomains

To crawl multiple subdomains:

**Option 1:** Create separate data sources for each subdomain

```
Source 1: https://docs.example.com
Source 2: https://api.example.com
Source 3: https://help.example.com
```

**Option 2:** Enable external link following (use cautiously)

```json
{
  "followExternalLinks": true,
  "includePatterns": [
    "docs.example.com",
    "api.example.com"
  ]
}
```

### Handling Dynamic Content

**JavaScript-Rendered Content:** The crawler can handle basic JavaScript rendering but may miss:

* Complex single-page applications (SPAs)
* Content loaded after user interaction
* Infinite scroll content

**Solutions:**

* Check if site has a static HTML fallback
* Look for sitemap.xml (use [Sitemap connector](/product/data-integrations/sitemap.md))
* Contact site owner about crawler-friendly version

### Rate Limiting

The crawler automatically:

* Waits between requests (polite crawling)
* Respects server `Retry-After` headers
* Backs off on errors
* Distributes load over time

## Troubleshooting

### Few Pages Crawled

**Problem:** Only 1-2 pages crawled from a large site

**Solutions:**

* Check `maxDepth` is sufficient
* Verify `maxPages` limit isn't too low
* Review `includePatterns` aren't too restrictive
* Ensure site has proper internal linking
* Check robots.txt isn't blocking crawler

### Missing Content

**Problem:** Important pages not included

**Solutions:**

* Verify pages are linked from start URL
* Check pages aren't excluded by patterns
* Ensure pages are within maxDepth limit
* Look for pages in robots.txt disallow list
* Check if pages require authentication

### Crawl Timeout

**Problem:** Crawl stops before completing

**Solutions:**

* Reduce `maxPages` limit
* Decrease `maxDepth`
* Add more specific `includePatterns`
* Check if site is responding slowly
* Try crawling specific sections separately

### Duplicate Content

**Problem:** Same content appears multiple times

**Solutions:**

* The crawler should handle duplicates automatically
* Check for URL variations (with/without trailing slash)
* Review query parameters in URLs
* Add exclude patterns for redundant paths

### Refresh Not Working

**Problem:** Scheduled refresh isn't updating content

**Solutions:**

* Verify refresh frequency is not set to "NEVER"
* Check last processed date
* Review process logs for errors
* Ensure website is accessible
* Check for crawler blocking (robots.txt, firewall)

## Performance Tips

### 1. Start Specific, Expand Later

Begin with a narrow scope:

```json
{
  "includePatterns": ["/docs/getting-started/"],
  "maxPages": 50
}
```

Then expand as needed:

```json
{
  "includePatterns": ["/docs/"],
  "maxPages": 200
}
```

### 2. Use Multiple Focused Sources

Instead of one broad crawl:

```
❌ Start URL: https://example.com (crawls entire site)
```

Create targeted sources:

```
✅ Source 1: https://docs.example.com/api/ (API docs only)
✅ Source 2: https://docs.example.com/guides/ (User guides only)
✅ Source 3: https://help.example.com/ (Support articles only)
```

### 3. Exclude Non-Essential Content

Common exclusions:

```json
{
  "excludePatterns": [
    "/search",
    "/tags/",
    "/categories/",
    "/authors/",
    "/archive/",
    "/print/",
    "/download/",
    "/comments"
  ]
}
```

## Monitoring & Maintenance

### Regular Checks

**Weekly:**

* Review page count trends
* Check for crawl errors
* Verify refresh is working

**Monthly:**

* Audit included/excluded pages
* Optimize crawl parameters
* Remove outdated sources

**Quarterly:**

* Review AI answer quality from website data
* Update include/exclude patterns
* Adjust refresh frequency based on site update patterns

### Metrics to Track

* **Pages Crawled:** Total number of indexed pages
* **Last Sync:** When was the last successful crawl
* **Error Rate:** Percentage of failed page fetches
* **Crawl Duration:** Time taken for full crawl
* **Usage:** How often AI references this content

## Next Steps

After setting up website crawling:

1. [Test your AI agent](/getting-started/ask-a-question.md) with website content questions
2. [Create specialized agents](/product/overview/add-an-ai-agent-persona.md) for different site sections
3. [Monitor analytics](/product/monitoring/view-analytics.md) to see which pages are most useful
4. Optimize crawl configuration based on usage patterns

## Related Connectors

* [Sitemap](/product/data-integrations/sitemap.md) - Import from sitemap.xml files
* [Confluence](/product/data-integrations/confluence.md) - For Confluence-hosted documentation
* [Files](/product/data-integrations/files.md) - Upload exported HTML or PDF documentation
* [Google Drive](/product/data-integrations/google-drive.md) - For Google Docs-based documentation


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/product/data-integrations/website.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
