Website Crawling

Crawl and index publicly accessible websites to build a knowledge base from online documentation, blogs, help centers, and other web content.

Overview

Property
Details

Type

Dynamic

Refresh

Scheduled (Daily, Weekly, Monthly)

Tier

1 (All Plans)

Crawl Depth

Configurable

Authentication

Public sites only

When to Use Website Connector

The Website connector is ideal for:

  • Public Documentation - Developer docs, user guides, API references

  • Help Centers - Support articles and knowledge bases

  • Blog Content - Company blogs, technical articles

  • Marketing Pages - Product information, feature pages

  • Resource Centers - Tutorials, guides, whitepapers

  • News Sites - Company news, press releases

How It Works

Crawling Process

  1. Start URL: You provide a starting URL

  2. Discovery: Crawler finds all links on the page

  3. Follow Links: Crawler visits each discovered link (within domain)

  4. Extract Content: Text content is extracted from each page

  5. Index: Content is processed and indexed for AI search

Crawl Scope

By default, the crawler:

  • Stays within the same domain as the start URL

  • Follows all internal links

  • Respects robots.txt rules

  • Skips non-HTML content (images, PDFs, etc.)

  • Avoids duplicate pages

How to Add a Website

Step 1: Navigate to Data Sources

  1. Log in to your Twig AI account

  2. Click Data in the main navigation menu

  3. Click Add Data Source or the + button

Step 2: Select Website Connector

  1. Choose Website from the list of connectors

  2. The connector shows: "Reads a publicly accessible websites"

Step 3: Configure the Data Source

Basic Information

  • Name (required): Descriptive name for the website

    • Example: "Product Documentation", "Support Knowledge Base", "Company Blog"

  • Description (optional): Additional context

    • Example: "Official product documentation site with API reference and user guides"

URL Configuration

  • URL (required): The starting URL to crawl

    • Must be a valid URL starting with http:// or https://

    • Example: https://docs.example.com

    • Example: https://help.example.com/en/

URL Selection Tips:

  • Start at the most relevant section (e.g., /docs/ instead of homepage)

  • Use URLs with clear structure and navigation

  • Avoid URLs with query parameters if possible

Advanced Parameters (JSON)

You can configure advanced crawling options using JSON in the Parameters field:

Available Parameters:

Parameter
Type
Description
Default

maxDepth

Number

Maximum link depth from start URL

3

maxPages

Number

Maximum pages to crawl

100

includePatterns

Array

URL patterns to include

All

excludePatterns

Array

URL patterns to exclude

None

followExternalLinks

Boolean

Crawl external domains

false

respectRobotsTxt

Boolean

Follow robots.txt rules

true

userAgent

String

Custom user agent string

Twig Bot

Refresh Frequency

Choose how often to re-crawl the website:

  • Never - Manual refresh only (static)

  • Daily - Refresh every day

  • Weekly - Refresh every week

  • Monthly - Refresh every month

Recommendation:

  • Daily: For frequently updated sites (news, blogs)

  • Weekly: For moderately updated sites (documentation)

  • Monthly: For stable content (marketing pages)

Tags (Optional)

Add tags for organization:

  • Examples: "documentation", "external", "support", "public"

Step 4: Save and Crawl

  1. Click Save or Create

  2. Initial crawl begins automatically

  3. Monitor status in the data sources list

Step 5: Verify Crawl

  1. Check record count (number of pages crawled)

  2. Verify status shows "END_PROCESS"

  3. Review process logs for any errors

  4. Test knowledge with relevant questions

Examples

Example 1: Documentation Site

Example 2: Help Center

Example 3: Company Blog

Best Practices

1. Choose the Right Starting URL

Good Starting Points:

  • /docs/ - Documentation root

  • /help/en/ - Help center in specific language

  • /api/reference/ - API documentation section

  • /kb/ - Knowledge base root

Avoid Starting From:

  • Homepage (too broad, many irrelevant links)

  • Login pages (can't be crawled)

  • Dynamic search results

  • Paginated archives without limit

2. Use Include/Exclude Patterns

Include Patterns - Only crawl these sections:

Exclude Patterns - Skip these sections:

3. Optimize Crawl Limits

For Large Sites:

For Small Sites:

Balance: Higher depth = more comprehensive, longer crawl time

4. Set Appropriate Refresh Schedules

  • Daily: News sites, rapidly changing content

  • Weekly: Documentation with regular updates

  • Monthly: Stable marketing content

  • Manual (Never): One-time imports, archived content

5. Monitor Crawl Health

Regularly check:

  • Number of pages crawled (is it increasing/decreasing?)

  • Last successful crawl date

  • Error logs for failed pages

  • Response time and crawl duration

Advanced Configuration

Handling Authentication

The Website connector only supports public websites. For authenticated content, use alternative connectors:

  • Confluence - For Atlassian Confluence

  • SharePoint - For Microsoft SharePoint

  • Google Drive - For Google Docs

Crawling Subdomains

To crawl multiple subdomains:

Option 1: Create separate data sources for each subdomain

Option 2: Enable external link following (use cautiously)

Handling Dynamic Content

JavaScript-Rendered Content: The crawler can handle basic JavaScript rendering but may miss:

  • Complex single-page applications (SPAs)

  • Content loaded after user interaction

  • Infinite scroll content

Solutions:

  • Check if site has a static HTML fallback

  • Look for sitemap.xml (use Sitemap connector)

  • Contact site owner about crawler-friendly version

Rate Limiting

The crawler automatically:

  • Waits between requests (polite crawling)

  • Respects server Retry-After headers

  • Backs off on errors

  • Distributes load over time

Troubleshooting

Few Pages Crawled

Problem: Only 1-2 pages crawled from a large site

Solutions:

  • Check maxDepth is sufficient

  • Verify maxPages limit isn't too low

  • Review includePatterns aren't too restrictive

  • Ensure site has proper internal linking

  • Check robots.txt isn't blocking crawler

Missing Content

Problem: Important pages not included

Solutions:

  • Verify pages are linked from start URL

  • Check pages aren't excluded by patterns

  • Ensure pages are within maxDepth limit

  • Look for pages in robots.txt disallow list

  • Check if pages require authentication

Crawl Timeout

Problem: Crawl stops before completing

Solutions:

  • Reduce maxPages limit

  • Decrease maxDepth

  • Add more specific includePatterns

  • Check if site is responding slowly

  • Try crawling specific sections separately

Duplicate Content

Problem: Same content appears multiple times

Solutions:

  • The crawler should handle duplicates automatically

  • Check for URL variations (with/without trailing slash)

  • Review query parameters in URLs

  • Add exclude patterns for redundant paths

Refresh Not Working

Problem: Scheduled refresh isn't updating content

Solutions:

  • Verify refresh frequency is not set to "NEVER"

  • Check last processed date

  • Review process logs for errors

  • Ensure website is accessible

  • Check for crawler blocking (robots.txt, firewall)

Performance Tips

1. Start Specific, Expand Later

Begin with a narrow scope:

Then expand as needed:

2. Use Multiple Focused Sources

Instead of one broad crawl:

Create targeted sources:

3. Exclude Non-Essential Content

Common exclusions:

Monitoring & Maintenance

Regular Checks

Weekly:

  • Review page count trends

  • Check for crawl errors

  • Verify refresh is working

Monthly:

  • Audit included/excluded pages

  • Optimize crawl parameters

  • Remove outdated sources

Quarterly:

  • Review AI answer quality from website data

  • Update include/exclude patterns

  • Adjust refresh frequency based on site update patterns

Metrics to Track

  • Pages Crawled: Total number of indexed pages

  • Last Sync: When was the last successful crawl

  • Error Rate: Percentage of failed page fetches

  • Crawl Duration: Time taken for full crawl

  • Usage: How often AI references this content

Next Steps

After setting up website crawling:

  1. Test your AI agent with website content questions

  2. Create specialized agents for different site sections

  3. Monitor analytics to see which pages are most useful

  4. Optimize crawl configuration based on usage patterns

  • Sitemap - Import from sitemap.xml files

  • Confluence - For Confluence-hosted documentation

  • Files - Upload exported HTML or PDF documentation

  • Google Drive - For Google Docs-based documentation

Last updated