Website Crawling

Crawl and index publicly accessible websites to build a knowledge base from online documentation, blogs, help centers, and other web content.

Overview

Property

Details

Type

Dynamic

Refresh

Scheduled (Daily, Weekly, Monthly)

Tier

1 (All Plans)

Crawl Depth

Configurable

Authentication

Public sites only

When to Use Website Connector

The Website connector is ideal for:

Public Documentation - Developer docs, user guides, API references
Help Centers - Support articles and knowledge bases
Blog Content - Company blogs, technical articles
Marketing Pages - Product information, feature pages
Resource Centers - Tutorials, guides, whitepapers
News Sites - Company news, press releases

How It Works

Crawling Process

Start URL: You provide a starting URL
Discovery: Crawler finds all links on the page
Follow Links: Crawler visits each discovered link (within domain)
Extract Content: Text content is extracted from each page
Index: Content is processed and indexed for AI search

Crawl Scope

By default, the crawler:

Stays within the same domain as the start URL
Follows all internal links
Respects robots.txt rules
Skips non-HTML content (images, PDFs, etc.)
Avoids duplicate pages

How to Add a Website

Step 1: Navigate to Data Sources

Log in to your Twig AI account
Click Data in the main navigation menu
Click Add Data Source or the + button

Step 2: Select Website Connector

Choose Website from the list of connectors
The connector shows: "Reads a publicly accessible websites"

Step 3: Configure the Data Source

Basic Information

Name (required): Descriptive name for the website
- Example: "Product Documentation", "Support Knowledge Base", "Company Blog"
Description (optional): Additional context
- Example: "Official product documentation site with API reference and user guides"

URL Configuration

URL (required): The starting URL to crawl
- Must be a valid URL starting with http:// or https://
- Example: https://docs.example.com
- Example: https://help.example.com/en/

URL Selection Tips:

Start at the most relevant section (e.g., /docs/ instead of homepage)
Use URLs with clear structure and navigation
Avoid URLs with query parameters if possible

Advanced Parameters (JSON)

You can configure advanced crawling options using JSON in the Parameters field:

{
  "maxDepth": 3,
  "maxPages": 100,
  "includePatterns": ["/docs/", "/help/"],
  "excludePatterns": ["/blog/", "/news/"],
  "followExternalLinks": false
}

Available Parameters:

Parameter

Type

Description

Default

maxDepth

Number

Maximum link depth from start URL

maxPages

Number

Maximum pages to crawl

100

includePatterns

Array

URL patterns to include

All

excludePatterns

Array

URL patterns to exclude

None

followExternalLinks

Boolean

Crawl external domains

false

respectRobotsTxt

Boolean

Follow robots.txt rules

true

userAgent

String

Custom user agent string

Twig Bot

Refresh Frequency

Choose how often to re-crawl the website:

Never - Manual refresh only (static)
Daily - Refresh every day
Weekly - Refresh every week
Monthly - Refresh every month

Recommendation:

Daily: For frequently updated sites (news, blogs)
Weekly: For moderately updated sites (documentation)
Monthly: For stable content (marketing pages)

Tags (Optional)

Add tags for organization:

Examples: "documentation", "external", "support", "public"

Step 4: Save and Crawl

Click Save or Create
Initial crawl begins automatically
Monitor status in the data sources list

Step 5: Verify Crawl

Check record count (number of pages crawled)
Verify status shows "END_PROCESS"
Review process logs for any errors
Test knowledge with relevant questions

Examples

Example 1: Documentation Site

Name: Product Documentation
Description: Official API and user guide documentation
URL: https://docs.example.com
Parameters: 
{
  "maxDepth": 5,
  "maxPages": 500,
  "includePatterns": ["/docs/"],
  "excludePatterns": ["/blog/", "/changelog/"]
}
Refresh: Weekly
Tags: documentation, public, api

Example 2: Help Center

Name: Customer Support Articles
Description: Complete help center with FAQs and troubleshooting guides
URL: https://help.example.com/en/
Parameters:
{
  "maxDepth": 3,
  "maxPages": 200,
  "includePatterns": ["/en/articles/"],
  "excludePatterns": ["/community/"]
}
Refresh: Daily
Tags: support, help-center, customer-facing

Example 3: Company Blog

Name: Company Blog
Description: Technical blog posts and product announcements
URL: https://blog.example.com
Parameters:
{
  "maxDepth": 2,
  "maxPages": 100,
  "includePatterns": ["/technical/", "/products/"],
  "excludePatterns": ["/authors/", "/tags/"]
}
Refresh: Weekly
Tags: blog, marketing, technical

Best Practices

1. Choose the Right Starting URL

Good Starting Points:

/docs/ - Documentation root
/help/en/ - Help center in specific language
/api/reference/ - API documentation section
/kb/ - Knowledge base root

Avoid Starting From:

Homepage (too broad, many irrelevant links)
Login pages (can't be crawled)
Dynamic search results
Paginated archives without limit

2. Use Include/Exclude Patterns

Include Patterns - Only crawl these sections:

{
  "includePatterns": [
    "/docs/",
    "/api-reference/",
    "/getting-started/"
  ]
}

Exclude Patterns - Skip these sections:

{
  "excludePatterns": [
    "/changelog/",
    "/blog/",
    "/about/",
    "/careers/",
    "/login",
    "/signup"
  ]
}

3. Optimize Crawl Limits

For Large Sites:

{
  "maxDepth": 4,
  "maxPages": 500
}

For Small Sites:

{
  "maxDepth": 10,
  "maxPages": 100
}

Balance: Higher depth = more comprehensive, longer crawl time

4. Set Appropriate Refresh Schedules

Daily: News sites, rapidly changing content
Weekly: Documentation with regular updates
Monthly: Stable marketing content
Manual (Never): One-time imports, archived content

5. Monitor Crawl Health

Regularly check:

Number of pages crawled (is it increasing/decreasing?)
Last successful crawl date
Error logs for failed pages
Response time and crawl duration

Advanced Configuration

Handling Authentication

The Website connector only supports public websites. For authenticated content, use alternative connectors:

Confluence - For Atlassian Confluence
SharePoint - For Microsoft SharePoint
Google Drive - For Google Docs

Crawling Subdomains

To crawl multiple subdomains:

Option 1: Create separate data sources for each subdomain

Source 1: https://docs.example.com
Source 2: https://api.example.com
Source 3: https://help.example.com

Option 2: Enable external link following (use cautiously)

{
  "followExternalLinks": true,
  "includePatterns": [
    "docs.example.com",
    "api.example.com"
  ]
}

Handling Dynamic Content

JavaScript-Rendered Content: The crawler can handle basic JavaScript rendering but may miss:

Complex single-page applications (SPAs)
Content loaded after user interaction
Infinite scroll content

Solutions:

Check if site has a static HTML fallback
Look for sitemap.xml (use Sitemap connector)
Contact site owner about crawler-friendly version

Rate Limiting

The crawler automatically:

Waits between requests (polite crawling)
Respects server Retry-After headers
Backs off on errors
Distributes load over time

Troubleshooting

Few Pages Crawled

Problem: Only 1-2 pages crawled from a large site

Solutions:

Check maxDepth is sufficient
Verify maxPages limit isn't too low
Review includePatterns aren't too restrictive
Ensure site has proper internal linking
Check robots.txt isn't blocking crawler

Missing Content

Problem: Important pages not included

Solutions:

Verify pages are linked from start URL
Check pages aren't excluded by patterns
Ensure pages are within maxDepth limit
Look for pages in robots.txt disallow list
Check if pages require authentication

Crawl Timeout

Problem: Crawl stops before completing

Solutions:

Reduce maxPages limit
Decrease maxDepth
Add more specific includePatterns
Check if site is responding slowly
Try crawling specific sections separately

Duplicate Content

Problem: Same content appears multiple times

Solutions:

The crawler should handle duplicates automatically
Check for URL variations (with/without trailing slash)
Review query parameters in URLs
Add exclude patterns for redundant paths

Refresh Not Working

Problem: Scheduled refresh isn't updating content

Solutions:

Verify refresh frequency is not set to "NEVER"
Check last processed date
Review process logs for errors
Ensure website is accessible
Check for crawler blocking (robots.txt, firewall)

Performance Tips

1. Start Specific, Expand Later

Begin with a narrow scope:

{
  "includePatterns": ["/docs/getting-started/"],
  "maxPages": 50
}

Then expand as needed:

{
  "includePatterns": ["/docs/"],
  "maxPages": 200
}

2. Use Multiple Focused Sources

Instead of one broad crawl:

❌ Start URL: https://example.com (crawls entire site)

Create targeted sources:

✅ Source 1: https://docs.example.com/api/ (API docs only)
✅ Source 2: https://docs.example.com/guides/ (User guides only)
✅ Source 3: https://help.example.com/ (Support articles only)

3. Exclude Non-Essential Content

Common exclusions:

{
  "excludePatterns": [
    "/search",
    "/tags/",
    "/categories/",
    "/authors/",
    "/archive/",
    "/print/",
    "/download/",
    "/comments"
  ]
}

Monitoring & Maintenance

Regular Checks

Weekly:

Review page count trends
Check for crawl errors
Verify refresh is working

Monthly:

Audit included/excluded pages
Optimize crawl parameters
Remove outdated sources

Quarterly:

Review AI answer quality from website data
Update include/exclude patterns
Adjust refresh frequency based on site update patterns

Metrics to Track

Pages Crawled: Total number of indexed pages
Last Sync: When was the last successful crawl
Error Rate: Percentage of failed page fetches
Crawl Duration: Time taken for full crawl
Usage: How often AI references this content

Next Steps

After setting up website crawling:

Test your AI agent with website content questions
Create specialized agents for different site sections
Monitor analytics to see which pages are most useful
Optimize crawl configuration based on usage patterns

Sitemap - Import from sitemap.xml files
Confluence - For Confluence-hosted documentation
Files - Upload exported HTML or PDF documentation
Google Drive - For Google Docs-based documentation

PreviousQnA CSV NextSitemap Integration

Last updated 8 hours ago

hashtagOverview

hashtagWhen to Use Website Connector

hashtagHow It Works

hashtagCrawling Process

hashtagCrawl Scope

hashtagHow to Add a Website

hashtagStep 1: Navigate to Data Sources

hashtagStep 2: Select Website Connector

hashtagStep 3: Configure the Data Source

hashtagBasic Information

hashtagURL Configuration

hashtagAdvanced Parameters (JSON)

hashtagRefresh Frequency

hashtagTags (Optional)

hashtagStep 4: Save and Crawl

hashtagStep 5: Verify Crawl

hashtagExamples

hashtagExample 1: Documentation Site

hashtagExample 2: Help Center

hashtagExample 3: Company Blog

hashtagBest Practices

hashtag1. Choose the Right Starting URL

hashtag2. Use Include/Exclude Patterns

hashtag3. Optimize Crawl Limits

hashtag4. Set Appropriate Refresh Schedules

hashtag5. Monitor Crawl Health

hashtagAdvanced Configuration

hashtagHandling Authentication

hashtagCrawling Subdomains

hashtagHandling Dynamic Content

hashtagRate Limiting

hashtagTroubleshooting

hashtagFew Pages Crawled

hashtagMissing Content

hashtagCrawl Timeout

hashtagDuplicate Content

hashtagRefresh Not Working

hashtagPerformance Tips

hashtag1. Start Specific, Expand Later

hashtag2. Use Multiple Focused Sources

hashtag3. Exclude Non-Essential Content

hashtagMonitoring & Maintenance

hashtagRegular Checks

hashtagMetrics to Track

hashtagNext Steps

hashtagRelated Connectors

Overview

When to Use Website Connector

How It Works

Crawling Process

Crawl Scope

How to Add a Website

Step 1: Navigate to Data Sources

Step 2: Select Website Connector

Step 3: Configure the Data Source

Basic Information

URL Configuration

Advanced Parameters (JSON)

Refresh Frequency

Tags (Optional)

Step 4: Save and Crawl

Step 5: Verify Crawl

Examples

Example 1: Documentation Site

Example 2: Help Center

Example 3: Company Blog

Best Practices

1. Choose the Right Starting URL

2. Use Include/Exclude Patterns

3. Optimize Crawl Limits

4. Set Appropriate Refresh Schedules

5. Monitor Crawl Health

Advanced Configuration

Handling Authentication

Crawling Subdomains

Handling Dynamic Content

Rate Limiting

Troubleshooting

Few Pages Crawled

Missing Content

Crawl Timeout

Duplicate Content

Refresh Not Working

Performance Tips

1. Start Specific, Expand Later

2. Use Multiple Focused Sources

3. Exclude Non-Essential Content

Monitoring & Maintenance

Regular Checks

Metrics to Track

Next Steps

Related Connectors