Website Crawling
Crawl and index publicly accessible websites to build a knowledge base from online documentation, blogs, help centers, and other web content.
Overview
Type
Dynamic
Refresh
Scheduled (Daily, Weekly, Monthly)
Tier
1 (All Plans)
Crawl Depth
Configurable
Authentication
Public sites only
When to Use Website Connector
The Website connector is ideal for:
Public Documentation - Developer docs, user guides, API references
Help Centers - Support articles and knowledge bases
Blog Content - Company blogs, technical articles
Marketing Pages - Product information, feature pages
Resource Centers - Tutorials, guides, whitepapers
News Sites - Company news, press releases
How It Works
Crawling Process
Start URL: You provide a starting URL
Discovery: Crawler finds all links on the page
Follow Links: Crawler visits each discovered link (within domain)
Extract Content: Text content is extracted from each page
Index: Content is processed and indexed for AI search
Crawl Scope
By default, the crawler:
Stays within the same domain as the start URL
Follows all internal links
Respects
robots.txtrulesSkips non-HTML content (images, PDFs, etc.)
Avoids duplicate pages
How to Add a Website
Step 1: Navigate to Data Sources
Log in to your Twig AI account
Click Data in the main navigation menu
Click Add Data Source or the + button
Step 2: Select Website Connector
Choose Website from the list of connectors
The connector shows: "Reads a publicly accessible websites"
Step 3: Configure the Data Source
Basic Information
Name (required): Descriptive name for the website
Example: "Product Documentation", "Support Knowledge Base", "Company Blog"
Description (optional): Additional context
Example: "Official product documentation site with API reference and user guides"
URL Configuration
URL (required): The starting URL to crawl
Must be a valid URL starting with
http://orhttps://Example:
https://docs.example.comExample:
https://help.example.com/en/
URL Selection Tips:
Start at the most relevant section (e.g.,
/docs/instead of homepage)Use URLs with clear structure and navigation
Avoid URLs with query parameters if possible
Advanced Parameters (JSON)
You can configure advanced crawling options using JSON in the Parameters field:
Available Parameters:
maxDepth
Number
Maximum link depth from start URL
3
maxPages
Number
Maximum pages to crawl
100
includePatterns
Array
URL patterns to include
All
excludePatterns
Array
URL patterns to exclude
None
followExternalLinks
Boolean
Crawl external domains
false
respectRobotsTxt
Boolean
Follow robots.txt rules
true
userAgent
String
Custom user agent string
Twig Bot
Refresh Frequency
Choose how often to re-crawl the website:
Never - Manual refresh only (static)
Daily - Refresh every day
Weekly - Refresh every week
Monthly - Refresh every month
Recommendation:
Daily: For frequently updated sites (news, blogs)
Weekly: For moderately updated sites (documentation)
Monthly: For stable content (marketing pages)
Tags (Optional)
Add tags for organization:
Examples: "documentation", "external", "support", "public"
Step 4: Save and Crawl
Click Save or Create
Initial crawl begins automatically
Monitor status in the data sources list
Step 5: Verify Crawl
Check record count (number of pages crawled)
Verify status shows "END_PROCESS"
Review process logs for any errors
Test knowledge with relevant questions
Examples
Example 1: Documentation Site
Example 2: Help Center
Example 3: Company Blog
Best Practices
1. Choose the Right Starting URL
Good Starting Points:
/docs/- Documentation root/help/en/- Help center in specific language/api/reference/- API documentation section/kb/- Knowledge base root
Avoid Starting From:
Homepage (too broad, many irrelevant links)
Login pages (can't be crawled)
Dynamic search results
Paginated archives without limit
2. Use Include/Exclude Patterns
Include Patterns - Only crawl these sections:
Exclude Patterns - Skip these sections:
3. Optimize Crawl Limits
For Large Sites:
For Small Sites:
Balance: Higher depth = more comprehensive, longer crawl time
4. Set Appropriate Refresh Schedules
Daily: News sites, rapidly changing content
Weekly: Documentation with regular updates
Monthly: Stable marketing content
Manual (Never): One-time imports, archived content
5. Monitor Crawl Health
Regularly check:
Number of pages crawled (is it increasing/decreasing?)
Last successful crawl date
Error logs for failed pages
Response time and crawl duration
Advanced Configuration
Handling Authentication
The Website connector only supports public websites. For authenticated content, use alternative connectors:
Confluence - For Atlassian Confluence
SharePoint - For Microsoft SharePoint
Google Drive - For Google Docs
Crawling Subdomains
To crawl multiple subdomains:
Option 1: Create separate data sources for each subdomain
Option 2: Enable external link following (use cautiously)
Handling Dynamic Content
JavaScript-Rendered Content: The crawler can handle basic JavaScript rendering but may miss:
Complex single-page applications (SPAs)
Content loaded after user interaction
Infinite scroll content
Solutions:
Check if site has a static HTML fallback
Look for sitemap.xml (use Sitemap connector)
Contact site owner about crawler-friendly version
Rate Limiting
The crawler automatically:
Waits between requests (polite crawling)
Respects server
Retry-AfterheadersBacks off on errors
Distributes load over time
Troubleshooting
Few Pages Crawled
Problem: Only 1-2 pages crawled from a large site
Solutions:
Check
maxDepthis sufficientVerify
maxPageslimit isn't too lowReview
includePatternsaren't too restrictiveEnsure site has proper internal linking
Check robots.txt isn't blocking crawler
Missing Content
Problem: Important pages not included
Solutions:
Verify pages are linked from start URL
Check pages aren't excluded by patterns
Ensure pages are within maxDepth limit
Look for pages in robots.txt disallow list
Check if pages require authentication
Crawl Timeout
Problem: Crawl stops before completing
Solutions:
Reduce
maxPageslimitDecrease
maxDepthAdd more specific
includePatternsCheck if site is responding slowly
Try crawling specific sections separately
Duplicate Content
Problem: Same content appears multiple times
Solutions:
The crawler should handle duplicates automatically
Check for URL variations (with/without trailing slash)
Review query parameters in URLs
Add exclude patterns for redundant paths
Refresh Not Working
Problem: Scheduled refresh isn't updating content
Solutions:
Verify refresh frequency is not set to "NEVER"
Check last processed date
Review process logs for errors
Ensure website is accessible
Check for crawler blocking (robots.txt, firewall)
Performance Tips
1. Start Specific, Expand Later
Begin with a narrow scope:
Then expand as needed:
2. Use Multiple Focused Sources
Instead of one broad crawl:
Create targeted sources:
3. Exclude Non-Essential Content
Common exclusions:
Monitoring & Maintenance
Regular Checks
Weekly:
Review page count trends
Check for crawl errors
Verify refresh is working
Monthly:
Audit included/excluded pages
Optimize crawl parameters
Remove outdated sources
Quarterly:
Review AI answer quality from website data
Update include/exclude patterns
Adjust refresh frequency based on site update patterns
Metrics to Track
Pages Crawled: Total number of indexed pages
Last Sync: When was the last successful crawl
Error Rate: Percentage of failed page fetches
Crawl Duration: Time taken for full crawl
Usage: How often AI references this content
Next Steps
After setting up website crawling:
Test your AI agent with website content questions
Create specialized agents for different site sections
Monitor analytics to see which pages are most useful
Optimize crawl configuration based on usage patterns
Related Connectors
Sitemap - Import from sitemap.xml files
Confluence - For Confluence-hosted documentation
Files - Upload exported HTML or PDF documentation
Google Drive - For Google Docs-based documentation
Last updated

