Sitemap Integration

Import website content using a sitemap.xml file to efficiently index large documentation sites, blogs, and other structured web content.

Overview

Property
Details

Type

Static

Refresh

Manual

Tier

1 (All Plans)

Format

sitemap.xml file

Max URLs

Varies by plan

When to Use Sitemap Connector

The Sitemap connector is ideal for:

  • Large Documentation Sites - Efficiently import hundreds of pages

  • Structured Content - Sites with well-organized sitemaps

  • Static Site Generators - Jekyll, Hugo, Docusaurus, etc.

  • Archived Content - One-time import of website snapshots

  • Selective Imports - When you want specific URLs from a site

What is a Sitemap.xml?

A sitemap.xml file is a list of URLs on a website, typically used to help search engines discover and index pages. It looks like this:

Finding a Website's Sitemap

Common Sitemap Locations

Most websites place their sitemap at:

  • https://example.com/sitemap.xml

  • https://example.com/sitemap_index.xml

  • https://example.com/sitemap1.xml

  • https://docs.example.com/sitemap.xml

Check robots.txt

Many sites list their sitemap in robots.txt:

Look for:

Use Browser Tools

  1. Open the website in your browser

  2. Right-click → View Page Source

  3. Search (Ctrl+F / Cmd+F) for "sitemap"

  4. Look for <link rel="sitemap" tags

Ask the Website Owner

If you can't find the sitemap, contact the site administrator. They can:

  • Provide the sitemap URL

  • Generate a sitemap if one doesn't exist

  • Create a custom sitemap with specific pages

How to Add a Sitemap

Step 1: Download the Sitemap

  1. Navigate to the sitemap URL in your browser

  2. Right-click on the page

  3. Select "Save As" or "Save Page As"

  4. Save with filename: sitemap.xml

Alternative: Use command line:

or

Step 2: Navigate to Data Sources

  1. Log in to your Twig AI account

  2. Click Data in the main navigation menu

  3. Click Add Data Source or the + button

Step 3: Select Sitemap Connector

  1. Choose Sitemap.xml from the list

  2. The connector shows: "Publicly accessible websites from a sitemap.xml file"

Step 4: Configure the Data Source

Basic Information

  • Name (required): Descriptive name

    • Example: "Documentation Sitemap", "Blog Sitemap", "Help Center Pages"

  • Description (optional): Additional context

    • Example: "Complete documentation site from sitemap dated 2024-01-15"

File Upload

  1. Click Choose File or drag-and-drop

  2. Select your downloaded sitemap.xml file

  3. Wait for upload to complete

Tags (Optional)

  • Add organizational tags

  • Examples: "documentation", "external", "sitemap"

Step 5: Save and Process

  1. Click Save or Create

  2. System will:

    • Parse the sitemap file

    • Fetch each URL listed

    • Extract and index content

  3. Monitor processing status

Step 6: Verify Import

  1. Check record count (number of URLs processed)

  2. Verify status shows "END_PROCESS"

  3. Review process logs for any failed URLs

  4. Test with relevant questions

Creating Custom Sitemaps

If you need a sitemap for a specific subset of pages, you can create one manually.

Basic Sitemap Structure

With Optional Metadata

Optional Tags:

  • <lastmod> - Last modified date (YYYY-MM-DD)

  • <priority> - Importance (0.0 to 1.0)

  • <changefreq> - Update frequency (daily, weekly, monthly)

Using Online Sitemap Generators

Several tools can generate sitemaps:

  • Screaming Frog SEO Spider - Desktop app

  • XML-Sitemaps.com - Online generator

  • Sitemap Writer Pro - Desktop app

  • Custom scripts - Python, Node.js, etc.

Examples

Example 1: Documentation Site

Example 2: Blog Posts

Example 3: Help Center

Best Practices

1. Filter Sitemap Content

Before uploading, edit the sitemap to include only relevant pages:

Good:

Remove:

2. Keep Sitemaps Organized

Create separate data sources for different content types:

  • docs-sitemap.xml - Documentation pages

  • help-sitemap.xml - Support articles

  • blog-sitemap.xml - Blog posts

3. Version Your Sitemaps

When re-importing, keep versions:

4. Validate Before Upload

Use sitemap validators:

  • https://www.xml-sitemaps.com/validate-xml-sitemap.html

  • https://www.websiteplanet.com/webtools/sitemap-validator/

5. Check URL Accessibility

Ensure all URLs in sitemap are:

  • Publicly accessible (no authentication required)

  • Returning 200 status code (not 404 or redirects)

  • Containing actual content (not empty pages)

Advantages Over Website Connector

Feature
Sitemap
Website Crawler

Speed

Fast (only listed URLs)

Slower (discovers links)

Precision

Exact pages you want

May miss or include extra pages

Control

Full control over URLs

Limited by crawler settings

Resources

Less server load

More server requests

Freshness

Manual update needed

Can auto-refresh

Use Sitemap when:

  • You know exactly which pages to import

  • Site has a complete, up-to-date sitemap

  • You want a one-time import

  • You need to minimize server load

Use Website Crawler when:

  • You want automatic discovery

  • Site structure changes frequently

  • You want automatic updates

  • You're not sure which pages exist

Handling Large Sitemaps

Sitemap Index Files

Large sites may use sitemap index files:

To import:

  1. Download each individual sitemap

  2. Create separate data sources for each, or

  3. Merge sitemaps into one file before uploading

Merging Multiple Sitemaps

Combine multiple sitemaps into one:

Updating Content

Since Sitemap is a static connector, updates require re-import:

To Update:

  1. Download updated sitemap from website

  2. Edit your data source in Twig

  3. Upload the new sitemap file

  4. Save to reprocess all URLs

Automation Options:

  • Schedule periodic manual updates

  • Use Website connector for automatic updates

  • Set up external scripts to notify you of sitemap changes

Troubleshooting

URLs Not Accessible

Problem: Some URLs fail to process

Solutions:

  • Verify URLs are publicly accessible

  • Check for authentication requirements

  • Test URLs in incognito browser window

  • Review process logs for specific error codes

Invalid Sitemap Format

Problem: Sitemap upload fails

Solutions:

  • Validate XML syntax using online validator

  • Check for proper XML declaration

  • Ensure proper namespace declaration

  • Verify file encoding is UTF-8

Empty Pages Imported

Problem: URLs processed but no content extracted

Solutions:

  • Check if pages contain actual text content

  • Verify pages aren't JavaScript-heavy SPAs

  • Look for content behind login walls

  • Test URL manually in browser

Partial Import

Problem: Only some URLs processed

Solutions:

  • Check plan limits on number of URLs

  • Review process logs for errors

  • Verify failed URLs are accessible

  • Split large sitemaps into multiple sources

Advanced Tips

1. Filtering with Text Editor

Use find-and-replace in text editor to quickly filter sitemaps:

Remove URLs containing "blog":

Keep only "/docs/" URLs:

  • Copy entire sitemap

  • Delete all content

  • Paste back only lines containing "/docs/"

2. Combining Sitemaps from Different Sites

Create a consolidated sitemap:

3. Priority-Based Import

Create multiple data sources based on priority:

High Priority (daily refresh needed):

Low Priority (rarely changes):

Next Steps

After importing from sitemap:

  1. Create AI agents for specific content areas

  2. Monitor usage to see which pages are most referenced

  3. Plan periodic sitemap updates

Last updated