# Sitemap Integration

Import website content using a sitemap.xml file to efficiently index large documentation sites, blogs, and other structured web content.

## Overview

| Property     | Details          |
| ------------ | ---------------- |
| **Type**     | Static           |
| **Refresh**  | Manual           |
| **Tier**     | 1 (All Plans)    |
| **Format**   | sitemap.xml file |
| **Max URLs** | Varies by plan   |

## When to Use Sitemap Connector

The Sitemap connector is ideal for:

* **Large Documentation Sites** - Efficiently import hundreds of pages
* **Structured Content** - Sites with well-organized sitemaps
* **Static Site Generators** - Jekyll, Hugo, Docusaurus, etc.
* **Archived Content** - One-time import of website snapshots
* **Selective Imports** - When you want specific URLs from a site

## What is a Sitemap.xml?

A sitemap.xml file is a list of URLs on a website, typically used to help search engines discover and index pages. It looks like this:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://docs.example.com/getting-started</loc>
    <lastmod>2024-01-15</lastmod>
  </url>
  <url>
    <loc>https://docs.example.com/api-reference</loc>
    <lastmod>2024-01-14</lastmod>
  </url>
  <url>
    <loc>https://docs.example.com/tutorials</loc>
    <lastmod>2024-01-10</lastmod>
  </url>
</urlset>
```

## Finding a Website's Sitemap

### Common Sitemap Locations

Most websites place their sitemap at:

* `https://example.com/sitemap.xml`
* `https://example.com/sitemap_index.xml`
* `https://example.com/sitemap1.xml`
* `https://docs.example.com/sitemap.xml`

### Check robots.txt

Many sites list their sitemap in `robots.txt`:

```
https://example.com/robots.txt
```

Look for:

```
Sitemap: https://example.com/sitemap.xml
```

### Use Browser Tools

1. Open the website in your browser
2. Right-click → View Page Source
3. Search (Ctrl+F / Cmd+F) for "sitemap"
4. Look for `<link rel="sitemap"` tags

### Ask the Website Owner

If you can't find the sitemap, contact the site administrator. They can:

* Provide the sitemap URL
* Generate a sitemap if one doesn't exist
* Create a custom sitemap with specific pages

## How to Add a Sitemap

### Step 1: Download the Sitemap

1. Navigate to the sitemap URL in your browser
2. Right-click on the page
3. Select "Save As" or "Save Page As"
4. Save with filename: `sitemap.xml`

**Alternative:** Use command line:

```bash
curl https://docs.example.com/sitemap.xml -o sitemap.xml
```

or

```bash
wget https://docs.example.com/sitemap.xml
```

### Step 2: Navigate to Data Sources

1. Log in to your Twig AI account
2. Click **Data** in the main navigation menu
3. Click **Add Data Source** or the **+** button

### Step 3: Select Sitemap Connector

1. Choose **Sitemap.xml** from the list
2. The connector shows: "Publicly accessible websites from a sitemap.xml file"

### Step 4: Configure the Data Source

#### Basic Information

* **Name** (required): Descriptive name
  * Example: "Documentation Sitemap", "Blog Sitemap", "Help Center Pages"
* **Description** (optional): Additional context
  * Example: "Complete documentation site from sitemap dated 2024-01-15"

#### File Upload

1. Click **Choose File** or drag-and-drop
2. Select your downloaded `sitemap.xml` file
3. Wait for upload to complete

#### Tags (Optional)

* Add organizational tags
* Examples: "documentation", "external", "sitemap"

### Step 5: Save and Process

1. Click **Save** or **Create**
2. System will:
   * Parse the sitemap file
   * Fetch each URL listed
   * Extract and index content
3. Monitor processing status

### Step 6: Verify Import

1. Check record count (number of URLs processed)
2. Verify status shows "END\_PROCESS"
3. Review process logs for any failed URLs
4. Test with relevant questions

## Creating Custom Sitemaps

If you need a sitemap for a specific subset of pages, you can create one manually.

### Basic Sitemap Structure

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/page1</loc>
  </url>
  <url>
    <loc>https://example.com/page2</loc>
  </url>
  <url>
    <loc>https://example.com/page3</loc>
  </url>
</urlset>
```

### With Optional Metadata

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/important-page</loc>
    <lastmod>2024-01-15</lastmod>
    <priority>1.0</priority>
    <changefreq>daily</changefreq>
  </url>
  <url>
    <loc>https://example.com/other-page</loc>
    <lastmod>2024-01-10</lastmod>
    <priority>0.8</priority>
    <changefreq>weekly</changefreq>
  </url>
</urlset>
```

**Optional Tags:**

* `<lastmod>` - Last modified date (YYYY-MM-DD)
* `<priority>` - Importance (0.0 to 1.0)
* `<changefreq>` - Update frequency (daily, weekly, monthly)

### Using Online Sitemap Generators

Several tools can generate sitemaps:

* **Screaming Frog SEO Spider** - Desktop app
* **XML-Sitemaps.com** - Online generator
* **Sitemap Writer Pro** - Desktop app
* **Custom scripts** - Python, Node.js, etc.

## Examples

### Example 1: Documentation Site

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://docs.example.com/</loc>
  </url>
  <url>
    <loc>https://docs.example.com/getting-started</loc>
  </url>
  <url>
    <loc>https://docs.example.com/api-reference</loc>
  </url>
  <url>
    <loc>https://docs.example.com/tutorials</loc>
  </url>
  <url>
    <loc>https://docs.example.com/faq</loc>
  </url>
</urlset>
```

```
Name: Product Documentation
Description: Complete product documentation from sitemap
File: docs-sitemap.xml (5 URLs)
Tags: documentation, product, public
```

### Example 2: Blog Posts

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://blog.example.com/2024/how-to-get-started</loc>
  </url>
  <url>
    <loc>https://blog.example.com/2024/advanced-tips</loc>
  </url>
  <url>
    <loc>https://blog.example.com/2023/year-in-review</loc>
  </url>
</urlset>
```

```
Name: Technical Blog Posts
Description: Selected technical blog posts
File: blog-sitemap.xml (3 URLs)
Tags: blog, technical, public
```

### Example 3: Help Center

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://help.example.com/en/articles/account-setup</loc>
  </url>
  <url>
    <loc>https://help.example.com/en/articles/billing-faq</loc>
  </url>
  <url>
    <loc>https://help.example.com/en/articles/troubleshooting</loc>
  </url>
  <url>
    <loc>https://help.example.com/en/articles/api-integration</loc>
  </url>
</urlset>
```

```
Name: Help Center Articles
Description: Customer support articles in English
File: help-sitemap.xml (4 URLs)
Tags: support, help, customer-facing
```

## Best Practices

### 1. Filter Sitemap Content

**Before uploading**, edit the sitemap to include only relevant pages:

**Good:**

```xml
<url><loc>https://example.com/docs/getting-started</loc></url>
<url><loc>https://example.com/docs/api-reference</loc></url>
<url><loc>https://example.com/docs/tutorials</loc></url>
```

**Remove:**

```xml
<url><loc>https://example.com/login</loc></url>
<url><loc>https://example.com/signup</loc></url>
<url><loc>https://example.com/checkout</loc></url>
<url><loc>https://example.com/privacy-policy</loc></url>
```

### 2. Keep Sitemaps Organized

Create separate data sources for different content types:

* `docs-sitemap.xml` - Documentation pages
* `help-sitemap.xml` - Support articles
* `blog-sitemap.xml` - Blog posts

### 3. Version Your Sitemaps

When re-importing, keep versions:

```
docs-sitemap-2024-01.xml
docs-sitemap-2024-02.xml
docs-sitemap-2024-03.xml
```

### 4. Validate Before Upload

Use sitemap validators:

* <https://www.xml-sitemaps.com/validate-xml-sitemap.html>
* <https://www.websiteplanet.com/webtools/sitemap-validator/>

### 5. Check URL Accessibility

Ensure all URLs in sitemap are:

* Publicly accessible (no authentication required)
* Returning 200 status code (not 404 or redirects)
* Containing actual content (not empty pages)

## Advantages Over Website Connector

| Feature       | Sitemap                 | Website Crawler                 |
| ------------- | ----------------------- | ------------------------------- |
| **Speed**     | Fast (only listed URLs) | Slower (discovers links)        |
| **Precision** | Exact pages you want    | May miss or include extra pages |
| **Control**   | Full control over URLs  | Limited by crawler settings     |
| **Resources** | Less server load        | More server requests            |
| **Freshness** | Manual update needed    | Can auto-refresh                |

**Use Sitemap when:**

* You know exactly which pages to import
* Site has a complete, up-to-date sitemap
* You want a one-time import
* You need to minimize server load

**Use Website Crawler when:**

* You want automatic discovery
* Site structure changes frequently
* You want automatic updates
* You're not sure which pages exist

## Handling Large Sitemaps

### Sitemap Index Files

Large sites may use sitemap index files:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://example.com/sitemap1.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap2.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://example.com/sitemap3.xml</loc>
  </sitemap>
</sitemapindex>
```

**To import:**

1. Download each individual sitemap
2. Create separate data sources for each, or
3. Merge sitemaps into one file before uploading

### Merging Multiple Sitemaps

Combine multiple sitemaps into one:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <!-- URLs from sitemap1.xml -->
  <url><loc>https://example.com/page1</loc></url>
  <url><loc>https://example.com/page2</loc></url>
  
  <!-- URLs from sitemap2.xml -->
  <url><loc>https://example.com/page3</loc></url>
  <url><loc>https://example.com/page4</loc></url>
</urlset>
```

## Updating Content

Since Sitemap is a **static** connector, updates require re-import:

### To Update:

1. Download updated sitemap from website
2. Edit your data source in Twig
3. Upload the new sitemap file
4. Save to reprocess all URLs

### Automation Options:

* Schedule periodic manual updates
* Use [Website connector](/product/data-integrations/website.md) for automatic updates
* Set up external scripts to notify you of sitemap changes

## Troubleshooting

### URLs Not Accessible

**Problem:** Some URLs fail to process

**Solutions:**

* Verify URLs are publicly accessible
* Check for authentication requirements
* Test URLs in incognito browser window
* Review process logs for specific error codes

### Invalid Sitemap Format

**Problem:** Sitemap upload fails

**Solutions:**

* Validate XML syntax using online validator
* Check for proper XML declaration
* Ensure proper namespace declaration
* Verify file encoding is UTF-8

### Empty Pages Imported

**Problem:** URLs processed but no content extracted

**Solutions:**

* Check if pages contain actual text content
* Verify pages aren't JavaScript-heavy SPAs
* Look for content behind login walls
* Test URL manually in browser

### Partial Import

**Problem:** Only some URLs processed

**Solutions:**

* Check plan limits on number of URLs
* Review process logs for errors
* Verify failed URLs are accessible
* Split large sitemaps into multiple sources

## Advanced Tips

### 1. Filtering with Text Editor

Use find-and-replace in text editor to quickly filter sitemaps:

**Remove URLs containing "blog":**

```regex
Find: .*<url>.*blog.*</url>.*\n
Replace: (empty)
```

**Keep only "/docs/" URLs:**

* Copy entire sitemap
* Delete all content
* Paste back only lines containing "/docs/"

### 2. Combining Sitemaps from Different Sites

Create a consolidated sitemap:

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <!-- Site 1 docs -->
  <url><loc>https://docs.site1.com/guide</loc></url>
  
  <!-- Site 2 docs -->
  <url><loc>https://docs.site2.com/guide</loc></url>
  
  <!-- Site 3 docs -->
  <url><loc>https://docs.site3.com/guide</loc></url>
</urlset>
```

### 3. Priority-Based Import

Create multiple data sources based on priority:

**High Priority (daily refresh needed):**

```xml
<url><loc>https://example.com/getting-started</loc></url>
<url><loc>https://example.com/pricing</loc></url>
```

**Low Priority (rarely changes):**

```xml
<url><loc>https://example.com/company-history</loc></url>
<url><loc>https://example.com/team</loc></url>
```

## Next Steps

After importing from sitemap:

1. [Test knowledge coverage](/getting-started/ask-a-question.md)
2. [Create AI agents](/product/overview/add-an-ai-agent-persona.md) for specific content areas
3. [Monitor usage](/product/monitoring/view-analytics.md) to see which pages are most referenced
4. Plan periodic sitemap updates

## Related Connectors

* [Website](/product/data-integrations/website.md) - Automated web crawling with refresh
* [Files](/product/data-integrations/files.md) - Upload HTML or PDF exports
* [Confluence](/product/data-integrations/confluence.md) - For wiki-based documentation
* [Google Drive](/product/data-integrations/google-drive.md) - For cloud-hosted documentation


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/product/data-integrations/sitemap.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
