, or role="main" → Ignore

→ Score elements by text density → Extract only high-density content areas But: → Not all sites use semantic HTML → Heuristics fail on unusual layouts → May miss legitimate content in sidebars ``` **The Code Block Problem:** ```html


npm install @company/sdk
npm start

``` **Extraction challenge:** ``` Should crawler: → Include code blocks as-is? → Preserve formatting (indentation, newlines)? → Add metadata like "language: bash"? → Treat code differently than prose in embeddings? Code in RAG: → User asks: "How do I install the SDK?" → Retrieved chunk: "npm install @company/sdk" → LLM needs to recognize this is a command, not prose → Formatting matters for code comprehension ``` ### Sitemap vs Crawl Strategy Sites may provide sitemaps for easier indexing: **Sitemap.xml:** ```xml https://docs.company.com/getting-started 2024-01-15 1.0 https://docs.company.com/api-reference 2024-01-20 0.8 ``` **Benefits:** ``` Sitemap-based crawl: → Explicit list of all pages → Includes lastmod → skip unchanged pages → Faster than link crawling → No infinite loop risk ``` **Limitations:** ``` Not all sites have sitemaps Sitemaps may be: → Outdated (missing new pages) → Incomplete (manually maintained, forgotten pages) → Too large (split across multiple files) → Exclude authenticated pages ``` **Hybrid Strategy:** ``` 1. Check for sitemap.xml 2. If found: Use as primary source 3. Also crawl from homepage 4. Compare: sitemap URLs vs discovered URLs 5. Union of both sets = complete coverage But: → More complexity → Longer crawl time → Higher risk of duplicates ``` ### Rate Limiting and Politeness Aggressive crawling can overload servers: **The Server Load Problem:** ``` Naive crawler: → 10 concurrent requests → 500 pages total → Completes in 30 seconds Server perspective: → 10 simultaneous connections → High CPU/memory usage → Looks like DDoS attack → May trigger rate limiting or IP ban ``` **Polite Crawling:** ``` Best practices: → 1 request at a time (or max 2-3) → 1-2 second delay between requests → Respect Crawl-delay in robots.txt → Use consistent User-Agent → Handle 429 (rate limit) with exponential backoff But: → 500 pages × 2 seconds = 16 minutes → User sees "crawl taking forever" → Impatient user cancels → Incomplete knowledge base ``` *** ## How to Solve **Use headless browser for JavaScript sites + respect robots.txt + implement URL deduplication + extract only main content area + rate limit requests.** See [Website Data Sources](/product/data-integrations/website.md) for configuration. --- # Agent Instructions: Querying This Documentation If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question. Perform an HTTP GET request on the current page URL with the `ask` query parameter: ``` GET https://help.twig.so/rag-scenarios-and-solutions/data-integration/website-scraping.md?ask= ``` The question should be specific, self-contained, and written in natural language. The response will contain a direct answer to the question and relevant excerpts and sources from the documentation. Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.

Getting Started

Getting Started