# HTML to Text Conversion Problems

## The Problem

Converting HTML documents to plain text loses structure, formatting, navigation elements contaminate content, and JavaScript-rendered content is missed entirely.

### Symptoms

* ❌ Navigation menus mixed into article text
* ❌ "Click here" buttons appear as plain text
* ❌ CSS-hidden content extracted (e.g., mobile menus)
* ❌ `<div>` soup with no semantic structure
* ❌ Ads and tracking scripts in extracted text

### Real-World Example

```html
<html>
<header>
  <nav>Home | About | Products | Contact</nav>
</header>
<main>
  <article>
    <h1>Getting Started Guide</h1>
    <p>Welcome to our platform...</p>
  </article>
</main>
<footer>© 2024 Company | Privacy | Terms</footer>
</html>

Naive text extraction:
"Home About Products Contact Getting Started Guide Welcome to our platform... © 2024 Company Privacy Terms"

All elements flattened, navigation mixed with content
```

***

## Deep Technical Analysis

### Semantic HTML vs Div Soup

Modern HTML uses semantic tags:

**Semantic HTML5:**

```html
<article>: Main content
<nav>: Navigation
<header>: Page/section header
<footer>: Page/section footer
<aside>: Sidebar/tangential content
<main>: Primary content
```

**Best Case Extraction:**

```html
<main>
  <article>
    <p>Content to extract</p>
  </article>
</main>
<nav>Menu items (ignore)</nav>

Algorithm:
1. Find <article> or <main>
2. Extract only from these tags
3. Ignore <nav>, <header>, <footer>

Works well for semantic HTML
```

**Worst Case (Div Soup):**

```html
<div class="content">
  <div class="header">Menu</div>
  <div class="main">
    <div class="article">Actual content</div>
  </div>
  <div class="sidebar">Ads</div>
</div>

No semantic tags, only <div> with classes
→ Must infer content from class names
→ "content", "main", "article" are hints
→ But no standards, site-specific
```

### CSS Display and Visibility

HTML content may be visually hidden:

**Display: None:**

```html
<div style="display:none">
  Mobile menu (hidden on desktop)
</div>
```

**Extraction Issue:**

```
Text extraction sees:
→ "Mobile menu (hidden on desktop)"
→ Even though visually hidden

Should extracted text:
→ Include hidden content? (it exists in DOM)
→ Exclude hidden content? (user doesn't see it)

Use case dependent:
→ Mobile menu: Exclude (duplicate of main nav)
→ Spoiler/accordion: Include (real content, just collapsed)
```

**Visibility: Hidden vs Opacity:0:**

```html
<div style="visibility:hidden">Content A</div>
<div style="opacity:0">Content B</div>
<div class="sr-only">Screen reader only text</div>

All invisible to sighted users
→ visibility:hidden: Layout space reserved
→ opacity:0: Transparent but present
→ sr-only: For accessibility

Should extract:
→ sr-only: Yes (valuable alt text)
→ others: Debatable
```

### Navigation and UI Elements

Page chrome contamination:

**Navigation Extraction:**

```html
<nav>
  <ul>
    <li><a href="/">Home</a></li>
    <li><a href="/about">About</a></li>
    <li><a href="/products">Products</a></li>
  </ul>
</nav>

Text extraction: "Home About Products"

Appears in every page:
→ 50 pages on site
→ All have "Home About Products" in extracted text
→ Repetitive noise
→ Dilutes unique content signal
```

**Button and Link Text:**

```html
<button>Click here</button>
<a href="/signup">Learn more</a>
<a href="/docs">Read documentation →</a>

Extracted: "Click here Learn more Read documentation →"

Out of context:
→ "Click here" meaningless (click where?)
→ "Learn more" vague (learn about what?)
→ Arrow "→" is decorative (visual only)

Better to:
→ Extract link destination as context
→ "Sign up (Learn more)"
→ "Documentation (Read documentation)"
```

### Forms and Input Fields

Form elements have special extraction needs:

**Form HTML:**

```html
<form>
  <label for="email">Email:</label>
  <input type="email" id="email" placeholder="you@example.com">
  <button>Submit</button>
</form>
```

**Extraction Variants:**

```
Option 1: Extract labels only
"Email: Submit"

Option 2: Extract labels + placeholders
"Email: you@example.com Submit"
→ Placeholder looks like content (wrong)

Option 3: Skip forms entirely
(no text extracted)

Best practice:
→ Extract labels (field names)
→ Skip inputs, buttons, placeholders
→ "Email:" is content
→ "you@example.com" is just hint
```

### Script Tags and Style Blocks

Non-content elements:

**JavaScript Inline:**

```html
<script>
function trackEvent() {
  analytics.send('pageview');
}
</script>
```

**Text Extraction:**

```
Naive extraction includes:
"function trackEvent() { analytics.send('pageview'); }"

This is code, not content!
→ Should be excluded
→ But: How to distinguish from <code> blocks (which should be included)?

Solution:
→ Strip <script> tags entirely
→ Keep <code> and <pre> tags
```

**CSS Inline:**

```html
<style>
.header { color: blue; font-size: 24px; }
</style>

Also not content
→ Should exclude
→ Most parsers do this automatically
```

### Generated Content (CSS ::before/::after)

CSS can inject text:

**Pseudo-Elements:**

```css
.warning::before {
  content: "⚠️ Warning: ";
}
```

```html
<div class="warning">System maintenance tonight</div>
```

**Visual Rendering:**

```
⚠️ Warning: System maintenance tonight
```

**Text Extraction:**

```
HTML DOM only contains:
"System maintenance tonight"

Missing: "⚠️ Warning: "
→ Generated by CSS, not in DOM
→ Text extraction sees incomplete sentence

Headless browser rendering needed:
→ Render page with CSS
→ Extract computed text (including ::before/::after)
→ More accurate but much slower
```

### Table Extraction from HTML

HTML tables need structure preservation:

**Table HTML:**

```html
<table>
  <thead>
    <tr><th>Product</th><th>Price</th></tr>
  </thead>
  <tbody>
    <tr><td>Widget</td><td>$10</td></tr>
  </tbody>
</table>
```

**Extraction Formats:**

```
Option 1: Markdown table
| Product | Price |
|---------|-------|
| Widget  | $10   |

Option 2: CSV-style
Product, Price
Widget, $10

Option 3: Linearized prose
"Product: Widget, Price: $10"

Option 4: Just text (structure lost)
"Product Price Widget $10"

Best: Markdown (preserves structure, readable)
```

### Image Alt Text and Captions

Images carry semantic information:

**Alt Text:**

```html
<img src="diagram.png" alt="System architecture diagram showing 3-tier design">
```

**Extraction Importance:**

```
Without alt text:
→ Image invisible to text extraction
→ "See diagram below" references nothing
→ Incomplete information

With alt text:
→ "System architecture diagram showing 3-tier design"
→ LLM has description of visual
→ Can partially answer questions about diagram

Alt text is critical content
→ Must include in extraction
```

**Figure Captions:**

```html
<figure>
  <img src="chart.png" alt="Performance chart">
  <figcaption>Figure 1: Query performance over time</figcaption>
</figure>

Should extract:
"Figure 1: Query performance over time. [Image: Performance chart]"

Both caption and alt text provide context
```

### Microdata and Structured Data

Schema.org and other structured markup:

**JSON-LD:**

```html
<script type="application/ld+json">
{
  "@type": "Article",
  "headline": "Getting Started Guide",
  "author": "John Smith",
  "datePublished": "2024-01-15"
}
</script>
```

**Extraction Opportunity:**

```
Structured data provides:
→ Article title
→ Author
→ Date
→ Other metadata

Can augment extracted text:
"Getting Started Guide (by John Smith, published 2024-01-15)"

Adds context beyond visible text
```

### Single Page Applications (SPAs)

JavaScript-rendered content:

**Initial HTML (before JS):**

```html
<div id="root"></div>
<script src="app.js"></script>
```

**After JavaScript Executes:**

```html
<div id="root">
  <h1>Welcome</h1>
  <p>Actual content rendered by React...</p>
</div>
```

**The Empty Shell Problem:**

```
HTTP request fetches:
→ Empty <div id="root"></div>
→ No content visible

Text extraction:
→ Nothing to extract!

Solution required:
→ Headless browser (Puppeteer, Playwright)
→ Execute JavaScript
→ Wait for content to load
→ Then extract

10-100x slower than static HTML extraction
→ But necessary for SPAs
```

***

## How to Solve

**Use semantic HTML tags to identify content areas (article, main) + strip navigation, headers, footers + exclude display:none elements + extract alt text from images + use headless browser for JavaScript-rendered content + convert tables to markdown format.** See [HTML Extraction](/rag-scenarios-and-solutions/chunking/html-conversion.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/chunking/html-conversion.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
