HTML to Text Conversion Problems

The Problem

Converting HTML documents to plain text loses structure, formatting, navigation elements contaminate content, and JavaScript-rendered content is missed entirely.

Symptoms

❌ Navigation menus mixed into article text
❌ "Click here" buttons appear as plain text
❌ CSS-hidden content extracted (e.g., mobile menus)
❌ <div> soup with no semantic structure
❌ Ads and tracking scripts in extracted text

Real-World Example

<html>
<header>
  <nav>Home | About | Products | Contact</nav>
</header>
<main>
  <article>
    <h1>Getting Started Guide</h1>
    <p>Welcome to our platform...</p>
  </article>
</main>
<footer>© 2024 Company | Privacy | Terms</footer>
</html>

Naive text extraction:
"Home About Products Contact Getting Started Guide Welcome to our platform... © 2024 Company Privacy Terms"

All elements flattened, navigation mixed with content

Deep Technical Analysis

Semantic HTML vs Div Soup

Modern HTML uses semantic tags:

Semantic HTML5:

<article>: Main content
<nav>: Navigation
<header>: Page/section header
<footer>: Page/section footer
<aside>: Sidebar/tangential content
<main>: Primary content

Best Case Extraction:

<main>
  <article>
    <p>Content to extract</p>
  </article>
</main>
<nav>Menu items (ignore)</nav>

Algorithm:
1. Find <article> or <main>
2. Extract only from these tags
3. Ignore <nav>, <header>, <footer>

Works well for semantic HTML

Worst Case (Div Soup):

<div class="content">
  <div class="header">Menu</div>
  <div class="main">
    <div class="article">Actual content</div>
  </div>
  <div class="sidebar">Ads</div>
</div>

No semantic tags, only <div> with classes
→ Must infer content from class names
→ "content", "main", "article" are hints
→ But no standards, site-specific

CSS Display and Visibility

HTML content may be visually hidden:

Display: None:

<div style="display:none">
  Mobile menu (hidden on desktop)
</div>

Extraction Issue:

Text extraction sees:
→ "Mobile menu (hidden on desktop)"
→ Even though visually hidden

Should extracted text:
→ Include hidden content? (it exists in DOM)
→ Exclude hidden content? (user doesn't see it)

Use case dependent:
→ Mobile menu: Exclude (duplicate of main nav)
→ Spoiler/accordion: Include (real content, just collapsed)

Visibility: Hidden vs Opacity:0:

<div style="visibility:hidden">Content A</div>
<div style="opacity:0">Content B</div>
<div class="sr-only">Screen reader only text</div>

All invisible to sighted users
→ visibility:hidden: Layout space reserved
→ opacity:0: Transparent but present
→ sr-only: For accessibility

Should extract:
→ sr-only: Yes (valuable alt text)
→ others: Debatable

Page chrome contamination:

Navigation Extraction:

<nav>
  <ul>
    <li><a href="/">Home</a></li>
    <li><a href="/about">About</a></li>
    <li><a href="/products">Products</a></li>
  </ul>
</nav>

Text extraction: "Home About Products"

Appears in every page:
→ 50 pages on site
→ All have "Home About Products" in extracted text
→ Repetitive noise
→ Dilutes unique content signal

Button and Link Text:

<button>Click here</button>
<a href="/signup">Learn more</a>
<a href="/docs">Read documentation →</a>

Extracted: "Click here Learn more Read documentation →"

Out of context:
→ "Click here" meaningless (click where?)
→ "Learn more" vague (learn about what?)
→ Arrow "→" is decorative (visual only)

Better to:
→ Extract link destination as context
→ "Sign up (Learn more)"
→ "Documentation (Read documentation)"

Forms and Input Fields

Form elements have special extraction needs:

Form HTML:

<form>
  <label for="email">Email:</label>
  <input type="email" id="email" placeholder="[email protected]">
  <button>Submit</button>
</form>

Extraction Variants:

Option 1: Extract labels only
"Email: Submit"

Option 2: Extract labels + placeholders
"Email: [email protected] Submit"
→ Placeholder looks like content (wrong)

Option 3: Skip forms entirely
(no text extracted)

Best practice:
→ Extract labels (field names)
→ Skip inputs, buttons, placeholders
→ "Email:" is content
→ "[email protected]" is just hint

Script Tags and Style Blocks

Non-content elements:

JavaScript Inline:

<script>
function trackEvent() {
  analytics.send('pageview');
}
</script>

Text Extraction:

Naive extraction includes:
"function trackEvent() { analytics.send('pageview'); }"

This is code, not content!
→ Should be excluded
→ But: How to distinguish from <code> blocks (which should be included)?

Solution:
→ Strip <script> tags entirely
→ Keep <code> and <pre> tags

CSS Inline:

<style>
.header { color: blue; font-size: 24px; }
</style>

Also not content
→ Should exclude
→ Most parsers do this automatically

Generated Content (CSS ::before/::after)

CSS can inject text:

Pseudo-Elements:

.warning::before {
  content: "⚠️ Warning: ";
}

<div class="warning">System maintenance tonight</div>

Visual Rendering:

⚠️ Warning: System maintenance tonight

Text Extraction:

HTML DOM only contains:
"System maintenance tonight"

Missing: "⚠️ Warning: "
→ Generated by CSS, not in DOM
→ Text extraction sees incomplete sentence

Headless browser rendering needed:
→ Render page with CSS
→ Extract computed text (including ::before/::after)
→ More accurate but much slower

Table Extraction from HTML

HTML tables need structure preservation:

Table HTML:

<table>
  <thead>
    <tr><th>Product</th><th>Price</th></tr>
  </thead>
  <tbody>
    <tr><td>Widget</td><td>$10</td></tr>
  </tbody>
</table>

Extraction Formats:

Option 1: Markdown table
| Product | Price |
|---------|-------|
| Widget  | $10   |

Option 2: CSV-style
Product, Price
Widget, $10

Option 3: Linearized prose
"Product: Widget, Price: $10"

Option 4: Just text (structure lost)
"Product Price Widget $10"

Best: Markdown (preserves structure, readable)

Image Alt Text and Captions

Images carry semantic information:

Alt Text:

<img src="diagram.png" alt="System architecture diagram showing 3-tier design">

Extraction Importance:

Without alt text:
→ Image invisible to text extraction
→ "See diagram below" references nothing
→ Incomplete information

With alt text:
→ "System architecture diagram showing 3-tier design"
→ LLM has description of visual
→ Can partially answer questions about diagram

Alt text is critical content
→ Must include in extraction

Figure Captions:

<figure>
  <img src="chart.png" alt="Performance chart">
  <figcaption>Figure 1: Query performance over time</figcaption>
</figure>

Should extract:
"Figure 1: Query performance over time. [Image: Performance chart]"

Both caption and alt text provide context

Microdata and Structured Data

Schema.org and other structured markup:

JSON-LD:

<script type="application/ld+json">
{
  "@type": "Article",
  "headline": "Getting Started Guide",
  "author": "John Smith",
  "datePublished": "2024-01-15"
}
</script>

Extraction Opportunity:

Structured data provides:
→ Article title
→ Author
→ Date
→ Other metadata

Can augment extracted text:
"Getting Started Guide (by John Smith, published 2024-01-15)"

Adds context beyond visible text

Single Page Applications (SPAs)

JavaScript-rendered content:

Initial HTML (before JS):

<div id="root"></div>
<script src="app.js"></script>

After JavaScript Executes:

<div id="root">
  <h1>Welcome</h1>
  <p>Actual content rendered by React...</p>
</div>

The Empty Shell Problem:

HTTP request fetches:
→ Empty <div id="root"></div>
→ No content visible

Text extraction:
→ Nothing to extract!

Solution required:
→ Headless browser (Puppeteer, Playwright)
→ Execute JavaScript
→ Wait for content to load
→ Then extract

10-100x slower than static HTML extraction
→ But necessary for SPAs

How to Solve

Use semantic HTML tags to identify content areas (article, main) + strip navigation, headers, footers + exclude display:none elements + extract alt text from images + use headless browser for JavaScript-rendered content + convert tables to markdown format. See HTML Extraction.

PreviousFootnotes and References Lost NextPoor Semantic Search Results

Last updated 1 minute ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagSemantic HTML vs Div Soup

hashtagCSS Display and Visibility

hashtagNavigation and UI Elements

hashtagForms and Input Fields

hashtagScript Tags and Style Blocks

hashtagGenerated Content (CSS ::before/::after)

hashtagTable Extraction from HTML

hashtagImage Alt Text and Captions

hashtagMicrodata and Structured Data

hashtagSingle Page Applications (SPAs)

hashtagHow to Solve