Missing Context in Images

The Problem

Images, diagrams, and screenshots in documents lack alt text or descriptions, making visual information inaccessible to RAG retrieval.

Symptoms

❌ "See diagram below" - AI can't see diagram
❌ Charts and graphs not described
❌ Architecture diagrams lost
❌ Screenshots provide no text
❌ Visual instructions unusable

Real-World Example

Documentation: "Follow these steps: [Screenshot of UI showing 5 buttons]"

Extracted text: "Follow these steps: [Image]"

Query: "How do I configure settings?"
→ Retrieved chunk mentions "follow steps"
→ But steps are in image (not extracted)

AI response: "The documentation mentions configuration steps but
doesn't provide details."

Visual info lost

Deep Technical Analysis

Image Extraction Challenges

Text Extraction from PDFs:

PDF contains:
→ Text (extractable)
→ Images (not automatically extracted)

Standard extraction:
→ Gets text: "See figure 3:"
→ Skips image content

Figure 3 has critical info:
→ Diagram of architecture
→ Flow chart of process
→ Lost in extraction

HTML Image Alt Text:

Good HTML:
<img src="diagram.png" alt="System architecture showing frontend, API, and database">
→ Alt text provides context

Bad HTML:
<img src="diagram.png" alt="image">
→ No useful context

Missing HTML:
<img src="diagram.png">
→ No alt text at all

Depends on source quality

OCR for Image Text

Embedded Text in Images:

Screenshot with text:
→ Button labels
→ Menu items
→ Error messages

Without OCR:
→ Text lost

With OCR (Tesseract, Cloud Vision API):
→ Extract text from image
→ Include in chunk content

Enables retrieval of visual text

OCR Limitations:

Works well for:
→ High-resolution screenshots
→ Clear typography
→ Good contrast

Fails for:
→ Handwriting
→ Low resolution
→ Complex backgrounds
→ Stylized fonts

Accuracy ~80-95% (varies)

Vision Language Models

Image Understanding:

Modern approach: Use vision-language models
→ GPT-4 Vision
→ CLIP
→ LLaVA (open source)

Process:
1. Extract image from document
2. Send to vision model
3. Prompt: "Describe this diagram in detail"
4. Model output: Text description
5. Embed description with document text

Makes visual content searchable

Cost Considerations:

GPT-4 Vision pricing:
→ $0.01-0.03 per image (depending on resolution)

Large knowledge base:
→ 10,000 images
→ Cost: $100-300

One-time cost at ingestion
Worth it for image-heavy docs

Multimodal Embeddings

CLIP Embeddings:

CLIP (OpenAI):
→ Embeds images and text in same space
→ "Cat photo" and actual cat photo = similar vectors

Use case:
→ Query: "Show me the authentication flow diagram"
→ Retrieves: Actual diagram image (embedded)
→ Can display image to user

Beyond just text retrieval

How to Solve

Extract alt text from images where available + implement OCR (Tesseract, Cloud Vision) for text in images + use vision-language models (GPT-4 Vision) to describe diagrams/charts + generate descriptive captions for images at ingestion + embed image descriptions as text chunks + consider multimodal embeddings (CLIP) for image-text search + tag chunks with "has_image" metadata for context. See Image Handling.

PreviousInconsistent Document Metadata NextDocument Version Conflicts

Last updated 18 minutes ago

hashtagThe Problem

hashtagSymptoms

hashtagReal-World Example

hashtagDeep Technical Analysis

hashtagImage Extraction Challenges

hashtagOCR for Image Text

hashtagVision Language Models

hashtagMultimodal Embeddings

hashtagHow to Solve