Missing Context in Images

The Problem

Images, diagrams, and screenshots in documents lack alt text or descriptions, making visual information inaccessible to RAG retrieval.

Symptoms

  • ❌ "See diagram below" - AI can't see diagram

  • ❌ Charts and graphs not described

  • ❌ Architecture diagrams lost

  • ❌ Screenshots provide no text

  • ❌ Visual instructions unusable

Real-World Example

Documentation: "Follow these steps: [Screenshot of UI showing 5 buttons]"

Extracted text: "Follow these steps: [Image]"

Query: "How do I configure settings?"
→ Retrieved chunk mentions "follow steps"
→ But steps are in image (not extracted)

AI response: "The documentation mentions configuration steps but
doesn't provide details."

Visual info lost

Deep Technical Analysis

Image Extraction Challenges

Text Extraction from PDFs:

HTML Image Alt Text:

OCR for Image Text

Embedded Text in Images:

OCR Limitations:

Vision Language Models

Image Understanding:

Cost Considerations:

Multimodal Embeddings

CLIP Embeddings:


How to Solve

Extract alt text from images where available + implement OCR (Tesseract, Cloud Vision) for text in images + use vision-language models (GPT-4 Vision) to describe diagrams/charts + generate descriptive captions for images at ingestion + embed image descriptions as text chunks + consider multimodal embeddings (CLIP) for image-text search + tag chunks with "has_image" metadata for context. See Image Handling.

Last updated