# Missing Context in Images

## The Problem

Images, diagrams, and screenshots in documents lack alt text or descriptions, making visual information inaccessible to RAG retrieval.

### Symptoms

* ❌ "See diagram below" - AI can't see diagram
* ❌ Charts and graphs not described
* ❌ Architecture diagrams lost
* ❌ Screenshots provide no text
* ❌ Visual instructions unusable

### Real-World Example

```
Documentation: "Follow these steps: [Screenshot of UI showing 5 buttons]"

Extracted text: "Follow these steps: [Image]"

Query: "How do I configure settings?"
→ Retrieved chunk mentions "follow steps"
→ But steps are in image (not extracted)

AI response: "The documentation mentions configuration steps but
doesn't provide details."

Visual info lost
```

***

## Deep Technical Analysis

### Image Extraction Challenges

**Text Extraction from PDFs:**

```
PDF contains:
→ Text (extractable)
→ Images (not automatically extracted)

Standard extraction:
→ Gets text: "See figure 3:"
→ Skips image content

Figure 3 has critical info:
→ Diagram of architecture
→ Flow chart of process
→ Lost in extraction
```

**HTML Image Alt Text:**

```
Good HTML:
<img src="diagram.png" alt="System architecture showing frontend, API, and database">
→ Alt text provides context

Bad HTML:
<img src="diagram.png" alt="image">
→ No useful context

Missing HTML:
<img src="diagram.png">
→ No alt text at all

Depends on source quality
```

### OCR for Image Text

**Embedded Text in Images:**

```
Screenshot with text:
→ Button labels
→ Menu items
→ Error messages

Without OCR:
→ Text lost

With OCR (Tesseract, Cloud Vision API):
→ Extract text from image
→ Include in chunk content

Enables retrieval of visual text
```

**OCR Limitations:**

```
Works well for:
→ High-resolution screenshots
→ Clear typography
→ Good contrast

Fails for:
→ Handwriting
→ Low resolution
→ Complex backgrounds
→ Stylized fonts

Accuracy ~80-95% (varies)
```

### Vision Language Models

**Image Understanding:**

```
Modern approach: Use vision-language models
→ GPT-4 Vision
→ CLIP
→ LLaVA (open source)

Process:
1. Extract image from document
2. Send to vision model
3. Prompt: "Describe this diagram in detail"
4. Model output: Text description
5. Embed description with document text

Makes visual content searchable
```

**Cost Considerations:**

```
GPT-4 Vision pricing:
→ $0.01-0.03 per image (depending on resolution)

Large knowledge base:
→ 10,000 images
→ Cost: $100-300

One-time cost at ingestion
Worth it for image-heavy docs
```

### Multimodal Embeddings

**CLIP Embeddings:**

```
CLIP (OpenAI):
→ Embeds images and text in same space
→ "Cat photo" and actual cat photo = similar vectors

Use case:
→ Query: "Show me the authentication flow diagram"
→ Retrieves: Actual diagram image (embedded)
→ Can display image to user

Beyond just text retrieval
```

***

## How to Solve

**Extract alt text from images where available + implement OCR (Tesseract, Cloud Vision) for text in images + use vision-language models (GPT-4 Vision) to describe diagrams/charts + generate descriptive captions for images at ingestion + embed image descriptions as text chunks + consider multimodal embeddings (CLIP) for image-text search + tag chunks with "has\_image" metadata for context.** See [Image Handling](/rag-scenarios-and-solutions/data-quality/alt-text.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/data-quality/alt-text.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.