# Inconsistent Document Metadata

## The Problem

Missing, incorrect, or inconsistent metadata across documents prevents effective filtering, search, and access control in RAG retrieval.

### Symptoms

* ❌ Some docs have metadata, others don't
* ❌ Same field different names ("author" vs "created\_by")
* ❌ Dates in multiple formats
* ❌ Cannot filter by category/department
* ❌ Access control metadata missing

### Real-World Example

```
Document A metadata:
{
  "author": "john.smith@company.com",
  "created": "2024-01-15",
  "department": "Engineering",
  "sensitivity": "internal"
}

Document B metadata:
{
  "created_by": "Jane Doe",
  "date": "Jan 15, 2024",
  "dept": "Eng"
}

Document C metadata:
{
  // No metadata at all
}

Query with filter: WHERE department = "Engineering"
→ Matches Doc A only
→ Doc B uses "dept" (different field)
→ Doc C has no metadata
→ Incomplete results despite relevant content in B and C
```

***

## Deep Technical Analysis

### Schema Inconsistency

**Field Name Variations:**

```
Same concept, different names:
→ "author", "created_by", "owner", "contributor"
→ "date", "created", "timestamp", "published_date"
→ "category", "type", "classification", "tag"

Queries break:
→ Filter by "author" misses "created_by"
→ Manual mapping required
```

**Data Type Mismatches:**

```
Date field:
→ Doc A: "2024-01-15" (ISO 8601)
→ Doc B: "January 15, 2024" (text)
→ Doc C: 1705276800 (Unix timestamp)

Cannot compare:
→ WHERE date > "2024-01-01"
→ Only matches Doc A (ISO format)
→ Others incompatible
```

### Missing Metadata

**Incomplete Extraction:**

```
Source: Confluence page
→ Has author, date, labels (tags)

Extraction:
→ Captures title, body
→ Misses labels (not in API response)

Result: Metadata incomplete
```

**Legacy Documents:**

```
Old docs imported:
→ Created before metadata standards
→ Missing required fields
→ Cannot retroactively add without manual review

Metadata gaps persist
```

### Normalization Strategies

**Schema Standardization:**

```
Define canonical schema:
{
  "author": "string",
  "created_at": "ISO 8601 datetime",
  "department": "string (controlled vocabulary)",
  "sensitivity": "enum: public|internal|confidential",
  "document_type": "enum: policy|guide|api_doc"
}

Map all inputs to this schema
```

**Field Mapping:**

```
Ingestion pipeline:
→ Detect source schema
→ Map to canonical:
  - "created_by" → "author"
  - "dept" → "department"
  - Normalize: "Eng" → "Engineering"

Ensures consistency
```

**Default Values:**

```
Required fields:
→ If missing, use default
→ "author": "unknown"
→ "sensitivity": "internal" (safe default)

Prevents null/missing values breaking queries
```

### Controlled Vocabularies

**Department Field:**

```
Problem: Free text
→ "Engineering", "Eng", "engineering", "ENGINEERING", "R&D"

Solution: Enum
→ Valid values: ["Engineering", "Sales", "Support", "HR"]
→ Reject or map invalid values

Enables reliable filtering
```

**Tag Standardization:**

```
Tags: ["api", "API", "rest-api", "REST API", "restapi"]
→ All mean same thing

Normalize:
→ Lowercase: "api"
→ Canonical form: "rest-api"

Consistent tagging
```

***

## How to Solve

**Define canonical metadata schema upfront + implement field mapping during ingestion (source schema → canonical) + normalize data types (all dates to ISO 8601) + use controlled vocabularies for categories/departments + set safe defaults for missing required fields + validate metadata at ingestion + audit and remediate legacy docs.** See [Metadata Standards](/rag-scenarios-and-solutions/data-quality/metadata-inconsistent.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/data-quality/metadata-inconsistent.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
