Inconsistent Document Metadata

The Problem

Missing, incorrect, or inconsistent metadata across documents prevents effective filtering, search, and access control in RAG retrieval.

Symptoms

  • ❌ Some docs have metadata, others don't

  • ❌ Same field different names ("author" vs "created_by")

  • ❌ Dates in multiple formats

  • ❌ Cannot filter by category/department

  • ❌ Access control metadata missing

Real-World Example

Document A metadata:
{
  "author": "[email protected]",
  "created": "2024-01-15",
  "department": "Engineering",
  "sensitivity": "internal"
}

Document B metadata:
{
  "created_by": "Jane Doe",
  "date": "Jan 15, 2024",
  "dept": "Eng"
}

Document C metadata:
{
  // No metadata at all
}

Query with filter: WHERE department = "Engineering"
→ Matches Doc A only
→ Doc B uses "dept" (different field)
→ Doc C has no metadata
→ Incomplete results despite relevant content in B and C

Deep Technical Analysis

Schema Inconsistency

Field Name Variations:

Data Type Mismatches:

Missing Metadata

Incomplete Extraction:

Legacy Documents:

Normalization Strategies

Schema Standardization:

Field Mapping:

Default Values:

Controlled Vocabularies

Department Field:

Tag Standardization:


How to Solve

Define canonical metadata schema upfront + implement field mapping during ingestion (source schema → canonical) + normalize data types (all dates to ISO 8601) + use controlled vocabularies for categories/departments + set safe defaults for missing required fields + validate metadata at ingestion + audit and remediate legacy docs. See Metadata Standards.

Last updated