Data Manipulations

Data manipulation encompasses the various transformations, enrichments, and processing techniques applied to your data to improve its quality, structure, and usability for AI agents. This includes cleaning, formatting, enriching with metadata, and optimizing for retrieval.

Overview

Raw data rarely comes in the perfect format for AI consumption. Data manipulation techniques transform your content into an optimized form that enables:

Better retrieval accuracy
Improved response quality
Enhanced filtering and categorization
More efficient processing

Core Manipulation Techniques

Data Cleaning

Remove noise and inconsistencies from your data.

Common Cleaning Operations:

Whitespace Normalization

Remove extra spaces, tabs, and newlines
Standardize line endings
Clean up formatting artifacts

Before: "This  is   a    sentence.\n\n\n\nNext paragraph."
After:  "This is a sentence.\n\nNext paragraph."

Character Encoding

Fix encoding issues (UTF-8, ASCII)
Handle special characters
Normalize unicode variations

Before: "caf\u00e9" or "cafÃ©"
After:  "café"

HTML/XML Cleanup

Strip HTML tags
Decode HTML entities
Remove CSS and JavaScript
Extract meaningful text

Before: "<p>Hello &nbsp; <strong>World</strong>!</p>"
After:  "Hello World!"

Noise Removal

Remove boilerplate text (headers, footers)
Strip navigation elements
Delete advertising content
Remove redundant copyright notices

Data Formatting

Standardize structure and format across your content.

Markdown Normalization

Standardize heading styles
Consistent list formatting
Proper code block formatting
Table standardization

Before:
# Heading
** Bold text **
- item 1
* item 2

After:
# Heading
**Bold text**
- item 1
- item 2

Date and Time Standardization

Convert to ISO 8601 format
Handle time zones consistently
Parse various date formats

Before: "Jan 15, 2024", "15/01/2024", "2024-1-15"
After:  "2024-01-15T00:00:00Z"

URL Normalization

Standardize URL formats
Remove tracking parameters
Handle relative URLs
Extract meaningful link text

Before: "https://example.com/page?utm_source=email&sessionid=123"
After:  "https://example.com/page"

Text Transformations

Modify text content to improve processing.

Case Normalization

Lowercase for case-insensitive matching
Title case for headings
Proper case for names

Punctuation Handling

Standardize quotation marks
Handle apostrophes consistently
Remove or standardize special punctuation

Language Processing

Stemming: Reduce words to root form
Lemmatization: Convert to dictionary form
Tokenization: Split into words/tokens

Before: "running", "ran", "runs"
After (stemmed): "run", "run", "run"
After (lemmatized): "run", "run", "run"

Abbreviation Expansion

Expand common abbreviations
Handle acronyms consistently
Add full forms as metadata

Before: "API", "e.g.", "i.e."
After:  "API (Application Programming Interface)", "for example", "that is"

Metadata Enrichment

Add contextual information to improve retrieval and filtering.

Source Metadata

Track where content originated:

{
  "source_type": "confluence",
  "source_url": "https://wiki.company.com/page/123",
  "source_title": "API Documentation",
  "author": "John Doe",
  "last_modified": "2024-01-15T10:30:00Z",
  "version": "2.1"
}

Content Classification

Automatically categorize content:

{
  "category": "technical-documentation",
  "subcategory": "api-reference",
  "topics": ["authentication", "REST API", "OAuth"],
  "complexity": "intermediate",
  "content_type": "how-to"
}

Semantic Metadata

Add meaning and context:

{
  "key_concepts": ["rate limiting", "API keys", "authentication"],
  "related_topics": ["security", "developer-tools"],
  "prerequisites": ["account setup", "API key generation"],
  "target_audience": "developers"
}

Structural Metadata

Capture document structure:

{
  "heading_hierarchy": ["Getting Started", "Authentication", "API Keys"],
  "section_type": "setup-guide",
  "reading_time_minutes": 5,
  "code_blocks": 3,
  "external_links": 2
}

Temporal Metadata

Track time-related information:

{
  "created_at": "2023-06-01T00:00:00Z",
  "updated_at": "2024-01-15T10:30:00Z",
  "valid_from": "2024-01-01T00:00:00Z",
  "expires_at": "2025-01-01T00:00:00Z",
  "freshness_score": 0.95
}

Advanced Manipulations

Entity Extraction

Identify and extract key entities:

Types of Entities:

People: Names, roles, contacts
Organizations: Companies, departments, teams
Products: Software, services, tools
Locations: Offices, regions, data centers
Technical Terms: APIs, protocols, technologies

Example:

{
  "text": "Contact Jane Smith at [email protected] for API access to our DataSync service.",
  "entities": {
    "people": [{"name": "Jane Smith", "email": "[email protected]"}],
    "products": ["DataSync"],
    "topics": ["API access"]
  }
}

Relationship Mapping

Identify connections between content pieces:

{
  "document_id": "doc_123",
  "relationships": [
    {
      "type": "prerequisite",
      "target": "doc_045",
      "description": "Setup guide required first"
    },
    {
      "type": "related",
      "target": "doc_234",
      "description": "Advanced configuration options"
    }
  ]
}

Intent Classification

Determine the purpose of content:

{
  "primary_intent": "instructional",
  "secondary_intents": ["troubleshooting", "reference"],
  "action_items": ["setup", "configure", "test"],
  "question_types_addressed": ["how-to", "what-is"]
}

Sentiment and Tone

Analyze content characteristics:

{
  "tone": "formal",
  "sentiment": "neutral",
  "reading_level": "college",
  "technical_density": "high"
}

Language Detection and Translation

Handle multilingual content:

{
  "detected_language": "en",
  "confidence": 0.99,
  "has_translations": true,
  "available_languages": ["en", "es", "fr"],
  "translation_status": "complete"
}

Content Enhancement

Summary Generation

Create concise summaries for quick understanding:

{
  "content": "... [full content] ...",
  "summary": "This guide explains how to authenticate with the API using OAuth 2.0. It covers setup, token generation, and refresh workflows.",
  "key_points": [
    "OAuth 2.0 is the primary authentication method",
    "Tokens expire after 1 hour",
    "Refresh tokens are valid for 30 days"
  ]
}

Title and Heading Extraction

Identify and standardize titles:

{
  "original_title": "api-auth-guide.md",
  "extracted_title": "API Authentication Guide",
  "main_heading": "Authenticating with the API",
  "subheadings": [
    "OAuth 2.0 Setup",
    "Token Management",
    "Best Practices"
  ]
}

Code Extraction and Annotation

Handle code snippets specially:

{
  "code_blocks": [
    {
      "language": "python",
      "code": "import requests\n...",
      "purpose": "Example API authentication",
      "line_numbers": [45, 52]
    }
  ]
}

Link Processing

Extract and enrich hyperlinks:

{
  "links": [
    {
      "url": "https://api.example.com/docs",
      "text": "API Documentation",
      "type": "external",
      "status": "active",
      "description": "Official API reference"
    }
  ]
}

Deduplication

Remove duplicate or highly similar content.

Exact Duplicates

Remove identical content:

# Example logic
if content_hash(new_chunk) in existing_hashes:
    skip_chunk()
else:
    add_chunk()

Near Duplicates

Identify and merge similar content:

Techniques:

Cosine similarity on embeddings
Fuzzy string matching
MinHash/LSH algorithms

similarity = cosine_similarity(embedding1, embedding2)
if similarity > 0.95:
    merge_or_skip()

Version Consolidation

Handle multiple versions of the same document:

{
  "consolidation_strategy": "latest",
  "versions": [
    {"id": "v1", "date": "2023-01-01", "action": "archive"},
    {"id": "v2", "date": "2024-01-01", "action": "keep"}
  ]
}

Data Validation

Ensure data quality through validation.

Schema Validation

Verify data structure:

{
  "required_fields": ["content", "source", "timestamp"],
  "optional_fields": ["metadata", "tags"],
  "validation_rules": {
    "content": {"min_length": 10, "max_length": 10000},
    "timestamp": {"format": "ISO8601"}
  }
}

Content Validation

Check content quality:

Minimum Length: Ensure chunks aren't too short
Maximum Length: Prevent oversized chunks
Language Check: Verify expected language
Encoding Validation: Ensure proper encoding

Metadata Validation

Verify metadata completeness:

{
  "metadata_completeness": 0.85,
  "missing_fields": ["author", "category"],
  "validation_status": "warning"
}

Filtering and Exclusion

Remove unwanted content systematically.

Content-Based Filtering

Exclude based on content characteristics:

{
  "exclusion_rules": [
    {"type": "length", "min": 50, "max": 5000},
    {"type": "language", "allowed": ["en", "es"]},
    {"type": "contains", "patterns": ["deprecated", "obsolete"]}
  ]
}

Source-Based Filtering

Filter by origin:

{
  "excluded_sources": [
    "internal-only-wiki",
    "draft-documents"
  ],
  "included_sources": [
    "public-documentation",
    "kb-articles"
  ]
}

Time-Based Filtering

Filter by freshness:

{
  "age_limit_days": 365,
  "exclude_before": "2023-01-01",
  "only_updated_after": "2024-01-01"
}

Optimization Techniques

Embedding Optimization

Prepare content for optimal embeddings:

Chunk Size: Optimal for embedding model
Context Addition: Add titles/headings to chunks
Metadata Inclusion: Include key metadata in embedded text

Original chunk: "Click the Save button to save your changes."

Optimized for embedding:
"[Configuration Settings > Saving Changes] 
Click the Save button to save your changes to your account configuration."

Query Matching Optimization

Enhance content for better query matching:

Question Format: Add question-style text
Keyword Enrichment: Include relevant keywords
Synonym Addition: Add alternative terms

Original: "Authentication requires an API key."

Optimized: "Authentication requires an API key. How to authenticate: 
You need an API key to authenticate with the service. 
Also known as: login, authorization, access token."

Hierarchical Structuring

Create parent-child relationships:

{
  "parent_chunk": {
    "id": "chunk_parent_123",
    "content": "... [entire section] ...",
    "type": "context"
  },
  "child_chunks": [
    {
      "id": "chunk_child_124",
      "content": "... [specific subsection] ...",
      "type": "retrievable"
    }
  ]
}

Implementation Workflow

1. Data Ingestion

Raw Data → Parse → Extract Text → Initial Validation

2. Cleaning Pipeline

Raw Text → Remove Noise → Normalize → Fix Encoding → Clean HTML

3. Enrichment Pipeline

Clean Text → Extract Entities → Classify → Add Metadata → Generate Embeddings

4. Optimization Pipeline

Enriched Data → Deduplicate → Validate → Optimize → Index

5. Quality Assurance

Indexed Data → Sample Testing → Quality Metrics → Manual Review → Deployment

Best Practices

Processing Order

Clean First: Remove noise before analysis
Extract Then Enrich: Get base data before adding metadata
Validate Throughout: Check quality at each stage
Optimize Last: Final tuning after core processing

Idempotency

Ensure operations can be safely repeated:

Same input → Same output
Track processing versions
Enable reprocessing
Maintain audit trails

Scalability

Design for large-scale processing:

Batch processing for efficiency
Parallel processing where possible
Incremental updates
Efficient storage formats

Monitoring

Track manipulation effectiveness:

{
  "processing_metrics": {
    "documents_processed": 1000,
    "success_rate": 0.98,
    "avg_processing_time_ms": 150,
    "errors": 20,
    "warnings": 45
  }
}

Common Pitfalls

Over-Processing

Problem: Too many transformations lose original meaning
Solution: Keep transformations minimal and reversible

Metadata Bloat

Problem: Excessive metadata slows retrieval
Solution: Focus on useful, frequently-filtered metadata

Loss of Context

Problem: Aggressive cleaning removes important information
Solution: Preserve key structural and contextual elements

Inconsistent Processing

Problem: Different rules for different sources
Solution: Standardize processing pipelines

Tools and Libraries

Text Processing

NLTK: Natural language processing
spaCy: Industrial-strength NLP
Beautiful Soup: HTML parsing
Pandas: Data manipulation

Data Cleaning

ftfy: Fix text encoding
unidecode: ASCII transliteration
langdetect: Language detection

Metadata Extraction

pdfplumber: PDF extraction
docx: Word document parsing
python-magic: File type detection

Next Steps

Chunking Strategies - Optimize how data is split
Synthetic Data - Enhance with generated content
Data Sources - Learn about data ingestion

PreviousSynthetic Data NextAgent Builder

Last updated 7 days ago

hashtagOverview

hashtagCore Manipulation Techniques

hashtagData Cleaning

hashtagWhitespace Normalization

hashtagCharacter Encoding

hashtagHTML/XML Cleanup

hashtagNoise Removal

hashtagData Formatting

hashtagMarkdown Normalization

hashtagDate and Time Standardization

hashtagURL Normalization

hashtagText Transformations

hashtagCase Normalization

hashtagPunctuation Handling

hashtagLanguage Processing

hashtagAbbreviation Expansion

hashtagMetadata Enrichment

hashtagSource Metadata

hashtagContent Classification

hashtagSemantic Metadata

hashtagStructural Metadata

hashtagTemporal Metadata

hashtagAdvanced Manipulations

hashtagEntity Extraction

hashtagRelationship Mapping

hashtagIntent Classification

hashtagSentiment and Tone

hashtagLanguage Detection and Translation

hashtagContent Enhancement

hashtagSummary Generation

hashtagTitle and Heading Extraction

hashtagCode Extraction and Annotation

hashtagLink Processing

hashtagDeduplication

hashtagExact Duplicates

hashtagNear Duplicates

hashtagVersion Consolidation

hashtagData Validation

hashtagSchema Validation

hashtagContent Validation

hashtagMetadata Validation

hashtagFiltering and Exclusion

hashtagContent-Based Filtering

hashtagSource-Based Filtering

hashtagTime-Based Filtering

hashtagOptimization Techniques

hashtagEmbedding Optimization

hashtagQuery Matching Optimization

hashtagHierarchical Structuring

hashtagImplementation Workflow

hashtag1. Data Ingestion

hashtag2. Cleaning Pipeline

hashtag3. Enrichment Pipeline

hashtag4. Optimization Pipeline

hashtag5. Quality Assurance

hashtagBest Practices

hashtagProcessing Order

hashtagIdempotency

hashtagScalability

hashtagMonitoring

hashtagCommon Pitfalls

hashtagOver-Processing

hashtagMetadata Bloat

hashtagLoss of Context

hashtagInconsistent Processing

hashtagTools and Libraries

hashtagText Processing

hashtagData Cleaning

hashtagMetadata Extraction

hashtagNext Steps

Overview

Core Manipulation Techniques

Data Cleaning

Whitespace Normalization

Character Encoding

HTML/XML Cleanup

Noise Removal

Data Formatting

Markdown Normalization

Date and Time Standardization

URL Normalization

Text Transformations

Case Normalization

Punctuation Handling

Language Processing

Abbreviation Expansion

Metadata Enrichment

Source Metadata

Content Classification

Semantic Metadata

Structural Metadata

Temporal Metadata

Advanced Manipulations

Entity Extraction

Relationship Mapping

Intent Classification

Sentiment and Tone

Language Detection and Translation

Content Enhancement

Summary Generation

Title and Heading Extraction

Code Extraction and Annotation

Link Processing

Deduplication

Exact Duplicates

Near Duplicates

Version Consolidation

Data Validation

Schema Validation

Content Validation

Metadata Validation

Filtering and Exclusion

Content-Based Filtering

Source-Based Filtering

Time-Based Filtering

Optimization Techniques

Embedding Optimization

Query Matching Optimization

Hierarchical Structuring

Implementation Workflow

1. Data Ingestion

2. Cleaning Pipeline

3. Enrichment Pipeline

4. Optimization Pipeline

5. Quality Assurance

Best Practices

Processing Order

Idempotency

Scalability

Monitoring

Common Pitfalls

Over-Processing

Metadata Bloat

Loss of Context

Inconsistent Processing

Tools and Libraries

Text Processing

Data Cleaning

Metadata Extraction

Next Steps