# Data Manipulations

Data manipulation encompasses the various transformations, enrichments, and processing techniques applied to your data to improve its quality, structure, and usability for AI agents. This includes cleaning, formatting, enriching with metadata, and optimizing for retrieval.

## Overview

Raw data rarely comes in the perfect format for AI consumption. Data manipulation techniques transform your content into an optimized form that enables:

* Better retrieval accuracy
* Improved response quality
* Enhanced filtering and categorization
* More efficient processing

## Core Manipulation Techniques

### Data Cleaning

Remove noise and inconsistencies from your data.

**Common Cleaning Operations:**

#### Whitespace Normalization

* Remove extra spaces, tabs, and newlines
* Standardize line endings
* Clean up formatting artifacts

```
Before: "This  is   a    sentence.\n\n\n\nNext paragraph."
After:  "This is a sentence.\n\nNext paragraph."
```

#### Character Encoding

* Fix encoding issues (UTF-8, ASCII)
* Handle special characters
* Normalize unicode variations

```
Before: "caf\u00e9" or "cafÃ©"
After:  "café"
```

#### HTML/XML Cleanup

* Strip HTML tags
* Decode HTML entities
* Remove CSS and JavaScript
* Extract meaningful text

```
Before: "<p>Hello &nbsp; <strong>World</strong>!</p>"
After:  "Hello World!"
```

#### Noise Removal

* Remove boilerplate text (headers, footers)
* Strip navigation elements
* Delete advertising content
* Remove redundant copyright notices

### Data Formatting

Standardize structure and format across your content.

#### Markdown Normalization

* Standardize heading styles
* Consistent list formatting
* Proper code block formatting
* Table standardization

```markdown
Before:
# Heading
** Bold text **
- item 1
* item 2

After:
# Heading
**Bold text**
- item 1
- item 2
```

#### Date and Time Standardization

* Convert to ISO 8601 format
* Handle time zones consistently
* Parse various date formats

```
Before: "Jan 15, 2024", "15/01/2024", "2024-1-15"
After:  "2024-01-15T00:00:00Z"
```

#### URL Normalization

* Standardize URL formats
* Remove tracking parameters
* Handle relative URLs
* Extract meaningful link text

```
Before: "https://example.com/page?utm_source=email&sessionid=123"
After:  "https://example.com/page"
```

### Text Transformations

Modify text content to improve processing.

#### Case Normalization

* Lowercase for case-insensitive matching
* Title case for headings
* Proper case for names

#### Punctuation Handling

* Standardize quotation marks
* Handle apostrophes consistently
* Remove or standardize special punctuation

#### Language Processing

* Stemming: Reduce words to root form
* Lemmatization: Convert to dictionary form
* Tokenization: Split into words/tokens

```
Before: "running", "ran", "runs"
After (stemmed): "run", "run", "run"
After (lemmatized): "run", "run", "run"
```

#### Abbreviation Expansion

* Expand common abbreviations
* Handle acronyms consistently
* Add full forms as metadata

```
Before: "API", "e.g.", "i.e."
After:  "API (Application Programming Interface)", "for example", "that is"
```

## Metadata Enrichment

Add contextual information to improve retrieval and filtering.

### Source Metadata

Track where content originated:

```json
{
  "source_type": "confluence",
  "source_url": "https://wiki.company.com/page/123",
  "source_title": "API Documentation",
  "author": "John Doe",
  "last_modified": "2024-01-15T10:30:00Z",
  "version": "2.1"
}
```

### Content Classification

Automatically categorize content:

```json
{
  "category": "technical-documentation",
  "subcategory": "api-reference",
  "topics": ["authentication", "REST API", "OAuth"],
  "complexity": "intermediate",
  "content_type": "how-to"
}
```

### Semantic Metadata

Add meaning and context:

```json
{
  "key_concepts": ["rate limiting", "API keys", "authentication"],
  "related_topics": ["security", "developer-tools"],
  "prerequisites": ["account setup", "API key generation"],
  "target_audience": "developers"
}
```

### Structural Metadata

Capture document structure:

```json
{
  "heading_hierarchy": ["Getting Started", "Authentication", "API Keys"],
  "section_type": "setup-guide",
  "reading_time_minutes": 5,
  "code_blocks": 3,
  "external_links": 2
}
```

### Temporal Metadata

Track time-related information:

```json
{
  "created_at": "2023-06-01T00:00:00Z",
  "updated_at": "2024-01-15T10:30:00Z",
  "valid_from": "2024-01-01T00:00:00Z",
  "expires_at": "2025-01-01T00:00:00Z",
  "freshness_score": 0.95
}
```

## Advanced Manipulations

### Entity Extraction

Identify and extract key entities:

**Types of Entities:**

* **People**: Names, roles, contacts
* **Organizations**: Companies, departments, teams
* **Products**: Software, services, tools
* **Locations**: Offices, regions, data centers
* **Technical Terms**: APIs, protocols, technologies

**Example:**

```json
{
  "text": "Contact Jane Smith at jane@company.com for API access to our DataSync service.",
  "entities": {
    "people": [{"name": "Jane Smith", "email": "jane@company.com"}],
    "products": ["DataSync"],
    "topics": ["API access"]
  }
}
```

### Relationship Mapping

Identify connections between content pieces:

```json
{
  "document_id": "doc_123",
  "relationships": [
    {
      "type": "prerequisite",
      "target": "doc_045",
      "description": "Setup guide required first"
    },
    {
      "type": "related",
      "target": "doc_234",
      "description": "Advanced configuration options"
    }
  ]
}
```

### Intent Classification

Determine the purpose of content:

```json
{
  "primary_intent": "instructional",
  "secondary_intents": ["troubleshooting", "reference"],
  "action_items": ["setup", "configure", "test"],
  "question_types_addressed": ["how-to", "what-is"]
}
```

### Sentiment and Tone

Analyze content characteristics:

```json
{
  "tone": "formal",
  "sentiment": "neutral",
  "reading_level": "college",
  "technical_density": "high"
}
```

### Language Detection and Translation

Handle multilingual content:

```json
{
  "detected_language": "en",
  "confidence": 0.99,
  "has_translations": true,
  "available_languages": ["en", "es", "fr"],
  "translation_status": "complete"
}
```

## Content Enhancement

### Summary Generation

Create concise summaries for quick understanding:

```json
{
  "content": "... [full content] ...",
  "summary": "This guide explains how to authenticate with the API using OAuth 2.0. It covers setup, token generation, and refresh workflows.",
  "key_points": [
    "OAuth 2.0 is the primary authentication method",
    "Tokens expire after 1 hour",
    "Refresh tokens are valid for 30 days"
  ]
}
```

### Title and Heading Extraction

Identify and standardize titles:

```json
{
  "original_title": "api-auth-guide.md",
  "extracted_title": "API Authentication Guide",
  "main_heading": "Authenticating with the API",
  "subheadings": [
    "OAuth 2.0 Setup",
    "Token Management",
    "Best Practices"
  ]
}
```

### Code Extraction and Annotation

Handle code snippets specially:

```json
{
  "code_blocks": [
    {
      "language": "python",
      "code": "import requests\n...",
      "purpose": "Example API authentication",
      "line_numbers": [45, 52]
    }
  ]
}
```

### Link Processing

Extract and enrich hyperlinks:

```json
{
  "links": [
    {
      "url": "https://api.example.com/docs",
      "text": "API Documentation",
      "type": "external",
      "status": "active",
      "description": "Official API reference"
    }
  ]
}
```

## Deduplication

Remove duplicate or highly similar content.

### Exact Duplicates

Remove identical content:

```python
# Example logic
if content_hash(new_chunk) in existing_hashes:
    skip_chunk()
else:
    add_chunk()
```

### Near Duplicates

Identify and merge similar content:

**Techniques:**

* Cosine similarity on embeddings
* Fuzzy string matching
* MinHash/LSH algorithms

```python
similarity = cosine_similarity(embedding1, embedding2)
if similarity > 0.95:
    merge_or_skip()
```

### Version Consolidation

Handle multiple versions of the same document:

```json
{
  "consolidation_strategy": "latest",
  "versions": [
    {"id": "v1", "date": "2023-01-01", "action": "archive"},
    {"id": "v2", "date": "2024-01-01", "action": "keep"}
  ]
}
```

## Data Validation

Ensure data quality through validation.

### Schema Validation

Verify data structure:

```json
{
  "required_fields": ["content", "source", "timestamp"],
  "optional_fields": ["metadata", "tags"],
  "validation_rules": {
    "content": {"min_length": 10, "max_length": 10000},
    "timestamp": {"format": "ISO8601"}
  }
}
```

### Content Validation

Check content quality:

* **Minimum Length**: Ensure chunks aren't too short
* **Maximum Length**: Prevent oversized chunks
* **Language Check**: Verify expected language
* **Encoding Validation**: Ensure proper encoding

### Metadata Validation

Verify metadata completeness:

```json
{
  "metadata_completeness": 0.85,
  "missing_fields": ["author", "category"],
  "validation_status": "warning"
}
```

## Filtering and Exclusion

Remove unwanted content systematically.

### Content-Based Filtering

Exclude based on content characteristics:

```json
{
  "exclusion_rules": [
    {"type": "length", "min": 50, "max": 5000},
    {"type": "language", "allowed": ["en", "es"]},
    {"type": "contains", "patterns": ["deprecated", "obsolete"]}
  ]
}
```

### Source-Based Filtering

Filter by origin:

```json
{
  "excluded_sources": [
    "internal-only-wiki",
    "draft-documents"
  ],
  "included_sources": [
    "public-documentation",
    "kb-articles"
  ]
}
```

### Time-Based Filtering

Filter by freshness:

```json
{
  "age_limit_days": 365,
  "exclude_before": "2023-01-01",
  "only_updated_after": "2024-01-01"
}
```

## Optimization Techniques

### Embedding Optimization

Prepare content for optimal embeddings:

* **Chunk Size**: Optimal for embedding model
* **Context Addition**: Add titles/headings to chunks
* **Metadata Inclusion**: Include key metadata in embedded text

```
Original chunk: "Click the Save button to save your changes."

Optimized for embedding:
"[Configuration Settings > Saving Changes] 
Click the Save button to save your changes to your account configuration."
```

### Query Matching Optimization

Enhance content for better query matching:

* **Question Format**: Add question-style text
* **Keyword Enrichment**: Include relevant keywords
* **Synonym Addition**: Add alternative terms

```
Original: "Authentication requires an API key."

Optimized: "Authentication requires an API key. How to authenticate: 
You need an API key to authenticate with the service. 
Also known as: login, authorization, access token."
```

### Hierarchical Structuring

Create parent-child relationships:

```json
{
  "parent_chunk": {
    "id": "chunk_parent_123",
    "content": "... [entire section] ...",
    "type": "context"
  },
  "child_chunks": [
    {
      "id": "chunk_child_124",
      "content": "... [specific subsection] ...",
      "type": "retrievable"
    }
  ]
}
```

## Implementation Workflow

### 1. Data Ingestion

```mermaid
Raw Data → Parse → Extract Text → Initial Validation
```

### 2. Cleaning Pipeline

```mermaid
Raw Text → Remove Noise → Normalize → Fix Encoding → Clean HTML
```

### 3. Enrichment Pipeline

```mermaid
Clean Text → Extract Entities → Classify → Add Metadata → Generate Embeddings
```

### 4. Optimization Pipeline

```mermaid
Enriched Data → Deduplicate → Validate → Optimize → Index
```

### 5. Quality Assurance

```mermaid
Indexed Data → Sample Testing → Quality Metrics → Manual Review → Deployment
```

## Best Practices

### Processing Order

1. **Clean First**: Remove noise before analysis
2. **Extract Then Enrich**: Get base data before adding metadata
3. **Validate Throughout**: Check quality at each stage
4. **Optimize Last**: Final tuning after core processing

### Idempotency

Ensure operations can be safely repeated:

* Same input → Same output
* Track processing versions
* Enable reprocessing
* Maintain audit trails

### Scalability

Design for large-scale processing:

* Batch processing for efficiency
* Parallel processing where possible
* Incremental updates
* Efficient storage formats

### Monitoring

Track manipulation effectiveness:

```json
{
  "processing_metrics": {
    "documents_processed": 1000,
    "success_rate": 0.98,
    "avg_processing_time_ms": 150,
    "errors": 20,
    "warnings": 45
  }
}
```

## Common Pitfalls

### Over-Processing

* **Problem**: Too many transformations lose original meaning
* **Solution**: Keep transformations minimal and reversible

### Metadata Bloat

* **Problem**: Excessive metadata slows retrieval
* **Solution**: Focus on useful, frequently-filtered metadata

### Loss of Context

* **Problem**: Aggressive cleaning removes important information
* **Solution**: Preserve key structural and contextual elements

### Inconsistent Processing

* **Problem**: Different rules for different sources
* **Solution**: Standardize processing pipelines

## Tools and Libraries

### Text Processing

* **NLTK**: Natural language processing
* **spaCy**: Industrial-strength NLP
* **Beautiful Soup**: HTML parsing
* **Pandas**: Data manipulation

### Data Cleaning

* **ftfy**: Fix text encoding
* **unidecode**: ASCII transliteration
* **langdetect**: Language detection

### Metadata Extraction

* **pdfplumber**: PDF extraction
* **docx**: Word document parsing
* **python-magic**: File type detection

## Next Steps

* [Chunking Strategies](/product/data-prep/chunking-strategies.md) - Optimize how data is split
* [Synthetic Data](/product/data-prep/synthetic-data.md) - Enhance with generated content
* [Data Sources](https://github.com/thrivapp/twig-help-docs/blob/staging/data/overview.md) - Learn about data ingestion


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/product/data-prep/data-manipulation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
