# Synthetic Data

Synthetic data generation enhances your knowledge base by creating additional training examples, questions, and variations of existing content. This improves your AI agent's ability to understand and respond to a wider range of user queries.

## What is Synthetic Data?

Synthetic data refers to artificially generated content that supplements your original data. In the context of AI agents, this typically includes:

* **Question-Answer Pairs**: Generated questions based on your documents
* **Paraphrased Content**: Alternative phrasings of existing information
* **Edge Cases**: Variations covering uncommon query patterns
* **Expanded Examples**: Additional context and use cases

## Why Use Synthetic Data?

### Coverage Enhancement

Original documentation often doesn't cover all possible ways users might ask questions. Synthetic data fills these gaps by:

* Generating multiple question variations for each concept
* Creating questions for different expertise levels
* Covering different phrasings and terminology
* Addressing implicit questions not explicitly stated in docs

### Improved Retrieval

More diverse data improves semantic search:

* Better embedding coverage of semantic space
* Higher likelihood of matching user queries
* Reduced dependency on exact keyword matches
* Improved ranking of relevant results

### Training and Testing

Synthetic data enables better evaluation:

* Create test sets for measuring accuracy
* Generate scenarios for stress testing
* Validate coverage of key topics
* Benchmark performance improvements

## Types of Synthetic Data

### Question Generation

Automatically generate questions from your existing content.

**How It Works:**

1. Analyze document chunks
2. Identify key concepts and facts
3. Generate relevant questions using LLMs
4. Associate questions with source content

**Example:**

Original content:

```
Twig AI supports integration with Slack, Microsoft Teams, 
and Zendesk. Each integration requires OAuth authentication 
and can be configured from the Integrations page.
```

Generated questions:

* "What integrations does Twig AI support?"
* "How do I connect Twig AI to Slack?"
* "What authentication method is used for integrations?"
* "Where can I configure integrations in Twig AI?"

**Benefits:**

* Improves retrieval for question-style queries
* Covers different ways of asking the same thing
* Explicitly links questions to answers

### Answer Generation

Create complete Q\&A pairs from documentation.

**How It Works:**

1. Generate questions as above
2. Extract or generate concise answers
3. Include source references
4. Validate accuracy

**Example Structure:**

```json
{
  "question": "How do I connect Twig AI to Slack?",
  "answer": "To connect Twig AI to Slack, go to the Integrations page and select Slack. You'll be prompted to authenticate using OAuth. Once authenticated, you can configure which channels to monitor.",
  "source": "integrations-guide.md",
  "section": "Slack Integration"
}
```

**Benefits:**

* Provides ready-to-use Q\&A format
* Optimized for direct answering
* Easier to validate and edit

### Paraphrasing and Variation

Generate alternative phrasings of existing content.

**Use Cases:**

* Different terminology (technical vs. layman)
* Various language styles (formal vs. casual)
* Different expertise levels (beginner vs. advanced)
* Regional variations (US vs. UK English)

**Example:**

Original:

```
"Initialize the SDK by providing your API key and organization ID."
```

Variations:

* "Start using the SDK by entering your API key and org ID."
* "Set up the SDK with your credentials: API key and organization ID."
* "To begin, configure the SDK using your API key and organization identifier."

### Scenario Expansion

Create examples and use cases that illustrate concepts.

**How It Works:**

1. Identify abstract concepts
2. Generate concrete examples
3. Create step-by-step scenarios
4. Include expected outcomes

**Example:**

Original concept:

```
"You can filter data sources by category and date range."
```

Expanded scenario:

```
Example: Filtering Customer Support Tickets

1. Navigate to Data Sources
2. Select "Category" filter
3. Choose "Customer Support"
4. Set date range: Last 30 days
5. Click "Apply Filters"

Result: Only customer support tickets from the past month will be displayed, 
making it easier to train your agent on recent support interactions.
```

### Edge Case Generation

Create examples covering unusual or complex scenarios.

**Types of Edge Cases:**

* Error conditions
* Unusual input formats
* Boundary conditions
* Complex multi-step workflows
* Integration failure scenarios

**Example:**

Standard case:

```
"How do I upload a document?"
```

Edge cases:

* "What happens if my document upload fails?"
* "Can I upload documents larger than 10MB?"
* "What if my PDF is password-protected?"
* "How do I handle upload timeouts?"

## Implementation Strategies

### Automated Generation

Use LLMs to automatically generate synthetic data.

**Process:**

1. Configure generation rules and templates
2. Process document chunks through LLM
3. Generate questions, answers, or variations
4. Review and validate output
5. Add to knowledge base

**Pros:**

* Fast and scalable
* Consistent format
* Covers large volumes

**Cons:**

* May require validation
* Can generate incorrect information
* Needs quality control

### Semi-Automated Generation

Combine AI generation with human review.

**Process:**

1. Auto-generate candidates
2. Human review and editing
3. Approval workflow
4. Integration into knowledge base

**Pros:**

* Better quality control
* Maintains accuracy
* Allows expert refinement

**Cons:**

* More time-intensive
* Requires human resources
* Slower to scale

### Manual Curation

Manually create synthetic examples based on real usage.

**Process:**

1. Analyze user queries
2. Identify gaps in coverage
3. Manually create Q\&A pairs
4. Add examples and scenarios

**Pros:**

* Highest quality
* Addresses real user needs
* Expert-validated

**Cons:**

* Time-consuming
* Limited scalability
* Requires domain expertise

## Best Practices

### Quality Over Quantity

* Focus on high-quality, accurate synthetic data
* Validate generated content before adding to knowledge base
* Remove or fix incorrect synthetic data
* Regular quality audits

### Maintain Source Attribution

* Link synthetic data to original sources
* Track generation method and date
* Enable easy updating when source changes
* Allow filtering by data type (original vs. synthetic)

### Balance Original and Synthetic

* Don't let synthetic data overwhelm original content
* Maintain 60-70% original, 30-40% synthetic ratio
* Prioritize original content in retrieval
* Use synthetic data to enhance, not replace

### Version Control

* Track versions of synthetic data
* Link to source document versions
* Update when sources change
* Archive outdated synthetic data

### Continuous Improvement

* Monitor which synthetic data gets used
* Remove unused synthetic examples
* Generate new data based on gaps
* A/B test with and without synthetic data

## Configuration Examples

### Question Generation Config

```json
{
  "syntheticData": {
    "questionGeneration": {
      "enabled": true,
      "questionsPerChunk": 3,
      "questionTypes": ["what", "how", "why", "when"],
      "difficultyLevels": ["basic", "intermediate"],
      "includeContext": true
    }
  }
}
```

### Paraphrasing Config

```json
{
  "syntheticData": {
    "paraphrasing": {
      "enabled": true,
      "variationsPerChunk": 2,
      "styles": ["formal", "casual"],
      "preserveTechnicalTerms": true
    }
  }
}
```

## Measuring Impact

### Metrics to Track

* **Coverage**: Percentage of queries finding relevant synthetic data
* **Retrieval Improvement**: Accuracy increase with synthetic data
* **User Satisfaction**: Feedback on responses using synthetic data
* **Usage Rate**: How often synthetic vs. original data is retrieved

### A/B Testing

Run experiments to validate synthetic data value:

1. **Control Group**: Users without synthetic data
2. **Test Group**: Users with synthetic data
3. **Compare**: Response quality, user satisfaction, retrieval accuracy

### Quality Metrics

* **Accuracy Rate**: Percentage of accurate synthetic data
* **Relevance Score**: How relevant synthetic data is to queries
* **Freshness**: Age of synthetic data vs. source documents

## Common Pitfalls

### Over-Generation

* **Problem**: Too much synthetic data dilutes quality
* **Solution**: Set limits and focus on high-value additions

### Inaccuracy

* **Problem**: Generated content contradicts source material
* **Solution**: Implement validation and review processes

### Staleness

* **Problem**: Synthetic data becomes outdated
* **Solution**: Regular regeneration tied to source updates

### Loss of Context

* **Problem**: Generated content lacks necessary context
* **Solution**: Include surrounding information and metadata

### Hallucination

* **Problem**: LLMs generate plausible but false information
* **Solution**: Strict validation against source material

## Tools and Techniques

### LLM Prompts for Question Generation

```
Given the following document excerpt, generate 3 relevant questions 
that users might ask about this content. Ensure questions are:
- Specific and answerable from the excerpt
- Varied in type (what, how, why)
- Natural and conversational

Document excerpt:
[Your content here]

Format your response as:
1. [Question 1]
2. [Question 2]
3. [Question 3]
```

### Validation Prompts

```
Review the following generated question and answer pair. 
Check if the answer is:
- Accurate according to the source
- Complete and helpful
- Free from hallucinations

Source: [Original content]
Question: [Generated question]
Answer: [Generated answer]

Is this Q&A pair accurate? (Yes/No)
If No, explain the issue:
```

## Advanced Techniques

### Multi-Document Synthesis

Generate synthetic data that combines information from multiple sources:

* Cross-reference related concepts
* Create comprehensive Q\&A from scattered info
* Build workflows combining multiple docs

### Adaptive Generation

Automatically generate synthetic data based on query patterns:

* Monitor failed queries
* Identify coverage gaps
* Generate targeted synthetic content
* Close knowledge base gaps

### Persona-Based Generation

Create variations for different user types:

```json
{
  "personas": [
    {
      "type": "technical",
      "tone": "formal",
      "detail": "high",
      "terminology": "technical"
    },
    {
      "type": "business",
      "tone": "professional",
      "detail": "medium",
      "terminology": "layman"
    }
  ]
}
```

## Next Steps

* [Chunking Strategies](/product/data-prep/chunking-strategies.md) - Optimize how your source data is split
* [Data Manipulations](/product/data-prep/data-manipulation.md) - Transform and enrich your data further


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/product/data-prep/synthetic-data.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
