Synthetic Data

Synthetic data generation enhances your knowledge base by creating additional training examples, questions, and variations of existing content. This improves your AI agent's ability to understand and respond to a wider range of user queries.

What is Synthetic Data?

Synthetic data refers to artificially generated content that supplements your original data. In the context of AI agents, this typically includes:

  • Question-Answer Pairs: Generated questions based on your documents

  • Paraphrased Content: Alternative phrasings of existing information

  • Edge Cases: Variations covering uncommon query patterns

  • Expanded Examples: Additional context and use cases

Why Use Synthetic Data?

Coverage Enhancement

Original documentation often doesn't cover all possible ways users might ask questions. Synthetic data fills these gaps by:

  • Generating multiple question variations for each concept

  • Creating questions for different expertise levels

  • Covering different phrasings and terminology

  • Addressing implicit questions not explicitly stated in docs

Improved Retrieval

More diverse data improves semantic search:

  • Better embedding coverage of semantic space

  • Higher likelihood of matching user queries

  • Reduced dependency on exact keyword matches

  • Improved ranking of relevant results

Training and Testing

Synthetic data enables better evaluation:

  • Create test sets for measuring accuracy

  • Generate scenarios for stress testing

  • Validate coverage of key topics

  • Benchmark performance improvements

Types of Synthetic Data

Question Generation

Automatically generate questions from your existing content.

How It Works:

  1. Analyze document chunks

  2. Identify key concepts and facts

  3. Generate relevant questions using LLMs

  4. Associate questions with source content

Example:

Original content:

Generated questions:

  • "What integrations does Twig AI support?"

  • "How do I connect Twig AI to Slack?"

  • "What authentication method is used for integrations?"

  • "Where can I configure integrations in Twig AI?"

Benefits:

  • Improves retrieval for question-style queries

  • Covers different ways of asking the same thing

  • Explicitly links questions to answers

Answer Generation

Create complete Q&A pairs from documentation.

How It Works:

  1. Generate questions as above

  2. Extract or generate concise answers

  3. Include source references

  4. Validate accuracy

Example Structure:

Benefits:

  • Provides ready-to-use Q&A format

  • Optimized for direct answering

  • Easier to validate and edit

Paraphrasing and Variation

Generate alternative phrasings of existing content.

Use Cases:

  • Different terminology (technical vs. layman)

  • Various language styles (formal vs. casual)

  • Different expertise levels (beginner vs. advanced)

  • Regional variations (US vs. UK English)

Example:

Original:

Variations:

  • "Start using the SDK by entering your API key and org ID."

  • "Set up the SDK with your credentials: API key and organization ID."

  • "To begin, configure the SDK using your API key and organization identifier."

Scenario Expansion

Create examples and use cases that illustrate concepts.

How It Works:

  1. Identify abstract concepts

  2. Generate concrete examples

  3. Create step-by-step scenarios

  4. Include expected outcomes

Example:

Original concept:

Expanded scenario:

Edge Case Generation

Create examples covering unusual or complex scenarios.

Types of Edge Cases:

  • Error conditions

  • Unusual input formats

  • Boundary conditions

  • Complex multi-step workflows

  • Integration failure scenarios

Example:

Standard case:

Edge cases:

  • "What happens if my document upload fails?"

  • "Can I upload documents larger than 10MB?"

  • "What if my PDF is password-protected?"

  • "How do I handle upload timeouts?"

Implementation Strategies

Automated Generation

Use LLMs to automatically generate synthetic data.

Process:

  1. Configure generation rules and templates

  2. Process document chunks through LLM

  3. Generate questions, answers, or variations

  4. Review and validate output

  5. Add to knowledge base

Pros:

  • Fast and scalable

  • Consistent format

  • Covers large volumes

Cons:

  • May require validation

  • Can generate incorrect information

  • Needs quality control

Semi-Automated Generation

Combine AI generation with human review.

Process:

  1. Auto-generate candidates

  2. Human review and editing

  3. Approval workflow

  4. Integration into knowledge base

Pros:

  • Better quality control

  • Maintains accuracy

  • Allows expert refinement

Cons:

  • More time-intensive

  • Requires human resources

  • Slower to scale

Manual Curation

Manually create synthetic examples based on real usage.

Process:

  1. Analyze user queries

  2. Identify gaps in coverage

  3. Manually create Q&A pairs

  4. Add examples and scenarios

Pros:

  • Highest quality

  • Addresses real user needs

  • Expert-validated

Cons:

  • Time-consuming

  • Limited scalability

  • Requires domain expertise

Best Practices

Quality Over Quantity

  • Focus on high-quality, accurate synthetic data

  • Validate generated content before adding to knowledge base

  • Remove or fix incorrect synthetic data

  • Regular quality audits

Maintain Source Attribution

  • Link synthetic data to original sources

  • Track generation method and date

  • Enable easy updating when source changes

  • Allow filtering by data type (original vs. synthetic)

Balance Original and Synthetic

  • Don't let synthetic data overwhelm original content

  • Maintain 60-70% original, 30-40% synthetic ratio

  • Prioritize original content in retrieval

  • Use synthetic data to enhance, not replace

Version Control

  • Track versions of synthetic data

  • Link to source document versions

  • Update when sources change

  • Archive outdated synthetic data

Continuous Improvement

  • Monitor which synthetic data gets used

  • Remove unused synthetic examples

  • Generate new data based on gaps

  • A/B test with and without synthetic data

Configuration Examples

Question Generation Config

Paraphrasing Config

Measuring Impact

Metrics to Track

  • Coverage: Percentage of queries finding relevant synthetic data

  • Retrieval Improvement: Accuracy increase with synthetic data

  • User Satisfaction: Feedback on responses using synthetic data

  • Usage Rate: How often synthetic vs. original data is retrieved

A/B Testing

Run experiments to validate synthetic data value:

  1. Control Group: Users without synthetic data

  2. Test Group: Users with synthetic data

  3. Compare: Response quality, user satisfaction, retrieval accuracy

Quality Metrics

  • Accuracy Rate: Percentage of accurate synthetic data

  • Relevance Score: How relevant synthetic data is to queries

  • Freshness: Age of synthetic data vs. source documents

Common Pitfalls

Over-Generation

  • Problem: Too much synthetic data dilutes quality

  • Solution: Set limits and focus on high-value additions

Inaccuracy

  • Problem: Generated content contradicts source material

  • Solution: Implement validation and review processes

Staleness

  • Problem: Synthetic data becomes outdated

  • Solution: Regular regeneration tied to source updates

Loss of Context

  • Problem: Generated content lacks necessary context

  • Solution: Include surrounding information and metadata

Hallucination

  • Problem: LLMs generate plausible but false information

  • Solution: Strict validation against source material

Tools and Techniques

LLM Prompts for Question Generation

Validation Prompts

Advanced Techniques

Multi-Document Synthesis

Generate synthetic data that combines information from multiple sources:

  • Cross-reference related concepts

  • Create comprehensive Q&A from scattered info

  • Build workflows combining multiple docs

Adaptive Generation

Automatically generate synthetic data based on query patterns:

  • Monitor failed queries

  • Identify coverage gaps

  • Generate targeted synthetic content

  • Close knowledge base gaps

Persona-Based Generation

Create variations for different user types:

Next Steps

Last updated