Synthetic Data
Synthetic data generation enhances your knowledge base by creating additional training examples, questions, and variations of existing content. This improves your AI agent's ability to understand and respond to a wider range of user queries.
What is Synthetic Data?
Synthetic data refers to artificially generated content that supplements your original data. In the context of AI agents, this typically includes:
Question-Answer Pairs: Generated questions based on your documents
Paraphrased Content: Alternative phrasings of existing information
Edge Cases: Variations covering uncommon query patterns
Expanded Examples: Additional context and use cases
Why Use Synthetic Data?
Coverage Enhancement
Original documentation often doesn't cover all possible ways users might ask questions. Synthetic data fills these gaps by:
Generating multiple question variations for each concept
Creating questions for different expertise levels
Covering different phrasings and terminology
Addressing implicit questions not explicitly stated in docs
Improved Retrieval
More diverse data improves semantic search:
Better embedding coverage of semantic space
Higher likelihood of matching user queries
Reduced dependency on exact keyword matches
Improved ranking of relevant results
Training and Testing
Synthetic data enables better evaluation:
Create test sets for measuring accuracy
Generate scenarios for stress testing
Validate coverage of key topics
Benchmark performance improvements
Types of Synthetic Data
Question Generation
Automatically generate questions from your existing content.
How It Works:
Analyze document chunks
Identify key concepts and facts
Generate relevant questions using LLMs
Associate questions with source content
Example:
Original content:
Generated questions:
"What integrations does Twig AI support?"
"How do I connect Twig AI to Slack?"
"What authentication method is used for integrations?"
"Where can I configure integrations in Twig AI?"
Benefits:
Improves retrieval for question-style queries
Covers different ways of asking the same thing
Explicitly links questions to answers
Answer Generation
Create complete Q&A pairs from documentation.
How It Works:
Generate questions as above
Extract or generate concise answers
Include source references
Validate accuracy
Example Structure:
Benefits:
Provides ready-to-use Q&A format
Optimized for direct answering
Easier to validate and edit
Paraphrasing and Variation
Generate alternative phrasings of existing content.
Use Cases:
Different terminology (technical vs. layman)
Various language styles (formal vs. casual)
Different expertise levels (beginner vs. advanced)
Regional variations (US vs. UK English)
Example:
Original:
Variations:
"Start using the SDK by entering your API key and org ID."
"Set up the SDK with your credentials: API key and organization ID."
"To begin, configure the SDK using your API key and organization identifier."
Scenario Expansion
Create examples and use cases that illustrate concepts.
How It Works:
Identify abstract concepts
Generate concrete examples
Create step-by-step scenarios
Include expected outcomes
Example:
Original concept:
Expanded scenario:
Edge Case Generation
Create examples covering unusual or complex scenarios.
Types of Edge Cases:
Error conditions
Unusual input formats
Boundary conditions
Complex multi-step workflows
Integration failure scenarios
Example:
Standard case:
Edge cases:
"What happens if my document upload fails?"
"Can I upload documents larger than 10MB?"
"What if my PDF is password-protected?"
"How do I handle upload timeouts?"
Implementation Strategies
Automated Generation
Use LLMs to automatically generate synthetic data.
Process:
Configure generation rules and templates
Process document chunks through LLM
Generate questions, answers, or variations
Review and validate output
Add to knowledge base
Pros:
Fast and scalable
Consistent format
Covers large volumes
Cons:
May require validation
Can generate incorrect information
Needs quality control
Semi-Automated Generation
Combine AI generation with human review.
Process:
Auto-generate candidates
Human review and editing
Approval workflow
Integration into knowledge base
Pros:
Better quality control
Maintains accuracy
Allows expert refinement
Cons:
More time-intensive
Requires human resources
Slower to scale
Manual Curation
Manually create synthetic examples based on real usage.
Process:
Analyze user queries
Identify gaps in coverage
Manually create Q&A pairs
Add examples and scenarios
Pros:
Highest quality
Addresses real user needs
Expert-validated
Cons:
Time-consuming
Limited scalability
Requires domain expertise
Best Practices
Quality Over Quantity
Focus on high-quality, accurate synthetic data
Validate generated content before adding to knowledge base
Remove or fix incorrect synthetic data
Regular quality audits
Maintain Source Attribution
Link synthetic data to original sources
Track generation method and date
Enable easy updating when source changes
Allow filtering by data type (original vs. synthetic)
Balance Original and Synthetic
Don't let synthetic data overwhelm original content
Maintain 60-70% original, 30-40% synthetic ratio
Prioritize original content in retrieval
Use synthetic data to enhance, not replace
Version Control
Track versions of synthetic data
Link to source document versions
Update when sources change
Archive outdated synthetic data
Continuous Improvement
Monitor which synthetic data gets used
Remove unused synthetic examples
Generate new data based on gaps
A/B test with and without synthetic data
Configuration Examples
Question Generation Config
Paraphrasing Config
Measuring Impact
Metrics to Track
Coverage: Percentage of queries finding relevant synthetic data
Retrieval Improvement: Accuracy increase with synthetic data
User Satisfaction: Feedback on responses using synthetic data
Usage Rate: How often synthetic vs. original data is retrieved
A/B Testing
Run experiments to validate synthetic data value:
Control Group: Users without synthetic data
Test Group: Users with synthetic data
Compare: Response quality, user satisfaction, retrieval accuracy
Quality Metrics
Accuracy Rate: Percentage of accurate synthetic data
Relevance Score: How relevant synthetic data is to queries
Freshness: Age of synthetic data vs. source documents
Common Pitfalls
Over-Generation
Problem: Too much synthetic data dilutes quality
Solution: Set limits and focus on high-value additions
Inaccuracy
Problem: Generated content contradicts source material
Solution: Implement validation and review processes
Staleness
Problem: Synthetic data becomes outdated
Solution: Regular regeneration tied to source updates
Loss of Context
Problem: Generated content lacks necessary context
Solution: Include surrounding information and metadata
Hallucination
Problem: LLMs generate plausible but false information
Solution: Strict validation against source material
Tools and Techniques
LLM Prompts for Question Generation
Validation Prompts
Advanced Techniques
Multi-Document Synthesis
Generate synthetic data that combines information from multiple sources:
Cross-reference related concepts
Create comprehensive Q&A from scattered info
Build workflows combining multiple docs
Adaptive Generation
Automatically generate synthetic data based on query patterns:
Monitor failed queries
Identify coverage gaps
Generate targeted synthetic content
Close knowledge base gaps
Persona-Based Generation
Create variations for different user types:
Next Steps
Chunking Strategies - Optimize how your source data is split
Data Manipulations - Transform and enrich your data further
Last updated

