# Domain-Specific Vocabulary

## The Problem

General-purpose embedding models don't understand industry-specific terms, internal jargon, product names, or technical acronyms unique to your organization.

### Symptoms

* ❌ Company acronyms treated as random strings
* ❌ Product names don't match descriptions
* ❌ Industry jargon retrieves generic results
* ❌ Internal tools/systems not recognized
* ❌ Technical terms embedded poorly

### Real-World Example

```
Company uses "GTM" internally (Go-To-Market strategy)

User query: "GTM timeline"
General embedding model knows "GTM" as:
→ Google Tag Manager (web analytics)
→ Not go-to-market

Retrieves: Web analytics docs
Should retrieve: Marketing strategy docs

Internal meaning != general meaning
Model trained on public data doesn't know company context
```

***

## Deep Technical Analysis

### Out-of-Vocabulary Terms

Company-specific terms not in training data:

**The Unknown Token Problem:**

```
Company product: "ThroughputMax-3000"

General tokenizer:
→ Breaks into: ["Through", "##put", "##Max", "-", "3000"]
→ 5 generic subword tokens
→ No semantic unity

vs.

Company term: "OAuth" (in training data)
→ Single token: ["OAuth"]
→ Proper semantic representation

Product name fragmented
→ Embedding doesn't capture it as single concept
→ Weak retrieval
```

**Acronym Ambiguity:**

```
Internal acronym: "CRM" = Customer Risk Model
Industry standard: "CRM" = Customer Relationship Management

Embedding trained on public data:
→ Strongly associates "CRM" with Salesforce, relationships
→ Not with risk modeling

Query: "CRM scoring thresholds"
→ User means: Risk Model scoring
→ Model retrieves: Salesforce configuration docs

Context-specific meaning lost
```

### Industry Jargon and Terminology

Specialized fields have unique vocabularies:

**Medical Domain:**

```
Term: "hypertension"
General model: Moderate understanding (common medical term)

Term: "essential hypertension"
General model: Weaker (more specific)

Term: "refractory essential hypertension with target organ damage"
General model: Poor (highly specialized)

Tokenization: ["refract", "##ory", "essential", "hyper", "##tension", ...]
→ 10+ subword pieces
→ Semantic meaning fragmented
```

**Legal Domain:**

```
Phrase: "res judicata"
General model: Rare Latin legal term
→ May not be in vocabulary
→ Or poorly represented

vs.

Phrase: "dismissed with prejudice"
General model: Knows "dismissed" and "prejudice" separately
→ But not the legal meaning of the phrase
→ Literal interpretation != legal interpretation
```

**Engineering/Technical:**

```
Company uses: "fsync durability guarantees"

General model:
→ "fsync" not in vocabulary (Linux system call)
→ "durability" understood (general word)
→ Together: Weak association

Domain model would know:
→ fsync → data persistence
→ durability → ACID properties
→ Together: Database consistency guarantees
```

### Named Entity Recognition Gaps

Proper nouns and brand names:

**Product Names:**

```
Company products:
→ "CloudSync Pro"
→ "DataVault Enterprise"
→ "Quantum Analytics"

General model:
→ "CloudSync" = cloud + sync (generic concepts)
→ Not recognized as specific product

User query: "CloudSync pricing"
→ May retrieve: General cloud sync comparison docs
→ Should retrieve: CloudSync specific pricing page
```

**Internal Tools:**

```
Company tools:
→ "Jenkins-Internal" (CI/CD)
→ "DocuTrack" (document management)
→ "TeamPulse" (collaboration)

General model:
→ Doesn't know these exist
→ Cannot match tool-specific questions to right docs

Query: "How to configure TeamPulse notifications?"
→ No semantic understanding of "TeamPulse"
→ Matches on "configure" and "notifications" only
→ Generic results
```

### Semantic Relationships Missing

Domain-specific concept relationships:

**Concept Hierarchies:**

```
Company taxonomy:
→ "Widget" → "Premium Widget" → "Enterprise Widget"

General model doesn't know:
→ These are hierarchical
→ Premium Widget is a type of Widget
→ Queries about "Widget" should match all three

User query: "Widget installation"
→ Should match:
  - "Widget installation guide"
  - "Premium Widget setup"
  - "Enterprise Widget deployment"

General model: Exact keyword match only
```

**Synonyms and Abbreviations:**

```
Internal usage:
→ "Knowledge Base" = "KB" = "Docs Repository"
→ All mean same thing

General model:
→ "KB" might mean "kilobyte"
→ Doesn't link to "Knowledge Base"

Query: "How to update KB?"
→ Doesn't match: "Knowledge Base update procedure"
```

### Fine-Tuning Challenges

Adapting models to domain:

**Data Requirements:**

```
Fine-tuning needs:
→ 1000+ domain-specific examples
→ Query-document pairs
→ Labeled relevance

Small companies:
→ Don't have 1000 examples
→ Cold start problem

Manual labeling:
→ Expensive
→ Time-consuming
→ Requires domain expertise
```

**Catastrophic Forgetting:**

```
Fine-tune on medical data:
→ Model learns medical terms well
→ But: Forgets general English

Query: "How to reset password" (general)
→ Fine-tuned medical model performs worse
→ Lost general capabilities

Balancing act:
→ Learn domain specifics
→ Retain general knowledge
```

**The Vocabulary Expansion Problem:**

```
Add new tokens to vocabulary:
→ "ThroughputMax" → single token

But:
→ Must train embeddings for new tokens
→ Requires substantial training data
→ Or: Random initialization (poor quality)

Trade-off:
→ Expand vocabulary (complex)
→ Or: Use subword tokens (loses semantics)
```

### Contextual Usage Patterns

Same word, different meaning by context:

**Polysemy in Domain:**

```
General English: "Model" = fashion model, 3D model, role model
Company context: "Model" = ML model (99% of usage)

General embedding:
→ Averages all meanings
→ Weak specific signal

Domain embedding:
→ Strongly associates with ML
→ Better retrieval for "model training"
```

**Register and Formality:**

```
Customer-facing docs: Formal, clear
Internal docs: Casual, jargon-heavy

Example:
External: "To initiate the authentication process..."
Internal: "Fire up the auth flow..."

General model:
→ Treats differently (different vocabulary)
→ May not match semantic equivalence

Domain model:
→ Learns these are equivalent in company context
```

### Mitigation Strategies

Practical approaches without full fine-tuning:

**1. Keyword Boosting:**

```
Hybrid search:
→ Semantic embedding (general model)
→ + BM25 keyword match
→ = Combined score

For domain terms:
→ Keyword match captures exact term
→ Even if embedding is weak

Query: "ThroughputMax configuration"
→ Semantic: Weak match
→ Keyword: Strong match on "ThroughputMax"
→ Combined: Good result
```

**2. Domain Lexicon:**

```
Maintain glossary:
→ "GTM" → "Go-To-Market strategy"
→ "CRM" → "Customer Risk Model"

At query time:
→ Expand acronyms
→ Query: "GTM timeline"
→ Expanded: "Go-To-Market strategy timeline"

Embedding of expanded query:
→ Better semantic match
```

**3. Metadata Enrichment:**

```
Store metadata with embeddings:
→ product: "ThroughputMax"
→ category: "Performance Tools"
→ tags: ["optimization", "throughput"]

Retrieval:
→ Semantic match on content
→ + Exact match on metadata
→ = Higher relevance

User can also filter by product
```

**4. Synthetic Training Data:**

```
Generate query-document pairs:
→ Document title: "ThroughputMax Configuration Guide"
→ Synthetic queries:
  - "How to configure ThroughputMax?"
  - "ThroughputMax setup instructions"
  - "ThroughputMax settings"

Fine-tune with synthetic data:
→ Cheap to generate
→ Covers domain vocabulary
→ Bootstraps domain adaptation
```

***

## How to Solve

**Implement hybrid search (semantic + keyword BM25) to catch exact term matches + maintain domain lexicon for query expansion + use metadata tags for product/category filtering + generate synthetic training data for key domain terms + consider domain-adapted models (BioBERT for medical, etc.) if available.** See [Domain Vocabulary](/rag-scenarios-and-solutions/vectors/domain-vocabulary.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/vectors/domain-vocabulary.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
