# Query Audit Trail Gaps

## The Problem

Insufficient logging of RAG queries and retrieved context makes it impossible to audit data access, investigate security incidents, or prove compliance.

### Symptoms

* ❌ Cannot track who queried what
* ❌ No record of retrieved sensitive data
* ❌ Missing timestamps for access
* ❌ Cannot investigate data breaches
* ❌ Compliance audit failures

### Real-World Example

```
Security incident:
→ Confidential document leaked
→ Need to find: Who accessed it?

Check logs:
→ Application logs: Generic "query processed"
→ Vector DB logs: No query content logged
→ LLM API logs: Retained 30 days (too old)

Cannot determine:
→ Which user queried the document
→ When it was accessed
→ What context was retrieved
→ If data was exfiltrated

Forensic investigation impossible
```

***

## Deep Technical Analysis

### Logging Gaps

**Application-Level Logging:**

```
Typical logs:
"User 123 submitted query" ✓
"Retrieved 5 chunks" ✓

Missing:
- Query text content ✗
- Retrieved chunk IDs ✗
- Document sources ✗
- Sensitivity labels ✗
- User IP address ✗
```

**Vector DB Logging:**

```
Pinecone/Weaviate:
→ Operational metrics (latency, errors)
→ But: No query content logged
→ Privacy by design (good for user privacy)
→ Bad for audit trail (cannot reconstruct access)
```

**LLM API Logging:**

```
OpenAI/Anthropic:
→ 30-day retention (default)
→ Then deleted
→ Insufficient for compliance (HIPAA: 6 years)

Must log locally:
→ Before sending to API
→ Full request/response
→ Long-term retention
```

### Comprehensive Audit Log

**Required Fields:**

```json
{
  "timestamp": "2024-01-15T14:32:18Z",
  "user_id": "user_12345",
  "session_id": "sess_abc123",
  "ip_address": "192.168.1.100",
  "query": "What is the CEO's compensation?",
  "agent_id": "hr_agent",
  "retrieved_chunks": [
    {
      "chunk_id": "doc_789_chunk_12",
      "document": "Executive Compensation 2023",
      "sensitivity": "confidential",
      "score": 0.87
    }
  ],
  "response": "According to...",
  "response_time_ms": 1234,
  "model": "gpt-4",
  "tokens_used": 567
}
```

**Storage Requirements:**

```
For compliance:
→ Immutable storage (append-only)
→ Encrypted at rest
→ Retention: 6+ years (HIPAA)
→ Searchable for investigations
→ Access-controlled (who can view logs?)
```

### Performance Impact

**Logging Overhead:**

```
Synchronous logging:
→ Write to DB before response
→ Adds latency (50-200ms)
→ User waits for log write

Asynchronous logging:
→ Queue log event
→ Write in background
→ Minimal latency impact
→ Risk: Log loss if crash before flush
```

**Storage Costs:**

```
High-volume system:
→ 10,000 queries/day
→ 5 KB per log entry
→ = 50 MB/day = 18 GB/year
→ × 6 years retention = 108 GB

Plus retrieved chunks:
→ 10 chunks × 500 tokens each = 5,000 tokens/query
→ 50 MB/day just for chunk content
→ Substantial storage
```

### Audit Query Interface

**Investigations:**

```
Security team needs:
→ "Show all queries accessing document X"
→ "Who accessed salary data last month?"
→ "Find queries from IP 1.2.3.4"

Requires:
→ Indexed logs (ElasticSearch, Splunk)
→ Query interface
→ Role-based access (only security team)
```

***

## How to Solve

**Log query, user, timestamp, retrieved chunks, and response for every request + use structured logging (JSON) with all required fields + implement async logging to minimize latency + store in immutable append-only storage + retain 6+ years for compliance + index logs for searchable audit trail.** See [Audit Logging](/rag-scenarios-and-solutions/privacy/audit-gaps.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://help.twig.so/rag-scenarios-and-solutions/privacy/audit-gaps.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
