RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI
Building a monitoring system for RAG. We will use Prometheus, Grafana, and RAGAS to monitor both infrastructure performance and AI response quality.
"If you don't measure it, you can't improve it. If you don't monitor it, you can't trust it." In traditional software, monitoring is about uptime. In AI, monitoring is about truth, quality, and cost.*
Table of Contents
- The 3 Tiers of RAG Monitoring
- Infrastructure Metrics (The Foundation)
- LLM Observability (The Traces)
- RAG-Specific Quality Metrics (The Gold Standard)
- Evaluating with RAGAS & LLM-as-a-Judge
- Designing the 'NOC for AI' Dashboard
- Alerting Strategy: When to Wake Up at 3 AM?
- Conclusion & Next Post
The 3 Tiers of RAG Monitoring
You can't treat a RAG system like a simple CRUD app. We need a three-layered approach:
| Tier | Focus | Tools |
|---|---|---|
| 1. Infrastructure | Uptime, Latency, Error Rate | Prometheus, Grafana |
| 2. Observability | Reasoning Path, Prompt Tracing | LangSmith, Phoenix |
| 3. Quality (AI) | Accuracy, Hallucination, Relevance | RAGAS, DeepEval |
Infrastructure Metrics
These are the "vital signs" of your system. We collect them using prometheus-fastapi-instrumentator.
Key Metrics to Track:
- Request Latency (P95/P99): AI is slow, so we need to know exactly how slow.
- Token Usage/Cost: Track how many tokens are being burned per user/day.
- Cache Hit Rate: Are we hitting our semantic cache often enough?
- Vector DB Search Time: Is Qdrant slowing down as the collection grows?
LLM Observability
A user says, "This answer is wrong." Why?
- Was the Retrieval bad? (Relevant docs not found)
- Was the Prompt bad? (Instructions were confusing)
- Did the LLM hallucinate? (Docs were correct, but AI ignored them)
We use LangSmith to visualize the execution graph. Every step is recorded:
Input -> Rewrite Query -> Retrieval -> Rerank -> Final Prompt -> LLM Output.
RAG-Specific Quality Metrics
How do you turn "quality" into a number? We follow the RAG Triad:
- Faithfulness (Groundedness): Is the answer derived only from the retrieved documents? (Anti-hallucination).
- Answer Relevance: Does the answer actually address the user's question?
- Context Precision: Are the retrieved documents truly relevant to the query?
Evaluating with RAGAS
RAGAS is a framework that uses an LLM (the "Judge") to evaluate your RAG system's answers.
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance
# Example evaluation
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevance]
)
print(f"Faithfulness Score: {results['faithfulness']}")
# Result: 0.85 (85% of answers are grounded in facts)
Designing the 'NOC for AI' Dashboard
Our Grafana dashboard has 4 main panels:
- Panel 1: User Sentiment. Histogram of Thumbs Up/Down feedback.
- Panel 2: Latency Waterfall. Where is time spent? (Retrieval vs. Inference).
- Panel 3: Cost Heatmap. Real-time dollar spent per department.
- Panel 4: Drift Detection. Is the average answer quality dropping over time?
Alerting Strategy
Don't alert for every minor error. Use Service Level Objectives (SLOs):
- CRITICAL Alert: Success Rate < 95% over a 5-minute window.
- WARNING Alert: P99 Latency > 15 seconds.
- QUALITY Alert: Average Faithfulness score drops below 0.7.
Conclusion & Next Post
Monitoring a RAG system is a continuous process. By combining technical metrics with AI-driven quality scores, we can turn a "black box" into a predictable enterprise product.
3 Key Takeaways:
- Traceability is the only way to debug poor AI responses.
- User feedback is a goldmine for improving your dataset.
- Automated quality scores (RAGAS) replace expensive manual human reviews.
👉 Next Post: [Post 09] Security & Privacy - Protecting Your Enterprise Data
AI systems introduce new security risks: Prompt Injection, PII leakage, and data breaches. In the next post, we will build a Security Shield for our RAG system. How do you stop a user from tricking the AI into revealing the CEO's salary?
📬 How do you evaluate the quality of your AI? Manual review or automated scoring?
Author: [Your Name]
Series: RAG in Production — The Journey of Building a Real-world AI System
Tags: Monitoring LLM Observability SRE RAGAS Grafana
Series • Part 8 of 11