RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI

"If you don't measure it, you can't improve it. If you don't monitor it, you can't trust it." In traditional software, monitoring is about uptime. In AI, monitoring is about truth, quality, and cost.*

The 3 Tiers of RAG Monitoring
Infrastructure Metrics (The Foundation)
LLM Observability (The Traces)
RAG-Specific Quality Metrics (The Gold Standard)
Evaluating with RAGAS & LLM-as-a-Judge
Designing the 'NOC for AI' Dashboard
Alerting Strategy: When to Wake Up at 3 AM?
Conclusion & Next Post

The 3 Tiers of RAG Monitoring

You can't treat a RAG system like a simple CRUD app. We need a three-layered approach:

Tier	Focus	Tools
1. Infrastructure	Uptime, Latency, Error Rate	Prometheus, Grafana
2. Observability	Reasoning Path, Prompt Tracing	LangSmith, Phoenix
3. Quality (AI)	Accuracy, Hallucination, Relevance	RAGAS, DeepEval

Infrastructure Metrics

These are the "vital signs" of your system. We collect them using prometheus-fastapi-instrumentator.

Key Metrics to Track:

Request Latency (P95/P99): AI is slow, so we need to know exactly how slow.
Token Usage/Cost: Track how many tokens are being burned per user/day.
Cache Hit Rate: Are we hitting our semantic cache often enough?
Vector DB Search Time: Is Qdrant slowing down as the collection grows?

LLM Observability

A user says, "This answer is wrong." Why?

Was the Retrieval bad? (Relevant docs not found)
Was the Prompt bad? (Instructions were confusing)
Did the LLM hallucinate? (Docs were correct, but AI ignored them)

We use LangSmith to visualize the execution graph. Every step is recorded: Input -> Rewrite Query -> Retrieval -> Rerank -> Final Prompt -> LLM Output.

RAG-Specific Quality Metrics

How do you turn "quality" into a number? We follow the RAG Triad:

Faithfulness (Groundedness): Is the answer derived only from the retrieved documents? (Anti-hallucination).
Answer Relevance: Does the answer actually address the user's question?
Context Precision: Are the retrieved documents truly relevant to the query?

Evaluating with RAGAS

RAGAS is a framework that uses an LLM (the "Judge") to evaluate your RAG system's answers.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance

# Example evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevance]
)

print(f"Faithfulness Score: {results['faithfulness']}")
# Result: 0.85 (85% of answers are grounded in facts)

Designing the 'NOC for AI' Dashboard

Our Grafana dashboard has 4 main panels:

Panel 1: User Sentiment. Histogram of Thumbs Up/Down feedback.
Panel 2: Latency Waterfall. Where is time spent? (Retrieval vs. Inference).
Panel 3: Cost Heatmap. Real-time dollar spent per department.
Panel 4: Drift Detection. Is the average answer quality dropping over time?

Alerting Strategy

Don't alert for every minor error. Use Service Level Objectives (SLOs):

CRITICAL Alert: Success Rate < 95% over a 5-minute window.
WARNING Alert: P99 Latency > 15 seconds.
QUALITY Alert: Average Faithfulness score drops below 0.7.

Conclusion & Next Post

Monitoring a RAG system is a continuous process. By combining technical metrics with AI-driven quality scores, we can turn a "black box" into a predictable enterprise product.

3 Key Takeaways:

Traceability is the only way to debug poor AI responses.
User feedback is a goldmine for improving your dataset.
Automated quality scores (RAGAS) replace expensive manual human reviews.

👉 Next Post: [Post 09] Security & Privacy - Protecting Your Enterprise Data

AI systems introduce new security risks: Prompt Injection, PII leakage, and data breaches. In the next post, we will build a Security Shield for our RAG system. How do you stop a user from tricking the AI into revealing the CEO's salary?

📬 How do you evaluate the quality of your AI? Manual review or automated scoring?

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: Monitoring LLM Observability SRE RAGAS Grafana

"If you don't measure it, you can't improve it. If you don't monitor it, you can't trust it." In traditional software, monitoring is about uptime. In AI, monitoring is about truth, quality, and cost.*

The 3 Tiers of RAG Monitoring
Infrastructure Metrics (The Foundation)
LLM Observability (The Traces)
RAG-Specific Quality Metrics (The Gold Standard)
Evaluating with RAGAS & LLM-as-a-Judge
Designing the 'NOC for AI' Dashboard
Alerting Strategy: When to Wake Up at 3 AM?
Conclusion & Next Post

The 3 Tiers of RAG Monitoring

You can't treat a RAG system like a simple CRUD app. We need a three-layered approach:

Tier	Focus	Tools
1. Infrastructure	Uptime, Latency, Error Rate	Prometheus, Grafana
2. Observability	Reasoning Path, Prompt Tracing	LangSmith, Phoenix
3. Quality (AI)	Accuracy, Hallucination, Relevance	RAGAS, DeepEval

Infrastructure Metrics

These are the "vital signs" of your system. We collect them using prometheus-fastapi-instrumentator.

Key Metrics to Track:

Request Latency (P95/P99): AI is slow, so we need to know exactly how slow.
Token Usage/Cost: Track how many tokens are being burned per user/day.
Cache Hit Rate: Are we hitting our semantic cache often enough?
Vector DB Search Time: Is Qdrant slowing down as the collection grows?

LLM Observability

A user says, "This answer is wrong." Why?

Was the Retrieval bad? (Relevant docs not found)
Was the Prompt bad? (Instructions were confusing)
Did the LLM hallucinate? (Docs were correct, but AI ignored them)

We use LangSmith to visualize the execution graph. Every step is recorded: Input -> Rewrite Query -> Retrieval -> Rerank -> Final Prompt -> LLM Output.

RAG-Specific Quality Metrics

How do you turn "quality" into a number? We follow the RAG Triad:

Faithfulness (Groundedness): Is the answer derived only from the retrieved documents? (Anti-hallucination).
Answer Relevance: Does the answer actually address the user's question?
Context Precision: Are the retrieved documents truly relevant to the query?

Evaluating with RAGAS

RAGAS is a framework that uses an LLM (the "Judge") to evaluate your RAG system's answers.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance

# Example evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevance]
)

print(f"Faithfulness Score: {results['faithfulness']}")
# Result: 0.85 (85% of answers are grounded in facts)

Designing the 'NOC for AI' Dashboard

Our Grafana dashboard has 4 main panels:

Panel 1: User Sentiment. Histogram of Thumbs Up/Down feedback.
Panel 2: Latency Waterfall. Where is time spent? (Retrieval vs. Inference).
Panel 3: Cost Heatmap. Real-time dollar spent per department.
Panel 4: Drift Detection. Is the average answer quality dropping over time?

Alerting Strategy

Don't alert for every minor error. Use Service Level Objectives (SLOs):

CRITICAL Alert: Success Rate < 95% over a 5-minute window.
WARNING Alert: P99 Latency > 15 seconds.
QUALITY Alert: Average Faithfulness score drops below 0.7.

Conclusion & Next Post

Monitoring a RAG system is a continuous process. By combining technical metrics with AI-driven quality scores, we can turn a "black box" into a predictable enterprise product.

3 Key Takeaways:

Traceability is the only way to debug poor AI responses.
User feedback is a goldmine for improving your dataset.
Automated quality scores (RAGAS) replace expensive manual human reviews.

👉 Next Post: [Post 09] Security & Privacy - Protecting Your Enterprise Data

📬 How do you evaluate the quality of your AI? Manual review or automated scoring?

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: Monitoring LLM Observability SRE RAGAS Grafana

RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI

Table of Contents

The 3 Tiers of RAG Monitoring

Infrastructure Metrics

LLM Observability

RAG-Specific Quality Metrics

Evaluating with RAGAS

Designing the 'NOC for AI' Dashboard

Alerting Strategy

Conclusion & Next Post

👉 Next Post: [Post 09] Security & Privacy - Protecting Your Enterprise Data

RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI

Table of Contents

The 3 Tiers of RAG Monitoring

Infrastructure Metrics

LLM Observability

RAG-Specific Quality Metrics

Evaluating with RAGAS

Designing the 'NOC for AI' Dashboard

Alerting Strategy

Conclusion & Next Post

👉 Next Post: [Post 09] Security & Privacy - Protecting Your Enterprise Data