LogoTRUONG PHAM
Home
Projects
Blogs
YouTube
Contact

Newsletter

Stay updated with technical artifacts and engineering insights.

LogoTRUONG PHAM

Building scalable software and sharing insights on technology & life.

Sitemap

  • Home
  • Projects
  • Blogs
  • YouTube
  • Contact

Connect

  • GitHub
  • LinkedIn
  • Email
  • YouTube

© 2024 TRUONG PHAM. © All rights reserved.

Privacy PolicyTerms of Service
Back
RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI
RAG in Production — The Journey of Building a Real-world AI System

RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI

Building a monitoring system for RAG. We will use Prometheus, Grafana, and RAGAS to monitor both infrastructure performance and AI response quality.

TP
Truong PhamSoftware Engineer
PublishedApril 12, 2024
Stack
Monitoring ·Grafana ·Prometheus ·RAGAS

"If you don't measure it, you can't improve it. If you don't monitor it, you can't trust it." In traditional software, monitoring is about uptime. In AI, monitoring is about truth, quality, and cost.*


Table of Contents

  1. The 3 Tiers of RAG Monitoring
  2. Infrastructure Metrics (The Foundation)
  3. LLM Observability (The Traces)
  4. RAG-Specific Quality Metrics (The Gold Standard)
  5. Evaluating with RAGAS & LLM-as-a-Judge
  6. Designing the 'NOC for AI' Dashboard
  7. Alerting Strategy: When to Wake Up at 3 AM?
  8. Conclusion & Next Post

The 3 Tiers of RAG Monitoring

You can't treat a RAG system like a simple CRUD app. We need a three-layered approach:

TierFocusTools
1. InfrastructureUptime, Latency, Error RatePrometheus, Grafana
2. ObservabilityReasoning Path, Prompt TracingLangSmith, Phoenix
3. Quality (AI)Accuracy, Hallucination, RelevanceRAGAS, DeepEval

Infrastructure Metrics

These are the "vital signs" of your system. We collect them using prometheus-fastapi-instrumentator.

Key Metrics to Track:

  • Request Latency (P95/P99): AI is slow, so we need to know exactly how slow.
  • Token Usage/Cost: Track how many tokens are being burned per user/day.
  • Cache Hit Rate: Are we hitting our semantic cache often enough?
  • Vector DB Search Time: Is Qdrant slowing down as the collection grows?

LLM Observability

A user says, "This answer is wrong." Why?

  • Was the Retrieval bad? (Relevant docs not found)
  • Was the Prompt bad? (Instructions were confusing)
  • Did the LLM hallucinate? (Docs were correct, but AI ignored them)

We use LangSmith to visualize the execution graph. Every step is recorded: Input -> Rewrite Query -> Retrieval -> Rerank -> Final Prompt -> LLM Output.


RAG-Specific Quality Metrics

How do you turn "quality" into a number? We follow the RAG Triad:

  1. Faithfulness (Groundedness): Is the answer derived only from the retrieved documents? (Anti-hallucination).
  2. Answer Relevance: Does the answer actually address the user's question?
  3. Context Precision: Are the retrieved documents truly relevant to the query?

Evaluating with RAGAS

RAGAS is a framework that uses an LLM (the "Judge") to evaluate your RAG system's answers.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevance

# Example evaluation
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevance]
)

print(f"Faithfulness Score: {results['faithfulness']}")
# Result: 0.85 (85% of answers are grounded in facts)

Designing the 'NOC for AI' Dashboard

Our Grafana dashboard has 4 main panels:

  • Panel 1: User Sentiment. Histogram of Thumbs Up/Down feedback.
  • Panel 2: Latency Waterfall. Where is time spent? (Retrieval vs. Inference).
  • Panel 3: Cost Heatmap. Real-time dollar spent per department.
  • Panel 4: Drift Detection. Is the average answer quality dropping over time?

Alerting Strategy

Don't alert for every minor error. Use Service Level Objectives (SLOs):

  • CRITICAL Alert: Success Rate < 95% over a 5-minute window.
  • WARNING Alert: P99 Latency > 15 seconds.
  • QUALITY Alert: Average Faithfulness score drops below 0.7.

Conclusion & Next Post

Monitoring a RAG system is a continuous process. By combining technical metrics with AI-driven quality scores, we can turn a "black box" into a predictable enterprise product.

3 Key Takeaways:

  1. Traceability is the only way to debug poor AI responses.
  2. User feedback is a goldmine for improving your dataset.
  3. Automated quality scores (RAGAS) replace expensive manual human reviews.

👉 Next Post: [Post 09] Security & Privacy - Protecting Your Enterprise Data

AI systems introduce new security risks: Prompt Injection, PII leakage, and data breaches. In the next post, we will build a Security Shield for our RAG system. How do you stop a user from tricking the AI into revealing the CEO's salary?


📬 How do you evaluate the quality of your AI? Manual review or automated scoring?


Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: Monitoring LLM Observability SRE RAGAS Grafana

Series • Part 8 of 11

RAG in Production — The Journey of Building a Real-world AI System

NextRAG in Production [P9]: Security & Privacy - Protecting Your Enterprise Data
RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem
01RAG in Production [P1]: Real-world Problem - When Does a Business Actually Need AI?02RAG in Production [P2]: What is RAG? Why not Fine-tuning or Prompt Engineering?03RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG System04RAG in Production [P4]: Backend Implementation - Building the Engine with FastAPI & LangChain05RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale06RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes07RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem08RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AIReading09RAG in Production [P9]: Security & Privacy - Protecting Your Enterprise Data10RAG in Production [P10]: Future Improvements - Agentic RAG, GraphRAG & Beyond11RAG in Production [P11]: Lessons Learned - 15 Hard Truths About RAG in Production
TP

Written by Truong Pham

Software Engineer passionate about building high-performance systems and meaningful experiences.

Read more articles