LogoTRUONG PHAM
Home
Projects
Blogs
YouTube
Contact

Newsletter

Stay updated with technical artifacts and engineering insights.

LogoTRUONG PHAM

Building scalable software and sharing insights on technology & life.

Sitemap

  • Home
  • Projects
  • Blogs
  • YouTube
  • Contact

Connect

  • GitHub
  • LinkedIn
  • Email
  • YouTube

© 2024 TRUONG PHAM. © All rights reserved.

Privacy PolicyTerms of Service
Back
RAG in Production [P11]: Lessons Learned - 15 Hard Truths About RAG in Production
RAG in Production — The Journey of Building a Real-world AI System

RAG in Production [P11]: Lessons Learned - 15 Hard Truths About RAG in Production

The series finale. Summarizing 15 key lessons and 'expensive' mistakes learned from building and operating RAG systems in real-world enterprise environments.

TP
Truong PhamSoftware Engineer
PublishedApril 20, 2024
Stack
RAG ·Lessons Learned ·Best Practices ·AI Engineering

"Experience is what you get when you didn't get what you wanted." After 11 posts, we've covered a lot of ground. In this finale, I've distilled our journey into 15 hard truths that will save you months of trial and error.*


Table of Contents

  1. The "Data is King" Truths
  2. The Retrieval Engineering Truths
  3. The LLM & Generation Truths
  4. The Operations & Security Truths
  5. The Human & Business Truths
  6. Series Summary Checklist
  7. Conclusion: The Journey Continues

1. The "Data is King" Truths

Lesson 01: Garbage In, Garbage Out

No matter how advanced your LLM is (even GPT-5), if your source documents are messy, duplicated, or outdated, your AI will be useless. Spending 80% of your time cleaning data is not a mistake; it's the requirement.

Lesson 02: Chunking is an Art, Not a Setting

Don't just use chunk_size=500. Your chunks should be Semantic Units. A chunk that cuts a table in half or loses the context of its parent header is a failed retrieval waiting to happen.

Lesson 03: Metadata is the Real Secret Sauce

Vectors are for searching meaning, but Metadata is for searching facts. Without proper metadata (author, date, department, category), you cannot implement security, handle versioning, or perform efficient hybrid search.


2. The Retrieval Engineering Truths

Lesson 04: Pure Vector Search is Often Not Enough

Semantic search is "vague". To build a production system, you almost always need Hybrid Search (Vector + BM25). Keywords still matter for product names, IDs, and domain-specific terminology.

Lesson 05: Reranking is Your Most Effective Lever

If you want to move the needle on accuracy, add a Reranker. It’s much cheaper than fine-tuning a model and significantly more effective at filtering out the noise from the initial retrieval.

Lesson 06: Embedding Models are Domain-Dependent

The model that works for English Wikipedia might fail for Vietnamese Fintech documentation. Always benchmark a few models (OpenAI vs. BGE vs. Cohere) against your specific dataset.


3. The LLM & Generation Truths

Lesson 07: Hallucinations are a Feature, Not a Bug

LLMs are designed to predict the next token. They don't have a "Fact Checker" module built-in. You must strictly constrain them using the "Only answer from context" instruction and provide a "I don't know" fallback.

Lesson 08: Context Window Does Not Solve Everything

Just because a model has a 128K context window doesn't mean you should use it all. "Lost in the Middle" is real. Information density matters more than information volume.

Lesson 09: Streaming is a UX Requirement

Wait times longer than 2 seconds feel like an eternity in a chat interface. Implementing SSE (Server-Sent Events) for streaming responses is non-negotiable for user satisfaction.


4. The Operations & Security Truths

Lesson 10: Monitoring Quality is Harder than Monitoring Uptime

Your API might be 200 OK, but your answer could be 100% wrong. You need AI-driven evaluation (RAGAS / DeepEval) to monitor the "Truthfulness" of your system at scale.

Lesson 11: Security Filter at the Database, Not the Prompt

Never ask the LLM: "Only answer if the user can see this". It's too easy to bypass with Prompt Injection. Implement Row-level Security (RLS) in your Vector Database.

Lesson 12: GPUs are the New Oil (and They're Expensive)

Optimizing your inference stack with vLLM and Quantization isn't just a "nice-to-have" tech optimization; it's a financial necessity as you scale to thousands of users.


5. The Human & Business Truths

Lesson 13: Users are the Best Testers

Professional AI Engineers can't predict how a Customer Support agent or a Sales rep will ask a question. Add a Feedback Loop (Thumbs Up/Down) on day one and use that data to improve your retrieval.

Lesson 14: Manage Stakeholder Expectations

People think AI is magic. You must educate them that RAG is a probabilistic system, not a deterministic one. There will be errors, and that’s why "human-in-the-loop" is still necessary for high-stakes decisions.

Lesson 15: RAG is a Pipeline, Not a Product

You don't "finish" a RAG system. It’s a living pipeline that needs to be tuned as your documentation grows, your models evolve, and your users' needs change.


Series Summary Checklist

If you've followed the entire series, you should now have a system that checks all these boxes:

  • P1: Solves a real, quantified business problem.
  • P2: Uses RAG instead of just "Prompt Stuffing".
  • P3: Has a clean, decoupled architecture.
  • P4: Built with a scalable backend (FastAPI).
  • P5: Uses a performant Vector DB (Qdrant) with Hybrid Search.
  • P6: Optimized for inference (vLLM/OpenAI Gateway).
  • P7: Automated with CI/CD and Kubernetes.
  • P8: Monitored for both performance and quality (RAGAS).
  • P9: Secured against PII leaks and injections.
  • P10: Ready for future patterns (Agents/GraphRAG).

Conclusion: The Journey Continues

Building RAG in production is one of the most challenging but rewarding engineering tasks of this decade. It combines classic Software Engineering, DevOps, and Data Science into a single unified discipline.

I hope this 11-part series has provided you with a clear roadmap, practical code, and the confidence to build your own production AI systems.

The world of AI is moving at lightning speed, and RAG is the foundation of the next generation of software.


🚀 What's Next for You?

  1. Build a POC: Don't just read—code.
  2. Measure: Get your baseline metrics.
  3. Iterate: Use the feedback loop to improve.

Thank you for being part of this journey. If you have any questions or want to share your project, feel free to connect with me!


Author: Truong Pham Series Finale: RAG in Production — The Journey of Building a Real-world AI System Tags: RAG AI Engineering Production Software Architecture Final Thoughts

Series • Part 11 of 11

RAG in Production — The Journey of Building a Real-world AI System

RAG in Production [P10]: Future Improvements - Agentic RAG, GraphRAG & Beyond
01RAG in Production [P1]: Real-world Problem - When Does a Business Actually Need AI?02RAG in Production [P2]: What is RAG? Why not Fine-tuning or Prompt Engineering?03RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG System04RAG in Production [P4]: Backend Implementation - Building the Engine with FastAPI & LangChain05RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale06RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes07RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem08RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI09RAG in Production [P9]: Security & Privacy - Protecting Your Enterprise Data10RAG in Production [P10]: Future Improvements - Agentic RAG, GraphRAG & Beyond11RAG in Production [P11]: Lessons Learned - 15 Hard Truths About RAG in ProductionReading
TP

Written by Truong Pham

Software Engineer passionate about building high-performance systems and meaningful experiences.

Read more articles