LogoTRUONG PHAM
Home
Projects
Blogs
YouTube
Contact

Newsletter

Stay updated with technical artifacts and engineering insights.

LogoTRUONG PHAM

Building scalable software and sharing insights on technology & life.

Sitemap

  • Home
  • Projects
  • Blogs
  • YouTube
  • Contact

Connect

  • GitHub
  • LinkedIn
  • Email
  • YouTube

© 2024 TRUONG PHAM. © All rights reserved.

Privacy PolicyTerms of Service
Back
RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG System
RAG in Production — The Journey of Building a Real-world AI System

RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG System

Design a production-ready RAG architecture with key components: Ingestion Pipeline, Retriever, and Generator, while ensuring high scalability and observability.

TP
Truong PhamSoftware Engineer
PublishedMarch 30, 2024
Stack
RAG ·System Design ·Architecture ·Engineering

"Architecture is about the core decisions—the ones that are expensive to change later." In RAG, if you choose the wrong chunking strategy or vector database schema, you might have to re-index millions of documents. This post is about how to avoid that.*


Table of Contents

  1. Recap of Post 02
  2. Core Design Principles
  3. High-Level Architecture (HLD)
  4. Component 1: Ingestion Pipeline (The Data Factory)
  5. Component 2: Embedding Service
  6. Component 3: Vector Database Layer
  7. Component 4: Retrieval & Query Service
  8. Component 5: LLM Gateway & Generation
  9. Component 6: Observability & Feedback Loop
  10. Handling Security & Access Control
  11. Conclusion & Next Post

Recap of Post 02

In the previous post, we analyzed:

  • Why RAG is superior to Fine-tuning for knowledge-intensive enterprise tasks.
  • How RAG works at a high level (Retrieve → Augment → Generate).
  • The fundamental importance of semantic search using embeddings.

Now, we will move from "what" and "why" to "how" by designing a professional system architecture.


Core Design Principles

When designing for an enterprise environment, we followed 5 "commandments":

  1. Separation of Concerns (SoC): The service that processes documents (Ingestion) must be separate from the service that answers questions (Query).
  2. Replaceability: You should be able to swap OpenAI for Llama 3 or Qdrant for Pinecone without rewriting the entire core logic.
  3. Observability: Every step of the pipeline must be logged and monitored. If an answer is wrong, you must know if it's because of bad retrieval or bad generation.
  4. Fail Gracefully: If the vector database is down, the system should return a friendly error or a fallback answer, not crash.
  5. Security by Design: Access control must be handled at the database layer (pre-filtering), not just filtered in the UI.

High-Level Architecture (HLD)

Our system is divided into two main pipelines that operate independently:

graph TD
    subgraph "Ingestion Pipeline (Offline/Async)"
        D[Data Sources: Confluence, PDF, Slack] --> P[Document Processor]
        P --> C[Chunking Engine]
        C --> E1[Embedding Service]
        E1 --> VDB[(Vector Database)]
    end

    subgraph "Query Pipeline (Real-time)"
        U[User Interface] --> Q[Query Service]
        Q --> E2[Embedding Service]
        E2 --> R[Retriever]
        R --> VDB
        VDB --> R
        R --> G[LLM Generator]
        G --> U
    end

    subgraph "Infrastructure & Observability"
        OBS[Prometheus / Grafana / LangSmith]
        Q -.-> OBS
        P -.-> OBS
    end

Component 1: Ingestion Pipeline (The Data Factory)

This pipeline handles "digesting" documents and transforming them into a format the AI can understand.

Document Connectors

A system must support various formats:

  • API-based: Crawling Confluence, Jira, Slack.
  • File-based: Uploading PDF, DOCX, Markdown.
  • Database-based: Syncing from SQL/NoSQL.

Document Processor (Cleaning)

Raw documents are often "noisy". Before chunking, we need to:

  • Remove HTML tags, unnecessary scripts.
  • Normalize encoding (UTF-8).
  • Extract Metadata (author, creation date, department, access hierarchy).

Chunking Engine

This is where you decide how to "slice" documents. We used a Recursive Character Text Splitter with an overlap:

# Optimal configuration for general documentation
chunk_size = 800 tokens
chunk_overlap = 150 tokens # Avoid cutting information at the boundary

Why Overlap? To ensure that if a sentence is cut in the middle, the context is preserved in the next chunk.


Component 2: Embedding Service

The Embedding Service transforms text into numerical vectors.

  • Model Choice: We chose text-embedding-3-small (OpenAI) for the MVP due to its low cost and high performance. Later, we added a fallback to bge-m3 (Open-source) for self-hosting.
  • Batching: When indexing 3,000 documents, you must use batch embedding to optimize network latency and API costs.

Component 3: Vector Database Layer

We chose Qdrant for our production environment.

Why Qdrant?

  • Performance: Written in Rust, extremely fast.
  • Filtering: Supports complex payload filtering (e.g., "Search only in HR department documents").
  • Hybrid Search: Built-in support for combining Vector Search and BM25.
  • Self-hosted: Can run on Kubernetes (K8s).

Schema Design

Each point (unit) in the vector database has:

  • ID: UUID.
  • Vector: [0.12, -0.05, ...] (1536 dimensions).
  • Payload (Metadata):
    • content: Original text.
    • source_url: Link to the doc.
    • department_id: For access control.
    • last_updated: For checking staleness.

Component 4: Retrieval & Query Service

This is the "brain" of the real-time system.

Query Rewriting

Users often ask vague questions like: "How about the refund?". The Query Service uses a small LLM (like GPT-3.5) to rewrite this into: "What is the company's refund policy and process?" before searching.

Hybrid Retrieval

We combine two search methods:

  1. Vector Search: Finds semantic meaning.
  2. Keyword Search (BM25): Finds exact matches (important for product codes like SKU-9902).

Reranking

After getting Top-20 results, we use a Cross-Encoder Reranker (like BGE-Reranker) to select the Top-5 best snippets. This significantly increases accuracy.


Component 5: LLM Gateway & Generation

Instead of calling OpenAI directly, we built an LLM Gateway.

Responsibilities:

  • Load Balancing: Distribute requests between multiple API keys or models.
  • Retries: Automatically retry if the API times out.
  • Rate Limiting: Protect the system from being overwhelmed or incurring high costs.
  • Fallback: If GPT-4 is down, switch to Claude 3 or a self-hosted Llama 3.

Component 6: Observability & Feedback Loop

A production system cannot be a "black box".

  • Tracing: Use LangSmith or Arize Phoenix to trace the "path" of a request: Query → Retrieval → Prompt → Answer.
  • Metrics: Use Prometheus/Grafana to monitor:
    • Average Response Time (Latency).
    • Tokens used (Cost).
    • Success/Error rate.
  • Feedback: Add "Thumps Up/Down" buttons in the UI. If a user gives a "Thumps Down", the system automatically creates a ticket for us to review that specific retrieval.

Handling Security & Access Control

This is the most common reason AI projects fail in enterprises. You can't let a Junior Developer see the Salaries documentation.

Our Approach: Pre-filtering at Database Level

  1. User logs in → We get their user_groups from SSO (Okta/Keycloak).
  2. When querying:
# The search query includes a filter
results = qdrant.search(
    collection_name="docs",
    query_vector=vector,
    query_filter=Filter(
        must=[
            FieldCondition(key="groups", match=Any(values=user_groups))
        ]
    )
)
  1. This ensures that the AI never even sees documents the user isn't allowed to access.

Conclusion & Next Post

A solid architecture is the foundation for a scalable and reliable RAG system. By separating Ingestion from Query and implementing a robust observability layer, we can continuously improve the system without rebuilding it from scratch.

3 Key Takeaways:

  1. Pipeline separation is mandatory for maintainability.
  2. Metadata is just as important as the vectors themselves for filtering and security.
  3. Observability is the only way to debug AI systems.

👉 Next Post: [Post 04] Practical Implementation - Coding the Backend with FastAPI & LangChain

Enough talk about diagrams! In the next post, we will open up the editor and start coding. I will share the project structure, how to implement the Retriever class, and how to stream responses from the LLM to the UI using SSE (Server-Sent Events).


📬 What is the most challenging part of your RAG architecture? Share your thoughts below!


Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: RAG System Design Architecture Software Engineering Enterprise AI

Series • Part 3 of 11

RAG in Production — The Journey of Building a Real-world AI System

NextRAG in Production [P4]: Backend Implementation - Building the Engine with FastAPI & LangChain
RAG in Production [P2]: What is RAG? Why not Fine-tuning or Prompt Engineering?
01RAG in Production [P1]: Real-world Problem - When Does a Business Actually Need AI?02RAG in Production [P2]: What is RAG? Why not Fine-tuning or Prompt Engineering?03RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG SystemReading04RAG in Production [P4]: Backend Implementation - Building the Engine with FastAPI & LangChain05RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale06RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes07RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem08RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI09RAG in Production [P9]: Security & Privacy - Protecting Your Enterprise Data10RAG in Production [P10]: Future Improvements - Agentic RAG, GraphRAG & Beyond11RAG in Production [P11]: Lessons Learned - 15 Hard Truths About RAG in Production
TP

Written by Truong Pham

Software Engineer passionate about building high-performance systems and meaningful experiences.

Read more articles