RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG System
Design a production-ready RAG architecture with key components: Ingestion Pipeline, Retriever, and Generator, while ensuring high scalability and observability.
"Architecture is about the core decisions—the ones that are expensive to change later." In RAG, if you choose the wrong chunking strategy or vector database schema, you might have to re-index millions of documents. This post is about how to avoid that.*
Table of Contents
- Recap of Post 02
- Core Design Principles
- High-Level Architecture (HLD)
- Component 1: Ingestion Pipeline (The Data Factory)
- Component 2: Embedding Service
- Component 3: Vector Database Layer
- Component 4: Retrieval & Query Service
- Component 5: LLM Gateway & Generation
- Component 6: Observability & Feedback Loop
- Handling Security & Access Control
- Conclusion & Next Post
Recap of Post 02
In the previous post, we analyzed:
- Why RAG is superior to Fine-tuning for knowledge-intensive enterprise tasks.
- How RAG works at a high level (Retrieve → Augment → Generate).
- The fundamental importance of semantic search using embeddings.
Now, we will move from "what" and "why" to "how" by designing a professional system architecture.
Core Design Principles
When designing for an enterprise environment, we followed 5 "commandments":
- Separation of Concerns (SoC): The service that processes documents (Ingestion) must be separate from the service that answers questions (Query).
- Replaceability: You should be able to swap OpenAI for Llama 3 or Qdrant for Pinecone without rewriting the entire core logic.
- Observability: Every step of the pipeline must be logged and monitored. If an answer is wrong, you must know if it's because of bad retrieval or bad generation.
- Fail Gracefully: If the vector database is down, the system should return a friendly error or a fallback answer, not crash.
- Security by Design: Access control must be handled at the database layer (pre-filtering), not just filtered in the UI.
High-Level Architecture (HLD)
Our system is divided into two main pipelines that operate independently:
graph TD
subgraph "Ingestion Pipeline (Offline/Async)"
D[Data Sources: Confluence, PDF, Slack] --> P[Document Processor]
P --> C[Chunking Engine]
C --> E1[Embedding Service]
E1 --> VDB[(Vector Database)]
end
subgraph "Query Pipeline (Real-time)"
U[User Interface] --> Q[Query Service]
Q --> E2[Embedding Service]
E2 --> R[Retriever]
R --> VDB
VDB --> R
R --> G[LLM Generator]
G --> U
end
subgraph "Infrastructure & Observability"
OBS[Prometheus / Grafana / LangSmith]
Q -.-> OBS
P -.-> OBS
end
Component 1: Ingestion Pipeline (The Data Factory)
This pipeline handles "digesting" documents and transforming them into a format the AI can understand.
Document Connectors
A system must support various formats:
- API-based: Crawling Confluence, Jira, Slack.
- File-based: Uploading PDF, DOCX, Markdown.
- Database-based: Syncing from SQL/NoSQL.
Document Processor (Cleaning)
Raw documents are often "noisy". Before chunking, we need to:
- Remove HTML tags, unnecessary scripts.
- Normalize encoding (UTF-8).
- Extract Metadata (author, creation date, department, access hierarchy).
Chunking Engine
This is where you decide how to "slice" documents. We used a Recursive Character Text Splitter with an overlap:
# Optimal configuration for general documentation
chunk_size = 800 tokens
chunk_overlap = 150 tokens # Avoid cutting information at the boundary
Why Overlap? To ensure that if a sentence is cut in the middle, the context is preserved in the next chunk.
Component 2: Embedding Service
The Embedding Service transforms text into numerical vectors.
- Model Choice: We chose
text-embedding-3-small(OpenAI) for the MVP due to its low cost and high performance. Later, we added a fallback tobge-m3(Open-source) for self-hosting. - Batching: When indexing 3,000 documents, you must use batch embedding to optimize network latency and API costs.
Component 3: Vector Database Layer
We chose Qdrant for our production environment.
Why Qdrant?
- Performance: Written in Rust, extremely fast.
- Filtering: Supports complex payload filtering (e.g., "Search only in HR department documents").
- Hybrid Search: Built-in support for combining Vector Search and BM25.
- Self-hosted: Can run on Kubernetes (K8s).
Schema Design
Each point (unit) in the vector database has:
- ID: UUID.
- Vector: [0.12, -0.05, ...] (1536 dimensions).
- Payload (Metadata):
content: Original text.source_url: Link to the doc.department_id: For access control.last_updated: For checking staleness.
Component 4: Retrieval & Query Service
This is the "brain" of the real-time system.
Query Rewriting
Users often ask vague questions like: "How about the refund?". The Query Service uses a small LLM (like GPT-3.5) to rewrite this into: "What is the company's refund policy and process?" before searching.
Hybrid Retrieval
We combine two search methods:
- Vector Search: Finds semantic meaning.
- Keyword Search (BM25): Finds exact matches (important for product codes like
SKU-9902).
Reranking
After getting Top-20 results, we use a Cross-Encoder Reranker (like BGE-Reranker) to select the Top-5 best snippets. This significantly increases accuracy.
Component 5: LLM Gateway & Generation
Instead of calling OpenAI directly, we built an LLM Gateway.
Responsibilities:
- Load Balancing: Distribute requests between multiple API keys or models.
- Retries: Automatically retry if the API times out.
- Rate Limiting: Protect the system from being overwhelmed or incurring high costs.
- Fallback: If GPT-4 is down, switch to Claude 3 or a self-hosted Llama 3.
Component 6: Observability & Feedback Loop
A production system cannot be a "black box".
- Tracing: Use LangSmith or Arize Phoenix to trace the "path" of a request: Query → Retrieval → Prompt → Answer.
- Metrics: Use Prometheus/Grafana to monitor:
- Average Response Time (Latency).
- Tokens used (Cost).
- Success/Error rate.
- Feedback: Add "Thumps Up/Down" buttons in the UI. If a user gives a "Thumps Down", the system automatically creates a ticket for us to review that specific retrieval.
Handling Security & Access Control
This is the most common reason AI projects fail in enterprises. You can't let a Junior Developer see the Salaries documentation.
Our Approach: Pre-filtering at Database Level
- User logs in → We get their
user_groupsfrom SSO (Okta/Keycloak). - When querying:
# The search query includes a filter
results = qdrant.search(
collection_name="docs",
query_vector=vector,
query_filter=Filter(
must=[
FieldCondition(key="groups", match=Any(values=user_groups))
]
)
)
- This ensures that the AI never even sees documents the user isn't allowed to access.
Conclusion & Next Post
A solid architecture is the foundation for a scalable and reliable RAG system. By separating Ingestion from Query and implementing a robust observability layer, we can continuously improve the system without rebuilding it from scratch.
3 Key Takeaways:
- Pipeline separation is mandatory for maintainability.
- Metadata is just as important as the vectors themselves for filtering and security.
- Observability is the only way to debug AI systems.
👉 Next Post: [Post 04] Practical Implementation - Coding the Backend with FastAPI & LangChain
Enough talk about diagrams! In the next post, we will open up the editor and start coding. I will share the project structure, how to implement the Retriever class, and how to stream responses from the LLM to the UI using SSE (Server-Sent Events).
📬 What is the most challenging part of your RAG architecture? Share your thoughts below!
Author: [Your Name]
Series: RAG in Production — The Journey of Building a Real-world AI System
Tags: RAG System Design Architecture Software Engineering Enterprise AI
Series • Part 3 of 11