RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG System

"Architecture is about the core decisions—the ones that are expensive to change later." In RAG, if you choose the wrong chunking strategy or vector database schema, you might have to re-index millions of documents. This post is about how to avoid that.*

Recap of Post 02
Core Design Principles
High-Level Architecture (HLD)
Component 1: Ingestion Pipeline (The Data Factory)
Component 2: Embedding Service
Component 3: Vector Database Layer
Component 4: Retrieval & Query Service
Component 5: LLM Gateway & Generation
Component 6: Observability & Feedback Loop
Handling Security & Access Control
Conclusion & Next Post

Recap of Post 02

In the previous post, we analyzed:

Why RAG is superior to Fine-tuning for knowledge-intensive enterprise tasks.
How RAG works at a high level (Retrieve → Augment → Generate).
The fundamental importance of semantic search using embeddings.

Now, we will move from "what" and "why" to "how" by designing a professional system architecture.

Core Design Principles

When designing for an enterprise environment, we followed 5 "commandments":

Separation of Concerns (SoC): The service that processes documents (Ingestion) must be separate from the service that answers questions (Query).
Replaceability: You should be able to swap OpenAI for Llama 3 or Qdrant for Pinecone without rewriting the entire core logic.
Observability: Every step of the pipeline must be logged and monitored. If an answer is wrong, you must know if it's because of bad retrieval or bad generation.
Fail Gracefully: If the vector database is down, the system should return a friendly error or a fallback answer, not crash.
Security by Design: Access control must be handled at the database layer (pre-filtering), not just filtered in the UI.

High-Level Architecture (HLD)

Our system is divided into two main pipelines that operate independently:

graph TD
    subgraph "Ingestion Pipeline (Offline/Async)"
        D[Data Sources: Confluence, PDF, Slack] --> P[Document Processor]
        P --> C[Chunking Engine]
        C --> E1[Embedding Service]
        E1 --> VDB[(Vector Database)]
    end

    subgraph "Query Pipeline (Real-time)"
        U[User Interface] --> Q[Query Service]
        Q --> E2[Embedding Service]
        E2 --> R[Retriever]
        R --> VDB
        VDB --> R
        R --> G[LLM Generator]
        G --> U
    end

    subgraph "Infrastructure & Observability"
        OBS[Prometheus / Grafana / LangSmith]
        Q -.-> OBS
        P -.-> OBS
    end

Component 1: Ingestion Pipeline (The Data Factory)

This pipeline handles "digesting" documents and transforming them into a format the AI can understand.

Document Connectors

A system must support various formats:

API-based: Crawling Confluence, Jira, Slack.
File-based: Uploading PDF, DOCX, Markdown.
Database-based: Syncing from SQL/NoSQL.

Document Processor (Cleaning)

Raw documents are often "noisy". Before chunking, we need to:

Remove HTML tags, unnecessary scripts.
Normalize encoding (UTF-8).
Extract Metadata (author, creation date, department, access hierarchy).

Chunking Engine

This is where you decide how to "slice" documents. We used a Recursive Character Text Splitter with an overlap:

# Optimal configuration for general documentation
chunk_size = 800 tokens
chunk_overlap = 150 tokens # Avoid cutting information at the boundary

Why Overlap? To ensure that if a sentence is cut in the middle, the context is preserved in the next chunk.

Component 2: Embedding Service

The Embedding Service transforms text into numerical vectors.

Model Choice: We chose text-embedding-3-small (OpenAI) for the MVP due to its low cost and high performance. Later, we added a fallback to bge-m3 (Open-source) for self-hosting.
Batching: When indexing 3,000 documents, you must use batch embedding to optimize network latency and API costs.

Component 3: Vector Database Layer

We chose Qdrant for our production environment.

Why Qdrant?

Performance: Written in Rust, extremely fast.
Filtering: Supports complex payload filtering (e.g., "Search only in HR department documents").
Hybrid Search: Built-in support for combining Vector Search and BM25.
Self-hosted: Can run on Kubernetes (K8s).

Schema Design

Each point (unit) in the vector database has:

ID: UUID.
Vector: [0.12, -0.05, ...] (1536 dimensions).
Payload (Metadata):
- content: Original text.
- source_url: Link to the doc.
- department_id: For access control.
- last_updated: For checking staleness.

Component 4: Retrieval & Query Service

This is the "brain" of the real-time system.

Query Rewriting

Users often ask vague questions like: "How about the refund?". The Query Service uses a small LLM (like GPT-3.5) to rewrite this into: "What is the company's refund policy and process?" before searching.

Hybrid Retrieval

We combine two search methods:

Vector Search: Finds semantic meaning.
Keyword Search (BM25): Finds exact matches (important for product codes like SKU-9902).

Reranking

After getting Top-20 results, we use a Cross-Encoder Reranker (like BGE-Reranker) to select the Top-5 best snippets. This significantly increases accuracy.

Component 5: LLM Gateway & Generation

Instead of calling OpenAI directly, we built an LLM Gateway.

Responsibilities:

Load Balancing: Distribute requests between multiple API keys or models.
Retries: Automatically retry if the API times out.
Rate Limiting: Protect the system from being overwhelmed or incurring high costs.
Fallback: If GPT-4 is down, switch to Claude 3 or a self-hosted Llama 3.

Component 6: Observability & Feedback Loop

A production system cannot be a "black box".

Tracing: Use LangSmith or Arize Phoenix to trace the "path" of a request: Query → Retrieval → Prompt → Answer.
Metrics: Use Prometheus/Grafana to monitor:
- Average Response Time (Latency).
- Tokens used (Cost).
- Success/Error rate.
Feedback: Add "Thumps Up/Down" buttons in the UI. If a user gives a "Thumps Down", the system automatically creates a ticket for us to review that specific retrieval.

Handling Security & Access Control

This is the most common reason AI projects fail in enterprises. You can't let a Junior Developer see the Salaries documentation.

Our Approach: Pre-filtering at Database Level

User logs in → We get their user_groups from SSO (Okta/Keycloak).
When querying:

# The search query includes a filter
results = qdrant.search(
    collection_name="docs",
    query_vector=vector,
    query_filter=Filter(
        must=[
            FieldCondition(key="groups", match=Any(values=user_groups))
        ]
    )
)

This ensures that the AI never even sees documents the user isn't allowed to access.

Conclusion & Next Post

A solid architecture is the foundation for a scalable and reliable RAG system. By separating Ingestion from Query and implementing a robust observability layer, we can continuously improve the system without rebuilding it from scratch.

3 Key Takeaways:

Pipeline separation is mandatory for maintainability.
Metadata is just as important as the vectors themselves for filtering and security.
Observability is the only way to debug AI systems.

👉 Next Post: [Post 04] Practical Implementation - Coding the Backend with FastAPI & LangChain

Enough talk about diagrams! In the next post, we will open up the editor and start coding. I will share the project structure, how to implement the Retriever class, and how to stream responses from the LLM to the UI using SSE (Server-Sent Events).

📬 What is the most challenging part of your RAG architecture? Share your thoughts below!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: RAG System Design Architecture Software Engineering Enterprise AI

"Architecture is about the core decisions—the ones that are expensive to change later." In RAG, if you choose the wrong chunking strategy or vector database schema, you might have to re-index millions of documents. This post is about how to avoid that.*

Recap of Post 02
Core Design Principles
High-Level Architecture (HLD)
Component 1: Ingestion Pipeline (The Data Factory)
Component 2: Embedding Service
Component 3: Vector Database Layer
Component 4: Retrieval & Query Service
Component 5: LLM Gateway & Generation
Component 6: Observability & Feedback Loop
Handling Security & Access Control
Conclusion & Next Post

Recap of Post 02

In the previous post, we analyzed:

Why RAG is superior to Fine-tuning for knowledge-intensive enterprise tasks.
How RAG works at a high level (Retrieve → Augment → Generate).
The fundamental importance of semantic search using embeddings.

Now, we will move from "what" and "why" to "how" by designing a professional system architecture.

Core Design Principles

When designing for an enterprise environment, we followed 5 "commandments":

Separation of Concerns (SoC): The service that processes documents (Ingestion) must be separate from the service that answers questions (Query).
Replaceability: You should be able to swap OpenAI for Llama 3 or Qdrant for Pinecone without rewriting the entire core logic.
Observability: Every step of the pipeline must be logged and monitored. If an answer is wrong, you must know if it's because of bad retrieval or bad generation.
Fail Gracefully: If the vector database is down, the system should return a friendly error or a fallback answer, not crash.
Security by Design: Access control must be handled at the database layer (pre-filtering), not just filtered in the UI.

High-Level Architecture (HLD)

Our system is divided into two main pipelines that operate independently:

graph TD
    subgraph "Ingestion Pipeline (Offline/Async)"
        D[Data Sources: Confluence, PDF, Slack] --> P[Document Processor]
        P --> C[Chunking Engine]
        C --> E1[Embedding Service]
        E1 --> VDB[(Vector Database)]
    end

    subgraph "Query Pipeline (Real-time)"
        U[User Interface] --> Q[Query Service]
        Q --> E2[Embedding Service]
        E2 --> R[Retriever]
        R --> VDB
        VDB --> R
        R --> G[LLM Generator]
        G --> U
    end

    subgraph "Infrastructure & Observability"
        OBS[Prometheus / Grafana / LangSmith]
        Q -.-> OBS
        P -.-> OBS
    end

Component 1: Ingestion Pipeline (The Data Factory)

This pipeline handles "digesting" documents and transforming them into a format the AI can understand.

Document Connectors

A system must support various formats:

API-based: Crawling Confluence, Jira, Slack.
File-based: Uploading PDF, DOCX, Markdown.
Database-based: Syncing from SQL/NoSQL.

Document Processor (Cleaning)

Raw documents are often "noisy". Before chunking, we need to:

Remove HTML tags, unnecessary scripts.
Normalize encoding (UTF-8).
Extract Metadata (author, creation date, department, access hierarchy).

Chunking Engine

This is where you decide how to "slice" documents. We used a Recursive Character Text Splitter with an overlap:

# Optimal configuration for general documentation
chunk_size = 800 tokens
chunk_overlap = 150 tokens # Avoid cutting information at the boundary

Why Overlap? To ensure that if a sentence is cut in the middle, the context is preserved in the next chunk.

Component 2: Embedding Service

The Embedding Service transforms text into numerical vectors.

Model Choice: We chose text-embedding-3-small (OpenAI) for the MVP due to its low cost and high performance. Later, we added a fallback to bge-m3 (Open-source) for self-hosting.
Batching: When indexing 3,000 documents, you must use batch embedding to optimize network latency and API costs.

Component 3: Vector Database Layer

We chose Qdrant for our production environment.

Why Qdrant?

Performance: Written in Rust, extremely fast.
Filtering: Supports complex payload filtering (e.g., "Search only in HR department documents").
Hybrid Search: Built-in support for combining Vector Search and BM25.
Self-hosted: Can run on Kubernetes (K8s).

Schema Design

Each point (unit) in the vector database has:

ID: UUID.
Vector: [0.12, -0.05, ...] (1536 dimensions).
Payload (Metadata):
- content: Original text.
- source_url: Link to the doc.
- department_id: For access control.
- last_updated: For checking staleness.

Component 4: Retrieval & Query Service

This is the "brain" of the real-time system.

Query Rewriting

Hybrid Retrieval

We combine two search methods:

Vector Search: Finds semantic meaning.
Keyword Search (BM25): Finds exact matches (important for product codes like SKU-9902).

Reranking

After getting Top-20 results, we use a Cross-Encoder Reranker (like BGE-Reranker) to select the Top-5 best snippets. This significantly increases accuracy.

Component 5: LLM Gateway & Generation

Instead of calling OpenAI directly, we built an LLM Gateway.

Responsibilities:

Load Balancing: Distribute requests between multiple API keys or models.
Retries: Automatically retry if the API times out.
Rate Limiting: Protect the system from being overwhelmed or incurring high costs.
Fallback: If GPT-4 is down, switch to Claude 3 or a self-hosted Llama 3.

Component 6: Observability & Feedback Loop

A production system cannot be a "black box".

Tracing: Use LangSmith or Arize Phoenix to trace the "path" of a request: Query → Retrieval → Prompt → Answer.
Metrics: Use Prometheus/Grafana to monitor:
- Average Response Time (Latency).
- Tokens used (Cost).
- Success/Error rate.
Feedback: Add "Thumps Up/Down" buttons in the UI. If a user gives a "Thumps Down", the system automatically creates a ticket for us to review that specific retrieval.

Handling Security & Access Control

This is the most common reason AI projects fail in enterprises. You can't let a Junior Developer see the Salaries documentation.

Our Approach: Pre-filtering at Database Level

User logs in → We get their user_groups from SSO (Okta/Keycloak).
When querying:

# The search query includes a filter
results = qdrant.search(
    collection_name="docs",
    query_vector=vector,
    query_filter=Filter(
        must=[
            FieldCondition(key="groups", match=Any(values=user_groups))
        ]
    )
)

This ensures that the AI never even sees documents the user isn't allowed to access.

Conclusion & Next Post

3 Key Takeaways:

Pipeline separation is mandatory for maintainability.
Metadata is just as important as the vectors themselves for filtering and security.
Observability is the only way to debug AI systems.

👉 Next Post: [Post 04] Practical Implementation - Coding the Backend with FastAPI & LangChain

📬 What is the most challenging part of your RAG architecture? Share your thoughts below!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: RAG System Design Architecture Software Engineering Enterprise AI

Table of Contents

Recap of Post 02

Core Design Principles

High-Level Architecture (HLD)

Component 1: Ingestion Pipeline (The Data Factory)

Document Connectors

Document Processor (Cleaning)

Chunking Engine

Component 2: Embedding Service

Component 3: Vector Database Layer

Schema Design

Component 4: Retrieval & Query Service

Query Rewriting

Hybrid Retrieval

Reranking

Component 5: LLM Gateway & Generation

Component 6: Observability & Feedback Loop

Handling Security & Access Control

Conclusion & Next Post

👉 Next Post: [Post 04] Practical Implementation - Coding the Backend with FastAPI & LangChain

Table of Contents

Recap of Post 02

Core Design Principles

High-Level Architecture (HLD)

Component 1: Ingestion Pipeline (The Data Factory)

Document Connectors

Document Processor (Cleaning)

Chunking Engine

Component 2: Embedding Service

Component 3: Vector Database Layer

Schema Design

Component 4: Retrieval & Query Service

Query Rewriting

Hybrid Retrieval

Reranking

Component 5: LLM Gateway & Generation

Component 6: Observability & Feedback Loop

Handling Security & Access Control

Conclusion & Next Post

👉 Next Post: [Post 04] Practical Implementation - Coding the Backend with FastAPI & LangChain