RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale

"A vector database is like a library where books aren't sorted by title, but by the smell of their stories." In the world of RAG, the Vector Database (VDB) is the memory of your AI. If its memory is disorganized, the AI will be confused.*

Why Specifically a Vector Database?
Why Did We Choose Qdrant?
The Heart of Indexing: HNSW Algorithm
Choosing the Right Distance Metric
Schema Design for Production
Hybrid Search: Dense + Sparse Vectors
Resource Optimization: Quantization & Indexing
Conclusion & Next Post

Why Specifically a Vector Database?

Traditional databases (SQL/NoSQL) are built for exact matching. You search for name = 'Ivan' and it finds exactly that.

In AI, users ask semantic questions: "How can I get my money back?". A traditional DB would look for the word "money" and "back", but it doesn't understand that this means "Refund Policy".

Vector Databases search for meaning by representing text as numerical vectors in dynamic space.

Why Did We Choose Qdrant?

Among many players like Pinecone, Weaviate, Milvus, and pgvector, we chose Qdrant for 3 main reasons:

Performance & Reliability: Built in Rust. It's extremely memory-efficient and fast.
Advanced Filtering: Unlike some VDBs that filter after searching (slow), Qdrant filters during search (fast).
Rich Feature Set: Built-in Hybrid Search, support for Sparse Vectors, and easy-to-use API.

The Heart of Indexing: HNSW Algorithm

Searching through 10 million vectors linearly is impossible in real-time. We need an index. Qdrant uses HNSW (Hierarchical Navigable Small World).

How HNSW Works:

Imagine HNSW as a "highway and local roads" system:

Layer 0 (Bottom): All points are connected. Searching here is slow but precise.
Layer 1, 2, ... (Top): Higher layers contain fewer points. They are "highways" that allow you to jump large distances across the vector space.

graph TD
    subgraph "Layer 2 (Express)"
        A2[Point A] --- B2[Point B]
    end
    subgraph "Layer 1 (Suburban)"
        A1[Point A] --- C1[Point C] --- B1[Point B]
    end
    subgraph "Layer 0 (City Streets)"
        A0[Point A] --- D0[Point D] --- C0[Point C] --- E0[Point E] --- B0[Point B]
    end

Trade-off:

Higher m: More connections, higher accuracy but uses more RAM.
Higher ef_construct: Better index quality but slower indexing time.

Choosing the Right Distance Metric

How do we define "closeness" between two vectors?

Cosine Similarity (Most Common): Measures the angle between vectors. Perfect for text because it ignores the length of the document and focuses on the direction of meaning.
Dot Product: Fast but sensitive to vector magnitude. Used when your embedding model is normalized.
Euclidean (L2) Distance: Measures the direct distance between points. Common in image search.

For RAG with OpenAI/BGE embeddings, always start with Cosine Similarity.

Schema Design for Production

In Qdrant, a "collection" is like an SQL table. Here is how we structured ours:

from qdrant_client import QdrantClient
from qdrant_client.http import models

client = QdrantClient("http://localhost:6333")

client.create_collection(
    collection_name="enterprise_docs",
    vectors_config=models.VectorParams(
        size=1536, # OpenAI embedding size
        distance=models.Distance.COSINE
    ),
    # Optimized for fast searching with Payload pre-filtering
    hnsw_config=models.HnswConfigDiff(
        payload_m=16, 
        m=16
    )
)

Strategic Metadata (Payload)

Don't just store the text. Store these for efficient filtering:

user_id / group_id: Essential for security.
doc_type: pdf, confluence, slack.
version: To avoid retrieving outdated info.
language: If you are building a multi-lingual system.

Hybrid Search: Dense + Sparse Vectors

Vector search is great for meaning but terrible for exact keywords (e.g., searching for a specific error code like ERR-404).

Qdrant's Hybrid Solution:

Dense Vector (Embeddings): Understands "I'm having a connection issue".
Sparse Vector (BM25/SPLADE): Understands the exact keyword "connection".

Qdrant fuses these results using Reciprocal Rank Fusion (RRF) to give the best of both worlds.

Resource Optimization: Quantization

Storing 1,536 dimensions as floating-point numbers (float32) for millions of documents uses a massive amount of VRAM.

Solution: Product Quantization (PQ) or Scalar Quantization. Qdrant can compress vectors (e.g., from float32 to int8). This can reduce RAM usage by 4x with a minimal loss in accuracy (~1%).

# Enabling Scalar Quantization in Qdrant
client.update_collection(
    collection_name="enterprise_docs",
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarType.INT8,
        always_ram=True,
        quantile=0.99
    )
)

Conclusion & Next Post

Designing a vector database layer is about balancing Speed, Accuracy, and Cost. HNSW gives us the speed, Cosine Similarity gives us the accuracy, and Quantization keeps the costs down.

3 Key Takeaways:

HNSW is the engine that makes large-scale RAG possible.
Payload filtering is the secret to enterprise security.
Hybrid Search is mandatory for systems involving specific technical terms.

👉 Next Post: [Post 06] LLM Inference - Deploying Models with vLLM & Kubernetes

Now that we have the "Memory" (Vector DB) and the "Body" (Backend), we need the "Voice" — the LLM. We will discuss how to deploy self-hosted models like Llama 3 using vLLM to achieve 10x throughput compared to standard deployment.

📬 Are you using Qdrant or another Vector DB? I'd love to hear your experience!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: Qdrant Vector Database Machine Learning Search Engineering

"A vector database is like a library where books aren't sorted by title, but by the smell of their stories." In the world of RAG, the Vector Database (VDB) is the memory of your AI. If its memory is disorganized, the AI will be confused.*

Why Specifically a Vector Database?
Why Did We Choose Qdrant?
The Heart of Indexing: HNSW Algorithm
Choosing the Right Distance Metric
Schema Design for Production
Hybrid Search: Dense + Sparse Vectors
Resource Optimization: Quantization & Indexing
Conclusion & Next Post

Why Specifically a Vector Database?

Traditional databases (SQL/NoSQL) are built for exact matching. You search for name = 'Ivan' and it finds exactly that.

In AI, users ask semantic questions: "How can I get my money back?". A traditional DB would look for the word "money" and "back", but it doesn't understand that this means "Refund Policy".

Vector Databases search for meaning by representing text as numerical vectors in dynamic space.

Why Did We Choose Qdrant?

Among many players like Pinecone, Weaviate, Milvus, and pgvector, we chose Qdrant for 3 main reasons:

Performance & Reliability: Built in Rust. It's extremely memory-efficient and fast.
Advanced Filtering: Unlike some VDBs that filter after searching (slow), Qdrant filters during search (fast).
Rich Feature Set: Built-in Hybrid Search, support for Sparse Vectors, and easy-to-use API.

The Heart of Indexing: HNSW Algorithm

Searching through 10 million vectors linearly is impossible in real-time. We need an index. Qdrant uses HNSW (Hierarchical Navigable Small World).

How HNSW Works:

Imagine HNSW as a "highway and local roads" system:

Layer 0 (Bottom): All points are connected. Searching here is slow but precise.
Layer 1, 2, ... (Top): Higher layers contain fewer points. They are "highways" that allow you to jump large distances across the vector space.

graph TD
    subgraph "Layer 2 (Express)"
        A2[Point A] --- B2[Point B]
    end
    subgraph "Layer 1 (Suburban)"
        A1[Point A] --- C1[Point C] --- B1[Point B]
    end
    subgraph "Layer 0 (City Streets)"
        A0[Point A] --- D0[Point D] --- C0[Point C] --- E0[Point E] --- B0[Point B]
    end

Trade-off:

Higher m: More connections, higher accuracy but uses more RAM.
Higher ef_construct: Better index quality but slower indexing time.

Choosing the Right Distance Metric

How do we define "closeness" between two vectors?

Cosine Similarity (Most Common): Measures the angle between vectors. Perfect for text because it ignores the length of the document and focuses on the direction of meaning.
Dot Product: Fast but sensitive to vector magnitude. Used when your embedding model is normalized.
Euclidean (L2) Distance: Measures the direct distance between points. Common in image search.

For RAG with OpenAI/BGE embeddings, always start with Cosine Similarity.

Schema Design for Production

In Qdrant, a "collection" is like an SQL table. Here is how we structured ours:

from qdrant_client import QdrantClient
from qdrant_client.http import models

client = QdrantClient("http://localhost:6333")

client.create_collection(
    collection_name="enterprise_docs",
    vectors_config=models.VectorParams(
        size=1536, # OpenAI embedding size
        distance=models.Distance.COSINE
    ),
    # Optimized for fast searching with Payload pre-filtering
    hnsw_config=models.HnswConfigDiff(
        payload_m=16, 
        m=16
    )
)

Strategic Metadata (Payload)

Don't just store the text. Store these for efficient filtering:

user_id / group_id: Essential for security.
doc_type: pdf, confluence, slack.
version: To avoid retrieving outdated info.
language: If you are building a multi-lingual system.

Hybrid Search: Dense + Sparse Vectors

Vector search is great for meaning but terrible for exact keywords (e.g., searching for a specific error code like ERR-404).

Qdrant's Hybrid Solution:

Dense Vector (Embeddings): Understands "I'm having a connection issue".
Sparse Vector (BM25/SPLADE): Understands the exact keyword "connection".

Qdrant fuses these results using Reciprocal Rank Fusion (RRF) to give the best of both worlds.

Resource Optimization: Quantization

Storing 1,536 dimensions as floating-point numbers (float32) for millions of documents uses a massive amount of VRAM.

Solution: Product Quantization (PQ) or Scalar Quantization. Qdrant can compress vectors (e.g., from float32 to int8). This can reduce RAM usage by 4x with a minimal loss in accuracy (~1%).

# Enabling Scalar Quantization in Qdrant
client.update_collection(
    collection_name="enterprise_docs",
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarType.INT8,
        always_ram=True,
        quantile=0.99
    )
)

Conclusion & Next Post

Designing a vector database layer is about balancing Speed, Accuracy, and Cost. HNSW gives us the speed, Cosine Similarity gives us the accuracy, and Quantization keeps the costs down.

3 Key Takeaways:

HNSW is the engine that makes large-scale RAG possible.
Payload filtering is the secret to enterprise security.
Hybrid Search is mandatory for systems involving specific technical terms.

👉 Next Post: [Post 06] LLM Inference - Deploying Models with vLLM & Kubernetes

📬 Are you using Qdrant or another Vector DB? I'd love to hear your experience!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: Qdrant Vector Database Machine Learning Search Engineering

RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale

Table of Contents

Why Specifically a Vector Database?

Why Did We Choose Qdrant?

The Heart of Indexing: HNSW Algorithm

How HNSW Works:

Choosing the Right Distance Metric

Schema Design for Production

Strategic Metadata (Payload)

Hybrid Search: Dense + Sparse Vectors

Resource Optimization: Quantization

Conclusion & Next Post

👉 Next Post: [Post 06] LLM Inference - Deploying Models with vLLM & Kubernetes

RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale

Table of Contents

Why Specifically a Vector Database?

Why Did We Choose Qdrant?

The Heart of Indexing: HNSW Algorithm

How HNSW Works:

Choosing the Right Distance Metric

Schema Design for Production

Strategic Metadata (Payload)

Hybrid Search: Dense + Sparse Vectors

Resource Optimization: Quantization

Conclusion & Next Post

👉 Next Post: [Post 06] LLM Inference - Deploying Models with vLLM & Kubernetes