RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale
Deep dive into Vector Databases, specifically Qdrant. Learn about HNSW indexing, distance metrics, and how to design a schema for millions of vectors.
"A vector database is like a library where books aren't sorted by title, but by the smell of their stories." In the world of RAG, the Vector Database (VDB) is the memory of your AI. If its memory is disorganized, the AI will be confused.*
Table of Contents
- Why Specifically a Vector Database?
- Why Did We Choose Qdrant?
- The Heart of Indexing: HNSW Algorithm
- Choosing the Right Distance Metric
- Schema Design for Production
- Hybrid Search: Dense + Sparse Vectors
- Resource Optimization: Quantization & Indexing
- Conclusion & Next Post
Why Specifically a Vector Database?
Traditional databases (SQL/NoSQL) are built for exact matching. You search for name = 'Ivan' and it finds exactly that.
In AI, users ask semantic questions: "How can I get my money back?". A traditional DB would look for the word "money" and "back", but it doesn't understand that this means "Refund Policy".
Vector Databases search for meaning by representing text as numerical vectors in dynamic space.
Why Did We Choose Qdrant?
Among many players like Pinecone, Weaviate, Milvus, and pgvector, we chose Qdrant for 3 main reasons:
- Performance & Reliability: Built in Rust. It's extremely memory-efficient and fast.
- Advanced Filtering: Unlike some VDBs that filter after searching (slow), Qdrant filters during search (fast).
- Rich Feature Set: Built-in Hybrid Search, support for Sparse Vectors, and easy-to-use API.
The Heart of Indexing: HNSW Algorithm
Searching through 10 million vectors linearly is impossible in real-time. We need an index. Qdrant uses HNSW (Hierarchical Navigable Small World).
How HNSW Works:
Imagine HNSW as a "highway and local roads" system:
- Layer 0 (Bottom): All points are connected. Searching here is slow but precise.
- Layer 1, 2, ... (Top): Higher layers contain fewer points. They are "highways" that allow you to jump large distances across the vector space.
graph TD
subgraph "Layer 2 (Express)"
A2[Point A] --- B2[Point B]
end
subgraph "Layer 1 (Suburban)"
A1[Point A] --- C1[Point C] --- B1[Point B]
end
subgraph "Layer 0 (City Streets)"
A0[Point A] --- D0[Point D] --- C0[Point C] --- E0[Point E] --- B0[Point B]
end
Trade-off:
- Higher
m: More connections, higher accuracy but uses more RAM. - Higher
ef_construct: Better index quality but slower indexing time.
Choosing the Right Distance Metric
How do we define "closeness" between two vectors?
- Cosine Similarity (Most Common): Measures the angle between vectors. Perfect for text because it ignores the length of the document and focuses on the direction of meaning.
- Dot Product: Fast but sensitive to vector magnitude. Used when your embedding model is normalized.
- Euclidean (L2) Distance: Measures the direct distance between points. Common in image search.
For RAG with OpenAI/BGE embeddings, always start with Cosine Similarity.
Schema Design for Production
In Qdrant, a "collection" is like an SQL table. Here is how we structured ours:
from qdrant_client import QdrantClient
from qdrant_client.http import models
client = QdrantClient("http://localhost:6333")
client.create_collection(
collection_name="enterprise_docs",
vectors_config=models.VectorParams(
size=1536, # OpenAI embedding size
distance=models.Distance.COSINE
),
# Optimized for fast searching with Payload pre-filtering
hnsw_config=models.HnswConfigDiff(
payload_m=16,
m=16
)
)
Strategic Metadata (Payload)
Don't just store the text. Store these for efficient filtering:
user_id/group_id: Essential for security.doc_type: pdf, confluence, slack.version: To avoid retrieving outdated info.language: If you are building a multi-lingual system.
Hybrid Search: Dense + Sparse Vectors
Vector search is great for meaning but terrible for exact keywords (e.g., searching for a specific error code like ERR-404).
Qdrant's Hybrid Solution:
- Dense Vector (Embeddings): Understands "I'm having a connection issue".
- Sparse Vector (BM25/SPLADE): Understands the exact keyword "connection".
Qdrant fuses these results using Reciprocal Rank Fusion (RRF) to give the best of both worlds.
Resource Optimization: Quantization
Storing 1,536 dimensions as floating-point numbers (float32) for millions of documents uses a massive amount of VRAM.
Solution: Product Quantization (PQ) or Scalar Quantization.
Qdrant can compress vectors (e.g., from float32 to int8). This can reduce RAM usage by 4x with a minimal loss in accuracy (~1%).
# Enabling Scalar Quantization in Qdrant
client.update_collection(
collection_name="enterprise_docs",
quantization_config=models.ScalarQuantization(
scalar=models.ScalarType.INT8,
always_ram=True,
quantile=0.99
)
)
Conclusion & Next Post
Designing a vector database layer is about balancing Speed, Accuracy, and Cost. HNSW gives us the speed, Cosine Similarity gives us the accuracy, and Quantization keeps the costs down.
3 Key Takeaways:
- HNSW is the engine that makes large-scale RAG possible.
- Payload filtering is the secret to enterprise security.
- Hybrid Search is mandatory for systems involving specific technical terms.
👉 Next Post: [Post 06] LLM Inference - Deploying Models with vLLM & Kubernetes
Now that we have the "Memory" (Vector DB) and the "Body" (Backend), we need the "Voice" — the LLM. We will discuss how to deploy self-hosted models like Llama 3 using vLLM to achieve 10x throughput compared to standard deployment.
📬 Are you using Qdrant or another Vector DB? I'd love to hear your experience!
Author: [Your Name]
Series: RAG in Production — The Journey of Building a Real-world AI System
Tags: Qdrant Vector Database Machine Learning Search Engineering
Series • Part 5 of 11