LogoTRUONG PHAM
Home
Projects
Blogs
YouTube
Contact

Newsletter

Stay updated with technical artifacts and engineering insights.

LogoTRUONG PHAM

Building scalable software and sharing insights on technology & life.

Sitemap

  • Home
  • Projects
  • Blogs
  • YouTube
  • Contact

Connect

  • GitHub
  • LinkedIn
  • Email
  • YouTube

© 2024 TRUONG PHAM. © All rights reserved.

Privacy PolicyTerms of Service
Back
RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale
RAG in Production — The Journey of Building a Real-world AI System

RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale

Deep dive into Vector Databases, specifically Qdrant. Learn about HNSW indexing, distance metrics, and how to design a schema for millions of vectors.

TP
Truong PhamSoftware Engineer
PublishedApril 5, 2024
Stack
Qdrant ·Vector Database ·HNSW ·Search

"A vector database is like a library where books aren't sorted by title, but by the smell of their stories." In the world of RAG, the Vector Database (VDB) is the memory of your AI. If its memory is disorganized, the AI will be confused.*


Table of Contents

  1. Why Specifically a Vector Database?
  2. Why Did We Choose Qdrant?
  3. The Heart of Indexing: HNSW Algorithm
  4. Choosing the Right Distance Metric
  5. Schema Design for Production
  6. Hybrid Search: Dense + Sparse Vectors
  7. Resource Optimization: Quantization & Indexing
  8. Conclusion & Next Post

Why Specifically a Vector Database?

Traditional databases (SQL/NoSQL) are built for exact matching. You search for name = 'Ivan' and it finds exactly that.

In AI, users ask semantic questions: "How can I get my money back?". A traditional DB would look for the word "money" and "back", but it doesn't understand that this means "Refund Policy".

Vector Databases search for meaning by representing text as numerical vectors in dynamic space.


Why Did We Choose Qdrant?

Among many players like Pinecone, Weaviate, Milvus, and pgvector, we chose Qdrant for 3 main reasons:

  1. Performance & Reliability: Built in Rust. It's extremely memory-efficient and fast.
  2. Advanced Filtering: Unlike some VDBs that filter after searching (slow), Qdrant filters during search (fast).
  3. Rich Feature Set: Built-in Hybrid Search, support for Sparse Vectors, and easy-to-use API.

The Heart of Indexing: HNSW Algorithm

Searching through 10 million vectors linearly is impossible in real-time. We need an index. Qdrant uses HNSW (Hierarchical Navigable Small World).

How HNSW Works:

Imagine HNSW as a "highway and local roads" system:

  • Layer 0 (Bottom): All points are connected. Searching here is slow but precise.
  • Layer 1, 2, ... (Top): Higher layers contain fewer points. They are "highways" that allow you to jump large distances across the vector space.
graph TD
    subgraph "Layer 2 (Express)"
        A2[Point A] --- B2[Point B]
    end
    subgraph "Layer 1 (Suburban)"
        A1[Point A] --- C1[Point C] --- B1[Point B]
    end
    subgraph "Layer 0 (City Streets)"
        A0[Point A] --- D0[Point D] --- C0[Point C] --- E0[Point E] --- B0[Point B]
    end

Trade-off:

  • Higher m: More connections, higher accuracy but uses more RAM.
  • Higher ef_construct: Better index quality but slower indexing time.

Choosing the Right Distance Metric

How do we define "closeness" between two vectors?

  1. Cosine Similarity (Most Common): Measures the angle between vectors. Perfect for text because it ignores the length of the document and focuses on the direction of meaning.
  2. Dot Product: Fast but sensitive to vector magnitude. Used when your embedding model is normalized.
  3. Euclidean (L2) Distance: Measures the direct distance between points. Common in image search.

For RAG with OpenAI/BGE embeddings, always start with Cosine Similarity.


Schema Design for Production

In Qdrant, a "collection" is like an SQL table. Here is how we structured ours:

from qdrant_client import QdrantClient
from qdrant_client.http import models

client = QdrantClient("http://localhost:6333")

client.create_collection(
    collection_name="enterprise_docs",
    vectors_config=models.VectorParams(
        size=1536, # OpenAI embedding size
        distance=models.Distance.COSINE
    ),
    # Optimized for fast searching with Payload pre-filtering
    hnsw_config=models.HnswConfigDiff(
        payload_m=16, 
        m=16
    )
)

Strategic Metadata (Payload)

Don't just store the text. Store these for efficient filtering:

  • user_id / group_id: Essential for security.
  • doc_type: pdf, confluence, slack.
  • version: To avoid retrieving outdated info.
  • language: If you are building a multi-lingual system.

Hybrid Search: Dense + Sparse Vectors

Vector search is great for meaning but terrible for exact keywords (e.g., searching for a specific error code like ERR-404).

Qdrant's Hybrid Solution:

  1. Dense Vector (Embeddings): Understands "I'm having a connection issue".
  2. Sparse Vector (BM25/SPLADE): Understands the exact keyword "connection".

Qdrant fuses these results using Reciprocal Rank Fusion (RRF) to give the best of both worlds.


Resource Optimization: Quantization

Storing 1,536 dimensions as floating-point numbers (float32) for millions of documents uses a massive amount of VRAM.

Solution: Product Quantization (PQ) or Scalar Quantization. Qdrant can compress vectors (e.g., from float32 to int8). This can reduce RAM usage by 4x with a minimal loss in accuracy (~1%).

# Enabling Scalar Quantization in Qdrant
client.update_collection(
    collection_name="enterprise_docs",
    quantization_config=models.ScalarQuantization(
        scalar=models.ScalarType.INT8,
        always_ram=True,
        quantile=0.99
    )
)

Conclusion & Next Post

Designing a vector database layer is about balancing Speed, Accuracy, and Cost. HNSW gives us the speed, Cosine Similarity gives us the accuracy, and Quantization keeps the costs down.

3 Key Takeaways:

  1. HNSW is the engine that makes large-scale RAG possible.
  2. Payload filtering is the secret to enterprise security.
  3. Hybrid Search is mandatory for systems involving specific technical terms.

👉 Next Post: [Post 06] LLM Inference - Deploying Models with vLLM & Kubernetes

Now that we have the "Memory" (Vector DB) and the "Body" (Backend), we need the "Voice" — the LLM. We will discuss how to deploy self-hosted models like Llama 3 using vLLM to achieve 10x throughput compared to standard deployment.


📬 Are you using Qdrant or another Vector DB? I'd love to hear your experience!


Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: Qdrant Vector Database Machine Learning Search Engineering

Series • Part 5 of 11

RAG in Production — The Journey of Building a Real-world AI System

NextRAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes
RAG in Production [P4]: Backend Implementation - Building the Engine with FastAPI & LangChain
01RAG in Production [P1]: Real-world Problem - When Does a Business Actually Need AI?02RAG in Production [P2]: What is RAG? Why not Fine-tuning or Prompt Engineering?03RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG System04RAG in Production [P4]: Backend Implementation - Building the Engine with FastAPI & LangChain05RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for ScaleReading06RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes07RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem08RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI09RAG in Production [P9]: Security & Privacy - Protecting Your Enterprise Data10RAG in Production [P10]: Future Improvements - Agentic RAG, GraphRAG & Beyond11RAG in Production [P11]: Lessons Learned - 15 Hard Truths About RAG in Production
TP

Written by Truong Pham

Software Engineer passionate about building high-performance systems and meaningful experiences.

Read more articles