LogoTRUONG PHAM
Home
Projects
Blogs
YouTube
Contact

Newsletter

Stay updated with technical artifacts and engineering insights.

LogoTRUONG PHAM

Building scalable software and sharing insights on technology & life.

Sitemap

  • Home
  • Projects
  • Blogs
  • YouTube
  • Contact

Connect

  • GitHub
  • LinkedIn
  • Email
  • YouTube

© 2024 TRUONG PHAM. © All rights reserved.

Privacy PolicyTerms of Service
Back
RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes
RAG in Production — The Journey of Building a Real-world AI System

RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes

Learn how to deploy LLM models at scale. Compare OpenAI API with self-hosted vLLM, and explore optimization techniques like PagedAttention.

TP
Truong PhamSoftware Engineer
PublishedApril 8, 2024
Stack
LLM ·vLLM ·Kubernetes ·Inference

"The LLM is the engine of your AI. Whether you rent a supercar (API) or build your own racer (Self-hosted), you need to know how to drive it at 200 mph." In this post, we'll discuss the most expensive and compute-intensive part of any RAG system: Inference.*


Table of Contents

  1. OpenAI API vs. Self-hosted Models
  2. Choosing the Right Model (SOTA vs. Efficiency)
  3. Introducing vLLM: The Game Changer
  4. The Magic of PagedAttention
  5. Inference Optimization: Quantization (AWQ/FP8)
  6. Deploying LLM on Kubernetes (GPU Nodes)
  7. The LLM Gateway Pattern
  8. Conclusion & Next Post

OpenAI API vs. Self-hosted Models

Every project starts with this dilemma. Here's our rubric for decision making:

Option A: Managed APIs (OpenAI, Claude, Gemini)

  • Pros: Zero maintenance, state-of-the-art performance, pay-as-you-go.
  • Cons: High long-term cost, data privacy concerns (for some industries), rate limits.
  • Best for: Prototyping, small-scale startups, non-sensitive data.

Option B: Self-hosting (Llama 3, Mistral, Qwen)

  • Pros: Full data control, fixed infrastructure cost, no rate limits, customizable.
  • Cons: High operational complexity, requires expensive GPUs (A100/H100), latency issues if not optimized.
  • Best for: Enterprise security, high-volume production, specialized domains.

Our Path: We used OpenAI for the pilot and migrated to Llama 3 hosted on vLLM for production to save costs and ensure data privacy.


Choosing the Right Model

Size matters, but bigger isn't always better for RAG.

  • 7B - 8B Models (Llama 3 8B, Mistral): Extremely fast, cheap to host. Good for simple RAG tasks.
  • 70B+ Models (Llama 3 70B): Much more "intelligent", better at reasoning and following complex instructions.
  • MoE Models (Mixtral 8x7B): A great middle ground—high intelligence with relatively fast inference.

Introducing vLLM: The Game Changer

If you deploy an LLM with a naive HuggingFace backend, you will struggle to handle more than 1–2 users simultaneously. vLLM allows you to serve 10x-20x more requests on the same hardware.

Why is vLLM so fast?

  • Continuous Batching: Unlike traditional batching that waits for all requests to finish, vLLM injects new requests as soon as a slot becomes free.
  • PagedAttention: This is the "secret sauce".

The Magic of PagedAttention

In standard inference, the KV Cache (Key-Value Cache) consumes a massive amount of VRAM and is very fragmented.

PagedAttention works like virtual memory in OS. It breaks the KV Cache into small blocks (pages) that can be stored in non-contiguous memory.

  • Result: Near-zero memory waste.
  • Impact: You can increase your batch size from 4 to 128 on the same GPU.

Inference Optimization: Quantization

An 8B model usually requires ~16GB of VRAM (FP16). By using Quantization, we can fit it into 8GB or even 5GB with minimal accuracy loss.

  • AWQ (Activation-aware Weight Quantization): Great for 4-bit precision.
  • FP8: The new standard for H100 GPUs, offering speed gains with almost zero accuracy loss.
# Running vLLM with AWQ quantization
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3-8b-instruct-awq \
    --quantization awq \
    --dtype float16 \
    --port 8000

Deploying LLM on Kubernetes

Serving LLMs in production requires a robust orchestrator.

GPU Scheduling

You need to tell Kubernetes which nodes have GPUs:

resources:
  limits:
    nvidia.com/gpu: 1 # Requests 1 GPU

Health Checks

Starting an LLM takes time (reading 15GB+ into VRAM). Your readinessProbe must wait for the model to be fully loaded:

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60

The LLM Gateway Pattern

In a real-world RAG system, never let your backend talk directly to a single LLM instance. Use a Gateway.

graph LR
    BE[Backend] --> GW[LLM Gateway]
    GW -- 80% --> vLLM[vLLM / Llama 3]
    GW -- 20% --> OpenAI[OpenAI API Fallback]
    GW -- Monitoring --> Metrics[Tokens/Cost/Latency]

Benefits of a Gateway:

  • Load Balancing: If one GPU node crashes, traffic shifts to another.
  • Semantic Routing: Send simple questions to Llama-8B (cheap) and complex ones to GPT-4 (expensive).
  • Audit Logs: Track exactly how much each department spends on AI.

Conclusion & Next Post

LLM Inference is no longer just about "running a model". It's about optimizing memory with PagedAttention, choosing the right quantization, and orchestrating nodes with Kubernetes.

3 Key Takeaways:

  1. vLLM is mandatory for any serious self-hosted LLM deployment.
  2. Quantization is the only way to scale without breaking the bank on GPUs.
  3. LLM Gateway provides the reliability needed for enterprise apps.

👉 Next Post: [Post 07] DevOps & GitOps - Orchestrating the RAG Ecosystem

We have the code, the database, and the model. Now, how do we automate everything? In the next post, we will build a CI/CD pipeline with GitHub Actions and manage our Kubernetes cluster using ArgoCD.


📬 Are you self-hosting or using APIs? Share your reasoning in the comments below!


Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: LLM vLLM Kubernetes DevOps GPU AI Infrastructure

Series • Part 6 of 11

RAG in Production — The Journey of Building a Real-world AI System

NextRAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem
RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale
01RAG in Production [P1]: Real-world Problem - When Does a Business Actually Need AI?02RAG in Production [P2]: What is RAG? Why not Fine-tuning or Prompt Engineering?03RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG System04RAG in Production [P4]: Backend Implementation - Building the Engine with FastAPI & LangChain05RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale06RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & KubernetesReading07RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem08RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI09RAG in Production [P9]: Security & Privacy - Protecting Your Enterprise Data10RAG in Production [P10]: Future Improvements - Agentic RAG, GraphRAG & Beyond11RAG in Production [P11]: Lessons Learned - 15 Hard Truths About RAG in Production
TP

Written by Truong Pham

Software Engineer passionate about building high-performance systems and meaningful experiences.

Read more articles