RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes
Learn how to deploy LLM models at scale. Compare OpenAI API with self-hosted vLLM, and explore optimization techniques like PagedAttention.
"The LLM is the engine of your AI. Whether you rent a supercar (API) or build your own racer (Self-hosted), you need to know how to drive it at 200 mph." In this post, we'll discuss the most expensive and compute-intensive part of any RAG system: Inference.*
Table of Contents
- OpenAI API vs. Self-hosted Models
- Choosing the Right Model (SOTA vs. Efficiency)
- Introducing vLLM: The Game Changer
- The Magic of PagedAttention
- Inference Optimization: Quantization (AWQ/FP8)
- Deploying LLM on Kubernetes (GPU Nodes)
- The LLM Gateway Pattern
- Conclusion & Next Post
OpenAI API vs. Self-hosted Models
Every project starts with this dilemma. Here's our rubric for decision making:
Option A: Managed APIs (OpenAI, Claude, Gemini)
- Pros: Zero maintenance, state-of-the-art performance, pay-as-you-go.
- Cons: High long-term cost, data privacy concerns (for some industries), rate limits.
- Best for: Prototyping, small-scale startups, non-sensitive data.
Option B: Self-hosting (Llama 3, Mistral, Qwen)
- Pros: Full data control, fixed infrastructure cost, no rate limits, customizable.
- Cons: High operational complexity, requires expensive GPUs (A100/H100), latency issues if not optimized.
- Best for: Enterprise security, high-volume production, specialized domains.
Our Path: We used OpenAI for the pilot and migrated to Llama 3 hosted on vLLM for production to save costs and ensure data privacy.
Choosing the Right Model
Size matters, but bigger isn't always better for RAG.
- 7B - 8B Models (Llama 3 8B, Mistral): Extremely fast, cheap to host. Good for simple RAG tasks.
- 70B+ Models (Llama 3 70B): Much more "intelligent", better at reasoning and following complex instructions.
- MoE Models (Mixtral 8x7B): A great middle ground—high intelligence with relatively fast inference.
Introducing vLLM: The Game Changer
If you deploy an LLM with a naive HuggingFace backend, you will struggle to handle more than 1–2 users simultaneously. vLLM allows you to serve 10x-20x more requests on the same hardware.
Why is vLLM so fast?
- Continuous Batching: Unlike traditional batching that waits for all requests to finish, vLLM injects new requests as soon as a slot becomes free.
- PagedAttention: This is the "secret sauce".
The Magic of PagedAttention
In standard inference, the KV Cache (Key-Value Cache) consumes a massive amount of VRAM and is very fragmented.
PagedAttention works like virtual memory in OS. It breaks the KV Cache into small blocks (pages) that can be stored in non-contiguous memory.
- Result: Near-zero memory waste.
- Impact: You can increase your batch size from 4 to 128 on the same GPU.
Inference Optimization: Quantization
An 8B model usually requires ~16GB of VRAM (FP16). By using Quantization, we can fit it into 8GB or even 5GB with minimal accuracy loss.
- AWQ (Activation-aware Weight Quantization): Great for 4-bit precision.
- FP8: The new standard for H100 GPUs, offering speed gains with almost zero accuracy loss.
# Running vLLM with AWQ quantization
python -m vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-8b-instruct-awq \
--quantization awq \
--dtype float16 \
--port 8000
Deploying LLM on Kubernetes
Serving LLMs in production requires a robust orchestrator.
GPU Scheduling
You need to tell Kubernetes which nodes have GPUs:
resources:
limits:
nvidia.com/gpu: 1 # Requests 1 GPU
Health Checks
Starting an LLM takes time (reading 15GB+ into VRAM). Your readinessProbe must wait for the model to be fully loaded:
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
The LLM Gateway Pattern
In a real-world RAG system, never let your backend talk directly to a single LLM instance. Use a Gateway.
graph LR
BE[Backend] --> GW[LLM Gateway]
GW -- 80% --> vLLM[vLLM / Llama 3]
GW -- 20% --> OpenAI[OpenAI API Fallback]
GW -- Monitoring --> Metrics[Tokens/Cost/Latency]
Benefits of a Gateway:
- Load Balancing: If one GPU node crashes, traffic shifts to another.
- Semantic Routing: Send simple questions to Llama-8B (cheap) and complex ones to GPT-4 (expensive).
- Audit Logs: Track exactly how much each department spends on AI.
Conclusion & Next Post
LLM Inference is no longer just about "running a model". It's about optimizing memory with PagedAttention, choosing the right quantization, and orchestrating nodes with Kubernetes.
3 Key Takeaways:
- vLLM is mandatory for any serious self-hosted LLM deployment.
- Quantization is the only way to scale without breaking the bank on GPUs.
- LLM Gateway provides the reliability needed for enterprise apps.
👉 Next Post: [Post 07] DevOps & GitOps - Orchestrating the RAG Ecosystem
We have the code, the database, and the model. Now, how do we automate everything? In the next post, we will build a CI/CD pipeline with GitHub Actions and manage our Kubernetes cluster using ArgoCD.
📬 Are you self-hosting or using APIs? Share your reasoning in the comments below!
Author: [Your Name]
Series: RAG in Production — The Journey of Building a Real-world AI System
Tags: LLM vLLM Kubernetes DevOps GPU AI Infrastructure
Series • Part 6 of 11