LogoTRUONG PHAM
Home
Projects
Portfolio
Blogs
YouTube
Contact

Newsletter

Stay updated with technical artifacts and engineering insights.

LogoTRUONG PHAM

Building scalable software and sharing insights on technology & life.

Sitemap

  • Home
  • Projects
  • Portfolio
  • Blogs
  • YouTube
  • Contact

Connect

  • GitHub
  • LinkedIn
  • Email
  • YouTube

© 2024 TRUONG PHAM. © All rights reserved.

Privacy PolicyTerms of Service
Back/RAG in Production — The Journey of Building a Real-world AI System
RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes

RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes

Learn how to deploy LLM models at scale. Compare OpenAI API with self-hosted vLLM, and explore optimization techniques like PagedAttention.

LLMvLLMKubernetesInference

April 8, 2024·5 min read

"The LLM is the engine of your AI. Whether you rent a supercar (API) or build your own racer (Self-hosted), you need to know how to drive it at 200 mph." In this post, we'll discuss the most expensive and compute-intensive part of any RAG system: Inference.*


Table of Contents

  1. OpenAI API vs. Self-hosted Models
  2. Choosing the Right Model (SOTA vs. Efficiency)
  3. Introducing vLLM: The Game Changer
  4. The Magic of PagedAttention
  5. Inference Optimization: Quantization (AWQ/FP8)
  6. Deploying LLM on Kubernetes (GPU Nodes)
  7. The LLM Gateway Pattern
  8. Conclusion & Next Post

OpenAI API vs. Self-hosted Models

Every project starts with this dilemma. Here's our rubric for decision making:

Option A: Managed APIs (OpenAI, Claude, Gemini)

  • Pros: Zero maintenance, state-of-the-art performance, pay-as-you-go.
  • Cons: High long-term cost, data privacy concerns (for some industries), rate limits.
  • Best for: Prototyping, small-scale startups, non-sensitive data.

Option B: Self-hosting (Llama 3, Mistral, Qwen)

  • Pros: Full data control, fixed infrastructure cost, no rate limits, customizable.
  • Cons: High operational complexity, requires expensive GPUs (A100/H100), latency issues if not optimized.
  • Best for: Enterprise security, high-volume production, specialized domains.

Our Path: We used OpenAI for the pilot and migrated to Llama 3 hosted on vLLM for production to save costs and ensure data privacy.


Choosing the Right Model

Size matters, but bigger isn't always better for RAG.

  • 7B - 8B Models (Llama 3 8B, Mistral): Extremely fast, cheap to host. Good for simple RAG tasks.
  • 70B+ Models (Llama 3 70B): Much more "intelligent", better at reasoning and following complex instructions.
  • MoE Models (Mixtral 8x7B): A great middle ground—high intelligence with relatively fast inference.

Introducing vLLM: The Game Changer

If you deploy an LLM with a naive HuggingFace backend, you will struggle to handle more than 1–2 users simultaneously. vLLM allows you to serve 10x-20x more requests on the same hardware.

Why is vLLM so fast?

  • Continuous Batching: Unlike traditional batching that waits for all requests to finish, vLLM injects new requests as soon as a slot becomes free.
  • PagedAttention: This is the "secret sauce".

The Magic of PagedAttention

In standard inference, the KV Cache (Key-Value Cache) consumes a massive amount of VRAM and is very fragmented.

PagedAttention works like virtual memory in OS. It breaks the KV Cache into small blocks (pages) that can be stored in non-contiguous memory.

  • Result: Near-zero memory waste.
  • Impact: You can increase your batch size from 4 to 128 on the same GPU.

Inference Optimization: Quantization

An 8B model usually requires ~16GB of VRAM (FP16). By using Quantization, we can fit it into 8GB or even 5GB with minimal accuracy loss.

  • AWQ (Activation-aware Weight Quantization): Great for 4-bit precision.
  • FP8: The new standard for H100 GPUs, offering speed gains with almost zero accuracy loss.
# Running vLLM with AWQ quantization
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3-8b-instruct-awq \
    --quantization awq \
    --dtype float16 \
    --port 8000

Deploying LLM on Kubernetes

Serving LLMs in production requires a robust orchestrator.

GPU Scheduling

You need to tell Kubernetes which nodes have GPUs:

resources:
  limits:
    nvidia.com/gpu: 1 # Requests 1 GPU

Health Checks

Starting an LLM takes time (reading 15GB+ into VRAM). Your readinessProbe must wait for the model to be fully loaded:

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60

The LLM Gateway Pattern

In a real-world RAG system, never let your backend talk directly to a single LLM instance. Use a Gateway.

graph LR
    BE[Backend] --> GW[LLM Gateway]
    GW -- 80% --> vLLM[vLLM / Llama 3]
    GW -- 20% --> OpenAI[OpenAI API Fallback]
    GW -- Monitoring --> Metrics[Tokens/Cost/Latency]

Benefits of a Gateway:

  • Load Balancing: If one GPU node crashes, traffic shifts to another.
  • Semantic Routing: Send simple questions to Llama-8B (cheap) and complex ones to GPT-4 (expensive).
  • Audit Logs: Track exactly how much each department spends on AI.

Conclusion & Next Post

LLM Inference is no longer just about "running a model". It's about optimizing memory with PagedAttention, choosing the right quantization, and orchestrating nodes with Kubernetes.

3 Key Takeaways:

  1. vLLM is mandatory for any serious self-hosted LLM deployment.
  2. Quantization is the only way to scale without breaking the bank on GPUs.
  3. LLM Gateway provides the reliability needed for enterprise apps.

👉 Next Post: [Post 07] DevOps & GitOps - Orchestrating the RAG Ecosystem

We have the code, the database, and the model. Now, how do we automate everything? In the next post, we will build a CI/CD pipeline with GitHub Actions and manage our Kubernetes cluster using ArgoCD.


📬 Are you self-hosting or using APIs? Share your reasoning in the comments below!


Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: LLM vLLM Kubernetes DevOps GPU AI Infrastructure

Previous: RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for ScaleAll posts in this seriesNext: RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem