RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes

"The LLM is the engine of your AI. Whether you rent a supercar (API) or build your own racer (Self-hosted), you need to know how to drive it at 200 mph." In this post, we'll discuss the most expensive and compute-intensive part of any RAG system: Inference.*

OpenAI API vs. Self-hosted Models
Choosing the Right Model (SOTA vs. Efficiency)
Introducing vLLM: The Game Changer
The Magic of PagedAttention
Inference Optimization: Quantization (AWQ/FP8)
Deploying LLM on Kubernetes (GPU Nodes)
The LLM Gateway Pattern
Conclusion & Next Post

OpenAI API vs. Self-hosted Models

Every project starts with this dilemma. Here's our rubric for decision making:

Option A: Managed APIs (OpenAI, Claude, Gemini)

Pros: Zero maintenance, state-of-the-art performance, pay-as-you-go.
Cons: High long-term cost, data privacy concerns (for some industries), rate limits.
Best for: Prototyping, small-scale startups, non-sensitive data.

Option B: Self-hosting (Llama 3, Mistral, Qwen)

Pros: Full data control, fixed infrastructure cost, no rate limits, customizable.
Cons: High operational complexity, requires expensive GPUs (A100/H100), latency issues if not optimized.
Best for: Enterprise security, high-volume production, specialized domains.

Our Path: We used OpenAI for the pilot and migrated to Llama 3 hosted on vLLM for production to save costs and ensure data privacy.

Choosing the Right Model

Size matters, but bigger isn't always better for RAG.

7B - 8B Models (Llama 3 8B, Mistral): Extremely fast, cheap to host. Good for simple RAG tasks.
70B+ Models (Llama 3 70B): Much more "intelligent", better at reasoning and following complex instructions.
MoE Models (Mixtral 8x7B): A great middle ground—high intelligence with relatively fast inference.

Introducing vLLM: The Game Changer

If you deploy an LLM with a naive HuggingFace backend, you will struggle to handle more than 1–2 users simultaneously. vLLM allows you to serve 10x-20x more requests on the same hardware.

Why is vLLM so fast?

Continuous Batching: Unlike traditional batching that waits for all requests to finish, vLLM injects new requests as soon as a slot becomes free.
PagedAttention: This is the "secret sauce".

The Magic of PagedAttention

In standard inference, the KV Cache (Key-Value Cache) consumes a massive amount of VRAM and is very fragmented.

PagedAttention works like virtual memory in OS. It breaks the KV Cache into small blocks (pages) that can be stored in non-contiguous memory.

Result: Near-zero memory waste.
Impact: You can increase your batch size from 4 to 128 on the same GPU.

Inference Optimization: Quantization

An 8B model usually requires ~16GB of VRAM (FP16). By using Quantization, we can fit it into 8GB or even 5GB with minimal accuracy loss.

AWQ (Activation-aware Weight Quantization): Great for 4-bit precision.
FP8: The new standard for H100 GPUs, offering speed gains with almost zero accuracy loss.

# Running vLLM with AWQ quantization
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3-8b-instruct-awq \
    --quantization awq \
    --dtype float16 \
    --port 8000

Deploying LLM on Kubernetes

Serving LLMs in production requires a robust orchestrator.

GPU Scheduling

You need to tell Kubernetes which nodes have GPUs:

resources:
  limits:
    nvidia.com/gpu: 1 # Requests 1 GPU

Health Checks

Starting an LLM takes time (reading 15GB+ into VRAM). Your readinessProbe must wait for the model to be fully loaded:

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60

The LLM Gateway Pattern

In a real-world RAG system, never let your backend talk directly to a single LLM instance. Use a Gateway.

graph LR
    BE[Backend] --> GW[LLM Gateway]
    GW -- 80% --> vLLM[vLLM / Llama 3]
    GW -- 20% --> OpenAI[OpenAI API Fallback]
    GW -- Monitoring --> Metrics[Tokens/Cost/Latency]

Benefits of a Gateway:

Load Balancing: If one GPU node crashes, traffic shifts to another.
Semantic Routing: Send simple questions to Llama-8B (cheap) and complex ones to GPT-4 (expensive).
Audit Logs: Track exactly how much each department spends on AI.

Conclusion & Next Post

LLM Inference is no longer just about "running a model". It's about optimizing memory with PagedAttention, choosing the right quantization, and orchestrating nodes with Kubernetes.

3 Key Takeaways:

vLLM is mandatory for any serious self-hosted LLM deployment.
Quantization is the only way to scale without breaking the bank on GPUs.
LLM Gateway provides the reliability needed for enterprise apps.

👉 Next Post: [Post 07] DevOps & GitOps - Orchestrating the RAG Ecosystem

We have the code, the database, and the model. Now, how do we automate everything? In the next post, we will build a CI/CD pipeline with GitHub Actions and manage our Kubernetes cluster using ArgoCD.

📬 Are you self-hosting or using APIs? Share your reasoning in the comments below!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: LLM vLLM Kubernetes DevOps GPU AI Infrastructure

"The LLM is the engine of your AI. Whether you rent a supercar (API) or build your own racer (Self-hosted), you need to know how to drive it at 200 mph." In this post, we'll discuss the most expensive and compute-intensive part of any RAG system: Inference.*

OpenAI API vs. Self-hosted Models
Choosing the Right Model (SOTA vs. Efficiency)
Introducing vLLM: The Game Changer
The Magic of PagedAttention
Inference Optimization: Quantization (AWQ/FP8)
Deploying LLM on Kubernetes (GPU Nodes)
The LLM Gateway Pattern
Conclusion & Next Post

OpenAI API vs. Self-hosted Models

Every project starts with this dilemma. Here's our rubric for decision making:

Option A: Managed APIs (OpenAI, Claude, Gemini)

Pros: Zero maintenance, state-of-the-art performance, pay-as-you-go.
Cons: High long-term cost, data privacy concerns (for some industries), rate limits.
Best for: Prototyping, small-scale startups, non-sensitive data.

Option B: Self-hosting (Llama 3, Mistral, Qwen)

Pros: Full data control, fixed infrastructure cost, no rate limits, customizable.
Cons: High operational complexity, requires expensive GPUs (A100/H100), latency issues if not optimized.
Best for: Enterprise security, high-volume production, specialized domains.

Our Path: We used OpenAI for the pilot and migrated to Llama 3 hosted on vLLM for production to save costs and ensure data privacy.

Choosing the Right Model

Size matters, but bigger isn't always better for RAG.

7B - 8B Models (Llama 3 8B, Mistral): Extremely fast, cheap to host. Good for simple RAG tasks.
70B+ Models (Llama 3 70B): Much more "intelligent", better at reasoning and following complex instructions.
MoE Models (Mixtral 8x7B): A great middle ground—high intelligence with relatively fast inference.

Introducing vLLM: The Game Changer

If you deploy an LLM with a naive HuggingFace backend, you will struggle to handle more than 1–2 users simultaneously. vLLM allows you to serve 10x-20x more requests on the same hardware.

Why is vLLM so fast?

Continuous Batching: Unlike traditional batching that waits for all requests to finish, vLLM injects new requests as soon as a slot becomes free.
PagedAttention: This is the "secret sauce".

The Magic of PagedAttention

In standard inference, the KV Cache (Key-Value Cache) consumes a massive amount of VRAM and is very fragmented.

PagedAttention works like virtual memory in OS. It breaks the KV Cache into small blocks (pages) that can be stored in non-contiguous memory.

Result: Near-zero memory waste.
Impact: You can increase your batch size from 4 to 128 on the same GPU.

Inference Optimization: Quantization

An 8B model usually requires ~16GB of VRAM (FP16). By using Quantization, we can fit it into 8GB or even 5GB with minimal accuracy loss.

AWQ (Activation-aware Weight Quantization): Great for 4-bit precision.
FP8: The new standard for H100 GPUs, offering speed gains with almost zero accuracy loss.

# Running vLLM with AWQ quantization
python -m vllm.entrypoints.openai.api_server \
    --model casperhansen/llama-3-8b-instruct-awq \
    --quantization awq \
    --dtype float16 \
    --port 8000

Deploying LLM on Kubernetes

Serving LLMs in production requires a robust orchestrator.

GPU Scheduling

You need to tell Kubernetes which nodes have GPUs:

resources:
  limits:
    nvidia.com/gpu: 1 # Requests 1 GPU

Health Checks

Starting an LLM takes time (reading 15GB+ into VRAM). Your readinessProbe must wait for the model to be fully loaded:

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60

The LLM Gateway Pattern

In a real-world RAG system, never let your backend talk directly to a single LLM instance. Use a Gateway.

graph LR
    BE[Backend] --> GW[LLM Gateway]
    GW -- 80% --> vLLM[vLLM / Llama 3]
    GW -- 20% --> OpenAI[OpenAI API Fallback]
    GW -- Monitoring --> Metrics[Tokens/Cost/Latency]

Benefits of a Gateway:

Load Balancing: If one GPU node crashes, traffic shifts to another.
Semantic Routing: Send simple questions to Llama-8B (cheap) and complex ones to GPT-4 (expensive).
Audit Logs: Track exactly how much each department spends on AI.

Conclusion & Next Post

LLM Inference is no longer just about "running a model". It's about optimizing memory with PagedAttention, choosing the right quantization, and orchestrating nodes with Kubernetes.

3 Key Takeaways:

vLLM is mandatory for any serious self-hosted LLM deployment.
Quantization is the only way to scale without breaking the bank on GPUs.
LLM Gateway provides the reliability needed for enterprise apps.

👉 Next Post: [Post 07] DevOps & GitOps - Orchestrating the RAG Ecosystem

📬 Are you self-hosting or using APIs? Share your reasoning in the comments below!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: LLM vLLM Kubernetes DevOps GPU AI Infrastructure

RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes

Table of Contents

OpenAI API vs. Self-hosted Models

Option A: Managed APIs (OpenAI, Claude, Gemini)

Option B: Self-hosting (Llama 3, Mistral, Qwen)

Choosing the Right Model

Introducing vLLM: The Game Changer

The Magic of PagedAttention

Inference Optimization: Quantization

Deploying LLM on Kubernetes

GPU Scheduling

Health Checks

The LLM Gateway Pattern

Conclusion & Next Post

👉 Next Post: [Post 07] DevOps & GitOps - Orchestrating the RAG Ecosystem

RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes

Table of Contents

OpenAI API vs. Self-hosted Models

Option A: Managed APIs (OpenAI, Claude, Gemini)

Option B: Self-hosting (Llama 3, Mistral, Qwen)

Choosing the Right Model

Introducing vLLM: The Game Changer

The Magic of PagedAttention

Inference Optimization: Quantization

Deploying LLM on Kubernetes

GPU Scheduling

Health Checks

The LLM Gateway Pattern

Conclusion & Next Post

👉 Next Post: [Post 07] DevOps & GitOps - Orchestrating the RAG Ecosystem