LogoTRUONG PHAM
Home
Projects
Blogs
YouTube
Contact

Newsletter

Stay updated with technical artifacts and engineering insights.

LogoTRUONG PHAM

Building scalable software and sharing insights on technology & life.

Sitemap

  • Home
  • Projects
  • Blogs
  • YouTube
  • Contact

Connect

  • GitHub
  • LinkedIn
  • Email
  • YouTube

© 2024 TRUONG PHAM. © All rights reserved.

Privacy PolicyTerms of Service
Back
RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem
RAG in Production — The Journey of Building a Real-world AI System

RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem

Automate the deployment of your RAG system using Docker, Kubernetes, and Helm. Implement a robust CI/CD pipeline with GitHub Actions and ArgoCD.

TP
Truong PhamSoftware Engineer
PublishedApril 10, 2024
Stack
Docker ·Kubernetes ·DevOps ·CI/CD

"Code without automation is like a car without a steering wheel—it might be fast, but it's bound to crash eventually." In this post, we'll turn our RAG system into a collection of orchestrated services that can be deployed, scaled, and updated with a single git commit.*


Table of Contents

  1. The Containerization Strategy
  2. Multi-stage Dockerfiles for Python
  3. Structuring the Kubernetes Cluster
  4. Mastering Helm Charts
  5. CI/CD Pipeline with GitHub Actions
  6. GitOps Workflow with ArgoCD
  7. Resource Management & Cost Control
  8. Conclusion & Next Post

The Containerization Strategy

Our RAG system consists of multiple moving parts. We need to containerize them separately to allow independent scaling:

  1. API Service: FastAPI backend.
  2. Worker Service: Celery workers for heavy ingestion.
  3. Inference Service: vLLM for the LLM model.
  4. Monitoring Stack: Prometheus, Grafana, and Loki.

Multi-stage Dockerfiles for Python

Python images can be huge (~1GB). We use Multi-stage builds to keep our production images slim and secure.

# Stage 1: Build
FROM python:3.11-slim as builder

WORKDIR /build
RUN pip install poetry
COPY pyproject.toml poetry.lock ./
RUN poetry export -f requirements.txt > requirements.txt
RUN pip install --prefix=/install -r requirements.txt

# Stage 2: Final
FROM python:3.11-slim

WORKDIR /app
COPY --from=builder /install /usr/local
COPY ./app ./app

# Run as non-root for security
RUN useradd -m myuser
USER myuser

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Structuring the Kubernetes Cluster

We organized our cluster using Namespaces to separate environments:

  • rag-prod: Production workloads.
  • rag-staging: Testing new features.
  • rag-infra: Shared services like Qdrant and Redis.

Node Pools

  • General Pool: For API and small services (CPU-optimized).
  • GPU Pool: For vLLM (NVIDIA A100/L4).
  • Storage Pool: For the Vector Database (High IOPS SSD).

Mastering Helm Charts

Instead of managing 50 YAML files, we use Helm to template our configuration.

# charts/rag-app/values.yaml
replicaCount: 3
image:
  repository: my-rag-api
  tag: "stable"

env:
  QDRANT_URL: "http://qdrant.rag-infra.svc.cluster.local:6333"
  LLM_MODEL: "gpt-4-turbo"

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"

CI/CD Pipeline with GitHub Actions

Every time we push code, the pipeline:

  1. Lints & Tests: Runs ruff, mypy, and pytest.
  2. Builds Image: Creates a new Docker container.
  3. Publishes Image: Pushes to AWS ECR or GitHub Packages.
  4. Updates Config: Updates the version tag in our GitOps repository.

GitOps Workflow with ArgoCD

We follow the GitOps principle: "The state of the system is whatever is in the Git repository."

How it works:

  1. Our CI pipeline updates a YAML file in the infrastructure repo.
  2. ArgoCD notices the change in Git.
  3. ArgoCD automatically synchronizes the Kubernetes cluster to match the new Git state.
  4. Result: No manual kubectl apply. Ever.

Resource Management & Cost Control

Running GPUs in the cloud is expensive. We implemented two strategies:

  1. Vertical Pod Autoscaler (VPA): Automatically adjusts CPU/RAM requests based on usage.
  2. KEDA (Kubernetes Event-driven Autoscaling): Scale the number of Inference nodes based on the number of pending requests in the queue.
  3. Node Autoprovisioning: Shut down GPU nodes completely during the night if no requests are coming in.

Conclusion & Next Post

DevOps for RAG is more complex than traditional web apps because of GPU dependencies and large data volumes. But with Kubernetes and GitOps, we can manage this complexity with confidence.

3 Key Takeaways:

  1. Multi-stage builds are mandatory for secure, slim Python containers.
  2. Helm reduces YAML fatigue by templating configurations.
  3. GitOps ensures that your production environment is reproducible and traceable.

👉 Next Post: [Post 08] Monitoring & Optimization - Keeping an Eye on Your AI

Your system is live, but is it working well? In the next post, we'll discuss LLM Observability. How do we monitor quality, hallucination rates, and latency? We will set up dashboards in Grafana that make every AI engineer jealous.


📬 What's your biggest pain point in deploying AI? Docker, GPUs, or Kubernetes? Let's chat!


Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: Docker Kubernetes DevOps CI/CD ArgoCD Cloud Engineering

Series • Part 7 of 11

RAG in Production — The Journey of Building a Real-world AI System

NextRAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI
RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes
01RAG in Production [P1]: Real-world Problem - When Does a Business Actually Need AI?02RAG in Production [P2]: What is RAG? Why not Fine-tuning or Prompt Engineering?03RAG in Production [P3]: Architecture Design - Blueprint for an Enterprise RAG System04RAG in Production [P4]: Backend Implementation - Building the Engine with FastAPI & LangChain05RAG in Production [P5]: Vector Database Design - Optimizing Qdrant for Scale06RAG in Production [P6]: LLM Inference Deployment - Scalability with vLLM & Kubernetes07RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG EcosystemReading08RAG in Production [P8]: Monitoring & Optimization - Keeping an Eye on Your AI09RAG in Production [P9]: Security & Privacy - Protecting Your Enterprise Data10RAG in Production [P10]: Future Improvements - Agentic RAG, GraphRAG & Beyond11RAG in Production [P11]: Lessons Learned - 15 Hard Truths About RAG in Production
TP

Written by Truong Pham

Software Engineer passionate about building high-performance systems and meaningful experiences.

Read more articles