RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem
Automate the deployment of your RAG system using Docker, Kubernetes, and Helm. Implement a robust CI/CD pipeline with GitHub Actions and ArgoCD.
"Code without automation is like a car without a steering wheel—it might be fast, but it's bound to crash eventually." In this post, we'll turn our RAG system into a collection of orchestrated services that can be deployed, scaled, and updated with a single git commit.*
Table of Contents
- The Containerization Strategy
- Multi-stage Dockerfiles for Python
- Structuring the Kubernetes Cluster
- Mastering Helm Charts
- CI/CD Pipeline with GitHub Actions
- GitOps Workflow with ArgoCD
- Resource Management & Cost Control
- Conclusion & Next Post
The Containerization Strategy
Our RAG system consists of multiple moving parts. We need to containerize them separately to allow independent scaling:
- API Service: FastAPI backend.
- Worker Service: Celery workers for heavy ingestion.
- Inference Service: vLLM for the LLM model.
- Monitoring Stack: Prometheus, Grafana, and Loki.
Multi-stage Dockerfiles for Python
Python images can be huge (~1GB). We use Multi-stage builds to keep our production images slim and secure.
# Stage 1: Build
FROM python:3.11-slim as builder
WORKDIR /build
RUN pip install poetry
COPY pyproject.toml poetry.lock ./
RUN poetry export -f requirements.txt > requirements.txt
RUN pip install --prefix=/install -r requirements.txt
# Stage 2: Final
FROM python:3.11-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY ./app ./app
# Run as non-root for security
RUN useradd -m myuser
USER myuser
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
Structuring the Kubernetes Cluster
We organized our cluster using Namespaces to separate environments:
rag-prod: Production workloads.rag-staging: Testing new features.rag-infra: Shared services like Qdrant and Redis.
Node Pools
- General Pool: For API and small services (CPU-optimized).
- GPU Pool: For vLLM (NVIDIA A100/L4).
- Storage Pool: For the Vector Database (High IOPS SSD).
Mastering Helm Charts
Instead of managing 50 YAML files, we use Helm to template our configuration.
# charts/rag-app/values.yaml
replicaCount: 3
image:
repository: my-rag-api
tag: "stable"
env:
QDRANT_URL: "http://qdrant.rag-infra.svc.cluster.local:6333"
LLM_MODEL: "gpt-4-turbo"
resources:
requests:
cpu: "500m"
memory: "1Gi"
CI/CD Pipeline with GitHub Actions
Every time we push code, the pipeline:
- Lints & Tests: Runs
ruff,mypy, andpytest. - Builds Image: Creates a new Docker container.
- Publishes Image: Pushes to AWS ECR or GitHub Packages.
- Updates Config: Updates the version tag in our GitOps repository.
GitOps Workflow with ArgoCD
We follow the GitOps principle: "The state of the system is whatever is in the Git repository."
How it works:
- Our CI pipeline updates a YAML file in the
infrastructurerepo. - ArgoCD notices the change in Git.
- ArgoCD automatically synchronizes the Kubernetes cluster to match the new Git state.
- Result: No manual
kubectl apply. Ever.
Resource Management & Cost Control
Running GPUs in the cloud is expensive. We implemented two strategies:
- Vertical Pod Autoscaler (VPA): Automatically adjusts CPU/RAM requests based on usage.
- KEDA (Kubernetes Event-driven Autoscaling): Scale the number of Inference nodes based on the number of pending requests in the queue.
- Node Autoprovisioning: Shut down GPU nodes completely during the night if no requests are coming in.
Conclusion & Next Post
DevOps for RAG is more complex than traditional web apps because of GPU dependencies and large data volumes. But with Kubernetes and GitOps, we can manage this complexity with confidence.
3 Key Takeaways:
- Multi-stage builds are mandatory for secure, slim Python containers.
- Helm reduces YAML fatigue by templating configurations.
- GitOps ensures that your production environment is reproducible and traceable.
👉 Next Post: [Post 08] Monitoring & Optimization - Keeping an Eye on Your AI
Your system is live, but is it working well? In the next post, we'll discuss LLM Observability. How do we monitor quality, hallucination rates, and latency? We will set up dashboards in Grafana that make every AI engineer jealous.
📬 What's your biggest pain point in deploying AI? Docker, GPUs, or Kubernetes? Let's chat!
Author: [Your Name]
Series: RAG in Production — The Journey of Building a Real-world AI System
Tags: Docker Kubernetes DevOps CI/CD ArgoCD Cloud Engineering
Series • Part 7 of 11