RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem

"Code without automation is like a car without a steering wheel—it might be fast, but it's bound to crash eventually." In this post, we'll turn our RAG system into a collection of orchestrated services that can be deployed, scaled, and updated with a single git commit.*

The Containerization Strategy
Multi-stage Dockerfiles for Python
Structuring the Kubernetes Cluster
Mastering Helm Charts
CI/CD Pipeline with GitHub Actions
GitOps Workflow with ArgoCD
Resource Management & Cost Control
Conclusion & Next Post

The Containerization Strategy

Our RAG system consists of multiple moving parts. We need to containerize them separately to allow independent scaling:

API Service: FastAPI backend.
Worker Service: Celery workers for heavy ingestion.
Inference Service: vLLM for the LLM model.
Monitoring Stack: Prometheus, Grafana, and Loki.

Multi-stage Dockerfiles for Python

Python images can be huge (~1GB). We use Multi-stage builds to keep our production images slim and secure.

# Stage 1: Build
FROM python:3.11-slim as builder

WORKDIR /build
RUN pip install poetry
COPY pyproject.toml poetry.lock ./
RUN poetry export -f requirements.txt > requirements.txt
RUN pip install --prefix=/install -r requirements.txt

# Stage 2: Final
FROM python:3.11-slim

WORKDIR /app
COPY --from=builder /install /usr/local
COPY ./app ./app

# Run as non-root for security
RUN useradd -m myuser
USER myuser

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Structuring the Kubernetes Cluster

We organized our cluster using Namespaces to separate environments:

rag-prod: Production workloads.
rag-staging: Testing new features.
rag-infra: Shared services like Qdrant and Redis.

Node Pools

General Pool: For API and small services (CPU-optimized).
GPU Pool: For vLLM (NVIDIA A100/L4).
Storage Pool: For the Vector Database (High IOPS SSD).

Mastering Helm Charts

Instead of managing 50 YAML files, we use Helm to template our configuration.

# charts/rag-app/values.yaml
replicaCount: 3
image:
  repository: my-rag-api
  tag: "stable"

env:
  QDRANT_URL: "http://qdrant.rag-infra.svc.cluster.local:6333"
  LLM_MODEL: "gpt-4-turbo"

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"

CI/CD Pipeline with GitHub Actions

Every time we push code, the pipeline:

Lints & Tests: Runs ruff, mypy, and pytest.
Builds Image: Creates a new Docker container.
Publishes Image: Pushes to AWS ECR or GitHub Packages.
Updates Config: Updates the version tag in our GitOps repository.

GitOps Workflow with ArgoCD

We follow the GitOps principle: "The state of the system is whatever is in the Git repository."

How it works:

Our CI pipeline updates a YAML file in the infrastructure repo.
ArgoCD notices the change in Git.
ArgoCD automatically synchronizes the Kubernetes cluster to match the new Git state.
Result: No manual kubectl apply. Ever.

Resource Management & Cost Control

Running GPUs in the cloud is expensive. We implemented two strategies:

Vertical Pod Autoscaler (VPA): Automatically adjusts CPU/RAM requests based on usage.
KEDA (Kubernetes Event-driven Autoscaling): Scale the number of Inference nodes based on the number of pending requests in the queue.
Node Autoprovisioning: Shut down GPU nodes completely during the night if no requests are coming in.

Conclusion & Next Post

DevOps for RAG is more complex than traditional web apps because of GPU dependencies and large data volumes. But with Kubernetes and GitOps, we can manage this complexity with confidence.

3 Key Takeaways:

Multi-stage builds are mandatory for secure, slim Python containers.
Helm reduces YAML fatigue by templating configurations.
GitOps ensures that your production environment is reproducible and traceable.

👉 Next Post: [Post 08] Monitoring & Optimization - Keeping an Eye on Your AI

Your system is live, but is it working well? In the next post, we'll discuss LLM Observability. How do we monitor quality, hallucination rates, and latency? We will set up dashboards in Grafana that make every AI engineer jealous.

📬 What's your biggest pain point in deploying AI? Docker, GPUs, or Kubernetes? Let's chat!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: Docker Kubernetes DevOps CI/CD ArgoCD Cloud Engineering

"Code without automation is like a car without a steering wheel—it might be fast, but it's bound to crash eventually." In this post, we'll turn our RAG system into a collection of orchestrated services that can be deployed, scaled, and updated with a single git commit.*

The Containerization Strategy
Multi-stage Dockerfiles for Python
Structuring the Kubernetes Cluster
Mastering Helm Charts
CI/CD Pipeline with GitHub Actions
GitOps Workflow with ArgoCD
Resource Management & Cost Control
Conclusion & Next Post

The Containerization Strategy

Our RAG system consists of multiple moving parts. We need to containerize them separately to allow independent scaling:

API Service: FastAPI backend.
Worker Service: Celery workers for heavy ingestion.
Inference Service: vLLM for the LLM model.
Monitoring Stack: Prometheus, Grafana, and Loki.

Multi-stage Dockerfiles for Python

Python images can be huge (~1GB). We use Multi-stage builds to keep our production images slim and secure.

# Stage 1: Build
FROM python:3.11-slim as builder

WORKDIR /build
RUN pip install poetry
COPY pyproject.toml poetry.lock ./
RUN poetry export -f requirements.txt > requirements.txt
RUN pip install --prefix=/install -r requirements.txt

# Stage 2: Final
FROM python:3.11-slim

WORKDIR /app
COPY --from=builder /install /usr/local
COPY ./app ./app

# Run as non-root for security
RUN useradd -m myuser
USER myuser

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]

Structuring the Kubernetes Cluster

We organized our cluster using Namespaces to separate environments:

rag-prod: Production workloads.
rag-staging: Testing new features.
rag-infra: Shared services like Qdrant and Redis.

Node Pools

General Pool: For API and small services (CPU-optimized).
GPU Pool: For vLLM (NVIDIA A100/L4).
Storage Pool: For the Vector Database (High IOPS SSD).

Mastering Helm Charts

Instead of managing 50 YAML files, we use Helm to template our configuration.

# charts/rag-app/values.yaml
replicaCount: 3
image:
  repository: my-rag-api
  tag: "stable"

env:
  QDRANT_URL: "http://qdrant.rag-infra.svc.cluster.local:6333"
  LLM_MODEL: "gpt-4-turbo"

resources:
  requests:
    cpu: "500m"
    memory: "1Gi"

CI/CD Pipeline with GitHub Actions

Every time we push code, the pipeline:

Lints & Tests: Runs ruff, mypy, and pytest.
Builds Image: Creates a new Docker container.
Publishes Image: Pushes to AWS ECR or GitHub Packages.
Updates Config: Updates the version tag in our GitOps repository.

GitOps Workflow with ArgoCD

We follow the GitOps principle: "The state of the system is whatever is in the Git repository."

How it works:

Our CI pipeline updates a YAML file in the infrastructure repo.
ArgoCD notices the change in Git.
ArgoCD automatically synchronizes the Kubernetes cluster to match the new Git state.
Result: No manual kubectl apply. Ever.

Resource Management & Cost Control

Running GPUs in the cloud is expensive. We implemented two strategies:

Vertical Pod Autoscaler (VPA): Automatically adjusts CPU/RAM requests based on usage.
KEDA (Kubernetes Event-driven Autoscaling): Scale the number of Inference nodes based on the number of pending requests in the queue.
Node Autoprovisioning: Shut down GPU nodes completely during the night if no requests are coming in.

Conclusion & Next Post

DevOps for RAG is more complex than traditional web apps because of GPU dependencies and large data volumes. But with Kubernetes and GitOps, we can manage this complexity with confidence.

3 Key Takeaways:

Multi-stage builds are mandatory for secure, slim Python containers.
Helm reduces YAML fatigue by templating configurations.
GitOps ensures that your production environment is reproducible and traceable.

👉 Next Post: [Post 08] Monitoring & Optimization - Keeping an Eye on Your AI

📬 What's your biggest pain point in deploying AI? Docker, GPUs, or Kubernetes? Let's chat!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: Docker Kubernetes DevOps CI/CD ArgoCD Cloud Engineering

RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem

Table of Contents

The Containerization Strategy

Multi-stage Dockerfiles for Python

Structuring the Kubernetes Cluster

Node Pools

Mastering Helm Charts

CI/CD Pipeline with GitHub Actions

GitOps Workflow with ArgoCD

Resource Management & Cost Control

Conclusion & Next Post

👉 Next Post: [Post 08] Monitoring & Optimization - Keeping an Eye on Your AI

RAG in Production [P7]: DevOps & GitOps - Orchestrating the RAG Ecosystem

Table of Contents

The Containerization Strategy

Multi-stage Dockerfiles for Python

Structuring the Kubernetes Cluster

Node Pools

Mastering Helm Charts

CI/CD Pipeline with GitHub Actions

GitOps Workflow with ArgoCD

Resource Management & Cost Control

Conclusion & Next Post

👉 Next Post: [Post 08] Monitoring & Optimization - Keeping an Eye on Your AI