LogoTRUONG PHAM
Home
Projects
Blogs
YouTube
Contact

Newsletter

Stay updated with technical artifacts and engineering insights.

LogoTRUONG PHAM

Building scalable software and sharing insights on technology & life.

Sitemap

  • Home
  • Projects
  • Blogs
  • YouTube
  • Contact

Connect

  • GitHub
  • LinkedIn
  • Email
  • YouTube

© 2024 TRUONG PHAM. © All rights reserved.

Privacy PolicyTerms of Service
Back
AI-Service Integration - Bringing LLM/RAG into Microservice Architecture
Microservice Journey: Lessons & Trade-offs

AI-Service Integration - Bringing LLM/RAG into Microservice Architecture

AI is not like regular CRUD services. How to integrate LLM and RAG without bottlenecking your system.

TP
Truong PhamSoftware Engineer
PublishedApril 1, 2024
Stack
microservice ·AI ·RAG ·LLM ·architecture

Integrating AI into microservices isn't just about calling an OpenAI API. It's a matter of resource management, latency handling, and designing asynchronous data flows.


When AI Joins the System

Recently, we had a new requirement to build an Internal RAG AI Assistant (an virtual assistant for looking up internal documents). When we started, we realized that AI services have very "different" characteristics compared to regular services:

  • Extremely High Latency: An AI request can take 5-10 seconds to complete.
  • Resource Intensive: Running local LLMs will eat up all your GPU/RAM.
  • Streaming: Users want to see words pop up one by one (streaming) rather than waiting 10 seconds for a block of text.

1. Designing Separate AI Services

Never cram AI processing logic into existing business services. We separated a dedicated AI Service. This service only does two things: querying the Vector Database (RAG) and calling the LLM.

Why? Because you can scale this resource-hungry "beast" independently without affecting your user's payment or login flows.

2. Streaming and WebSockets

Since AI responds slowly, using regular HTTP requests will hang connections and provide a very poor UX. Solution:

  • Use Server-Sent Events (SSE) or WebSockets to push parts of the result back to the Frontend.
  • As soon as the LLM generates a sentence, it's sent to the user immediately. This makes the system feel much faster than it actually is.

3. RAG Pipeline in Microservices

In our RAG project, we had to handle data synchronization:

  1. When there is a new document in the Document Service.
  2. An event is fired via a Message Queue (NATS).
  3. The AI Service receives the event and triggers the "Embedding" process, saving it to a Vector Database (like Qdrant or pgvector).

Trade-off: Data in the AI Assistant will be slightly behind reality (Eventual Consistency), but in return, the Document Service won't hang while waiting for the AI to process thousands of document pages.


Lessons Learned

  1. Caching is Vital: Calling LLMs is expensive (both in money and time). Use Redis to cache common questions.
  2. Strict Token Limits: Don't let a too-long request crash the service. Always have Input/Output token limit mechanisms.
  3. Observability for AI: You need to monitor not just CPU/RAM but also Token Usage and Model Latency.

Conclusion

AI is a promising but challenging component for microservice architecture. By treating AI Services as asynchronous components and prioritizing the Streaming experience, you can bring the power of LLMs into your system while maintaining the necessary stability.

Series • Part 15 of 20

Microservice Journey: Lessons & Trade-offs

NextFeature Flags - Deploy and Release are No Longer One
Contract Testing - When Integration Testing Becomes a Burden
10Idempotency - Why Every API Should Be 'Stubborn'?11Infra as Code (IaC) - Automate or Die in a Mountain of Config12The Pain Named Cloud Bill - Cost Optimization for 'Broke Teams'13When the System Goes Silent - Lessons on Post-mortem and Blame-free Culture14Contract Testing - When Integration Testing Becomes a Burden15AI-Service Integration - Bringing LLM/RAG into Microservice ArchitectureReading16Feature Flags - Deploy and Release are No Longer One17Secret Management - Don't Let API Keys Wander Around18Macroservice - When We Decide to 'Merge' Services Back19The Maintenance Nightmare - When Microservice Systems Age20Microservice Journey Summary - Do You Really Need It?
TP

Written by Truong Pham

Software Engineer passionate about building high-performance systems and meaningful experiences.

Read more articles