AI-Service Integration - Bringing LLM/RAG into Microservice Architecture

Integrating AI into microservices isn't just about calling an OpenAI API. It's a matter of resource management, latency handling, and designing asynchronous data flows.

When AI Joins the System

Recently, we had a new requirement to build an Internal RAG AI Assistant (an virtual assistant for looking up internal documents). When we started, we realized that AI services have very "different" characteristics compared to regular services:

Extremely High Latency: An AI request can take 5-10 seconds to complete.
Resource Intensive: Running local LLMs will eat up all your GPU/RAM.
Streaming: Users want to see words pop up one by one (streaming) rather than waiting 10 seconds for a block of text.

1. Designing Separate AI Services

Never cram AI processing logic into existing business services. We separated a dedicated AI Service. This service only does two things: querying the Vector Database (RAG) and calling the LLM.

Why? Because you can scale this resource-hungry "beast" independently without affecting your user's payment or login flows.

2. Streaming and WebSockets

Since AI responds slowly, using regular HTTP requests will hang connections and provide a very poor UX. Solution:

Use Server-Sent Events (SSE) or WebSockets to push parts of the result back to the Frontend.
As soon as the LLM generates a sentence, it's sent to the user immediately. This makes the system feel much faster than it actually is.

3. RAG Pipeline in Microservices

In our RAG project, we had to handle data synchronization:

When there is a new document in the Document Service.
An event is fired via a Message Queue (NATS).
The AI Service receives the event and triggers the "Embedding" process, saving it to a Vector Database (like Qdrant or pgvector).

Trade-off: Data in the AI Assistant will be slightly behind reality (Eventual Consistency), but in return, the Document Service won't hang while waiting for the AI to process thousands of document pages.

Lessons Learned

Caching is Vital: Calling LLMs is expensive (both in money and time). Use Redis to cache common questions.
Strict Token Limits: Don't let a too-long request crash the service. Always have Input/Output token limit mechanisms.
Observability for AI: You need to monitor not just CPU/RAM but also Token Usage and Model Latency.

Conclusion

AI is a promising but challenging component for microservice architecture. By treating AI Services as asynchronous components and prioritizing the Streaming experience, you can bring the power of LLMs into your system while maintaining the necessary stability.

When AI Joins the System

Extremely High Latency: An AI request can take 5-10 seconds to complete.

Resource Intensive: Running local LLMs will eat up all your GPU/RAM.

Streaming: Users want to see words pop up one by one (streaming) rather than waiting 10 seconds for a block of text.

1. Designing Separate AI Services

Never cram AI processing logic into existing business services. We separated a dedicated AI Service. This service only does two things: querying the Vector Database (RAG) and calling the LLM.

Why? Because you can scale this resource-hungry "beast" independently without affecting your user's payment or login flows.

2. Streaming and WebSockets

Since AI responds slowly, using regular HTTP requests will hang connections and provide a very poor UX. Solution:

Use Server-Sent Events (SSE) or WebSockets to push parts of the result back to the Frontend.

As soon as the LLM generates a sentence, it's sent to the user immediately. This makes the system feel much faster than it actually is.

3. RAG Pipeline in Microservices

In our RAG project, we had to handle data synchronization:

When there is a new document in the Document Service.

An event is fired via a Message Queue (NATS).

The AI Service receives the event and triggers the "Embedding" process, saving it to a Vector Database (like Qdrant or pgvector).

Lessons Learned

Caching is Vital: Calling LLMs is expensive (both in money and time). Use Redis to cache common questions.

Strict Token Limits: Don't let a too-long request crash the service. Always have Input/Output token limit mechanisms.

Observability for AI: You need to monitor not just CPU/RAM but also Token Usage and Model Latency.