RAG in Production [P2]: What is RAG? Why not Fine-tuning or Prompt Engineering?

"Just add the document to the system prompt and we're done, right?" This innocent question from a PM took me 20 minutes to explain — and that was when I realized I needed to write this post.*

Recap of Post 01
What Does an LLM Know, and What Does It Not Know?
3 Ways to Bring Knowledge into an LLM
Prompt Engineering — Simple but Limited
Fine-tuning — Powerful but Expensive
RAG — Retrieval-Augmented Generation
Direct Comparison: Prompt vs Fine-tune vs RAG
How Does RAG Work?
Advanced RAG Variants
When to Use What? Decision Tree
Limits of RAG That Few People Talk About
Conclusion & Next Post

Recap of Post 01

In the previous post, we defined:

Business problem: Customer Support agents lose 10.2 minutes/ticket due to manual search through 3,000+ scattered documents.
Conclusion: Need a system that understands the semantics of questions and retrieves the right information in < 5 seconds.
Decision: RAG is the most suitable approach.

But why RAG? This post will answer that question thoroughly.

What Does an LLM Know, and What Does It Not Know?

Before comparing approaches, it's necessary to understand the nature of an LLM.

LLM is a "Compressed Book"

An LLM like GPT-4, Claude, or Llama is trained on terabytes of text from the internet — books, Wikipedia, code, forums, scientific reports... The training process compresses all that information into billions of parameters of the model.

Training Data (terabytes)
        ↓
    [Training Process]
        ↓
Model Weights (gigabytes) ← "Compressed Knowledge"

Result: The model "knows" a lot about the world in a general way. But that knowledge has 3 fundamental limits:

Limit 1: Knowledge Cutoff

The model only knows information up to the moment training ended. Any event after that is "blind" to it.

GPT-4 training cutoff: ~April 2023
→ Knows nothing about events from May 2023 onwards
→ Knows nothing about your company's internal content

Limit 2: Private Knowledge

An LLM knows nothing about your company unless that information was public on the internet before the cutoff date.

LLM doesn't know:
❌ Your company's specific refund policy
❌ Internal onboarding processes
❌ Ticket conversation history
❌ Current product prices
❌ VIP customer names and their history

Limit 3: Hallucination

When it doesn't know, an LLM doesn't say "I don't know" — it makes up an answer that sounds reasonable. This is an intrinsic characteristic of the transformer architecture, not a bug that can be fixed completely.

# Example of dangerous hallucination in an enterprise environment

User: "What is our refund policy?"

GPT-4 (without context): 
"The standard refund policy in the industry is 30 days. 
Customers need to send a request via email to support@company.com 
and will receive a refund within 5-7 business days."

→ The answer sounds COMPLETELY REASONABLE
→ But it's 100% made up — because the model doesn't have this information
→ Agent copy-pastes → Customer receives wrong information → Churn

Conclusion: LLMs need to be provided with context to answer correctly about domain-specific knowledge. The question is: how to provide it?

3 Ways to Bring Knowledge into an LLM

There are 3 main approaches, each with its own trade-offs:

┌─────────────────────────────────────────────────────────────┐
│              3 APPROACHES TO BRING KNOWLEDGE TO LLM         │
│                                                             │
│  1. PROMPT ENGINEERING                                      │
│     Stuff information directly into the prompt for each call │
│                                                             │
│  2. FINE-TUNING                                             │
│     Retrain the model with your data                        │
│                                                             │
│  3. RAG                                                     │
│     Search for relevant information → put it in the prompt  │
└─────────────────────────────────────────────────────────────┘

Prompt Engineering

How It Works

The simplest way: put the information you want the model to know directly into the prompt.

# Naive approach: Stuff all policies into the system prompt

SYSTEM_PROMPT = """
You are a Customer Support assistant for [Company].
Below are all the company policies:

=== REFUND POLICY ===
[3 pages of refund policy]

=== SHIPPING POLICY ===
[2 pages of shipping policy]

=== PRODUCT CATALOG ===
[500 products with descriptions]

=== FAQ ===
[200 frequently asked questions]

... [and much more]
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_question}
    ]
)

Pros

✅ Extremely simple to implement
✅ No infrastructure required complexly
✅ Easy to update — just edit the prompt
✅ No latency caused by retrieval

Cons and Practical Limits

Context Window is limited:

GPT-4 Turbo context window: 128,000 tokens ≈ ~100,000 words ≈ ~400 A4 pages

Sounds like a lot? Let's calculate:
- 3,000 Confluence pages × 500 words/page = 1,500,000 words
- Exceeds context window ~15 times
→ IMPOSSIBLE to stuff it all into the prompt

Cost increases linearly with context:

# Each API call with full context = very expensive

# Example with GPT-4 Turbo (price ~$0.01/1K tokens)
tokens_per_call = 100_000  # 100K token context
cost_per_call = (100_000 / 1_000) * 0.01  # = $1 per call

# With 300 tickets/day:
daily_cost = 300 * 1  # = $300/day
monthly_cost = 300 * 30  # = $9,000/month

# Comparison with RAG approach:
# RAG only puts 3-5 relevant snippets in context (~2,000 tokens)
cost_per_call_rag = (2_000 / 1_000) * 0.01  # = $0.02
monthly_cost_rag = 300 * 30 * 0.02  # = $180/month

# → RAG is ~50 times cheaper

"Lost in the Middle" Problem:

Research has shown that LLMs tend to pay attention to information at the beginning and end of the context, while missing information in the middle. If you stuff 100 pages of documents into a prompt, an important answer on page 50 might be ignored.

[SYSTEM PROMPT - 100 pages of documents]
│
├── Page 1-10: Model pays high attention ✅
├── Page 11-90: Model might ignore ⚠️  ← "Lost in the Middle"
└── Page 91-100: Model pays high attention ✅

No semantic search:

When you stuff everything into the prompt, the model has to process the whole thing — even though 95% of the information is irrelevant to the question. This is a waste of both computation and accuracy.

When is Prompt Engineering Enough?

✅ Suitable when:
   - Knowledge base < 50 pages
   - Use case is simple, well-defined
   - Prototyping/POC quickly
   - Budget is not an issue
   - Context is stable, changes little

Fine-tuning

How It Works

Fine-tuning is the process of further training the model on your dataset, so the model "memorizes" domain-specific knowledge into its weights.

Base Model (GPT-3.5 / Llama 3 / Mistral)
        +
Your Training Data (Q&A pairs, documents)
        ↓
    [Fine-tuning Process]
        ↓
Fine-tuned Model (knows about your domain)

Example training data for fine-tuning:

{"prompt": "What is our company's refund policy?", 
 "completion": "Customers can get a refund within 30 days..."}

{"prompt": "How to exchange a product?", 
 "completion": "To exchange a product, customers need to..."}

{"prompt": "What is the domestic shipping fee?", 
 "completion": "Domestic shipping fees are calculated based on..."}

Pros

✅ No retrieval step required → lower latency
✅ Model "deeply understands" the domain — not just research but internalize
✅ Consistent style and tone — can train the model to speak with the brand's voice
✅ Does not depend on context window for knowledge already trained
✅ Good for format/behavior — train model to output JSON, follow templates...

Serious Cons

Catastrophic Forgetting:

When fine-tuning, the model might "forget" some general knowledge it has already learned. This is an especially serious problem if the fine-tuning dataset is small or not diverse.

Knowledge cannot be updated in real-time:

Situation: Refund policy changes tomorrow.

With Fine-tuning:
1. Prepare new training data ← 1-2 days
2. Fine-tune model (~$100-$1,000 depending on model size) ← several hours
3. Evaluate new model ← 1 day
4. Deploy new model ← several hours
Total: 3-5 days + $100-$1,000

With RAG:
1. Update document in knowledge base ← 5 minutes
2. Re-index document ← several seconds
Total: 5 minutes + $0

High Cost:

Fine-tuning cost estimate:

GPT-3.5 Fine-tuning (OpenAI):
  - Training: $0.008 / 1K tokens
  - 1M token dataset: ~$8,000 one time
  - Each policy update: repeat the whole process

Self-hosted (Llama 3 8B with LoRA):
  - Need GPU: A100 40GB ≈ $2-3/hour
  - Training 1M tokens: ~10-20 hours = $20-60
  - But needs an engineer who knows how to do it
  - Needs infrastructure to deploy

Need high-quality data:

Fine-tuning requires labeled training data — exact (input, output) pairs. Creating this dataset is labor-intensive and prone to bias.

Hallucination still occurs:

Fine-tuning doesn't eliminate hallucinations. The model might still make up information not in the training data, especially with edge case questions.

When is Fine-tuning Actually Suitable?

✅ Suitable when the goal is to change BEHAVIOR, not KNOWLEDGE:
   - Want model to output the correct format (JSON, XML, specific templates)
   - Want model to speak with a specific tone/style
   - Want model to follow more complex instructions than the base model
   - Highly specialized domain (medical, law, finance) with lots of data
   - Latency is critical and cannot accept retrieval delay

❌ NOT suitable when:
   - Knowledge changes frequently
   - Do not have enough high-quality training data
   - Team lacks ML expertise
   - Budget is limited
   - Need traceability (know where the answer comes from)

RAG — Retrieval-Augmented Generation

Core Idea

RAG was introduced in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020 — Meta AI). The idea is extremely elegant:

"Instead of stuffing all knowledge into the model, let the model search for knowledge when needed — like a person looking up reference books before answering."

Perfect Analogy:

❌ Prompt Engineering = Stuffing the entire library into the answerer's head
❌ Fine-tuning = Forcing the answerer to memorize the entire library
✅ RAG = The answerer can look up the exact book needed before answering 
           the question

High-level RAG Process

[User asks]: "What is the refund policy for defective products?"

                    ↓

Step 1: EMBED the question into a vector
        "refund policy for defective products"
        → [0.23, -0.51, 0.87, ..., 0.12]  (1536 dimensions)

                    ↓

Step 2: SEARCH in a Vector Database
        Find text snippets with the meanings closest to the question
        → [Snippet 1: "Defective products will be refunded..."]
        → [Snippet 2: "Defective product complaint process..."]
        → [Snippet 3: "Refund processing time is..."]

                    ↓

Step 3: AUGMENT the prompt with retrieved context
        System: "Answer based on the following documents: [3 snippets above]"
        User: "What is the refund policy for defective products?"

                    ↓

Step 4: GENERATE the answer
        LLM aggregates the 3 snippets → precise answer, with citations

How Does RAG Work?

RAG has 2 phases: Indexing (offline) and Querying (online/real-time).

Phase 1: Indexing (Offline)

┌─────────────────────────────────────────────────────────────┐
│                    INDEXING PIPELINE                         │
│                                                             │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────┐  │
│  │  Sources  │──▶│  Chunk   │──▶│  Embed   │──▶│ Store  │  │
│  │          │   │          │   │          │   │        │  │
│  │Confluence│   │Split docs│   │text→vec  │   │Vector  │  │
│  │SharePoint│   │into      │   │[0.2,-0.5,│   │   DB   │  │
│  │PDF files │   │~500 token│   │0.8,...]  │   │        │  │
│  │Slack     │   │chunks    │   │          │   │Qdrant/ │  │
│  └──────────┘   └──────────┘   └──────────┘   │Weaviate│  │
│                                                │Pinecone│  │
│                                                └────────┘  │
└─────────────────────────────────────────────────────────────┘

Chunking — Dividing documents:

# Example: 10-page document → many small chunks

document = """
Refund Policy (10 pages)
...
[page 1: refund conditions]
[page 2: implementation process]
[page 3: processing time]
...
"""

# After chunking:
chunks = [
    "Chunk 1: Refund conditions — Customers can request a refund within 30 days...",
    "Chunk 2: Implementation process — Step 1: Contact support via email...",
    "Chunk 3: Processing time — After confirmation, refund within 5-7 days...",
    # ...
]

Embedding — Converting text to vector:

from openai import OpenAI

client = OpenAI()

def embed_text(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"  # 1536 dimensions
    )
    return response.data[0].embedding

# Each chunk → 1 vector in 1536-dimensional space
chunk_vector = embed_text("Chunk 1: Refund conditions...")
# → [0.023, -0.512, 0.871, ..., 0.124]  (1536 numbers)

Why is Vector useful?

Embedding models are trained so that text snippets with similar meanings will have vectors close to each other in high-dimensional space.

Vector space (simplified to 2D for illustration):

                      "refund" ←→ "hoàn tiền"
                          ·  ·
            "exchange" ·       · "return policy"
                   ·               ·
                                        "shipping"
                                            ·
                                                "giao hàng"

→ "refund" and "hoàn tiền" have vectors close to each other
→ "refund" and "shipping" have vectors far from each other
→ Semantic similarity = Geometric proximity

Phase 2: Querying (Real-time)

┌─────────────────────────────────────────────────────────────┐
│                    QUERYING PIPELINE                         │
│                                                             │
│  User Query                                                 │
│      │                                                      │
│      ▼                                                      │
│  ┌────────┐    ┌────────┐    ┌────────┐    ┌────────────┐  │
│  │ Embed  │───▶│ Search │───▶│Rerank  │───▶│  Generate  │  │
│  │ Query  │    │Vector  │    │(opt.)  │    │            │  │
│  │        │    │   DB   │    │        │    │LLM + ctx   │  │
│  │text→   │    │Top-K   │    │Filter  │    │→ Answer    │  │
│  │vector  │    │snippets│    │better  │    │+ citations │  │
│  └────────┘    └────────┘    └────────┘    └────────────┘  │
└─────────────────────────────────────────────────────────────┘

# Simplified querying flow

def rag_query(user_question: str) -> str:
    # Step 1: Embed question
    query_vector = embed_text(user_question)

    # Step 2: Search for top-5 most relevant snippets
    relevant_chunks = vector_db.search(
        vector=query_vector,
        top_k=5,
        score_threshold=0.7  # only retrieve if similarity > 70%
    )

    # Step 3: Build prompt with context
    context = "\n\n".join([chunk.text for chunk in relevant_chunks])
    sources = [chunk.metadata["source"] for chunk in relevant_chunks]

    prompt = f"""
    Answer the question COMPLETELY based on the provided documents.
    If the document does not have the information, say "I didn't find this information".
    DO NOT invent information not in the documents.

    Reference documents:
    {context}

    Question: {user_question}
    """

    # Step 4: Generate answer
    response = llm.generate(prompt)

    return {
        "answer": response,
        "sources": sources  # Traceability!
    }

Pros of RAG

✅ Knowledge is always up-to-date — just update documents, no need to retrain model
✅ Traceability — know which document the answer comes from
✅ Reduced hallucinations — model is "forced" to base on provided context
✅ Cost-effective — only put a few relevant snippets in context (not everything)
✅ Scales well — vector databases can handle millions of documents
✅ No ML expertise needed to maintain
✅ Security — can implement access control at retrieval layer

Direct Comparison

╔═══════════════════╦══════════════════╦══════════════════╦══════════════════╗
║ Criteria          ║ Prompt Eng.      ║ Fine-tuning      ║ RAG              ║
╠═══════════════════╬══════════════════╬══════════════════╬══════════════════╣
║ Complexity        ║ Low ✅           ║ High ❌          ║ Medium ⚠️        ║
║ Infrastructure    ║ Not needed ✅    ║ GPU cluster ❌   ║ Vector DB ⚠️     ║
║ Knowledge update  ║ Easy (edit prompt)║ Very hard ❌     ║ Easy (update doc)✅║
║ Scalability       ║ Poor ❌          ║ Good ✅          ║ Good ✅          ║
║ Knowledge size    ║ < 100 pages ❌   ║ Unlimited ✅      ║ Unlimited ✅      ║
║ Cost / query      ║ High ❌          ║ Low ✅           ║ Medium ⚠️        ║
║ Setup cost        ║ Very low ✅      ║ Very high ❌     ║ Medium ⚠️        ║
║ Hallucinations    ║ High ❌          ║ Medium ⚠️        ║ Low ✅           ║
║ Traceability      ║ No ❌            ║ No ❌            ║ Yes ✅           ║
║ Latency           ║ Low ✅           ║ Low ✅           ║ Higher ⚠️        ║
║ Time to market    ║ Days ✅          ║ Weeks-months ❌  ║ Weeks ⚠️         ║
║ Suitable use case ║ POC, small       ║ Behavior change  ║ Knowledge-heavy  ║
╚═══════════════════╩══════════════════╩══════════════════╩══════════════════╝

Advanced RAG Variants

Vanilla RAG is not the only solution. Depending on the problem, you might need these variants:

1. Naive RAG (Vanilla RAG)

This is the baseline — what we described above. Suitable for most use cases.

Query → Embed → Search → Generate

Limitations: Only bases on semantic similarity. Doesn't handle multi-hop reasoning questions.

2. Advanced RAG

Improvements in both pre-retrieval and post-retrieval:

Query Rewriting      → Embed → Search → Reranking → Generate
(improve query)                          (filter results)

Query Rewriting: LLM rewrites the user's question to optimize it for search:

User: "I bought a product 2 months ago, now it's defective, what happens?"
↓ Rewrite
Search query: "warranty return policy defective product"

Reranking: After retrieving top-K snippets from vector search, use another model (cross-encoder) to re-sort them based on more precise relevance.

3. Modular RAG

Flexible architecture — can plug-and-play different components:

┌──────────────────────────────────────────────────────┐
│              MODULAR RAG PIPELINE                    │
│                                                      │
│  Query     → [Query Transform]                       │
│               ↓                                      │
│             [Router] → select suitable retriever     │
│               ↓                                      │
│   ┌─────────────────────────────────┐                │
│   │ Vector Search │ BM25 │ SQL │ API│  ← Multi-source│
│   └─────────────────────────────────┘                │
│               ↓                                      │
│             [Fusion] → combine results               │
│               ↓                                      │
│             [Rerank] → re-sort                       │
│               ↓                                      │
│             [Generate] → answer                      │
└──────────────────────────────────────────────────────┘

4. Agentic RAG

LLM has the capability to decide for itself when and what to search:

# Instead of a rigid pipeline, LLM decides action

Agent reasoning:
  "This question needs policy information. I will search."
  → search("refund policy")
  → Receive results
  → "Results are not enough, need more info on timeline"
  → search("refund processing time")
  → Receive results
  → "Enough info, aggregate answer"
  → Generate answer

5. GraphRAG

Instead of just storing text chunks, build a knowledge graph to capture relationships between entities:

[Product A] ──[has policy]──▶ [30-day refund]
      │                                 │
   [belongs to]                      [applies when]
      │                                 │
   [Category X]                    [defective product]

Suitable when multi-hop reasoning is needed: "Which category does Product A belong to and what policy does that category have?"

6. Hybrid RAG (Used in our production)

Combines vector search (semantic) with BM25 (keyword/lexical):

# BM25: precise keyword search
bm25_results = bm25_search("refund policy")

# Vector search: semantic search
semantic_results = vector_search(embed("refund policy"))

# Reciprocal Rank Fusion: combine 2 results
final_results = reciprocal_rank_fusion(bm25_results, semantic_results)

Hybrid RAG solves the weakness of pure vector search when the question contains important exact keywords (product names, serial numbers, people's names...).

When to Use What?

Decision Tree

Do you need to bring custom knowledge to the LLM?
           │
           ▼
Knowledge base < 50 pages AND budget is no issue?
    YES → Prompt Engineering (simple, fast)
    NO  → continue
           │
           ▼
Goal is to change BEHAVIOR (format, style, instruction-following)?
    YES → Fine-tuning (or Prompt Engineering + Fine-tuning)
    NO  → continue
           │
           ▼
Knowledge changes frequently (weekly/monthly)?
    YES → RAG (mandatory)
    NO  → continue
           │
           ▼
Need traceability (know which source the answer comes from)?
    YES → RAG (mandatory)
    NO  → continue
           │
           ▼
Knowledge base > 100 pages?
    YES → RAG
    NO  → Prompt Engineering or Fine-tuning depending on budget

What did we choose and why?

Problem:
✅ Knowledge base: 3,000+ Confluence pages
✅ Update: Weekly
✅ Need traceability: Mandatory (compliance)
✅ Team: No deep ML expertise
✅ Budget: Moderate

Decision: Hybrid RAG (Vector Search + BM25)

Why not Fine-tuning:
  - Knowledge changes too frequently
  - No high-quality training data
  - Team lacks ML expertise
  
Why not Prompt Engineering:
  - 3,000 pages >> context window
  - Cost is too high if stuffing everything

Why Hybrid instead of Naive RAG:
  - Many questions contain exact terms (product IDs, policy names)
  - BM25 handles exact match better than vector search

Limits of RAG That Few People Talk About

RAG is not a silver bullet. After many months in production, these are the real limitations:

1. Retrieval Failures

If the retrieval is wrong, the generation will also be wrong. "Garbage in, garbage out."

Retrieval failure cases:
- Too vague question: "How does it work?" (no context)
- Serious typo: "rifund policy" (BM25 miss, vector search degrade)
- Domain-specific terminology: LLM uses different abbreviations than docs
- Multi-hop questions: need to combine multiple unrelated docs

2. Chunking Quality Matters Enormously

Bad chunking = bad retrieval = bad answer. This is the most underestimated issue.

❌ Bad chunk:
"...and according to clause number 3, in this case 
the customer can..." ← lacks context, chunk is cut in the middle

✅ Good chunk:
"Refund policy for defective products (Clause 3): 
Customers can request a refund within 30 days 
from the date of purchase if the product is defective by the manufacturer..."

3. Latency Overhead

RAG adds at least 2 network calls compared to pure prompt engineering:

Prompt Engineering: ~1-2 seconds (1 LLM call)

RAG:
  - Embed query: ~100ms
  - Vector search: ~50-200ms
  - LLM call: ~1-3 seconds
  - Total: ~1.5-4 seconds

→ For real-time applications, this is an important trade-off

4. Hallucinations Still Occur

RAG reduces but does not completely eliminate hallucinations:

# LLM can still make up things when:
# 1. Context is not clear enough
# 2. Question relates to many contradicting contexts
# 3. LLM "extrapolates" from context to non-existent info

# Mitigation:
system_prompt = """
Only answer based on the provided documents.
If there is no information, say precisely:
"I didn't find this information in the documents. 
 Please contact [contact] for support."
DO NOT invent additional info outside the documents.
"""

Conclusion & Next Post

Choosing the right approach is the first step towards success. For most enterprises with large knowledge bases, RAG is the most balanced and efficient choice.

3 Key Takeaways:

LLMs are "clueless" about your private data unless you provide it.
RAG is more scalable, flexible, and cheaper than Fine-tuning for knowledge-heavy tasks.
Retrieval quality is the bottleneck of a RAG system.

👉 Next Post: [Post 03] Architecture Design - Blueprint for an Enterprise RAG System

In the next post, we will draw the full map of a RAG system in production. From Ingestion Pipeline, Vector Database, to Query Service. How do these components "talk" to each other? How to handle access control?

📬 Any questions about RAG vs Fine-tuning? Drop a comment below!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: RAG LLM Fine-tuning Prompt Engineering AI Strategy

"Just add the document to the system prompt and we're done, right?" This innocent question from a PM took me 20 minutes to explain — and that was when I realized I needed to write this post.*

Recap of Post 01
What Does an LLM Know, and What Does It Not Know?
3 Ways to Bring Knowledge into an LLM
Prompt Engineering — Simple but Limited
Fine-tuning — Powerful but Expensive
RAG — Retrieval-Augmented Generation
Direct Comparison: Prompt vs Fine-tune vs RAG
How Does RAG Work?
Advanced RAG Variants
When to Use What? Decision Tree
Limits of RAG That Few People Talk About
Conclusion & Next Post

Recap of Post 01

In the previous post, we defined:

Business problem: Customer Support agents lose 10.2 minutes/ticket due to manual search through 3,000+ scattered documents.
Conclusion: Need a system that understands the semantics of questions and retrieves the right information in < 5 seconds.
Decision: RAG is the most suitable approach.

But why RAG? This post will answer that question thoroughly.

What Does an LLM Know, and What Does It Not Know?

Before comparing approaches, it's necessary to understand the nature of an LLM.

LLM is a "Compressed Book"

Training Data (terabytes)
        ↓
    [Training Process]
        ↓
Model Weights (gigabytes) ← "Compressed Knowledge"

Result: The model "knows" a lot about the world in a general way. But that knowledge has 3 fundamental limits:

Limit 1: Knowledge Cutoff

The model only knows information up to the moment training ended. Any event after that is "blind" to it.

GPT-4 training cutoff: ~April 2023
→ Knows nothing about events from May 2023 onwards
→ Knows nothing about your company's internal content

Limit 2: Private Knowledge

An LLM knows nothing about your company unless that information was public on the internet before the cutoff date.

LLM doesn't know:
❌ Your company's specific refund policy
❌ Internal onboarding processes
❌ Ticket conversation history
❌ Current product prices
❌ VIP customer names and their history

Limit 3: Hallucination

# Example of dangerous hallucination in an enterprise environment

User: "What is our refund policy?"

GPT-4 (without context): 
"The standard refund policy in the industry is 30 days. 
Customers need to send a request via email to support@company.com 
and will receive a refund within 5-7 business days."

→ The answer sounds COMPLETELY REASONABLE
→ But it's 100% made up — because the model doesn't have this information
→ Agent copy-pastes → Customer receives wrong information → Churn

Conclusion: LLMs need to be provided with context to answer correctly about domain-specific knowledge. The question is: how to provide it?

3 Ways to Bring Knowledge into an LLM

There are 3 main approaches, each with its own trade-offs:

┌─────────────────────────────────────────────────────────────┐
│              3 APPROACHES TO BRING KNOWLEDGE TO LLM         │
│                                                             │
│  1. PROMPT ENGINEERING                                      │
│     Stuff information directly into the prompt for each call │
│                                                             │
│  2. FINE-TUNING                                             │
│     Retrain the model with your data                        │
│                                                             │
│  3. RAG                                                     │
│     Search for relevant information → put it in the prompt  │
└─────────────────────────────────────────────────────────────┘

Prompt Engineering

How It Works

The simplest way: put the information you want the model to know directly into the prompt.

# Naive approach: Stuff all policies into the system prompt

SYSTEM_PROMPT = """
You are a Customer Support assistant for [Company].
Below are all the company policies:

=== REFUND POLICY ===
[3 pages of refund policy]

=== SHIPPING POLICY ===
[2 pages of shipping policy]

=== PRODUCT CATALOG ===
[500 products with descriptions]

=== FAQ ===
[200 frequently asked questions]

... [and much more]
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_question}
    ]
)

Pros

✅ Extremely simple to implement
✅ No infrastructure required complexly
✅ Easy to update — just edit the prompt
✅ No latency caused by retrieval

Cons and Practical Limits

Context Window is limited:

GPT-4 Turbo context window: 128,000 tokens ≈ ~100,000 words ≈ ~400 A4 pages

Sounds like a lot? Let's calculate:
- 3,000 Confluence pages × 500 words/page = 1,500,000 words
- Exceeds context window ~15 times
→ IMPOSSIBLE to stuff it all into the prompt

Cost increases linearly with context:

# Each API call with full context = very expensive

# Example with GPT-4 Turbo (price ~$0.01/1K tokens)
tokens_per_call = 100_000  # 100K token context
cost_per_call = (100_000 / 1_000) * 0.01  # = $1 per call

# With 300 tickets/day:
daily_cost = 300 * 1  # = $300/day
monthly_cost = 300 * 30  # = $9,000/month

# Comparison with RAG approach:
# RAG only puts 3-5 relevant snippets in context (~2,000 tokens)
cost_per_call_rag = (2_000 / 1_000) * 0.01  # = $0.02
monthly_cost_rag = 300 * 30 * 0.02  # = $180/month

# → RAG is ~50 times cheaper

"Lost in the Middle" Problem:

[SYSTEM PROMPT - 100 pages of documents]
│
├── Page 1-10: Model pays high attention ✅
├── Page 11-90: Model might ignore ⚠️  ← "Lost in the Middle"
└── Page 91-100: Model pays high attention ✅

No semantic search:

When is Prompt Engineering Enough?

✅ Suitable when:
   - Knowledge base < 50 pages
   - Use case is simple, well-defined
   - Prototyping/POC quickly
   - Budget is not an issue
   - Context is stable, changes little

Fine-tuning

How It Works

Fine-tuning is the process of further training the model on your dataset, so the model "memorizes" domain-specific knowledge into its weights.

Base Model (GPT-3.5 / Llama 3 / Mistral)
        +
Your Training Data (Q&A pairs, documents)
        ↓
    [Fine-tuning Process]
        ↓
Fine-tuned Model (knows about your domain)

Example training data for fine-tuning:

{"prompt": "What is our company's refund policy?", 
 "completion": "Customers can get a refund within 30 days..."}

{"prompt": "How to exchange a product?", 
 "completion": "To exchange a product, customers need to..."}

{"prompt": "What is the domestic shipping fee?", 
 "completion": "Domestic shipping fees are calculated based on..."}

Pros

✅ No retrieval step required → lower latency
✅ Model "deeply understands" the domain — not just research but internalize
✅ Consistent style and tone — can train the model to speak with the brand's voice
✅ Does not depend on context window for knowledge already trained
✅ Good for format/behavior — train model to output JSON, follow templates...

Serious Cons

Catastrophic Forgetting:

When fine-tuning, the model might "forget" some general knowledge it has already learned. This is an especially serious problem if the fine-tuning dataset is small or not diverse.

Knowledge cannot be updated in real-time:

Situation: Refund policy changes tomorrow.

With Fine-tuning:
1. Prepare new training data ← 1-2 days
2. Fine-tune model (~$100-$1,000 depending on model size) ← several hours
3. Evaluate new model ← 1 day
4. Deploy new model ← several hours
Total: 3-5 days + $100-$1,000

With RAG:
1. Update document in knowledge base ← 5 minutes
2. Re-index document ← several seconds
Total: 5 minutes + $0

High Cost:

Fine-tuning cost estimate:

GPT-3.5 Fine-tuning (OpenAI):
  - Training: $0.008 / 1K tokens
  - 1M token dataset: ~$8,000 one time
  - Each policy update: repeat the whole process

Self-hosted (Llama 3 8B with LoRA):
  - Need GPU: A100 40GB ≈ $2-3/hour
  - Training 1M tokens: ~10-20 hours = $20-60
  - But needs an engineer who knows how to do it
  - Needs infrastructure to deploy

Need high-quality data:

Fine-tuning requires labeled training data — exact (input, output) pairs. Creating this dataset is labor-intensive and prone to bias.

Hallucination still occurs:

Fine-tuning doesn't eliminate hallucinations. The model might still make up information not in the training data, especially with edge case questions.

When is Fine-tuning Actually Suitable?

✅ Suitable when the goal is to change BEHAVIOR, not KNOWLEDGE:
   - Want model to output the correct format (JSON, XML, specific templates)
   - Want model to speak with a specific tone/style
   - Want model to follow more complex instructions than the base model
   - Highly specialized domain (medical, law, finance) with lots of data
   - Latency is critical and cannot accept retrieval delay

❌ NOT suitable when:
   - Knowledge changes frequently
   - Do not have enough high-quality training data
   - Team lacks ML expertise
   - Budget is limited
   - Need traceability (know where the answer comes from)

RAG — Retrieval-Augmented Generation

Core Idea

RAG was introduced in the paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" (Lewis et al., 2020 — Meta AI). The idea is extremely elegant:

"Instead of stuffing all knowledge into the model, let the model search for knowledge when needed — like a person looking up reference books before answering."

Perfect Analogy:

❌ Prompt Engineering = Stuffing the entire library into the answerer's head
❌ Fine-tuning = Forcing the answerer to memorize the entire library
✅ RAG = The answerer can look up the exact book needed before answering 
           the question

High-level RAG Process

[User asks]: "What is the refund policy for defective products?"

                    ↓

Step 1: EMBED the question into a vector
        "refund policy for defective products"
        → [0.23, -0.51, 0.87, ..., 0.12]  (1536 dimensions)

                    ↓

Step 2: SEARCH in a Vector Database
        Find text snippets with the meanings closest to the question
        → [Snippet 1: "Defective products will be refunded..."]
        → [Snippet 2: "Defective product complaint process..."]
        → [Snippet 3: "Refund processing time is..."]

                    ↓

Step 3: AUGMENT the prompt with retrieved context
        System: "Answer based on the following documents: [3 snippets above]"
        User: "What is the refund policy for defective products?"

                    ↓

Step 4: GENERATE the answer
        LLM aggregates the 3 snippets → precise answer, with citations

How Does RAG Work?

RAG has 2 phases: Indexing (offline) and Querying (online/real-time).

Phase 1: Indexing (Offline)

┌─────────────────────────────────────────────────────────────┐
│                    INDEXING PIPELINE                         │
│                                                             │
│  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────┐  │
│  │  Sources  │──▶│  Chunk   │──▶│  Embed   │──▶│ Store  │  │
│  │          │   │          │   │          │   │        │  │
│  │Confluence│   │Split docs│   │text→vec  │   │Vector  │  │
│  │SharePoint│   │into      │   │[0.2,-0.5,│   │   DB   │  │
│  │PDF files │   │~500 token│   │0.8,...]  │   │        │  │
│  │Slack     │   │chunks    │   │          │   │Qdrant/ │  │
│  └──────────┘   └──────────┘   └──────────┘   │Weaviate│  │
│                                                │Pinecone│  │
│                                                └────────┘  │
└─────────────────────────────────────────────────────────────┘

Chunking — Dividing documents:

# Example: 10-page document → many small chunks

document = """
Refund Policy (10 pages)
...
[page 1: refund conditions]
[page 2: implementation process]
[page 3: processing time]
...
"""

# After chunking:
chunks = [
    "Chunk 1: Refund conditions — Customers can request a refund within 30 days...",
    "Chunk 2: Implementation process — Step 1: Contact support via email...",
    "Chunk 3: Processing time — After confirmation, refund within 5-7 days...",
    # ...
]

Embedding — Converting text to vector:

from openai import OpenAI

client = OpenAI()

def embed_text(text: str) -> list[float]:
    response = client.embeddings.create(
        input=text,
        model="text-embedding-3-small"  # 1536 dimensions
    )
    return response.data[0].embedding

# Each chunk → 1 vector in 1536-dimensional space
chunk_vector = embed_text("Chunk 1: Refund conditions...")
# → [0.023, -0.512, 0.871, ..., 0.124]  (1536 numbers)

Why is Vector useful?

Embedding models are trained so that text snippets with similar meanings will have vectors close to each other in high-dimensional space.

Vector space (simplified to 2D for illustration):

                      "refund" ←→ "hoàn tiền"
                          ·  ·
            "exchange" ·       · "return policy"
                   ·               ·
                                        "shipping"
                                            ·
                                                "giao hàng"

→ "refund" and "hoàn tiền" have vectors close to each other
→ "refund" and "shipping" have vectors far from each other
→ Semantic similarity = Geometric proximity

Phase 2: Querying (Real-time)

┌─────────────────────────────────────────────────────────────┐
│                    QUERYING PIPELINE                         │
│                                                             │
│  User Query                                                 │
│      │                                                      │
│      ▼                                                      │
│  ┌────────┐    ┌────────┐    ┌────────┐    ┌────────────┐  │
│  │ Embed  │───▶│ Search │───▶│Rerank  │───▶│  Generate  │  │
│  │ Query  │    │Vector  │    │(opt.)  │    │            │  │
│  │        │    │   DB   │    │        │    │LLM + ctx   │  │
│  │text→   │    │Top-K   │    │Filter  │    │→ Answer    │  │
│  │vector  │    │snippets│    │better  │    │+ citations │  │
│  └────────┘    └────────┘    └────────┘    └────────────┘  │
└─────────────────────────────────────────────────────────────┘

# Simplified querying flow

def rag_query(user_question: str) -> str:
    # Step 1: Embed question
    query_vector = embed_text(user_question)

    # Step 2: Search for top-5 most relevant snippets
    relevant_chunks = vector_db.search(
        vector=query_vector,
        top_k=5,
        score_threshold=0.7  # only retrieve if similarity > 70%
    )

    # Step 3: Build prompt with context
    context = "\n\n".join([chunk.text for chunk in relevant_chunks])
    sources = [chunk.metadata["source"] for chunk in relevant_chunks]

    prompt = f"""
    Answer the question COMPLETELY based on the provided documents.
    If the document does not have the information, say "I didn't find this information".
    DO NOT invent information not in the documents.

    Reference documents:
    {context}

    Question: {user_question}
    """

    # Step 4: Generate answer
    response = llm.generate(prompt)

    return {
        "answer": response,
        "sources": sources  # Traceability!
    }

Pros of RAG

✅ Knowledge is always up-to-date — just update documents, no need to retrain model
✅ Traceability — know which document the answer comes from
✅ Reduced hallucinations — model is "forced" to base on provided context
✅ Cost-effective — only put a few relevant snippets in context (not everything)
✅ Scales well — vector databases can handle millions of documents
✅ No ML expertise needed to maintain
✅ Security — can implement access control at retrieval layer

Direct Comparison

╔═══════════════════╦══════════════════╦══════════════════╦══════════════════╗
║ Criteria          ║ Prompt Eng.      ║ Fine-tuning      ║ RAG              ║
╠═══════════════════╬══════════════════╬══════════════════╬══════════════════╣
║ Complexity        ║ Low ✅           ║ High ❌          ║ Medium ⚠️        ║
║ Infrastructure    ║ Not needed ✅    ║ GPU cluster ❌   ║ Vector DB ⚠️     ║
║ Knowledge update  ║ Easy (edit prompt)║ Very hard ❌     ║ Easy (update doc)✅║
║ Scalability       ║ Poor ❌          ║ Good ✅          ║ Good ✅          ║
║ Knowledge size    ║ < 100 pages ❌   ║ Unlimited ✅      ║ Unlimited ✅      ║
║ Cost / query      ║ High ❌          ║ Low ✅           ║ Medium ⚠️        ║
║ Setup cost        ║ Very low ✅      ║ Very high ❌     ║ Medium ⚠️        ║
║ Hallucinations    ║ High ❌          ║ Medium ⚠️        ║ Low ✅           ║
║ Traceability      ║ No ❌            ║ No ❌            ║ Yes ✅           ║
║ Latency           ║ Low ✅           ║ Low ✅           ║ Higher ⚠️        ║
║ Time to market    ║ Days ✅          ║ Weeks-months ❌  ║ Weeks ⚠️         ║
║ Suitable use case ║ POC, small       ║ Behavior change  ║ Knowledge-heavy  ║
╚═══════════════════╩══════════════════╩══════════════════╩══════════════════╝

Advanced RAG Variants

Vanilla RAG is not the only solution. Depending on the problem, you might need these variants:

1. Naive RAG (Vanilla RAG)

This is the baseline — what we described above. Suitable for most use cases.

Query → Embed → Search → Generate

Limitations: Only bases on semantic similarity. Doesn't handle multi-hop reasoning questions.

2. Advanced RAG

Improvements in both pre-retrieval and post-retrieval:

Query Rewriting      → Embed → Search → Reranking → Generate
(improve query)                          (filter results)

Query Rewriting: LLM rewrites the user's question to optimize it for search:

User: "I bought a product 2 months ago, now it's defective, what happens?"
↓ Rewrite
Search query: "warranty return policy defective product"

Reranking: After retrieving top-K snippets from vector search, use another model (cross-encoder) to re-sort them based on more precise relevance.

3. Modular RAG

Flexible architecture — can plug-and-play different components:

┌──────────────────────────────────────────────────────┐
│              MODULAR RAG PIPELINE                    │
│                                                      │
│  Query     → [Query Transform]                       │
│               ↓                                      │
│             [Router] → select suitable retriever     │
│               ↓                                      │
│   ┌─────────────────────────────────┐                │
│   │ Vector Search │ BM25 │ SQL │ API│  ← Multi-source│
│   └─────────────────────────────────┘                │
│               ↓                                      │
│             [Fusion] → combine results               │
│               ↓                                      │
│             [Rerank] → re-sort                       │
│               ↓                                      │
│             [Generate] → answer                      │
└──────────────────────────────────────────────────────┘

4. Agentic RAG

LLM has the capability to decide for itself when and what to search:

# Instead of a rigid pipeline, LLM decides action

Agent reasoning:
  "This question needs policy information. I will search."
  → search("refund policy")
  → Receive results
  → "Results are not enough, need more info on timeline"
  → search("refund processing time")
  → Receive results
  → "Enough info, aggregate answer"
  → Generate answer

5. GraphRAG

Instead of just storing text chunks, build a knowledge graph to capture relationships between entities:

[Product A] ──[has policy]──▶ [30-day refund]
      │                                 │
   [belongs to]                      [applies when]
      │                                 │
   [Category X]                    [defective product]

Suitable when multi-hop reasoning is needed: "Which category does Product A belong to and what policy does that category have?"

6. Hybrid RAG (Used in our production)

Combines vector search (semantic) with BM25 (keyword/lexical):

# BM25: precise keyword search
bm25_results = bm25_search("refund policy")

# Vector search: semantic search
semantic_results = vector_search(embed("refund policy"))

# Reciprocal Rank Fusion: combine 2 results
final_results = reciprocal_rank_fusion(bm25_results, semantic_results)

Hybrid RAG solves the weakness of pure vector search when the question contains important exact keywords (product names, serial numbers, people's names...).

When to Use What?

Decision Tree

Do you need to bring custom knowledge to the LLM?
           │
           ▼
Knowledge base < 50 pages AND budget is no issue?
    YES → Prompt Engineering (simple, fast)
    NO  → continue
           │
           ▼
Goal is to change BEHAVIOR (format, style, instruction-following)?
    YES → Fine-tuning (or Prompt Engineering + Fine-tuning)
    NO  → continue
           │
           ▼
Knowledge changes frequently (weekly/monthly)?
    YES → RAG (mandatory)
    NO  → continue
           │
           ▼
Need traceability (know which source the answer comes from)?
    YES → RAG (mandatory)
    NO  → continue
           │
           ▼
Knowledge base > 100 pages?
    YES → RAG
    NO  → Prompt Engineering or Fine-tuning depending on budget

What did we choose and why?

Problem:
✅ Knowledge base: 3,000+ Confluence pages
✅ Update: Weekly
✅ Need traceability: Mandatory (compliance)
✅ Team: No deep ML expertise
✅ Budget: Moderate

Decision: Hybrid RAG (Vector Search + BM25)

Why not Fine-tuning:
  - Knowledge changes too frequently
  - No high-quality training data
  - Team lacks ML expertise
  
Why not Prompt Engineering:
  - 3,000 pages >> context window
  - Cost is too high if stuffing everything

Why Hybrid instead of Naive RAG:
  - Many questions contain exact terms (product IDs, policy names)
  - BM25 handles exact match better than vector search

Limits of RAG That Few People Talk About

RAG is not a silver bullet. After many months in production, these are the real limitations:

1. Retrieval Failures

If the retrieval is wrong, the generation will also be wrong. "Garbage in, garbage out."

Retrieval failure cases:
- Too vague question: "How does it work?" (no context)
- Serious typo: "rifund policy" (BM25 miss, vector search degrade)
- Domain-specific terminology: LLM uses different abbreviations than docs
- Multi-hop questions: need to combine multiple unrelated docs

2. Chunking Quality Matters Enormously

Bad chunking = bad retrieval = bad answer. This is the most underestimated issue.

❌ Bad chunk:
"...and according to clause number 3, in this case 
the customer can..." ← lacks context, chunk is cut in the middle

✅ Good chunk:
"Refund policy for defective products (Clause 3): 
Customers can request a refund within 30 days 
from the date of purchase if the product is defective by the manufacturer..."

3. Latency Overhead

RAG adds at least 2 network calls compared to pure prompt engineering:

Prompt Engineering: ~1-2 seconds (1 LLM call)

RAG:
  - Embed query: ~100ms
  - Vector search: ~50-200ms
  - LLM call: ~1-3 seconds
  - Total: ~1.5-4 seconds

→ For real-time applications, this is an important trade-off

4. Hallucinations Still Occur

RAG reduces but does not completely eliminate hallucinations:

# LLM can still make up things when:
# 1. Context is not clear enough
# 2. Question relates to many contradicting contexts
# 3. LLM "extrapolates" from context to non-existent info

# Mitigation:
system_prompt = """
Only answer based on the provided documents.
If there is no information, say precisely:
"I didn't find this information in the documents. 
 Please contact [contact] for support."
DO NOT invent additional info outside the documents.
"""

Conclusion & Next Post

Choosing the right approach is the first step towards success. For most enterprises with large knowledge bases, RAG is the most balanced and efficient choice.

3 Key Takeaways:

LLMs are "clueless" about your private data unless you provide it.
RAG is more scalable, flexible, and cheaper than Fine-tuning for knowledge-heavy tasks.
Retrieval quality is the bottleneck of a RAG system.

👉 Next Post: [Post 03] Architecture Design - Blueprint for an Enterprise RAG System

📬 Any questions about RAG vs Fine-tuning? Drop a comment below!

Author: [Your Name] Series: RAG in Production — The Journey of Building a Real-world AI System Tags: RAG LLM Fine-tuning Prompt Engineering AI Strategy

Table of Contents

Recap of Post 01

What Does an LLM Know, and What Does It Not Know?

LLM is a "Compressed Book"

Limit 1: Knowledge Cutoff

Limit 2: Private Knowledge

Limit 3: Hallucination

3 Ways to Bring Knowledge into an LLM

Prompt Engineering

How It Works

Pros

Cons and Practical Limits

When is Prompt Engineering Enough?

Fine-tuning

How It Works

Pros

Serious Cons

When is Fine-tuning Actually Suitable?

RAG — Retrieval-Augmented Generation

Core Idea

High-level RAG Process

How Does RAG Work?

Phase 1: Indexing (Offline)

Phase 2: Querying (Real-time)

Pros of RAG

Direct Comparison

Advanced RAG Variants

1. Naive RAG (Vanilla RAG)

2. Advanced RAG

3. Modular RAG

4. Agentic RAG

5. GraphRAG

6. Hybrid RAG (Used in our production)

When to Use What?

Decision Tree

What did we choose and why?

Limits of RAG That Few People Talk About

1. Retrieval Failures

2. Chunking Quality Matters Enormously

3. Latency Overhead

4. Hallucinations Still Occur

Conclusion & Next Post

👉 Next Post: [Post 03] Architecture Design - Blueprint for an Enterprise RAG System

Table of Contents

Recap of Post 01

What Does an LLM Know, and What Does It Not Know?

LLM is a "Compressed Book"

Limit 1: Knowledge Cutoff

Limit 2: Private Knowledge

Limit 3: Hallucination

3 Ways to Bring Knowledge into an LLM

Prompt Engineering

How It Works

Pros

Cons and Practical Limits

When is Prompt Engineering Enough?

Fine-tuning

How It Works

Pros

Serious Cons

When is Fine-tuning Actually Suitable?

RAG — Retrieval-Augmented Generation

Core Idea

High-level RAG Process

How Does RAG Work?

Phase 1: Indexing (Offline)

Phase 2: Querying (Real-time)

Pros of RAG

Direct Comparison

Advanced RAG Variants

1. Naive RAG (Vanilla RAG)

2. Advanced RAG

3. Modular RAG

4. Agentic RAG

5. GraphRAG

6. Hybrid RAG (Used in our production)

When to Use What?

Decision Tree

What did we choose and why?

Limits of RAG That Few People Talk About