Case Study: Building an Internal RAG AI Assistant & Optimizing Local LLM Inference

[!TIP] Open Source Reference: You can refer to the source code on GitHub at IvanPham03/text-to-sql-for-legacy-db. This is the base foundation for several components of the system, which I am currently expanding upon. Naturally, security-sensitive elements and proprietary configurations have been removed in compliance with actual project regulations.

Security Disclaimer: To comply with security principles and protect internal information, all diagrams and images in this article have been redesigned for technical illustration. Detailed parameters, models, and structures are for reference only and do not reflect the original system in actual operation.

1. Executive Summary

This project focuses on building an Internal Text-to-SQL RAG Assistant to automate data querying from internal management systems using natural language. As a solo project completed independently over 10 weeks, I directly handled every role from Data Engineer and AI Engineer to DevOps/SRE. The system implements a flexible Hybrid Routing mechanism: prioritizing local processing on Private GPU infrastructure to ensure absolute data privacy, while automatically routing complex reasoning tasks to top-tier commercial models (such as Claude Opus or GPT-5) via the OpenRouter API. The result is a system with sub-2-second latency, a SQL query accuracy of 85/100 on a real-world Golden Dataset, and fully secure on-premise operational capabilities.

2. Problem & Context

In the context of internal organizational data fragmented across legacy relational database management systems (RDBMS), extracting information requires specialized SQL skills, creating a bottleneck for business departments. While public LLMs (like GPT-5) are powerful, they face two major hurdles:

Sensitive Data: Business records and database structures cannot be sent outside the internal network.
Complex Legacy Schemas: Older databases often lack documentation and use obscure column names (e.g., COL_12 instead of customer_name), causing LLMs to generate incorrect queries without full metadata context.

Real-world Context: The system serves a team of 100-200 analysts and managers querying a "dirty" legacy database system with hundreds of tables and millions of records. The hardware infrastructure includes 96GB VRAM GPUs, allowing for the simultaneous hosting of multiple specialized LLMs and maximized parallel inference in a private environment.

3. System Architecture

The system architecture is built on the philosophy of strictly separating the application layer, AI processing layer, and data layer to ensure maintainability.

Overall System Architecture

Technology Stack:

Frontend: Next.js (App Router), React Server Components. Uses streaming to display results as the first tokens arrive.
Backend: Python (FastAPI, LangChain). LangChain manages complex logic (Chains) and flexibly integrates RAG support tools.
Inference Engine: vLLM. Utilizes PagedAttention to optimize throughput. With 96GB VRAM, the system hosts the Distil-Qwen-4B-Text2SQL model (specialized for SQL) and auxiliary models like Llama-3 (8B) or Mistral for Intent Detection.
Vector Database: Qdrant. Stores metadata for table/column schemas and few-shot SQL samples.
Embedding Model: Uses Voyage AI (voyage-lite-02-instruct) to understand the semantics of data fields and map them to user requests.
Data Flow: User Request → Intent Detection (Llama-3) → Schema Retrieval (Qdrant) → SQL Generation (Distil-Qwen-4B) → Self-Repair Loop → SQL Execution → Stream Response.

Sequence Flow

4. Technical Challenges

During implementation, I faced three core technical challenges:

Inference Latency vs. Throughput: Running local LLMs initially had high latency, affecting real-time data query workflows.
Schema Linking Accuracy: Mapping natural language questions to exact table/column names in a "dirty" legacy database without clear foreign keys.
Infrastructure Complexity: Managing GPU drivers and CUDA environments within containers for high-intensity inference tasks.

5. Solution & Trade-offs

Schema Indexing & Handling "Dirty" Legacy Data

Unlike standard text RAG, SQL RAG requires absolute structural precision. I implemented a Self-Correcting Schema Indexing strategy:

Indexing & Schema Linking: To map obscure columns (e.g., COL_12), I embedded not just the column names but also their semantic descriptions and sample values. When a user asks for "customer names," the system uses Cosine Similarity to find the column with the closest description or similar data patterns.
Legacy Database Handling: For undocumented databases, I implemented a Schema Interpretation layer (generated by LLMs based on data samples) with human-in-the-loop verification. The resulting "clean" metadata is stored in Qdrant for Schema Linking.
Read-only DB Security: To ensure absolute safety, the AI connects via a Dedicated DB User with SELECT-only privileges restricted to a Read-only Cluster. A separate connection pool isolates AI traffic from core business operations.

"Edge Cases" from "Dirty" Legacy Databases

Working with undocumented legacy databases provided invaluable lessons on real-world data:

Join Logic in App Code: Many legacy databases don't declare Foreign Keys (FK) at the DB level; relationships are buried in backend code. I used LLMs to scan old code to "discover" these hidden relationships and added them to Qdrant as Virtual Relationships.
Data Type Mismatch: A column named PRICE might be a VARCHAR containing currency symbols. AI initially used SUM() directly, causing SQL errors. The solution was automatic Type Casting in the prompt context: "Always CAST value columns to DECIMALS before calculation."
Column Overloading: A REMARK column might contain critical filtering data (e.g., 'VIP' for priority customers). I implemented Value-Level Indexing—indexing a portion of unique values so the LLM knows to use LIKE %...%.
Ambiguous Join Paths: When there are multiple ways to JOIN tables, AI often picks the shortest but incorrect path. I defined Golden Join Paths in metadata and instructed the model to prioritize them.

[!NOTE] The edge case solutions above are derived from specialized PoC testing and research. Due to time and operational cost constraints at scale, these techniques are currently at the "Expert Experience" level and are being incrementally integrated into the main system.

Model Choice: Why Distil-Qwen-4B-Text2SQL?

Instead of a general-purpose model, I used Distil-Qwen-4B (fine-tuned from Qwen2.5-4B) specifically for SQL:

Pros: Its small size (4B parameters) allows for extremely fast inference (high Tokens/s), yet it outperforms 7B or 13B general models in SQL generation because it was trained exclusively on SQL datasets. It is highly sensitive to JOIN structures and complex Aggregation logic.
Cons: Poor performance in general conversation or summarization. I solved this with a Multi-Model architecture: using Llama-3 (8B) for summarization and Qwen-4B solely for SQL generation.

Inference Optimization & Cost Analysis (Local vs. API)

With 96GB VRAM, hosting the entire pipeline locally brings monthly operational costs nearly to zero:

Performance: Generation speeds exceed 80 tokens/sec, meeting real-time query needs.
Token Analysis: An average query with schema context consumes about 2,000 tokens. On public APIs (like GPT-5), this would be prohibitively expensive at scale. Running on private 96GB GPUs significantly optimizes operational costs while ensuring data privacy.
Hybrid Routing: Only when local models fail (identified via the Self-Repair Loop) or encounter extreme reasoning cases does the system automatically switch to top-tier commercial models (like Claude Opus or GPT-5) via OpenRouter. This approach ensures total accuracy while controlling sensitive data flow.

Hybrid Search & Reranking Strategy

I implemented a two-stage retrieval logic (Retrieve & Re-rank).

Retrieval Scoring Logic

Solution: Uses Qdrant for hybrid search (Vector + Keyword), followed by a small Cross-Encoder model to re-score the Top-10 results before they enter the prompt.
Trade-off: Adds about 200ms to total processing time but ensures the AI isn't distracted by irrelevant "noise."

Why Qdrant over ChromaDB or FAISS?

During the PoC, I tested ChromaDB and FAISS but ultimately chose Qdrant for three strategic reasons:

Advanced Filtering: Text-to-SQL RAG requires heavy metadata filtering (table_name, schema_owner). Qdrant supports Scalar Filtering during the ANN search, which is more precise than ChromaDB's offering at that time.
Production-Ready (Rust-based): Unlike early Python-based versions of ChromaDB, Qdrant's Rust core provides stable performance, low RAM usage, and better load handling in on-prem environments.
Payload Management: Qdrant's management of payloads (metadata attached to vectors) is highly intuitive for storing complex schema metadata.

6. Code / Implementation Highlight

Below is a snippet of the result streaming logic from the vLLM engine in FastAPI. The key is using AsyncIterator to prevent blocking the server's event loop during inference.

# Self-Repair Loop: Automatic SQL Error Correction
async def execute_with_retry(user_query, schema, max_retries=3):
    current_schema = schema
    for attempt in range(max_retries):
        sql = await generate_sql(user_query, current_schema)
        try:
            return await db.execute(sql)
        except Exception as e:
            # Feed DB error back to LLM for correction
            error_msg = f"SQL Error: {str(e)}. Please fix the query."
            current_schema += f"\nPrevious Error: {error_msg}"
    raise Exception("Max retries reached")

# API Endpoint
@app.post("/query-data")
async def query_data(request: QueryRequest):
    # 1. Schema Linking (Mapping obscure columns)
    schema_metadata = await qdrant_store.search_schema(request.query)
    
    # 2. SQL Generation & Self-Repair
    try:
        data_results = await execute_with_retry(request.query, schema_metadata)
        return {"data": data_results}
    except Exception as e:
        return {"error": "Query failed", "details": str(e)}

7. The "Failure Story" - Lessons from Mistakes

A real-world AI project is never a smooth pink line. To reach 85% accuracy, I endured several painful "trial and error" moments:

The 13B Trap & Insights: Initially, I tried running large models (13B+) hoping "bigger is better." The result was extremely slow speeds (TTFT > 10s). However, switching to a specialized 4B SQL model revealed another trap: while small models write SQL syntax fast, they lack reasoning capabilities. For "dirty" legacy databases requiring an understanding of obscure column descriptions, larger models still had the edge in "guessing" intent. This led to my Multi-model architecture: using a large model for reasoning/linking and a fine-tuned small model for the final SQL output.
Embedding Dimension Error: I once updated the embedding model from OpenAI to a local one but forgot to update the dimension mapping in Qdrant. The system didn't crash immediately but returned completely random retrieval results (noise). Lesson: Always have strict schema validation for vector databases and unit tests for retrieval.
"Vector Drift": When the legacy database structure changed (renamed columns, deleted tables) but the vector database wasn't re-indexed, the system started "hallucinating" non-existent columns. I had to build an automated pipeline to trigger metadata re-syncs on schema changes.
Infinite Loop in Self-Repair: In the first version, the retry logic had a bug where the LLM would replace one syntax error with another logic error, creating an infinite loop that consumed thousands of tokens in seconds. I applied "Zero Temperature" for retries and a hard max_retries=3 limit to break the cycle.

8. AI-Driven Development Workflow

Instead of just using AI as a coding assistant, I integrated Cursor AI and Antigravity deeply into the development workflow as "partner engineers":

Schema & Metadata Engineering: Used Antigravity to analyze complex legacy table structures, automatically generate metadata descriptions for hundreds of columns, and write scripts for Qdrant data preparation—reducing context preparation time by 40%.
Prompt Engineering Lab: Used Cursor to iterate on specialized Text-to-SQL System Prompt versions (like CoT or Few-shot) and simulate "tricky" queries to refine the Self-Repair Loop logic.
Mindset: The most important skill here is knowing how to "prompt" AI to generate code that fits the system architecture (such as Async Streams and Multi-model orchestrators), rather than just copy-pasting disjointed snippets.

9. Results & Impact

The project brought tangible changes to internal operational workflows:

Performance: TTFT (Time To First Token) dropped from 4.5 seconds to 1.2 seconds.
Accuracy: Reached 85/100. This was validated by running 100 sample questions from a Golden Dataset and checking the returned data results (Execution Accuracy) rather than just SQL syntax. If the AI generated valid SQL that returned incorrect data, it scored 0.
Monitoring & CI/CD: Integrated Prometheus/Grafana for real-time GPU utilization tracking. The system is deployed via an automated CI/CD pipeline, ensuring smooth and secure updates for bug fixes and new models.

Deployment Infrastructure & Monitoring

10. Future Work & Scalability

Docker Compose (Current Deployment): Currently, the system is stably deployed on bare-metal via Docker Compose, optimizing direct GPU driver access while ensuring encapsulation.
Scalability with Kubernetes: I have successfully designed and tested a Proof of Concept (PoC) on Kubernetes for centralized GPU orchestration, ready for future scaling as user numbers grow.
Automated Eval: Integrating RAGAS to automate quality evaluation instead of manual human-eval.

Kubernetes PoC Architecture

11. Appendix: Top Text-to-SQL Strategies

I tested over 50 different techniques from the latest research, but here are the 7 most valuable strategies I refined and implemented for the legacy database system:

Schema Linking (Critical): Combining Cosine Similarity on column descriptions with Fuzzy Matching on sample data to precisely map user intent to obscure schemas.
Few-shot RAG: Retrieving similar (Question - SQL) pairs from a "Golden" library verified by DBAs to guide the LLM.
Self-Repair Loop: Automatically catching DB errors and feeding them back to the LLM to fix the SQL command (up to 3 retries).
CoT (Chain-of-thought): Forcing the model to analyze relationship logic between tables before writing the first line of SELECT.
Schema Pruning (Dynamic): Only including truly necessary tables and columns in the context based on preliminary intent analysis, increasing model "Attention."
Execution Sandbox: Running SQL in an isolated environment, checking data types, and record limits (LIMIT) before returning results to the user.
Hybrid Routing: Automatically directing complex tasks to more powerful models (Claude Opus) based on the complexity of the involved schemas.

12. Reflection

The most important lesson: Text-to-SQL RAG is not a problem of choosing the most powerful AI model, but a problem of Structured Context Quality. When Schema Metadata is clean and transparent, even small models (7B, 8B parameters) are capable of writing complex SQL accurately.

If I were to build it again: I would invest in an Evaluation Pipeline (SQL Execution Accuracy) from day one—automatically comparing AI-generated SQL results with "Golden Queries" in a sandbox environment.

This project has been a valuable opportunity for me to directly execute the entire AI product lifecycle (End-to-end). Taking on the role of the sole engineer—from requirement analysis to infrastructure operation—has helped me deeply understand that the combination of product thinking and infrastructure engineering is the key to successfully deploying AI in a complex enterprise environment.

Through this, I have gained a comprehensive overview of every link in a real-world AI product. My experience from architectural design to direct infrastructure management reinforces my belief that to successfully deploy AI in complex environments, an engineer must balance both technical challenges and problem-solving mindsets for the business.