One of my recent projects, the 1337-handbook-assistant, uses AI to answer questions about the 42 School/1337 handbook. Rather than just asking ChatGPT (which doesn't have the private handbook context), I built a Retrieval Augmented Generation (RAG) system using LangChain, FastAPI, and ChromaDB.
What is RAG?
Large Language Models (LLMs) are great at generating text, but their knowledge is frozen in time and they hallucinate when they don't know the answer. RAG solves this by providing the LLM with relevant, specific context before it answers the question.
The Architecture
Here is how the pipeline works:
- Ingestion: We parse the PDF handbook, split the text into smaller chunks, convert those chunks into mathematical vectors (embeddings) using OpenAI's embedding model, and store them in a vector database (ChromaDB).
- Retrieval: When a user asks a question via our FastAPI endpoint, we convert the question into an embedding, search the vector database for the top 3 most similar chunks of context, and retrieve the raw text.
- Generation: We send the user's question + the retrieved context to the LLM (gpt-4o-mini) and instruct it: "Answer the user's question using ONLY the provided context."
The Implementation (FastAPI + LangChain)
LangChain makes tying these pieces together incredibly simple:
from fastapi import FastAPI
from pydantic import BaseModel
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
app = FastAPI()
# Load the vector store
embeddings = OpenAIEmbeddings()
db = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
# Create the QA Chain
llm = ChatOpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=db.as_retriever()
)
class Question(BaseModel):
query: str
@app.post("/ask")
async def ask_question(question: Question):
response = qa_chain.run(question.query)
return {"answer": response}
Lessons Learned
The hardest part of RAG isn't the LLM — it's the chunking strategy. If you split your documents poorly (e.g., cutting a sentence in half), the retrieval step fails and the LLM produces garbage. A solid understanding of data engineering is crucial for building robust AI systems.
