Skip to content

Notes: M7: RAG I, give the AI your own knowledge

A model knows what it read during training, a huge slice of the public internet up to some cutoff date. It has never seen your meeting notes, your company's policy doc, last week's news, or the PDF on your desktop. Ask about them and it does the worst possible thing: it answers confidently anyway, sometimes making facts up (a "hallucination"). This module is the fix, and it's the single most useful pattern in applied AI: Retrieval-Augmented Generation (RAG): fetch the right passages from your documents and hand them to the model as context, so its answer is grounded in your data.

Why not just "train it on my data"?

The instinct is to teach the model your documents (fine-tuning). For almost everyone, that's the wrong first move: it's slow, costs money and expertise, has to be redone every time a document changes, and still tends to blur facts rather than quote them. RAG is better for knowledge: no training, updates instantly when you change a file, and the model answers from text you can point to. Reach for RAG before fine-tuning, it's faster, cheaper, and more truthful. (Recall M5: prompt first, then add data via RAG, then add tools via agents; fine-tuning is a last resort.)

The core idea: retrieve, then answer

You can't paste a whole 200-page manual into every prompt, it's too big and mostly irrelevant to any one question. So RAG does something smarter: for each question, it finds just the few passages most relevant and pastes those in. Three steps:

  1. Retrieve: search your document for the chunks most relevant to the question.
  2. Augment: put those chunks into the prompt as context.
  3. Generate: ask the model to answer using only that context.
flowchart LR
  subgraph Once["Set up once"]
    D["your document"] --> C["split into chunks"] --> E["embed each chunk → vectors"] --> V["vector store"]
  end
  subgraph PerQ["Per question"]
    Q["question"] --> EQ["embed the question"] --> S["find nearest chunks in the store"]
    V --> S
    S --> P["augment prompt with chunks"] --> M["model generates the answer"]
  end

The clever part is the "find the most relevant chunks" step. That's where embeddings come in.

Embeddings: meaning as numbers

An embedding is a list of numbers (a vector) that captures the meaning of a piece of text. The key property: texts with similar meaning get similar vectors, even if they share no words. "How do I get a refund?" and "Can I get my money back?" land close together; "what are your opening hours?" lands far away. A model called an embedding model produces these vectors. (Chroma ships a small one and runs it locally for you, that's why RAG needs no extra API key for search.)

Because meaning becomes geometry, "find the most relevant chunk" becomes "find the nearest vector", a fast, well-understood computation. This is semantic search: matching on meaning, not keywords. It's why your app finds the "Refunds" passage when you ask about "getting your money back" without ever typing "refund".

The vector store: a database for meaning

A vector store (we use Chroma) holds your chunks and their embeddings and answers one question very fast: "given this query vector, which stored chunks are nearest?" You add your chunks once; then every question is a query that returns the top-k closest chunks. Chroma handles the embedding and the math; you write a few lines:

collection.add(documents=chunks, ids=[...])          # embeds + stores (set up once)
results = collection.query(query_texts=[question], n_results=3)   # nearest 3 chunks
(FAISS is a popular lower-level alternative where you manage embeddings yourself; Chroma bundles it all, which is why we start there.)

Chunking: how you split the document matters

You don't store the whole document as one blob, you split it into chunks (here, one per paragraph). Why chunk at all? Two reasons: retrieval can return a focused passage instead of the whole document, and you only spend prompt space (and tokens) on the relevant bits. Chunking has trade-offs you'll feel in M8: - Too big (whole pages) → each chunk mixes many topics, so retrieval is fuzzy and you waste tokens. - Too small (single sentences) → a chunk may lose the context needed to make sense. - A common sweet spot is a paragraph or a few hundred words, sometimes with a little overlap so an idea split across a boundary isn't lost. (M8 tunes this.)

Grounding: answer from the context, or say you don't know

Retrieval gets the right text in front of the model; the prompt makes it behave. The instruction "Answer using ONLY the context below; if it's not there, say 'I don't know based on the document'" does two jobs: it keeps answers grounded in your data, and it gives the model permission to decline instead of inventing. That's the difference between a demo and something you'd trust, and a first taste of the guardrails in M10.

What RAG is good for (and its limits)

RAG shines whenever the answer lives in text you have: docs Q&A, support over a knowledge base, "chat with this PDF/codebase/wiki". Its quality is only as good as its retrieval: if the right chunk isn't fetched, the model can't use it, and you get a wrong or "I don't know" answer. That's not a flaw so much as the thing to measure and improve, which is exactly M8 (RAG II: make it good). For now, the win is the whole pipeline working end-to-end over your own document.

Go deeper (optional, not needed for today's win) - **How big are these vectors?** Often a few hundred numbers each (e.g. 384 for the small default model). "Nearest" is usually measured by **cosine similarity**: the angle between vectors. - **Persisting the index:** `chromadb.Client()` is in-memory (re-indexes each run, fine for one doc). `chromadb.PersistentClient(path=...)` saves the index to disk so you index once and reuse it. - **Other file types:** real apps extract text from PDFs/Word/HTML first (libraries like `pypdf`), then chunk the extracted text. We use `.txt` to keep the focus on RAG itself. - **Metadata:** you can store fields alongside each chunk (source, page, date) and filter on them, handy for "search only this folder" or showing citations. - **Embedding models / providers:** Chroma embeds locally for free; for higher quality many apps use a hosted embedding model, **OpenAI**, **Cohere**, **Google Gemini**, **Voyage**, or **Jina**. Same idea (text → vector), better quality, at a per-token cost and a network call. - **Other vector stores:** beyond Chroma and **FAISS**, you'll hear **Pinecone** and **Weaviate** (hosted/managed), **Qdrant** and **LanceDB** (open-source, run locally or hosted), and **pgvector** (add vector search to a Postgres database you already have). All answer the same "nearest vectors" question; pick by scale, hosting, and whether you want managed vs self-run.

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. Why doesn't a model already know your documents, and why is RAG better than fine-tuning for this?

Show answer

It only saw its training data (public text up to a cutoff), never your private/recent files. RAG beats fine-tuning for knowledge because it needs no training, updates the moment you change a file, and answers from text you can point to (more truthful). Fine-tuning is slow, costly, and must be redone on every change.

2. What are the three steps of RAG?

Show answer

Retrieve the chunks most relevant to the question, augment the prompt by pasting those chunks in as context, and generate an answer from them. Retrieve → augment → generate.

3. What is an embedding, and what makes vector search "semantic"?

Show answer

An embedding is a vector (list of numbers) that captures a text's meaning, so similar-meaning texts get nearby vectors. Searching by nearest vector therefore matches on meaning, not keywords: "money back" finds the "refund" passage. That's semantic search.

4. What does a vector store (Chroma) do for you?

Show answer

It embeds and stores your chunks, then for each query returns the top-k nearest chunks fast. You add(documents=...) once and query(query_texts=..., n_results=k) per question, Chroma does the embedding and the similarity math (locally, no extra key).

5. Your RAG app gives a wrong answer. Where do you look first, and why?

Show answer

At what got retrieved. RAG answers are only as good as retrieval, if the right chunk wasn't fetched, the model never saw the facts. Bad answer → usually bad retrieval (wrong chunks, too few, poor chunking). Improving that is M8.


New words (also in resources/glossary.md): RAG (retrieval-augmented generation), hallucination, embedding, vector, semantic search, vector store, Chroma, chunk/chunking, top-k, retrieval, grounding, cosine similarity (go-deeper).

Source: original, written for this course. RAG concepts reflect standard practice; the Chroma API (Client, get_or_create_collection, add, query) follows Chroma's official documentation. The example app's chunking and retrieve→augment→generate flow were verified to run (chunking on real text; the vector-store and model calls mocked, see the solution README). No third-party text or figures; diagrams are original.