Lab: M15: build a fine-tuning dataset, then fine-tune
You'll need: your venv. Part A needs no key and no GPU. Part B (submitting a real fine-tune) needs an OpenAI account/key or a GPU for the local path, optional in class. Time: ~55 min • Work in your breakout pair.
Heads up: a fine-tune is only as good as its dataset: that's the part you build today, and it's the part that matters. Submitting the job is the easy bit. Nothing here can harm your computer.
This lab has two parts: - Part A: build & validate a fine-tuning dataset (everyone, no key). - Part B: submit a fine-tune job + decide when fine-tuning is right.
flowchart LR
Ex["examples<br/>(input → ideal output)"] --> JSONL["train.jsonl<br/>(validated)"]
JSONL --> Job["fine-tune job"]
Job --> Model["your model<br/>(style baked in, no big prompt)"]
Part A: build the dataset (the part that matters)
Step 1: Set up
Put prepare_dataset.py, finetune.py (from solution/) and dataset_starter.py
(from starters/) in a folder. Activate your venv.
You should now see: (.venv) and those files.
Step 2: Build & validate a dataset
python prepare_dataset.py
Wrote 6 examples to train.jsonl and validated 6 lines. Open
train.jsonl, each line is one little conversation ending in the ideal on-brand reply. That JSONL
is exactly what a fine-tuning API wants.
Step 3: See what the format teaches
Open prepare_dataset.py. Notice every example shares the same system role and voice, and ends
with the assistant's ideal answer. The model learns that pattern.
You should now see / say: "fine-tuning learns from consistent (input → ideal output) examples, so the dataset's consistency is the quality." Inconsistent examples teach inconsistency.
Step 4: Break it on purpose (validation matters)
In prepare_dataset.py, temporarily make one example's reply empty or delete its system line, run
again, and watch the validator complain (or add a line that ends with a user turn).
You should now see: a clear ValueError naming the bad line. Validating before you spend money
on training is a habit worth keeping. (Undo your change.)
Step 5: Build YOUR dataset
Open dataset_starter.py. Set SYSTEM to your assistant's voice and add 6+ of your own
(input → ideal output) examples, an on-brand replier, a strict classifier, a fixed report format.
Run it.
You should now see: my_train.jsonl with your examples (and a nudge if you have fewer than 6).
You've built a real fine-tuning dataset for a task you care about.
Part B: fine-tune, and decide when to
Step 6: Read the fine-tune workflow
Open finetune.py. Trace it: upload the JSONL (files.create, purpose fine-tune) → start a
job (fine_tuning.jobs.create) → wait for succeeded → use job.fine_tuned_model like any
model. Note the bottom comment: the local LoRA path for open models.
You should now see / say: the four steps, upload → train → wait → use, and that the fine-tuned model needs no long system prompt (the style is baked in).
Step 7: (Optional, needs an OpenAI key) actually run it
Put your OPENAI_API_KEY in .env, then:
python finetune.py train.jsonl # prints a job id
python finetune.py status ftjob-... # repeat until 'succeeded'
status: succeeded and
a fine_tuned_model id you can call. (No key/GPU in class? That's fine, Part A is the real skill.)
Step 8: The decision: fine-tune, prompt, or RAG?
For each, say which you'd use and why: - a) answer questions about your company's ever-changing handbook, - b) always reply in your brand's exact voice and format, - c) get a slightly better one-off answer to a tricky question.
You should now see / say: (a) RAG (changing facts → retrieve, don't fine-tune), (b) fine-tune (consistent behaviour at scale), (c) a better prompt (one-off → prompt). That ordering, prompt → RAG → fine-tune, is the whole module.
Stuck? Working examples are in
../solution/.
Your win
You built and validated a real fine-tuning dataset, you know the full fine-tune workflow (hosted and local), and you can say when fine-tuning beats prompting or RAG.
Post it to the chat wins board: "Built a 10-example dataset to bake my bot's voice in, and I know to use RAG for facts, fine-tune for style. "
Take-home (optional)
Grow your dataset to 20-30 consistent examples and hold 5 aside as an eval set (M8). If you fine-tune for real, run those 5 through the base model and your fine-tune and compare, that's how you prove the fine-tune actually helped.