Lab: M15: build a fine-tuning dataset, then fine-tune

You'll need: your venv. Part A needs no key and no GPU. Part B (submitting a real fine-tune) needs an OpenAI account/key or a GPU for the local path, optional in class. Time: ~55 min • Work in your breakout pair.

Heads up: a fine-tune is only as good as its dataset: that's the part you build today, and it's the part that matters. Submitting the job is the easy bit. Nothing here can harm your computer.

This lab has two parts: - Part A: build & validate a fine-tuning dataset (everyone, no key). - Part B: submit a fine-tune job + decide when fine-tuning is right.

flowchart LR
  Ex["examples<br/>(input → ideal output)"] --> JSONL["train.jsonl<br/>(validated)"]
  JSONL --> Job["fine-tune job"]
  Job --> Model["your model<br/>(style baked in, no big prompt)"]

Part A: build the dataset (the part that matters)

Step 1: Set up

Put prepare_dataset.py, finetune.py (from solution/) and dataset_starter.py (from starters/) in a folder. Activate your venv.

You should now see: (.venv) and those files.

Step 2: Build & validate a dataset

python prepare_dataset.py

You should now see: Wrote 6 examples to train.jsonl and validated 6 lines. Open train.jsonl, each line is one little conversation ending in the ideal on-brand reply. That JSONL is exactly what a fine-tuning API wants.

Step 3: See what the format teaches

Open prepare_dataset.py. Notice every example shares the same system role and voice, and ends with the assistant's ideal answer. The model learns that pattern.

You should now see / say: "fine-tuning learns from consistent (input → ideal output) examples, so the dataset's consistency is the quality." Inconsistent examples teach inconsistency.

Step 4: Break it on purpose (validation matters)

In prepare_dataset.py, temporarily make one example's reply empty or delete its system line, run again, and watch the validator complain (or add a line that ends with a user turn).

You should now see: a clear ValueError naming the bad line. Validating before you spend money on training is a habit worth keeping. (Undo your change.)

Step 5: Build YOUR dataset

Open dataset_starter.py. Set SYSTEM to your assistant's voice and add 6+ of your own (input → ideal output) examples, an on-brand replier, a strict classifier, a fixed report format. Run it.

You should now see: my_train.jsonl with your examples (and a nudge if you have fewer than 6). You've built a real fine-tuning dataset for a task you care about.

Part B: fine-tune, and decide when to

Step 6: Read the fine-tune workflow

Open finetune.py. Trace it: upload the JSONL (files.create, purpose fine-tune) → start a job (fine_tuning.jobs.create) → wait for succeeded → use job.fine_tuned_model like any model. Note the bottom comment: the local LoRA path for open models.

You should now see / say: the four steps, upload → train → wait → use, and that the fine-tuned model needs no long system prompt (the style is baked in).

Step 7: (Optional, needs an OpenAI key) actually run it

Put your OPENAI_API_KEY in .env, then:

python finetune.py train.jsonl                 # prints a job id
python finetune.py status ftjob-...            # repeat until 'succeeded'

You should now see: a job id, then (after minutes-hours, a few dollars) status: succeeded and a fine_tuned_model id you can call. (No key/GPU in class? That's fine, Part A is the real skill.)

Step 8: The decision: fine-tune, prompt, or RAG?

For each, say which you'd use and why: - a) answer questions about your company's ever-changing handbook, - b) always reply in your brand's exact voice and format, - c) get a slightly better one-off answer to a tricky question.

You should now see / say: (a) RAG (changing facts → retrieve, don't fine-tune), (b) fine-tune (consistent behaviour at scale), (c) a better prompt (one-off → prompt). That ordering, prompt → RAG → fine-tune, is the whole module.

Stuck? Working examples are in ../solution/.

Your win

You built and validated a real fine-tuning dataset, you know the full fine-tune workflow (hosted and local), and you can say when fine-tuning beats prompting or RAG.

Post it to the chat wins board: "Built a 10-example dataset to bake my bot's voice in, and I know to use RAG for facts, fine-tune for style. "

Take-home (optional)

Grow your dataset to 20-30 consistent examples and hold 5 aside as an eval set (M8). If you fine-tune for real, run those 5 through the base model and your fine-tune and compare, that's how you prove the fine-tune actually helped.