Lab M30: turn production feedback into evals and training data

You'll need: your venv. The whole lab runs offline, free, no key (logging and curation are plain Python over JSONL). Time: about 40 minutes. Work in your breakout pair.

Heads up: this connects evals (M20/M26), fine-tuning (M15), and privacy (M14). The data is synthetic. Nothing here can harm your computer. The one rule we never break: redact PII before storing anything.

This lab has two parts: - Part A: log interactions with feedback, and redact PII on the way in. - Part B: curate the logs into eval cases and fine-tuning examples.

flowchart LR
  USE["users + feedback"] --> LOG["log (PII redacted)"]
  LOG --> CUR["curate"]
  CUR --> EVAL["eval cases (M26)"]
  CUR --> FT["fine-tune data (M15)"]
  EVAL --> SHIP["improved agent"]
  FT --> SHIP
  SHIP --> USE

Part A: capture feedback (safely)

Step 1: Set up

Copy the solution/ files into a folder. Activate your venv. No key, no installs.

python -c "print('ready')"

You should now see: ready.

Step 2: Run the flywheel demo

python demo.py

You should now see six interactions logged, with PII redacted at write time:

==== 1. LOG interactions (PII redacted on the way in) ====
  logged 6 interactions
  PII check, interaction 6 stored as: Email me at [email] about order [phone].

You should now see: the email and phone number were replaced with [email] and [phone] BEFORE storage. Open feedback_log.py and read redact_pii and log_interaction. Feedback data is data you keep, so privacy (M14) comes first.

Step 3: Read the three signals

Open demo.py and look at the INTERACTIONS: thumbs up (#1, #2, #5, #6), thumbs down + a correction (#3 refunds), and thumbs down with no correction (#4 reset password).

You should now see: these mean different things. Up = good. Down+correction = wrong, and here is right. Down with no fix = bad, but we do not know the right answer. Treating them the same would poison your data.

Part B: curate into datasets

Step 4: Curate eval cases (M26)

In the demo output:

==== 2. CURATE into EVAL cases (M26) ====
  [golden] q='What are your hours?' ...
  [regression] q='Do you offer refunds?' expect='Yes, we offer refunds within 30 days ...'
  4 eval cases; 1 down-voted need human review: ['How do I reset my password?']

Open curate.py and read to_eval_cases and needs_review.

You should now see: up-votes became golden cases (must keep working), the down-vote-with-fix became a regression case (the agent currently fails it), the duplicate "hours" was deduped, and the unlabel-able down-vote was routed to a human. These feed straight into the M26 CI gate.

Step 5: Curate fine-tuning examples (M15)

In the demo output:

==== 3. CURATE into FINE-TUNING examples (M15) ====
  4 training examples (deduped by question):
    [{"role": "user", "content": "Do you offer refunds?"}, {"role": "assistant", "content": "Yes, we offer refunds within 30 days of purchase."}]

Read to_finetune_examples in curate.py.

You should now see: chat-format {"messages": [...]} examples (exactly M15's dataset shape), and crucially the refunds example trains the corrected answer, not the bad one the agent originally gave. The same down-vote fed both the eval gate AND the training fix.

Step 6: See the dual use (the flywheel)

Find the refunds interaction (#3) in the demo. It produced a regression eval case (so CI catches the bug) and a corrected training example (so a fine-tune fixes it).

You should now see the loop: one piece of real feedback both protects against the bug and teaches the fix. Multiply that over thousands of interactions and the agent improves with use.

Step 7: Add a record yourself

python -c "import feedback_log as fb, curate; r=[{'question':'Do you ship internationally?','answer':'No.','feedback':'down','correction':'Yes, we ship to 40 countries.'}]; print(curate.to_eval_cases(r)); print(curate.to_finetune_examples(r))"

You should now see: your new down-vote-with-correction become one regression eval case and one chat-format training example with the corrected answer. You just fed the flywheel.

Step 8: Show it

Post the refunds example from the demo: the regression eval case next to the corrected training example, and one sentence on the privacy rule you would enforce before storing real user data.

If you get stuck

ModuleNotFoundError -> run from inside the folder with the solution .py files.
PII not redacted -> redaction runs in log_interaction (on write); redact_pii only covers emails and phones here, extend it for your domain.
A down-vote did not become an eval case -> it had no correction, so it cannot be auto-labeled; check needs_review. That is intentional.
Duplicates in the dataset -> to_eval_cases dedupes by (question, expected) and to_finetune_examples by question; check your records differ.

Check yourself

Why is a thumbs-down WITH a correction the most valuable record?

Because it gives you both halves: proof the agent was wrong (a regression eval case to guard against it) and the right answer (a corrected training example to fix it). One record improves both evals and the model.

Why can't you auto-label a thumbs-down with no correction?

You know the answer was bad but not what "good" looks like. Guessing the expected answer would put wrong labels into your data and poison both evals and training. Route it to a human instead.

What must happen before any interaction is stored?

PII redaction (M14): strip emails, phone numbers, and other identifiers at write time, minimize what you keep, get consent, and never store secrets. Feedback data is data you are now responsible for.

Why is curation (not just collection) the real work?

Raw logs are noisy and biased. You must dedupe, filter low quality, balance common vs rare cases, and review ambiguous signals. Garbage in, garbage out, especially for fine-tuning, where a sloppy dataset makes a worse model.