Lab M30: turn production feedback into evals and training data
You'll need: your venv. The whole lab runs offline, free, no key (logging and curation are plain Python over JSONL). Time: about 40 minutes. Work in your breakout pair.
Heads up: this connects evals (M20/M26), fine-tuning (M15), and privacy (M14). The data is synthetic. Nothing here can harm your computer. The one rule we never break: redact PII before storing anything.
This lab has two parts: - Part A: log interactions with feedback, and redact PII on the way in. - Part B: curate the logs into eval cases and fine-tuning examples.
flowchart LR
USE["users + feedback"] --> LOG["log (PII redacted)"]
LOG --> CUR["curate"]
CUR --> EVAL["eval cases (M26)"]
CUR --> FT["fine-tune data (M15)"]
EVAL --> SHIP["improved agent"]
FT --> SHIP
SHIP --> USE
Part A: capture feedback (safely)
Step 1: Set up
Copy the solution/ files into a folder. Activate your venv. No key, no installs.
python -c "print('ready')"
ready.
Step 2: Run the flywheel demo
python demo.py
==== 1. LOG interactions (PII redacted on the way in) ====
logged 6 interactions
PII check, interaction 6 stored as: Email me at [email] about order [phone].
[email] and [phone] BEFORE
storage. Open feedback_log.py and read redact_pii and
log_interaction. Feedback data is data you keep, so privacy (M14) comes first.
Step 3: Read the three signals
Open demo.py and look at the INTERACTIONS: thumbs up (#1, #2, #5, #6), thumbs down + a
correction (#3 refunds), and thumbs down with no correction (#4 reset password).
You should now see: these mean different things. Up = good. Down+correction = wrong, and here is right. Down with no fix = bad, but we do not know the right answer. Treating them the same would poison your data.
Part B: curate into datasets
Step 4: Curate eval cases (M26)
In the demo output:
==== 2. CURATE into EVAL cases (M26) ====
[golden] q='What are your hours?' ...
[regression] q='Do you offer refunds?' expect='Yes, we offer refunds within 30 days ...'
4 eval cases; 1 down-voted need human review: ['How do I reset my password?']
curate.py and read to_eval_cases and needs_review.
You should now see: up-votes became golden cases (must keep working), the down-vote-with-fix became a regression case (the agent currently fails it), the duplicate "hours" was deduped, and the unlabel-able down-vote was routed to a human. These feed straight into the M26 CI gate.
Step 5: Curate fine-tuning examples (M15)
In the demo output:
==== 3. CURATE into FINE-TUNING examples (M15) ====
4 training examples (deduped by question):
[{"role": "user", "content": "Do you offer refunds?"}, {"role": "assistant", "content": "Yes, we offer refunds within 30 days of purchase."}]
to_finetune_examples in curate.py.
You should now see: chat-format {"messages": [...]} examples (exactly M15's dataset shape), and
crucially the refunds example trains the corrected answer, not the bad one the agent originally gave.
The same down-vote fed both the eval gate AND the training fix.
Step 6: See the dual use (the flywheel)
Find the refunds interaction (#3) in the demo. It produced a regression eval case (so CI catches the bug) and a corrected training example (so a fine-tune fixes it).
You should now see the loop: one piece of real feedback both protects against the bug and teaches the fix. Multiply that over thousands of interactions and the agent improves with use.
Step 7: Add a record yourself
python -c "import feedback_log as fb, curate; r=[{'question':'Do you ship internationally?','answer':'No.','feedback':'down','correction':'Yes, we ship to 40 countries.'}]; print(curate.to_eval_cases(r)); print(curate.to_finetune_examples(r))"
regression eval case and one
chat-format training example with the corrected answer. You just fed the flywheel.
Step 8: Show it
Post the refunds example from the demo: the regression eval case next to the corrected training example, and one sentence on the privacy rule you would enforce before storing real user data.
If you get stuck
ModuleNotFoundError-> run from inside the folder with the solution.pyfiles.- PII not redacted -> redaction runs in
log_interaction(on write);redact_piionly covers emails and phones here, extend it for your domain. - A down-vote did not become an eval case -> it had no correction, so it cannot be auto-labeled; check
needs_review. That is intentional. - Duplicates in the dataset ->
to_eval_casesdedupes by (question, expected) andto_finetune_examplesby question; check your records differ.