M30 notes: Agent data and feedback loops (the one idea)

The one idea: a deployed agent generates the best training and evaluation data you will ever have, real questions and real signals about which answers were good, and most teams throw it away. The data flywheel captures that exhaust, curates it into eval cases (M26) and fine-tuning examples (M15), and feeds it back so the agent improves with use. The more it is used, the better it gets, as long as you respect privacy along the way.

1. Your agent is a data source

You spent earlier modules creating data by hand: golden eval cases (M20/M26), a fine-tuning dataset (M15). Production gives it to you for free, and better, because it is real: the actual questions your users ask, phrased the way they really phrase them, with the agent's actual answers. Add a way for users to signal quality (a thumbs up/down, an edit, a "try again") and you have LABELED data. That is the raw material of every improvement loop.

Analogy. A new shop guesses what to stock. A shop that has been open a year knows exactly what people ask for, because it watched. Your logs are that year of watching, if you keep them.

2. Capture, with privacy first

feedback_log.py logs each interaction as a JSONL record: the question, the answer, the sources, and a feedback field. The non-negotiable part: redact PII before you store anything. Feedback data is data you are now keeping, so the M14 rules apply, redact emails, phone numbers, and other identifiers at write time, get consent, and never log secrets. redact_pii does the obvious cases here; a real system extends it to its domain (names, account numbers, addresses) and minimizes what it keeps at all.

3. The three signals

Not all feedback means the same thing, and treating it bluntly is a mistake:

Thumbs up: the answer was good. A confirmed-good example: lock it in as a golden eval case (it must keep working) and use it as a training example.
Thumbs down + correction: the answer was wrong and a human told you the right one. The single most valuable record you can get: it becomes a regression eval case (one the agent currently fails) AND a corrected training example that teaches the fix.
Thumbs down, no correction: you know it was bad but not what "good" looks like. You cannot auto-label this; route it to a human for review. Guessing the expected answer would poison your data.

curate.py encodes exactly these rules. The judgement, what each signal means, is the real work; the code is small.

4. Curate into two datasets

Eval cases (M26): to_eval_cases turns up-votes into golden cases and down+correction into regression cases, deduped. These flow straight into your CI eval gate: now the suite is protecting behaviour your real users care about, not just cases you imagined.
Fine-tuning examples (M15): to_finetune_examples produces chat-format {"messages": [...]} records from good answers and corrected answers, deduped by question and filtered for length. This is a real SFT dataset, grown from production instead of written from scratch.

The same down-vote-with-correction feeds BOTH: it guards against the bug (eval) and teaches the fix (training). That dual use is the heart of the flywheel.

5. Curation is judgement, not just plumbing

Raw logs are noisy. Good curation: dedupe (the same question asked a hundred times should not swamp the dataset), filter low-quality or too-short records, balance so common cases do not drown rare ones, and review the ambiguous signals by hand. Garbage in, garbage out applies doubly to fine-tuning (M15): a sloppy dataset makes a worse model. Spend effort here, not just on volume.

6. Close the loop (and the cadence)

The full cycle: deploy (M29) -> log interactions + feedback -> curate -> add eval cases to CI (M26) and fine-tuning examples to the next training run (M15) -> ship the improved agent -> repeat. Run it on a cadence (weekly, monthly), not continuously, and always gate the result on evals so a "better" model trained on new data cannot quietly regress something old. This is how an agent that launches mediocre becomes excellent over a few months: not one big leap, but a steady flywheel.

7. Cautions

Privacy and consent (M14): redact, minimize, get consent, and let users delete their data. Do not train on sensitive data.
Feedback bias: people leave feedback more when annoyed; up-votes are rarer than the true positive rate. Do not read raw thumbs as ground truth.
Feedback loops can amplify mistakes: if the agent's own outputs become its training data unchecked, errors can compound. Keep humans in the loop and gate on evals.

Words you will hear

Data flywheel, feedback signal (explicit vs implicit), thumbs up/down, correction, golden vs regression case, curation (dedupe / filter / balance / review), PII redaction (M14), SFT dataset (M15), feedback loop / amplification. Full definitions in the glossary.