M30 solution: agent data and feedback loops
The data flywheel: capture production interactions and feedback (PII redacted), then curate them into new eval cases (M26) and fine-tuning examples (M15). Pure Python over JSONL, fully offline, no key.
Files
| File | Role |
|---|---|
feedback_log.py |
log_interaction (redacts PII, appends a JSONL record), load, and redact_pii (emails and phone numbers, extend for your domain). |
curate.py |
to_eval_cases (up -> golden, down+correction -> regression, deduped), needs_review (down with no fix -> human), to_finetune_examples (chat-format M15 records from good and corrected answers, deduped). |
demo.py |
Logs a batch of synthetic interactions, then curates them into eval cases and training examples. Start here. |
../starters/add_signal.py |
Add an implicit feedback signal (edits, regenerate, resolved). |
Run it
python demo.py # offline: log interactions, redact PII, curate into both datasets
The three signals (and what each becomes)
| Feedback | Eval case | Fine-tuning example |
|---|---|---|
| thumbs up | golden (must keep working) | the good answer |
| thumbs down + correction | regression (currently fails) | the corrected answer |
| thumbs down, no correction | (none) routed to human review | (none) |
The same down-vote-with-correction feeds both datasets: it guards against the bug and teaches the fix. That dual use is the heart of the flywheel.
How it works
- Privacy first.
log_interactionrunsredact_piion the question, answer, and correction before writing, because feedback data is data you now keep (M14). Real systems extend redaction and minimize what they store. - Signals mean different things. Up is a confirmed-good example; down+correction is a wrong answer with the fix; down without a fix cannot be auto-labeled and must be triaged by a human (guessing the expected answer would poison the data).
- Curation is judgement. Dedupe (by question, or question+expected), filter short/empty, and review ambiguous records. Garbage in, garbage out, doubly so for fine-tuning.
- Closes the loop. Eval cases flow into the M26 CI gate; fine-tuning examples flow into the M15 training set; ship the improved agent and repeat on a cadence, always gating on evals.
Verified (offline)
redact_piireplaces emails and phone numbers;log_interactionredacts on write andloadreads back.to_eval_cases: up -> golden, down+correction -> regression, duplicates removed; the regression case carries the corrected expected text.needs_review: a down-vote with no correction is routed for human triage (not auto-labeled).to_finetune_examples: chat-format{"messages":[user, assistant]}, deduped by question, using the corrected answer for down-voted-with-correction records.- All files compile;
demo.pyruns end to end offline. No key needed; using the data downstream is M15/M26.