M30 solution: agent data and feedback loops

The data flywheel: capture production interactions and feedback (PII redacted), then curate them into new eval cases (M26) and fine-tuning examples (M15). Pure Python over JSONL, fully offline, no key.

Files

File	Role
`feedback_log.py`	`log_interaction` (redacts PII, appends a JSONL record), `load`, and `redact_pii` (emails and phone numbers, extend for your domain).
`curate.py`	`to_eval_cases` (up -> golden, down+correction -> regression, deduped), `needs_review` (down with no fix -> human), `to_finetune_examples` (chat-format M15 records from good and corrected answers, deduped).
`demo.py`	Logs a batch of synthetic interactions, then curates them into eval cases and training examples. Start here.
`../starters/add_signal.py`	Add an implicit feedback signal (edits, regenerate, resolved).

Run it

python demo.py          # offline: log interactions, redact PII, curate into both datasets

The three signals (and what each becomes)

Feedback	Eval case	Fine-tuning example
thumbs up	golden (must keep working)	the good answer
thumbs down + correction	regression (currently fails)	the corrected answer
thumbs down, no correction	(none) routed to human review	(none)

The same down-vote-with-correction feeds both datasets: it guards against the bug and teaches the fix. That dual use is the heart of the flywheel.

How it works

Privacy first. log_interaction runs redact_pii on the question, answer, and correction before writing, because feedback data is data you now keep (M14). Real systems extend redaction and minimize what they store.
Signals mean different things. Up is a confirmed-good example; down+correction is a wrong answer with the fix; down without a fix cannot be auto-labeled and must be triaged by a human (guessing the expected answer would poison the data).
Curation is judgement. Dedupe (by question, or question+expected), filter short/empty, and review ambiguous records. Garbage in, garbage out, doubly so for fine-tuning.
Closes the loop. Eval cases flow into the M26 CI gate; fine-tuning examples flow into the M15 training set; ship the improved agent and repeat on a cadence, always gating on evals.

Verified (offline)

redact_pii replaces emails and phone numbers; log_interaction redacts on write and load reads back.
to_eval_cases: up -> golden, down+correction -> regression, duplicates removed; the regression case carries the corrected expected text.
needs_review: a down-vote with no correction is routed for human triage (not auto-labeled).
to_finetune_examples: chat-format {"messages":[user, assistant]}, deduped by question, using the corrected answer for down-voted-with-correction records.
All files compile; demo.py runs end to end offline. No key needed; using the data downstream is M15/M26.