Skip to content

M30 solution: agent data and feedback loops

The data flywheel: capture production interactions and feedback (PII redacted), then curate them into new eval cases (M26) and fine-tuning examples (M15). Pure Python over JSONL, fully offline, no key.

Files

File Role
feedback_log.py log_interaction (redacts PII, appends a JSONL record), load, and redact_pii (emails and phone numbers, extend for your domain).
curate.py to_eval_cases (up -> golden, down+correction -> regression, deduped), needs_review (down with no fix -> human), to_finetune_examples (chat-format M15 records from good and corrected answers, deduped).
demo.py Logs a batch of synthetic interactions, then curates them into eval cases and training examples. Start here.
../starters/add_signal.py Add an implicit feedback signal (edits, regenerate, resolved).

Run it

python demo.py          # offline: log interactions, redact PII, curate into both datasets

The three signals (and what each becomes)

Feedback Eval case Fine-tuning example
thumbs up golden (must keep working) the good answer
thumbs down + correction regression (currently fails) the corrected answer
thumbs down, no correction (none) routed to human review (none)

The same down-vote-with-correction feeds both datasets: it guards against the bug and teaches the fix. That dual use is the heart of the flywheel.

How it works

  • Privacy first. log_interaction runs redact_pii on the question, answer, and correction before writing, because feedback data is data you now keep (M14). Real systems extend redaction and minimize what they store.
  • Signals mean different things. Up is a confirmed-good example; down+correction is a wrong answer with the fix; down without a fix cannot be auto-labeled and must be triaged by a human (guessing the expected answer would poison the data).
  • Curation is judgement. Dedupe (by question, or question+expected), filter short/empty, and review ambiguous records. Garbage in, garbage out, doubly so for fine-tuning.
  • Closes the loop. Eval cases flow into the M26 CI gate; fine-tuning examples flow into the M15 training set; ship the improved agent and repeat on a cadence, always gating on evals.

Verified (offline)

  • redact_pii replaces emails and phone numbers; log_interaction redacts on write and load reads back.
  • to_eval_cases: up -> golden, down+correction -> regression, duplicates removed; the regression case carries the corrected expected text.
  • needs_review: a down-vote with no correction is routed for human triage (not auto-labeled).
  • to_finetune_examples: chat-format {"messages":[user, assistant]}, deduped by question, using the corrected answer for down-voted-with-correction records.
  • All files compile; demo.py runs end to end offline. No key needed; using the data downstream is M15/M26.