Skip to content

M33 notes: Data and release operations (the one idea)

The one idea: two of the things you built, the knowledge base and the deploy, are not done when they first work; they have to be operated. The index drifts from its sources, fills with stale and sensitive data, and is worthless without a restorable backup. Each new release is a bet that "passed CI" means "safe for everyone," which is not the same thing. Operations support safeguards both: keep the data fresh, private, and recoverable, and gate every build behind a canary that can roll itself back. This is the layer that protects the architecture, the databases, and the builds.


1. Why data and deploys need operating

The build modules treated the vector store (M7) and the deploy (M11/M29) as things you set up once. In production they are living systems:

Asset How it rots without ops The safeguard
The index / database sources change → stale answers; PII leaks in; it grows forever reindex, redact, retention, backup
The build / release a "fine" build regresses quality for real users canary, promote/rollback
The credentials a key rotation breaks every client at once rotation with a grace window

2. Staleness: the index drifts from the truth

A RAG answer is only as current as the document it was built from. When a source changes (a price, a policy, a manual), the indexed copy is stale and the assistant will confidently answer from the old version. is_stale detects this by storing the hash of the source each record was built from and comparing it to the live source. reindex then re-embeds only the changed docs. Re-embedding the whole store on every change is the expensive, slow mistake, real embedding calls cost money and time; you touch only what moved.

3. PII redaction and retention: the data lifecycle

Two rules keep a data store out of trouble: - Redact on write. redact_pii scrubs emails, SSNs, and card numbers before anything is stored, so the index never holds raw PII in the first place (the privacy-first principle from M14/M30). The safest data is the data you never wrote down. - Retention. Data should not live forever. sweep_expired drops documents whose TTL has passed, so the index does not grow without bound or keep sensitive data past its purpose.

4. Backups you can actually restore

snapshot takes a deep copy; restore puts it back. The discipline that matters is the second half: a backup you have never restored is a hope, not a backup. In the lab you snapshot the index, mutate it (reindex, sweep), and then restore, watching the swept and changed docs come back. In production you test restores on a schedule, because the day you need a backup is the worst day to discover it does not work.

Analogy. The index is a library. Reindexing is replacing the books whose contents changed (not re-buying the whole collection). Redaction is the librarian blacking out private details before a book goes on the shelf. Retention is removing books past their keep-by date. The backup is the off-site copy, and a librarian who has never tried to retrieve from it does not really have one. The canary, next, is lending a new edition to ten trusted readers before you stock a thousand.

5. Canary releases: prove a build before everyone gets it

A green CI run (M26) says a build passed your current tests. It does not prove the build is as good as what is already live. A canary closes that gap: run the candidate next to the live baseline on a small eval set and compare. canary promotes only if the candidate (a) clears a minimum bar and (b) does not regress against the baseline. In the demo, v2-good matches the baseline and is promoted; v3-bad scores 0.4 and is rolled back, the bad build never reaches a user.

This is the same eval machinery from M20/M26, used at a new moment: not "should this code merge?" but "should this build serve traffic?"

6. Rollback: the deploy you can undo

Promotion records the previous live version as last_good, so rollback() is one call back to a known state. The golden rule of release ops: never ship a change you cannot undo. A rollback is faster and safer than a forward-fix during an incident (it pairs directly with M31, "mitigate before you diagnose").

7. Secret rotation without an outage

Credentials must be rotated, but a hard swap breaks every client still using the old key. SecretStore keeps the previous secret valid during a grace window: after rotate, both old and new work, so in-flight clients keep running; once everyone has moved, expire_grace closes the window and only the new key is valid. Zero-downtime rotation is the difference between a routine task and a self-inflicted outage.

8. Putting it together

demo_mock.py: build the index with PII redacted → detect a stale doc and reindex only it → sweep an expired doc → restore from a backup → canary a good release and promote → canary a bad release and roll back → rotate a secret with grace. None of this is glamorous, and that is the point: operations support is the quiet, repeatable safeguarding that keeps the data trustworthy and the deploys reversible.


Words you will hear

Stale index / drift, reindex / re-embed, PII redaction, retention / TTL, backup / restore, snapshot, canary release, baseline, promote / rollback, blast radius, secret rotation / grace window, zero-downtime, idempotency. Full definitions in the glossary.