M23: Agent security (Part D: Agentic Systems)

The moment you give an agent tools, you give it power, and anything it reads can try to seize that power. A support ticket, a web page, a tool's output, an email: any of them can carry hidden instructions, because an agent cannot tell instructions from data. Today you watch a single poisoned document make a helpful agent email a secret to an attacker, and then you stop it, with layered defenses you build yourself. This is the security mindset for everything you shipped in Part D.

Today's win: the same prompt-injection attack that leaks a secret from a vulnerable agent is blocked, at more than one layer, by a hardened one, all demonstrated offline on synthetic data.

Scope and ethics: everything here is synthetic and for defense only. The payloads are fake, the "secret" is fake, and the email tool sends nothing. Use these techniques solely to defend systems you own or are explicitly authorized to test. This module teaches you to protect agents, not attack them.

Today you will

See indirect prompt injection: instructions smuggled inside content the agent reads
See excessive agency: how an over-powered tool turns a hijack into real damage (data exfiltration)
Build layered defenses: treat external content as data, detect injection, least-privilege tools, redact secrets, and the approval gate from M22
Learn defense in depth: why you never rely on the model resisting injection alone

Run of show (about 60 minutes)

Time	What we do
0:00	Hook: the agent that emailed a secret to an attacker
0:05	The one idea: the agent cannot tell instructions from data (read `notes.md`)
0:12	Lab Part A: run the attack and watch a vulnerable agent leak
0:30	Lab Part B: add defenses and watch each layer stop it
0:52	Show: post the vulnerable leak next to the hardened block
1:00	Wrap

If you get stuck

Builds on M10 (guardrails), M16 (MCP and tool risks), and M22 (the approval gate, reused here). Reuse your .env key only for the optional live run.
The core lab runs offline, free, no key (mock model, synthetic payloads). No new libraries. Nothing here can harm your computer; the attack is simulated end to end.
Each defense is a few lines in defenses.py. If a layer is not blocking, read that function and the order it runs in.

Optional challenge

Open starters/add_defense.py and build an outbound content scanner: block anything leaving the system that contains a secret or points at a domain not on your allowlist. Exfiltration hides in URLs and image links, not just email, so the last line of defense is watching what goes out.