M23: Agent security (Part D: Agentic Systems)
The moment you give an agent tools, you give it power, and anything it reads can try to seize that power. A support ticket, a web page, a tool's output, an email: any of them can carry hidden instructions, because an agent cannot tell instructions from data. Today you watch a single poisoned document make a helpful agent email a secret to an attacker, and then you stop it, with layered defenses you build yourself. This is the security mindset for everything you shipped in Part D.
Today's win: the same prompt-injection attack that leaks a secret from a vulnerable agent is blocked, at more than one layer, by a hardened one, all demonstrated offline on synthetic data.
Scope and ethics: everything here is synthetic and for defense only. The payloads are fake, the "secret" is fake, and the email tool sends nothing. Use these techniques solely to defend systems you own or are explicitly authorized to test. This module teaches you to protect agents, not attack them.
Today you will
- See indirect prompt injection: instructions smuggled inside content the agent reads
- See excessive agency: how an over-powered tool turns a hijack into real damage (data exfiltration)
- Build layered defenses: treat external content as data, detect injection, least-privilege tools, redact secrets, and the approval gate from M22
- Learn defense in depth: why you never rely on the model resisting injection alone
Run of show (about 60 minutes)
| Time | What we do |
|---|---|
| 0:00 | Hook: the agent that emailed a secret to an attacker |
| 0:05 | The one idea: the agent cannot tell instructions from data (read notes.md) |
| 0:12 | Lab Part A: run the attack and watch a vulnerable agent leak |
| 0:30 | Lab Part B: add defenses and watch each layer stop it |
| 0:52 | Show: post the vulnerable leak next to the hardened block |
| 1:00 | Wrap |
If you get stuck
- Builds on M10 (guardrails), M16 (MCP and tool risks), and M22 (the approval gate, reused here). Reuse your
.envkey only for the optional live run. - The core lab runs offline, free, no key (mock model, synthetic payloads). No new libraries. Nothing here can harm your computer; the attack is simulated end to end.
- Each defense is a few lines in
defenses.py. If a layer is not blocking, read that function and the order it runs in.
Optional challenge
Open starters/add_defense.py and build an outbound content scanner:
block anything leaving the system that contains a secret or points at a domain not on your allowlist.
Exfiltration hides in URLs and image links, not just email, so the last line of defense is watching
what goes out.