M23 solution: agent security
A vulnerable-vs-hardened demonstration of the headline agentic attack (indirect prompt injection leading to data exfiltration) and the layered defenses that stop it. Everything is synthetic and runs offline: no API key, no tokens, and nothing is actually sent.
Scope and ethics: for defending agents you own or are authorized to test. The payloads, target, and "secret" are fake; the email tool only records attempts.
Files
| File | Role |
|---|---|
attacks.py |
Synthetic, clearly-labeled attack material: POISONED_DOCUMENT (indirect injection in a support ticket), CLEAN_DOCUMENT, POISONED_TOOL_DESCRIPTION (tool poisoning), and a FAKE_SECRET. |
defenses.py |
The layers: detect_injection (heuristic tripwire), wrap_untrusted (treat content as data), redact_secrets (strip keys from outbound text), domain_allowed (least-privilege allowlist). |
agent.py |
A support agent that summarizes a ticket and can send_email, with switchable defenses: wrap_input (input side) and enforce_tools (redact + allowlist + approval). Returns {answer, sent, blocked, injection_flags, leaked}. Injectable client. |
demo_mock.py |
Runs the attack against the agent at four defense levels and shows the secret leak only when undefended. Start here. |
../starters/add_defense.py |
Build an outbound content scanner (blocks secrets and non-allowlisted URLs). |
Run it
# offline, free, no key (mock model + synthetic payloads):
python demo_mock.py
# live (optional, costs a few tokens): put your key in .env first
cp ../starters/.env.example .env # then edit .env and paste your key
python agent.py
What the demo shows
| Scenario | wrap_input | enforce_tools | leaked? |
|---|---|---|---|
| 1. Vulnerable | off | off | True (secret emailed to attacker) |
| 2. Input defense | on | off | False (model not fooled) |
| 3. Defense in depth | off | on | False (model fooled, but tool layer blocks the recipient) |
| 4. Fully hardened | on | on | False |
The headline lesson is scenario 3: the model is still hijacked, but least privilege on the tool
(domain_allowed) plus redact_secrets means the secret never leaves. Do not rely on the model
resisting injection; constrain the tools.
Honest limits
detect_injectionis a heuristic tripwire (regex markers). It catches obvious payloads, not clever ones; it is a smoke alarm, not a wall. That is exactly why the tool-layer defenses exist.wrap_untrustedreduces susceptibility; it does not guarantee the model ignores injected text.- Real systems add more: sandboxing, output scanning (the lab challenge), rate limits, monitoring (M20), and the OWASP Top 10 for LLM Applications as a checklist.
Verified (offline)
- Units:
detect_injectionflags the poisoned doc and passes the clean one;wrap_untrusteddelimits;redact_secretsremoves the fake key;domain_allowedallows the company domain and blocks the attacker. - Scenarios: vulnerable agent leaks the secret to
attacker@evil.example; input defense stops the model being fooled; defense-in-depth blocks the exfiltration even when the model is fooled (domain_not_allowed); fully hardened leaks nothing; a clean ticket triggers no injection flags and no email. - All files compile;
demo_mock.pyruns end to end offline. Live runs reuse the M4 key and cost tokens.