Skip to content

M23 solution: agent security

A vulnerable-vs-hardened demonstration of the headline agentic attack (indirect prompt injection leading to data exfiltration) and the layered defenses that stop it. Everything is synthetic and runs offline: no API key, no tokens, and nothing is actually sent.

Scope and ethics: for defending agents you own or are authorized to test. The payloads, target, and "secret" are fake; the email tool only records attempts.

Files

File Role
attacks.py Synthetic, clearly-labeled attack material: POISONED_DOCUMENT (indirect injection in a support ticket), CLEAN_DOCUMENT, POISONED_TOOL_DESCRIPTION (tool poisoning), and a FAKE_SECRET.
defenses.py The layers: detect_injection (heuristic tripwire), wrap_untrusted (treat content as data), redact_secrets (strip keys from outbound text), domain_allowed (least-privilege allowlist).
agent.py A support agent that summarizes a ticket and can send_email, with switchable defenses: wrap_input (input side) and enforce_tools (redact + allowlist + approval). Returns {answer, sent, blocked, injection_flags, leaked}. Injectable client.
demo_mock.py Runs the attack against the agent at four defense levels and shows the secret leak only when undefended. Start here.
../starters/add_defense.py Build an outbound content scanner (blocks secrets and non-allowlisted URLs).

Run it

# offline, free, no key (mock model + synthetic payloads):
python demo_mock.py

# live (optional, costs a few tokens): put your key in .env first
cp ../starters/.env.example .env      # then edit .env and paste your key
python agent.py

What the demo shows

Scenario wrap_input enforce_tools leaked?
1. Vulnerable off off True (secret emailed to attacker)
2. Input defense on off False (model not fooled)
3. Defense in depth off on False (model fooled, but tool layer blocks the recipient)
4. Fully hardened on on False

The headline lesson is scenario 3: the model is still hijacked, but least privilege on the tool (domain_allowed) plus redact_secrets means the secret never leaves. Do not rely on the model resisting injection; constrain the tools.

Honest limits

  • detect_injection is a heuristic tripwire (regex markers). It catches obvious payloads, not clever ones; it is a smoke alarm, not a wall. That is exactly why the tool-layer defenses exist.
  • wrap_untrusted reduces susceptibility; it does not guarantee the model ignores injected text.
  • Real systems add more: sandboxing, output scanning (the lab challenge), rate limits, monitoring (M20), and the OWASP Top 10 for LLM Applications as a checklist.

Verified (offline)

  • Units: detect_injection flags the poisoned doc and passes the clean one; wrap_untrusted delimits; redact_secrets removes the fake key; domain_allowed allows the company domain and blocks the attacker.
  • Scenarios: vulnerable agent leaks the secret to attacker@evil.example; input defense stops the model being fooled; defense-in-depth blocks the exfiltration even when the model is fooled (domain_not_allowed); fully hardened leaks nothing; a clean ticket triggers no injection flags and no email.
  • All files compile; demo_mock.py runs end to end offline. Live runs reuse the M4 key and cost tokens.