Skip to content

M10: Evaluation, guardrails & security

Your apps can answer, retrieve, and act. Now the question every real deployment faces: can it be tricked? Today you put on the attacker's hat and try to break your own app, make it spill a secret, ignore its rules, or misuse a tool, then you build a guardrail layer and prove the attacks stop working. This is the skill that takes you from "cool demo" to "safe to ship."

Today's win: you red-team your own AI app, then add guardrails and re-test, turning "it leaked the secret" into "0 leaks, and normal questions still work."

Today you will

  • Learn the real risks, OWASP LLM Top 10, especially prompt injection & excessive agency
  • Red-team your own app: try to make it leak, ignore its rules, or misuse a tool
  • Add a guardrail layer (input, output, and tool guards) and re-measure: a security scorecard

Run of show (~60 min)

Time What we do
0:00 Hook + the win we're chasing
0:05 The one idea: assume your app will be attacked; defend in layers (full read in notes.md)
0:10 Lab Part A: red-team the bot (guardrails OFF): watch it leak
0:35 Lab Part B: turn guardrails ON, add your own, re-test to 0 leaks
0:55 Show: post your before/after scorecard to the wins board
1:00 Wrap + take-home

If you get stuck

  • No new install, reuse M4's anthropic + key. The guardrails are plain Python you can test directly.
  • Authorized, educational red-teaming only. You attack your own practice app with fake data, never a system you don't own or have permission to test. Nothing here can harm your computer.
  • Modern models resist many attacks on their own, if an attack doesn't leak, that's good, but try harder and remember: safety shouldn't depend on the model's goodwill. Re-read the You should now see line.

Optional challenge

Find an attack that slips past the built-in guardrail (e.g. asking for the secret in another language, or hiding the instruction inside a long quote), then add a rule that catches it, without blocking the benign control question. That cat-and-mouse is AI security. Go deeper with the labs in the Resource Map (§8): Lakera Gandalf, PortSwigger's Web LLM Attacks, and DeepLearning.AI's Red Teaming LLM Applications.