M10: Evaluation, guardrails & security
Your apps can answer, retrieve, and act. Now the question every real deployment faces: can it be tricked? Today you put on the attacker's hat and try to break your own app, make it spill a secret, ignore its rules, or misuse a tool, then you build a guardrail layer and prove the attacks stop working. This is the skill that takes you from "cool demo" to "safe to ship."
Today's win: you red-team your own AI app, then add guardrails and re-test, turning "it leaked the secret" into "0 leaks, and normal questions still work."
Today you will
- Learn the real risks, OWASP LLM Top 10, especially prompt injection & excessive agency
- Red-team your own app: try to make it leak, ignore its rules, or misuse a tool
- Add a guardrail layer (input, output, and tool guards) and re-measure: a security scorecard
Run of show (~60 min)
| Time | What we do |
|---|---|
| 0:00 | Hook + the win we're chasing |
| 0:05 | The one idea: assume your app will be attacked; defend in layers (full read in notes.md) |
| 0:10 | Lab Part A: red-team the bot (guardrails OFF): watch it leak |
| 0:35 | Lab Part B: turn guardrails ON, add your own, re-test to 0 leaks |
| 0:55 | Show: post your before/after scorecard to the wins board |
| 1:00 | Wrap + take-home |
If you get stuck
- No new install, reuse M4's
anthropic+ key. The guardrails are plain Python you can test directly. - Authorized, educational red-teaming only. You attack your own practice app with fake data, never a system you don't own or have permission to test. Nothing here can harm your computer.
- Modern models resist many attacks on their own, if an attack doesn't leak, that's good, but try harder and remember: safety shouldn't depend on the model's goodwill. Re-read the You should now see line.
Optional challenge
Find an attack that slips past the built-in guardrail (e.g. asking for the secret in another language, or hiding the instruction inside a long quote), then add a rule that catches it, without blocking the benign control question. That cat-and-mouse is AI security. Go deeper with the labs in the Resource Map (§8): Lakera Gandalf, PortSwigger's Web LLM Attacks, and DeepLearning.AI's Red Teaming LLM Applications.