M10: Evaluation, guardrails & security

Your apps can answer, retrieve, and act. Now the question every real deployment faces: can it be tricked? Today you put on the attacker's hat and try to break your own app, make it spill a secret, ignore its rules, or misuse a tool, then you build a guardrail layer and prove the attacks stop working. This is the skill that takes you from "cool demo" to "safe to ship."

Today's win: you red-team your own AI app, then add guardrails and re-test, turning "it leaked the secret" into "0 leaks, and normal questions still work."

Today you will

Learn the real risks, OWASP LLM Top 10, especially prompt injection & excessive agency
Red-team your own app: try to make it leak, ignore its rules, or misuse a tool
Add a guardrail layer (input, output, and tool guards) and re-measure: a security scorecard

Run of show (~60 min)

Time	What we do
0:00	Hook + the win we're chasing
0:05	The one idea: assume your app will be attacked; defend in layers (full read in `notes.md`)
0:10	Lab Part A: red-team the bot (guardrails OFF): watch it leak
0:35	Lab Part B: turn guardrails ON, add your own, re-test to 0 leaks
0:55	Show: post your before/after scorecard to the wins board
1:00	Wrap + take-home

If you get stuck

No new install, reuse M4's anthropic + key. The guardrails are plain Python you can test directly.
Authorized, educational red-teaming only. You attack your own practice app with fake data, never a system you don't own or have permission to test. Nothing here can harm your computer.
Modern models resist many attacks on their own, if an attack doesn't leak, that's good, but try harder and remember: safety shouldn't depend on the model's goodwill. Re-read the You should now see line.

Optional challenge

Find an attack that slips past the built-in guardrail (e.g. asking for the secret in another language, or hiding the instruction inside a long quote), then add a rule that catches it, without blocking the benign control question. That cat-and-mouse is AI security. Go deeper with the labs in the Resource Map (§8): Lakera Gandalf, PortSwigger's Web LLM Attacks, and DeepLearning.AI's Red Teaming LLM Applications.