Notes: M9: Agents, tools, function calling & frameworks

Everything so far has the model produce text. An agent lets it do things: call a calculator, search logs, look up a record, hit an API, then look at the result and decide the next step. That loop is the whole idea, and it's simpler than it sounds. This module builds it twice: by hand (so nothing is magic) and with a framework (so you see how real systems are built), then surveys the wider landscape so the buzzwords stop being scary.

What an agent actually is

An agent = a model + tools + a loop. A tool is just a normal function you let the model call. The model can't run your code, so the dance is:

You send the question and a list of tools (each with a name, description, and input shape).
The model either answers, or replies "please run lookup_ioc('185.220.101.45')" (a tool call).
Your code runs the real function and sends the result back.
The model reads the result and either answers or asks for another tool. Repeat.

This reason → act → observe cycle is called the ReAct loop. "Multi-step" just means the loop runs more than once (look up an indicator, then search logs, then summarize).

flowchart LR
  Q["question + tool list"] --> R["model reasons"]
  R -->|"tool call"| X["your code runs the tool"]
  X -->|"result (observation)"| R
  R -->|"no more tools"| A["final answer"]

Tool calling from first principles (9a)

A tool definition is just a description the model reads, name, what it does, and a JSON schema for its inputs (your M6 schemas again):

{"name": "calculate",
 "description": "Evaluate a basic arithmetic expression like '3 * (4 + 5)'.",
 "input_schema": {"type": "object",
                  "properties": {"expression": {"type": "string"}},
                  "required": ["expression"]}}

You send these with the message. When the model wants a tool, the response has

stop_reason ==
"tool_use"

and a tool_use block with the chosen tool and its arguments. You run the matching Python function, append a tool_result (carrying the same tool_use_id), and call the model again. You loop until stop_reason isn't tool_use. That's the entire agent in agent_manual.py, about fifteen lines, no framework.

The tool description is the steering wheel. The model decides which tool to call almost entirely from the description. Vague descriptions → wrong tool or no tool. Say when to use it ("Use this to enrich an IP/domain/hash found in an alert"), not just what it is.

Why use a framework? (9b)

The manual loop is great for understanding, and fine for small things. But real agents need more: running the loop, parsing many tool calls, memory across turns, retries, streaming, multiple agents working together, tracing what happened. Writing all that by hand gets old. A framework packages it so you focus on the tools and the task.

We use LangGraph. Its big idea is modelling an agent as a graph: nodes (steps) connected by edges, with a shared state flowing through. That graph model is what makes complex, branching, long-running agents manageable. For the common case, LangGraph ships a ready-made ReAct agent so you don't build the graph by hand:

from langchain_anthropic import ChatAnthropic
from langgraph.prebuilt import create_react_agent
from langgraph.checkpoint.memory import MemorySaver

model = ChatAnthropic(model="claude-opus-4-8", max_tokens=1024)
agent = create_react_agent(model, tools=[lookup_ioc, search_logs],
                           prompt=SYSTEM, checkpointer=MemorySaver())

config = {"configurable": {"thread_id": "case-001"}}
result = agent.invoke({"messages": [{"role": "user", "content": "Triage this alert..."}]}, config)
print(result["messages"][-1].content)

Compare that to 9a: the loop is gone, LangGraph runs it. Tools are the same idea, just decorated with @tool so their schema comes from the function signature and docstring automatically.

Memory

Agents need to remember. Two kinds show up in the lab: - Conversation memory: the agent recalls earlier turns. checkpointer=MemorySaver() plus a thread_id does this: each thread is one ongoing conversation it can refer back to. - Tool memory: a tool that stores things (the helper's save_note/list_notes). The agent writes and reads its own notes. (Bigger systems add long-term memory in a database or a vector store, RAG and agents combine well.)

The headline project: a SOC assistant (synthetic data)

The security build is an agent that does what a SOC (Security Operations Center) analyst does at Level 1 (enrich indicators, summarize) and Level 2 (correlate, assist an investigation): given an alert, it enriches indicators (lookup_ioc), finds related activity (search_logs), correlates them, and summarizes with a suggested next step. It's a genuinely useful shape, and a perfect agent task because it's multi-step and tool-driven.

Responsible use. All data here is synthetic (made up in tools.py). This teaches investigation and summarization only, the agent takes no real action. Never connect an agent like this to real systems, logs, or intel feeds without authorization. Giving an AI the power to act raises real safety questions, which is the whole of M10.

The wider landscape (survey: reference only, don't install them all)

You learned one framework deeply; here's the map so the names make sense. Pick by the job, not the hype.

Tool	What it is / when you'd reach for it
LangGraph (what we used)	Graph-based, stateful agents; fine control over complex/branching flows.
CrewAI	"Crews" of role-playing agents that collaborate (researcher + writer + reviewer).
AutoGen / AG2	Multi-agent conversations, agents talk to each other to solve a task.
OpenAI Agents SDK	OpenAI's first-party framework for building agents on their models.
Claude Agent SDK	Anthropic's SDK for building agents like Claude Code (tools, files, long tasks).
LlamaIndex	Data/RAG-first; agents that reason over your indexed documents.
smolagents	Minimal (Hugging Face); agents that write small bits of code to act.
Hermes	Open-weight models (NousResearch) tuned for function-calling/agentic use, a model choice, not a framework.
MCP (Model Context Protocol)	Not a framework, a standard. A common "plug" so any model/app can connect to tools and data sources (databases, GitHub, your files) without custom glue. Think "USB-C for AI tools."

The throughline: they all implement the same reason→act→observe loop you built in 9a. Once you understand the loop, every framework is a different convenience layer over it.

Agents are powerful and risky

An agent that can act can act wrongly, call the wrong tool, be tricked by a malicious input into doing something harmful (prompt injection), or be given more power than the task needs (excessive agency). The safeguards: give an agent the fewest tools it needs, make hard-to-reverse actions (delete, send, pay) require human approval, validate tool inputs, and log everything. That's M10, evaluation, guardrails, and security, and it matters most precisely because agents can do things.

Go deeper (optional, not needed for today's win)

- **Server-side tools:** Anthropic offers tools that run on *their* side (e.g. **web search**, code execution), you just declare them and skip writing the function. Great for search; the client-side tools in 9a are better for learning because you see the loop. - **The SDK has its own loop runner:** `anthropic`'s `tool_runner` (with `@beta_tool`) runs the manual loop for you without a separate framework, a lighter middle ground between 9a and LangGraph. - **Parallel tool calls:** a model can request several tools at once; you run them all and return all results in one turn (our loop already handles a list). - **Tracing:** frameworks integrate with tools like LangSmith to *see* each step an agent took, vital for debugging agents, which fail in more interesting ways than plain calls. - **Coding agents** are agents specialized for software work, with file/bash/edit tools: **Claude Code**, **Cursor**, **OpenAI Codex**, and SDKs to build your own (**Claude Agent SDK**, **Google ADK**). Same ReAct loop you built, with a curated tool surface for editing code. - **Common agent tools** are just functions: besides the calculator/SOC tools here, real agents get a **web_search** tool (Anthropic offers a server-side one), a **SQL/database** query tool, an HTTP call, a calendar, anything you can wrap in a function with a good description. - **Long-run context:** as an agent runs many steps its context fills up. **Context compaction** (summarize older turns to free space) and **context isolation** (give a sub-agent only what it needs) keep long agents working, frameworks and the API offer these; reach for them when a long agent starts forgetting or hitting limits. - **Building MCP yourself:** beyond *using* MCP servers, you can **build an MCP server** (wrap your tools/data so any MCP-aware app can use them) or an **MCP client** (let your app consume MCP servers). The official MCP SDKs make both a small amount of code, a great next step after this module.

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. What is an agent, in one line, and what's the ReAct loop?

Show answer

An agent is a model + tools + a loop. The ReAct loop is reason → act (call a tool) → observe (read the result) → repeat until done. "Multi-step" just means the loop runs more than once.

2. In tool calling from first principles, who actually runs the tool?

Show answer

You do: your code. The model can't run your functions; it returns a tool_use request, your code executes the real function and sends back a tool_result (with the matching tool_use_id), and the model continues. The model decides which tool; your code does the work.

3. Why does the tool description matter so much?

Show answer

The model chooses which tool to call almost entirely from its description. Vague descriptions cause wrong-tool or no-tool behavior. Say when to use it, not just what it does, it's the steering wheel.

4. What does a framework like LangGraph give you over the hand-written loop?

Show answer

It runs the loop for you and adds the production pieces, memory (checkpointer + thread_id), multi-tool handling, retries, streaming, multi-agent flows, tracing, so you focus on the tools and the task. Same ReAct loop underneath; less boilerplate. (LangGraph models it as a stateful graph.)

5. Why is M10 (guardrails/security) especially important for agents?

Show answer

Because agents can take actions, a mistake or a malicious input can cause real harm (wrong tool, prompt injection, excessive agency). Safeguards, fewest-needed tools, human approval for risky actions, input validation, logging, matter far more once the AI can do things, not just talk.

New words (also in resources/glossary.md): agent, tool, tool calling / function calling, tool_use / tool_result, ReAct loop, multi-step, LangGraph, create_react_agent, checkpointer / conversation memory, MCP, SOC (L1/L2), prompt injection (preview), excessive agency (preview).

Source: original, written for this course. The manual tool loop follows Anthropic's documented tool-use API; the LangGraph code (ChatAnthropic + create_react_agent + MemorySaver) follows LangGraph's current API and was verified to build against langgraph 1.x + langchain-anthropic (graphs compile; the tools run for real, see the solution README). The framework survey is a neutral reference. All security data is synthetic. No third-party text or figures; diagrams are original.