M21 notes: Agent memory and state (the one idea)

The one idea: a model has no memory of its own. Each API call is a blank slate; the only thing it "knows" is what you put in the prompt for that call. So agent memory is not magic storage inside the model. It is your code deciding what to put back into the next prompt. Everything in this module is a strategy for choosing that.

1. Why agents forget

When you call the model, you send messages and get a reply. The model keeps nothing. Call it again and it has no idea what you said before, unless you send the earlier messages again. A chatbot that "holds a conversation" is really just resending the growing message list every turn. Memory is an illusion you create by feeding context back in. Once you see that, the three kinds of memory are just three answers to one question: what is worth feeding back, and how do we store it?

Analogy. The model is a brilliant consultant with no memory who is handed a fresh briefing folder at the start of every meeting. Short-term memory is the notes from this meeting that you keep in the folder. Long-term memory is the filing cabinet you pull relevant past notes from to add to the folder. Checkpointing is photocopying the whole folder so you can pick up exactly where you left off tomorrow.

2. Short-term memory: the conversation, on a budget

Short-term memory is the running conversation within one session. The naive version is "keep every turn and resend them all". That breaks for two reasons: the model has a context limit, and every token you send costs money and time (M20). So short-term memory needs a budget.

In memory.py, ShortTermMemory keeps a list of turns and window() returns only the most recent turns that fit a token budget, dropping the oldest. We estimate tokens as roughly four characters each, which is close enough for budgeting. Drop-oldest is the simplest policy; the lab challenge upgrades it to summarize-oldest, which keeps the gist of old turns for far fewer tokens. Real frameworks (LangGraph, the others in M19) offer both.

Key detail in agent.py: we build the window from memory, send it, get the reply, and only THEN add the new user message and reply to memory. That ordering keeps the current turn out of its own window.

3. Long-term memory: facts that outlive the session

Short-term memory dies when the program ends. Long-term memory is for facts you want next week: the user's name, their preferences, decisions made. Two operations:

remember(fact): store a durable fact.
recall(query, k): return the k facts most relevant to what the user is asking now.

recall is the important one. You do not dump every stored fact into the prompt (that wastes tokens and confuses the model); you retrieve only the relevant ones and inject those. This is exactly the retrieval idea from M7 (RAG), now pointed at the agent's own memory instead of your documents.

How recall works here, and how it works in production

To keep this module offline and free, recall scores facts by shared content words (a small bag-of-words cosine, with common glue words filtered out). That matches "what is my name" against "the user's name is Sam", because they share the word "name". But it CANNOT match "what do you know about me" against "the user's name is Sam", because they share no words, even though a human sees they are related.

Real systems fix this with embeddings: each fact and each query becomes a vector that captures meaning, so "me" and "Sam" land near each other even with no shared words. That is precisely the vector store you built in M7 (Chroma, FAISS, and the hosted options). The pattern is identical: remember = add to the store, recall = similarity search. Swap our toy similarity for embeddings and the same MemoryAgent gains true semantic recall. Use the toy version to learn the shape; use a vector store in production.

What to store (and not)

Store durable, useful facts (name, preferences, long-lived decisions). Do not store secrets or sensitive personal data without consent; memory is data you are now keeping, so the privacy rules from M14 apply. Let users see and delete what an agent remembers about them.

4. Checkpointing: pause and resume

Sometimes you need the agent's WHOLE state, not just facts: the conversation so far plus the long-term store, written somewhere so a different process (a server restart, a job that runs tomorrow) can resume exactly where it stopped. agent.py does this with save_state and load_state, which serialize both memories to JSON. Frameworks call this a checkpointer (LangGraph's MemorySaver and its database checkpointers are the same idea), and it is what lets long-running and multi-step agents survive interruptions.

5. The whole picture

Each turn, the memory-augmented agent: 1. recalls relevant long-term facts and puts them in the system prompt, 2. adds the recent short-term window (trimmed to budget), 3. appends the new user message and calls the model, 4. saves the new turn to short-term memory, and optionally stores a new long-term fact.

That is it. No hidden state inside the model, just disciplined choices about what goes into each prompt.

Words you will hear

Short-term vs long-term memory, context window / token budget, recall (retrieval), embeddings vs keyword match, vector store (M7), checkpointing / checkpointer, summarize vs drop-oldest, statefulness. Full definitions in the glossary.