Skip to content

M17: Build a language model from scratch (deep-dive, optional)

You've used LLMs all course. Just once, let's look all the way under the hood and build one. You'll train a tiny language model in ~50 lines, watch its error drop as it learns, then watch it generate text, and then meet the real transformer architecture in miniature. You'll never wonder "but what is it actually doing?" again.

Today's win: you train a language model from scratch (loss goes down, it generates text), and you can explain the transformer's pieces, so an LLM is no longer magic.

Today you will

  • Train a tiny next-character model from scratch (numpy): predict → measure error → nudge weights → repeat
  • Watch the loss drop (that's learning) and generate text by sampling
  • Meet a real transformer in miniature (PyTorch): token + position embeddings, self-attention, blocks
  • See exactly how this scales up to ChatGPT (same idea + data + compute)

Reality check (and why this is optional): as an AI engineer you almost never build models, > you build with them (M0-M16). This module is for understanding, not your day job. It's the only "researcher" lab in the course, included so nothing about LLMs stays mysterious.

Run of show (~50 min)

Time What we do
0:00 Hook + the win we're chasing
0:05 The one idea: a model is weights you train to predict the next token (full read in notes.md)
0:10 Lab Part A: train the tiny model (numpy); watch loss drop; generate
0:30 Lab Part B: read/run the miniature transformer; find attention & blocks
0:45 Show: post your loss-before/after + a generated snippet
0:50 Wrap, how this becomes an LLM

If you get stuck

  • Part A needs only pip install numpy and runs in a second, no GPU, no key. Part B needs pip install torch (use Python 3.10-3.12; torch lags the newest Python). No key anywhere.
  • The generated text is gibberish-ish on purpose, a tiny model on tiny data. The point is the loss dropping and the mechanics, not quality.
  • Maths-shy? You don't need to follow the gradient maths, watch what happens (error shrinks, text improves) and read the comments. Nothing here can harm your computer.

Optional challenge

Train tiny_lm.py on a bigger, real text (paste a paragraph of a book) and bump the training steps, does the generated text get more word-like? Then skim Karpathy's nanoGPT / "Zero to Hero" (linked in notes) to go from this miniature to a real one.