Lab: M17: train a language model from scratch

You'll need: your venv. Part A: pip install numpy. Part B (optional): pip install torch (use Python 3.10-3.12). No API key anywhere. Time: ~45 min • Work in your breakout pair.

Heads up: the generated text will look like gibberish with a flavour, that's expected from a tiny model on tiny data. Watch the loss drop (that's learning) and read the comments; quality isn't the point, understanding is. Maths-shy? Just watch what happens. Nothing here can harm your computer.

This lab has two parts: - Part A: train a tiny model from scratch (numpy) and generate text. - Part B: read & run a real transformer in miniature (PyTorch).

flowchart LR
  Pred["predict next token"] --> Loss["measure error (loss)"] --> Nudge["nudge weights"] --> Pred
  Pred -->|trained| Gen["generate text"]

Part A: train a model from scratch

Step 1: Set up

pip install numpy

Put tiny_lm.py (from solution/) and train_your_text.py (from starters/) in a folder. Activate your venv.

You should now see: Successfully installed numpy-… and the files in your folder.

Step 2: Train it and watch it learn

python tiny_lm.py

You should now see: loss before training ≈ 2.2, then loss after training ≈ 0.5, then a generated string like 'held. worlorlo worlohellld...'. The loss dropping is the model learning to predict the next character; the generated text echoes the training data's patterns. You trained a language model from nothing.

Step 3: Read the loop

Open tiny_lm.py. Find the four steps: predict (softmax(W[x_ids])), error (cross-entropy), nudge the weights (the gradient + W -= lr*grad), repeated in the loop; then generate (sample the next char, append, repeat).

You should now see / say: "predict → measure error → nudge weights → repeat, then sample to generate." That loop is how every LLM is trained, this one just has 81 weights instead of billions.

Step 4: Train it on YOUR text

Open train_your_text.py, set TEXT to your own sentences (longer + more repetitive = clearer patterns), and run it.

You should now see: a final loss and a generated snippet that echoes your text's patterns. The model learned from the data you gave it, change the data, change what it learns.

Part B: a real transformer, in miniature

Step 5: (Optional) run the miniature GPT

pip install torch          # Python 3.10-3.12
python nanogpt_mini.py

You should now see: the loss printed each ~100 steps going down, then a generated chunk of text. This one has real self-attention and transformer blocks: the actual LLM architecture, shrunk. (No torch / on 3.13+? Skip running it, reading it (Step 6) is the real goal.)

Step 6: Find the transformer's pieces

Open nanogpt_mini.py. Locate: token + position embeddings (self.tok, self.pos), self-attention with the causal mask (in Block), the MLP + residuals + layer-norm, and the head that predicts the next token.

You should now see / say: the pieces of a real transformer, embeddings → attention → blocks → predict, and that the training loop is identical to Part A (predict → loss → nudge → repeat).

Step 7: Scale it up in your head

Compare: this model has thousands of weights and trained on one sentence. A frontier LLM has billions of weights and trained on much of the internet for GPU-months, then fine-tuning + RLHF (M15).

You should now see / say: "same loop + same architecture + vastly more scale, data, and compute = an LLM." The black box is now a glass box.

Stuck? Working code is in ../solution/ (tiny_lm.py + nanogpt_mini.py).

Your win

You trained a language model from scratch (loss dropped, it generated text), and you can name the transformer's pieces, so LLMs aren't magic anymore.

Post it to the chat wins board: "Trained a model from scratch, loss 2.2 → 0.5, and it babbles in the style of my text. I see what's under the hood now "

Take-home (optional)

Watch the first hour of Andrej Karpathy's "Neural Networks: Zero to Hero" (or skim nanoGPT), he builds from exactly this kind of tiny model up to a real GPT, line by line. You'll recognize every piece. Then come back to building with models (the other 16 modules), that's the engineer's job.