Skip to content

Notes: M17: Build a language model from scratch

M0 said an LLM "predicts the next token," trained by adjusting weights. M15 said fine-tuning continues that training. This module makes both concrete by building the smallest real version: a model you train from nothing, watch learn, and sample from, then the actual transformer in miniature. This is the one deliberately "researcher" lab in the course; you won't do this on the job (you build with models), but after it, nothing about LLMs is a black box.

The whole idea in one loop

A language model does one thing: given some text, predict the next token. Training is a loop:

flowchart LR
  D["data: text → (input, next-token) pairs"] --> P["predict next token"]
  P --> L["measure error (loss)"]
  L --> G["nudge the weights to reduce it (gradient descent)"]
  G --> P
  P -->|"after training"| Gen["generate: sample next token, append, repeat"]

Everything else, embeddings, attention, billions of parameters, is machinery that makes that prediction better. The loop is the soul of it.

Part A: a tiny model from scratch (tiny_lm.py, numpy)

We use characters as tokens (simplest possible). The model is a single weight matrix W (a "bigram neural net"): given the current character, W scores every possible next character; a softmax turns scores into probabilities.

  • Predict: probs = softmax(W[current_char]), the model's guess for the next char.
  • Loss: cross-entropy: how surprised the model was by the true next char. High loss = bad guesses; we want it low.
  • Train (gradient descent): compute how to change W to make the true next char more likely (the gradient), and nudge W that way. Repeat hundreds of times. The loss falls: that is learning. (You'll see it drop from ~2.2 to ~0.5 in the lab.)
  • Generate: start with a character, sample the next from the model's probabilities, append, and repeat. Out comes text in the style of the training data.

That's a complete, trained-from-scratch language model in ~50 lines, no framework, no GPU. It's not smart (tiny model, tiny data), but the mechanism is exactly an LLM's, just minus the transformer and the scale.

Part B: a real transformer, in miniature (nanogpt_mini.py, PyTorch)

The tiny model only looks at the current character. Real LLMs use the transformer so they can look at all the recent text and decide what matters. The miniature GPT has the real pieces:

  • Token embeddings: turn each token id into a learned vector (its "meaning"; recall embeddings, M7). Position embeddings: add where each token is, since order matters.
  • Self-attention: the key move: for each token, the model looks at the earlier tokens and weighs which ones are relevant to predicting the next one ("attention"). Causal masking stops it peeking at future tokens.
  • Blocks: stack attention + a small MLP, each wrapped with a residual connection and layer-norm (tricks that make deep networks trainable). Stack a few blocks (real LLMs stack dozens).
  • Head: project back to a score for every possible next token → softmax → predict. Same loop as Part A (predict → loss → nudge → repeat), just with a far more capable model.

Train it on a tiny text for a minute and the loss drops and it produces more text-like output. This is the architecture behind GPT/Claude/Llama: the only differences are scale (billions of parameters vs our thousands), data (much of the internet vs one sentence), and compute (data-center GPUs vs your laptop).

How this becomes an LLM

flowchart LR
  Mini["nanogpt_mini.py<br/>~thousands of params · 1 sentence"] --> Scale["× billions of params<br/>× the internet<br/>× GPU-months"] --> LLM["a frontier LLM<br/>(GPT / Claude / Llama)"]
  LLM --> Post["+ fine-tuning (M15)<br/>+ RLHF → helpful & safe"]
Pre-training is this loop at massive scale; then fine-tuning and RLHF (M15) make the raw next-token predictor into a helpful assistant. Same loop you just built, all the way up.

Why an AI engineer rarely does this

Training a real model costs millions of dollars, huge datasets, and GPU clusters, and you don't need to: capable models already exist, and you build with them (prompting, RAG, agents, fine-tuning). This module is for understanding, so you can reason about context windows, tokens, hallucination, and fine-tuning from a place of knowing what's underneath, not so you'll train one.

Go deeper (optional) - **Andrej Karpathy's "Neural Networks: Zero to Hero"** and **nanoGPT** are the classic next step, he builds up from exactly this kind of tiny model to a real GPT, explaining every line. - **Tokenization** in real models uses *subword* tokens (BPE), not single characters, more efficient (recall tokens, M0/M6). - **Why attention won:** earlier models (RNNs) processed text one step at a time; the transformer attends to all positions in parallel, which trains far better on modern GPUs (recall M1 of Course 01, GPUs do parallel math). - **Loss & perplexity:** the cross-entropy loss here, exponentiated, is "perplexity", a standard measure of how well a language model predicts text. - **This is deep learning, not AI engineering.** If you love this part, the ML-engineer / researcher path (PyTorch, math, training) is a whole field, separate from building apps with models.

Check yourself

Lock in today's win, answer each in your head, then reveal.

1. What is the core training loop of a language model?

Show answer

Predict the next token → measure the error (loss) → nudge the weights to reduce it (gradient descent) → repeat. After training, generate by sampling the next token and appending, over and over. Everything else is machinery to make the prediction better.

2. In tiny_lm.py, what does "the loss dropping" mean?

Show answer

The model is learning: its predicted probabilities for the true next character are getting higher (cross-entropy loss measures surprise; lower = better guesses). It falls from ~2.2 to ~0.5 as gradient descent tunes the weights.

3. What does self-attention do in a transformer?

Show answer

For each token, it looks at the earlier tokens and weighs which ones matter for predicting the next one, so the model uses context, not just the current token. Causal masking prevents it from peeking at future tokens.

4. Name the main pieces of the miniature transformer.

Show answer

Token + position embeddings, self-attention (causal), blocks of attention + an MLP with residuals and layer-norm, and a head that projects to next-token scores → softmax. Stack blocks for depth.

5. What turns this miniature into a real LLM, and why don't you build one on the job?

Show answer

Scale (billions of parameters), data (much of the internet), and compute (GPU-months), then fine-tuning + RLHF (M15) to make it helpful/safe. You don't build one because it costs millions and capable models already exist, as an AI engineer you build with them.


New words (also in resources/glossary.md): language model (recap), loss / cross-entropy, gradient descent (recap), self-attention, causal masking, token/position embedding, residual connection, layer-norm, MLP, nanoGPT, perplexity.

Source: original, written for this course; inspired by (not copied from) Andrej Karpathy's nanoGPT / "Zero to Hero" approach, which is credited as the go-deeper resource. tiny_lm.py (numpy) was trained and run for real (loss 2.2 → 0.5, generates text); nanogpt_mini.py (PyTorch) follows standard transformer structure and is syntax-verified (torch needed to train, pilot on Python 3.10-3.12). Diagrams are original.