Training and Results

Previously Defined

gpt(), params — model and parameters from Step 4
Adam optimizer: momentum (m) + adaptive rate (v) + bias correction
Per-parameter buffers tracking gradient statistics

The training loop

The full training loop, with every piece we’ve built across all five steps:

for step in range(num_steps):

    # Tokenize a training example
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = min(block_size, len(tokens) - 1)

    # Forward pass
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)

    # Backward pass
    loss.backward()

    # Adam update
    lr_t = learning_rate * (1 - step / num_steps)
    for i, p in enumerate(params):
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
        p.grad = 0

Everything here is familiar. The forward pass calls gpt() from Step 4, the backward pass uses the autograd from Step 2, and the Adam update replaces the single SGD line from Step 1. Linear learning rate decay still applies — Adam’s adaptive rates work on top of the global schedule, not instead of it.

Parameter count

Still 4,192 parameters — Adam doesn’t change the model, only how it’s trained. The optimizer adds two buffers of 4,192 floats each (m and v), but these aren’t model parameters — they’re optimizer state that isn’t used during inference.

Try it

Hover to compare the loss curves across all steps. Toggle each series on or off:

What the loss curve shows

Adam’s loss curve drops faster and reaches a lower final loss than SGD. The adaptive per-parameter learning rates let the optimizer navigate the loss landscape more efficiently — parameters that need larger steps get them, and parameters that need smaller steps are protected from overshooting.

Generate names

The trained model is running in your browser right now — all 4,192 parameters. Click Generate to sample new names:

The full comparison

	Step 2	Step 3	Step 4	Step 5
Architecture	MLP only	Attention + MLP	Multi-head attention + MLP
Attention	None	Single-head (16-dim)	4 heads (4-dim each)
Layers	1 (implicit)	1 (implicit)	Configurable (`n_layer`)
Optimizer	SGD (lr=0.1)			Adam (lr=0.01)
Parameters	3,184	4,192
New concepts	Autograd	Attention, position embeddings, RMSNorm, residuals, KV cache	Multi-head, layer loop	Adam optimizer
Model function	`mlp(token_id)`	`gpt(token_id, pos_id, keys, values)`

What we built

This is the complete microGPT. The same code — with no changes other than the numbers — could scale to a much larger model. Wider embeddings, more heads, more layers, a bigger vocabulary, more training data. The architecture, the autograd, and the optimizer are all here.

From Step 0’s bigram counting table to Step 5’s Adam-trained GPT transformer, we built every piece by hand:

Step 0: Counted bigrams. No learning.
Step 1: An MLP with hand-derived gradients. First learning.
Step 2: Automatic differentiation. The Value class.
Step 3: Attention. The model can look back.
Step 4: Multi-head attention and layers. A complete transformer.
Step 5: Adam. Each parameter learns at its own pace.

The model generates plausible English names from a 27-token vocabulary and 4,192 parameters. The same ideas, at scale, generate language.

← 5.3 The Code Home →