Training and Results
Previously Defined
gpt(),params— model and parameters from Step 4- Adam optimizer: momentum (
m) + adaptive rate (v) + bias correction - Per-parameter buffers tracking gradient statistics
The training loop
The full training loop, with every piece we’ve built across all five steps:
for step in range(num_steps):
# Tokenize a training example
doc = docs[step % len(docs)]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
n = min(block_size, len(tokens) - 1)
# Forward pass
keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = gpt(token_id, pos_id, keys, values)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)
loss = (1 / n) * sum(losses)
# Backward pass
loss.backward()
# Adam update
lr_t = learning_rate * (1 - step / num_steps)
for i, p in enumerate(params):
m[i] = beta1 * m[i] + (1 - beta1) * p.grad
v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
m_hat = m[i] / (1 - beta1 ** (step + 1))
v_hat = v[i] / (1 - beta2 ** (step + 1))
p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
p.grad = 0
Everything here is familiar. The forward pass calls gpt() from Step 4, the backward pass uses the autograd from Step 2, and the Adam update replaces the single SGD line from Step 1. Linear learning rate decay still applies — Adam’s adaptive rates work on top of the global schedule, not instead of it.
Parameter count
Still 4,192 parameters — Adam doesn’t change the model, only how it’s trained. The optimizer adds two buffers of 4,192 floats each (m and v), but these aren’t model parameters — they’re optimizer state that isn’t used during inference.
Try it
Hover to compare the loss curves across all steps. Toggle each series on or off:
What the loss curve shows
Adam’s loss curve drops faster and reaches a lower final loss than SGD. The adaptive per-parameter learning rates let the optimizer navigate the loss landscape more efficiently — parameters that need larger steps get them, and parameters that need smaller steps are protected from overshooting.
Generate names
The trained model is running in your browser right now — all 4,192 parameters. Click Generate to sample new names:
The full comparison
| Step 2 | Step 3 | Step 4 | Step 5 | |
|---|---|---|---|---|
| Architecture | MLP only | Attention + MLP | Multi-head attention + MLP | |
| Attention | None | Single-head (16-dim) | 4 heads (4-dim each) | |
| Layers | 1 (implicit) | 1 (implicit) | Configurable (n_layer) | |
| Optimizer | SGD (lr=0.1) | Adam (lr=0.01) | ||
| Parameters | 3,184 | 4,192 | ||
| New concepts | Autograd | Attention, position embeddings, RMSNorm, residuals, KV cache | Multi-head, layer loop | Adam optimizer |
| Model function | mlp(token_id) | gpt(token_id, pos_id, keys, values) | ||
What we built
This is the complete microGPT. The same code — with no changes other than the numbers — could scale to a much larger model. Wider embeddings, more heads, more layers, a bigger vocabulary, more training data. The architecture, the autograd, and the optimizer are all here.
From Step 0’s bigram counting table to Step 5’s Adam-trained GPT transformer, we built every piece by hand:
- Step 0: Counted bigrams. No learning.
- Step 1: An MLP with hand-derived gradients. First learning.
- Step 2: Automatic differentiation. The
Valueclass. - Step 3: Attention. The model can look back.
- Step 4: Multi-head attention and layers. A complete transformer.
- Step 5: Adam. Each parameter learns at its own pace.
The model generates plausible English names from a 27-token vocabulary and 4,192 parameters. The same ideas, at scale, generate language.