MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 5.3 The Code Home →
Step 5: Adam › 5.4

Training and Results

Previously Defined

  • gpt(), params — model and parameters from Step 4
  • Adam optimizer: momentum (m) + adaptive rate (v) + bias correction
  • Per-parameter buffers tracking gradient statistics

The training loop

The full training loop, with every piece we’ve built across all five steps:

for step in range(num_steps):

    # Tokenize a training example
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = min(block_size, len(tokens) - 1)

    # Forward pass
    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = gpt(token_id, pos_id, keys, values)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)

    # Backward pass
    loss.backward()

    # Adam update
    lr_t = learning_rate * (1 - step / num_steps)
    for i, p in enumerate(params):
        m[i] = beta1 * m[i] + (1 - beta1) * p.grad
        v[i] = beta2 * v[i] + (1 - beta2) * p.grad ** 2
        m_hat = m[i] / (1 - beta1 ** (step + 1))
        v_hat = v[i] / (1 - beta2 ** (step + 1))
        p.data -= lr_t * m_hat / (v_hat ** 0.5 + eps_adam)
        p.grad = 0

Everything here is familiar. The forward pass calls gpt() from Step 4, the backward pass uses the autograd from Step 2, and the Adam update replaces the single SGD line from Step 1. Linear learning rate decay still applies — Adam’s adaptive rates work on top of the global schedule, not instead of it.

Parameter count

Still 4,192 parameters — Adam doesn’t change the model, only how it’s trained. The optimizer adds two buffers of 4,192 floats each (m and v), but these aren’t model parameters — they’re optimizer state that isn’t used during inference.

Try it

Hover to compare the loss curves across all steps. Toggle each series on or off:

What the loss curve shows

Adam’s loss curve drops faster and reaches a lower final loss than SGD. The adaptive per-parameter learning rates let the optimizer navigate the loss landscape more efficiently — parameters that need larger steps get them, and parameters that need smaller steps are protected from overshooting.

Generate names

The trained model is running in your browser right now — all 4,192 parameters. Click Generate to sample new names:

The full comparison

Step 2Step 3Step 4Step 5
ArchitectureMLP onlyAttention + MLPMulti-head attention + MLP
AttentionNoneSingle-head (16-dim)4 heads (4-dim each)
Layers1 (implicit)1 (implicit)Configurable (n_layer)
OptimizerSGD (lr=0.1)Adam (lr=0.01)
Parameters3,1844,192
New conceptsAutogradAttention, position embeddings, RMSNorm, residuals, KV cacheMulti-head, layer loopAdam optimizer
Model functionmlp(token_id)gpt(token_id, pos_id, keys, values)

What we built

This is the complete microGPT. The same code — with no changes other than the numbers — could scale to a much larger model. Wider embeddings, more heads, more layers, a bigger vocabulary, more training data. The architecture, the autograd, and the optimizer are all here.

From Step 0’s bigram counting table to Step 5’s Adam-trained GPT transformer, we built every piece by hand:

The model generates plausible English names from a 27-token vocabulary and 4,192 parameters. The same ideas, at scale, generate language.

← 5.3 The Code Home →