MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 2.7 The New Training Loop Step 3: Attention →
Step 2: Autograd › 2.8

Same Results, Less Code

Previously Defined

  • Value class — automatic differentiation
  • loss.backward() replaces hand-written gradients
  • Training loop: forward → backward → update → zero

The inference code is identical to Step 1, with one small change — we need to peek inside the Values:

temperature = 0.5
for sample_idx in range(20):
    token_id = BOS
    sample = []
    for pos_id in range(16):
        logits = mlp(token_id)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])
    print(f"sample {sample_idx+1:2d}: {''.join(sample)}")

weights=probs becomes weights=[p.data for p in probs] — we extract the actual numbers from the Values for sampling.

The Big Picture

Try it

Hover to compare the loss curves. Toggle each series on or off:

Step 1 and Step 2 produce identical results (same random seed, same architecture, same optimizer). The only difference is how gradients are computed:

Step 1Step 2
Gradient methodHand-derived chain ruleAutomatic (Value class)
Lines of gradient code~40 (analytic_gradient)1 (loss.backward())
Adding a new layerRe-derive the backward passJust write the forward pass
SpeedFaster (plain Python floats)Slower (Value object overhead)
CorrectnessMust verify by hand or gradient checkCorrect by construction

The Value class is essentially a minimal autograd engine — the same idea that powers PyTorch’s torch.autograd. We’ll keep using our hand-written Value throughout microGPT, but production frameworks like PyTorch do the same thing in optimized C++/CUDA with tensor operations instead of individual scalars.

What We Gained

The ability to experiment. Want to add another layer? A different activation function? A skip connection? In Step 1, each change requires re-deriving the backward pass. In Step 2, you just write the forward pass and call loss.backward(). The gradients take care of themselves.

This is what makes modern deep learning research possible. Nobody hand-derives gradients for a billion-parameter model. Autograd does it.

← 2.7 The New Training Loop Step 3: Attention →