← 2.7 The New Training Loop Step 3: Attention →

Same Results, Less Code

Previously Defined

Value class — automatic differentiation
loss.backward() replaces hand-written gradients
Training loop: forward → backward → update → zero

The inference code is identical to Step 1, with one small change — we need to peek inside the Values:

temperature = 0.5
for sample_idx in range(20):
    token_id = BOS
    sample = []
    for pos_id in range(16):
        logits = mlp(token_id)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])
    print(f"sample {sample_idx+1:2d}: {''.join(sample)}")

weights=probs becomes weights=[p.data for p in probs] — we extract the actual numbers from the Values for sampling.

The Big Picture

Try it

Hover to compare the loss curves. Toggle each series on or off:

Step 1 and Step 2 produce identical results (same random seed, same architecture, same optimizer). The only difference is how gradients are computed:

	Step 1	Step 2
Gradient method	Hand-derived chain rule	Automatic (Value class)
Lines of gradient code	~40 (analytic_gradient)	1 (loss.backward())
Adding a new layer	Re-derive the backward pass	Just write the forward pass
Speed	Faster (plain Python floats)	Slower (Value object overhead)
Correctness	Must verify by hand or gradient check	Correct by construction

The Value class is essentially a minimal autograd engine — the same idea that powers PyTorch’s torch.autograd. We’ll keep using our hand-written Value throughout microGPT, but production frameworks like PyTorch do the same thing in optimized C++/CUDA with tensor operations instead of individual scalars.

What We Gained

The ability to experiment. Want to add another layer? A different activation function? A skip connection? In Step 1, each change requires re-deriving the backward pass. In Step 2, you just write the forward pass and call loss.backward(). The gradients take care of themselves.

This is what makes modern deep learning research possible. Nobody hand-derives gradients for a billion-parameter model. Autograd does it.

← 2.7 The New Training Loop Step 3: Attention →