Same Results, Less Code
Previously Defined
Valueclass — automatic differentiationloss.backward()replaces hand-written gradients- Training loop: forward → backward → update → zero
The inference code is identical to Step 1, with one small change — we need to peek inside the Values:
temperature = 0.5
for sample_idx in range(20):
token_id = BOS
sample = []
for pos_id in range(16):
logits = mlp(token_id)
probs = softmax([l / temperature for l in logits])
token_id = random.choices(range(vocab_size), weights=[p.data for p in probs])[0]
if token_id == BOS:
break
sample.append(uchars[token_id])
print(f"sample {sample_idx+1:2d}: {''.join(sample)}")
weights=probs becomes weights=[p.data for p in probs] — we extract the actual numbers from the Values for sampling.
The Big Picture
Try it
Hover to compare the loss curves. Toggle each series on or off:
Step 1 and Step 2 produce identical results (same random seed, same architecture, same optimizer). The only difference is how gradients are computed:
| Step 1 | Step 2 | |
|---|---|---|
| Gradient method | Hand-derived chain rule | Automatic (Value class) |
| Lines of gradient code | ~40 (analytic_gradient) | 1 (loss.backward()) |
| Adding a new layer | Re-derive the backward pass | Just write the forward pass |
| Speed | Faster (plain Python floats) | Slower (Value object overhead) |
| Correctness | Must verify by hand or gradient check | Correct by construction |
The Value class is essentially a minimal autograd engine — the same idea that powers PyTorch’s torch.autograd. We’ll keep using our hand-written Value throughout microGPT, but production frameworks like PyTorch do the same thing in optimized C++/CUDA with tensor operations instead of individual scalars.
What We Gained
The ability to experiment. Want to add another layer? A different activation function? A skip connection? In Step 1, each change requires re-deriving the backward pass. In Step 2, you just write the forward pass and call loss.backward(). The gradients take care of themselves.
This is what makes modern deep learning research possible. Nobody hand-derives gradients for a billion-parameter model. Autograd does it.