What Changes
The dataset, tokenizer, vocabulary, model architecture, optimizer, and inference code are all identical to Step 1. We’re still training the same 2,480-parameter MLP (multi-layer perceptron) with SGD (stochastic gradient descent).
docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)
uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1
What changes is how we compute gradients.
In Step 1, we wrote the backward pass by hand — deriving the chain rule for each layer, implementing it as 40+ lines of careful index arithmetic. It worked, but it was tedious and error-prone. Every time you change the model, you’d need to re-derive and re-implement the backward pass.
Step 2 introduces autograd — automatic differentiation. The idea: wrap every number in a Value object that records how it was computed. Then loss.backward() applies the chain rule automatically by walking the computation graph in reverse.
Here’s what we’re removing:
forward()(separate function, 10 lines)numerical_gradient()(19 lines)analytic_gradient()(40 lines)- The gradient check (6 lines)
And what we’re adding:
- The
Valueclass (~45 lines)
The net result: the training loop becomes forward pass → loss.backward() → SGD update. Three steps instead of a carefully hand-crafted backward pass.
This is essentially a minimal version of what production frameworks like PyTorch do under the hood.