← Step 1: Gradient Descent 2.2 The Value Class: Wrapping Numbers →

What Changes

The dataset, tokenizer, vocabulary, model architecture, optimizer, and inference code are all identical to Step 1. We’re still training the same 2,480-parameter MLP (multi-layer perceptron) with SGD (stochastic gradient descent).

docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)
uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

What changes is how we compute gradients.

In Step 1, we wrote the backward pass by hand — deriving the chain rule for each layer, implementing it as 40+ lines of careful index arithmetic. It worked, but it was tedious and error-prone. Every time you change the model, you’d need to re-derive and re-implement the backward pass.

Step 2 introduces autograd — automatic differentiation. The idea: wrap every number in a Value object that records how it was computed. Then loss.backward() applies the chain rule automatically by walking the computation graph in reverse.

Here’s what we’re removing:

forward() (separate function, 10 lines)
numerical_gradient() (19 lines)
analytic_gradient() (40 lines)
The gradient check (6 lines)

And what we’re adding:

The Value class (~45 lines)

The net result: the training loop becomes forward pass → loss.backward() → SGD update. Three steps instead of a carefully hand-crafted backward pass.

This is essentially a minimal version of what production frameworks like PyTorch do under the hood.

← Step 1: Gradient Descent 2.2 The Value Class: Wrapping Numbers →