The New Training Loop
Previously Defined
Valueclass records operations and computes gradients- Model functions use Values — three small changes
The training loop has four parts: setup, forward pass, backward pass, and SGD update.
Setup
num_steps = 1000
learning_rate = 1.0
for step in range(num_steps):
doc = docs[step % len(docs)]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
n = len(tokens) - 1
Identical to Step 1 — pick a name, tokenize it.
Forward pass
# Forward pass
losses = []
for pos_id in range(n):
token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
logits = mlp(token_id)
probs = softmax(logits)
loss_t = -probs[target_id].log()
losses.append(loss_t)
loss = (1 / n) * sum(losses)
In Step 1, this was hidden inside analytic_gradient(). Now it’s inline — and each operation on Value objects silently builds the computation graph. The loss uses -probs[target_id].log() instead of -math.log(probs[target_id]) so the operation is recorded.
Backward pass
loss.backward()
One line. This walks the entire computation graph, applying the chain rule at every node, computing the gradient for all 2,480 parameters at once. It replaces the 40-line analytic_gradient() function from Step 1.
SGD update
lr_t = learning_rate * (1 - step / num_steps)
for i, p in enumerate(params):
p.data -= lr_t * p.grad
p.grad = 0
Almost the same as Step 1, but now we access the number inside each Value:
| p.data -= lr_t * p.grad | → | Update the actual number inside the Value |
| p.grad = 0 | → | Reset the gradient for the next step (gradients accumulate, so we must zero them) |
That p.grad = 0 is important. Since backward() uses += to accumulate gradients, forgetting to zero would mean gradients from step 1 leak into step 2, and so on. This is the standard pattern: forward → backward → update → zero gradients.