← 2.6 Parameters as Values 2.8 Same Results, Less Code →

The New Training Loop

Previously Defined

Value class records operations and computes gradients
Model functions use Values — three small changes

The training loop has four parts: setup, forward pass, backward pass, and SGD update.

Setup

num_steps = 1000
learning_rate = 1.0
for step in range(num_steps):

    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = len(tokens) - 1

Identical to Step 1 — pick a name, tokenize it.

Forward pass

    # Forward pass
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = mlp(token_id)
        probs = softmax(logits)
        loss_t = -probs[target_id].log()
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)

In Step 1, this was hidden inside analytic_gradient(). Now it’s inline — and each operation on Value objects silently builds the computation graph. The loss uses -probs[target_id].log() instead of -math.log(probs[target_id]) so the operation is recorded.

Backward pass

    loss.backward()

One line. This walks the entire computation graph, applying the chain rule at every node, computing the gradient for all 2,480 parameters at once. It replaces the 40-line analytic_gradient() function from Step 1.

SGD update

    lr_t = learning_rate * (1 - step / num_steps)
    for i, p in enumerate(params):
        p.data -= lr_t * p.grad
        p.grad = 0

Almost the same as Step 1, but now we access the number inside each Value:

p.data -= lr_t * p.grad	→	Update the actual number inside the Value
p.grad = 0	→	Reset the gradient for the next step (gradients accumulate, so we must zero them)

That p.grad = 0 is important. Since backward() uses += to accumulate gradients, forgetting to zero would mean gradients from step 1 leak into step 2, and so on. This is the standard pattern: forward → backward → update → zero gradients.

← 2.6 Parameters as Values 2.8 Same Results, Less Code →