Training & Inference

So far

analytic_gradient() — fast gradient computation
params — flat list of all 2,480 parameters
mlp(token_id) — token → 27 logits
softmax(logits) — logits → probabilities

Training

Now we can train. Each step: pick a name, compute the gradient, nudge every parameter in the direction that reduces loss:

num_steps = 1000
learning_rate = 1.0
for step in range(num_steps):
    doc = docs[step % len(docs)]
    tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
    n = len(tokens) - 1
    loss, grad = analytic_gradient(tokens, n)
    lr_t = learning_rate * (1 - step / num_steps)
    for i, (row, j) in enumerate(params):
        row[j] -= lr_t * grad[i]

analytic_gradient(tokens, n)	→	Forward + backward pass: get the loss and all 2,480 gradients
lr_t = learning_rate * (1 - step / num_steps)	→	Learning rate decay: start big (1.0), shrink linearly to 0
row[j] -= lr_t * grad[i]	→	SGD update: nudge each parameter opposite to its gradient

Try it

Watch the model train in real time — the loss curve builds as it learns from one name at a time:

This is stochastic gradient descent (SGD). “Stochastic” because we use one name at a time (not the whole dataset). The gradient points uphill; we go downhill by subtracting it.

The learning rate decay prevents the model from overshooting as it gets close to a good solution — large steps early, fine-tuning later.

Inference

Sampling is almost identical to Step 0, with one addition — temperature:

temperature = 0.5
for sample_idx in range(20):
    token_id = BOS
    sample = []
    for pos_id in range(16):
        logits = mlp(token_id)
        probs = softmax([l / temperature for l in logits])
        token_id = random.choices(range(vocab_size), weights=probs)[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])

Temperature controls how “sharp” the probability distribution is:

temperature = 1.0	→	Use the model’s probabilities as-is
temperature < 1.0	→	More confident: high-probability tokens dominate
temperature > 1.0	→	More random: probabilities flatten out

Try it

Generated names from the trained MLP. Drag the temperature slider:

Try it

Pick a token and drag the temperature slider to see how the probability distribution changes:

Dividing logits by a small temperature before softmax amplifies the differences. At temperature 0.5, the model is more decisive; at temperature 2.0, it’s more creative (and more likely to produce nonsense).

What’s Next

We now have a neural network that learns by gradient descent. But writing the backward pass by hand was tedious and error-prone — and it gets much worse with deeper networks.

In Step 2, we replace the manual chain rule with automatic differentiation: same math, computed automatically.

← 1.8 Gradient Check Home →