Training & Inference
So far
analytic_gradient()— fast gradient computationparams— flat list of all 2,480 parametersmlp(token_id)— token → 27 logitssoftmax(logits)— logits → probabilities
Training
Now we can train. Each step: pick a name, compute the gradient, nudge every parameter in the direction that reduces loss:
num_steps = 1000
learning_rate = 1.0
for step in range(num_steps):
doc = docs[step % len(docs)]
tokens = [BOS] + [uchars.index(ch) for ch in doc] + [BOS]
n = len(tokens) - 1
loss, grad = analytic_gradient(tokens, n)
lr_t = learning_rate * (1 - step / num_steps)
for i, (row, j) in enumerate(params):
row[j] -= lr_t * grad[i]
| analytic_gradient(tokens, n) | → | Forward + backward pass: get the loss and all 2,480 gradients |
| lr_t = learning_rate * (1 - step / num_steps) | → | Learning rate decay: start big (1.0), shrink linearly to 0 |
| row[j] -= lr_t * grad[i] | → | SGD update: nudge each parameter opposite to its gradient |
Try it
Watch the model train in real time — the loss curve builds as it learns from one name at a time:
This is stochastic gradient descent (SGD). “Stochastic” because we use one name at a time (not the whole dataset). The gradient points uphill; we go downhill by subtracting it.
The learning rate decay prevents the model from overshooting as it gets close to a good solution — large steps early, fine-tuning later.
Inference
Sampling is almost identical to Step 0, with one addition — temperature:
temperature = 0.5
for sample_idx in range(20):
token_id = BOS
sample = []
for pos_id in range(16):
logits = mlp(token_id)
probs = softmax([l / temperature for l in logits])
token_id = random.choices(range(vocab_size), weights=probs)[0]
if token_id == BOS:
break
sample.append(uchars[token_id])
Temperature controls how “sharp” the probability distribution is:
| temperature = 1.0 | → | Use the model’s probabilities as-is |
| temperature < 1.0 | → | More confident: high-probability tokens dominate |
| temperature > 1.0 | → | More random: probabilities flatten out |
Try it
Generated names from the trained MLP. Drag the temperature slider:
Try it
Pick a token and drag the temperature slider to see how the probability distribution changes:
Dividing logits by a small temperature before softmax amplifies the differences. At temperature 0.5, the model is more decisive; at temperature 2.0, it’s more creative (and more likely to produce nonsense).
What’s Next
We now have a neural network that learns by gradient descent. But writing the backward pass by hand was tedious and error-prone — and it gets much worse with deeper networks.
In Step 2, we replace the manual chain rule with automatic differentiation: same math, computed automatically.