Loss
So far
probs = bigram(token)— prob of next token
How do we measure how well the model is doing? We use negative log probability — also called cross-entropy loss.
For each position, the model produces a probability distribution over all 27 possible next tokens. We look at the probability it assigned to the actual next token, take the log, and negate it:
loss_t = -math.log(probs[target_id])
In mathematical notation, the loss for a single name of $n$ token pairs is:
$$\text{loss} = -\frac{1}{n} \sum_{i=1}^{n} \log P(t_i \mid t_{i-1})$$
where $P(t_i \mid t_{i-1})$ is the probability the model assigns to the actual next token $t_i$ given the current token $t_{i-1}$.
If the model assigned high probability to the correct token, the loss is low. If the model was surprised — the correct token had low probability — the loss is high.
A Concrete Example
After 100 training steps, let’s compute the loss for “ava”:
Try it
Step through each token pair to see its probability and loss:
| BOS → a | prob = 0.134 loss = −log(0.134) = 2.01 | |
| a → v | prob = 0.041 loss = −log(0.041) = 3.19 | |
| v → a | prob = 0.118 loss = −log(0.118) = 2.14 | |
| a → BOS | prob = 0.148 loss = −log(0.148) = 1.91 |
The a → v pair has the highest loss because “v” is relatively rare after “a”. The model is least surprised by a → BOS because many names end with “a”.
We average the four losses:
loss = (1 / n) * sum(losses)
$$\text{loss} = \frac{2.01 + 3.19 + 2.14 + 1.91}{4} = 2.31$$
Why logarithms? Because probabilities multiply (the chance of a whole sequence is the product of each step), and logs turn products into sums — which are easier to work with and more numerically stable.
The Loss Curve
As training progresses, the model has seen more character pairs and makes better predictions. The loss drops:
Try it
Hover to see the loss at each training step:
The steep drop in the first ~50 steps is the model learning the most common patterns (names often start with certain letters, “th” is common, “a” is often followed by “n”). After that, improvement slows — the easy patterns are learned, and what’s left is rarer combinations.
A loss of 0 would mean the model predicts every next character with 100% confidence. That’s impossible — language has genuine uncertainty. Many names share prefixes (“ma…" could be “mary”, “mark”, “madison”), so the best the model can do is assign reasonable probabilities across the options.