Loss

So far

probs = bigram(token) — prob of next token

How do we measure how well the model is doing? We use negative log probability — also called cross-entropy loss.

For each position, the model produces a probability distribution over all 27 possible next tokens. We look at the probability it assigned to the actual next token, take the log, and negate it:

loss_t = -math.log(probs[target_id])

In mathematical notation, the loss for a single name of $n$ token pairs is:

$$\text{loss} = -\frac{1}{n} \sum_{i=1}^{n} \log P(t_i \mid t_{i-1})$$

where $P(t_i \mid t_{i-1})$ is the probability the model assigns to the actual next token $t_i$ given the current token $t_{i-1}$.

If the model assigned high probability to the correct token, the loss is low. If the model was surprised — the correct token had low probability — the loss is high.

A Concrete Example

After 100 training steps, let’s compute the loss for “ava”:

Try it

Step through each token pair to see its probability and loss:

BOS₂₆ → a₀ → v₂₁ → a₀ → BOS₂₆

For each pair, we ask the model: what probability did you assign to the correct next token?

BOS → a		prob = 0.134 loss = −log(0.134) = 2.01
a → v		prob = 0.041 loss = −log(0.041) = 3.19
v → a		prob = 0.118 loss = −log(0.118) = 2.14
a → BOS		prob = 0.148 loss = −log(0.148) = 1.91

The a → v pair has the highest loss because “v” is relatively rare after “a”. The model is least surprised by a → BOS because many names end with “a”.

We average the four losses:

loss = (1 / n) * sum(losses)

$$\text{loss} = \frac{2.01 + 3.19 + 2.14 + 1.91}{4} = 2.31$$

Why logarithms? Because probabilities multiply (the chance of a whole sequence is the product of each step), and logs turn products into sums — which are easier to work with and more numerically stable.

The Loss Curve

As training progresses, the model has seen more character pairs and makes better predictions. The loss drops:

Try it

Hover to see the loss at each training step:

The steep drop in the first ~50 steps is the model learning the most common patterns (names often start with certain letters, “th” is common, “a” is often followed by “n”). After that, improvement slows — the easy patterns are learned, and what’s left is rarer combinations.

A loss of 0 would mean the model predicts every next character with 100% confidence. That’s impossible — language has genuine uncertainty. Many names share prefixes (“ma…" could be “mary”, “mark”, “madison”), so the best the model can do is assign reasonable probabilities across the options.

← 0.5 Training 0.7 Inference →