MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 0.5 Training 0.7 Inference →
Step 0: Counting › 0.6

Loss

So far

  • probs = bigram(token) — prob of next token

How do we measure how well the model is doing? We use negative log probability — also called cross-entropy loss.

For each position, the model produces a probability distribution over all 27 possible next tokens. We look at the probability it assigned to the actual next token, take the log, and negate it:

loss_t = -math.log(probs[target_id])

In mathematical notation, the loss for a single name of $n$ token pairs is:

$$\text{loss} = -\frac{1}{n} \sum_{i=1}^{n} \log P(t_i \mid t_{i-1})$$

where $P(t_i \mid t_{i-1})$ is the probability the model assigns to the actual next token $t_i$ given the current token $t_{i-1}$.

If the model assigned high probability to the correct token, the loss is low. If the model was surprised — the correct token had low probability — the loss is high.

A Concrete Example

After 100 training steps, let’s compute the loss for “ava”:

Try it

Step through each token pair to see its probability and loss:

BOS26 a0 v21 a0 BOS26
For each pair, we ask the model: what probability did you assign to the correct next token?
BOSa prob = 0.134   loss = −log(0.134) = 2.01
av prob = 0.041   loss = −log(0.041) = 3.19
va prob = 0.118   loss = −log(0.118) = 2.14
aBOS prob = 0.148   loss = −log(0.148) = 1.91

The av pair has the highest loss because “v” is relatively rare after “a”. The model is least surprised by aBOS because many names end with “a”.

We average the four losses:

loss = (1 / n) * sum(losses)

$$\text{loss} = \frac{2.01 + 3.19 + 2.14 + 1.91}{4} = 2.31$$

Why logarithms? Because probabilities multiply (the chance of a whole sequence is the product of each step), and logs turn products into sums — which are easier to work with and more numerically stable.

The Loss Curve

As training progresses, the model has seen more character pairs and makes better predictions. The loss drops:

Try it

Hover to see the loss at each training step:

The steep drop in the first ~50 steps is the model learning the most common patterns (names often start with certain letters, “th” is common, “a” is often followed by “n”). After that, improvement slows — the easy patterns are learned, and what’s left is rarer combinations.

A loss of 0 would mean the model predicts every next character with 100% confidence. That’s impossible — language has genuine uncertainty. Many names share prefixes (“ma…" could be “mary”, “mark”, “madison”), so the best the model can do is assign reasonable probabilities across the options.

← 0.5 Training 0.7 Inference →