MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 1.4 The MLP Model 1.6 Numerical Gradient →
Step 1: Gradient Descent › 1.5

The Forward Pass

So far

  • mlp(token_id) — token → 27 logits
  • softmax(logits) — logits → probabilities

The forward pass runs the model on a tokenized name and computes the average loss — exactly as in Step 0, but now using the MLP:

def forward(tokens, n):
    losses = []
    for pos_id in range(n):
        token_id, target_id = tokens[pos_id], tokens[pos_id + 1]
        logits = mlp(token_id)
        probs = softmax(logits)
        loss_t = -math.log(probs[target_id])
        losses.append(loss_t)
    loss = (1 / n) * sum(losses)
    return loss

Try it

Step through the forward pass on “ava” using the trained MLP:

For each consecutive pair, we:

mlp(token_id) Run the neural network to get 27 logits (one raw score per possible next token)
softmax(logits) Convert logits to probabilities
-math.log(probs[target_id]) How surprised was the model by the actual next token?

The loss formula is the same as Step 0:

$$\text{loss} = -\frac{1}{n} \sum_{i=1}^{n} \log P(t_i \mid t_{i-1})$$

At the start of training, with random weights, the model is essentially guessing. It assigns roughly equal probability to all 27 tokens, so the initial loss is about $-\log(1/27) \approx 3.30$ — the same as an untrained Step 0 model.

Try it

Compare the loss curves — Step 0 (counting) vs Step 1 (gradient descent):

The difference is in what happens next. In Step 0, we updated counts directly. Here, we need to figure out how to change the weights to reduce the loss. That’s the job of the gradient.

← 1.4 The MLP Model 1.6 Numerical Gradient →