MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 1.2 The Parameters 1.4 The MLP Model →
Step 1: Gradient Descent › 1.3

Linear & Softmax

So far

  • vocab_size — 27
  • n_embd — 16
  • state_dict — three weight matrices

Before we can build the model, we need two fundamental operations.

Linear

A linear layer multiplies a vector by a weight matrix. It’s the basic building block of neural networks:

def linear(x, w):
    return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]

For each row wo in the weight matrix w, we compute the dot product with the input x.

Try it

Change any value to see the dot products update:

If x has 16 elements and w has 64 rows, the output has 64 elements. Each output is a weighted sum of the inputs.

Softmax

Softmax turns a list of raw numbers (called logits) into probabilities that sum to 1:

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

In code:

def softmax(logits):
    max_val = max(logits)
    exps = [math.exp(v - max_val) for v in logits]
    total = sum(exps)
    return [e / total for e in exps]
max_val = max(logits) Find the largest value (for numerical stability)
math.exp(v - max_val) Exponentiate each value (subtracting max prevents overflow)
e / total Normalize so they sum to 1

Softmax has a “winner-take-all” tendency: the largest logit gets the most probability, and the gap gets amplified. If the logits are [2, 1, 0], softmax gives roughly [0.67, 0.24, 0.09] — the largest value dominates.

Try it

Change the logits and see how softmax responds:

This is the same role that normalization played in Step 0’s bigram() function: turn raw numbers into a probability distribution. But now the raw numbers are learned logits instead of counts.

← 1.2 The Parameters 1.4 The MLP Model →