Linear & Softmax
So far
vocab_size— 27n_embd— 16state_dict— three weight matrices
Before we can build the model, we need two fundamental operations.
Linear
A linear layer multiplies a vector by a weight matrix. It’s the basic building block of neural networks:
def linear(x, w):
return [sum(wi * xi for wi, xi in zip(wo, x)) for wo in w]
For each row wo in the weight matrix w, we compute the dot product with the input x.
Try it
Change any value to see the dot products update:
If x has 16 elements and w has 64 rows, the output has 64 elements. Each output is a weighted sum of the inputs.
Softmax
Softmax turns a list of raw numbers (called logits) into probabilities that sum to 1:
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
In code:
def softmax(logits):
max_val = max(logits)
exps = [math.exp(v - max_val) for v in logits]
total = sum(exps)
return [e / total for e in exps]
| max_val = max(logits) | → | Find the largest value (for numerical stability) |
| math.exp(v - max_val) | → | Exponentiate each value (subtracting max prevents overflow) |
| e / total | → | Normalize so they sum to 1 |
Softmax has a “winner-take-all” tendency: the largest logit gets the most probability, and the gap gets amplified. If the logits are [2, 1, 0], softmax gives roughly [0.67, 0.24, 0.09] — the largest value dominates.
Try it
Change the logits and see how softmax responds:
This is the same role that normalization played in Step 0’s bigram() function: turn raw numbers into a probability distribution. But now the raw numbers are learned logits instead of counts.