The Parameters
So far
vocab_size— 27
In Step 0, the model was a single 27×27 count table. Now we replace it with three matrices of random numbers:
n_embd = 16
matrix = lambda nout, nin: [[random.gauss(0, 0.08) for _ in range(nin)] for _ in range(nout)]
state_dict = {
'wte': matrix(vocab_size, n_embd),
'mlp_fc1': matrix(4 * n_embd, n_embd),
'mlp_fc2': matrix(vocab_size, 4 * n_embd),
}
n_embd = 16 is the embedding dimension — the size of the vector that represents each token. Each token gets a 16-number “fingerprint” that the model will learn to make useful.
The three matrices:
| wte | → | Word Token Embeddings: 27 × 16. Each row is one token’s embedding vector. |
| mlp_fc1 | → | Multi-Layer Perceptron, Fully Connected layer 1: 64 × 16. Expands the 16-dimensional embedding to 64 dimensions. |
| mlp_fc2 | → | MLP, Fully Connected layer 2: 27 × 64. Projects back down to 27 raw scores (one per token). These raw scores are called **logits** — they’ll be converted to probabilities later. |
Try it
The three matrices, initialized with random Gaussian values. Blue = negative, red = positive:
Every entry starts as a small random number (Gaussian with standard deviation 0.08). This randomness is the starting point — the model knows nothing yet. Training will shape these random numbers into useful patterns.
We also need a flat list of every individual parameter, so we can update them during training:
params = [(row, j) for mat in state_dict.values() for row in mat for j in range(len(row))]
This is a dense line. To see what it does, imagine a tiny state_dict with two 2×2 matrices:
state_dict = {
'a': [[0.1, 0.2],
[0.3, 0.4]],
'b': [[0.5, 0.6],
[0.7, 0.8]],
}
The comprehension walks through each matrix, then each row, then each column index, producing a flat list:
params = [
([0.1, 0.2], 0), ([0.1, 0.2], 1), # a, row 0
([0.3, 0.4], 0), ([0.3, 0.4], 1), # a, row 1
([0.5, 0.6], 0), ([0.5, 0.6], 1), # b, row 0
([0.7, 0.8], 0), ([0.7, 0.8], 1), # b, row 1
]
Each entry is a (row, column index) pair — so params[5] is ([0.5, 0.6], 1), pointing to the value 0.6. Later, during training, we’ll loop through this flat list and nudge each value by its gradient.
For the real model, we’re flattening three matrices:
| wte | → | 27 × 16 = 432 parameters |
| mlp_fc1 | → | 64 × 16 = 1,024 parameters |
| mlp_fc2 | → | 27 × 64 = 1,728 parameters |
That’s 2,480 parameters total — compared to Step 0’s 729 counts. More capacity, but the model has to learn the right values instead of just counting.