← 1.1 Same Starting Point 1.3 Linear & Softmax →

The Parameters

So far

vocab_size — 27

In Step 0, the model was a single 27×27 count table. Now we replace it with three matrices of random numbers:

n_embd = 16
matrix = lambda nout, nin: [[random.gauss(0, 0.08) for _ in range(nin)] for _ in range(nout)]
state_dict = {
    'wte': matrix(vocab_size, n_embd),
    'mlp_fc1': matrix(4 * n_embd, n_embd),
    'mlp_fc2': matrix(vocab_size, 4 * n_embd),
}

n_embd = 16 is the embedding dimension — the size of the vector that represents each token. Each token gets a 16-number “fingerprint” that the model will learn to make useful.

The three matrices:

wte	→	Word Token Embeddings: 27 × 16. Each row is one token’s embedding vector.
mlp_fc1	→	Multi-Layer Perceptron, Fully Connected layer 1: 64 × 16. Expands the 16-dimensional embedding to 64 dimensions.
mlp_fc2	→	MLP, Fully Connected layer 2: 27 × 64. Projects back down to 27 raw scores (one per token). These raw scores are called logits — they’ll be converted to probabilities later.

Try it

The three matrices, initialized with random Gaussian values. Blue = negative, red = positive:

Every entry starts as a small random number (Gaussian with standard deviation 0.08). This randomness is the starting point — the model knows nothing yet. Training will shape these random numbers into useful patterns.

We also need a flat list of every individual parameter, so we can update them during training:

params = [(row, j) for mat in state_dict.values() for row in mat for j in range(len(row))]

This is a dense line. To see what it does, imagine a tiny state_dict with two 2×2 matrices:

state_dict = {
    'a': [[0.1, 0.2],
          [0.3, 0.4]],
    'b': [[0.5, 0.6],
          [0.7, 0.8]],
}

The comprehension walks through each matrix, then each row, then each column index, producing a flat list:

params = [
    ([0.1, 0.2], 0), ([0.1, 0.2], 1),   # a, row 0
    ([0.3, 0.4], 0), ([0.3, 0.4], 1),   # a, row 1
    ([0.5, 0.6], 0), ([0.5, 0.6], 1),   # b, row 0
    ([0.7, 0.8], 0), ([0.7, 0.8], 1),   # b, row 1
]

Each entry is a (row, column index) pair — so params[5] is ([0.5, 0.6], 1), pointing to the value 0.6. Later, during training, we’ll loop through this flat list and nudge each value by its gradient.

For the real model, we’re flattening three matrices:

wte	→	27 × 16 = 432 parameters
mlp_fc1	→	64 × 16 = 1,024 parameters
mlp_fc2	→	27 × 64 = 1,728 parameters

That’s 2,480 parameters total — compared to Step 0’s 729 counts. More capacity, but the model has to learn the right values instead of just counting.

← 1.1 Same Starting Point 1.3 Linear & Softmax →