Position Embeddings

Previously Defined

Step 2’s model processed each token independently
Step 3 adds context — the model can look back

Since Step 1.2 (wrapped in Value in Step 2.6), the model has had one embedding table: wte (word token embedding). Each token looked up a 16-dimensional vector representing what it is. But “m” at position 0 and “m” at position 3 got exactly the same vector — the model had no sense of order.

Step 3 adds a second embedding table: wpe (word position embedding). Each position gets its own learned vector representing where it is in the sequence.

n_embd = 16
block_size = 16 # maximum sequence length
matrix = lambda nout, nin: [[Value(random.gauss(0, 0.08)) for _ in range(nin)] for _ in range(nout)]
state_dict = {
    'wte': matrix(vocab_size, n_embd),
    'wpe': matrix(block_size, n_embd),
    ...

block_size = 16 sets the maximum sequence length — the model can handle names up to 16 characters. The wpe table has one row per position (16 rows of 16 values each).

Combining the embeddings

def gpt(token_id, pos_id, keys, values):
    tok_emb = state_dict['wte'][token_id]
    pos_emb = state_dict['wpe'][pos_id]
    x = [t + p for t, p in zip(tok_emb, pos_emb)]
    ...

The token embedding and position embedding are added element-wise. The result is a vector that encodes both what the token is and where it appears. This is the standard approach used in GPT-2 — simple addition rather than concatenation.

Try it

Click any token to see how its combined embedding is built. Notice that the two “m“s get the same token embedding but different position embeddings — so the model sees different vectors.

Both embedding tables start as random numbers, just like the weight matrices in Steps 1 and 2. The model learns what each position vector should be during training — there’s nothing hand-coded about which positions matter or how they relate to each other.

← 3.1 What Changes 3.3 RMSNorm →