The Model

So far

vocab_size — 27
state_dict — 27×27 count table

Given a token, the model looks up its row in the count table and converts the raw counts into probabilities.

def bigram(token_id):
    row = state_dict[token_id]
    total = sum(row) + vocab_size
    return [(c + 1) / total for c in row]

Let’s trace through what this does for a specific token — say a₀:

state_dict[0]	→	Look up row 0 — the counts of every token that has followed a₀ [232, 87, 143, 197, 35, 12, 41, 50, 112, 16, 43, 302, 172, 622, 0, 31, 5, 303, 189, 195, 47, 55, 22, 11, 72, 29, 92]
sum(row) + vocab_size	→	Total the counts, adding 27 for smoothing (see below) 3307 + 27 = 3334
(c + 1) / total	→	Convert each count to a probability by dividing by the total [0.070, 0.026, 0.043, 0.059, 0.011, ...]

The result is a list of 27 probabilities that sum to 1. This is the model’s prediction: given a₀, what’s the probability of each possible next token?

This is a bigram model: it predicts the next token based only on the current token. It has no memory of anything further back. The name “bigram” comes from looking at pairs (bi-) of tokens.

Try it

Click a token to see the probability distribution over what comes next:

Laplace Smoothing

Notice the + vocab_size in the denominator and + 1 in the numerator. This is add-one (Laplace) smoothing. Without it, any character pair the model hasn’t seen would get probability zero — and log(0) is negative infinity, which would break our loss calculation later.

The + 1 means every count is treated as if it’s at least 1, even for pairs the model has never seen. The + vocab_size in the denominator keeps the probabilities summing to 1.

← 0.3 The Count Table 0.5 Training →