Inference
So far
uchars— [a…z]BOS— token 26vocab_size— 27bigram(token)— prob of next token
Now the fun part: generating new names.
We start with BOS26 and ask the model what comes next. It gives us a probability distribution over all 27 tokens. We sample from that distribution — picking a random token weighted by its probability — then feed that token back in and repeat.
for sample_idx in range(20):
token_id = BOS
sample = []
for _ in range(16):
token_id = random.choices(range(vocab_size), weights=bigram(token_id))[0]
if token_id == BOS:
break
sample.append(uchars[token_id])
print(f"sample {sample_idx+1:2d}: {''.join(sample)}")
The outer loop generates 20 names. For each one:
| token_id = BOS | → | Start with BOS26 |
| bigram(token_id) | → | Get the probability distribution for what comes next |
| random.choices(...) | → | Pick a token, weighted by those probabilities |
| if token_id == BOS: break | → | If the model produces BOS, the name is done |
| uchars[token_id] | → | Convert the token ID back to a character |
The maximum length is 16 tokens, but most names end earlier when the model produces BOS.
Try it
Names generated from the trained bigram model:
They capture the statistical flavour of the training data — common letter pairs, typical name lengths, plausible beginnings and endings — all from nothing more than counting pairs.
What’s Next
This bigram model is as simple as it gets. It can only look at one token of context. When predicting what follows “h”, it doesn’t know whether a “t” came before it — it has no way to use more than one token of context.
In Step 1, we replace the count table with a neural network. The model becomes more expressive, but we can no longer solve it exactly — we’ll need gradient descent.