MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 0.6 Loss Step 1: Gradient Descent →
Step 0: Counting › 0.7

Inference

So far

  • uchars — [a…z]
  • BOS — token 26
  • vocab_size — 27
  • bigram(token) — prob of next token

Now the fun part: generating new names.

We start with BOS26 and ask the model what comes next. It gives us a probability distribution over all 27 tokens. We sample from that distribution — picking a random token weighted by its probability — then feed that token back in and repeat.

for sample_idx in range(20):
    token_id = BOS
    sample = []
    for _ in range(16):
        token_id = random.choices(range(vocab_size), weights=bigram(token_id))[0]
        if token_id == BOS:
            break
        sample.append(uchars[token_id])
    print(f"sample {sample_idx+1:2d}: {''.join(sample)}")

The outer loop generates 20 names. For each one:

token_id = BOS Start with BOS26
bigram(token_id) Get the probability distribution for what comes next
random.choices(...) Pick a token, weighted by those probabilities
if token_id == BOS: break If the model produces BOS, the name is done
uchars[token_id] Convert the token ID back to a character

The maximum length is 16 tokens, but most names end earlier when the model produces BOS.

Try it

Names generated from the trained bigram model:

They capture the statistical flavour of the training data — common letter pairs, typical name lengths, plausible beginnings and endings — all from nothing more than counting pairs.

What’s Next

This bigram model is as simple as it gets. It can only look at one token of context. When predicting what follows “h”, it doesn’t know whether a “t” came before it — it has no way to use more than one token of context.

In Step 1, we replace the count table with a neural network. The model becomes more expressive, but we can no longer solve it exactly — we’ll need gradient descent.

← 0.6 Loss Step 1: Gradient Descent →