MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← Step 0: Counting 1.2 The Parameters →
Step 1: Gradient Descent › 1.1

Same Starting Point

The dataset, tokenizer, and vocabulary are identical to Step 0. We’re still working with the same 32,033 names, the same 26 characters plus BOS26, and the same vocabulary size of 27.

docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)
uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

If any of this is unfamiliar, see Step 0 for the full explanation.

What changes in Step 1 is everything after this point. In Step 0, the “model” was a count table and “training” was incrementing counts. That worked because the bigram model was simple enough to have a closed-form solution.

Now we replace the count table with a neural network — a model that’s too expressive to solve exactly. Instead of counting, we’ll train it with gradient descent: nudging the parameters a little at a time, guided by the slope of the loss.

The input and output are the same: give the model a token, get back a probability distribution over the next token. But the internals are completely different.

← Step 0: Counting 1.2 The Parameters →