Same Starting Point

The dataset, tokenizer, and vocabulary are identical to Step 0. We’re still working with the same 32,033 names, the same 26 characters plus BOS₂₆, and the same vocabulary size of 27.

docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)
uchars = sorted(set(''.join(docs)))
BOS = len(uchars)
vocab_size = len(uchars) + 1

If any of this is unfamiliar, see Step 0 for the full explanation.

What changes in Step 1 is everything after this point. In Step 0, the “model” was a count table and “training” was incrementing counts. That worked because the bigram model was simple enough to have a closed-form solution.

Now we replace the count table with a neural network — a model that’s too expressive to solve exactly. Instead of counting, we’ll train it with gradient descent: nudging the parameters a little at a time, guided by the slope of the loss.

The input and output are the same: give the model a token, get back a probability distribution over the next token. But the internals are completely different.

← Step 0: Counting 1.2 The Parameters →