MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

0.2 The Tokenizer →
Step 0: Counting › 0.1

The Dataset

We start with a list of names — 32,033 of them. Each name is a short document. We’ll train a model that learns to generate new names that sound like real names.

The names come from a US Social Security dataset. The file looks like this:

emma
olivia
ava
isabella
sophia
mia
charlotte
amelia
...

Each name is all lowercase, no spaces, no punctuation — just letters. This simplicity is deliberate: it lets us focus on the model, not on preprocessing.

We load and shuffle them:

docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)

That first line is doing a lot of work. Reading from the inside out:

open('input.txt').read() Read the entire file as one big string
"emma\nolivia\nava\nisabella\n..."
.strip() Remove leading/trailing whitespace (including the final newline)
.split('\n') Split into a list of lines
["emma", "olivia", "ava", "isabella", ...]
l.strip() Strip whitespace from each individual line
if l.strip() Filter out any blank lines (empty strings are falsy in Python)

The result is a plain list of strings:

['emma', 'olivia', 'ava', 'isabella', 'sophia', 'mia', ...]

Then random.shuffle reorders them in place:

['kyla', 'priest', 'trystan', 'tennille', 'maren', 'lakeshia', ...]

The shuffling matters because we’ll train on one name at a time, and we don’t want the model to learn anything about the order of names in the file.

Try it

The same 15 names, reshuffled each time you click:

0.2 The Tokenizer →