The Dataset
We start with a list of names — 32,033 of them. Each name is a short document. We’ll train a model that learns to generate new names that sound like real names.
The names come from a US Social Security dataset. The file looks like this:
emma
olivia
ava
isabella
sophia
mia
charlotte
amelia
...
Each name is all lowercase, no spaces, no punctuation — just letters. This simplicity is deliberate: it lets us focus on the model, not on preprocessing.
We load and shuffle them:
docs = [l.strip() for l in open('input.txt').read().strip().split('\n') if l.strip()]
random.shuffle(docs)
That first line is doing a lot of work. Reading from the inside out:
| open('input.txt').read() | → | Read the entire file as one big string "emma\nolivia\nava\nisabella\n..." |
| .strip() | → | Remove leading/trailing whitespace (including the final newline) |
| .split('\n') | → | Split into a list of lines ["emma", "olivia", "ava", "isabella", ...] |
| l.strip() | → | Strip whitespace from each individual line |
| if l.strip() | → | Filter out any blank lines (empty strings are falsy in Python) |
The result is a plain list of strings:
['emma', 'olivia', 'ava', 'isabella', 'sophia', 'mia', ...]
Then random.shuffle reorders them in place:
['kyla', 'priest', 'trystan', 'tennille', 'maren', 'lakeshia', ...]
The shuffling matters because we’ll train on one name at a time, and we don’t want the model to learn anything about the order of names in the file.
Try it
The same 15 names, reshuffled each time you click: