The Tokenizer
So far
docs— shuffled names
Before a model can work with text, it needs to convert characters into numbers. This is tokenization.
We use the simplest possible tokenizer: each unique character gets its own ID.
uchars = sorted(set(''.join(docs)))
Reading from the inside out:
| ''.join(docs) | → | Concatenate all names into one long string "emmaolaboriviaavaisabella..." |
| set(...) | → | Extract the unique characters {'e', 'm', 'a', 'o', 'l', 'b', 'r', 'i', 'v', ...} |
| sorted(...) | → | Sort them alphabetically so the mapping is deterministic ['a', 'b', 'c', 'd', 'e', ..., 'x', 'y', 'z'] |
The result is a list of 26 characters. Each character’s position in this list becomes its token ID: a0 b1 c2 ... z25
Try it
Type some words and see which unique characters are extracted:
We also need one special token: BOS (Beginning of Sequence). It gets the next available ID, bringing our total vocabulary size to 27.
BOS = len(uchars)
vocab_size = len(uchars) + 1
Our complete vocabulary:
Every name gets wrapped in BOS tokens on both sides. So the name “ava” becomes:
Why BOS on both sides? Because the model needs to learn two things: what letters a name tends to start with (BOS → first letter), and when a name should end (last letter → BOS).
Try it
Type any name to see how the tokenizer converts it: