The Tokenizer

So far

docs — shuffled names

Before a model can work with text, it needs to convert characters into numbers. This is tokenization.

We use the simplest possible tokenizer: each unique character gets its own ID.

uchars = sorted(set(''.join(docs)))

Reading from the inside out:

''.join(docs)	→	Concatenate all names into one long string "emmaolaboriviaavaisabella..."
set(...)	→	Extract the unique characters {'e', 'm', 'a', 'o', 'l', 'b', 'r', 'i', 'v', ...}
sorted(...)	→	Sort them alphabetically so the mapping is deterministic ['a', 'b', 'c', 'd', 'e', ..., 'x', 'y', 'z']

The result is a list of 26 characters. Each character’s position in this list becomes its token ID: a₀ b₁ c₂ ... z₂₅

Try it

Type some words and see which unique characters are extracted:

We also need one special token: BOS (Beginning of Sequence). It gets the next available ID, bringing our total vocabulary size to 27.

BOS = len(uchars)
vocab_size = len(uchars) + 1

Our complete vocabulary:

a₀ b₁ c₂ d₃ e₄ f₅ g₆ h₇ i₈ j₉ k₁₀ l₁₁ m₁₂ n₁₃ o₁₄ p₁₅ q₁₆ r₁₇ s₁₈ t₁₉ u₂₀ v₂₁ w₂₂ x₂₃ y₂₄ z₂₅ BOS₂₆

Every name gets wrapped in BOS tokens on both sides. So the name “ava” becomes:

BOS₂₆ → a₀ → v₂₁ → a₀ → BOS₂₆

Why BOS on both sides? Because the model needs to learn two things: what letters a name tends to start with (BOS → first letter), and when a name should end (last letter → BOS).

Try it

Type any name to see how the tokenizer converts it:

← 0.1 The Dataset 0.3 The Count Table →