MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 0.1 The Dataset 0.3 The Count Table →
Step 0: Counting › 0.2

The Tokenizer

So far

  • docs — shuffled names

Before a model can work with text, it needs to convert characters into numbers. This is tokenization.

We use the simplest possible tokenizer: each unique character gets its own ID.

uchars = sorted(set(''.join(docs)))

Reading from the inside out:

''.join(docs) Concatenate all names into one long string
"emmaolaboriviaavaisabella..."
set(...) Extract the unique characters
{'e', 'm', 'a', 'o', 'l', 'b', 'r', 'i', 'v', ...}
sorted(...) Sort them alphabetically so the mapping is deterministic
['a', 'b', 'c', 'd', 'e', ..., 'x', 'y', 'z']

The result is a list of 26 characters. Each character’s position in this list becomes its token ID: a0 b1 c2 ... z25

Try it

Type some words and see which unique characters are extracted:

We also need one special token: BOS (Beginning of Sequence). It gets the next available ID, bringing our total vocabulary size to 27.

BOS = len(uchars)
vocab_size = len(uchars) + 1

Our complete vocabulary:

a0 b1 c2 d3 e4 f5 g6 h7 i8 j9 k10 l11 m12 n13 o14 p15 q16 r17 s18 t19 u20 v21 w22 x23 y24 z25 BOS26

Every name gets wrapped in BOS tokens on both sides. So the name “ava” becomes:

BOS26 a0 v21 a0 BOS26

Why BOS on both sides? Because the model needs to learn two things: what letters a name tends to start with (BOS → first letter), and when a name should end (last letter → BOS).

Try it

Type any name to see how the tokenizer converts it:

← 0.1 The Dataset 0.3 The Count Table →