MicroGPT Visualized — Building a GPT from scratch

What is this?

Andrej Karpathy wrote a single Python file that builds up a GPT transformer from scratch — no libraries, no frameworks, just pure Python and basic maths. He did it in six incremental steps, each adding one “layer of the onion.”

This tutorial goes one step further: we break each of Karpathy’s steps into substeps, and explain each one with diagrams, animations, and interactive visualizations. By the end, you’ll understand every line — not just what it does, but why.

The Six Steps

Step 0

Counting

A bigram language model trained by counting — the simplest possible language model.

Step 1

Gradient Descent

The same bigram model, but now it’s a neural network trained by gradient descent instead of counting.

Step 2

Autograd

The same multilayer perceptron (MLP), but now trained with automatic differentiation — no more hand-written backward pass.

Step 3

Attention

The model can now look back at previous tokens. Position embeddings, single-head attention, RMSNorm, and residual connections give the MLP context.

Step 4

Transformer

Multi-head attention and a configurable layer loop. The model is now a full GPT — only the optimizer remains to upgrade.

Step 5

Adam

Adam replaces SGD — per-parameter momentum and adaptive learning rates. The final piece. This is the complete microGPT.