MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

Step 1

Gradient Descent

The same bigram model, but now it's a neural network trained by gradient descent instead of counting.

  1. 1.1 Same Starting Point
  2. 1.2 The Parameters
  3. 1.3 Linear & Softmax
  4. 1.4 The MLP Model
  5. 1.5 The Forward Pass
  6. 1.6 Numerical Gradient
  7. 1.7 Analytic Gradient
  8. 1.8 Gradient Check
  9. 1.9 Training & Inference
Key insight: The model is now too expressive for exact solutions. Gradient descent finds good parameters by nudging them downhill — a little at a time, guided by the slope of the loss.
← Step 0: Counting