Step 1

Gradient Descent

The same bigram model, but now it's a neural network trained by gradient descent instead of counting.

1.1 Same Starting Point
1.2 The Parameters
1.3 Linear & Softmax
1.4 The MLP Model
1.5 The Forward Pass
1.6 Numerical Gradient
1.7 Analytic Gradient
1.8 Gradient Check
1.9 Training & Inference

Key insight: The model is now too expressive for exact solutions. Gradient descent finds good parameters by nudging them downhill — a little at a time, guided by the slope of the loss.

← Step 0: Counting