Gradient Descent
The same bigram model, but now it's a neural network trained by gradient descent instead of counting.
- 1.1 Same Starting Point
- 1.2 The Parameters
- 1.3 Linear & Softmax
- 1.4 The MLP Model
- 1.5 The Forward Pass
- 1.6 Numerical Gradient
- 1.7 Analytic Gradient
- 1.8 Gradient Check
- 1.9 Training & Inference
Key insight: The model is now too expressive for exact solutions. Gradient descent finds good parameters by nudging them downhill — a little at a time, guided by the slope of the loss.