A Brief History of the Ideas in microGPT

Every concept in this tutorial has a history. Some date back over a century; others were published just a few years before the code we’re studying. This timeline traces each idea from its origin to its role in microGPT.

1906 — Markov Chains

A.A. Markov, “Extension of the Law of Large Numbers to Dependent Events”

Markov analyzed sequences of Russian vowels and consonants, showing that the probability of the next letter depends on the current one. This is exactly the bigram model we build in Step 0 — the probability of the next token depends only on the previous token. Markov chains remain the mathematical foundation for all n-gram language models.

1948 — Information Theory, Entropy, and Cross-Entropy

Claude Shannon, “A Mathematical Theory of Communication,” Bell System Technical Journal

Shannon formalized how to measure information. His entropy measure H(p) quantifies the uncertainty in a distribution, and cross-entropy H(p, q) measures how well a model q approximates the true distribution p. Minimizing cross-entropy — the loss function used in every step of this tutorial — is equivalent to maximum likelihood estimation. Shannon also formalized n-gram language models, showing that letter sequences can be characterized by their statistical dependencies.

1951 — Stochastic Gradient Descent

Herbert Robbins and Sutton Monro, “A Stochastic Approximation Method,” Annals of Mathematical Statistics

Rather than computing gradients over the entire dataset, Robbins and Monro proved you can update parameters using noisy estimates from individual examples — and still converge. This is the SGD we use from Step 1 through Step 4: pick one training example, compute the gradient, take a step. Their convergence proof also established that learning rates must decay over time — the schedule we use throughout.

1958 — The Perceptron

Frank Rosenblatt, “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain,” Psychological Review

The first learnable artificial neuron. A perceptron takes weighted inputs, sums them, and applies a threshold. It’s a single-layer network — one linear transformation. Our MLP in Step 1 is its direct descendant: stack multiple layers of these, replace the threshold with smooth activations, and you get a multi-layer perceptron.

1970–1986 — Backpropagation

Seppo Linnainmaa (1970), Paul Werbos (1974), David Rumelhart, Geoffrey Hinton, and Ronald Williams (1986)

Three milestones for one idea. Linnainmaa invented reverse-mode automatic differentiation in his 1970 master’s thesis at the University of Helsinki — the algorithm itself. Werbos first applied it to neural networks in his 1974 Harvard PhD thesis. But it was Rumelhart, Hinton, and Williams who demonstrated in their 1986 Nature paper (“Learning Representations by Back-Propagating Errors”) that backpropagation through hidden layers lets networks learn useful internal representations. This made modern deep learning possible.

In Step 1, we compute gradients by hand. In Step 2, the Value class implements exactly what Linnainmaa described: record operations in a computational graph, then walk it backward applying the chain rule.

1989 — Softmax

John Bridle, “Training Stochastic Model Recognition Algorithms as Networks Can Lead to Maximum Mutual Information Estimation of Parameters,” NeurIPS 1989

The mathematical function (a normalized exponential) existed in statistical mechanics since Boltzmann (1868). Bridle’s contribution was recognizing that applying it to a neural network’s output turns raw logits into proper probability distributions — posterior class probabilities. We use softmax in every step of the tutorial: logits → probabilities → sampling or loss computation.

2003 — Word Embeddings

Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin, “A Neural Probabilistic Language Model,” JMLR

Before this paper, words were typically represented as sparse one-hot vectors — no notion of similarity. Bengio et al. showed that learning dense, low-dimensional vectors (embeddings) alongside the language model captures semantic structure: similar words end up with similar vectors. Our wte (token embedding) matrix in Step 1 is exactly this idea — each token gets a learned 16-dimensional vector.

This approach was later popularized by Mikolov et al.’s Word2Vec (2013), but Bengio’s paper established the paradigm.

2010 — ReLU

Vinod Nair and Geoffrey Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” ICML 2010

A deceptively simple activation function: max(0, x). By replacing sigmoid and tanh activations with ReLU, networks train faster (no vanishing gradient for positive values), learn sparse representations, and preserve magnitude information. The MLP block in Steps 3–5 uses ReLU between its two linear layers: expand → ReLU → compress.

2014 — The Attention Mechanism

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” ICLR 2015 (arXiv September 2014)

The breakthrough insight: instead of compressing an entire input sequence into a single fixed-size vector, let the model look back at all positions and learn which ones are relevant. Bahdanau’s attention computes a soft alignment over the encoder’s hidden states at each decoding step. This is the ancestor of the attention mechanism we build in Step 3 — though the Transformer’s self-attention (2017) reformulates it with Q, K, V projections.

2014 — Adam Optimizer

Diederik Kingma and Jimmy Ba, “Adam: A Method for Stochastic Optimization,” ICLR 2015 (arXiv December 2014)

Adam combines two ideas: momentum (tracking a running average of the gradient) and adaptive learning rates (tracking a running average of the squared gradient, à la RMSProp). Each parameter gets its own effective step size. Bias correction handles the cold-start problem. This is what Step 5 implements — five lines of code that replaced SGD as the default optimizer for nearly all of deep learning.

2015 — Residual Connections

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep Residual Learning for Image Recognition,” CVPR 2016 (arXiv December 2015)

A simple idea that enabled very deep networks: add the layer’s input directly to its output. Instead of learning the full transformation, the layer only needs to learn the residual — what to add. This provides a gradient highway that prevents signal degradation across many layers. In Step 3 onward, both the attention block and the MLP block have residual connections: x = [a + b for a, b in zip(x, x_residual)].

2016 — Layer Normalization

Jimmy Ba, Jamie Ryan Kiros, and Geoffrey Hinton, “Layer Normalization,” arXiv July 2016

Batch normalization (Ioffe and Szegedy, 2015) normalized across examples in a batch. Layer normalization normalizes across features within a single example — making it independent of batch size and applicable to sequences of any length. This was the normalization used in the original Transformer.

2017 — The Transformer

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention Is All You Need,” NeurIPS 2017

The paper that changed everything. The Transformer replaced recurrence entirely with self-attention: each position computes Query, Key, and Value projections, computes attention scores via Q·Kᵀ/√d, and produces a weighted sum of values. Stacked with feed-forward (MLP) sublayers, residual connections, and layer norm, it processes entire sequences in parallel.

This single paper introduced or combined: self-attention with Q/K/V, multi-head attention (running h heads in parallel on different subspaces), positional encodings (sinusoidal functions to inject order information), and the encoder-decoder architecture that became the starting point for GPT’s decoder-only variant. Virtually everything in Steps 3 and 4 traces back to this paper.

2018 — GPT

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever, “Improving Language Understanding by Generative Pre-Training,” OpenAI, June 2018

GPT-1 took the decoder half of the Transformer and trained it with a simple objective: predict the next token. Pre-train on a large corpus (BooksCorpus), then fine-tune on downstream tasks. This established the “generative pre-training” paradigm. The model we build in this tutorial — a decoder-only Transformer trained to predict the next character — is the same architecture, just smaller.

2019 — RMSNorm

Biao Zhang and Rico Sennrich, “Root Mean Square Layer Normalization,” NeurIPS 2019

A simplification of Layer Normalization: instead of both centering (subtracting the mean) and scaling (dividing by standard deviation), RMSNorm only scales by the root mean square. Faster to compute, and empirically just as effective. RMSNorm is now the default in most modern LLMs (LLaMA, Mistral, Gemma) and is the normalization we use from Step 3 onward — Karpathy chose the modern variant.

~2019 — KV Cache

Not a single invention but an engineering technique that became standard practice with GPT-2-era autoregressive models. During generation, each new token only needs its own Q, K, V — the K and V from all previous tokens can be cached and reused. This reduces per-step computation from O(n²) to O(n). We build the KV cache in Step 3 (keys.append(k)) and restructure it for multiple layers in Step 4 (keys[li].append(k)).

The Lineage

Reading this timeline, patterns emerge. Shannon’s 1948 information theory gives us both the n-gram model (Step 0) and the loss function (all steps). Rosenblatt’s 1958 perceptron evolves into the MLP (Steps 1–2). Linnainmaa’s 1970 autodiff algorithm becomes our Value class (Step 2). Bahdanau’s 2014 attention becomes self-attention (Step 3). He’s 2015 residual connections enable deep networks (Steps 3–5). Vaswani’s 2017 Transformer assembles all of these into a coherent architecture (Steps 3–4). And Kingma’s 2014 Adam optimizer trains it efficiently (Step 5).

microGPT is small, but it stands on decades of ideas from information theory, neuroscience, optimization, and machine learning. Every line of code has a history.