MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

Glossary

Key terms used in this tutorial, with links to where they’re introduced or explained most fully.


A

Activation function — A non-linear function applied after a linear layer. Introduces the non-linearity that lets neural networks learn complex patterns. microGPT uses ReLU. → Step 1.4

Adam optimizerAdaptive Moment Estimation. An optimizer that tracks per-parameter momentum and gradient magnitude, giving each parameter its own effective learning rate. → Step 5.1

Attention — A mechanism that lets the model look back at all previous positions and learn which ones are relevant for predicting the next token. Computes a weighted sum of values based on query-key similarity. → Step 3.1

Attention logits — The raw scores from the dot product of queries and keys, before softmax. Scaled by √d to prevent the softmax from becoming too peaked. → Step 3.5

Attention weights — The probability distribution produced by applying softmax to the attention logits. Each weight says how much to attend to a given position. → Step 3.5

Autoregressive — A model that generates one token at a time, using its own previous outputs as input. Each prediction depends only on the tokens that came before it — the model can never look ahead. microGPT is autoregressive: it predicts token t+1 from tokens 0 through t. → Step 0.7

Autograd — Short for automatic differentiation. A system that records operations in a computational graph and computes gradients automatically via the chain rule. → Step 2.1


B

Backpropagation — The algorithm for computing gradients by walking the computational graph backward from the loss, applying the chain rule at each node. → Step 2.5

Backward pass — The phase of training where gradients are computed by propagating the loss backward through the network. Follows the forward pass. → Step 2.5

Bias correction — A correction applied to Adam’s momentum and adaptive rate estimates to compensate for their zero initialization in early training steps. → Step 5.2

Bigram — A pair of consecutive tokens. A bigram model predicts the next token based only on the current one. → Step 0.1

Block size — The maximum sequence length the model can process. Determines the size of the position embedding table and the KV cache. → Step 3.2

BOSBeginning of sequence. A special token (index 0) used to mark the start and end of each training example. → Step 0.2


C

Chain rule — The calculus rule for computing the derivative of a composition of functions. The mathematical foundation of backpropagation: multiply the local gradient by the upstream gradient. → Step 2.5

Computational graph — A directed graph where nodes are values and edges are operations. Built during the forward pass and walked backward during backpropagation. → Step 2.3

Cross-entropy — The loss function used throughout this tutorial. Measures how well the model’s predicted probability distribution matches the true next token. Equivalent to the negative log probability of the correct token. → Step 0.5


D

Decoder-only — A transformer architecture that uses only the decoder half of the original Transformer design: causal self-attention (each position can only attend to previous positions) and MLPs, but no encoder and no cross-attention. GPT and microGPT are decoder-only. → Step 4.1

Dot product — The sum of element-wise products of two vectors. Used in attention to measure the similarity between a query and a key. → Step 3.5


E

Embedding — A learned dense vector representing a token or position. The model looks up an embedding for each input token (wte) and each position (wpe). → Step 1.2

Exponential moving average (EMA) — A running average that blends the new value with the old: ema = β·ema + (1−β)·x. Used in Adam for both momentum and adaptive rate tracking. → Step 5.2


F

Forward pass — The phase of computation where inputs flow through the network to produce outputs (logits). Precedes the backward pass during training. → Step 1.5


G

Gradient — The derivative of the loss with respect to a parameter. Points in the direction that would increase the loss; training subtracts it to decrease the loss. → Step 1.3

Gradient descent — An optimization algorithm that repeatedly adjusts parameters in the direction opposite to the gradient. The foundation of neural network training. → Step 1.1


H

Head — One parallel attention computation within multi-head attention. Each head operates on a different subspace of the embedding, with its own Q, K, V projections. → Step 4.2

Head dimension — The dimensionality of each attention head. Equal to n_embd / n_head. In microGPT: 16 / 4 = 4. → Step 4.2

Hidden layer — An intermediate layer between input and output in a neural network. The MLP has one hidden layer with 64 dimensions. → Step 1.4

Hyperparameter — A value set before training that controls the learning process (learning rate, number of layers, etc.), as opposed to parameters which are learned during training. → Step 1.1


I

Inference — Using a trained model to generate predictions. No gradients are computed — only the forward pass runs. → Step 0.7


K

Key — In attention, a projection of each position’s embedding that represents what information that position has. Compared against queries via dot product. → Step 3.4

KV cache — A list that stores the keys and values from all previous positions during inference, so they don’t need to be recomputed at each step. → Step 3.7


L

Layer — One complete attention + MLP block in the transformer. microGPT uses 1 layer (n_layer = 1), but the code supports any number. → Step 4.3

Learning rate — A scalar that controls how large each parameter update is. Too high = instability. Too low = slow convergence. → Step 1.1

Learning rate decay — A schedule that reduces the learning rate over the course of training. microGPT uses linear decay: lr × (1 − step/total_steps). → Step 1.7

Linear layer — A matrix multiplication that projects a vector from one dimensionality to another. The fundamental building block of neural networks. → Step 1.4

Logits — The raw, unnormalized output of the model before softmax. Each logit corresponds to one token in the vocabulary. → Step 1.5

Loss — A single number measuring how wrong the model’s prediction is. Training minimizes the loss. In this tutorial, always cross-entropy loss. → Step 0.5


M

MLPMulti-layer perceptron. A feed-forward network: linear → activation → linear. In the transformer, it processes each position independently after attention. → Step 1.4

Momentum — In Adam, an exponential moving average of the gradient. Smooths out noisy gradients and builds speed in consistent directions. → Step 5.2

Multi-head attention — Running multiple attention heads in parallel, each on a different subspace of the embedding, then concatenating and projecting. → Step 4.2


N

n-gram — A sequence of n consecutive tokens. A bigram (n=2) predicts the next token from the current one. → Step 0.1

Normalization — Scaling activations to stabilize training. microGPT uses RMSNorm (scaling by root mean square) before attention and MLP blocks. → Step 3.3


O

Optimizer — The algorithm that updates model parameters using gradients. SGD (Steps 1–4) and Adam (Step 5) are the two optimizers used in this tutorial. → Step 1.1

Output projection — The linear layer (attn_wo) that projects the concatenated attention head outputs back to the embedding dimension. → Step 3.6


P

Parameter — A learnable value in the model (weights and biases). microGPT has 4,192 parameters. Adjusted during training to minimize loss. → Step 1.2

Position embedding — A learned vector for each position in the sequence (wpe). Added to the token embedding so the model knows token order. → Step 3.2

Probability distribution — A set of non-negative values that sum to 1. The model’s output after softmax — one probability per token in the vocabulary. → Step 0.4


Q

Query — In attention, a projection of the current position’s embedding that represents what information it’s looking for. Compared against keys. → Step 3.4


R

ReLURectified Linear Unit. The activation function max(0, x). Zero for negative inputs, identity for positive. Used in the MLP block. → Step 1.4

Residual connection — Adding a layer’s input directly to its output (x = x + f(x)). Provides a gradient highway and lets layers learn incremental changes. → Step 3.6

RMSNormRoot Mean Square Normalization. Divides each element by the root mean square of the vector. Simpler and faster than Layer Normalization. → Step 3.3


S

Sampling — Generating tokens by drawing from the model’s predicted probability distribution, rather than always picking the most likely token. → Step 0.7

Scaled dot-product attention — The attention formula: softmax(Q·Kᵀ / √d) · V. Scaling by √d prevents dot products from growing too large. → Step 3.5

Self-attention — Attention where queries, keys, and values all come from the same sequence (unlike cross-attention where keys/values come from a different sequence). → Step 3.4

SGDStochastic gradient descent. The simplest optimizer: subtract lr × gradient from each parameter. Used in Steps 1–4. → Step 1.1

Softmax — A function that converts a vector of logits into a probability distribution. Each output is positive and they sum to 1. → Step 0.4


T

Temperature — A scaling factor applied to logits before softmax during sampling. Lower temperature → more deterministic (picks likely tokens). Higher → more random. → Step 0.7

Token — The atomic unit the model operates on. In microGPT, each character is one token. The vocabulary has 27 tokens (26 letters + BOS). → Step 0.1

Token embedding — The learned vector for each token in the vocabulary (wte). The model’s representation of what a token means. → Step 1.2

Topological sort — An ordering of a directed graph such that every node comes after all nodes it depends on. Used to determine the order of operations during backpropagation. → Step 2.5

Training loop — The repeated cycle of: pick a training example → forward pass → compute loss → backward pass → update parameters. → Step 1.7

Transformer — The architecture that combines self-attention, MLPs, residual connections, and normalization into a stack of layers. microGPT is a decoder-only transformer. → Step 4.1


V

Value (attention) — In attention, a projection of each position’s embedding that carries the actual information to be aggregated. Weighted by the attention weights. → Step 3.4

Value (autograd) — The Value class that wraps numbers to enable automatic differentiation. Records operations and supports .backward(). → Step 2.2

Vocabulary — The set of all tokens the model can process. microGPT’s vocabulary: the 26 lowercase letters plus BOS (27 tokens total). → Step 0.1