← 2.5 Backward: Walking the Graph 2.7 The New Training Loop →

Parameters as Values

Previously Defined

Value class with automatic gradient computation
backward() walks the graph applying the chain rule

With the Value class in place, the model code barely changes. Here’s every difference:

Initialization

Parameters are now Value objects instead of plain floats:

n_embd = 16
matrix = lambda nout, nin: [[Value(random.gauss(0, 0.08)) for _ in range(nin)] for _ in range(nout)]
state_dict = {
    'wte': matrix(vocab_size, n_embd),
    'mlp_fc1': matrix(4 * n_embd, n_embd),
    'mlp_fc2': matrix(vocab_size, 4 * n_embd),
}
params = [p for mat in state_dict.values() for row in mat for p in row]

The only line that changed is matrix: wrap each random number in Value(...). And params is now a flat list of Value objects — we’ll use .data and .grad directly instead of (row, j) index tuples.

Softmax

def softmax(logits):
    max_val = max(val.data for val in logits)
    exps = [(val - max_val).exp() for val in logits]
    total = sum(exps)
    probs = [e / total for e in exps]
    return probs

Two changes: max(logits) peeks inside with .data (to find the max without building a graph node), and math.exp(...) becomes .exp() (the Value’s own method, so the operation gets recorded).

ReLU

def mlp(token_id):
    x = state_dict['wte'][token_id]
    x = linear(x, state_dict['mlp_fc1'])
    x = [xi.relu() for xi in x]
    logits = linear(x, state_dict['mlp_fc2'])
    return logits

max(0, xi) becomes xi.relu(). The linear() function doesn’t change at all — it only uses * and +, which Value already handles through __mul__ and __add__.

That’s it. Three small changes, and the forward pass now automatically builds a computation graph as it runs.

← 2.5 Backward: Walking the Graph 2.7 The New Training Loop →