Parameters as Values
Previously Defined
Valueclass with automatic gradient computationbackward()walks the graph applying the chain rule
With the Value class in place, the model code barely changes. Here’s every difference:
Initialization
Parameters are now Value objects instead of plain floats:
n_embd = 16
matrix = lambda nout, nin: [[Value(random.gauss(0, 0.08)) for _ in range(nin)] for _ in range(nout)]
state_dict = {
'wte': matrix(vocab_size, n_embd),
'mlp_fc1': matrix(4 * n_embd, n_embd),
'mlp_fc2': matrix(vocab_size, 4 * n_embd),
}
params = [p for mat in state_dict.values() for row in mat for p in row]
The only line that changed is matrix: wrap each random number in Value(...). And params is now a flat list of Value objects — we’ll use .data and .grad directly instead of (row, j) index tuples.
Softmax
def softmax(logits):
max_val = max(val.data for val in logits)
exps = [(val - max_val).exp() for val in logits]
total = sum(exps)
probs = [e / total for e in exps]
return probs
Two changes: max(logits) peeks inside with .data (to find the max without building a graph node), and math.exp(...) becomes .exp() (the Value’s own method, so the operation gets recorded).
ReLU
def mlp(token_id):
x = state_dict['wte'][token_id]
x = linear(x, state_dict['mlp_fc1'])
x = [xi.relu() for xi in x]
logits = linear(x, state_dict['mlp_fc2'])
return logits
max(0, xi) becomes xi.relu(). The linear() function doesn’t change at all — it only uses * and +, which Value already handles through __mul__ and __add__.
That’s it. Three small changes, and the forward pass now automatically builds a computation graph as it runs.