MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 2.3 Recording Operations: Add and Multiply 2.5 Backward: Walking the Graph →
Step 2: Autograd › 2.4

More Operations and Syntactic Sugar

Previously Defined

  • Value wraps numbers and records operations
  • Add and multiply store local gradients

The same pattern extends to every operation our model needs. Each one has a single input and stores a single local gradient:

Power

    def __pow__(self, other): return Value(self.data**other, (self,), (other * self.data**(other-1),))

If c = a ** n, then ∂c/∂a = n * a^(n-1) — the standard power rule. Note that other here is a plain number, not a Value (we only need to differentiate with respect to the base).

Log

    def log(self): return Value(math.log(self.data), (self,), (1/self.data,))

If c = log(a), then ∂c/∂a = 1/a. This is the operation that turns probabilities into the loss (via -log(prob)).

Exp

    def exp(self): return Value(math.exp(self.data), (self,), (math.exp(self.data),))

If c = exp(a), then ∂c/∂a = exp(a) — the exponential is its own derivative. Used inside softmax.

ReLU

    def relu(self): return Value(max(0, self.data), (self,), (float(self.data > 0),))

If c = relu(a), then ∂c/∂a is 1 when a > 0, and 0 otherwise — the same binary gradient we hand-coded in Step 1.

The rest

    def __neg__(self): return self * -1
    def __radd__(self, other): return self + other
    def __sub__(self, other): return self + (-other)
    def __rsub__(self, other): return other + (-self)
    def __rmul__(self, other): return self * other
    def __truediv__(self, other): return self * other**-1
    def __rtruediv__(self, other): return other * self**-1

Subtraction, division, and negation are all defined in terms of add, multiply, and power. They don’t need their own gradient logic — the chain rule handles them automatically through the operations they’re built from.

← 2.3 Recording Operations: Add and Multiply 2.5 Backward: Walking the Graph →