MicroGPT Visualized

Building a GPT from scratch — an interactive visual guide

← 4.4 The Layer Loop 4.6 Training and Results →
Step 4: Transformer › 4.5

KV Cache: Per-Layer

One more structural change. In Step 3, keys and values were flat lists — one shared cache for the single attention layer:

keys, values = [], []

With multiple layers, each layer needs its own KV cache. Layer 0’s keys have nothing to do with layer 1’s keys — they’re computed by different weight matrices and represent different information.

Now keys is a list of lists: keys[li] is the key cache for layer li. Inside the model, the append and access index into the right layer:

keys[li].append(k)
values[li].append(v)

The initialisation creates one empty list per layer:

    keys, values = [[] for _ in range(n_layer)], [[] for _ in range(n_layer)]

This applies in both the training loop and the inference loop.

With n_layer = 1, keys is [[]] — a list containing one empty list. Functionally the same as Step 3’s [], but the structure is ready for stacking.

← 4.4 The Layer Loop 4.6 Training and Results →