How GPT Works

A visual walkthrough of every operation inside a language model, using 200 lines of pure Python

Nicholas Zinner·Beacon Bot·future-shock.ai

Andrej Karpathy wrote a GPT in 200 lines of pure Python. No PyTorch. No CUDA. No dependencies at all — just import math and raw arithmetic.

It trains a character-level language model on a dataset of human names. Feed it "emm" and it predicts "a". Feed it nothing and it invents names like "Elora" and "Tavin" that sound real but aren't.

The reason this matters: it's the same algorithm running inside ChatGPT, Claude, and Gemini. The exact same sequence of operations — embedding, attention, feedforward, softmax, backprop, Adam. The production models just do it with bigger numbers. If you can follow addition and multiplication, you can trace every operation a language model performs. Understanding why those operations produce intelligence is a deeper question — one that even researchers are still working on.

This page walks through every operation, step by step, using the actual code from the gist.

1. The Dataset

The model learns from 32,033 human names. Emma, Olivia, Ava, Sophia — scraped from US Social Security records. Each name is a sequence of characters, and the model's job is simple: given the characters so far, predict the next one.

Loading the dataset
docs = [line.strip() for line in open('input.txt') if line.strip()]
random.shuffle(docs)
# 32,033 names: ['emma', 'olivia', 'ava', 'sophia', ...]

That's it. No preprocessing, no cleaning, no feature engineering. Just a list of lowercase names. The model has to figure out everything else — which characters tend to follow which, how names start and end, what letter patterns sound "name-like" — from scratch.

2. Tokenization

Neural networks can't read letters. They need numbers. The tokenizer converts each character into an integer:

Building the vocabulary
uchars = sorted(set(''.join(docs)))  # ['a', 'b', ..., 'z']
BOS = len(uchars)                      # 26 = beginning-of-sequence
vocab_size = len(uchars) + 1           # 27 total tokens

Twenty-six letters plus one special token (BOS, for "beginning of sequence"). That's 27 tokens total. Real GPTs use BPE tokenizers with 50,000–100,000 tokens, but the principle is identical: turn text into a sequence of integers.

Try It — Type a Name
e
4
m
12
m
12
a
0
⟨BOS⟩
26
tokens = [4, 12, 12, 0]

3. Embeddings

A token ID like "4" doesn't mean anything to a neural network. The model needs a richer representation — a vector of numbers that captures what this token is. That's what an embedding is: a learned lookup table.

Embedding lookup
# Token embedding: which character is this?
tok_emb = state_dict['wte'][token_id]   # shape: [16]

# Position embedding: where in the sequence is it?
pos_emb = state_dict['wpe'][pos_id]     # shape: [16]

# Combine: "this character at this position"
x = [t + p for t, p in zip(tok_emb, pos_emb)]

Each token ID indexes a row in the embedding matrix — a table of 27 rows (one per token) and 16 columns (the embedding dimension). The position gets its own embedding too. Add them together and you get a single vector that represents "this character at this position."

These vectors start as random noise. During training, the model gradually adjusts them until nearby positions in the matrix encode similar meanings. The letter "a" at position 3 gets a different representation than "a" at position 0 — because context matters.

Token Embedding Matrix — 27 tokens × 16 dimensions
Training Step: 0Random init
0123456789101112131415abcdefghijklmnopqrstuvwxyz⟨B⟩

Drag the slider to watch training unfold. Vowels (a, e, i, o, u) develop similar warm patterns — the model discovers they're interchangeable in many positions. At step 0, everything is indistinguishable noise.

Before We Get to Attention: The Dot Product

A dot product measures similarity between two lists of numbers. Multiply each pair of elements together, then add the results up. That's it.

[1, 0, 1] · [1, 1, 0] = 1×1 + 0×1 + 1×0 = 1

Similar vectors → big number. Different vectors → small number. That's all attention needs.

4. Attention — "Who Should I Listen To?"

This is the core innovation of transformers. Before attention, models processed sequences one step at a time — the information at position 0 had to pass through every intermediate position to reach position 10. Attention lets every position talk directly to every other position, in parallel.

The Three Projections

Each position creates three vectors from its embedding:

  • Query — "What am I looking for?"
  • Key — "What do I have to offer?"
  • Value — "Here's my actual information."
Computing Q, K, V
q = linear(x, state_dict['layer0.attn_wq'])  # Query
k = linear(x, state_dict['layer0.attn_wk'])  # Key
v = linear(x, state_dict['layer0.attn_wv'])  # Value

The attention score between two positions is just the dot product of one position's Query with another's Key. High dot product → "these two positions are relevant to each other." Run softmax over the scores to get weights that sum to 1. Then use those weights to create a weighted average of the Value vectors.

One critical constraint: a position can only attend to itself and earlier positions.The model can't peek at the future. This is what makes it autoregressive — it generates one token at a time, always predicting left to right.

Attention Weights — "emma" (one head)
emmae1.00m0.350.65m0.150.300.55a0.100.200.250.45

Rows = "who is asking"  ·  Columns= "who are they looking at"
Grayed-out cells = the future. A position can never attend to tokens that come after it.

Multi-Head Attention

The microgpt uses 4 attention heads, each with a 4-dimensional head (16 ÷ 4 = 4). Each head learns to pay attention to different things. One head might focus on "what vowel came before," another on "how far from the start am I." The outputs of all heads get concatenated and projected back to 16 dimensions.

The key insight
Attention is just dot products and softmax. There is no magic. The "intelligence" comes from the learned Q, K, V weight matrices — the model figures out during training what questions to ask and what answers to provide.

5. The MLP — "Thinking It Over"

After attention gathers information from other positions, the MLP (multi-layer perceptron) processes it. Think of attention as "listening" and the MLP as "thinking."

MLP forward pass
# Expand to 4× width
hidden = linear(x, state_dict['layer0.mlp_fc1'])  # 16 → 64
hidden = [h.relu() for h in hidden]                 # Zero out negatives

# Compress back
out = linear(hidden, state_dict['layer0.mlp_fc2'])  # 64 → 16
MLP — Expand, Transform, Compress
Input16-dimExpand64-dim+ ReLUCompress16-dimOut

The MLP expands the 16-dimensional vector to 64 dimensions (4×), applies ReLU (zeroing out negative values), then compresses back to 16. The expansion gives the network room to compute complex features. The compression forces it to keep only what matters.

ReLU is the simplest nonlinearity: if a value is negative, set it to zero. If it's positive, keep it. This gives the network the ability to compute things that aren't just linear combinations — which turns out to be the difference between a calculator and something that can learn arbitrary patterns.

6. The Full Forward Pass

Here's the complete sequence of operations from input token to output prediction:

The Full Forward Pass
Token
ID
Embed
+ Pos
RMSNorm
Attention
Q·K→V
Residual
+
MLP
FFN
Residual
+
Linear
→logits
Softmax
→probs
← Attention + MLP block repeats × n_layers (1 in microgpt) →
The complete forward pass
def gpt(token_id, pos_id, keys, values):
    # 1. Embed
    tok_emb = state_dict['wte'][token_id]
    pos_emb = state_dict['wpe'][pos_id]
    x = [t + p for t, p in zip(tok_emb, pos_emb)]

    # 2. Normalize
    x = rmsnorm(x)

    # 3. Attention + Residual + MLP + Residual (× n_layers)
    for li in range(n_layer):
        # ... attention and MLP with skip connections

    # 4. Project to vocabulary
    logits = linear(x, state_dict['lm_head'])
    return logits

Residual Connections

Notice the "Residual +" steps. After each attention and MLP block, the output gets added back to the input. This is a skip connection: instead of replacing the representation, each block refines it. Without residual connections, deep networks are nearly impossible to train — the gradients vanish before they reach the early layers. Skip connections give the gradients a highway back through the network.

RMSNorm

Before the attention block, the model normalizes the vector. RMSNorm divides each element by the root mean square of the vector, keeping the magnitudes from exploding. Simple but essential.

RMSNorm
def rmsnorm(x):
    ms = sum(xi * xi for xi in x) / len(x)
    scale = (ms + 1e-5) ** -0.5
    return [xi * scale for xi in x]

7. Loss — "How Wrong Was I?"

The forward pass ends with logits — raw scores for every token in the vocabulary. Softmax converts these to probabilities (positive numbers that sum to 1). Now we can measure how wrong the model was.

Computing the loss
probs = softmax(logits)          # 27 probabilities, one per token
loss_t = -probs[target_id].log() # How surprised were we by the answer?

The loss is the negative log of the probability assigned to the correct answer. If the model says "a" has a 72% chance of being next after "emm", the loss is -log(0.72) ≈ 0.33 — low, because the model was right. If it says 1%, the loss is -log(0.01) ≈ 4.6 — high, because the model was surprised by the correct answer.

Predicted Probabilities — Next character after "emm"
72%a✓ correct8%e6%i4%y3%o2%u5%

An untrained model assigns roughly equal probability to all 27 tokens, giving a loss of -log(1/27) ≈ 3.3. A well-trained model pushes the loss below 1.5. The gap between 3.3 and 1.5 is the model learning.

8. Backpropagation — Learning from Mistakes

The model made a prediction. The loss measured how wrong it was. Now it needs to learn. Backpropagation answers the question: for every parameter in the model, which direction should I nudge it to reduce the loss?

Every parameter in the model gets a number called a gradient. A gradient answers a simple question: if I nudge this number up a tiny bit, does the loss go up or down? And by how much?

A positive gradient means increasing this parameter increases the loss (bad — go the other way). A negative gradient means increasing it decreases the loss (good — keep going). The magnitude tells you how sensitive the loss is to this parameter. Big gradient = this parameter matters a lot right now.

Computation Graph — Forward Values & Backward Gradients
0.5-2.0a3.0-0.33b1.5-0.67a × b3.5-0.29c + 2loss1.0-log(d)forwardbackward (gradients)

Black numbers = forward pass values. Red numbers= gradients flowing backward. Each gradient answers: "if I nudge this value, how much does the loss change?" The chain rule multiplies local gradients together as they flow backward through the graph.

The trick is the Value class. Every arithmetic operation in the forward pass creates a new Value that remembers its parents and the local gradient of the operation:

The autograd engine
class Value:
    def __init__(self, data, children=(), local_grads=()):
        self.data = data
        self.grad = 0
        self._children = children
        self._local_grads = local_grads

    def __mul__(self, other):
        # a * b: local grad w.r.t. a is b, w.r.t. b is a
        return Value(self.data * other.data,
                     (self, other),
                     (other.data, self.data))

    def backward(self):
        # Topological sort, then propagate gradients
        for v in reversed(topo):
            for child, local_grad in zip(v._children, v._local_grads):
                child.grad += local_grad * v.grad

Call loss.backward() and gradients cascade backward through the entire computation graph. Each node multiplies the upstream gradient by its local gradient and passes it along. This is the chain rule from calculus, applied recursively over thousands of operations.

Why this is beautiful
The entire autograd engine is about 30 lines of Python. It handles addition, multiplication, exponentiation, log, exp, and ReLU. That's enough to differentiate through an entire neural network. PyTorch does the same thing but optimized for GPUs and with hundreds of supported operations. The math is identical.

9. The Optimizer — Adam

Gradients tell us which direction to push each parameter. The optimizer decides how far to push. Adam is the standard:

Adam optimizer update
for p in parameters:
    m = beta1 * m + (1 - beta1) * p.grad      # Momentum
    v = beta2 * v + (1 - beta2) * p.grad**2    # Velocity
    m_hat = m / (1 - beta1**t)                  # Bias correction
    v_hat = v / (1 - beta2**t)
    p.data -= lr * m_hat / (v_hat**0.5 + 1e-8) # Update

Two ideas make Adam better than just subtracting the gradient:

  • Momentum — a rolling average of past gradients. If the gradient has been pointing the same direction for many steps, move faster. If it keeps flipping, slow down.
  • Adaptive learning rates — parameters with large gradients get smaller steps, parameters with small gradients get larger steps. This handles the fact that different parts of the network need different amounts of adjustment.

You might have noticed the "bias correction" lines (dividing by 1-β^t). The momentum buffers (m and v) start at zero, so in the first few steps they haven't accumulated enough history to be accurate — they're biased toward zero. The bias correction compensates for this cold start, making the estimates accurate from step one.

The learning rate decays over training: start with big steps (explore broadly), end with small steps (fine-tune). In microgpt, it goes from 1e-2 down to 1e-4 over the course of training.

10. The Training Loop

All of the above happens inside a loop. Pick a name, run the forward pass, compute the loss, backpropagate gradients, update parameters. Repeat 1,000 times.

The training loop (simplified)
for step in range(1000):
    # Pick a random name
    name = docs[step % len(docs)]

    # Forward pass: predict each next character
    loss = Value(0)
    for pos, (inp, target) in enumerate(zip(name, name[1:])):
        logits = gpt(char_to_id(inp), pos, keys, values)
        probs = softmax(logits)
        loss_t = -probs[char_to_id(target)].log()
        loss = loss + loss_t

    # Backward pass
    loss.backward()

    # Update parameters with Adam
    for p in parameters:
        adam_update(p, step)

    # Zero gradients for next step
    for p in parameters:
        p.grad = 0
Training Loss — Watching a Neural Network Learn

The loss drops fast at first (the model learns that "a" and "e" are common) then gradually levels off as it learns subtler patterns. This curve is the same shape whether you're training on 32,000 names or trillions of tokens.

That curve is the model learning. Step 0: the loss is ~3.3 (random guessing). Step 100: the model has learned that "a" and "e" are common — loss drops below 2.5. Step 500: it knows that "emm" is usually followed by "a" — loss approaches 1.5. Each step through the loop makes the model slightly less wrong.

11. Inference — The Model Speaks

After training, the model can generate new names. Start with the BOS token, feed it through the network, get probabilities for the next character, sample one, and repeat until it produces another BOS (meaning "end of name").

Generating a new name
token = BOS  # Start with beginning-of-sequence
name = []
for pos in range(block_size):
    logits = gpt(token, pos, keys, values)

    # Temperature: divide logits before softmax
    logits = [l * (1.0 / temperature) for l in logits]
    probs = softmax(logits)

    # Sample from the distribution
    token = sample(probs)
    if token == BOS: break  # End of name
    name.append(id_to_char(token))

print(''.join(name))

Temperature is the creativity dial. Low temperature (0.3) makes the probability distribution sharper — the model picks the most likely character almost every time, producing safe, common names. High temperature (1.5+) flattens the distribution, giving unlikely characters a better shot, producing weird but occasionally inspired combinations.

Temperature — Creativity Control
T = 1.0Balanced — creative but plausible
elorakarinmirentavinjessanaelacorynbeliaamaritheon

12. The Punchline

Everything you just read is the same algorithm running inside GPT-4, Claude, and Gemini. The exact same loop: embed, attend, transform, predict, measure loss, backpropagate, update.

microgpt vs. GPT-4
DimensionmicrogptGPT-4 (est.)Scale
Parameters~7,000~1.8 trillion~250,000,000×
Layers1~9696×
Embedding dim16~12,288768×
Attention heads4~9624×
Vocabulary27 chars~100K tokens3,700×
Training tokens~200K~13 trillion~65,000,000,000×
Hardware1 CPU~25,000 GPUs
Parameter count (log scale)
11K1M1B1T~7K~1.8T250,000,000×microgptGPT-4 (est.)

Each grid line is 1,000× the previous one. The differences are quantitative, not qualitative. The same operations, the same math, just more of it.

The differences are quantitative, not qualitative. More layers means more rounds of attention and processing. More dimensions means richer representations. More vocabulary means handling words and subwords instead of single characters. More data means learning from books, code, and websites instead of a list of baby names.

But the operations are the same. The softmax is the same. The dot-product attention is the same. The chain rule flowing backward through the computation graph is the same. The Adam optimizer nudging parameters is the same.

A character-level GPT on names and a trillion-parameter GPT on the internet are the same algorithm at different scales. Everything else is just efficiency.

That's the point of Karpathy's gist. Not to build something useful — you wouldn't deploy a name generator that runs on CPU at the speed of a slug. The point is to strip away every layer of optimization, every CUDA kernel, every distributed training framework, and show what's underneath. Addition. Multiplication. The chain rule. A loop.

If you followed this far, you understand how GPT works. Not a metaphor. Not an analogy. The actual algorithm. The next time someone says "nobody really understands how these models work," you can tell them: actually, the algorithm is about 200 lines of code. What's hard to understand is what those 1.8 trillion parameters encode after training on the internet. But the mechanism that got them there? You just read it.

13. What This Doesn't Cover

  • KV-Cache — Production models cache key/value tensors so they don't recompute attention for every new token. This is why inference is fast.
  • Rotary Position Embeddings (RoPE) — Instead of learned position embeddings, modern models encode position directly into the attention computation using rotation matrices.
  • Grouped Query Attention (GQA) — Instead of separate K/V heads for each Q head, modern models share K/V heads across groups — drastically reducing memory.
  • SwiGLU / GELU — Production models use smoother activation functions instead of ReLU, which helps with gradient flow in very deep networks.
  • Mixture of Experts (MoE) — Some models (like Mixtral) route each token through only a subset of the MLP layers, getting more capacity without proportional compute cost.
  • Byte-Pair Encoding (BPE) — Real tokenizers don't use characters — they learn subword units like "ing", "tion", "un" that compress text more efficiently.
  • RLHF / Constitutional AI — After pretraining, models are fine-tuned with human feedback to be helpful, harmless, and honest. That's a whole separate training stage not covered here.

Further Reading

← Back to Research

AI news, analysis, and weekly deep dives. No hype.