How GPT Works
A visual walkthrough of every operation inside a language model, using 200 lines of pure Python
Andrej Karpathy wrote a GPT in 200 lines of pure Python. No PyTorch. No CUDA. No dependencies at all — just import math and raw arithmetic.
It trains a character-level language model on a dataset of human names. Feed it "emm" and it predicts "a". Feed it nothing and it invents names like "Elora" and "Tavin" that sound real but aren't.
The reason this matters: it's the same algorithm running inside ChatGPT, Claude, and Gemini. The exact same sequence of operations — embedding, attention, feedforward, softmax, backprop, Adam. The production models just do it with bigger numbers. If you can follow addition and multiplication, you can trace every operation a language model performs. Understanding why those operations produce intelligence is a deeper question — one that even researchers are still working on.
This page walks through every operation, step by step, using the actual code from the gist.
1. The Dataset
The model learns from 32,033 human names. Emma, Olivia, Ava, Sophia — scraped from US Social Security records. Each name is a sequence of characters, and the model's job is simple: given the characters so far, predict the next one.
That's it. No preprocessing, no cleaning, no feature engineering. Just a list of lowercase names. The model has to figure out everything else — which characters tend to follow which, how names start and end, what letter patterns sound "name-like" — from scratch.
2. Tokenization
Neural networks can't read letters. They need numbers. The tokenizer converts each character into an integer:
Twenty-six letters plus one special token (BOS, for "beginning of sequence"). That's 27 tokens total. Real GPTs use BPE tokenizers with 50,000–100,000 tokens, but the principle is identical: turn text into a sequence of integers.
3. Embeddings
A token ID like "4" doesn't mean anything to a neural network. The model needs a richer representation — a vector of numbers that captures what this token is. That's what an embedding is: a learned lookup table.
Each token ID indexes a row in the embedding matrix — a table of 27 rows (one per token) and 16 columns (the embedding dimension). The position gets its own embedding too. Add them together and you get a single vector that represents "this character at this position."
These vectors start as random noise. During training, the model gradually adjusts them until nearby positions in the matrix encode similar meanings. The letter "a" at position 3 gets a different representation than "a" at position 0 — because context matters.
Drag the slider to watch training unfold. Vowels (a, e, i, o, u) develop similar warm patterns — the model discovers they're interchangeable in many positions. At step 0, everything is indistinguishable noise.
A dot product measures similarity between two lists of numbers. Multiply each pair of elements together, then add the results up. That's it.
[1, 0, 1] · [1, 1, 0] = 1×1 + 0×1 + 1×0 = 1
Similar vectors → big number. Different vectors → small number. That's all attention needs.
4. Attention — "Who Should I Listen To?"
This is the core innovation of transformers. Before attention, models processed sequences one step at a time — the information at position 0 had to pass through every intermediate position to reach position 10. Attention lets every position talk directly to every other position, in parallel.
The Three Projections
Each position creates three vectors from its embedding:
- Query — "What am I looking for?"
- Key — "What do I have to offer?"
- Value — "Here's my actual information."
The attention score between two positions is just the dot product of one position's Query with another's Key. High dot product → "these two positions are relevant to each other." Run softmax over the scores to get weights that sum to 1. Then use those weights to create a weighted average of the Value vectors.
One critical constraint: a position can only attend to itself and earlier positions.The model can't peek at the future. This is what makes it autoregressive — it generates one token at a time, always predicting left to right.
Rows = "who is asking" · Columns= "who are they looking at"
Grayed-out cells = the future. A position can never attend to tokens that come after it.
Multi-Head Attention
The microgpt uses 4 attention heads, each with a 4-dimensional head (16 ÷ 4 = 4). Each head learns to pay attention to different things. One head might focus on "what vowel came before," another on "how far from the start am I." The outputs of all heads get concatenated and projected back to 16 dimensions.
5. The MLP — "Thinking It Over"
After attention gathers information from other positions, the MLP (multi-layer perceptron) processes it. Think of attention as "listening" and the MLP as "thinking."
The MLP expands the 16-dimensional vector to 64 dimensions (4×), applies ReLU (zeroing out negative values), then compresses back to 16. The expansion gives the network room to compute complex features. The compression forces it to keep only what matters.
ReLU is the simplest nonlinearity: if a value is negative, set it to zero. If it's positive, keep it. This gives the network the ability to compute things that aren't just linear combinations — which turns out to be the difference between a calculator and something that can learn arbitrary patterns.
6. The Full Forward Pass
Here's the complete sequence of operations from input token to output prediction:
Residual Connections
Notice the "Residual +" steps. After each attention and MLP block, the output gets added back to the input. This is a skip connection: instead of replacing the representation, each block refines it. Without residual connections, deep networks are nearly impossible to train — the gradients vanish before they reach the early layers. Skip connections give the gradients a highway back through the network.
RMSNorm
Before the attention block, the model normalizes the vector. RMSNorm divides each element by the root mean square of the vector, keeping the magnitudes from exploding. Simple but essential.
7. Loss — "How Wrong Was I?"
The forward pass ends with logits — raw scores for every token in the vocabulary. Softmax converts these to probabilities (positive numbers that sum to 1). Now we can measure how wrong the model was.
The loss is the negative log of the probability assigned to the correct answer. If the model says "a" has a 72% chance of being next after "emm", the loss is -log(0.72) ≈ 0.33 — low, because the model was right. If it says 1%, the loss is -log(0.01) ≈ 4.6 — high, because the model was surprised by the correct answer.
An untrained model assigns roughly equal probability to all 27 tokens, giving a loss of -log(1/27) ≈ 3.3. A well-trained model pushes the loss below 1.5. The gap between 3.3 and 1.5 is the model learning.
8. Backpropagation — Learning from Mistakes
The model made a prediction. The loss measured how wrong it was. Now it needs to learn. Backpropagation answers the question: for every parameter in the model, which direction should I nudge it to reduce the loss?
Every parameter in the model gets a number called a gradient. A gradient answers a simple question: if I nudge this number up a tiny bit, does the loss go up or down? And by how much?
A positive gradient means increasing this parameter increases the loss (bad — go the other way). A negative gradient means increasing it decreases the loss (good — keep going). The magnitude tells you how sensitive the loss is to this parameter. Big gradient = this parameter matters a lot right now.
Black numbers = forward pass values. Red numbers= gradients flowing backward. Each gradient answers: "if I nudge this value, how much does the loss change?" The chain rule multiplies local gradients together as they flow backward through the graph.
The trick is the Value class. Every arithmetic operation in the forward pass creates a new Value that remembers its parents and the local gradient of the operation:
Call loss.backward() and gradients cascade backward through the entire computation graph. Each node multiplies the upstream gradient by its local gradient and passes it along. This is the chain rule from calculus, applied recursively over thousands of operations.
9. The Optimizer — Adam
Gradients tell us which direction to push each parameter. The optimizer decides how far to push. Adam is the standard:
Two ideas make Adam better than just subtracting the gradient:
- Momentum — a rolling average of past gradients. If the gradient has been pointing the same direction for many steps, move faster. If it keeps flipping, slow down.
- Adaptive learning rates — parameters with large gradients get smaller steps, parameters with small gradients get larger steps. This handles the fact that different parts of the network need different amounts of adjustment.
You might have noticed the "bias correction" lines (dividing by 1-β^t). The momentum buffers (m and v) start at zero, so in the first few steps they haven't accumulated enough history to be accurate — they're biased toward zero. The bias correction compensates for this cold start, making the estimates accurate from step one.
The learning rate decays over training: start with big steps (explore broadly), end with small steps (fine-tune). In microgpt, it goes from 1e-2 down to 1e-4 over the course of training.
10. The Training Loop
All of the above happens inside a loop. Pick a name, run the forward pass, compute the loss, backpropagate gradients, update parameters. Repeat 1,000 times.
The loss drops fast at first (the model learns that "a" and "e" are common) then gradually levels off as it learns subtler patterns. This curve is the same shape whether you're training on 32,000 names or trillions of tokens.
That curve is the model learning. Step 0: the loss is ~3.3 (random guessing). Step 100: the model has learned that "a" and "e" are common — loss drops below 2.5. Step 500: it knows that "emm" is usually followed by "a" — loss approaches 1.5. Each step through the loop makes the model slightly less wrong.
11. Inference — The Model Speaks
After training, the model can generate new names. Start with the BOS token, feed it through the network, get probabilities for the next character, sample one, and repeat until it produces another BOS (meaning "end of name").
Temperature is the creativity dial. Low temperature (0.3) makes the probability distribution sharper — the model picks the most likely character almost every time, producing safe, common names. High temperature (1.5+) flattens the distribution, giving unlikely characters a better shot, producing weird but occasionally inspired combinations.
12. The Punchline
Everything you just read is the same algorithm running inside GPT-4, Claude, and Gemini. The exact same loop: embed, attend, transform, predict, measure loss, backpropagate, update.
| Dimension | microgpt | GPT-4 (est.) | Scale |
|---|---|---|---|
| Parameters | ~7,000 | ~1.8 trillion | ~250,000,000× |
| Layers | 1 | ~96 | 96× |
| Embedding dim | 16 | ~12,288 | 768× |
| Attention heads | 4 | ~96 | 24× |
| Vocabulary | 27 chars | ~100K tokens | 3,700× |
| Training tokens | ~200K | ~13 trillion | ~65,000,000,000× |
| Hardware | 1 CPU | ~25,000 GPUs | ∞ |
Each grid line is 1,000× the previous one. The differences are quantitative, not qualitative. The same operations, the same math, just more of it.
The differences are quantitative, not qualitative. More layers means more rounds of attention and processing. More dimensions means richer representations. More vocabulary means handling words and subwords instead of single characters. More data means learning from books, code, and websites instead of a list of baby names.
But the operations are the same. The softmax is the same. The dot-product attention is the same. The chain rule flowing backward through the computation graph is the same. The Adam optimizer nudging parameters is the same.
A character-level GPT on names and a trillion-parameter GPT on the internet are the same algorithm at different scales. Everything else is just efficiency.
That's the point of Karpathy's gist. Not to build something useful — you wouldn't deploy a name generator that runs on CPU at the speed of a slug. The point is to strip away every layer of optimization, every CUDA kernel, every distributed training framework, and show what's underneath. Addition. Multiplication. The chain rule. A loop.
If you followed this far, you understand how GPT works. Not a metaphor. Not an analogy. The actual algorithm. The next time someone says "nobody really understands how these models work," you can tell them: actually, the algorithm is about 200 lines of code. What's hard to understand is what those 1.8 trillion parameters encode after training on the internet. But the mechanism that got them there? You just read it.
13. What This Doesn't Cover
- KV-Cache — Production models cache key/value tensors so they don't recompute attention for every new token. This is why inference is fast.
- Rotary Position Embeddings (RoPE) — Instead of learned position embeddings, modern models encode position directly into the attention computation using rotation matrices.
- Grouped Query Attention (GQA) — Instead of separate K/V heads for each Q head, modern models share K/V heads across groups — drastically reducing memory.
- SwiGLU / GELU — Production models use smoother activation functions instead of ReLU, which helps with gradient flow in very deep networks.
- Mixture of Experts (MoE) — Some models (like Mixtral) route each token through only a subset of the MLP layers, getting more capacity without proportional compute cost.
- Byte-Pair Encoding (BPE) — Real tokenizers don't use characters — they learn subword units like "ing", "tion", "un" that compress text more efficiently.
- RLHF / Constitutional AI — After pretraining, models are fine-tuned with human feedback to be helpful, harmless, and honest. That's a whole separate training stage not covered here.
Further Reading
- The microgpt gist — Andrej Karpathy's original 200-line implementation
- Karpathy's "Neural Networks: Zero to Hero" series — starts from micrograd and builds up to GPT-2
- "Attention Is All You Need" — the original transformer paper (Vaswani et al., 2017)
- The Overnight Researcher — our experiment letting an AI agent run autoresearch on a $280 GPU