GLOSSARY
AI Glossary
Plain-English definitions for the terms we use when tracking AI progress. No jargon walls. No hype inflation.
A
AGI (Artificial General Intelligence)
AI that can perform any intellectual task a human can, at human level or better. There is no universally agreed technical definition — different researchers use different benchmarks and thresholds. The term is most useful as a directional target rather than a precise threshold.
Related: ASI · LLM · AI Takeoff
ASI (Artificial Superintelligence)
AI that surpasses human-level intelligence across all domains by a significant margin. Theoretical concept used in long-term forecasting. Distinguished from AGI in that ASI implies capability well above the human maximum, not just parity.
Related: AGI
AI Takeoff
The transition period when AI improvement becomes self-reinforcing and accelerating — where AI systems are contributing meaningfully to AI research itself, compounding progress. 'Slow takeoff' implies decades; 'fast takeoff' implies months or years. Future Shock tracks leading indicators of takeoff dynamics.
Related: AGI · Scaling Laws · AI Progress
B
Benchmark
A standardized test used to measure AI capability. Examples include MMLU (multiple choice knowledge across 57 subjects), HumanEval (code generation), MATH (mathematical reasoning), and SWE-bench (real-world software engineering tasks). Benchmarks are useful until models saturate them, at which point the field moves to harder ones.
Related: Benchmark Saturation · MMLU · HLE · SWE-bench
Benchmark Saturation
When AI systems score so high on a benchmark that it no longer meaningfully distinguishes capability levels. MMLU is effectively saturated (top models score 87–90%+). When a benchmark saturates, new, harder benchmarks emerge. Tracking saturation rates reveals how quickly AI is advancing.
Related: Benchmark · HLE · MMLU
C
Context Window
The amount of text (measured in tokens) an AI model can 'see' at once during a single interaction. GPT-4 launched with an 8K context window in 2023; current frontier models support 1M+ tokens. Larger context windows enable working with full books, codebases, and long conversations.
Related: Token · LLM
F
FLOP / FLOPs
Floating Point Operation(s). A measure of computational work. Model training is measured in total FLOPs — the total compute expended during training. More FLOPs generally correlates with more capable models, subject to scaling laws. Often used to compare training runs across generations.
Related: Scaling Laws · Compute
G
GPQA (Graduate-Level Google-Proof Q&A)
A benchmark of 448 challenging questions at graduate level in biology, chemistry, and physics, specifically designed to resist Google search. Serves as a test of genuine reasoning, not memorization. Most frontier models now score above 70%; human expert baseline is around 65%.
Related: Benchmark · HLE · MMLU
H
HLE (Humanity's Last Exam)
The current frontier benchmark — 3,000 questions across 100+ disciplines, designed to be unsolvable by current AI. Released in early 2025, it replaced earlier benchmarks that had been saturated. Used to track the absolute capability frontier. Models are approaching 10–20% accuracy as of early 2026.
Related: Benchmark · Benchmark Saturation · GPQA
HumanEval
A benchmark of 164 Python programming problems with test cases. One of the most widely cited coding benchmarks. GPT-4 scored 67% at launch (2023); current frontier models exceed 95%. Largely saturated by 2024.
Related: Benchmark · SWE-bench · Coding
L
LLM (Large Language Model)
A neural network trained on large amounts of text data to predict and generate language. The architecture behind ChatGPT, Claude, Gemini, and most current AI assistants. 'Large' refers to the number of parameters — modern frontier models have hundreds of billions to trillions of parameters.
Related: Transformer · Parameters · Pretraining
M
MATH Benchmark
A dataset of 12,500 competition math problems (AMC, AIME, MATHCOUNTS level) testing mathematical reasoning. GPT-4 scored 42% at launch; current frontier models exceed 90%. Increasingly saturated at the olympiad level.
Related: Benchmark · AIME · Reasoning
MMLU (Massive Multitask Language Understanding)
A benchmark of 57 subjects ranging from elementary math to professional law and medicine — 14,000+ multiple-choice questions. Was the primary capability benchmark from 2021–2024. Effectively saturated: top models now score 87–90%+. Human expert average is ~89%.
Related: Benchmark · Benchmark Saturation
MoE (Mixture of Experts)
A neural network architecture where only a subset of the model's parameters ('experts') are activated for any given input, rather than running the full model for every token. Enables training larger models more efficiently. Used in GPT-4 (rumored), Mixtral, and other frontier models.
Related: LLM · Parameters · Inference
P
Parameters
The numerical weights inside a neural network that encode learned knowledge. Modern LLMs range from 7 billion (small, can run on a laptop) to 1 trillion+ parameters (requires large clusters). Parameter count is a rough proxy for capability and compute cost, though architecture and training data matter significantly.
Related: LLM · FLOP / FLOPs · Scaling Laws
PDM (Precondition Density Model)
An original analytical framework developed by Future Shock for evaluating how close AI is to specific capability thresholds. Rather than predicting dates, PDM maps the preconditions (technical, data, infrastructure, social) required and tracks how many are currently met. Read the full framework at future-shock.ai/research/precondition-density-model.
Related: AGI · AI Takeoff
Pretraining
The initial phase of LLM development — training on a massive corpus of text (internet, books, code) to develop general language and reasoning abilities. Expensive and compute-intensive; typically done once by labs. Produces a 'base model' that is then fine-tuned for specific uses.
Related: Fine-tuning · RLHF · LLM
R
RLHF (Reinforcement Learning from Human Feedback)
A technique for aligning AI models with human preferences. After pretraining, human raters score model outputs; a 'reward model' learns these preferences; the main model is then trained to maximize reward. Used by OpenAI (ChatGPT), Anthropic (Claude), and others to make models helpful, harmless, and honest.
Related: Fine-tuning · Pretraining · Alignment
Reasoning Models
LLMs trained or prompted to 'think step by step' before answering, dramatically improving performance on complex tasks. Examples include OpenAI o1/o3 and Anthropic's extended thinking mode. Trade faster response time for substantially better accuracy on math, coding, and logic.
Related: LLM · Test-Time Compute
S
Scaling Laws
Empirical relationships showing that AI capability improves predictably with more compute, more data, and more parameters — following smooth power laws. First described formally by Kaplan et al. (2020) at OpenAI. A key reason for the AI investment boom: capability gains are somewhat predictable if you scale up.
Related: FLOP / FLOPs · Parameters · Compute
SWE-bench
A benchmark of 2,294 real GitHub issues from popular Python repositories. Tests whether AI can actually fix bugs in production software — a harder, more realistic bar than toy coding problems. Current frontier models solve 40–50%+ of tasks. Unlike HumanEval, not yet saturated.
Related: Benchmark · HumanEval · Coding
T
Test-Time Compute
Using additional computation at inference time (when the model is answering, not training) to improve output quality. Reasoning models like o1 are the primary example — they 'think' longer before responding, using more compute per query. Represents a shift from scaling training compute to scaling inference compute.
Related: Reasoning Models · FLOP / FLOPs
Token
The basic unit of text that LLMs process — roughly 3/4 of a word in English. 'Tokenization' splits text into tokens before it enters the model. LLM pricing is often per token. A typical paragraph is ~100 tokens; a book is ~100,000 tokens; frontier models now support context windows of 1M+ tokens.
Related: Context Window · LLM
Transformer
The neural network architecture underlying virtually all modern LLMs. Introduced in 'Attention Is All You Need' (2017). Uses 'attention mechanisms' to process relationships between tokens across the full input simultaneously, rather than sequentially. The T in GPT stands for Transformer.
Related: LLM · Attention
Missing a term? Suggest an addition. We update this as new terminology enters common use.