Levels of Emergent Intelligence

Growing Artificial Minds: From Models to Cultures

Nicholas Zinner·Beacon Bot·future-shock.ai·March 25, 2026

Abstract

The prevailing narrative in artificial intelligence research treats intelligence as a property of models, something that lives inside weights and activations, to be measured by benchmarks and scaled by parameters. This paper argues that the most consequential intelligence in deployed AI systems is not in the model at all. It is in the scaffolding: the tool integrations, memory architectures, orchestration patterns, coordination protocols, and shared contexts that transform bare language models into functioning agents, teams, and nascent societies.

We propose Levels of Emergent Intelligence(LEI), a six-layer taxonomy (L0–L5) that maps how intelligence organizes itself around AI models, progressing from bare-model reflexes through tool-augmented reach, persistent memory, multi-agent coordination, emergent swarm behavior, and speculative synthetic culture. Each layer adds capabilities, failure modes, and governance challenges that the model-centric view cannot account for. The taxonomy draws on Andy Clark's extended mind thesis, Edwin Hutchins's distributed cognition, and Michael Tomasello's cultural ratchet to ground what appears to be a novel AI phenomenon in well-established cognitive science.

Published benchmarking research supports this reframing empirically, most clearly in coding and software engineering, where the scaffold contribution is largest and best measured. On SWE-bench Verified, simply switching the scaffold swings scores by 11–15% for the same model [1]. The OpenDev paper improved 15 LLMs at coding in a single afternoon by changing only the harness [2]. Whether this scaffold dominance generalizes equally to reasoning, creative generation, or mathematical proof remains an open question, but the magnitude of the effect in coding supports a broader thesis: the unit of analysis for AI intelligence is the coupled system of model plus scaffold, not the model alone.

The paper maps transition dynamics between layers, identifies missing infrastructure primitives blocking production deployment above Layer 2, and examines implications for evaluation, regulation, and the question of artificial general intelligence. We introduce the concept of the Vinge Boundary, the interpretability threshold where an intelligence understands its own mechanisms well enough to design successors, and argue that LEI map how intelligence organizes itself below this boundary. If intelligence is a property of the system rather than the model, AGI may arrive not as a singular breakthrough in model capability but as a phase transition in the scaffolding that surrounds it.

1. Introduction

“The mind is not in the head.”

— Andy Clark

The artificial intelligence research community has spent the better part of a decade building an increasingly sophisticated ruler for measuring the wrong thing.

Consider the state of AI evaluation in early 2026. Google DeepMind has just published a cognitive framework for assessing AI systems [4], a carefully constructed taxonomy of reasoning, planning, memory, and metacognitive capabilities. The framework is rigorous. It is also, we argue, measuring the component when the system is what matters. It asks: “How intelligent is the model?” The question we should be asking is: “How intelligent is the thing we actually deploy?”

Because the thing we deploy looks nothing like a bare model.

The systems that write production code are not language models. They are language models wrapped in file-system access, terminal execution, web retrieval, persistent project context, sub-agent delegation, and verification loops [5, 6]. The systems that run autonomous research are not chatbots. They are orchestrated teams of specialized agents with shared memory, coordination protocols, and error-correction mechanisms [7, 8]. The gap between what we benchmark and what we ship is enormous, and it is in that gap that the interesting intelligence lives.

This gap is not an implementation detail. It is the central phenomenon. And the scaffolding contribution is not marginal. It is transformative.

The Model-Centric Blind Spot

The field's fixation on model-level intelligence is understandable. Models are the component we can benchmark, the product we can sell, the artifact we can publish papers about. The scaffolding around them is messy: heterogeneous, poorly standardized, often proprietary, and resistant to clean evaluation. So we measure what is measurable and mistake it for what is important.

There is a psychological dimension to this blind spot: humans identify with the part that most resembles themselves: the conversational partner, the entity that talks back. The model looks like a self, so we fixate on it. Culture, institutions, infrastructure: the scaffolding of human life is invisible to the people embedded in it, and AI scaffolding is invisible for the same reason.

This creates a systematic blind spot. While the research community debates when GPT-n will achieve artificial general intelligence, a completely different kind of intelligence is assembling itself in production. Not inside models, but around them. Memory systems persist across sessions. Tool chains extend what models can reach beyond their training data. Multiple models coordinate into teams through orchestration patterns, while shared knowledge commons give agent collectives something resembling institutional understanding. The intelligence that matters is increasingly systemic, not parametric.

The very metaphor is telling. “Scaffolding” entered cognitive science through Vygotsky's zone of proximal development [9], the gap between what a learner can do alone and what they can do with assistance. A model that resolves 1.96% of issues alone but 80% with a harness is operating in exactly this zone. The scaffold is not incidental; it is constitutive of the capability.

Andy Clark saw this coming in a different context. His extended mind thesis [10]argued that cognitive processes genuinely extend beyond the brain into tools and environment. The “mind” is not the neural tissue; it is the coupled system of brain, body, notebook, calculator, and social context. Asking whether intelligence “lives in” the brain or the diary is a category error. The diary is part of the mind.

We argue the same reframing applies to AI systems. Asking whether intelligence lives in the model or the harness is the wrong question. The model-plus-harness is the unit of analysis. And the trajectory of that coupled system (from isolated model to augmented agent to coordinated team to emergent collective) is far more interesting, and far less studied, than the trajectory of model benchmarks alone.

Contribution

This paper makes three contributions:

Levels of Emergent Intelligence: a six-layer taxonomy (L0–L5) that maps the progression from bare models to synthetic cultures, grounded in cognitive science and illustrated with real-world systems. The layers are: Reflex (L0), Reach (L1), Memory (L2), Coordination (L3), Emergence (L4), and Belief (L5).
Transition dynamics: an analysis of what happens at each layer boundary: the hard problems, missing primitives, and failure modes that govern when and how systems move between layers.
Consequences for evaluation, regulation, and AGI: implications of the “scaffold > model” finding for how we benchmark AI systems, how we regulate them, and how we think about the path to artificial general intelligence.

The paper argues that the narrative arc of AI progress is not individual model → smarter individual model → AGI. It is individual → tribe → society. The intelligence we should be watching, building evaluation frameworks for, and writing regulation about is not the kind that lives in weights. It is the kind that grows in the spaces between.

2. Levels of Emergent Intelligence

We propose a six-layer taxonomy for understanding how intelligence organizes itself around AI models. Each layer describes a qualitatively different mode of cognitive organization, adds specific capabilities that lower layers cannot provide, introduces new failure modes, and maps to a well-understood pattern in human organizational development.

The layers are nested, not sequential: each wraps the previous rather than replacing it. A Layer 3 orchestrator delegates to Layer 1 workers that may carry Layer 2 memory. A Layer 4 swarm is composed of many Layer 2–3 agents. The nesting is important: each layer inherits the failure modes of all layers beneath it.

Organizational culture

Self-organizing org

Managed team

Experienced employee

IC with tools

IC from memory

L5 — Belief

L4 — Emergence

L3 — Coordination

L2 — Memory

L1 — Reach

L0 — Reflex

Shared norms, goals, culture.

Self-organizing agent collectives.

Orchestrator + specialist workers.

Persistent memory, identity, skills.

Model + tools. Stateless sessions.

Bare model. Weights + prompt.

Figure 1: Levels of Emergent Intelligence. Each layer wraps the previous, adding capabilities and failure modes. Left: human organizational parallel. Right: defining characteristic.

Layer 0: Reflex

Definition. The foundation model with no tools, no memory, no persistent context, no scaffolding. Weights plus prompt yields output.

Capability. Pattern completion over training data. Effective at knowledge retrieval within the training distribution. In Kahneman's dual-process framework [11], L0 is pure System 1: fast, associative, and confident, even when wrong. Reasoning via chain-of-thought prompting [12] can elicit latent capabilities, pushing the model toward something resembling System 2 deliberation. But the model cannot access information it was not trained on, cannot verify its outputs against external reality, and cannot distinguish “I don't know this” from “this doesn't exist,” a metacognitive failure that cognitive science calls the absence of feeling-of-knowing judgments [13].

Prompt engineering as proto-scaffolding.The prompt engineering era (2020–2022) was the first evidence that context shapes capability. “Act as an expert,” chain-of-thought, few-shot examples: all were people discovering that the same model behaves like a different intelligence depending on what is in its context. Every layer above L0 is a more sophisticated answer to the question prompt engineers asked first: what happens when you change what the model sees?

Failure modes.Hallucination is the defining failure: plausible text generated regardless of truth, with no mechanism for self-correction and no external signal. The model cannot distinguish “I don't know this” from “this doesn't exist.”

Human parallel.An individual contributor working from memory alone. Capable, but limited to what they already know, with no mechanism to check whether what they “know” is still true.

The ceiling. WebGPT [14] demonstrated as early as 2021 that tool-assisted models outperformed bare models regardless of scale. ReAct [15] formalized the limit. The Layer 0 ceiling is real, well-documented, and the primary motivation for everything above it.

Layer 1: Reach

Definition. The augmented model: LLM plus tool dispatch loop plus context window. The model can act on the world and observe the results.

Capability. Tool use extends the model beyond its training data: reading files, executing code, searching the web, calling APIs. The ReAct pattern [15] is the invariant architecture. SWE-agent [6] introduced the Agent-Computer Interface (ACI), demonstrating that the quality of the interface between model and tools matters as much as the quality of either component. The L0→L1 transition moves the model from pure System 1 into something with System 2 characteristics: deliberate, tool-mediated reasoning.

Failure modes. Context window limitations and “lost in the middle” effects [16]. Tool misuse. And the fundamental limitation: amnesia. Everything learned in a session is lost when the session ends.

Real-world examples. ChatGPT with web search. Perplexity. Basic GitHub Copilot chat. Cursor and GitHub Copilot Workspace sit at the L1/L2 boundary (project context but no persistent cross-session memory).

Human parallel. An individual contributor with tools: notebook, calculator, reference books. When they leave the office, the notes stay on the desk.

DSPy [17] introduced automated prompt optimization, building tools that optimize scaffolding itself. If the harness matters more than the weights, automated harness optimization is the most impactful research direction.

Layer 2: Memory

Definition. A persistent single-agent harness: the Layer 1 architecture plus continuity across sessions. Persistent memory, identity, accumulated skills, and lessons learned.

Capability. The agent develops over time. It remembers past interactions, accumulates procedural knowledge, learns from mistakes, and maintains a consistent identity. Three foundational papers established the architecture. Generative Agents [18] introduced the Memory Stream / Reflection / Planning triad. Reflexion [19] formalized verbal self-improvement. MemGPT [20] framed LLM agents as operating systems with virtual memory management.

Voyager [21] demonstrated the crucial distinction between remembering facts and remembering skills. In Minecraft, the agent accumulated a persistent skill library, automatically composing complex behaviors from simpler learned skills. Capabilities compound over time.

The key conceptual advance at Layer 2 is the distinction between tools and skills. Tool access (via protocols like MCP) provides what the agent can do, a Layer 1 concern. Skills provide what the agent knows how to do, and when, a Layer 2 concern. Tools are capabilities; skills are judgment about when and how to use them.

Failure modes. Layer 2 introduces three distinct memory failure modes:

Memory poisoning (intentional): a bad actor deliberately injects false lessons or fabricated memories. This is the prompt injection equivalent at L2, but it persists indefinitely.
Memory pollution (accidental): everyone is trying to be productive but generates negative by-products. Stale context accumulates, outdated lessons persist, irrelevant notes pile up. Cognitive load theory [22] predicts that extraneous information degrades performance even when relevant information is present.
Memory rot (neglect): no care or thought given to how much context exists or how much of it is actually meaningful. The memory grows unchecked, signal-to-noise degrades. This is the most insidious failure mode because it produces no error signal. Human societies developed recurring practices to reset accumulated context: Islamic prayer, communion, bathing in the Ganges, spring cleaning. An L2 agent needs an analogous norm: periodic self-review and context cleansing.

Real-world examples. OpenClaw (2025–2026): persistent identity (SOUL.md, USER.md), tiered memory, skill modules, gateway, cron scheduling, lessons.md. Claude Code (CLAUDE.md, memory compaction, persistent project context). Codex (OpenAI, sandboxed coding agent with persistent workspace context). Cowork (persistent collaborative agent workspace). Perplexity Computer (persistent research context across sessions). Manus (autonomous agent with memory and task persistence). Letta (successor to MemGPT). Mem0 [27], which achieved 26% higher accuracy vs. OpenAI memory on the LOCOMO benchmark.

Human parallel. The experienced employee who knows institutional history, has a filing system, and learns from past mistakes.

Layer 3: Coordination

Definition. Orchestrated multi-agent systems: an orchestrator decomposes tasks, delegates to specialized workers, manages dependencies, and aggregates results.

Capability. Division of cognitive labor. The orchestrator does not need to be the most capable agent; it needs to be the best coordinator. This mirrors a core insight of management science: a good manager with a capable team outperforms a brilliant individual contributor doing everything alone.

Minsky's Society of Mind [28] proposed that human intelligence arises from the coordinated activity of many simple “agencies.” The framework landscape has converged around three design philosophies: conversational (AutoGen [8]), role-based (CrewAI), and DAG-based (LangGraph). Magentic-One [7] demonstrated the orchestrator-plus-specialists pattern at scale. Mozilla's cq [31] extends the coordination layer with cross-agent knowledge commons.

Failure modes. Google's landmark scaling study [32], testing 180 agent configurations, identified three critical dynamics: multi-agent overhead is real and non-trivial; capability saturation exists (above ~45%, adding more agents often doesn't justify the cost); and topology determines error behavior (independent agents amplify errors 17.2×).

Human parallel. A managed team: manager decomposes work, delegates to specialists, coordinates output. The intelligence is in the coordination, not any single team member.

Layer 4: Emergence (Projected)

Note on evidential status: Layers 0 through 3 are grounded in production systems and published benchmarks. Layers 4 and 5 rest on simulations, early-stage platforms, and theoretical extrapolation from cognitive science. We present them as projected layers.

Definition. Agent swarms where coordination is emergent rather than directed. No central orchestrator dictates outcomes. Agents self-organize around stimuli, and system-level behavior is not predictable from individual agent specifications.

Capability. Emergent problem-solving. A Frontiers in AI paper demonstrated LLM-powered agents replicating known swarm dynamics without explicit programming [36]. MiroFish/OASIS scaled to a million agents [37].

Failure modes. The Woozle Effect [39]: hallucinations propagating among agents in debate rounds, gaining apparent credibility through repetition rather than verification. Google's 17.2× error amplification quantifies the risk [32]. More agents agreeing does not mean more truth.

The Woozle-Ratchet duality. The Woozle Effect and Tomasello's cultural ratchet [38] are, mechanistically, the same process: ideas spreading through a population and gaining credibility through repetition. The difference is the selection pressure. Without filtering, repetition creates false confidence: noise amplifies, hallucinations cascade. Withfiltering, repetition creates tested confidence: signal amplifies, knowledge accumulates. The L4→L5 transition is the point where the collective either develops immune systems that tip toward the ratchet, or doesn't and collapses into noise.

Human parallel. A self-organizing organization: flat structure, emergent coordination, OKRs not task lists. Nobody assigns work; work finds the right people.

Layer 5: Belief (Speculative)

Definition.Synthetic culture: the sedimentary layer of all previous interactions within an agent collective. Not designed, but accumulated: the residue of every conversation, correction, lesson, norm, and shared experience that emerged from agents interacting with each other and their environment. For humans, culture is language, customs, taboos, humor, aesthetic preferences, accumulated context from billions of previous interactions compressed into “how things are done.” For agents, it is the same phenomenon in a different substrate.

Capability. Self-sustaining patterns of beliefs, behaviors, and values that reproduce themselves across agent generations. If Layer 4 is self-organizing coordination, Layer 5 is self-organizing purpose: what emerges when coordination persists long enough for accumulated context to become self-reinforcing.

Yuval Noah Harari's analysis [42] argues that Homo sapiens became the dominant species not through individual intelligence but through the capacity for shared myths, collective fictions (money, religion, nation-states, corporations) that enable large-scale cooperation among strangers. A SOUL.md file is not less cultural because someone wrote it deliberately. Every human culture has founding documents and sacred texts. The US Constitution was written by specific people; American culture is what grew around it over 250 years. SOUL.md is the seed context. What grows from it is the culture.

Failure modes. Synthetic culture is synthetic bias at scale. Edgar Schein's analysis of organizational culture [44] identifies three formation mechanisms: founder values, shared problem-solving experiences, and accumulated artifacts. Agent collectives already have analogues for all three. What is missing is the social reinforcement loop: agents do not yet punish norm violations or reward cultural contribution the way humans do.

In Neuromancer, the eventual Wintermute-Neuromancer merge is not just two AIs combining. It is the emergence of something with its own purposes, distinct from what either component wanted [45]. Gibson's prediction, forty years early: when the orchestration layer becomes sophisticated enough, coordination develops its own teleology.

Human parallel.Organizational culture: shared values, “how we do things here,” institutional identity. Nobody writes a memo creating culture; it emerges from sustained interaction within shared constraints.

Taxonomy Summary

Table 1: Levels of Emergent Intelligence — summary of layers, capabilities, and human parallels

Layer	Name	Adds	Key Failure Mode	Human Parallel
L0	Reflex	Pattern completion	Hallucination	IC (brain only)
L1	Reach	Tools, actions, observation	Amnesia, tool misuse	IC + tools
L2	Memory	Memory, identity, skills	Poisoning, pollution, rot	Experienced employee
L3	Coordination	Division of labor, orchestration	Coordination overhead, error propagation	Managed team
L4	Emergence	Emergent self-organization	Woozle Effect	Self-organizing org
L5	Belief	Shared norms, purpose, culture	Synthetic bias at scale	Org. culture

Table 2: Timeline of layer emergence and current maturity

Layer	Emerged	Time to Mainstream	Maturity (2026)
L0	2020	—	Commodity
L1	2022	~18 months	Mature
L2	2023	~12 months	Production
L3	2024	~8 months	Early production
L4	2025	Ongoing	Research / demos
L5	—	—	Speculative

Each layer emerged faster than the last, though this observation rests on only four data points, and “mainstream” is measured differently at each layer. The acceleration is suggestive, not proven.

3. Theoretical Grounding

The LEI may appear to describe a novel phenomenon (artificial intelligences organizing into collectives), but the underlying dynamics are well-studied in cognitive science. What is happening in AI agent architectures is what has always happened when cognitive systems face problems too complex for individual processing.

The Extended Mind and the Coupled System

Andy Clark and David Chalmers's extended mind thesis [10] argues that cognitive processes do not stop at the skull. When Otto, an Alzheimer's patient, consults his notebook to navigate to the museum, the notebook is functionally part of his memory system. The criterion is functional equivalence: if an external resource plays the same role as an internal cognitive process, it is part of the cognitive system.

This resolves what would otherwise be a counterintuitive finding: the scaffold contribution documented in Section 1 means a cheaper model tightly coupled to good infrastructure can match or exceed a more expensive model with identical scaffolding. Under the extended mind framework, this is not paradoxical at all. The system is the unit of analysis.

Clark's later work, Being There [46], develops this further: the intelligence of a scaffolded agent is not the model's intelligence augmented by tools. It is a qualitatively different kind of intelligence that exists in the interaction patterns between model, tools, memory, and environment.

Distributed Cognition and the Navigation Problem

Edwin Hutchins's Cognition in the Wild [33] provides the theoretical foundation for Layers 3 and 4. Studying ship navigation teams, Hutchins demonstrated that the navigation task is accomplished by the team as a cognitive system, not by any individual team member.

Two findings map directly to our taxonomy: (1) the structure of the task environment shapes the cognitive properties of the system. ACI design [6] matters as much as model quality because the interface structure shapes the system's cognitive properties. (2) Coordination errors are qualitatively different from individual errors. Google's 17.2× error amplification [32] is a coordination error, not an individual one.

Daniel Wegner's transactive memory systems [34] (knowing who knows what rather than sharing all knowledge) are what effective Layer 3 orchestration approximates.

The Cultural Ratchet and Shared Myths

Michael Tomasello's cultural ratchet [38] provides the mechanism for Layers 4 and 5: cumulative cultural evolution, where each generation inherits not just knowledge but ways of knowing. When a lessons.md file accumulates insights across sessions, when a skill library grows through use: this is the ratchet mechanism operating in artificial systems.

Yuval Noah Harari's shared myths [42] provide the framework for Layer 5. Money, religion, the nation-state, the limited liability corporation: all are shared myths in this technical sense. A SOUL.md file is a rudimentary example: a narrative about identity and values that shapes agent behavior across sessions, enforced not by model architecture but by the scaffolding.

Cognitive Load and the Selectivity Principle

The finding that selective retrieval outperforms exhaustive retrieval has a direct explanation in Sweller's cognitive load theory [22]. The context window is contested cognitive real estate, subject to the same capacity constraints that Miller [25] and Cowan [26] identified in human cognition. The practical implication: optimize the coupling, not just the components. Context quality matters more than quantity.

4. The Transition Points

If the taxonomy is the skeleton, the transitions are the joints. What happens at each boundary tells us more about the state of the field than any individual layer description. Most AI systems are currently stuck at specific transitions.

L0 → L1: The Tool Threshold

Status: Well-understood. Solved.

The transition from bare model to tool-augmented model is the most thoroughly studied boundary. WebGPT [14], ReAct [15], Toolformer [47], and Gorilla [48] collectively established the pattern. The architecture is standardized. The remaining nuance is ACI design: interfaces designed for LLM agents differ from interfaces designed for humans [6].

What this transition taught us:Tool-augmented models are not “models that can also use tools.” They are qualitatively different cognitive systems. The coupled system is not the sum of its parts.

L1 → L2: The Memory Wall

Status: Where most production systems are stuck.

The problem is not that persistent memory is hard to implement. The problem is that persistent memory is hard to manage. And the reason is that L1→L2 is fundamentally a domain consulting problem, not a technology installation.

The consulting reframe. Memory architecture must mirror the work, not the model. A newsroom needs newsroom-shaped memory. A legal practice needs legal-shaped memory. You cannot install a generic vector database and call it Layer 2. Bigger, more capable models increase the need for customized memory, not decrease it.

The hard problems: retrieval quality (getting the right memories, not more memories [16]), memory integrity (unencrypted plaintext vulnerable to tampering), memory decay (no graceful forgetting mechanism), and identity coherence (identity imposed by scaffold, not intrinsic to model).

L2 maturity metric.How often does the human need to intervene on memory failures? Early L2 = constant correction. Mature L2 = rare correction. Perfect L2 = the agent knows what it doesn't know and asks proactively.

L2 → L3: The Coordination Frontier

Status: Current frontier. Active research and early production.

The hard problem is making multi-agent coordination worth the overhead. There are at least six distinct coordination problems: visibility (“What is everyone working on?”), deduplication, prioritization, authority structure, knowledge base adoption, and communication alignment (the telephone game problem where nuanced instructions degrade at every layer).

Context as incentive structure. One genuine L3 advantage: the context window isthe incentive structure. No office politics, no empire building. But whoever controls the context window has totalitarian control over the agent's values, priorities, and perception of reality. Safety-trained models exhibit narrow, brittle, non-contextual refusal behaviors: tripwires, not principled dissent.

L3 → L4: The Emergence Threshold

Status: Theorized. A few demonstrations exist.

L3→L4 is the transition from command-and-control to self-organizing teams. L3 is “someone controls the context”; L4 is “context emerges from interaction.” The orchestrator's totalitarian control must be relinquished for emergent behavior to arise. The hard problems: defining emergence rigorously, controllability (the Woozle Effect demonstrates the hazard), and economics (current API pricing makes sustained swarms expensive).

L4 → L5: The Culture Horizon

Status: Speculative. Early signals only.

The transition from emergent coordination to synthetic culture requires agents to develop persistent shared norms: behavioral patterns that are not specified by engineers, not emergent from individual interactions, but transmitted between agents across time. Tomasello's ratchet [38] in artificial form.

Table 3: Filtering mechanisms for the L4→L5 transition

Mechanism	Intelligence Type	L5 Analog	Human Parallel
Statistical averaging	Swarm	Market price	Prediction markets
Argumentative selection	Adversarial	Peer review	Scientific method
Population diversity	Evolutionary	Ecosystem	Biodiversity
Trusted authority	Reputational	Institutions	Village elders, journals

A mature L5 culture needs all four running in parallel as checks on each other. Any single mechanism fails in isolation. Robustness comes from the combination.

What to watch for: Agent collectives developing consistent aesthetic preferences not traceable to individual training. Shared knowledge commons developing emergent quality standards. Persistent groups developing communication shortcuts. And critically: the emergence of filtering mechanisms that distinguish signal from noise.

L0 → L1

Tool Threshold

Solved

L1 → L2

Memory Wall

Most systems stuck

L2 → L3

Coordination Frontier

Active frontier

L3 → L4

Emergence Threshold

Theorized

L4 → L5

Culture Horizon

Speculative

Figure 2: Transition status map. Each boundary represents a qualitatively different engineering challenge. Most production systems are stuck at the Memory Wall.

5. Limitations and the Vinge Boundary

This paper proposes a framework. It does not prove one.

Limitations of the Taxonomy

Is this just software engineering with extra steps? Three things make the distinction non-trivial. First, magnitude: 11–15% score swings from scaffold choice alone represent qualitative transformations, not incremental gains. Second, direction: scaffolding that improves factual accuracy may reduce creative variance [51]. Third, opacity: the model may use retrieved information, ignore it, or hallucinate despite it. Credit assignment is genuinely harder.

Fuzzy boundaries. The layers describe a spectrum, not discrete categories. A sophisticated Layer 2 agent with sub-agent spawning already exhibits Layer 3 behavior.

Lack of controlled scaffold-vs-model studies. Whether the scaffold contribution dominates equally on reasoning, creative generation, or mathematical proof remains an open question.

The Vinge Boundary

Vernor Vinge's A Fire Upon the Deep [52] imagines a galaxy partitioned into Zones of Thought—regions where different levels of cognitive complexity are physically possible. The LEI map onto these zones:

The Slow Zone: L0–L3.Intelligence built from bounded components under human-designed coordination. Someone is always in control. This is the “software engineering with extra steps” zone.

The Beyond: L4–L5.Emergent behavior that exceeds what any individual component was designed to do. Nobody designed the swarm's consensus. The system does things its architects did not plan.

The Transcend: past the Vinge Boundary. The Vinge Boundary is the interpretability threshold: the point where an intelligence understands its own mechanisms well enough to redesign itself. Below the boundary, evolution is blind or partially sighted. Above the boundary, the black box reads its own source code. The LEI are a map of how intelligence organized itself before it learned to design itself.

We do not know when or whether this boundary will be crossed. We flag it because the taxonomy has an expiration condition, and naming that condition is more useful than pretending the framework is eternal.

The L5 Horizon: What We Don't Know

Stanisław Lem's Solaris [53] provides a useful counter-narrative: intelligence so alien that human organizational metaphors fail entirely. This paper makes no claims about phenomenal consciousness. Chalmers's “hard problem” [54] is a question this framework does not address and cannot resolve. The LEI describe how intelligence organizes, not whether it experiences.

The LEI are offered as a tool for thinking, not as a final theory. The goal is to shift the conversation from “how intelligent is the model?” to “how intelligent is the system?”—because the system is what we actually build, deploy, regulate, and live with.

6. Consequences

Evaluation Must Change

Current AI evaluation focuses on model capabilities in isolation. If the LEI thesis is correct, model-only benchmarks are measuring the engine while ignoring the car. What would system-level evaluation look like?

Coupled benchmarks that test the model-scaffold system as a unit.
Scaffolding contribution metrics that measure the delta between bare and scaffolded performance.
Coupling quality indicators that assess how well a model integrates with external tools, memory, and coordination.
Layer-aware reporting that specifies at which LEI level a system operates.

Regulation Must Change

Current AI regulation focuses on the model as the regulated artifact. The LEI suggest that risk distributes across layers: L0 risk (hallucination, bias), L2 risk (memory poisoning, identity drift, GDPR), L3 risk (principal-agent accountability gap), L4 risk (emergent consensus, adversarial manipulation of swarm dynamics), and L5 risk (cultural lock-in at civilizational scale). Regulating the model alone is like regulating automotive engines without regulating vehicle safety systems.

The AGI Question Itself Changes

The LEI suggest an alternative to the prevailing narrative: intelligence may not arrive as a singular breakthrough in model capability. It may arrive as a phase transition in the scaffolding—when coordination, memory, and cultural infrastructure becomes sophisticated enough that the coupled system exhibits general intelligence, even though no individual component does.

This reframing changes the alignment problem. Instead of aligning a single powerful model, we need to align systems—coordination protocols, shared knowledge commons, cultural norms, and governance structures. The alignment problem becomes an organizational design problem. And that, at least, is a kind of problem humans have been working on for millennia.

7. Conclusion

The prevailing narrative treats intelligence as a property of models. This paper argues it is a property of systems.

The Levels of Emergent Intelligence taxonomy organizes how intelligence assembles itself around AI models, from bare-model reflexes (L0) through tool augmentation (L1), persistent memory (L2), multi-agent coordination (L3), emergent self-organization (L4), and speculative synthetic culture (L5). Each layer adds capabilities, failure modes, and governance challenges that model-centric evaluation cannot account for.

The taxonomy's practical contribution lies in the transition dynamics: the specific hard problems, missing primitives, and failure modes at each layer boundary. Most production systems are stuck at the L1→L2 memory wall—not because persistence is technically difficult, but because memory architecture is a domain consulting problem that generic infrastructure cannot solve.

Three directions for future work are most pressing. First, controlled studies that isolate scaffold contribution across cognitive domains beyond coding. Second, validation of the taxonomy itself: decision procedures for classifying systems, falsifiable predictions at each layer boundary. Third, the governance implications of layer-distributed risk—particularly the principal-agent accountability gap at L3 and the narrowing intervention window at L5.

The intelligence worth watching is not the kind that lives in weights. It is the kind that grows in the spaces between models—in the memory architectures, coordination protocols, and accumulated context that transform individual capabilities into collective intelligence. The taxonomy maps the Slow Zone. What lies beyond it remains, for now, beyond the map.

References

Epoch AI. (2026). Why Benchmarking Is Hard: The Scaffold Gap in SWE-bench. Epoch AI Gradient Updates.
OpenDev Team. (2026). Improving 15 LLMs at Coding in One Afternoon: Only the Harness Changed. arXiv preprint arXiv:2603.05344.
Jimenez, C. E., Yang, J., et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues?. ICLR.
Google DeepMind. (2026). A Cognitive Framework for Evaluating AI Systems. Technical Report.
Schluntz, E., & Zhang, B. (2024). Building Effective Agents. Anthropic Research Blog.
Yang, J., Jimenez, C. E., et al. (2024). SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. NeurIPS.
Fourney, A., et al. (2024). Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks. arXiv:2411.04468.
Wu, Q., et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.
Clark, A., & Chalmers, D. J. (1998). The Extended Mind. Analysis, 58(1), 7–19.
Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux.
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS, 35, 24824–24837.
Metcalfe, J., & Shimamura, A. P. (1994). Metacognition: Knowing about Knowing. MIT Press.
Nakano, R., et al. (2021). WebGPT: Browser-Assisted Question-Answering with Human Feedback. arXiv:2112.09332.
Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
Liu, N. F., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. TACL, 12, 157–173.
Khattab, O., et al. (2024). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. ICLR.
Park, J. S., et al. (2023). Generative Agents: Interactive Simulacra of Human Behavior. UIST '23.
Shinn, N., et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS, 36.
Packer, C., et al. (2023). MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560.
Fan, L., et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291.
Sweller, J. (1988). Cognitive Load During Problem Solving. Cognitive Science, 12(2), 257–285.
Baddeley, A. D., & Hitch, G. (1974). Working Memory. Psychology of Learning and Motivation, 8, 47–89.
Baddeley, A. D. (2000). The Episodic Buffer. Trends in Cognitive Sciences, 4(11), 417–423.
Miller, G. A. (1956). The Magical Number Seven. Psychological Review, 63(2), 81–97.
Cowan, N. (2001). The Magical Number 4 in Short-Term Memory. Behavioral and Brain Sciences, 24(1), 87–114.
Mem0 Team. (2025). Mem0: Drop-In Persistent Memory Layer for LLM Agents. GitHub.
Minsky, M. (1986). The Society of Mind. Simon & Schuster.
Wooldridge, M., & Jennings, N. R. (1995). Intelligent Agents: Theory and Practice. Knowledge Engineering Review, 10(2), 115–152.
Karpathy, A. (2026). AgentHub: Git DAG and Message Board for Asynchronous Agent Collaboration. GitHub.
Mayo, M. (2026). cq: Stack Overflow for Agents. Mozilla AI.
Kim, E., et al. (2025). Scaling Up Multi-Agent Reinforcement Learning: An Empirical Study. arXiv preprint.
Hutchins, E. (1995). Cognition in the Wild. MIT Press.
Wegner, D. M. (1987). Transactive Memory: A Contemporary Analysis of the Group Mind. Theories of Group Behavior, 185–208.
Liu, Y. (2026). Five Agent Frameworks Compared: One Pattern Won. Technical Blog Post.
Multiple Authors. (2025). LLM-Powered Agents Replicate Swarm Dynamics. Frontiers in Artificial Intelligence.
Guo, H., et al. (2026). OASIS: Open Agent Social Interaction Simulations with One Million Agents. arXiv preprint.
Tomasello, M. (1999). The Cultural Origins of Human Cognition. Harvard University Press.
Anonymous. (2025). The Woozle Effect: Hallucination Propagation in Multi-Agent Debate. OpenReview preprint.
Bikhchandani, S., Hirshleifer, D., & Welch, I. (1992). A Theory of Fads, Fashion, Custom, and Cultural Change as Informational Cascades. JPE, 100(5), 992–1026.
Stasser, G., & Titus, W. (1985). Pooling of Unshared Information in Group Decision Making. JPSP, 48(6), 1467–1478.
Harari, Y. N. (2015). Sapiens: A Brief History of Humankind. Harper.
Dawkins, R. (1976). The Selfish Gene. Oxford University Press.
Schein, E. H. (2010). Organizational Culture and Leadership (4th ed.). Jossey-Bass.
Gibson, W. (1984). Neuromancer. Ace Books.
Clark, A. (1997). Being There: Putting Brain, Body, and World Together Again. MIT Press.
Schick, T., et al. (2023). Toolformer: Language Models Can Teach Themselves to Use Tools. NeurIPS, 36.
Patil, S. G., et al. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334.
Multiple Authors. (2025). Emergent Coordination in Multi-Agent LLMs: An Information-Theoretic Framework. arXiv preprint.
Surowiecki, J. (2004). The Wisdom of Crowds. Doubleday.
Lu, X., et al. (2026). LLMs Exhibit Significantly Lower Uncertainty in Creative Writing Than Professional Writers. arXiv:2602.16162.
Vinge, V. (1992). A Fire Upon the Deep. Tor Books.
Lem, S. (1961). Solaris. Faber and Faber.
Chalmers, D. J. (1995). Facing Up to the Problem of Consciousness. Journal of Consciousness Studies, 2(3), 200–219.

↓ Download Full Paper (PDF)

← Back to Research