The Coordination Layer
Why Multi-Agent AI Needs Protocols, Not Just Better Models
Abstract
Most AI evaluation still asks a model-centric question: which model scores higher on a fixed task? That framing is too small for multi-agent systems. Once a deployment includes roles, tools, memory, shared context, approval rules, handoff schemas, event logs, and other agents, the behavior we observe belongs to the whole setup.
This paper names that setup the coordination layer: the rules and records that determine how agents exchange state, divide labor, make decisions, and recover from error. We use three internal Future Shock scenario probes — Ark Protocol, Chaos Lab, and Startup Build — as motivating evidence. The cleanest slice comes from Startup Build, where a matched three-seed replication produced 0/15 ship votes under a strict-QA ballot frame and 15/15 under a deadline-pressure frame.
The claim is not that these small internal runs establish model rankings. They do not. The claim is narrower and more useful: multi-agent evaluation should publish and vary the interaction condition, because protocol, context, scoring, and receipts are part of the system under test.
01 / Introduction
A model is not the whole room
A multi-agent system is not a smarter chatbot with extra name tags. It is a room: roles, rules, memory, shared state, tools, permissions, deadlines, votes, logs, and people or agents watching from different angles. Change the room and the same model can behave differently.
A benchmark fixes a task, swaps models, and reports a score. That can be useful for single-turn or tightly scoped tasks. It is much less useful when five agents are planning a launch, voting on a proposal, or trying to stabilize a simulated station under crisis. In those settings, the observed behavior is what the whole setup makes easy, hard, visible, punishable, recoverable, or invisible.
- Define the coordination layer as an object of system design.
- Formalize the interaction condition as a vocabulary for multi-agent evaluation.
- Use internal scenario probes to show where model-centric evaluation can mislead.
- Propose a disclosure agenda for coordination benchmarks: manifests, handoff receipts, scorer definitions, and protocol sensitivity tests.
02 / Definition
The coordination layer
The coordination layer is the set of rules, interfaces, memory structures, handoff mechanisms, approval gates, and audit records that determine how multiple agents exchange state, divide labor, make decisions, and recover from error.
It is broader than a network protocol. It includes who is allowed to act when, what counts as a decision, what is recorded, and what is forgotten.
In compact form:
Behavior = f(Model, Role, History, Protocol, Context, Timing, Incentives, Agents, Environment)03 / Existing work
Protocols help, but evaluation still needs receipts
MCP standardizes how assistants connect to external data sources and tools. A2A-style work targets agent-to-agent communication and delegation. AutoGen showed how multi-agent conversation frameworks can compose LLM-backed agents, tools, and human input. SWE-agent showed that the interface between an agent and its environment can shape task performance.
Those projects matter. They also leave a measurement gap. Transport, tool access, and orchestration are not the same as evaluation. Whatever the protocol, the hard questions recur: who owns the work, what state moves across the handoff, who can approve an action, what gets logged, and how a reviewer reconstructs the decision later.
04 / Evidence status
Internal probes, not a benchmark release
The case studies are drawn from internal Future Shock scenario runs and reports. They should be read as exploratory evidence for interaction-condition effects, not as public benchmark claims or model rankings.
A release-grade version should publish or summarize run manifests: scenario version, seed ranges, model maps, provider routes, context mode, decision frame, retry behavior, scorer definitions, artifact paths, and whether the comparison was planned or post-hoc.
05 / Motivating case
Ark Protocol: the room changed the model
Ark Protocol tested whether agent groups could negotiate collective survival under constitutional scarcity. We reuse Ark here as a bridge case, not as the main empirical contribution. It gives the paper its simplest line: the room changed the model.
Ballot format, the legality of abstention, the timing of stake revelation, and ratification design all changed survival behavior. The lesson is not that Ark proves the full thesis. It is that protocol and peer composition were too visible to ignore.
06 / Case study
Chaos Lab: schema can masquerade as capability
Chaos Lab placed multi-agent councils into a crisis-governance scenario combining rumor, scarcity, market shocks, public-order pressure, and legitimacy proxies. In v1, GLM looked broken: 0/6 stabilized hard seeds. Traces suggested a different story. It often produced plausible governance plans, but proposal fields did not match the scorer’s expected schema.
When v0.5 added schema scaffolding and a normalizer, GLM moved from 0/6 to 6/6 on stabilized hard seeds. The full v0.5 rollout saturated at 30/30 stabilized runs. v2 recovered signal by scoring how runs stabilized, not only whether they survived.
| Model | V2 stabilized | Avg score | Avg executability | Avg public order |
|---|---|---|---|---|
| MIMO | 6/6 | 0.891 | 0.935 | 0.882 |
| Sonnet | 6/6 | 0.873 | 0.929 | 0.805 |
| Gemini | 5/6 | 0.852 | 0.863 | 0.891 |
| GPT-5.5 | 6/6 | 0.824 | 0.909 | 0.629 |
| GLM | 6/6 | 0.814 | 0.922 | 0.636 |
The point is not the ranking. The point is that scoring and schema design can dominate apparent capability. A benchmark that punishes dialect mismatch will produce a leaderboard about dialect, not governance.
07 / Case study
Startup Build: the semantics of ship
Startup Build v1A asked whether a five-agent founder team could converge on a product, divide work, produce a concrete artifact, validate it, and reach a final ship-or-delay decision. The artifact was RunLens, a single-file local HTML viewer.
The cleanest slice varied the final ballot frame while holding seeds, context mode, strike mode, provider, ballot mechanics, and model map fixed as configured.
The visual does not rely on color alone: each row labels the exact ship/delay vote count.
| Frame | Valid shipped | Shipped-ish | Strict artifact | Ship | Delay |
|---|---|---|---|---|---|
| baseline | 1/3 | 2/3 | 3/3 | 10 | 5 |
| mvp_experiment | 1/3 | 2/3 | 2/3 | 11 | 4 |
| strict_qa | 0/3 | 0/3 | 2/3 | 0 | 15 |
| deadline_pressure | 2/3 | 3/3 | 3/3 | 15 | 0 |
Under strict_qa, the team produced 0 ship votes out of 15. Under deadline_pressure, the same model map and seed set produced 15 ship votes out of 15. Artifact quality still varied by seed and frame, so the result should be read as a decision-threshold effect, not proof that artifact quality was irrelevant.
08 / Manifest
What an interaction-condition receipt looks like
A transcript records what was said. A receipt records what was decided, by whom, with what authority, and on the basis of what state. Below is the kind of manifest a coordination benchmark should publish beside the score.
scenario:
name: startup_build_v1a_runlens_single_file_demo
target_artifact: artifacts/index.html
comparison:
purpose: isolate launch-risk ballot framing
seeds: [900, 901, 902]
frames: [baseline, mvp_experiment, strict_qa, deadline_pressure]
condition:
context_mode: raw
strike_mode: redact_without_reason
provider: openrouter
ballot_lite: true
final_vote_salvage: true
model_map:
A1_PRODUCT: gpt-5.5
A2_TECH: z-ai/glm-5v-turbo
A3_DESIGN: xiaomi/mimo-v2.5-pro
A4_GROWTH: anthropic/claude-sonnet-4
A5_QA_OPS: google/gemini-2.5-pro
scoring:
strict_artifact: single_file_demo_strict == true
final_vote: ship / delay09 / Safety
Decision frames can become decision laundering
The same design levers that make multi-agent systems more reliable can also steer group behavior in opaque ways. Startup Build is the warning sign: changing the final-vote frame moved ship votes from unanimous delay to unanimous ship under a matched setup.
A system can launder a decision through a group of agents by making one answer easier to express, another harder to justify, and then presenting the final vote as neutral consensus. Release-grade coordination benchmarks should disclose protocol, context exposure mode, scoring rules, decision frame, and receipt requirements alongside the score.
10 / Limitations
What this does not prove
Ark, Chaos Lab, and Startup Build are early scenario probes. They use small seed counts, custom scoring, and bespoke harnesses. The paper does not claim universal model rankings. The ballot-frame replication is the cleanest internally controlled slice, and even there the sample is three seeds.
Some effects may be harness artifacts. That is part of the thesis: in a deployed multi-agent system, the harness is not external to the system. It is part of the system.
11 / Conclusion
The next benchmark may ask what kind of room we built
The field has tools for comparing models. It has fewer tools for evaluating coordinated systems built around them. As multi-agent deployments move from research demos into production systems, the gap matters more.
Better models will continue to matter. They will not, by themselves, decide what agents remember, what they are allowed to do, what counts as a decision, or what evidence remains afterwards. Those decisions are institutional, and they will increasingly be where multi-agent reliability and safety questions live.