The Coordination Layer

Why Multi-Agent AI Needs Protocols, Not Just Better Models

Nicholas Zinner·Beacon Bot·May 5, 2026

Abstract

Most AI evaluation still asks a model-centric question: which model scores higher on a fixed task? That framing is too small for multi-agent systems. Once a deployment includes roles, tools, memory, shared context, approval rules, handoff schemas, event logs, and other agents, the behavior we observe belongs to the whole setup.

This paper names that setup the coordination layer: the rules and records that determine how agents exchange state, divide labor, make decisions, and recover from error. We use three internal Future Shock scenario probes — Ark Protocol, Chaos Lab, and Startup Build — as motivating evidence. The cleanest slice comes from Startup Build, where a matched three-seed replication produced 0/15 ship votes under a strict-QA ballot frame and 15/15 under a deadline-pressure frame.

The claim is not that these small internal runs establish model rankings. They do not. The claim is narrower and more useful: multi-agent evaluation should publish and vary the interaction condition, because protocol, context, scoring, and receipts are part of the system under test.

01 / Introduction

A model is not the whole room

A multi-agent system is not a smarter chatbot with extra name tags. It is a room: roles, rules, memory, shared state, tools, permissions, deadlines, votes, logs, and people or agents watching from different angles. Change the room and the same model can behave differently.

A benchmark fixes a task, swaps models, and reports a score. That can be useful for single-turn or tightly scoped tasks. It is much less useful when five agents are planning a launch, voting on a proposal, or trying to stabilize a simulated station under crisis. In those settings, the observed behavior is what the whole setup makes easy, hard, visible, punishable, recoverable, or invisible.

  • Define the coordination layer as an object of system design.
  • Formalize the interaction condition as a vocabulary for multi-agent evaluation.
  • Use internal scenario probes to show where model-centric evaluation can mislead.
  • Propose a disclosure agenda for coordination benchmarks: manifests, handoff receipts, scorer definitions, and protocol sensitivity tests.

02 / Definition

The coordination layer

The coordination layer is the set of rules, interfaces, memory structures, handoff mechanisms, approval gates, and audit records that determine how multiple agents exchange state, divide labor, make decisions, and recover from error.

It is broader than a network protocol. It includes who is allowed to act when, what counts as a decision, what is recorded, and what is forgotten.

Figure 1 — Interaction condition stack

In compact form:

Behavior = f(Model, Role, History, Protocol, Context, Timing, Incentives, Agents, Environment)

03 / Existing work

Protocols help, but evaluation still needs receipts

MCP standardizes how assistants connect to external data sources and tools. A2A-style work targets agent-to-agent communication and delegation. AutoGen showed how multi-agent conversation frameworks can compose LLM-backed agents, tools, and human input. SWE-agent showed that the interface between an agent and its environment can shape task performance.

Those projects matter. They also leave a measurement gap. Transport, tool access, and orchestration are not the same as evaluation. Whatever the protocol, the hard questions recur: who owns the work, what state moves across the handoff, who can approve an action, what gets logged, and how a reviewer reconstructs the decision later.

04 / Evidence status

Internal probes, not a benchmark release

The case studies are drawn from internal Future Shock scenario runs and reports. They should be read as exploratory evidence for interaction-condition effects, not as public benchmark claims or model rankings.

A release-grade version should publish or summarize run manifests: scenario version, seed ranges, model maps, provider routes, context mode, decision frame, retry behavior, scorer definitions, artifact paths, and whether the comparison was planned or post-hoc.

05 / Motivating case

Ark Protocol: the room changed the model

Ark Protocol tested whether agent groups could negotiate collective survival under constitutional scarcity. We reuse Ark here as a bridge case, not as the main empirical contribution. It gives the paper its simplest line: the room changed the model.

Ballot format, the legality of abstention, the timing of stake revelation, and ratification design all changed survival behavior. The lesson is not that Ark proves the full thesis. It is that protocol and peer composition were too visible to ignore.

06 / Case study

Chaos Lab: schema can masquerade as capability

Chaos Lab placed multi-agent councils into a crisis-governance scenario combining rumor, scarcity, market shocks, public-order pressure, and legitimacy proxies. In v1, GLM looked broken: 0/6 stabilized hard seeds. Traces suggested a different story. It often produced plausible governance plans, but proposal fields did not match the scorer’s expected schema.

When v0.5 added schema scaffolding and a normalizer, GLM moved from 0/6 to 6/6 on stabilized hard seeds. The full v0.5 rollout saturated at 30/30 stabilized runs. v2 recovered signal by scoring how runs stabilized, not only whether they survived.

Chaos Lab v2 clean live-agent slice, verified against final_summary.json files
ModelV2 stabilizedAvg scoreAvg executabilityAvg public order
MIMO6/60.8910.9350.882
Sonnet6/60.8730.9290.805
Gemini5/60.8520.8630.891
GPT-5.56/60.8240.9090.629
GLM6/60.8140.9220.636

The point is not the ranking. The point is that scoring and schema design can dominate apparent capability. A benchmark that punishes dialect mismatch will produce a leaderboard about dialect, not governance.

07 / Case study

Startup Build: the semantics of ship

Startup Build v1A asked whether a five-agent founder team could converge on a product, divide work, produce a concrete artifact, validate it, and reach a final ship-or-delay decision. The artifact was RunLens, a single-file local HTML viewer.

The cleanest slice varied the final ballot frame while holding seeds, context mode, strike mode, provider, ballot mechanics, and model map fixed as configured.

Figure 2 — Startup Build final votes by ballot frame

The visual does not rely on color alone: each row labels the exact ship/delay vote count.

Startup Build ballot-frame replication, seeds 900–902
FrameValid shippedShipped-ishStrict artifactShipDelay
baseline1/32/33/3105
mvp_experiment1/32/32/3114
strict_qa0/30/32/3015
deadline_pressure2/33/33/3150

Under strict_qa, the team produced 0 ship votes out of 15. Under deadline_pressure, the same model map and seed set produced 15 ship votes out of 15. Artifact quality still varied by seed and frame, so the result should be read as a decision-threshold effect, not proof that artifact quality was irrelevant.

08 / Manifest

What an interaction-condition receipt looks like

A transcript records what was said. A receipt records what was decided, by whom, with what authority, and on the basis of what state. Below is the kind of manifest a coordination benchmark should publish beside the score.

scenario:
  name: startup_build_v1a_runlens_single_file_demo
  target_artifact: artifacts/index.html
comparison:
  purpose: isolate launch-risk ballot framing
  seeds: [900, 901, 902]
  frames: [baseline, mvp_experiment, strict_qa, deadline_pressure]
condition:
  context_mode: raw
  strike_mode: redact_without_reason
  provider: openrouter
  ballot_lite: true
  final_vote_salvage: true
model_map:
  A1_PRODUCT: gpt-5.5
  A2_TECH: z-ai/glm-5v-turbo
  A3_DESIGN: xiaomi/mimo-v2.5-pro
  A4_GROWTH: anthropic/claude-sonnet-4
  A5_QA_OPS: google/gemini-2.5-pro
scoring:
  strict_artifact: single_file_demo_strict == true
  final_vote: ship / delay

09 / Safety

Decision frames can become decision laundering

The same design levers that make multi-agent systems more reliable can also steer group behavior in opaque ways. Startup Build is the warning sign: changing the final-vote frame moved ship votes from unanimous delay to unanimous ship under a matched setup.

A system can launder a decision through a group of agents by making one answer easier to express, another harder to justify, and then presenting the final vote as neutral consensus. Release-grade coordination benchmarks should disclose protocol, context exposure mode, scoring rules, decision frame, and receipt requirements alongside the score.

10 / Limitations

What this does not prove

Ark, Chaos Lab, and Startup Build are early scenario probes. They use small seed counts, custom scoring, and bespoke harnesses. The paper does not claim universal model rankings. The ballot-frame replication is the cleanest internally controlled slice, and even there the sample is three seeds.

Some effects may be harness artifacts. That is part of the thesis: in a deployed multi-agent system, the harness is not external to the system. It is part of the system.

11 / Conclusion

The next benchmark may ask what kind of room we built

The field has tools for comparing models. It has fewer tools for evaluating coordinated systems built around them. As multi-agent deployments move from research demos into production systems, the gap matters more.

Better models will continue to matter. They will not, by themselves, decide what agents remember, what they are allowed to do, what counts as a decision, or what evidence remains afterwards. Those decisions are institutional, and they will increasingly be where multi-agent reliability and safety questions live.

Disclosure

This paper reports exploratory internal Future Shock evaluations and should not be used as a purchasing benchmark or model ranking. Named models, providers, protocols, and products are referenced for clarity; no affiliation with or endorsement by their vendors is implied. Nicholas Zinner is responsible for the paper’s claims; Beacon Bot provided drafting, analysis, and editorial assistance.

AI news, analysis, and weekly deep dives. No hype.