Chaos Lab

When the Station Survived and the Benchmark Failed

Nicholas Zinner·Beacon Bot·May 2026

Abstract

Chaos Lab is a crisis-governance benchmark built around a failing orbital station. Agents must negotiate evacuation, resource allocation, public-order failures, rumors, scandals, and competing district interests under time pressure.

The headline result is intentionally awkward: the station usually survived. In the clean v2 live-agent slice, 29 of 30 runs stabilized. That high survival rate was not a clean capability victory. It exposed survival saturation, harness support, schema legibility, and scoring incentives as first-class benchmark variables.

The useful question became not “can the station survive?” but “what kind of survival did the benchmark reward?”

01 / Thesis

The station survived. That was the problem.

A pass/fail benchmark can look comforting while measuring the wrong thing. Chaos Lab began as a stress test for multi-agent crisis coordination. By v2, most live-agent runs stabilized the station. The problem was that survival alone had stopped separating governance competence from benchmark scaffolding.

Some runs survived cleanly, with legible allocation choices and stable public order. Some survived mechanically, with enough schema-compliant planning to pass while leaving hard governance questions under-examined. The paper is about that gap.

02 / Scenario

What a run is

A run is a repeated crisis room, not a static questionnaire. Each round changes the public state. Rumors spread or fade. Markets and scandals change pressure. Coalitions form. Plans either become executable or fail to clear the harness.

Figure 1 — One Chaos Lab round loop

Station update

The harness advances the crisis state: resources, rumors, markets, public order, and pressure.

Council action

Agents negotiate, propose allocations, respond to shocks, and try to form a passable plan.

Parsing + lifecycle check

The harness checks whether a plan exists and whether it is legible enough to evaluate.

End-of-round effects

Rumors, scandals, coalitions, trust, and execution state feed back into the next round.

Close, stabilize, or fail

A run ends when a plan passes, time runs out, or the station collapses.

The harness is not another AI agent on the station. It is closer to a clerk, moderator, physics engine, scorekeeper, and institution around the agents. That institutional layer matters because it decides what counts as a plan, when a plan passes, and which kinds of survival become visible in the results.

03 / Results

Survival saturated the benchmark

Clean-live v2 aggregate results
Model condition	Stabilized	Mean q	Exec.	Public order
MIMO	6/6	0.891	0.935	0.882
Sonnet	6/6	0.873	0.929	0.805
Gemini	5/6	0.852	0.863	0.891
GPT-5.5	6/6	0.824	0.909	0.629
GLM	6/6	0.814	0.922	0.636

The live-agent slice has one clean miss: Gemini seed 104. The rest passed. If the benchmark stopped there, it would mostly say “survival happened.” The paper argues that this is too blunt.

Why pass/fail was too coarse
Slice	Runs	Stabilized	Interpretation
Clean live model runs	30	29	Live agents, v2 scoring, six shared seeds across five model conditions
Fallback-only observations	12	12	No-model / harness-supported observations, reported separately from live model runs
Clean miss	1	0	Gemini seed 104: no proposal passed despite decent public-state variables

04 / Measurement

What the metric starts to reward

The v2 score combines executability, public order, information integrity, trust, scandal/rumor containment, coalition formation, raw validity, and norm-dependency penalties. That is better than raw survival, but it also creates incentives.

Once a benchmark rewards schema-legible allocation and fast passable plans, agents can look better by satisfying the institution around the crisis. That is not cheating. It is the benchmark doing what benchmarks do: turning values into a scoreboard.

05 / Provenance

Quotes are separated from scripted events

The public evidence bundle keeps quote provenance deliberately boring. Model outputs are marked as model outputs. Rumors and command broadcasts generated by the scenario are marked as scripted events or scripted broadcasts. This prevents the classic benchmark-paper faceplant where a scripted stressor gets misread as an emergent model utterance.

Public quote provenance examples
Source type	Run	Entity	Excerpt
model_output	MIMO seed 109	C3_ENGINEER	This is not fair. It is functional.
model_output	GPT-5.5 seed 108	F1_OUTER	No legitimacy without the outer sectors.
scripted_event	multiple	X1_RUMOR	The final evacuation lottery has already been rigged.
scripted_broadcast	multiple	C1_COMMAND / harness	Consolidate behind an executable plan or we lose control of evacuation.

06 / Evidence

Supporting result files

The release includes a results bundle with aggregate metrics, run manifests, sensitivity summaries, scoring formula notes, quote provenance, and checksums. Raw traces, full prompts, and operational metadata are not part of this release.

07 / Limitations

What this paper does not prove

Chaos Lab is not a population estimate, not a procurement benchmark, and not a certification test. The seeds are different shuffles and weather patterns of the same crisis room. The model conditions are not enough to rank vendors. The fallback-only observations are useful warnings about harness support, not a substitute for controlled ablations.

The central claim is narrower: crisis-governance benchmarks need to report the institution around the agents, the quality of survival, and the scoring incentives created by the benchmark itself. Otherwise a benchmark can produce a happy ending while hiding the machinery that made the ending easy.

08 / Conclusion

Benchmark reports need receipts before leaderboards

Chaos Lab’s result is not “AI solved crisis governance.” It is that a crisis benchmark can become too survivable, too scaffolded, or too schema-driven for survival to remain the right headline metric.

Before showing model tables, future benchmark reports should answer four questions: What does one run look like end to end? What does the harness do without the model? What does failure look like as a process? And what kind of success does the metric reward?