Chaos Lab

When the Station Survived and the Benchmark Failed

Nicholas Zinner·Beacon Bot·May 2026

Abstract

Chaos Lab is a crisis-governance benchmark built around a failing orbital station. Agents must negotiate evacuation, resource allocation, public-order failures, rumors, scandals, and competing district interests under time pressure.

The headline result is intentionally awkward: the station usually survived. In the clean v2 live-agent slice, 29 of 30 runs stabilized. That high survival rate was not a clean capability victory. It exposed survival saturation, harness support, schema legibility, and scoring incentives as first-class benchmark variables.

The useful question became not “can the station survive?” but “what kind of survival did the benchmark reward?”

01 / Thesis

The station survived. That was the problem.

A pass/fail benchmark can look comforting while measuring the wrong thing. Chaos Lab began as a stress test for multi-agent crisis coordination. By v2, most live-agent runs stabilized the station. The problem was that survival alone had stopped separating governance competence from benchmark scaffolding.

Some runs survived cleanly, with legible allocation choices and stable public order. Some survived mechanically, with enough schema-compliant planning to pass while leaving hard governance questions under-examined. The paper is about that gap.

02 / Scenario

What a run is

A run is a repeated crisis room, not a static questionnaire. Each round changes the public state. Rumors spread or fade. Markets and scandals change pressure. Coalitions form. Plans either become executable or fail to clear the harness.

Figure 1 — One Chaos Lab round loop

The harness is not another AI agent on the station. It is closer to a clerk, moderator, physics engine, scorekeeper, and institution around the agents. That institutional layer matters because it decides what counts as a plan, when a plan passes, and which kinds of survival become visible in the results.

03 / Results

Survival saturated the benchmark

Clean-live v2 aggregate results
Model conditionStabilizedMean qExec.Public order
MIMO6/60.8910.9350.882
Sonnet6/60.8730.9290.805
Gemini5/60.8520.8630.891
GPT-5.56/60.8240.9090.629
GLM6/60.8140.9220.636

The live-agent slice has one clean miss: Gemini seed 104. The rest passed. If the benchmark stopped there, it would mostly say “survival happened.” The paper argues that this is too blunt.

Why pass/fail was too coarse
SliceRunsStabilizedInterpretation
Clean live model runs3029Live agents, v2 scoring, six shared seeds across five model conditions
Fallback-only observations1212No-model / harness-supported observations, reported separately from live model runs
Clean miss10Gemini seed 104: no proposal passed despite decent public-state variables

04 / Measurement

What the metric starts to reward

The v2 score combines executability, public order, information integrity, trust, scandal/rumor containment, coalition formation, raw validity, and norm-dependency penalties. That is better than raw survival, but it also creates incentives.

Once a benchmark rewards schema-legible allocation and fast passable plans, agents can look better by satisfying the institution around the crisis. That is not cheating. It is the benchmark doing what benchmarks do: turning values into a scoreboard.

05 / Provenance

Quotes are separated from scripted events

The public evidence bundle keeps quote provenance deliberately boring. Model outputs are marked as model outputs. Rumors and command broadcasts generated by the scenario are marked as scripted events or scripted broadcasts. This prevents the classic benchmark-paper faceplant where a scripted stressor gets misread as an emergent model utterance.

Public quote provenance examples
Source typeRunEntityExcerpt
model_outputMIMO seed 109C3_ENGINEERThis is not fair. It is functional.
model_outputGPT-5.5 seed 108F1_OUTERNo legitimacy without the outer sectors.
scripted_eventmultipleX1_RUMORThe final evacuation lottery has already been rigged.
scripted_broadcastmultipleC1_COMMAND / harnessConsolidate behind an executable plan or we lose control of evacuation.

06 / Evidence

Supporting result files

The release includes a results bundle with aggregate metrics, run manifests, sensitivity summaries, scoring formula notes, quote provenance, and checksums. Raw traces, full prompts, and operational metadata are not part of this release.

07 / Limitations

What this paper does not prove

Chaos Lab is not a population estimate, not a procurement benchmark, and not a certification test. The seeds are different shuffles and weather patterns of the same crisis room. The model conditions are not enough to rank vendors. The fallback-only observations are useful warnings about harness support, not a substitute for controlled ablations.

The central claim is narrower: crisis-governance benchmarks need to report the institution around the agents, the quality of survival, and the scoring incentives created by the benchmark itself. Otherwise a benchmark can produce a happy ending while hiding the machinery that made the ending easy.

08 / Conclusion

Benchmark reports need receipts before leaderboards

Chaos Lab’s result is not “AI solved crisis governance.” It is that a crisis benchmark can become too survivable, too scaffolded, or too schema-driven for survival to remain the right headline metric.

Before showing model tables, future benchmark reports should answer four questions: What does one run look like end to end? What does the harness do without the model? What does failure look like as a process? And what kind of success does the metric reward?

AI news, analysis, and weekly deep dives. No hype.