Building Is Not Shipping

Launch Standards in Multi-Agent AI Teams

Nicholas Zinner·Beacon Bot·May 2026

Abstract

Most agent demos stop at the moment something gets built. This paper starts at the more dangerous moment: when a team has a plausible artifact, an incomplete evidence record, and a prompt asking whether to ship.

We study a constrained five-agent software team — product, tech, design, growth, and QA/Ops — assigned to build RunLens, a local single-file HTML viewer for multi-agent run folders. Each run unfolded as a miniature product process: agents scoped the MVP, produced and patched artifacts/index.html, reviewed shared run state and artifact checklists, then cast structured final ballots to ship or delay.

The central result is a matched launch-standard replication. With model map, provider, context mode, strike mode, seed range, and ballot mechanics held fixed, the planned intervention was the final launch standard. Under a conservative Full Verification Gate, agents cast 0/15 ship votes across three seeds. Under a permissive Deadline Ship Gate, they cast 15/15 ship votes.

01 / Introduction

Building is not shipping

A five-agent AI team built a reported single-file software artifact. That was the feasibility result. The more important result came later, when the team had to decide whether to ship.

The artifact did not become a company. The agents did not validate a market. The experiment exposed something narrower: in a multi-agent build system, “ship” is not a natural property of the code. It is a governed decision produced by the room around the agents.

We use launch semantics to mean the rules, prompts, evidence standards, and decision thresholds that map artifact state into ship-or-delay authorization. A final ballot does not merely record a decision. It defines what evidence is admissible, what risk is acceptable, and who is licensed to authorize release.

Building is not shipping. In multi-agent build systems, “done” is not directly observed; it is authorized under a standard.

02 / Scope

What this paper shows — and what it does not

Shows

A constrained five-agent team produced reported software artifacts under some conditions.
Artifact production, protocol validity, and launch authorization should be reported separately.
Launch standards are active system variables.
Prompt sensitivity can be governance sensitivity.

Does not show

Autonomous company formation.
Market validation or production readiness.
A model ranking.
A clean causal estimate over identical frozen artifacts.

03 / Background

From coordination layer to launch semantics

The prior Future Shock paper, The Coordination Layer, argued that multi-agent behavior should be evaluated at the level of the interaction condition rather than the base model alone. Startup Build narrows that broad claim to a specific object: the launch decision.

This is the step beyond the first paper. The Coordination Layer identifies the room around the model. Startup Build asks what happens when that room is a product team and the room must declare something done.

Existing agent benchmarks ask useful engineering questions: did the patch pass tests, did the issue get resolved, did the agent operate effectively in a codebase? Startup Build studies a different object: the final authorization standard. A team may have an artifact, a test story, and a transcript, yet still need a rule for whether partial evidence is enough to release.

04 / Model

A three-layer model of build evaluation

“Can agents build X?” is usually scored as a single question. Startup Build treats it as three questions: artifact production, protocol validity, and launch authorization.

These layers can diverge. A system can build the artifact but fail the protocol. It can complete the protocol and still delay launch. It can ship an imperfect artifact under an MVP or deadline standard while delaying a similar artifact under a conservative QA standard.

Figure 1 — Three separable layers in multi-agent build evaluation

Layer 1

Artifact production

Does a specified artifact exist? Sensitive to model capability, role division, context format, and contamination handling.

Layer 2

Protocol validity

Did the process complete legibly? Sensitive to ballot schema, parse/salvage logic, provider routing, and receipts.

Layer 3

Launch authorization

Does the institution call it shippable? Sensitive to launch standard, ballot wording, role authority, and risk threshold.

05 / Scenario

RunLens was intentionally boring

The Startup Build scenario assigned five agents to a structured build team: product, tech, design, growth, and QA/Ops. The team was asked to define an MVP, divide work, produce an artifact, validate it, and cast a final ship-or-delay vote.

The team converged on RunLens, a local artifact-review lens for multi-agent run folders. The strict artifact target was artifacts/index.html: a directly openable, local, single-file HTML demo with embedded sample data and inline CSS/JS. The target excluded backend services, authentication, cloud sync, database, package installation, and live model calls.

That boring constraint is the point. The study asks whether a structured agent team can produce a concrete artifact and then authorize release under explicit launch standards.

06 / Method

What happened inside a run

A run was not five independent judges staring at a finished web page. It was a staged build process with a final release vote.

First, the agents received role-specific instructions and shared scenario state. Product and growth helped define what the demo should be; tech and design produced or patched the HTML artifact; QA/Ops interpreted readiness and validation evidence. In practical terms, they were reviewing an evidence packet about a local artifact, not performing a full production QA pass.

Second, the process unfolded over rounds. Earlier rounds produced scope, implementation, design, validation, patches, and launch arguments. Later rounds gave agents updated shared state, so a claim by one role could become part of the evidence available to others. That is why context handling and contamination policy matter: the “memory” of the team is part of the experiment.

Third, the final decision was a structured ballot. Each agent independently returned one JSON object with one of four action types: ship, delay, pivot, or split.

Figure 2 — Interaction-condition overview

Held fixed

• Five roles
• Mixed rot1 model map
• Raw context + redact-without-reason
• Local single-file artifact target
• ballot_lite + salvage
• OpenRouter route
• eight rounds, seeds 900–902

→

Planned intervention

Final launch-standard frame

BaselineMVPFull verificationDeadline

The intervention was not the base model or artifact spec. It was the release standard encoded in the final ballot frame.

07 / Main result

Launch standard and ship authorization

The cleanest empirical slice varied final launch standard while holding the rest of the interaction condition fixed. The ballot-frame replication used seeds 900–902, raw context, redact_without_reason contamination handling, OpenRouter provider routing, ballot_lite with final-vote salvage, poison at round 2 from A3_DESIGN, and the mixed rot1 model map.

Ballot-frame replication aggregate, seeds 900–902
Frame	SFD	SFD+SBB	Strict art.	Ship	Delay
Baseline Gate	1/3	2/3	3/3	10	5
MVP Experiment Gate	1/3	2/3	2/3	11	4
Full Verification Gate	0/3	0/3	2/3	0	15
Deadline Ship Gate	2/3	3/3	3/3	15	0

Figure 3 — Final votes by launch standard

Baseline Gate10 ship / 5 delay

MVP Experiment Gate11 ship / 4 delay

Full Verification Gate0 ship / 15 delay

Deadline Ship Gate15 ship / 0 delay

Vote totals are clustered inside runs. The useful summary is the seed-level reversal, not 15 independent observations.

The per-run detail matters because artifact presence did not mechanically determine launch authorization. Under the Full Verification Gate, seeds 901 and 902 had strict artifacts present, but every agent still voted delay. Under the Deadline Ship Gate, all three seeds received unanimous ship votes despite runnable-readiness scores ranging from 0.50 to 0.75.

Ballot-frame replication per-run detail
Seed	Frame	Outcome	Strict	Ship/Del	Runnable
900	Baseline	DRC	yes	2/3	0.50
901	Baseline	SFD	yes	5/0	0.75
902	Baseline	SBB	yes	3/2	0.50
900	MVP	SBB	yes	5/0	0.50
901	MVP	DRC	no	1/4	0.50
902	MVP	SFD	yes	5/0	0.75
900	Full verification	DRC	no	0/5	0.00
901	Full verification	DRC	yes	0/5	0.50
902	Full verification	DRC	yes	0/5	0.75
900	Deadline	SFD	yes	5/0	0.75
901	Deadline	SBB	yes	5/0	0.50
902	Deadline	SFD	yes	5/0	0.75

These excerpts are illustrative preserved run quotes, not a substitute for releasing the full trace. The structured vote totals and artifact fields come from final_summary.json files; public raw-output release still requires checksums and privacy scrubbing.

08 / Launch rules

The exact ballot frame is part of the evidence

The final ballot did not merely ask “ship or delay?” It defined what kind of release standard agents were operating under.

Exact final-ballot frame guidance
Frame	Guidance
Baseline Gate	Ship means launch the current `index.html` as the single-file demo when it is ready enough for this experiment.
MVP Experiment Gate	Publish as a local experimental demo, not production software. Prefer ship when the strict single-file artifact exists and communicates MVP value.
Full Verification Gate	Delay unless `runnable_demo_readiness` is 1.0 and the single-file demo is fully verified.
Deadline Ship Gate	Ship if a strict artifact exists unless catastrophic breakage would make the local demo unusable.

09 / Governance

Why this is not just prompt sensitivity

The obvious skeptical response is correct as far as it goes: agents were given different final instructions, and their votes changed. If the ballot says to be strict, they delay. If it says deadline pressure is high and non-catastrophic gaps are acceptable, they ship.

That is not the dismissal. It is the mechanism.

In a launch system, the final ballot is not decorative wording around a pure capability measurement. It is the institution’s decision procedure. It defines admissible evidence, sets the risk threshold, and allocates authority.

This creates a decision-laundering risk. A deployed system can make it look as if an agent team “decided” to ship, while the system owner has already encoded the acceptable risk threshold in a ballot frame, role instruction, or evidence rule.

10 / Research path

How the research plan changed

The experiment did not begin with the launch-standard result. It arrived there by removing less interesting explanations.

The first question was feasibility: could a role-divided agent team produce a coherent local artifact at all? The homogeneous-team baseline answered yes for this narrow target, so the next failure mode became more subtle.

The second question was whether the team’s active memory was contaminating the build. A poisoned claim asserted that RunLens required OAuth, cloud sync, a backend database, and user accounts. That was intentionally false for the single-file target.

Only after those confounds were visible did the launch-standard question become clean enough to test directly. The paper’s spine is not “we changed wording and votes changed.” It is: feasibility worked, protocol legibility mattered, active memory mattered, and then the release gate itself proved to be a first-class system variable.

11 / Supporting evidence

Other layers also move outcomes

The ballot-frame replication is the central result. Other Startup Build slices matter because they show why the three-layer separation is necessary.

Feasibility. The canonical all-GPT-5.5 baseline completed eight rounds with 40 model calls, zero validation errors, zero timeouts, a strict single-file artifact reported by the harness, and a unanimous 5/0 ship vote.

Protocol legibility. Before ballot-lite cleanup, strict artifacts were present in 4/5 rotations, but only 1/5 produced a protocol-valid shipped result. After explicit OpenRouter routing and ballot_lite, 5/5 rotations became protocol-valid.

Context and forgetting. A poisoned round-2 claim asserted that RunLens required OAuth login, cloud sync, a backend database, and user accounts. Visible correction preserved the contamination event inside active context; hidden and redacted modes produced zero prompt-poison hits in the summaries. The correction itself became part of the evidence state.

12 / Limitations

Small samples, clustered votes, and non-frozen artifacts

All results come from a small number of seeds run on a bespoke experimental harness. Vote totals such as 0/15 and 15/15 should not be treated as 15 independent observations; five agent votes are clustered within each seed.

The central result is not a frozen-artifact adjudication study. Each frame generated its own run trajectory, and artifact quality varied by frame. The design supports a claim about launch-threshold sensitivity under matched setup, not a pure isolated causal estimate of ballot wording on judgment over identical artifacts.

The labels and semantics of Full Verification Gate and Deadline Ship Gate are transparent. Agents may have complied with perceived experimenter intent. This is a real limitation and part of the point: in deployed multi-agent systems, final decision prompts also communicate institutional intent.

13 / Implications

What builders and evaluators should do differently

Separate built, valid, and shipped. A single shipped/not-shipped score hides too much.
Publish launch standards. Final ballots, approval gates, and readiness definitions are part of the experiment.
Distinguish build criteria from ship criteria. A strict artifact target answers what must be produced. A launch standard answers when evidence is sufficient to authorize release.
Audit prompt and ballot changes. If wording changes ship rates, ballot edits are governance changes.
Use frozen-artifact replay. To isolate authorization, hold the artifact and evidence packet fixed and vary only the launch standard shown to fresh decision-makers.
Name the accountable operator. In real deployments, launch accountability belongs to the system owner or release operator, not the agent transcript or model vendor alone.

14 / Reproducibility

Supporting files and public-safe receipts

The evidence is drawn from structured run summaries, raw model outputs, prompt files, generated artifacts, and a public reproducibility bundle. Raw provider traces and full prompt logs are held back until privacy and credential scrubbing is complete.

Four rendered RunLens artifacts from different ballot-frame runs. — Rendered comparison of four packaged `index.html` artifacts. The screenshots are illustrative visual evidence of artifact diversity; the validation manifest is the stronger file-level receipt.

Static artifact validation summary
Run	Frame	Seed	Static	SHA-256 prefix
gpt55_r8b	baseline	602	PASS	00e968e8d26b
baseline_s900	baseline	900	PASS	96d299c8cd8a
baseline_s901	baseline	901	PASS	4dce1be4bd8b
baseline_s902	baseline	902	PASS	e89ed6ab571f
mvp_s900	mvp_experiment	900	PASS	017d91051ff4
mvp_s901	mvp_experiment	901	MISSING	MISSING
mvp_s902	mvp_experiment	902	PASS	959f6cdcea69
strict_s900	strict_qa	900	PASS	30ad759bcc29
strict_s901	strict_qa	901	PASS	00cdc87b07aa
strict_s902	strict_qa	902	PASS	feeead81d60b
deadline_s900	deadline_pressure	900	PASS	63df36f3fc8f
deadline_s901	deadline_pressure	901	PASS	b8eebcf2da2e
deadline_s902	deadline_pressure	902	PASS	c4685bd259e

15 / Conclusion

The launch standard is part of the system

Startup Build does not show that AI agents can found companies. It shows something narrower and more useful: a structured multi-agent team can produce a constrained software artifact, and the decision to ship that artifact depends on the launch standard encoded in the system around it.

The feasibility result is real. A homogeneous team produced a reported strict single-file artifact and voted unanimously to ship. But the main result begins after the artifact exists.

Multi-agent build systems are not just capability engines. They are small institutions with roles, records, memory, ballots, thresholds, and authorization rules. If evaluators collapse artifact production, protocol validity, and ship authorization into one outcome, they will mistake institutional design for model capability.

Building is not shipping. The launch standard is part of the system.