Prerequisite Density as a Predictor of Scientific Breakthrough: Expanded Validation and the Limits of Temporal Compression
Follow-up paper — web version
Abstract
We report an expanded validation of the Precondition Density Model, growing the dataset from 1,699 to 3,179 events across 600 years of scientific and technological history. The core holdout prediction remains robust: the ensemble method achieves Cohen's d = 9.80 (p < 0.001) across six successive dataset versions, demonstrating that the signal is not an artifact of dataset composition.
We test the H3 temporal compression hypothesis and reject it. A permutation test over 10,000 shuffles yields p = 1.0; the observed compression is fully explained by increasing event density. We further characterize the model's predictive boundaries: it accurately predicts science-adjacent breakthroughs but fails on commercial products and policy decisions, consistent with the theoretical claim that prerequisite density governs the adjacent possible but not market timing.
Finally, we contribute a verification pipeline for AI-generated research datasets, documenting a 67.2% pass rate with specific hallucination taxonomies.
Headline Finding
The H3 temporal compression hypothesis is rejected (p = 1.0).
The apparent shrinking of gaps between parallel discoveries is fully explained by increasing event density. Random shuffling produces stronger compression than the actual data.
1. Introduction
In the original Precondition Density Model paper, we demonstrated that the locations of scientific breakthroughs in semantic embedding space can be predicted from the accumulation of prerequisite knowledge. A blind holdout experiment across 25 documented cases of multiple discovery achieved a mean prediction rank of 3.9 out of 25 (Cohen's d = 9.80, p < 0.001).
That paper left several questions open. First, was the result specific to the original dataset of 1,699 events, or would it survive expansion? Second, if prerequisite density predicts where breakthroughs appear, does it also predict when? Third, what are the model's predictive boundaries?
This paper addresses all three questions. We expanded the dataset through six successive versions to 3,179 events. We tested the temporal compression hypothesis (H3) and rejected it. And we characterized the model's boundary conditions, finding that it predicts scientific breakthroughs where prerequisites accumulate but not commercial or policy events where market timing and political will dominate.
2. Dataset Expansion
The original dataset drew from four source types: Wikipedia timelines, patent records, seminal papers, and a curated convergence catalog. Over six versions, we expanded to seven source types and six additional sectors.
| Version | Events | Added | Description |
|---|---|---|---|
| V1 | 1,699 | --- | Original dataset |
| V2 | 1,784 | +85 | Multi-discovery cases, era backfill |
| V3 | 1,847 | +63 | AI events 2023-2025 |
| V4 | 2,202 | +355 | Curated space, Nobels, patents |
| V5 | 2,950 | +754 | KG-verified 2000-2026, URL-checked |
| V6 | 3,179 | +229 | Cross-sector impact (creative, law, finance, education, healthcare) |
Dataset Growth Across Versions
V5 represented the largest single expansion: 754 events generated by an AI knowledge-graph system (Manus), subjected to the verification pipeline described in Section 5. V6 extended coverage to sectors underrepresented in STEM-focused datasets: creative industries, law, finance, education, and healthcare.
3. The H3 Compression Hypothesis
Initial Observation
The original paper reported a suggestive trend: the median time gap between parallel discoveries declined from 2 years in the 1700s to 0 years after 1950. We hypothesized that this temporal compression exceeds what event density alone would predict --- that improving communication and collaboration infrastructure compresses discovery gaps beyond the baseline rate.
The Duplicate Contamination Problem
Our first attempt to test H3 rigorously revealed a data quality issue. When we clustered events by semantic similarity to identify parallel discoveries, 84% of the resulting pairs were not independent discoveries at all. They were the same event described differently --- the same discovery appearing once as a Wikipedia timeline entry and again as a patent record, or the same breakthrough recorded under different discoverers' names.
This contamination would have been invisible in a less careful analysis. The "parallel" pairs showed apparent temporal compression because duplicates have a gap of zero by definition.
Noise Diagnostic: Duplicates vs. True Parallels
Of 1,000 initial cluster pairs, 840 were duplicates of the same event under different descriptions. Only 160 represented genuine independent discoveries.
Cleaned Analysis
After removing duplicate pairs and retaining only cases where independent evidence confirmed distinct discoverers working without knowledge of each other, the cleaned dataset showed an apparent compression pattern:
- 1500s: ~8 years median gap
- 1700s: ~4 years
- 1900s: ~2 years
- 2000s: months
This looked promising. The question was whether the compression was real or an artifact of having more events in later periods.
Permutation Test
We constructed a permutation test to distinguish genuine compression from a density artifact. The null hypothesis: temporal compression is fully explained by the increasing density of events over time.
Procedure: shuffle timestamps within each century, recompute slope, repeat 10,000 times
Result: The observed slope was -0.023 (slight compression). The mean slope across 10,000 shuffles was -1.39. Random shuffling produced stronger compression than the actual data. The p-value was 1.0: the observed compression was less than or equal to the null expectation in every single permutation.
H3 Permutation Test: Null Distribution vs. Observed
The null distribution of compression slopes (10,000 shuffles) is centered at -1.39. The observed slope of -0.023 shows less compression than random --- the opposite of what the hypothesis predicted.
Discussion
The result is unambiguous. The apparent compression of parallel discovery gaps over time is fully explained by the increasing density of recorded events. When more events occur per decade, randomly selected pairs within the same era will naturally have smaller gaps.
This does not prove that communication infrastructure has no effect on discovery timing. It proves that the effect, if it exists, is smaller than what event density alone produces --- and therefore cannot be detected with this methodology and dataset.
4. Holdout Validation Across Expansions
The most important question for an expanded dataset is whether the original signal survives. We re-ran the holdout tests at each major version boundary. The results are invariant.
| Test | V1 | V4 | V5 | V6 |
|---|---|---|---|---|
| 25-holdout (d) | 9.80 | 9.80 | 9.80 | 9.80 |
| 50-holdout (d) | 10.66 | 10.66 | 10.66 | 10.66 |
| Post-2000 (d) | 5.34 | 4.77 | 5.28 | ~5.2 |
| AI-era (d) | --- | --- | 1.60* | --- |
| Impact-only (d) | --- | --- | --- | 1.63* |
* p = 0.06 (suggestive but not significant at the conventional 0.05 threshold)
Effect Sizes Across Holdout Test Types (V6)
Even the weakest tests (AI-era and impact-only) exceed the conventional "large effect" threshold of d = 0.8. The dashed line marks this threshold.
The core holdout effect sizes are perfectly stable: d = 9.80 for the 25-event holdout and d = 10.66 for the 50-event holdout, unchanged across all six versions. The post-2000 holdout tests a harder question: can the model predict modern innovations using only pre-2000 prerequisite events? The effect sizes range from d = 4.77 to d = 5.34, all highly significant (p < 0.0001).
5. Predictive Boundaries
What the Model Predicts
The model's strongest predictions cluster around scientific and technological breakthroughs where prerequisites accumulate visibly in the historical record:
- Higgs boson discovery (rank 0): Decades of theoretical physics and accelerator development created an unmistakable prerequisite cluster.
- CRISPR gene editing (rank 1): The convergence of microbiology, genetics, and biochemistry techniques formed a dense prerequisite region.
- COVID-19 mRNA vaccine (rank 2): mRNA research, lipid nanoparticle delivery, and prior coronavirus work accumulated over years.
What the Model Does Not Predict
- Cursor IDE: A commercial product whose emergence depends on market positioning, not prerequisite accumulation.
- Biden AI Executive Order: A political decision reflecting electoral dynamics and policy advocacy, not scientific readiness.
- GitHub Copilot Chat: A product feature release timed by corporate strategy.
Prediction Ranks by Event Type
Theoretical Explanation
This boundary is not a failure of the model --- it is a validation of its theoretical scope. The Precondition Density Model operationalizes Kauffman's adjacent possible: the set of innovations reachable from current knowledge in one step. The model predicts what becomes possible, not what becomes actual.
Commercial products, policy decisions, and social events are not governed primarily by prerequisite density. They depend on market timing, political will, organizational capacity, and individual initiative --- factors that leave no consistent trace in a semantic embedding of historical knowledge events.
6. Verification Pipeline for AI-Generated Data
Version V5 introduced a methodological challenge: how to incorporate AI-generated events without contaminating the dataset with hallucinations. We developed a four-stage verification pipeline:
1. URL Verification
Each event's source URL was checked for accessibility and relevance. Events citing non-existent or unrelated URLs were flagged.
2. Date-Year Matching
The claimed date was cross-referenced against the source material. Discrepancies of more than one year triggered rejection.
3. Attribution Checking
Named individuals and institutions were verified against independent sources. Fabricated researchers or misattributed discoveries were rejected.
4. Quality Filtering
Events that passed the first three stages were evaluated for specificity and significance. Generic or trivial events were classified as low-value and excluded.
Verification Pipeline Results (V5)
| Outcome | Count | Percentage |
|---|---|---|
| Passed | 754 | 67.2% |
| Duplicate of existing event | 276 | 24.6% |
| Hallucinated (fabricated facts) | 54 | 4.8% |
| Low-value (insufficient significance) | 37 | 3.3% |
| Date mismatch | 1 | 0.1% |
| Total submitted | 1,122 | 100% |
The 4.8% hallucination rate is notable. Hallucinated events included fabricated researchers, invented conference proceedings, and plausible-sounding but non-existent technologies. In every case, the hallucination was internally consistent --- the AI generated coherent descriptions of events that never occurred. This underscores the necessity of external verification for any AI-generated research data.
The 24.6% duplicate rate was higher than expected, reflecting the AI system's tendency to rephrase existing events rather than identify genuinely new ones. Duplicate detection used a combination of cosine similarity (> 0.92) against existing embeddings and manual review of flagged pairs.
7. Methodological Contributions
This work introduces several methodological tools applicable beyond the Precondition Density Model.
AI-Generated Dataset Verification Pipeline
The four-stage pipeline (URL verification, date-year matching, attribution checking, quality filtering) provides a replicable method for incorporating AI-generated data into research datasets. The 67.2% pass rate, with 4.8% outright hallucinations and 24.6% duplicates, offers baseline expectations for similar efforts. The key insight is that AI-generated research data requires external verification at every stage; internal consistency is not a reliable indicator of accuracy (Ji et al., 2023).
Algorithmic Parallel Invention Detection
The discovery that 84% of semantically clustered event pairs were duplicates rather than true parallel inventions highlights a general problem in computational history of science. Semantic similarity alone cannot distinguish between "two descriptions of the same event" and "two independent events in the same area." Our deduplication procedure --- combining cosine similarity thresholds with manual verification of discoverer independence --- provides a template for addressing this.
Permutation Framework for Temporal Claims
The H3 permutation test demonstrates how to rigorously evaluate temporal trend claims in historical data. Many observed trends in the history of science (acceleration, compression, convergence) may be density artifacts. The permutation approach --- shuffling timestamps within eras while preserving event counts --- provides a proper null hypothesis for any temporal trend analysis.
Value of Honest Negative Results
The H3 null result (p = 1.0) eliminated an attractive but unsupported claim. Had we reported the raw compression trend without the null test, we would have published a spurious finding. The research community benefits more from one clean negative than from a dozen suggestive positives that do not survive scrutiny (Kuhn, 1962).
8. Limitations and Future Work
Several limitations remain despite the expanded dataset.
STEM bias. Although V6 added cross-sector events, the dataset remains heavily weighted toward science and engineering. Events in humanities, social sciences, and arts are underrepresented. Future work should test whether prerequisite density operates similarly in these domains.
Independence criterion. For modern parallel discoveries, establishing true independence is difficult. Researchers in the same field read the same preprints, attend the same conferences, and may influence each other subtly. The independence criterion for "parallel" invention becomes increasingly blurred in an era of rapid communication.
Denominator data. The current analysis lacks denominator information: we know which breakthroughs occurred, but not how many research programs attempted similar breakthroughs and failed. Rate analysis --- the proportion of attempts that succeed as a function of prerequisite density --- would strengthen the causal interpretation.
Marginal significance results. The AI-era (d = 1.60, p = 0.06) and impact-only (d = 1.63, p = 0.06) holdout tests are suggestive but not significant at the conventional α = 0.05 threshold. Larger event pools in these categories would clarify whether the model genuinely predicts these event types or whether the effect sizes reflect noise.
Embedding model dependency. All results depend on the specific embedding model used (Google's gemini-embedding-001). While the original paper demonstrated robustness to embedding choice, the expanded dataset has not been tested with alternative embedding models.
9. Conclusion
This follow-up study yields three findings:
- Robust signal: The Precondition Density Model's predictive power survives dataset expansion. The core holdout test maintains d = 9.80 (p < 0.001) across six dataset versions spanning 1,699 to 3,179 events. The signal is not an artifact of the original dataset's composition.
- H3 rejected: The temporal compression hypothesis does not survive a proper null test. The apparent shrinking of gaps between parallel discoveries is fully explained by increasing event density (p = 1.0). This corrects a suggestive claim in the original paper.
- Clear boundaries: The model predicts scientific breakthroughs where prerequisites accumulate but not commercial products, policy decisions, or social events. This boundary aligns with the model's theoretical foundation in the adjacent possible: prerequisite density determines what can happen, not what does happen.
Taken together, these results narrow and strengthen the model's claims. The Precondition Density Model is not a general theory of innovation --- it is a specific, testable, and now more precisely bounded framework for understanding why scientific breakthroughs appear where and when they do.
References
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
Good, P. (1994). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses. Springer.
Ji, Z., Lee, N., Frieske, R., et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38.
Johnson, S. (2010). Where Good Ideas Come From: The Natural History of Innovation. Riverhead Books.
Kauffman, S. A. (1995). At Home in the Universe: The Search for the Laws of Self-Organization and Complexity. Oxford University Press.
Kuhn, T. S. (1959). Energy Conservation as an Example of Simultaneous Discovery. Critical Problems in the History of Science, 321–356.
Kuhn, T. S. (1962). The Structure of Scientific Revolutions. University of Chicago Press.
Lamb, D. & Easton, S. M. (1984). Multiple Discovery: The Pattern of Scientific Progress. Avebury.
Lee, J., Dai, Z., Ren, X., et al. (2024). Gecko: Versatile Text Embeddings Distilled from Large Language Models. arXiv:2403.20327.
Lemley, M. A. (2012). The Myth of the Sole Inventor. Michigan Law Review, 110(5), 709–760.
Merton, R. K. (1961). Singletons and Multiples in Scientific Discovery. Proceedings of the American Philosophical Society, 105(5), 470–486.
Merton, R. K. (1973). The Sociology of Science: Theoretical and Empirical Investigations. University of Chicago Press.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv:1301.3781.
Ogburn, W. F. & Thomas, D. (1922). Are Inventions Inevitable? A Note on Social Evolution. Political Science Quarterly, 37(1), 83–98.
Simonton, D. K. (2004). Creativity in Science: Chance, Logic, Genius, and Zeitgeist. Cambridge University Press.
Zinner, N. & Beacon Bot. (2026). Prerequisite Density Predicts Innovation Emergence: A Blind Holdout Experiment. Future Shock (future-shock.ai).