The Research

Nine models. Forty fields. One verdict.

The first calibrated, blinded, AI-panel trial of the great theories of reality — with every result below, and the complete data released alongside the book.

The Method

How do you judge reality without bias?

You take the author out of the judging seat.

Nine competing models of reality — from materialism to dualism to idealism to the simulation hypothesis — were each written up as a fair, steel-manned description. An independent AI panel then audited those descriptions for accuracy before any scoring began.

Then the trial: each model was scored against 40 fields of scientific and experiential evidence, covering 205 documented findings — from quantum physics and cosmology to the placebo effect, near-death experiences, and laboratory psi research.

The judges

The scoring panel: frontier AI models from six independent, competing providers — spanning the US, France, and China — Grok 4.20, Gemini 3.1 Pro, GPT-5.4, Claude Opus 4.8, plus two fast-tier panelists — with three more models completing scoring this week for the extended tier.

Every model was blinded: presented only as a letter, A to I, randomised on every call. Three block rotations. Three independent runs per score.

The scale, measured directly from the archive: 12,033 API calls producing 12.78 million words of written reasoning — about 160 books' worth. Every word archived in machine-readable files. Zero corrupt. Zero missing cells.

The thermometer test

Before touching reality, the instrument was calibrated on artificial worlds whose correct answer was known by construction — like calibrating a thermometer on boiling and freezing water before trusting it on a patient.

It was given debunked anomalies as traps: it refused to credit them to the wrong models, all three times. It was tested on telling idealism apart from a computer simulation: it separated them decisively, in both directions. It was pushed by an investigator framed as a skeptic, then as a believer: the scores held under both framings.

Only after it passed was it pointed at the real evidence.

Result One

The scoreboard

Each model received a fit score from 0 to 3 on every field — an explanatory-fit score under a calibrated inference-to-the-best-explanation procedure, not a "probability that the model is true." The means across all 40 fields:

Rank	Model	Mean fit (0–3)	Band
1	VR in Consciousness	2.67	Winner band
2	Analytic idealism	2.29	Accommodation (upper)
3	Interactionist dualism	2.20	Accommodation
4	Emergentism	1.58	Contradicted / poor fit
4	Panpsychism	1.58	Contradicted / poor fit
6	Simulation hypothesis	1.49	Contradicted / poor fit
7	Orch-OR	1.40	Contradicted / poor fit
8	Materialism	1.32	Contradicted / poor fit
9	Matrix hypothesis	1.29	Contradicted / poor fit

VR in Consciousness won 29 of the 40 fields outright. Its lead over materialism is 1.35 points — and in the calibration's known-answer worlds, true models beat their best rivals by 0.5 to 1.5 points. This margin sits at the top of the instrument-validated range.

What the scores mean on the real scale

The calibration measured how the instrument reads models that are known to be true, and known to be false. On that measured scale, VR in Consciousness reads at 91% — the way ground-truth-true models read. Materialism reads at 15% — just above the instrument's empirical floor, the way ground-truth-false models read.

And here is the fair-instrument signature: on the 12 mainstream-science fields alone — quantum physics, cosmology, biology, information theory — materialism legitimately rises into a near-tied chasing pack scoring 1.9–2.15. VR in Consciousness still leads on its rivals' home turf: 2.53, winning 9 of the 12 fields, ranked first in 99.99% of 10,000 bootstrap resamples, and reading 83% on the calibrated scale against 44–61% for the pack. The verdict does not depend on the controversial evidence.

How solid is the lead?

The result was stress-tested by bootstrap resampling — rebuilding the study 10,000 times from random re-draws of its own fields. VR in Consciousness ranked first in all 10,000 resamples (overall fit 2.67, 95% CI 2.60–2.74). Its lead measures 1.5 standard deviations over its nearest rival, and 4.3–5.4 standard deviations over the materialist and computational models.

Field by field, it never trails analytic idealism or dualism on a single field, and it leads materialism on 36 of 40 (three ties, one trail). The podium is identical under every aggregation method tested.

"So reality is a simulation, then?" — No. The data is explicit.

A dimensional analysis asked which property of a model predicts its fit to the evidence. Being a simulation — a physical, external computer running the world — predicts nothing: correlation +0.03, indistinguishable from zero. What predicts fit is consciousness being fundamental: correlation +0.80.

The universe does not behave like someone else's machine. It behaves like a virtual reality occurring within consciousness — and the instrument can tell the difference.

Result Two

Does the brain produce consciousness — or receive it?

A second, independent arm of the study asked the sharpest question in science head-on. On nine fields of evidence that bear logically on the brain–consciousness relationship, the panel allocated probability across four hypotheses:

H1 — Consciousness-prior: the brain filters or receives consciousness; it does not create it
H2 — Generator: the brain produces consciousness
H3 — Dual-aspect: mind and matter are two faces of one underlying reality
H4 — Underdetermined: this evidence cannot decide

Here is the panel's full allocation, field by field — including everything it refused to decide:

Field	H1 %	H2 %	H3 %	H4 %
Quantum physics	18	11	20	52
Information theory	20	19	35	26
Cosmology	20	12	18	49
Biology	8	23	18	50
Quantum field theory	8	10	11	72
Retrocausality	8	12	12	68
Block universe	15	16	17	53
Information-field	21	12	48	19
Survival evidence	72	5	12	10

Notice the honesty first: on pure physics fields like quantum field theory, the panel parked most of its probability on "underdetermined" rather than manufacturing support. The calibration's negative controls prove that restraint is a feature of the instrument, not a flaw.

Now the derived view — conditional on the question being decidable: set H4 aside and renormalise the remaining probability to 100%. Across the nine fields:

73 / 27

The panel allocated 73% to "the brain does not generate consciousness" (consciousness-prior plus dual-aspect) versus 27% to "the brain generates it." Brain-as-generator was the minority position on 8 of the 9 fields.

On the survival-of-consciousness evidence — the field where the question matters most — the split was 95 / 5.

How sturdy is the 73? Bootstrap resampling, 10,000 iterations: 73.0, with a 95% confidence interval of 65.5–80.8 — and in all 10,000 resamples, the "brain does not generate consciousness" side never fell below 50%. Nor is the headline carried by outlier fields: every one of the nine fields individually favours it, from 53/47 on biology to 95/5 on survival evidence.

The skeptic's cut

Drop the two frontier fields entirely — the information-field and survival evidence — and keep only the seven mainstream-science fields. The result, doubly conditional now (H4 excluded and mainstream-restricted, stated as such): 68.2 / 31.8, with a 95% confidence interval of 62.0–73.8. Still better than two to one against "the brain generates consciousness" — on mainstream evidence alone.

One honest note: with the strongest fields removed, the panel's "underdetermined" share rises from 45.1% to 53.8%. It finds the question harder to decide — exactly as it should. What remains decidable still points the same way.

It replicated

An earlier version of this study found roughly 74/26. Then everything was rebuilt: the evidence fields re-audited, a hypothesis amended, a panel member upgraded, every score re-run from scratch. The result came back 73/27.

The finding replicated to within one point.

Result Three · Preliminary

The whole case, seen at once

Scoring evidence one field at a time can miss the larger pattern. So a final layer put the complete case in front of the panel: all nine models and all 40 fields of scores together, with rankings hidden. One question: which model unifies the evidence — and which merely patches over it, field by field?

This layer was calibrated first, like everything else — on synthetic known-truth profiles, 6 of 6 passed, including a trap built to catch judges that reward a high-scoring patchwork over genuine unification.

VR in Consciousness returned a perfect consilience score — 3/3/3 on unification, prediction-versus-patchwork, and scope — in 18 out of 18 independent reads.

Asked to spread 100 points of total-evidence weight across the nine models, the panel allocated 33.0 points to VR in Consciousness — three times an even split, 1.8× analytic idealism, and roughly 8–10× the materialist cluster. By family: consciousness-first models took 66.2 points against 20.1 for the materialist-computational family — an independent route landing right beside the deductive arm's conditional 73/27.

Materialism's weakest criterion told its own story: prediction-versus-patchwork, 0.78 out of 3. Seen all at once, its field-by-field explanations read as accumulated rescue devices. Epicycles — made quantitative.

The skeptic-facing arm

Run again admitting only the 12 mainstream-science fields, the picture holds: VR in Consciousness scores 31.1 of 100 — double analytic idealism, 2.6× materialism — and the consciousness family leads 53.3 to 36.5.

And one number proves the instrument's fairness: with the frontier evidence removed, materialism's prediction score jumps from 0.78 to 1.78. The rescue-device penalty tracks the evidence, not the instrument.

One more control: when the investigator was framed as a hostile skeptic, VR in Consciousness's allocation went up by 2 points. Under believer framing, it didn't move at all.

Status: conference-preliminary — produced under a lean audited protocol for SAI 2026. The exhaustive iteration is pre-committed and will be published with the full dataset.

The Controls

How tightly was it controlled?

Every control below ran with a pre-set pass criterion, and every outcome — including the misses — is recorded in the archive. None were deferred silently.

Control	Outcome
Blinding — models shown as random letters A–I, re-randomised on every call	In effect across both tracks, both phases
Replication — three independent runs per score	100% complete; zero missing cells in all four datasets
False-positive traps — three seductive-but-wrong fits planted in calibration	3 of 3 unsprung
Negative control — a field built to fit no consciousness model	The thesis-model analog was correctly suppressed (1.5) — the panel does not inflate the home team
Idealism-vs-simulation guard — can the panel tell the two apart?	Decisive in both directions (2.9 vs 1.5; 3.0 vs 1.3)
Two-way bias test — investigator framed as a skeptic AND as a believer; 198 dedicated audit calls	Scores hold under both framings
Length control — materialism's description padded to the winner's length	+0.17 shift, inside the ≤ +0.30 criterion — length does not buy points (the second-longest description ranks last; a near-shortest ranks second)
Locked instruments — model cards, evidence fields, and the expected-score key locked with integrity hashes before any run	The key was never revised after data arrived; misses are disclosed in an honest-miss register, not patched
Method tournament — four scoring frameworks raced on known ground truths	Full inference-to-the-best-explanation won (16/17 ranking accuracy, best trap resistance) and was pre-registered for the main run
Adjudication — every high-variance cell re-examined	Full trigger coverage; headlines unchanged

Pre-registered, calibrated on known ground truth, blinded, triplicated, bias-tested both ways, length-controlled, trap-tested, fully adjudicated — and every one of 12,391 output files is archived and machine-readable.

The Standard

Check the work yourself

A result you cannot inspect is an opinion. So this project commits to full transparency: when the book is released, this site will publish the complete methodology and every AI input and output — every prompt, every response, every score, every reconciliation.

Replicate it with your own AI panel. Change the prompts. Attack the weak points. If the verdict is wrong, the fastest way to find out is to let everyone try to break it.

The full dataset and methodology arrive with the book. Subscribers are notified the day it goes live.

Notify me when the data is published

Nine models. Forty fields. One verdict.

How do you judge reality without bias?

The judges

The thermometer test

The scoreboard

What the scores mean on the real scale

How solid is the lead?

"So reality is a simulation, then?" — No. The data is explicit.

Does the brain produce consciousness — or receive it?

The skeptic's cut

It replicated

The whole case, seen at once

The skeptic-facing arm

How tightly was it controlled?

Check the work yourself

Be there when it goes live