Measurement Methodology

How accuracy is measured.

Simulatte uses a single, transparent accuracy metric: distribution accuracy. It measures how well a simulated population's opinion distributions match a real survey population's distributions — answer by answer, question by question.

1 − Σ|Δ| / 2

Distribution accuracy formula

Identical to UC Berkeley Jan 2026 synthetic population benchmark

91.0%

Human ceiling

Iyengar et al. (Stanford) — natural self-inconsistency in human survey responses

Public

Verification

Every result reproducible via public GitHub repository with SHA-256 audit trail

The formula is not proprietary. Its use here is intentional — the same metric used by independent academic benchmarks allows direct comparison without methodological translation. There is no scoring inflation, no cherry-picked metric, and no adjusted baseline.

Study 1A — US Pew (88.7%) Study 1B — India Pew (85.3%) LLM Comparison GitHub Repository ↗

Definition

The formula.

Distribution accuracy measures the overlap between two probability distributions: what the real survey population answered, and what the simulated population answered.

Distribution Accuracy = 1 − ( Σ|real_i − sim_i| / 2 )

Where: real_i = fraction of real survey respondents choosing option i · sim_i = fraction of simulated personas choosing option i · Σ = sum over all response options for a question · / 2 = normalisation factor (the total variation distance between two distributions sums to at most 2)

What it measures

Distribution accuracy captures how well the simulated population's aggregate opinion matches the real population's aggregate opinion. A score of 100% means every response option was chosen at exactly the right frequency. A score of 50% means the simulation is no better than a random uniform distribution across options.

The metric operates on population distributions — not individual-level predictions. Simulatte does not predict how any specific person will answer; it predicts how the population will distribute across options. This is the correct level of analysis for synthetic population research.

Why this metric

Most ML accuracy metrics measure individual-level classification. Survey simulation is a distribution-matching problem, not a classification problem. Distribution accuracy is the natural measure: it directly quantifies how much probability mass is misallocated across response options.

The choice of this specific formula enables direct comparison with the UC Berkeley Jan 2026 synthetic population benchmark — the leading independent academic study in this field — without any methodological translation or conversion.

Key properties

Bounded: scores always fall in [0, 1] (or 0%–100%). Symmetric: penalises overestimating and underestimating option frequencies equally. Interpretable: a score of 85% means 85% of the probability mass in the real distribution is correctly captured. Not sensitive to choice of reference: the same formula applies regardless of how many response options a question has (2, 4, or more).

Illustration

Worked example.

A concrete calculation on a hypothetical 4-option survey question about religious practice frequency, with real Pew distribution versus Simulatte simulation output.

Opt

Response option

Pew real

Simulatte

|Δ|

Several times a week

0.42

0.44

0.02

About once a week

0.23

0.21

0.02

A few times a year

0.19

0.22

0.03

Rarely or never

0.16

0.13

0.03

Total |Δ|

1.00

0.10

Distribution Accuracy = 1 − (0.10 / 2)

95.0%

Note: values are illustrative. Real question scores range from 79% to 97% across Study 1A; 71% to 96% across Study 1B. The 88.7% and 85.3% study scores are the mean distribution accuracy averaged across all 15 questions.

Mean Study Accuracy = (1/n) × Σ_q [ 1 − ( Σ_i |real_qi − sim_qi| / 2 ) ]

Where n = number of questions (15), q = each question, i = each response option within that question. Study-level accuracy is the unweighted average across all question-level distribution accuracies.

Ceiling

The human ceiling.

A score of 100% is mathematically impossible for any simulation system — including one run by humans. This is not a limitation of Simulatte's design. It is an irreducible property of human opinion measurement.

Why humans cannot score 100%

Iyengar et al. (Stanford) established the human ceiling through repeated-response studies: when the same individual is asked the same survey question twice (with a time gap), they give inconsistent answers approximately 9% of the time. This is not measurement error — it is the natural cognitive variability in how humans form and express opinions. Any simulation of human opinion is bounded by this ceiling, not by 100%.

Human ceiling

91.0%

Simulatte — US (1A)

88.7%

Simulatte — India (1B)

85.3%

Bar lengths proportional to 91.0% ceiling. Study 1A reaches 97.5% of ceiling; Study 1B reaches 93.7% of ceiling.

What the ceiling means for interpretation

When evaluating a result of 88.7%, the correct comparison is not to 100% but to 91.0%. Simulatte's Study 1A result of 88.7% is 2.3 pp from the human ceiling — meaning 97.5% of achievable accuracy is captured. The remaining 2.3 pp gap includes both Simulatte error and irreducible human self-inconsistency in the ground truth data itself.

Interpreting the gap

Study	Simulatte score	Gap to 91.0% ceiling	% of ceiling captured	Interpretation
Study 1A — US Pew	88.7%	2.3 pp	97.5%	Within statistical noise of ceiling
Study 1B — India Pew	85.3%	5.7 pp	93.7%	Harder cultural context, 22-sprint result
Best LLM (GPT-4o)	75.6%	15.4 pp	83.1%	Demographic context alone, no calibration

External Validation

UC Berkeley comparison.

The UC Berkeley Jan 2026 synthetic population study (Park et al.) is the leading independent academic benchmark for AI-generated population opinion simulation. Simulatte deliberately uses the identical formula to enable direct comparison without any methodological translation.

UC Berkeley (Park et al., Jan 2026)

~76–79%

Best-performing synthetic population methods in the Berkeley benchmark achieve approximately 76–79% distribution accuracy on US political opinion surveys. These methods use LLM-based persona generation with demographic conditioning — broadly analogous to Simulatte's LLM baseline approach but with additional prompt engineering.

Simulatte (Study 1A — US Pew)

88.7%

Simulatte's Study 1A result of 88.7% on the US Pew Opinion Survey uses the identical formula as the Berkeley benchmark. The result is directly comparable: no metric conversion, no normalisation differences, no definitional adjustment. The 88.7% figure represents Simulatte's performance after 12 sprints of calibration using the perceive → reflect → decide cognitive loop.

Why identical formula matters

Choosing the same formula as UC Berkeley was deliberate. Any proprietary metric can be tuned to look impressive — identical metrics cannot. By publishing results using the same formula as an independent academic benchmark, Simulatte results can be directly compared, challenged, and replicated by third parties who are already familiar with the Berkeley methodology.

Formula equivalence

The UC Berkeley paper defines accuracy as total variation distance subtracted from 1. This is algebraically identical to Simulatte's formula:

Simulatte: DA = 1 − ( Σ|real_i − sim_i| / 2 )

Explicit normalisation by /2

Berkeley: DA = 1 − TV(P_real, P_sim)

TV = total variation distance = Σ|p_i − q_i| / 2

Both reduce to: 1 minus one-half the sum of absolute differences across all response options. They are the same formula written in two notations.

Results

Published results.

All study results are computed using the formula above, averaged across 15 survey questions per study. Every number below is reproducible from the public audit files in the GitHub repository.

Study	Population	Questions	Sprints	Score	Gap to ceiling
Study 1A	US Pew Opinion Survey	15	12	88.7%	2.3 pp
Study 1B	India Pew Opinion Survey	15	22	85.3%	5.7 pp
LLM Comparison (best)	India Pew — GPT-4o naive	15	0	75.6%	15.4 pp
LLM Comparison (avg)	India Pew — 10-model average	15	0	63.3%	27.7 pp
Human ceiling	Iyengar et al. (Stanford)	—	—	91.0%	—

The LLM comparison avg is computed from the 10-model leaderboard: (75.6 + 74.3 + 73.8 + 72.4 + 71.9 + 70.2 + 44.3 + 43.9 + 43.5) ÷ 9 = 63.3%. The Simulatte–to–average-LLM gap is 85.3 − 63.3 = 22.0 pp on the raw score, or equivalently 4.9× closer to the human ceiling on residual error.

Reproducibility

Every result is reproducible.

The public GitHub repository contains all audit files, SHA-256 hashes, response logs, and the verifier script. A researcher who understands the formula above can verify every published number from first principles using only the files in the repository.

What is public

All 15 survey questions (public Pew Research data)
Stripped audit logs with SHA-256 hashes (no persona text)
Root hash for each study run
Per-question accuracy scores
integrity verifier script (Python)
Sprint history with scores per iteration

What requires NDA

Full persona definitions and demographic corpus
Complete prompt text (contains persona data)
Calibration procedure details
WorldviewAnchor attitude dimension specifications

NDA-gated materials available to qualified researchers. The SHA-256 hashes allow verification that published scores correspond to real prompts — even without reading those prompts.

Study 1A — US Pew (88.7%) Study 1B — India Pew (85.3%) LLM Comparison Accuracy Methodology GitHub Repository ↗