Simulatte uses a single, transparent accuracy metric: distribution accuracy. It measures how well a simulated population's opinion distributions match a real survey population's distributions — answer by answer, question by question.
The formula is not proprietary. Its use here is intentional — the same metric used by independent academic benchmarks allows direct comparison without methodological translation. There is no scoring inflation, no cherry-picked metric, and no adjusted baseline.
Distribution accuracy measures the overlap between two probability distributions: what the real survey population answered, and what the simulated population answered.
Distribution Accuracy = 1 − ( Σ|real_i − sim_i| / 2 )
Distribution accuracy captures how well the simulated population's aggregate opinion matches the real population's aggregate opinion. A score of 100% means every response option was chosen at exactly the right frequency. A score of 50% means the simulation is no better than a random uniform distribution across options.
The metric operates on population distributions — not individual-level predictions. Simulatte does not predict how any specific person will answer; it predicts how the population will distribute across options. This is the correct level of analysis for synthetic population research.
Most ML accuracy metrics measure individual-level classification. Survey simulation is a distribution-matching problem, not a classification problem. Distribution accuracy is the natural measure: it directly quantifies how much probability mass is misallocated across response options.
The choice of this specific formula enables direct comparison with the UC Berkeley Jan 2026 synthetic population benchmark — the leading independent academic study in this field — without any methodological translation or conversion.
Bounded: scores always fall in [0, 1] (or 0%–100%). Symmetric: penalises overestimating and underestimating option frequencies equally. Interpretable: a score of 85% means 85% of the probability mass in the real distribution is correctly captured. Not sensitive to choice of reference: the same formula applies regardless of how many response options a question has (2, 4, or more).
A concrete calculation on a hypothetical 4-option survey question about religious practice frequency, with real Pew distribution versus Simulatte simulation output.
Note: values are illustrative. Real question scores range from 79% to 97% across Study 1A; 71% to 96% across Study 1B. The 88.7% and 85.3% study scores are the mean distribution accuracy averaged across all 15 questions.
Mean Study Accuracy = (1/n) × Σ_q [ 1 − ( Σ_i |real_qi − sim_qi| / 2 ) ]
A score of 100% is mathematically impossible for any simulation system — including one run by humans. This is not a limitation of Simulatte's design. It is an irreducible property of human opinion measurement.
Iyengar et al. (Stanford) established the human ceiling through repeated-response studies: when the same individual is asked the same survey question twice (with a time gap), they give inconsistent answers approximately 9% of the time. This is not measurement error — it is the natural cognitive variability in how humans form and express opinions. Any simulation of human opinion is bounded by this ceiling, not by 100%.
Bar lengths proportional to 91.0% ceiling. Study 1A reaches 97.5% of ceiling; Study 1B reaches 93.7% of ceiling.
When evaluating a result of 88.7%, the correct comparison is not to 100% but to 91.0%. Simulatte's Study 1A result of 88.7% is 2.3 pp from the human ceiling — meaning 97.5% of achievable accuracy is captured. The remaining 2.3 pp gap includes both Simulatte error and irreducible human self-inconsistency in the ground truth data itself.
| Study | Simulatte score | Gap to 91.0% ceiling | % of ceiling captured | Interpretation |
|---|---|---|---|---|
| Study 1A — US Pew | 88.7% | 2.3 pp | 97.5% | Within statistical noise of ceiling |
| Study 1B — India Pew | 85.3% | 5.7 pp | 93.7% | Harder cultural context, 22-sprint result |
| Best LLM (GPT-4o) | 75.6% | 15.4 pp | 83.1% | Demographic context alone, no calibration |
The UC Berkeley Jan 2026 synthetic population study (Park et al.) is the leading independent academic benchmark for AI-generated population opinion simulation. Simulatte deliberately uses the identical formula to enable direct comparison without any methodological translation.
Choosing the same formula as UC Berkeley was deliberate. Any proprietary metric can be tuned to look impressive — identical metrics cannot. By publishing results using the same formula as an independent academic benchmark, Simulatte results can be directly compared, challenged, and replicated by third parties who are already familiar with the Berkeley methodology.
The UC Berkeley paper defines accuracy as total variation distance subtracted from 1. This is algebraically identical to Simulatte's formula:
Simulatte: DA = 1 − ( Σ|real_i − sim_i| / 2 )
Berkeley: DA = 1 − TV(P_real, P_sim)
Both reduce to: 1 minus one-half the sum of absolute differences across all response options. They are the same formula written in two notations.
All study results are computed using the formula above, averaged across 15 survey questions per study. Every number below is reproducible from the public audit files in the GitHub repository.
| Study | Population | Questions | Sprints | Score | Gap to ceiling |
|---|---|---|---|---|---|
| Study 1A | US Pew Opinion Survey | 15 | 12 | 88.7% | 2.3 pp |
| Study 1B | India Pew Opinion Survey | 15 | 22 | 85.3% | 5.7 pp |
| LLM Comparison (best) | India Pew — GPT-4o naive | 15 | 0 | 75.6% | 15.4 pp |
| LLM Comparison (avg) | India Pew — 10-model average | 15 | 0 | 63.3% | 27.7 pp |
| Human ceiling | Iyengar et al. (Stanford) | — | — | 91.0% | — |
The LLM comparison avg is computed from the 10-model leaderboard: (75.6 + 74.3 + 73.8 + 72.4 + 71.9 + 70.2 + 44.3 + 43.9 + 43.5) ÷ 9 = 63.3%. The Simulatte–to–average-LLM gap is 85.3 − 63.3 = 22.0 pp on the raw score, or equivalently 4.9× closer to the human ceiling on residual error.
The public GitHub repository contains all audit files, SHA-256 hashes, response logs, and the verifier script. A researcher who understands the formula above can verify every published number from first principles using only the files in the repository.
NDA-gated materials available to qualified researchers. The SHA-256 hashes allow verification that published scores correspond to real prompts — even without reading those prompts.