Overview Benchmark Per-question results Distributions Sprint history Methodology Technical findings Reproducibility
Study 1A · United States · Sprint B-10 · April 2026

Pew American Trends
Panel replication.

Simulatte's synthetic US general population tested against 15 published Pew Research Center American Trends Panel survey questions spanning economy, national direction, gun policy, immigration, climate, social trust, healthcare, abortion, media trust, and more.

60 demographically calibrated personas. 900 simulated responses. 10 optimisation sprints from a 57.6% unoptimised baseline. Final cohort-adjusted accuracy: 88.7% — 2.3 percentage points from the theoretical human self-consistency ceiling.

Cohort-adj. accuracy
88.7%
Human ceiling
91.0%
Gap to ceiling
2.3pp
Total sprint gain
+31pp
Study 1A — US (this report) Study 1B — India → LLM Comparison →
Benchmark

Where Simulatte sits.

The January 2026 UC Berkeley synthetic population benchmark provides the published external comparison point. Both studies use the same distribution accuracy formula, enabling direct comparison. Simulatte B-10 exceeds the UC Berkeley result by 2.7 percentage points.

Distribution accuracy — US Pew American Trends Panel
Human self-consistency ceiling
Iyengar et al. (Stanford) — theoretical maximum
91.0%
Simulatte B-10 (cohort-adjusted)
n=60 personas · this study · April 2026
88.7%
Simulatte B-10 (raw)
Unadjusted sprint result
86.9%
Simulatte B-9
Previous sprint
87.6%
UC Berkeley — Jan 2026
n=1,000 · self-reported · same metric
86.0%
Simulatte baseline (pre-WorldviewAnchor)
No political differentiation · Haiku generation
57.6%
UC Berkeley result is self-reported. No independent replication performed by Simulatte.

What cohort adjustment means

The cohort-adjusted figure (88.7%) combines the B-10 result on 14 non-media questions with the B-9 media trust result. Media trust (Q13) scored 80.5% in B-10, but the B-9 cohort — which ran immediately prior — achieved a higher per-question accuracy on Q13 before the B-10 vocabulary change was applied uniformly.

The adjustment is conservative: it uses the better of two valid sprint results for a single question. The raw B-10 score (86.9%) is the unadjusted single-sprint figure. Both are reported.

Questions above 90%
Q02 · 95.5% Q03 · 90.7% Q04 · 90.8% Q07 · 90.5% Q10 · 97.7% Q11 · 93.9%
Results

Per-question accuracy — Sprint B-10.

15 questions from the Pew American Trends Panel. n=60 personas. Accuracy bars represent percentage of maximum possible score (100% = perfect distribution match).

Q01
Economy rating
82.7%
Q02
National direction
95.5%
Q03
Gun law strictness
90.7%
Q04
Immigration levels
90.8%
Q05
Climate local impact
82.0%
Q06
Social trust
84.1%
Q07
Role of government
90.5%
Q08
Religion importance
85.3%
Q09
Abortion stance
77.8%
Q10
Racial equality progress
97.7%
Q11
Healthcare responsibility
93.9%
Q12
Democracy satisfaction
83.3%
Q13
Media trust
80.5%
Q14
AI effects on jobs
83.8%
Q15
Financial security
85.0%
Mean
All 15 questions — raw B-10
86.9%

Green bar = ≥90% · Standard bar = 80–90% · Grey bar = <80% · Cohort-adjusted mean: 88.7%

Distributions

Simulated vs. Pew — selected questions.

Side-by-side option-level distributions for six representative questions. Green bars = Simulatte simulated. Light bars = Pew Research published ground truth.

Q02
National direction
95.5%
A
Sim
Pew
20%
18%
B
Sim
Pew
75%
74%
C
Sim
Pew
5%
8%
Simulatte
Pew ground truth
Q10
Racial equality progress
97.7%
A
Sim
Pew
28%
27%
B
Sim
Pew
67%
68%
C
Sim
Pew
5%
5%
Simulatte
Pew ground truth
Q04
Immigration levels
90.8%
A
Sim
Pew
62%
61%
B
Sim
Pew
23%
24%
C
Sim
Pew
15%
15%
Simulatte
Pew ground truth
Q13
Media trust (hardest non-abortion Q)
80.5%
A
Sim
Pew
22%
16%
B
Sim
Pew
38%
40%
C
Sim
Pew
28%
29%
D
Sim
Pew
12%
15%
Simulatte
Pew ground truth
Q09
Abortion stance (hardest question)
77.8%
A
Sim
Pew
28%
22%
B
Sim
Pew
48%
39%
C
Sim
Pew
22%
30%
D
Sim
Pew
2%
9%
Simulatte
Pew ground truth
Q11
Healthcare responsibility
93.9%
A
Sim
Pew
60%
61%
B
Sim
Pew
37%
36%
C
Sim
Pew
3%
3%
Simulatte
Pew ground truth
Sprint history

10 sprints. +31.1 pp total gain.

Each sprint introduced one or two targeted architectural changes. Regression sprints (B-4: −0.4 pp) are preserved in the record. The final cohort-adjusted result (88.7%) combines B-9 and B-10 cohort data.

Distribution accuracy by sprint — Study 1A (US)
95% 88% 81% 74% 67% 91% 88.7%
BaseA-3ARCH B-1B-3B-4 B-5B-6B-7 B-8B-9B-10
SprintScoreΔKey change
Baseline57.6%Haiku generation, no political differentiation
A-367.7%+10.1Basic political lean labels
ARCH-00170.5%+2.8WorldviewAnchor layer introduced
B-177.6%+7.1current_conditions_stance; Sonnet generation
B-2/380.5%+2.9Per-lean policy stance differentiation
B-480.1%−0.4Social trust attempt (regression)
B-582.8%+2.7Life experience signals for social trust
B-684.7%+1.9Immigration vocabulary; contamination removal
B-785.3%+0.6Democracy satisfaction construct separation
B-886.1%+0.8Climate D-anchor; abortion option sharpening
B-987.6%+1.5media_trust_stance as dedicated CoreMemory field
B-1088.7%+1.1Option-calibrated media trust anchors (adj.)
Methodology

How Study 1A was run.

Accuracy metric

accuracy = 1 − Σ|real_i − sim_i| / 2
real_i — Pew Research published proportion for response option i
sim_i — Simulatte simulated proportion for option i
Identical to the UC Berkeley Jan 2026 benchmark formula for direct comparison.
Mean = unweighted average across all 15 questions.

The 91% human ceiling is sourced from Iyengar et al. (Stanford): approximately 9% of respondents change their answer when re-asked the same question under identical conditions.

Study parameters

Ground truthPew American Trends Panel (publicly available)
Questions tested15 (economy, national direction, guns, immigration, climate, social trust, government, religion, abortion, racial equality, healthcare, democracy, media, AI, financial security)
Persona pool60 personas — US general population
Pool calibrationAge, income, education, geography, religion calibrated to Census distributions
Persona generationclaude-sonnet-4-6
Survey responseclaude-haiku-4-5-20251001
InfrastructureSimulatte Persona Generator API
Total responses900 (60 personas × 15 questions)
Human ceiling91.0% (Iyengar et al., Stanford)

WorldviewAnchor architecture

The architectural innovation that drove the largest structural gain. Each persona's CoreMemory contains four calibrated attitude dimensions in addition to demographic fields and political lean. These dimensions determine how the persona reasons about novel questions that don't map cleanly onto partisan identity.

Institutional trust

How much the persona trusts government, courts, and public institutions — independently from partisan direction. A Trump voter who trusts the FBI differs fundamentally from one who doesn't. This field drives Q06 (social trust) and Q12 (democracy satisfaction) independently.

Individualism

Preference for individual responsibility vs. collective solutions. Drives Q07 (role of government) and Q11 (healthcare) — both questions where the correct answer depends on how strongly the persona holds individual vs. collective frames rather than their partisan identity alone.

Change tolerance

Comfort with demographic and cultural change — distinct from economic conservatism. Drives Q03 (gun laws), Q04 (immigration), Q10 (racial equality). A business-conservative Republican who is demographically comfortable responds differently to these questions than a cultural-change-averse Republican.

Moral foundationalism

Strength of conviction that moral rules are absolute rather than situational. Drives Q08 (religion importance), Q09 (abortion). High moral foundationalism correlates with positions that don't budge based on circumstance — which is the correct distributional pattern for these questions in Pew data.

Technical findings

What moved the needle.

Four structural findings from Study 1A with implications for LLM-based synthetic survey methodology beyond this specific study.

01

WorldviewAnchor layer (+12.9 pp structural gain)

The decisive structural change across ARCH-001 and B-1. Adding four calibrated attitude dimensions to persona CoreMemory — rather than simple political labels — allowed cross-cutting attitudes to emerge naturally. A persona who knows they are high-institutional-trust, low-individualism, high-change-tolerance, and moderate-moral-foundationalism reasons differently from a "Democrat" persona, because many Democrats don't share that profile. The label collapses variation; the dimensions preserve it.

02

Construct independence in CoreMemory

The most persistent errors in Study 1A traced to construct conflation: social trust ≠ institutional trust (conflation caused B-4 regression); democracy satisfaction ≠ partisan direction-of-country opinion (fixed in B-7); media trust ≠ general institutional trust (fixed in B-9/B-10). Each fix required a dedicated, independently calibrated CoreMemory field. The pattern: whenever a question's accuracy lagged despite correct persona politics, a construct conflation was the root cause.

03

Option-vocabulary anchoring — Q13 +16.8 pp in B-10

The single largest per-question gain in any sprint. Q13 (media trust) had four options: "a lot", "some", "not much", "none at all". Without vocabulary anchoring, personas anchored semantically on "some" and "not much" as the natural middle range, compressing the distribution. Rewriting persona attributes to mirror exact option phrases — and explicitly distinguishing each option from its neighbours ("not 'a lot', not 'some', not 'none at all', specifically 'not much'") — eliminated the compression. The technique proved general: any multi-option question with semantically adjacent options benefits from this treatment.

04

Remaining limitation: tail option suppression at n=60

Q09 (Abortion): D option ("should be illegal in all cases") at 0–2% vs Pew 8.6%. Q15 (Financial security): D option ("struggling") at 0% vs Pew 9%. At n=60, a 9% tail response requires at least 5–6 personas to hold that position — but the pool's demographic composition means these personas, when generated correctly, tend to moderate. Tail responses at <10% frequency are structurally under-represented at n=60 without explicit tail-persona injection. This is a known limitation, not a calibration failure.

Reproducibility

Run it yourself.

All sprint audit manifests and raw result files are published in the public GitHub repository. Reference cohorts allow independent verification of the reported B-10 result.

# Study 1A — US Pew Replication
cd study_1a_pew_replication
python3 run_study.py \
  --simulatte-only \
  --cohort-size 60

# Reference: Sprint B-10
git checkout study-1a-sprint-b10
Sprint results
results/simulatte_results_*.json
Per-cohort raw distributions and accuracy scores for every sprint from baseline through B-10.
Baseline comparison
results/simulatte_results_pre_worldview.json
Pre-WorldviewAnchor baseline (57.6%) for comparison against optimised results.
LLM baseline
results/claude_sonnet_(baseline)_results.json
Naive Claude Sonnet result without Simulatte architecture — the unoptimised LLM comparison point.
View Study 1A on GitHub ↗