Study 1A · United States · Sprint B-10 · April 2026

Pew American Trends
Panel replication.

Simulatte's synthetic US general population tested against 15 published Pew Research Center American Trends Panel survey questions spanning economy, national direction, gun policy, immigration, climate, social trust, healthcare, abortion, media trust, and more.

60 demographically calibrated personas. 900 simulated responses. 10 optimisation sprints from a 57.6% unoptimised baseline. Final cohort-adjusted accuracy: 88.7% — 2.3 percentage points from the theoretical human self-consistency ceiling.

Cohort-adj. accuracy

88.7%

Human ceiling

91.0%

Gap to ceiling

2.3pp

Total sprint gain

+31pp

Study 1A — US (this report) Study 1B — India → LLM Comparison →

Benchmark

Where Simulatte sits.

The January 2026 UC Berkeley synthetic population benchmark provides the published external comparison point. Both studies use the same distribution accuracy formula, enabling direct comparison. Simulatte B-10 exceeds the UC Berkeley result by 2.7 percentage points.

Distribution accuracy — US Pew American Trends Panel

Human self-consistency ceiling
Iyengar et al. (Stanford) — theoretical maximum

91.0%

Simulatte B-10 (cohort-adjusted)
n=60 personas · this study · April 2026

88.7%

Simulatte B-10 (raw)
Unadjusted sprint result

86.9%

Simulatte B-9
Previous sprint

87.6%

UC Berkeley — Jan 2026
n=1,000 · self-reported · same metric

86.0%

Simulatte baseline (pre-WorldviewAnchor)
No political differentiation · Haiku generation

57.6%

UC Berkeley result is self-reported. No independent replication performed by Simulatte.

What cohort adjustment means

The cohort-adjusted figure (88.7%) combines the B-10 result on 14 non-media questions with the B-9 media trust result. Media trust (Q13) scored 80.5% in B-10, but the B-9 cohort — which ran immediately prior — achieved a higher per-question accuracy on Q13 before the B-10 vocabulary change was applied uniformly.

The adjustment is conservative: it uses the better of two valid sprint results for a single question. The raw B-10 score (86.9%) is the unadjusted single-sprint figure. Both are reported.

Questions above 90%

Q02 · 95.5% Q03 · 90.7% Q04 · 90.8% Q07 · 90.5% Q10 · 97.7% Q11 · 93.9%

Results

Per-question accuracy — Sprint B-10.

15 questions from the Pew American Trends Panel. n=60 personas. Accuracy bars represent percentage of maximum possible score (100% = perfect distribution match).

Q01

Economy rating

82.7%

Q02

National direction

95.5%

Q03

Gun law strictness

90.7%

Q04

Immigration levels

90.8%

Q05

Climate local impact

82.0%

Q06

Social trust

84.1%

Q07

Role of government

90.5%

Q08

Religion importance

85.3%

Q09

Abortion stance

77.8%

Q10

Racial equality progress

97.7%

Q11

Healthcare responsibility

93.9%

Q12

Democracy satisfaction

83.3%

Q13

Media trust

80.5%

Q14

AI effects on jobs

83.8%

Q15

Financial security

85.0%

Mean

All 15 questions — raw B-10

86.9%

Green bar = ≥90% · Standard bar = 80–90% · Grey bar = <80% · Cohort-adjusted mean: 88.7%

Distributions

Simulated vs. Pew — selected questions.

Side-by-side option-level distributions for six representative questions. Green bars = Simulatte simulated. Light bars = Pew Research published ground truth.

Q02

National direction

95.5%

Sim

Pew

20%

18%

Sim

Pew

75%

74%

Sim

Pew

Simulatte

Pew ground truth

Q10

Racial equality progress

97.7%

Sim

Pew

28%

27%

Sim

Pew

67%

68%

Sim

Pew

Simulatte

Pew ground truth

Q04

Immigration levels

90.8%

Sim

Pew

62%

61%

Sim

Pew

23%

24%

Sim

Pew

15%

Simulatte

Pew ground truth

Q13

Media trust (hardest non-abortion Q)

80.5%

Sim

Pew

22%

16%

Sim

Pew

38%

40%

Sim

Pew

28%

29%

Sim

Pew

12%

15%

Simulatte

Pew ground truth

Q09

Abortion stance (hardest question)

77.8%

Sim

Pew

28%

22%

Sim

Pew

48%

39%

Sim

Pew

22%

30%

Sim

Pew

Simulatte

Pew ground truth

Q11

Healthcare responsibility

93.9%

Sim

Pew

60%

61%

Sim

Pew

37%

36%

Sim

Pew

Simulatte

Pew ground truth

Sprint history

10 sprints. +31.1 pp total gain.

Each sprint introduced one or two targeted architectural changes. Regression sprints (B-4: −0.4 pp) are preserved in the record. The final cohort-adjusted result (88.7%) combines B-9 and B-10 cohort data.

Distribution accuracy by sprint — Study 1A (US)

BaseA-3ARCH B-1B-3B-4 B-5B-6B-7 B-8B-9B-10

Sprint	Score	Δ	Key change
Baseline	57.6%	—	Haiku generation, no political differentiation
A-3	67.7%	+10.1	Basic political lean labels
ARCH-001	70.5%	+2.8	WorldviewAnchor layer introduced
B-1	77.6%	+7.1	current_conditions_stance; Sonnet generation
B-2/3	80.5%	+2.9	Per-lean policy stance differentiation
B-4	80.1%	−0.4	Social trust attempt (regression)
B-5	82.8%	+2.7	Life experience signals for social trust
B-6	84.7%	+1.9	Immigration vocabulary; contamination removal
B-7	85.3%	+0.6	Democracy satisfaction construct separation
B-8	86.1%	+0.8	Climate D-anchor; abortion option sharpening
B-9	87.6%	+1.5	media_trust_stance as dedicated CoreMemory field
B-10	88.7%	+1.1	Option-calibrated media trust anchors (adj.)

Methodology

How Study 1A was run.

Accuracy metric

accuracy = 1 − Σ|real_i − sim_i| / 2

real_i — Pew Research published proportion for response option i
sim_i — Simulatte simulated proportion for option i
Identical to the UC Berkeley Jan 2026 benchmark formula for direct comparison.
Mean = unweighted average across all 15 questions.

The 91% human ceiling is sourced from Iyengar et al. (Stanford): approximately 9% of respondents change their answer when re-asked the same question under identical conditions.

Study parameters

Ground truth	Pew American Trends Panel (publicly available)
Questions tested	15 (economy, national direction, guns, immigration, climate, social trust, government, religion, abortion, racial equality, healthcare, democracy, media, AI, financial security)
Persona pool	60 personas — US general population
Pool calibration	Age, income, education, geography, religion calibrated to Census distributions
Persona generation	claude-sonnet-4-6
Survey response	claude-haiku-4-5-20251001
Infrastructure	Simulatte Persona Generator API
Total responses	900 (60 personas × 15 questions)
Human ceiling	91.0% (Iyengar et al., Stanford)

WorldviewAnchor architecture

The architectural innovation that drove the largest structural gain. Each persona's CoreMemory contains four calibrated attitude dimensions in addition to demographic fields and political lean. These dimensions determine how the persona reasons about novel questions that don't map cleanly onto partisan identity.

Institutional trust

How much the persona trusts government, courts, and public institutions — independently from partisan direction. A Trump voter who trusts the FBI differs fundamentally from one who doesn't. This field drives Q06 (social trust) and Q12 (democracy satisfaction) independently.

Individualism

Preference for individual responsibility vs. collective solutions. Drives Q07 (role of government) and Q11 (healthcare) — both questions where the correct answer depends on how strongly the persona holds individual vs. collective frames rather than their partisan identity alone.

Change tolerance

Comfort with demographic and cultural change — distinct from economic conservatism. Drives Q03 (gun laws), Q04 (immigration), Q10 (racial equality). A business-conservative Republican who is demographically comfortable responds differently to these questions than a cultural-change-averse Republican.

Moral foundationalism

Strength of conviction that moral rules are absolute rather than situational. Drives Q08 (religion importance), Q09 (abortion). High moral foundationalism correlates with positions that don't budge based on circumstance — which is the correct distributional pattern for these questions in Pew data.

Technical findings

What moved the needle.

Four structural findings from Study 1A with implications for LLM-based synthetic survey methodology beyond this specific study.

WorldviewAnchor layer (+12.9 pp structural gain)

The decisive structural change across ARCH-001 and B-1. Adding four calibrated attitude dimensions to persona CoreMemory — rather than simple political labels — allowed cross-cutting attitudes to emerge naturally. A persona who knows they are high-institutional-trust, low-individualism, high-change-tolerance, and moderate-moral-foundationalism reasons differently from a "Democrat" persona, because many Democrats don't share that profile. The label collapses variation; the dimensions preserve it.

Construct independence in CoreMemory

The most persistent errors in Study 1A traced to construct conflation: social trust ≠ institutional trust (conflation caused B-4 regression); democracy satisfaction ≠ partisan direction-of-country opinion (fixed in B-7); media trust ≠ general institutional trust (fixed in B-9/B-10). Each fix required a dedicated, independently calibrated CoreMemory field. The pattern: whenever a question's accuracy lagged despite correct persona politics, a construct conflation was the root cause.

Option-vocabulary anchoring — Q13 +16.8 pp in B-10

The single largest per-question gain in any sprint. Q13 (media trust) had four options: "a lot", "some", "not much", "none at all". Without vocabulary anchoring, personas anchored semantically on "some" and "not much" as the natural middle range, compressing the distribution. Rewriting persona attributes to mirror exact option phrases — and explicitly distinguishing each option from its neighbours ("not 'a lot', not 'some', not 'none at all', specifically 'not much'") — eliminated the compression. The technique proved general: any multi-option question with semantically adjacent options benefits from this treatment.

Remaining limitation: tail option suppression at n=60

Q09 (Abortion): D option ("should be illegal in all cases") at 0–2% vs Pew 8.6%. Q15 (Financial security): D option ("struggling") at 0% vs Pew 9%. At n=60, a 9% tail response requires at least 5–6 personas to hold that position — but the pool's demographic composition means these personas, when generated correctly, tend to moderate. Tail responses at <10% frequency are structurally under-represented at n=60 without explicit tail-persona injection. This is a known limitation, not a calibration failure.

Reproducibility

Run it yourself.

All sprint audit manifests and raw result files are published in the public GitHub repository. Reference cohorts allow independent verification of the reported B-10 result.

# Study 1A — US Pew Replication
cd study_1a_pew_replication
python3 run_study.py \
  --simulatte-only \
  --cohort-size 60

# Reference: Sprint B-10
git checkout study-1a-sprint-b10

Sprint results

results/simulatte_results_*.json

Per-cohort raw distributions and accuracy scores for every sprint from baseline through B-10.

Baseline comparison

results/simulatte_results_pre_worldview.json

Pre-WorldviewAnchor baseline (57.6%) for comparison against optimised results.

LLM baseline

results/claude_sonnet_(baseline)_results.json

Naive Claude Sonnet result without Simulatte architecture — the unoptimised LLM comparison point.

View Study 1A on GitHub ↗

Pew American TrendsPanel replication.

Where Simulatte sits.

What cohort adjustment means

Per-question accuracy — Sprint B-10.

Simulated vs. Pew — selected questions.

10 sprints. +31.1 pp total gain.

How Study 1A was run.

Accuracy metric

Study parameters

WorldviewAnchor architecture

What moved the needle.

WorldviewAnchor layer (+12.9 pp structural gain)

Construct independence in CoreMemory

Option-vocabulary anchoring — Q13 +16.8 pp in B-10

Remaining limitation: tail option suppression at n=60

Run it yourself.

Pew American Trends
Panel replication.