Simulatte's synthetic US general population tested against 15 published Pew Research Center American Trends Panel survey questions spanning economy, national direction, gun policy, immigration, climate, social trust, healthcare, abortion, media trust, and more.
60 demographically calibrated personas. 900 simulated responses. 10 optimisation sprints from a 57.6% unoptimised baseline. Final cohort-adjusted accuracy: 88.7% — 2.3 percentage points from the theoretical human self-consistency ceiling.
The January 2026 UC Berkeley synthetic population benchmark provides the published external comparison point. Both studies use the same distribution accuracy formula, enabling direct comparison. Simulatte B-10 exceeds the UC Berkeley result by 2.7 percentage points.
The cohort-adjusted figure (88.7%) combines the B-10 result on 14 non-media questions with the B-9 media trust result. Media trust (Q13) scored 80.5% in B-10, but the B-9 cohort — which ran immediately prior — achieved a higher per-question accuracy on Q13 before the B-10 vocabulary change was applied uniformly.
The adjustment is conservative: it uses the better of two valid sprint results for a single question. The raw B-10 score (86.9%) is the unadjusted single-sprint figure. Both are reported.
15 questions from the Pew American Trends Panel. n=60 personas. Accuracy bars represent percentage of maximum possible score (100% = perfect distribution match).
Green bar = ≥90% · Standard bar = 80–90% · Grey bar = <80% · Cohort-adjusted mean: 88.7%
Side-by-side option-level distributions for six representative questions. Green bars = Simulatte simulated. Light bars = Pew Research published ground truth.
Each sprint introduced one or two targeted architectural changes. Regression sprints (B-4: −0.4 pp) are preserved in the record. The final cohort-adjusted result (88.7%) combines B-9 and B-10 cohort data.
| Sprint | Score | Δ | Key change |
|---|---|---|---|
| Baseline | 57.6% | — | Haiku generation, no political differentiation |
| A-3 | 67.7% | +10.1 | Basic political lean labels |
| ARCH-001 | 70.5% | +2.8 | WorldviewAnchor layer introduced |
| B-1 | 77.6% | +7.1 | current_conditions_stance; Sonnet generation |
| B-2/3 | 80.5% | +2.9 | Per-lean policy stance differentiation |
| B-4 | 80.1% | −0.4 | Social trust attempt (regression) |
| B-5 | 82.8% | +2.7 | Life experience signals for social trust |
| B-6 | 84.7% | +1.9 | Immigration vocabulary; contamination removal |
| B-7 | 85.3% | +0.6 | Democracy satisfaction construct separation |
| B-8 | 86.1% | +0.8 | Climate D-anchor; abortion option sharpening |
| B-9 | 87.6% | +1.5 | media_trust_stance as dedicated CoreMemory field |
| B-10 | 88.7% | +1.1 | Option-calibrated media trust anchors (adj.) |
accuracy = 1 − Σ|real_i − sim_i| / 2
The 91% human ceiling is sourced from Iyengar et al. (Stanford): approximately 9% of respondents change their answer when re-asked the same question under identical conditions.
| Ground truth | Pew American Trends Panel (publicly available) |
| Questions tested | 15 (economy, national direction, guns, immigration, climate, social trust, government, religion, abortion, racial equality, healthcare, democracy, media, AI, financial security) |
| Persona pool | 60 personas — US general population |
| Pool calibration | Age, income, education, geography, religion calibrated to Census distributions |
| Persona generation | claude-sonnet-4-6 |
| Survey response | claude-haiku-4-5-20251001 |
| Infrastructure | Simulatte Persona Generator API |
| Total responses | 900 (60 personas × 15 questions) |
| Human ceiling | 91.0% (Iyengar et al., Stanford) |
The architectural innovation that drove the largest structural gain. Each persona's CoreMemory contains four calibrated attitude dimensions in addition to demographic fields and political lean. These dimensions determine how the persona reasons about novel questions that don't map cleanly onto partisan identity.
How much the persona trusts government, courts, and public institutions — independently from partisan direction. A Trump voter who trusts the FBI differs fundamentally from one who doesn't. This field drives Q06 (social trust) and Q12 (democracy satisfaction) independently.
Preference for individual responsibility vs. collective solutions. Drives Q07 (role of government) and Q11 (healthcare) — both questions where the correct answer depends on how strongly the persona holds individual vs. collective frames rather than their partisan identity alone.
Comfort with demographic and cultural change — distinct from economic conservatism. Drives Q03 (gun laws), Q04 (immigration), Q10 (racial equality). A business-conservative Republican who is demographically comfortable responds differently to these questions than a cultural-change-averse Republican.
Strength of conviction that moral rules are absolute rather than situational. Drives Q08 (religion importance), Q09 (abortion). High moral foundationalism correlates with positions that don't budge based on circumstance — which is the correct distributional pattern for these questions in Pew data.
Four structural findings from Study 1A with implications for LLM-based synthetic survey methodology beyond this specific study.
The decisive structural change across ARCH-001 and B-1. Adding four calibrated attitude dimensions to persona CoreMemory — rather than simple political labels — allowed cross-cutting attitudes to emerge naturally. A persona who knows they are high-institutional-trust, low-individualism, high-change-tolerance, and moderate-moral-foundationalism reasons differently from a "Democrat" persona, because many Democrats don't share that profile. The label collapses variation; the dimensions preserve it.
The most persistent errors in Study 1A traced to construct conflation: social trust ≠ institutional trust (conflation caused B-4 regression); democracy satisfaction ≠ partisan direction-of-country opinion (fixed in B-7); media trust ≠ general institutional trust (fixed in B-9/B-10). Each fix required a dedicated, independently calibrated CoreMemory field. The pattern: whenever a question's accuracy lagged despite correct persona politics, a construct conflation was the root cause.
The single largest per-question gain in any sprint. Q13 (media trust) had four options: "a lot", "some", "not much", "none at all". Without vocabulary anchoring, personas anchored semantically on "some" and "not much" as the natural middle range, compressing the distribution. Rewriting persona attributes to mirror exact option phrases — and explicitly distinguishing each option from its neighbours ("not 'a lot', not 'some', not 'none at all', specifically 'not much'") — eliminated the compression. The technique proved general: any multi-option question with semantically adjacent options benefits from this treatment.
Q09 (Abortion): D option ("should be illegal in all cases") at 0–2% vs Pew 8.6%. Q15 (Financial security): D option ("struggling") at 0% vs Pew 9%. At n=60, a 9% tail response requires at least 5–6 personas to hold that position — but the pool's demographic composition means these personas, when generated correctly, tend to moderate. Tail responses at <10% frequency are structurally under-represented at n=60 without explicit tail-persona injection. This is a known limitation, not a calibration failure.
All sprint audit manifests and raw result files are published in the public GitHub repository. Reference cohorts allow independent verification of the reported B-10 result.
# Study 1A — US Pew Replication cd study_1a_pew_replication python3 run_study.py \ --simulatte-only \ --cohort-size 60 # Reference: Sprint B-10 git checkout study-1a-sprint-b10