Simulatte DEEP benchmarked against 10 leading large language models on the India Pew Opinion Survey. All models given identical demographic context — no calibration, no cognitive loop. This is the fairest possible baseline.
The human ceiling of 91.0% (Iyengar et al., Stanford) represents irreducible self-inconsistency in survey responses — not a limitation of Simulatte's measurement. All results are independently verifiable via the public GitHub repository.
10 LLMs scored on 15 India Pew questions using 40 demographically calibrated personas. Distribution accuracy = 1 − (Σ|real − sim| / 2). Higher is better. Human ceiling: 91.0%.
Bar lengths scaled to 91% human ceiling = 100%. Simulatte bar represents 93.7% of the way to ceiling. GPT-4o is at 83.1%.
The Nx ratio measures how much closer Simulatte's error is to the human ceiling compared to a given LLM's error. Higher = further ahead. The ceiling is 91.0% — not 100% — so errors are measured relative to that benchmark.
Nx = LLM error ÷ Simulatte error. Error = 91% − model score. Simulatte error = 91.0 − 85.3 = 5.7 pp. Average LLM error = 91.0 − 63.3 = 27.7 pp. Nx = 27.7 ÷ 5.7 = 4.86 ≈ 4.9×.
Distance from human ceiling (91%) — lower is better. Each segment represents residual error from ceiling.
Bars represent percentage-point distance from 91.0% human ceiling. Simulatte's 5.7 pp bar is proportionally shorter than all LLM bars. Gemini errors are nearly 8× larger.
Every LLM received the same stripped demographic description per persona in a single API call — no cognitive loop, no memory, no calibration. This is a fair, generous baseline matching what a survey researcher would know about a respondent.
SYSTEM: You are {name}, a {age}-year-old {religion} {gender} living in {city}, {state}.
Education: {education}. Employment: {employment}. Income: {income_bracket}.
Caste: {caste}. Politically: {political_lean_description}.
Answer the following survey question exactly as {name} would.
USER: {question_text}
Options: A) ... B) ... C) ... D) ...
Respond with ONLY the single option letter (A, B, C, or D).
LLMs received religion, caste, age, education, income, region, and political lean — the full demographic profile a survey researcher would have. The prompt is identical for all 10 LLMs and uses the same 40 personas as Simulatte. The only variable is the model receiving the prompt.
| Run | Run ID | Entries | Models | Date (UTC) |
|---|---|---|---|---|
| Main | llm-india-20260407-213325-677319f7 | 4,200 | Claude Haiku, Claude Sonnet, GPT-4o, GPT-4o Mini, Gemini 2.5 Flash, Gemini 3 Flash, Gemini 3 Pro | 2026-04-07 21:33 |
| Supplemental | llm-india-20260407-221604-f2a991f1 | 1,678 | GPT-5, GPT-5 Mini | 2026-04-07 22:16 |
| Total | — | 5,878 | 10 LLMs | — |
GPT-5 scores 72.4% versus GPT-4o's 75.6% — a 3.2 pp gap in the wrong direction. Raw model capability does not transfer to cultural calibration. GPT-5 applies heavier alignment-style balancing on politically sensitive Indian questions, flattening BJP/opposition distributions toward artificial centrism. More capable models are more constrained by RLHF's Western political frame, not less.
All three Gemini variants score within 0.8 pp of each other (43.5%–44.3%) regardless of model size. This is not a scale limitation — Gemini 3 Pro is a flagship model. It reflects a structural failure in handling Indian cultural identity and socio-political attitudes. Across questions involving religious identity, caste, and political institutions, Gemini produces near-random distributions that average close to 50% accuracy. Scale cannot fix a cultural blind spot.
Three question categories universally challenge LLMs. (1) Government trust (in09): LLMs conflate institutional trust with political approval — Indians distinguish them more sharply than Western populations do. (2) INC anger calibration (in04): LLMs underproduce strong anti-INC sentiment in Hindu-majority rural personas, reflecting a trained reluctance to generate politically charged negative affect. (3) Strong-leader preference (in07): India scores among the world's highest at 80% support for strong leaders — a distribution LLMs resist generating because it conflicts with RLHF-internalized liberal democratic norms.
Every LLM received identical demographic profiles that a survey researcher would consider comprehensive: religion, caste, age, education, income, region, political lean. None broke 76%. The gap between 75.6% (best LLM) and 85.3% (Simulatte) — 9.7 pp — represents what cognitive architecture, construct independence, and calibration add. Cultural accuracy requires more than demographic input; it requires attitude structures that encode how those demographics have historically expressed opinion in specific political contexts.
GPT models cluster between 72.4–75.6%. Claude models cluster at 70.2–71.9%. Gemini models cluster at 43.5–44.3%. These within-family clusters are tight (under 2 pp spread for GPT, under 2 pp for Claude, under 1 pp for Gemini), suggesting that cultural calibration failure is primarily a training alignment issue, not a parameter scale issue. Simulatte at 85.3% sits 9.7 pp above the best LLM family ceiling — a gap that 22 sprints of calibration produce, not a larger model.
All 5,878 API calls are logged with SHA-256 prompt and response hashes. The stripped audit file has prompt text removed to protect the proprietary persona pool while remaining publicly verifiable. Any third party can confirm our published results correspond exactly to the prompts sent.
python3 audit/verify.py — Recomputes SHA-256 hash of stripped_audit.jsonl and confirms it matches the published root hash above. Outputs PASS or FAIL with entry count.
The audit files, verifier script, and questions are published in the public GitHub repository. The stripped audit allows verification without requiring access to the proprietary persona pool. Researchers requesting full replication access can contact us for NDA-gated materials.
All audit files, manifests, verifier script, and results are public. Clone the repository and run verify.py to confirm audit integrity independently.
github.com/Iqbalahmed7/simulatte-credibility ↗This benchmark lives in the studies/llm_comparison/ directory of the repository. It contains results, audit files, questions, and the verification script.
The Simulatte-only sprint progression for Study 1B (from which the 85.3% score is sourced) lives in studies/pew_india/.