LLM Comparison — India Pew Opinion Survey

LLMs compared.

Simulatte DEEP benchmarked against 10 leading large language models on the India Pew Opinion Survey. All models given identical demographic context — no calibration, no cognitive loop. This is the fairest possible baseline.

85.3%

Simulatte DEEP

Sprint A-22 result

75.6%

Best LLM (GPT-4o)

9.7 pp behind Simulatte

4.9×

Nx vs. avg LLM

Closer to human ceiling

5,878

Total API calls

SHA-256 verified

The human ceiling of 91.0% (Iyengar et al., Stanford) represents irreducible self-inconsistency in survey responses — not a limitation of Simulatte's measurement. All results are independently verifiable via the public GitHub repository.

Study 1A — US Pew (88.7%) Study 1B — India Pew (85.3%) LLM Comparison GitHub Repository ↗

Full Results

The leaderboard.

10 LLMs scored on 15 India Pew questions using 40 demographically calibrated personas. Distribution accuracy = 1 − (Σ|real − sim| / 2). Higher is better. Human ceiling: 91.0%.

Model

Distribution accuracy

Score

Gap to 91%

—

Human ceiling (Stanford)

91.0%

—

Simulatte DEEP (A-22)

85.3%

5.7 pp

GPT-4o

75.6%

15.4 pp

GPT-5 Mini

74.3%

16.7 pp

GPT-4o Mini

73.8%

17.2 pp

GPT-5

72.4%

18.6 pp

Claude Haiku 4.5

71.9%

19.1 pp

Claude Sonnet 4.6

70.2%

20.8 pp

Gemini 3 Pro

44.3%

46.7 pp

Gemini 3 Flash

43.9%

47.1 pp

Gemini 2.5 Flash

43.5%

47.5 pp

Bar lengths scaled to 91% human ceiling = 100%. Simulatte bar represents 93.7% of the way to ceiling. GPT-4o is at 83.1%.

Performance Gap

The Nx ratio.

The Nx ratio measures how much closer Simulatte's error is to the human ceiling compared to a given LLM's error. Higher = further ahead. The ceiling is 91.0% — not 100% — so errors are measured relative to that benchmark.

Nx vs. average LLM

4.9×

Simulatte error: 5.7 pp. Average LLM error: 27.7 pp. The average LLM is 4.9× further from the human ceiling than Simulatte.

Nx vs. best LLM (GPT-4o)

2.7×

GPT-4o error: 15.4 pp. Simulatte error: 5.7 pp. Simulatte is 2.7× closer to the ceiling than the best-performing LLM.

Nx vs. Gemini 2.5 Flash

8.3×

Gemini 2.5 Flash error: 47.5 pp. The leading Gemini model from an API-cost perspective is 8.3× further from the ceiling than Simulatte.

How Nx is calculated

Nx = LLM error ÷ Simulatte error. Error = 91% − model score. Simulatte error = 91.0 − 85.3 = 5.7 pp. Average LLM error = 91.0 − 63.3 = 27.7 pp. Nx = 27.7 ÷ 5.7 = 4.86 ≈ 4.9×.

Error decomposition

Distance from human ceiling (91%) — lower is better. Each segment represents residual error from ceiling.

Simulatte DEEP

5.7 pp

GPT-4o

15.4 pp

GPT-5 Mini

16.7 pp

GPT-4o Mini

17.2 pp

GPT-5

18.6 pp

Claude Haiku 4.5

19.1 pp

Claude Sonnet 4.6

20.8 pp

Gemini 3 Pro

46.7 pp

Gemini 3 Flash

47.1 pp

Gemini 2.5 Flash

47.5 pp

Bars represent percentage-point distance from 91.0% human ceiling. Simulatte's 5.7 pp bar is proportionally shorter than all LLM bars. Gemini errors are nearly 8× larger.

Protocol

LLM baseline protocol.

Every LLM received the same stripped demographic description per persona in a single API call — no cognitive loop, no memory, no calibration. This is a fair, generous baseline matching what a survey researcher would know about a respondent.

SYSTEM: You are {name}, a {age}-year-old {religion} {gender} living in {city}, {state}.
        Education: {education}. Employment: {employment}. Income: {income_bracket}.
        Caste: {caste}. Politically: {political_lean_description}.
        Answer the following survey question exactly as {name} would.

USER:   {question_text}
        Options: A) ... B) ... C) ... D) ...
        Respond with ONLY the single option letter (A, B, C, or D).

Why this is a fair baseline

LLMs received religion, caste, age, education, income, region, and political lean — the full demographic profile a survey researcher would have. The prompt is identical for all 10 LLMs and uses the same 40 personas as Simulatte. The only variable is the model receiving the prompt.

What Simulatte does differently

LLM baseline

Single API call per question
No memory between questions
Demographic prompt only
No calibration against real data
Generic alignment training

Simulatte DEEP

Perceive → Reflect → Decide cognitive loop
Persistent CoreMemory across questions
WorldviewAnchor attitude dimensions
22 sprints of calibration on real population data
Option-vocabulary anchoring to survey language

Run details

Run	Run ID	Entries	Models	Date (UTC)
Main	llm-india-20260407-213325-677319f7	4,200	Claude Haiku, Claude Sonnet, GPT-4o, GPT-4o Mini, Gemini 2.5 Flash, Gemini 3 Flash, Gemini 3 Pro	2026-04-07 21:33
Supplemental	llm-india-20260407-221604-f2a991f1	1,678	GPT-5, GPT-5 Mini	2026-04-07 22:16
Total	—	5,878	10 LLMs	—

Analysis

Key findings.

GPT-5 underperforms GPT-4o

GPT-5 scores 72.4% versus GPT-4o's 75.6% — a 3.2 pp gap in the wrong direction. Raw model capability does not transfer to cultural calibration. GPT-5 applies heavier alignment-style balancing on politically sensitive Indian questions, flattening BJP/opposition distributions toward artificial centrism. More capable models are more constrained by RLHF's Western political frame, not less.

Gemini clusters at 43–44% — a structural failure

All three Gemini variants score within 0.8 pp of each other (43.5%–44.3%) regardless of model size. This is not a scale limitation — Gemini 3 Pro is a flagship model. It reflects a structural failure in handling Indian cultural identity and socio-political attitudes. Across questions involving religious identity, caste, and political institutions, Gemini produces near-random distributions that average close to 50% accuracy. Scale cannot fix a cultural blind spot.

The hardest questions expose RLHF ceilings

Three question categories universally challenge LLMs. (1) Government trust (in09): LLMs conflate institutional trust with political approval — Indians distinguish them more sharply than Western populations do. (2) INC anger calibration (in04): LLMs underproduce strong anti-INC sentiment in Hindu-majority rural personas, reflecting a trained reluctance to generate politically charged negative affect. (3) Strong-leader preference (in07): India scores among the world's highest at 80% support for strong leaders — a distribution LLMs resist generating because it conflicts with RLHF-internalized liberal democratic norms.

Demographic context alone is insufficient

Every LLM received identical demographic profiles that a survey researcher would consider comprehensive: religion, caste, age, education, income, region, political lean. None broke 76%. The gap between 75.6% (best LLM) and 85.3% (Simulatte) — 9.7 pp — represents what cognitive architecture, construct independence, and calibration add. Cultural accuracy requires more than demographic input; it requires attitude structures that encode how those demographics have historically expressed opinion in specific political contexts.

Model families cluster — calibration does not

GPT models cluster between 72.4–75.6%. Claude models cluster at 70.2–71.9%. Gemini models cluster at 43.5–44.3%. These within-family clusters are tight (under 2 pp spread for GPT, under 2 pp for Claude, under 1 pp for Gemini), suggesting that cultural calibration failure is primarily a training alignment issue, not a parameter scale issue. Simulatte at 85.3% sits 9.7 pp above the best LLM family ceiling — a gap that 22 sprints of calibration produce, not a larger model.

Verification

Audit integrity.

All 5,878 API calls are logged with SHA-256 prompt and response hashes. The stripped audit file has prompt text removed to protect the proprietary persona pool while remaining publicly verifiable. Any third party can confirm our published results correspond exactly to the prompts sent.

Root hash — stripped_audit.jsonl (SHA-256)

sha256:a76aa717a0971961220f314451fe23ac623bf01cb8ca790f39a6ad5ed273d3f0

Stripped audit log

audit/stripped_audit.jsonl

All 5,878 entries: timestamps, SHA-256 hashes of prompts and responses, raw answers. Prompt text removed to protect persona pool.

Audit manifest

audit/audit_manifest.json

Root hash, run IDs, entry counts per run, model list. Cross-references both run IDs against the main and supplemental runs.

Integrity verifier

audit/verify.py

Standalone Python script that confirms stripped_audit.jsonl is unmodified by recomputing and comparing the root hash.

Model score table

results/llm_scores.json

Distribution accuracy for all 10 models with per-question breakdown. Referenced by this report for all numerical claims.

Survey questions

questions.json

All 15 India Pew survey questions used in this benchmark. Sourced from publicly available Pew Research Center data.

NDA replication

Full prompt corpus

Full prompt text and persona definitions available under NDA for researchers who wish to independently replicate this benchmark.

Verification command

python3 audit/verify.py — Recomputes SHA-256 hash of stripped_audit.jsonl and confirms it matches the published root hash above. Outputs PASS or FAIL with entry count.

Reproducibility

Reproduce this study.

The audit files, verifier script, and questions are published in the public GitHub repository. The stripped audit allows verification without requiring access to the proprietary persona pool. Researchers requesting full replication access can contact us for NDA-gated materials.

GitHub repository

All audit files, manifests, verifier script, and results are public. Clone the repository and run verify.py to confirm audit integrity independently.

github.com/Iqbalahmed7/simulatte-credibility ↗

LLM comparison study

This benchmark lives in the studies/llm_comparison/ directory of the repository. It contains results, audit files, questions, and the verification script.

The Simulatte-only sprint progression for Study 1B (from which the 85.3% score is sourced) lives in studies/pew_india/.

Study 1A — US Pew (88.7%) Study 1B — India Pew (85.3%) Accuracy Methodology GitHub Repository ↗