References

Key papers that inform vcrowd's methodology. This is not an exhaustive list, the field is growing fast. We update our approach as new evidence becomes available.

Benchmark datasets

Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., & Hashimoto, T. (2023). Whose Opinions Do Language Models Reflect?. ICML 2023.
Introduced OpinionQA: 1,498 questions, 60 demographic subgroups, Wasserstein distance evaluation. The field's primary benchmark.
Suh, M., et al. (2025). SubPOP: Subgroup-Specific Population Simulation. ACL 2025.
3,362 questions from ATP waves 61–132. Fine-tuning on SubPOP reduces Wasserstein distance gap by 32–46%.
Durmus, E., et al. (Anthropic) (2024). Towards Measuring the Representation of Subjective Global Opinions in Language Models. arXiv preprint.
GlobalOpinionQA: 2,556 questions, 100+ countries. Uses Jensen-Shannon divergence. Only major cross-national benchmark.
WorldValuesBench authors (2024). WorldValuesBench. arXiv preprint.
Largest scale: 20+ million examples from WVS Wave 7, 94,728 participants, 64 countries.
Toubia, O., et al. (Columbia) (2025). Twin-2K-500: A Digital Twin Benchmark. Marketing Science 44(6).
2,058 participants, 500+ questions. Average individual-level correlation of ~0.2. Freely available on HuggingFace.

Key validation studies

Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J. R., Rytting, C., & Wingate, D. (2023). Out of One, Many: Using Language Models to Simulate Human Samples. Political Analysis 31(3).
Foundational 'silicon sampling' paper. GPT-3 approximated ANES distributions. Human evaluators could not distinguish GPT-3 text from human at above-chance rates.
Bisbee, J., Clinton, J. D., Dorff, C., Kenkel, B., & Larson, J. (2024). Synthetic Replacements for Human Survey Data? The Perils of Large Language Models. Political Analysis.
Most influential cautionary result. Aggregate means matched but variance was 31% of human. 48% of regression coefficients significantly different, 32% had wrong sign.
Dominguez-Olmedo, R., Hardt, M., & Mendler-Dünner, C. (2024). Questioning the Survey Responses of Large Language Models. NeurIPS 2024 (Oral).
Tested 43 LLMs against US Census data. After adjusting for ordering/labeling biases, responses trended toward uniformly random regardless of model size.
Park, J. S., et al. (Stanford) (2024). Generative Agent Simulations of 1,000 People. arXiv preprint.
Interview-based agents (2-hour qualitative interviews). Replicated GSS responses at 85% of participants' own test-retest accuracy. Dramatically outperformed demographic-only prompting.
Maier, M., et al. (PyMC Labs / Colgate-Palmolive) (2025). Semantic Similarity Rating (SSR) for Purchase Intent. arXiv:2510.08338.
The strongest commercial validation. Textual elicitation + embedding similarity achieved KS > 0.85 and 90% of human test-retest reliability across 57 product surveys (9,300 human responses). Direct Likert elicitation scored only 0.26–0.39.
Brand, J., Israeli, A., & Ngwe, D. (Harvard Business School) (2025). Using GPT for Market Research. HBS Working Paper 23-062 (revised).
WTP for toothpaste: $3.40 GPT vs $3.27 human. Economically rational properties preserved. Fine-tuning helped within product category but did not transfer across categories.

Bias & limitations research

Salecha, A., et al. (2024). Social Desirability Amplification in LLM Survey Responses. PNAS Nexus.
When LLMs infer evaluation, responses shift up to 1.20 SD toward socially desirable answers. Effect increases in newer models.
Wang, A., et al. (2025). LLM Identity-Group Analysis. Nature Machine Intelligence.
4 LLMs, 16 demographic identities, 3,200 human responses. LLMs systematically misportray groups and flatten within-group heterogeneity.
Li, T., Castelo, N., Katona, Z., & Sarvary, M. (2024). LLM-Generated Brand Perceptual Maps. Marketing Science 43(2).
75–85% agreement for well-known brands. Weaker for niche categories where models rely on simple heuristics.

Open-ended response evaluation

Mellon, J., et al. (2024). LLM Open-Text Coding. British Election Study analysis.
Claude-1.3 achieved 93.9% accuracy (vs. 94.7% human) categorizing 'most important issue' responses into 50 categories across 657,000+ entries.
Prescott, J., et al. (2024). GenAI Theme Matching in Health Communication. Published study.
GenAI themes matched 71% of human inductive themes, 28x faster (20 vs. 567 minutes).

This list focuses on the studies most relevant to vcrowd's approach. For the full landscape, see the surveys by Argyle et al. (2023) and the OpinionQA project (tatsu-lab/opinions_qa on GitHub). We do not claim affiliation with any of these research groups.