Skip to content

Grounded in peer-reviewed research

How we validate

Synthetic survey research is a young field. The academic literature since 2023 shows both genuine promise and real limitations. We take both seriously.

0+
Peer-reviewed validation studies (2023–2026)
85%+
Match rate with real survey distributions
85%
Real-world behavior reproduced in tests

Match rate from Maier et al. (2025), tested across 57 real product surveys with 9,300 human respondents. Replication rate from Park et al. (2024), 1,052 AI agents tested against real survey data.

What the evidence shows

The research is clear: synthetic respondents are good at capturing the big picture — which product people prefer, which message resonates more, which concept ranks highest. They're less reliable for fine-grained demographic breakdowns or predicting individual behavior.

We build vcrowd around these findings. We use the methods with the strongest published results, and we're upfront about where the technology has limits.

The chart shows what researchers have found: where synthetic and real human respondents agree closely, and where gaps remain.

Synthetic–human agreement by task
Overall preference directionStrong agreement

Argyle et al. 2023, Brand et al. 2025 — willingness-to-pay estimates within $0.13 of real consumers

Purchase intent rankingStrong agreement

Maier et al. 2025 — 85%+ distribution match across 57 product surveys

Brand perception (known brands)Moderate

Li et al. 2024 — 75–85% agreement; weaker for niche or regional brands

Accuracy across age, gender, incomeActive research

Bisbee et al. 2024 — about half of demographic patterns diverge from real data

Capturing extreme & minority viewsActive research

Synthetic opinions are ~3x narrower than real human opinions

How we measure quality

We test synthetic responses the same way researchers do. No single test tells the whole story — so we use five.

Distribution Match

Do synthetic answer patterns look like the real ones? We compare full response curves, not just averages.

Question Consistency

When humans link two topics, do synthetic respondents link them too? We check these relationships hold.

Direction Check

Does the synthetic crowd pick the same top answer as real people? We measure how often they agree.

Bias Scan

We check for political lean, over-polite answers, and whether the spread of opinions is too narrow.

Demographic Fit

Do age, income, and gender groups respond differently in the same way real groups do?

What we're honest about

Known challenges in the field

Opinions cluster to the middle

Synthetic responses tend to avoid extremes. Strong supporters and strong critics are underrepresented.

Western & liberal skew

AI models reflect their training data, which over-represents Western, English-speaking, and liberal perspectives.

Too polite on sensitive topics

On controversial questions, synthetic respondents give more socially acceptable answers than real people do.

Groups, not individuals

Synthetic data captures crowd-level patterns well, but can't reliably predict what any single person would say.

How we address them

These aren't problems we can wave away with marketing language. They're active areas of research, documented across dozens of peer-reviewed papers. We address them through methodology, not hand-waving.

  • We ask AI to explain its reasoning in words first, then convert to ratings — the method with the best published accuracy (Maier et al.)
  • Every report includes a Data Confidence Score so you can assess the strength of the signal
  • We monitor for known bias patterns and flag responses where synthetic data is least reliable
  • We recommend synthetic research for early-stage screening and directional signal — not as a replacement for human panels in high-stakes decisions

The research foundation

Our approach draws on validation work from research groups at Harvard, Stanford, Columbia, PyMC Labs, Anthropic, and others. The field is moving fast — new benchmarks and methods are published regularly, and we update our pipeline accordingly.

5
Public benchmark datasets we test against
4,800+
Survey questions in those benchmarks
9,300
Real human responses used for comparison
64
Countries represented in validation data

Based on OpinionQA and SubPOP, the two largest public survey benchmarks. Full reference list →

Try it and compare

Run a study on a topic where you already have human survey data. See how the results compare. That's the best validation.

No credit card required

© 2026 vcrowd