Comparing LLM-Generated Reviews to Human Evaluations in Social Science Research

Authors

David Reinstein

Valentin Klotzbücher

Published

June 29, 2025

Project Overview

We are exploring whether large language models (LLMs) can generate research paper evaluations comparable to expert human reviews. In this project, we use an AI (OpenAI’s o3 model, Google’s Gemini 2.5 Pro) to review social science research papers under the same criteria used by human reviewers in The Unjournal.

Each paper is assessed on specific dimensions – for example, the strength of its evidence, rigor of methods, clarity of communication, openness/reproducibility, relevance to global priorities, and overall quality. The LLM will provide quantitative scores (with uncertainty intervals) on these criteria and produce a written evaluation

Currently, we just illustrate/test the workflow for one example paper (Brülhart et al. 2021).

Our initial dataset will include research papers that have existing human evaluations. For each paper, the AI will generate: (1) numeric ratings on the defined criteria, (2) identification of the paper’s key claims, and (3) a detailed review discussing the paper’s contributions and weaknesses. We will then compare the AI-generated evaluations to the published human evaluations.

Next, we will focus on papers currently under evaluation, ie where no human evaluation exists yet and we can rule out any contamination.