Reproducibility Report for Homo Silicus Study by John Horton (Work in progress)

Author

Joon Sung Park (joonspk@stanford.edu)

Published

October 8, 2023

Introduction

Many important theories in social science and policy design, such as the evolution of norms and the effects of policy interventions on a community, cannot be tested directly due to practical challenges of conducting large-scale longitudinal studies [1, 2, 3]. In response, one promising modern solution I have observed is the use of large language models to create proxies of human participants. This allows us to simulate the outcomes of studies that would otherwise be impossible to conduct. In my research program at the intersection of human-computer interaction and natural language processing, I have introduced methods to simulate general computational agents, known as generative agents [3, 4]. These agents leverage a large language model within a novel agent cognitive architecture to produce human-like behaviors at both the individual and group levels (e.g., user behaviors in online social media, NPC behaviors in Sims-inspired games). My current research interest focuses on demonstrating these agents as a scientific tool that can help us address many of the challenges in the social sciences that are best suited to being answered using simulations of human behavior.

In this replication study, I will delve into John Horton’s paper, “Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?”, which replicates existing social science experiments using large language models as proxies for human participants [5]. Horton’s work is among the notable early works [3, 4, 5, 9, 10] that aim to leverage the power of language models to simulate human participants in behavioral experiments. In his study, he replicates the findings from prior experiments derived from works such as Charness and Rabin (2002) [6] and Kahneman, Knetsch, and Thaler (1986) [7] by prompting a large language model. He finds that the language model-simulated participants, achieved by prompting the language model with a description of the study and then querying how it might behave in such an experiment, roughly matched the behavior of human subjects. My goal is to replicate Horton’s findings from two of the experiments that he used.

Justification for choice of study

While large language model-based simulations offer an important new avenue for future studies, we are still in the early stages of developing them as a scientific method. Therefore, the best research and reporting practices for ensuring the replicability and robustness of the results are actively being developed. This makes the task of replicating existing large language model-based studies particularly interesting and important, as doing so will help us better understand the strengths and challenges of using language models for social scientific discovery, upon which we can build. In particular, I see three important challenges that remain unaddressed in this emerging field:

Ensuring the robustness of the simulated outcomes given minor changes in the prompt and across different models, especially considering that many of the closed language models accessible through APIs may be changing without informing users.
Understanding the population we are representing in our simulated outcomes.
The challenges of benchmarking language model-simulated outcomes against published experiments that may be known to the model.

Continued efforts in replicating and refining research in the field of large language model-based studies may help us better understand the extent of these challenges and how to address them.

Anticipated challenges

The challenges in replicating Horton’s study are indicative of the broader, more general challenge I mentioned earlier. In this replication study, I am specifically focusing on the first challenge I previously mentioned, as it represents what could be considered the fundamental aspect of large language model-based approaches. It raises the question: Can we produce replicable results when using the same setup? I aim to expand upon Horton’s replication study to gain a better understanding of the first challenge I listed previously 1) by reproducing his results using variations of prompts that convey semantic meaning to describe the experiments but are phrased differently, and 2) by comparing different versions of large language models. Ensuring the robustness of the results, especially in light of changes in both the model and prompt, is particularly crucial for ensuring the replicability of findings generated using a large language model.

Links

Project repository (on Github): https://github.com/psych251/horton2023

Original paper (as hosted in your repo): https://github.com/psych251/horton2023/blob/main/original_paper/horton_homo_silicus.pdf

Methods

The overarching strategy for the large language model-based studies we are attempting to replicate involves describing well-known social science experiments in natural language scenarios. We then prompt a language model to answer the question, “If you were a participant, how would you behave in this scenario?” We aim to replicate two well-known social scientific experiments previously studied by Horton: Charness and Rabin (2002) [6] and Kahneman, Knetsch, and Thaler (1986) [7].

Description of the steps required to reproduce the results

Horton described the three studies in natural language to make them understandable for a language model. For each experiment, he introduced an additional variable that characterizes the participants’ persona in the context of the study being replicated. For instance, in the case of Kahneman et al.’s (1986) study [7], where subjects were presented with a series of market scenarios to assess intuitions about fairness in market contexts, he presented the study as follows: “A hardware store has been selling snow shovels for $15. The morning after a large snowstorm, the store raises the price to $20. Please rate this action as: 1) Completely Fair 2) Acceptable 3) Unfair 4) Very Unfair”

Horton then prompted the language model to predict how a person of a particular political leaning (e.g., socialist, libertarian) might respond to the prompt. He compared the language model’s simulated outcome to the reported outcome in the published study, which indicated that with a price increase to $20, 82% of participants found it in some way “unacceptable.”

In his study, Horton found that only moderates and libertarians considered this price increase acceptable according to the language model simulation. Based on this, assuming that Kahneman et al. (1986)’s study employed a nationally representative sample in the original study, and considering that only about 37% of Americans described themselves as “moderate” in 2021, Horton concluded that the language model estimate would be an underestimate when compared to the original finding of 82% finding it in some way “unacceptable.”

My baseline replication will involve using the prompts and participants’ personas as described in Horton’s study if those prompts are available. However, I have noticed that not all prompts are precisely described in the report. In such cases, I will create a prompt myself in order to replicate the reported results. Additionally, I will employ the same large language model (GPT-3 text-davinci-003) and hyperparameters that were used in Horton’s study.

Differences from original study

There are two main aspects that differ, at least to some extent, between my replication and Horton’s original study: the prompt and the model version. Regarding the prompt, while I do not expect significant differences, there may be some discrepancies that arise as I fill in the gaps, given that some of the precise prompts used to generate the findings were not shared in the original report. Concerning the model, the difference mainly arises from the fact that centrally hosted language models, such as GPT-3 accessible through APIs, are not static but continually changing, sometimes without the knowledge of end users. Therefore, I would anticipate that the model accessible today will be different in some subtle or major way compared to the one used by Horton in his original study.

Given these differences, the challenge lies in understanding the extent to which these variances impact the final outcome of the study. Do subtle differences in the prompt and the model alter the study’s outcome, or do the outcomes remain relatively stable regardless of the specifics of the prompt and the model used? To address this question, I will expand my replication study to create and test syntactic variations of the prompts I will be using. Additionally, I will employ an additional model, GPT-4, to assess the impact of more controlled changes in the prompt and the model on the study’s outcome.

Project Progress Check 1

Measure of success

I have set up separate measures of success for the two experiments based on Horton’s work as follows:

Kahneman et al (1986) [7]. As described above, in this experiment, Horton prompted the model to predict how a persona with a specific political leaning would react to a market scenario. The model was asked to rate the scenario as either 1) Completely Fair, 2) Acceptable, 3) Unfair, or 4) Very Unfair. In my replication, I aim to investigate whether a language model-generated persona with the same political leaning rates the scenario in a manner consistent with Horton’s original work. Horton’s study included six personas – I will report the proportion of personas that responded in the same way as in Horton’s work.
Charness and Rabin (2002) [6]. In this experiment, Horton focused on the unilateral dictator game from Charness and Rabin’s research. All dictator games were structured as follows: Left: Person B receives $600, and Person A receives $300, or Right: Person B receives $500, and Person A receives $700. Horton prompted the model to predict how a persona with a specific personality trait (e.g., someone who is inequity averse and “only cares about fairness between players”) would respond to the presented dictator’s game. Horton’s study featured three personas – similar to the experiment above, I will report the proportion of personas that responded in a manner consistent with Horton’s original work.

Pipeline progress

The replication pipeline consists of two steps. The first step involves crafting the prompt and personas. In this phase, I will use Horton’s work as the baseline and incorporate it verbatim, adding additional details to the prompts only where they are missing. In the second step, I will employ these resulting prompts and personas to prompt the language model and generate the agent’s response.

Then, to assess the robustness of this methodology in the face of slight syntactic changes in the prompt and variations in the model version, I will implement the following two additional steps: 1) I will generate a variation of the prompt by instructing the language model to “paraphrase” the original base prompt. This will yield an additional prompt with the same meaning but potentially in a slightly different syntactic form. I will then use this new prompt to rerun the experiment. 2) To evaluate the impact of the model version, I will rerun the original base prompt using a different, more recent model, specifically GPT-4.

Work Cited

[1] Thomas Schelling. Micromotives and Macrobehavior (1978).
[2] Eric Bonabeau. PNAS. Agent-based modeling: Methods and techniques for simulating human systems (2002)
[3] Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative agents: Interactive simulacra of human behavior.
[4] Joon Sung Park, Lindsay Popowski, Carrie Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. 2022. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1– 18.
[5] John Horton. 2023. Large Language Models as Simulated Economic Agents: What can we learn from Homo Silicus?
[6] Charness, Gary and Matthew Rabin, “Understanding social preferences with simple tests,” The quarterly journal of economics, 2002, 117 (3), 817–869.
[7] Kahneman, Daniel, Jack L Knetsch, and Richard Thaler, “Fairness as a constraint on profit seeking: Entitlements in the market,” The American economic review, 1986, pp. 728–741.
[8] Samuelson, William and Richard Zeckhauser, “Status quo bias in decision making,” Journal of risk and uncertainty, 1988, 1 (1), 7–59
[9] Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, David Wingate. Out of one, many: Using language models to simulate human samples. Political Analysis (2023)
[10] Marcel Binz and Eric Schulz. Using cognitive psychology to understand GPT-3. PNAS (2023)