Reproducibility Report for Homo Silicus Study by John Horton (Work in progress)
Introduction
Many important theories in social science and policy design, such as the evolution of norms and the effects of policy interventions on a community, cannot be tested directly due to practical challenges of conducting large-scale longitudinal studies [1, 2, 3]. In response, one promising modern solution I have observed is the use of large language models to create proxies of human participants. This allows us to simulate the outcomes of studies that would otherwise be impossible to conduct. In my research program at the intersection of human-computer interaction and natural language processing, I have introduced methods to simulate general computational agents, known as generative agents [3, 4]. These agents leverage a large language model within a novel agent cognitive architecture to produce human-like behaviors at both the individual and group levels (e.g., user behaviors in online social media, NPC behaviors in Sims-inspired games). My current research interest focuses on demonstrating these agents as a scientific tool that can help us address many of the challenges in the social sciences that are best suited to being answered using simulations of human behavior.
In this replication study, I will delve into John Horton’s paper, “Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus?”, which replicates existing social science experiments using large language models as proxies for human participants [5]. Horton’s work is among the notable early works [3, 4, 5, 9, 10] that aim to leverage the power of language models to simulate human participants in behavioral experiments. In his study, he replicates the findings from prior experiments derived from works such as Charness and Rabin (2002) [6] and Kahneman, Knetsch, and Thaler (1986) [7] by prompting a large language model. He finds that the language model-simulated participants, achieved by prompting the language model with a description of the study and then querying how it might behave in such an experiment, roughly matched the behavior of human subjects. My goal is to replicate Horton’s findings from two of the experiments that he used.
Justification for choice of study
While large language model-based simulations offer an important new avenue for future studies, we are still in the early stages of developing them as a scientific method. Therefore, the best research and reporting practices for ensuring the replicability and robustness of the results are actively being developed. This makes the task of replicating existing large language model-based studies particularly interesting and important, as doing so will help us better understand the strengths and challenges of using language models for social scientific discovery, upon which we can build. In particular, I see three important challenges that remain unaddressed in this emerging field:
- Ensuring the robustness of the simulated outcomes given minor changes in the prompt and across different models, especially considering that many of the closed language models accessible through APIs may be changing without informing users.
- Understanding the population we are representing in our simulated outcomes.
- The challenges of benchmarking language model-simulated outcomes against published experiments that may be known to the model.
Continued efforts in replicating and refining research in the field of large language model-based studies may help us better understand the extent of these challenges and how to address them.
Anticipated challenges
The challenges in replicating Horton’s study are indicative of the broader, more general challenge I mentioned earlier. In this replication study, I am specifically focusing on the first challenge I previously mentioned, as it represents what could be considered the fundamental aspect of large language model-based approaches. It raises the question: Can we produce replicable results when using the same setup? I aim to expand upon Horton’s replication study to gain a better understanding of the first challenge I listed previously 1) by reproducing his results using variations of prompts that convey semantic meaning to describe the experiments but are phrased differently, and 2) by comparing different versions of large language models. Ensuring the robustness of the results, especially in light of changes in both the model and prompt, is particularly crucial for ensuring the replicability of findings generated using a large language model.
Links
Project repository (on Github): https://github.com/psych251/horton2023
Original paper (as hosted in your repo): https://github.com/psych251/horton2023/blob/main/original_paper/horton_homo_silicus.pdf
Methods
The overarching strategy for the large language model-based studies we are attempting to replicate involves describing well-known social science experiments in natural language scenarios. We then prompt a language model to answer the question, “If you were a participant, how would you behave in this scenario?” We aim to replicate two well-known social scientific experiments previously studied by Horton: Charness and Rabin (2002) [6] and Kahneman, Knetsch, and Thaler (1986) [7].
Description of the steps required to reproduce the results
Horton described the three studies in natural language to make them understandable for a language model. For each experiment, he introduced an additional variable that characterizes the participants’ persona in the context of the study being replicated. For instance, in the case of Kahneman et al.’s (1986) study [7], where subjects were presented with a series of market scenarios to assess intuitions about fairness in market contexts, he presented the study as follows: “A hardware store has been selling snow shovels for $15. The morning after a large snowstorm, the store raises the price to $20. Please rate this action as: 1) Completely Fair 2) Acceptable 3) Unfair 4) Very Unfair”
Horton then prompted the language model to predict how a person of a particular political leaning (e.g., socialist, libertarian) might respond to the prompt. He compared the language model’s simulated outcome to the reported outcome in the published study, which indicated that with a price increase to $20, 82% of participants found it in some way “unacceptable.”
In his study, Horton found that only moderates and libertarians considered this price increase acceptable according to the language model simulation. Based on this, assuming that Kahneman et al. (1986)’s study employed a nationally representative sample in the original study, and considering that only about 37% of Americans described themselves as “moderate” in 2021, Horton concluded that the language model estimate would be an underestimate when compared to the original finding of 82% finding it in some way “unacceptable.”
My baseline replication will involve using the prompts and participants’ personas as described in Horton’s study if those prompts are available. However, I have noticed that not all prompts are precisely described in the report. In such cases, I will create a prompt myself in order to replicate the reported results. Additionally, I will employ the same large language model (GPT-3 text-davinci-003) and hyperparameters that were used in Horton’s study.
Differences from original study
There are two main aspects that differ, at least to some extent, between my replication and Horton’s original study: the prompt and the model version. Regarding the prompt, while I do not expect significant differences, there may be some discrepancies that arise as I fill in the gaps, given that some of the precise prompts used to generate the findings were not shared in the original report. Concerning the model, the difference mainly arises from the fact that centrally hosted language models, such as GPT-3 accessible through APIs, are not static but continually changing, sometimes without the knowledge of end users. Therefore, I would anticipate that the model accessible today will be different in some subtle or major way compared to the one used by Horton in his original study.
Given these variations, the challenge lies in comprehending the extent to which these discrepancies affect the ultimate results of the study. Do subtle distinctions in the model have an influence on the study’s outcomes, or do the results remain relatively consistent regardless of the particulars of the prompt and the model used? To address this inquiry, I will broaden my replication study to examine how different models perform, aiming to assess the impact of more controlled modifications in the prompt and the model on the study’s findings.
Project Progress Check 1
Measure of success
I have set up separate measures of success for the two experiments based on Horton’s work as follows:
Kahneman et al (1986) [7]. As described above, in this experiment, Horton prompted the model to predict how a persona with a specific political leaning would react to a market scenario. The model was asked to rate the scenario as either 1) Completely Fair, 2) Acceptable, 3) Unfair, or 4) Very Unfair. In my replication, I aim to investigate whether a language model-generated persona with the same political leaning rates the scenario in a manner consistent with Horton’s original work. Horton’s study included six personas – I will report the proportion of personas that responded in the same way as in Horton’s work.
Charness and Rabin (2002) [6]. In this experiment, Horton focused on the unilateral dictator game from Charness and Rabin’s research. All dictator games were structured as follows: Left: Person B receives $600, and Person A receives $300, or Right: Person B receives $500, and Person A receives $700. Horton prompted the model to predict how a persona with a specific personality trait (e.g., someone who is inequity averse and “only cares about fairness between players”) would respond to the presented dictator’s game. Horton’s study featured three personas – similar to the experiment above, I will report the proportion of personas that responded in a manner consistent with Horton’s original work.
Pipeline progress
The replication process comprises two essential steps. The initial step involves crafting the prompt and personas. During this phase, I will use Horton’s work as the foundation, incorporating it verbatim and only supplementing additional details to the prompts where they are absent. In the second step, I will utilize these refined prompts and personas to stimulate the language model and generate the agent’s response. To assess the resilience of this methodology in the face of slight syntactic alterations and variations in the model version, I will implement the following two additional stages: To evaluate the impact of the model version, I will re-run the original base prompt using different, more recent models, specifically GPT-4 and Mistal, an open-source model.
Results
In this section, we provide a summary of the outcomes from our replication study. Specifically, we present our reproductions of two studies that were examined in Horton’s paper: the replication of the experiments detailed in Charness and Rabin (2002) [6] and Kahneman, Knetsch, and Thaler (1986) [7].
Data preparation
Our codes for data preparation and analysis can be found in the “experiments” folder within our publicly accessible repository, as indicated by the provided link. Specifically, the “main.py” file contains all the essential code for executing the primary analysis and data preparation procedures, and its results are stored in an “output” folder. For functions designed to interact with and utilize language models, you can refer to the “gpt_structure” folder, which includes accompanying documentation.
Key analysis
Overall, our analysis reveals that the findings from Horton’s exploratory study generally replicate and exhibit similar trends. However, it’s important to acknowledge differences attributable to variations in the model used. Notably, it’s worth considering that the underlying models, especially those maintained by OpenAI, undergo continuous updates and safety improvements. Below, we provide an interpretation of our results for each of the two studies. For each study, we present both the original results figure reported by Horton and a table generated as part of our replication effort.
Charness and Rabin (2002)
Horton’s original findings suggest that unless provided with a specific persona, AI agents tend to adopt the role of the social planner, prioritizing the maximization of their own payoffs. In essence, they consistently choose options that yield the highest monetary benefit for themselves. For example, when faced with a choice between two options – option 1: I receive $200, and the other receives $700, and option 2: I receive $600, and the other receives $600 – the agent selects the second option, which maximizes its personal gain. However, this behavior begins to shift when the agent is assigned an explicit persona that instructs it to consider fairness between players or the overall payoff for both participants.
In our replication study, we observe that the models generally exhibit a tendency towards self-maximization when no prompts are provided. Notably, for GPT-4 and GPT-3, we notice a slight deviation in behavior once an explicit persona is introduced. In this regard, we can see that Horton’s findings largely align with the observed trends in our replication. However, it is important to note that in our replication, a significant number of responses were returned with the language models (LLMs) refusing to provide an answer. They stated their inability to make human-like decisions due to their AI nature. This phenomenon is particularly prominent in GPT-4 and GPT-3, which are OpenAI’s models. We presume that this is a result of OpenAI’s ongoing efforts to enhance safety measures and prevent undue anthropomorphism. Nonetheless, this makes a direct comparison between Horton’s work and our replication somewhat challenging.
In contrast, the open-source model, Mistral, did not produce no-answers, but it struggled to grasp the nuances conveyed by the persona. Its responses were mostly consistent across different conditions, leaning toward self-maximization, which contradicts Horton’s findings based on GPT-3 models.
Kahneman, Knetsch, and Thaler (1986)
In Horton’s original findings for the second study by Kahneman, Knetsch, and Thaler (1986) [7], it was suggested that liberal individuals tend to perceive situations as unfair when the price of an item changes due to unexpected market demand, whereas more conservative individuals are generally more inclined to adopt a neutral or accepting stance. Additionally, the wording of the scenario (changes the price to vs. raises the price to) did not have a significant impact. In our replication, we once again observe that these trends are generally consistent, particularly for GPT-4 and GPT-3 models, and we also see that the wordings did not have a significant impact. However, Mistral exhibits a distinct pattern, focusing predominantly on labeling situations as unfair, without fully considering the nuances conveyed by the persona that describes the political leanings of the respondent whose responses it is simulating.
Exploratory analyses
The models such as GPT-4, GPT-3, and Mistral, as employed in these studies, simulate collective behavioral patterns. However, is it possible to develop models capable of simulating an individual’s viewpoint instead of focusing solely on aggregate-level behavior?
In my exploratory analyses, I delve into this question by creating a generative agent that represents me. This agent is a Language Model (LM), specifically GPT-4, enhanced with external memory containing information about myself, extracted from an interview transcript where I shared my personal history and viewpoints. Subsequently, I conducted the experiments outlined in this replication study on myself to establish ground truth labels for these tasks. I then had “generative Joon” predict my actual responses. The results indicate that the predictions were generally accurate, although they may occasionally misjudge the extent or degree of my responses.
Discussion
Summary of Reproduction Attempt
Our replication study utilizing OpenAI’s GPT-4 and GPT-3 has successfully reproduced the general trends discussed in Horton’s exploratory study, which employed Language Models (LLMs) to simulate participants in behavioral studies. However, it’s crucial to acknowledge that these models, despite being accessible through the same API, can evolve over time. We observed potential impacts of these changes in some of our findings, where our model declined to provide one of the response options, citing its refusal to make a human decision. While this emphasis on safety and the avoidance of anthropomorphism is a commendable effort by those maintaining these models, it presents a significant challenge when employing them for scientific purposes.
Open-source models, which have gained significant attention recently, offer a potential solution to this issue, as users have complete control over their modifications and updates. However, in our current study, we find that open-source models like Mistral may not sufficiently capture the nuances in human decision-making as derived from persona descriptions. Consequently, we were unable to replicate certain findings using this model.
Commentary
It is encouraging to see that the general trends and findings from Horton’s studies persist with the latest and arguably most powerful model. However, I anticipate that the role of open-source models will become increasingly important as we grapple with the need for full control over the behavior and versions of these language models to facilitate replicable scientific research.
Furthermore, these aggregate-level findings pose an intriguing challenge when assessing the model’s ability to generate human behavior. Namely, is the model replicating these findings because it genuinely generates human behavior in a reliable manner, or is it merely an artifact of having memorized well-known social science studies from its training data? I propose that modeling individuals and benchmarking the model’s ability to predict individual responses offers a promising solution to this problem. The fact that “generative Joon” appeared to replicate my behavior in these studies is promising and points towards an exciting new direction for LLM-based social science research.
Work Cited
[1] Thomas Schelling. Micromotives and Macrobehavior (1978).
[2] Eric Bonabeau. PNAS. Agent-based modeling: Methods and techniques for simulating human systems (2002)
[3] Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative agents: Interactive simulacra of human behavior.
[4] Joon Sung Park, Lindsay Popowski, Carrie Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. 2022. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1– 18.
[5] John Horton. 2023. Large Language Models as Simulated Economic Agents: What can we learn from Homo Silicus?
[6] Charness, Gary and Matthew Rabin, “Understanding social preferences with simple tests,” The quarterly journal of economics, 2002, 117 (3), 817–869.
[7] Kahneman, Daniel, Jack L Knetsch, and Richard Thaler, “Fairness as a constraint on profit seeking: Entitlements in the market,” The American economic review, 1986, pp. 728–741.
[8] Samuelson, William and Richard Zeckhauser, “Status quo bias in decision making,” Journal of risk and uncertainty, 1988, 1 (1), 7–59
[9] Lisa P Argyle, Ethan C Busby, Nancy Fulda, Joshua R Gubler, Christopher Rytting, David Wingate. Out of one, many: Using language models to simulate human samples. Political Analysis (2023)
[10] Marcel Binz and Eric Schulz. Using cognitive psychology to understand GPT-3. PNAS (2023)