Replication of Study X by Sample & Sample (20xx, Psychological Science)

Author

Replication Author[s] (contact information)

Published

December 12, 2025

Introduction

Justification: By replicating this project, I want to gain experience running experiments with dyadic interaction. I am interested in understanding how social norms form, and this kind of repeated games are a common way to investigate this problem, hence will continue to be useful in my research program.

Stimuli and Procedures: Participants will play a 2-player iterated cooperative decision-making task which I will implement in Empirica. Participants will be informed that they must select a parking spot in a virtual parking lot over the course of several days. Different spots will cost different amounts of a virtual currency (Monetary Units; price remain fixed over days). Participants will be incentivized to minimize cost paid as low cost paid corresponds to higher pay for participating in the experiment. There are two zones in the lot: an orange zone and a purple zone, and two parking spots per zone, and participants will selecte a spot to park before seeing their partner’s selection. Participants will be informed that selecting the same color zone as their partner would give them a “group discount”, but selecting the exact same parking spot will incur a penalty price. After making their decisions, participants were shown the actions that their partner took and the price each participant paid.

Here are the links to the repo and paper.

Project Progress

Pilot A: Results and preliminary graphs corresponding to the original paper can be found at this github repo under the pilotA_results folder:

Outcome Measure: The success of the replication will be measured based on how well the distribuion of strategies that the partners adopt match the distribution from the original paper. More specifically, we will use the same statistical analyses to compare the results obtained in condition 3 with the results from condition 1 and 2 of the experiment to see if the qualitative differences that they report replicate. The main result we want to replicate is that “participants developed an alternating norm more frequently than in the control (β = 0.10,CrI = [0.01,0.20])”. Additionally, we are also interested in the result that “participants were less likely to converge on stable selection on orange compared to [condition 1] (β=−0.75, CrI = [−0.92,−0.59]) and 2 (β=−0.29, CrI = [−0.46,−0.13])” and that “they failed to form any norm more frequently compared to the control (β= 0.18, CrI = [0.04, 0.31])”.

We will do the same statistical tests comparing results from other conditions in the original paper with our replication results of condition 3, to see whether these observations still hold.

We will also confirm whether “participants paid less over the course of the game” still holds, which is a key indicator of whether participants converged to systematic norms over time.

Methods

Power Analysis

Running the power analysis below shows that we should use 52 pairs, hence recruit 104 participants to get a power of 0.8.

library(pwr)
p1 <- 9.5/84 # probability of alternating on purple in condition 3
p2 <- 1/102 # probability of alternating on purple in condition 1

h <- ES.h(p1, p2)
result <- pwr.2p.test(h = h, sig.level = 0.05, alternative = "greater",power=0.8)
result$n

[1] 52.0063

Planned Sample

We plan to run 104 participants.

Materials

The experiment will be run online. Participants will be recruited through Prolific, and sent to a link hosted by google cloud where they will complete a task with another participant (which is implemented using Empirica). The code used for the online experiment is available at this link.

Procedure

This is the procedure as reported in the original paper: > [Participants] were shown the parking lot of their assigned treatment and were asked to write a strategic plan describing how they would ideally play the game. After writing, participants progressed to a treatment-specific waiting room and were paired with the first available partner. They played 12 trials of the game; after each trial, participants were shown their partner’s move and cost paid on the previous trial, and needed to indicate the cost they themselves paid as an attention check. The task took a median time of 12 minutes. Participants were paid a base rate of $12.50/hr; they were incentivized to minimize their overall cost in the game through a performance-based compensation bonus. Participants were informed they would play multiple trials but not told precisely how many, to induce uncertainty in the time horizon.

We follow the same procedure except we pay them a base rate of $8/hr.

Analysis Plan

We follow the same procedure as in the paper quoted below: > We excluded from analysis participants who failed to select a parking spot in the allotted time and did not finish the game, as well as participants who wrote fewer than 10 characters in a pre-game writing task.

Clarify key analysis of interest here The key analysis of interest is whether the proportion of participants who used the strategy of alternating between the two purple parking spots was greater in condition 3 than in the control (condition 1). We will compare results from condition 1 in the original paper against the condition 3 results we obtain and use a bayesian regression model brm(isAlternatingPurple ~ condition) to see if the credible interval is above 0. We will also use a different metric, the prop.test function to see test whether the proportions of the strategy of alternating purple is different in condition 3 and 1.

Differences from Original Study

Here are the main differences from the original study:

We are only replicating condition 3 of the experiment, hence the results for condition 3 we replicate will be compared with results from condition 1 of the authors’ paper instead of results from our own experiment.
The classification algorithm for the different strategies is not publicly available so I created my own algorithm to classify the strategies. However, there should be minimal differences, as there is little ambiguity in which traces of the interaction should be classified into different strategies.
Participants were paid $8/hour instead of $12.5/hour as in the original study

However, we do not expect these differences to be significant in influencing the outcome of this study.

Methods Addendum (Post Data Collection)

Actual Sample

Despite recruiting 104 participants, we encountered errors due to overloading of the server, and participants disconnecting, meaning that 21 dyads completed the game. We followed the same exclusion criteria as the original paper including the attention check protocol in which we asked participants what cost they paid after each round of the game. 9 out of 21 dyads scored lower than 60% in correctly reporting the cost that they paid, hence were excluded, leaving us with 12 data points to analyze. We suspect that one reason for this low scores was that we did not give feedback on whether the participants were correct or not when reporting their own scores. Furthermore, they may not have had enough time to fill in their score as suggested by the fact that there were many empty submissions to the attention check.

Differences from pre-data collection methods plan

As reported in the section above, the number of dyads we were actually able to analyze was 12 dyads, which is singificantly smaller than the 52 dyads that we had planned.

Results

Data preparation

Data preparation following the analysis plan.

Confirmatory analysis

The analyses as specified in the analysis plan.

Side-by-side graph with original graph is ideal here

library(ggplot2)
library(gridExtra)


Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

# generate same plot for simplified_dhara_data.csv
original_d <- read.csv("../data/simplified_original_data.csv")
original_d <- original_d[original_d$condition == "condition3", ]
# for each strategy, sum up the counts
strategy_counts <- data.frame(
    strategy = c("stable_orange", "stable_purple", "alt_purple", "other"),
    count = c(sum(original_d$stable_orange), sum(original_d$stable_purple), sum(original_d$alt_purple), sum(original_d$other))
)
p_original <- ggplot(strategy_counts,aes(x=strategy,y=count/sum(count))) + geom_col() +
  labs(x = "norm type", y = "proportion of pairs exhibiting norm", title = "Condition 3 in Figure 3\n in original paper") + theme_minimal() + scale_x_discrete(limits = c("stable_orange", "stable_purple", "alt_purple", "other"),labels = c("stable\norange", "stable\npurple", "alt\npurple", "other"))

ggsave("../data/original_strategy_plot.png", plot = p_original, width = 6, height = 4, dpi = 300)

experimental_d <- read.csv("../data/full_replication/simplified_experimental_data.csv")
# for each strategy, sum up the counts
strategy_counts_exp <- data.frame(
    strategy = c("stable_orange", "stable_purple", "alt_purple", "other"),
    count = c(sum(experimental_d$stable_orange), sum(experimental_d$stable_purple), sum(experimental_d$alt_purple), sum(experimental_d$other))
)
p_experimental <- ggplot(strategy_counts_exp, aes(x=strategy, y=count)) + 
  geom_col() +
  labs(x = "norm type", y = "proportion of pairs exhibiting norm", title = "Replication Results") + 
  theme_minimal() + 
  scale_x_discrete(limits = c("stable_orange", "stable_purple", "alt_purple", "other"),
                   labels = c("stable\norange", "stable\npurple", "alt\npurple", "other"))
ggsave("experimental_strategy_plot.png", plot = p_experimental, width = 6, height = 4, dpi = 300)   
grid.arrange(p_original, p_experimental, ncol = 2)

Exploratory analyses

I also plan to compare the distributions of results from condition 3 in the original paper, and the replication results.

Discussion

Summary of Replication Attempt

Qualitatively, the results from condition 3 of the original paper replicated. Furthermore, Bayesian regression analyses (included in the anaysis section) showed that participants in condition 3 developed the alternating norm more frequently than in control (β = 0.09,CrI = [0.05,0.13]) which reflects the original paper’s finding which obtained (β = 0.10,CrI = [0.01,0.20]). * In addition our proportion test showed the same result with p-value<0.05. Thereforem we conclude that we were successful in a partial replication.

*we are not sure whether the same bayesian model was used as we do not have access to their analysis code.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.

(a) I looked into the pre-game plans written by dyads who converged to the alternating norm. I found that for one pair, they both mentioned the exact alternating strategy, and the other pairs who developed the alternating strategy mentioned something along the lines of coordinating so that they both choose purple. Their plans were generally more sophisticated than the average initial plan given in the study, giving further evidence on an exploratory analysis that the original authors conducted, which suggested that the written plans were predictive of the strategy that the dyads developed during the interaction.

(b) The fact that some dyads in the replication developed an alternating norm confirms that it is plausible to develop this norm in the short timeframe of the experiment. This was surprising to me as apriori, it felt hard to come up with the alternating strategy, let alone coordinate on this strategy with a partner without communication.

(c) It was hard to figure out the strategy classification algorithm, as the one shown in the paper “algorithm 1” served more as a conceptual explanation rather than what was actually used in the analysis. As the authors mention in a section of the paper, the actual classification algorithm used was different which confused me.