Replication of People make the Same Bayesian Judgment They Criticize in Others by Cao, Kleiman-Weiner and Banaji (2019, Psychological Science)

Introduction

I chose this paper because it’s an experimental application of Bayes theorem, a ubiquitous framework in the social sciences. Replicating it will require me to internalize the theorem’s basic intuition as well as its more complex manifestations. The experimental paradigm involves a qualitative and computational exploration of how beliefs are updated with evidence. This is a great opportunity to think about how Bayesian principles fit in with people’s real world decision-making, an invaluable tool in a social scientist’s arsenal.

This paper explored participants’ judgments about the likelihood of a hypothetical person being of a particular occupation based on idiosyncratic statistical heuristics. The replication target will be study 5, which will consist of three parts; the first part will query people’s prior, posterior, and likelihood estimates of an air traffic control (ATC) communicator either being male or female; the second part will compute a model posterior for each participant and compare it with their actual posterior; and the third part will have the participants evaluate the moral character of a third party who makes a Bayesian judgment about the same scenario.

If the results of the original study hold, this study expects to find that participants make Bayesian judgments about the ATC scenario, with actual and model posteriors favoring the communicator being male rather than female. A second key finding will be that participants will judge a third party who makes the same Bayesian judgments as themselves as being unfair, unjust, inaccurate and unintelligent.The study is expected to be conducted on Amazon’s task crowd-sourcing marketplace, Mechanical Turk. Some challenges expected include low quality of data due to bots or inattentive participants, and the possibility of participants looking up answers to the filler questions about unrelated statistical phenomena.

Link to the repository
Link to the Qualtrics survey

Methods

Power Analysis

Original effect size, power analysis for samples to achieve 80%, 90%, 95% power to detect that effect size. Considerations of feasibility for selecting planned sample size.

Planned Sample

Planned sample size and/or termination rule, sampling frame, known demographics if any, preselection rules if any. .

Materials

The study will be conducted in October 2020. Participants will be recruited from Mturk and compensated $0.71 each.

Procedure

The study will proceed in three parts, each of which will correspond to a component of Bayes’s rule.

“In the first part, each participant was randomly assigned to learn that either a man or a woman had communicated with air traffic control during a flight. Participants provided their priors, posteriors, and likelihoods for this scenario”

(Quoted - edited) "Part 1: priors. Participants will be instructed to imagine a man and a woman who work at the same airline. One person is a pilot and the other person is not a pilot, but who is the pilot and who is not is unknown. Participants will estimate the percentage chance that each person is the pilot. Because there are two hypotheses—either the man or the woman is the pilot (and the other is the not)—both estimates had to sum to 1. Thus, each participant will provide his or her subjective prior about each person’s profession (e.g., the man has a 75% chance of being the pilot; the woman has a 25% chance of being not the pilot).

Part 2: posteriors. After providing priors, each participant will be randomly assigned to learn one of the following two pieces of data: (a) The man communicated with air traffic control(b) the woman communicated with air traffic control. After learning this datum, participants will again estimated the percentage chance that each person is the pilot. Thus, each participant will provide his or her subjective posterior.

Part 3: likelihoods. Each participant will estimate two likelihoods: the likelihood of observing the datum given the hypothesis that the target they learned about is the pilot and the likelihood of observing the datum given the hypothesis that the target they learned about is not the pilot. For example, if a participant learned that the woman had communicated with air traffic control, that participant will estimate the percentage of female pilots who communicate with air traffic control and the percentage of female non-pilots who communicate with air traffic control. If a participant learned that the man had communicated with air traffic control, that participant will estimated the percentage of male pilots who communicate with air traffic control and the percentage of male non-pilots who communicate with air traffic control. Thus, each participant will provide his or her subjective likelihood estimates, which will be combined by forming a ratio. Each participant will be randomly assigned to estimate the corresponding likelihoods either before or after providing subjective priors and posteriors. Each participant’s priors and likelihoods will be entered into Bayes’s rule to compute a model posterior, which represents what the participant’s posterior should be from a statistical perspective. This model posterior will be compared with the posterior that the participant actually reported."

Analysis Plan

Participants will be excluded who provide priors of either 0% of 100% since these cannot be updated in accordance with Bayes rule.

“Each participant’s priors and likelihoods will be entered into Bayes’s rule to compute a model posterior, which represents what the participant’s posterior should be from a statistical perspective. This model posterior will be compared with the posterior that the participant actually reported.”

Key descriptive statistics such as means and standard errors for judgments among participants in each condition will be computed and some plotted.

Likelihood ratios will be log-scaled and plotted in each condition.

A scatterplot will be generated showing the relationship between evaluation of person X (x-axis) and statistical accuracy (model posterior subtracted from reported posterior)

A paired t-test will be conducted to investigate whether participant’s model and reported posteriors favor the man or the woman to be the pilot. This will also be used to compute the difference between the model and reported posteriors among those who learnt that the man versus the woman communicated with ATC.

The four Likert scales used to generate moral judgments about person X’s will be tested for reliability by computing a Cronbach’s alpha. The means of all 4 scales will be averaged in each condition to form a composite measure of participant’s evaluation of person X.

Clarify key analysis of interest here You can also pre-specify additional analyses you plan to do.

Finally, a graph will be plotted for the relationship between reported posterior probabilities and evaluations of person X (an average of the four Likert scale DVs).

Differences from Original Study

The study is expected to be replicated with a much smaller sample than the original 353 due to budget concerns. A power analysis will be conducted to determine the ideal sample size under which an effect is expected to be observed. The procedure and analysis will be the same one used in study 5. A key difference is that the replication study will not ask participants to complete filler tasks consisting of unrelated statistical judgments on the second part of the study, as the authors did. These differences are not anticipated to make a differences to the final result.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

Data preparation following the analysis plan.

Confirmatory analysis

The analyses as specified in the analysis plan.

Finally, a graph will be plotted for the relationship between reported posterior probabilities and evaluations of person X (an average of the four Likert scale DVs).

Side-by-side graph with original graph is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.