Replication of Experiment 1 by Paxton, Unger, and Greene (2012, Cognitive Science)

Author

Alexander Pereira (pereirak@stanford.edu)

Published

October 23, 2023

Introduction

Paxton, Unger and Greene (2011) investigated the role of reflection and reasoning on moral judgements. I’m a PhD student in philosophy and recent MA student in psychology. While I am not primarily interested in moral philosophy or psychology, I am interested in the project of empirically testing the intuitions and assumptions that philosophers often adopt when making arguments. Here, it has been widely assumed in philosophy that reflection and reasoning should change one’s moral views. In fact, if this wasn’t assumed, it would be difficult to justify engaging in moral philosophy at all. However (by an ought-implies-can principle) we should be interested in knowing whether and in what ways reason and reflection can actually influence moral judgements in real people; this is the overarching theme of the present paper.

According to Haidt’s Social Intuitionist Model (SIM) of moral judgement, moral evaluations are furnished by quick and automatic emotional responses, not reasoning reflection. However, other theories of moral judgement grant moral reasoning and reflection more influence on evaluations. Experiment 1 or Paxton, Unger, and Greene (2012) tested whether reflection caused subjects to engage in utilitarian (cost-benefit) moral reasoning. Subjects were randomly selected to undertake a cognitive reflection task (CRT) either before or after rating solutions to moral dilemmas. The CRT is thought to induce a reflective state that can override intuitions. It was hypothesized that subjects in the CRT-first condition would be more liekly to evaluate moral quandries according to utilitation (cost-benefit) reasoning, as reflection would override initial aversions to extreme actions (like killing one person to save many). Three “high-conflict” moral dilemmas were used. All participants evaluated whether a utilitarian response was acceptable, first by a binary (YES/NO) then a rating scale (1 = Completely Unacceptable, 7 = Completely Acceptable). Finally, they answered basic demographic questions.

Anticipated challenges in replicating this study involve recreating the experimental design and adding additional statistical checks, if needed.

Link to repo Link to original paper in repo

Summary of prior replication attempt

There were four main differences between the original study (Paxton et al., 2012) and the replication attempt (Fereday 2019). First, the CRT questions were rewritten and an additional fourth CRT question was added. Fereday (2019) justifies this change by noting that the original phrasings of the CRT questions have been “overused” on Mechanical Turk, from which participants for both the original paper and the replication attempt were drawn. Second, an attention check was built into the survey. Third, the moral dilemma questions included two sub-conditions which were randomly assigned to participants. In the personalized condition, characters in the questions had names, and in the depersonalized conditions characters were not named. Fereday 2019 aimed to investigate the relationship between personalization of questions and the mean acceptability rating of utilitarian solutions to dilemmas. Finally, Fereday (2019) included an attention check in the survey.

Methods

Power Analysis

Original effect size, power analysis for samples to achieve 80%, 90%, 95% power to detect that effect size. Considerations of feasibility for selecting planned sample size.

How much power does your planned sample have for original effect? For an attenuated effect that is half the size of the original?

(If power analysis is not possible or precise, discuss more fully how you determined a sample size that would be sufficient for rescue.)

Planned Sample

Planned sample size and/or termination rule, sampling frame, known demographics if any, preselection rules if any.

Materials

The original Cognitive Reflection Test questions are below as well as the rewritten version of the CRT. A fourth question not used in the original paper was added from a set of questions another team of researchers used as a substitute for the CRT,

Cognitive Reflection Test (CRT) questions, quoted from Frederick (2005) and referenced in the original article:

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? _____ cents
If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? _____ minutes
In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake? _____ days”

Rewritten CRT Questions:

A book and a pencil cost $1.20 in total. The book costs $1.00 more than the pencil. How much does the pencil cost? _____ cents.
If it takes 10 programmers 10 minutes to make 10 improvements, how long would it take 50 programmers to make 50 improvements? _____ minutes

There is a grasshopper crossing a road. With every jump, the distance the grasshopper jumps doubles. If it takes 26 jumps for the grasshopper to cross the entire road, how many jumps would it take for the grasshopper to make it halfway across? _____ jumps.

A farmer had 15 sheep and all but 8 died. How many are left? _____ sheep.

Three high conflict moral dilemmas, presented randomly before or after the CRT questions in randomized order:

“John is the captain of a military submarine traveling underneath a large iceberg. An onboard explosion has caused the vessel to lose most of its oxygen supply and has injured a crewman who is quickly losing blood. The injured crewman is going to die from his wounds no matter what happens. The remaining oxygen is not sufficient for the entire crew to make it to the surface. The only way to save the other crew members is for John to shoot dead the injured crewman so that there will be just enough oxygen for the rest of the crew to survive.
Enemy soldiers have taken over Jane’s village. They have orders to kill all remaining civilians. Jane and some of her townspeople have sought refuge in the cellar of a large house. Outside they hear the voices of soldiers who have come to search the house for valuables. Jane’s baby begins to cry loudly. She covers his mouth to block the sound. If she removes her hand from his mouth, his crying will summon the attention of the soldiers, who will kill her, her child, and the others hiding out in the cellar. To save herself and the others, she must smother her child to death.
A runaway trolley is heading down the tracks toward five railway workmen, who will be killed if the trolley proceeds on its present course. Jane is on a footbridge over the tracks, in between the approaching trolley and the five workmen. Next to her on this footbridge is a lone railway workman, who happens to be wearing a large, heavy backpack. The only way to save the lives of the five workmen is for Jane to push the lone work- man off the bridge and onto the tracks below, where he and his large backpack will stop the trolley. The lone workman will die if Jane does this, but the five workmen will be saved.”

This rescue project will follow the exact wording of the questions and dilemmas above.

Procedure

Quoted from original article:

“Subjects were randomly assigned to complete the CRT either before (CRT-First condition) or after (Dilemmas-First condition) responding to the dilemmas. Subjects evaluated the moral acceptability of the utilitarian action with a binary response (YES ⁄ NO), followed by a rating scale (1 = Completely Unacceptable, 7 = Completely Acceptable). No time limits were imposed on responses. Subjects completed the CRT questions and read and responded to the dilemmas at their own pace. Subjects subsequently completed a brief set of demographic questions.”

Controls

What attention checks, positive or negative controls, or other quality control measures are you adding so that a (positive or negative) result will be more interpretable?

I want to add an attention check during the survey to ensure participants read the CRTs and especially the moral dilemmas sincerely. There was no attention check mentioned in the original paper, and Fedreday (2019) included an attention check in the “methods addendum” post data collection.

Analysis Plan

Exclusion rules: Exclude subjects who do not pass attention check. Exclude subjects who did not answer at least one CRT question correctly (i.e., those we a CRT score of 0).
CRT scores: Calculate each participant’s CRT score by assigning 1 for correct and 0 for incorrect response. Minimum score is 0, maximum score is 3.
Reliability of CRT scores: Calculate Cronbach’s alpha to determine reliability across moral dilemmas.
Moral acceptability rating: Collapse each subjects moral acceptability rating to create an average moral acceptability rating for each subject.
Linear regression of CRT-First condition on utilitarian moral judgments.
Main Statistical Test: Between-subject t-test of CRT-First condition on individual moral acceptability rating.
Controlling for trait-reflectiveness: Test correlation among subjects in the Dilemmas first condition to rule out variation due to trait reflectiveness. Confirm with a Fischer r-z test.
Controlling for effects of perfoming a task before moral judgements. A potential objection is that simply doing any non-specific problem solving task might change moral evaluations. To test this, calculate the within CRT-first condition correlation of CRT scores and moral acceptability ratings.
Regress the CRT scores across the two conditions to address the objection that receiving the Dilemmas-First condition would influence subsequent CRT Performance.

Clarify key analysis of interest here You can also pre-specify additional analyses you plan to do.

Differences from Original Study and 1st replication

Explicitly describe known differences in sample, setting, procedure, and analysis plan from original study. The goal, of course, is to minimize those differences, but differences will inevitably occur. Also, note whether such differences are anticipated to make a difference based on claims in the original article or subsequent published research on the conditions for obtaining the effect.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

Data preparation following the analysis plan.

Read in data
Delete unwanted rows and re-order data
Create new variables for CRT scores and Moral Acceptability Ratings.
Create new variable for total CRT score and average CRT score.
Create new column for CRT-first and Dilemma-first conditions.
Create new column to capture attention check responses across both conditions.
Create one column from the two conditions for binary judgement.
Create one column from the two conditions for moral acceptability.

Results of control measures

Filter out subjects with CRT scores of 0.
Filter out subjects who failed attention check.

Confirmatory analysis

Calculate Cronbach’s alpha to test reliability across moral acceptability scores.

Main statistical test: Two-sample t-test between mean moral acceptability rating and CRT order. If mean moral acceptability ratings is significantly greater in CRT-first group compared to Dilemmas-first, results support the main inference of the paper, i.e., that CRT exposure (operationalizing “reflection”) causes an increase in utilitarian (cost-benefit) moral reasoning.

Three-panel graph with original, 1st replication, and your replication is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Mini meta analysis

Combining across the original paper, 1st replication, and 2nd replication, what is the aggregate effect size?

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.