Replication of Experiment 1 by Paxton, Unger, and Greene (2012, Cognitive Science)

Author

Alexander Pereira (pereirak@stanford.edu)

Published

April 25, 2024

Introduction

Paxton, Unger and Greene (2012) hypothesized that cognitive reflection increases the acceptability of utilitarian responses to moral dilemmas. According to the Social Intuitionist Model (SIM) of moral judgement, moral evaluations are furnished by quick and automatic emotional responses, not reasoning nor critical reflection. However, other theories of moral judgement grant reasoning and reflection more influence on moral evaluations. Experiment 1 of Paxton et al. (2012) tested whether reflection caused subjects to engage in utilitarian (cost-benefit) moral reasoning. Subjects were randomly selected to undertake a cognitive reflection task (CRT) either before or after evaluating utilitarian solutions to high-stakes moral dilemmas. It was hypothesized that subjects in the CRT-first condition would be more likely to evaluate classic moral dilemmas according to utilitarian (cost-benefit) reasoning, as reflection would override initial emotional aversions to extreme actions, such as killing one person to save many. Three “high-conflict” moral dilemmas were used. All participants evaluated whether a utilitarian response was acceptable, first by a binary (YES/NO) then a rating scale (1 = Completely Unacceptable, 7 = Completely Acceptable). Finally, participants answered basic demographic questions.

Summary of prior replication attempts

There were five main differences in design between the original study (Paxton et al., 2012) and the previous PSYCH251 replication attempt (Fereday 2019). First, the CRT questions were rewritten and an additional fourth CRT question was added. Second, Fereday added a sub-condition to the moral dilemmas. Half the dilemmas included personalized information (names of characters) while the rest were depersonalized. Fereday wanted to investigate the impact of personalization on moral evaluations. Second, an attention check was built into the survey. Third, the moral dilemma questions included two sub-conditions which were randomly assigned to participants. In the personalized condition, characters in the questions had names, and in the depersonalized conditions characters were not named. Fereday 2019 aimed to investigate the relationship between personalization of questions and the mean acceptability rating of utilitarian solutions to dilemmas. Finally, Fereday (2019) included an attention check in the survey.

Freyeday’s (2019) study did not replicate either of the two key findings from Paxton, Unger, and Greene (2012). The first main statistical test was a two-sided t-test of the mean moral acceptability ratings of the CRT-First (experiment) vs Dilemma-First (control) groups. The replication yielded a statistically insignificant relationship (CRT-First: M = 3.54; Dilemmas-First: M = 3.66; t(82) = -.36, p = .72, d = .08), conflicting with the original study (CRT-First: M = 3.77; Dilemmas-First: M = 3.25; t(90) = 2.03, p = .05, d = 0.43). The second main statistical test was a correlation between the number of CRT questions participants answered correctly and their mean acceptability rating across the three dilemmas. This was performed on the CRT-first and dilemma-first conditions separately. The replication observed no significant correlation in the CRT-First group (r = .14, p = .32), while the original study observed a significant positive relationship (r = .39, p = .001). The replication also observed an insignificant negative relationship in the Dilemmas-First condition (r = .03, p = .85). The original also observed an insignificant relationship, but in the positive direction (r = -.03, p = .8).

There have been several others replications of experiment 1 from Paxton et al., (2012). Attie and Knobe (2017) failed to replicate the original results. There was no significant difference in responses between participants in the CRT-First condition (M= 3.19, SD = 1.52) and those in the CRT-After condition (M = 2.99, SD =1.50), t(296) = -1.16, p = .24. There was also no significant correlation found between utilitarian judgement and CRT score in the CRT-first condition, r(316) = .08, p = .11, 95% CI [-.03, .18].

Paxton et al., (2014) however replicated both of the main findings by using a slightly different moral dilemma (Sophie’s Choice). Their results were as follows: CRT-First M = 5.5, Dilemmas-First M = 3.0, t(15) = 2.2, p = .04; r = 0.44, P = 0.006.

Changes Made to Original Study to Address Replication Failure

Major Changes

1. New CRT Questions

The present rescue uses a new pool of Cognitive Reflection Task (CRT) questions to induce a reflective state in the treatment condition. The original study used the CRT; this rescue will use the CRT-2, developed by Thomson and Oppenheimer (2016). The CRT-2 has two relevant advantages over the original CRT. (i) The validity of the CRT appears to depend on participants being naive to its materials and objectives (Thomson and Oppenheimer, 2016; Haigh, 2016; however, see Meyer, Zhou, and Frederick for a dissenting view). Participants generally, but especially those who work on services like MTurk and Prolific, are likely to be familiar with the original CRT questions, but not the CRT-2. (ii) Success at the CRT is affected by numerical skills such that failure at the CRT might be explained by the absence of appropriate reflection, or factors such as poor mathematical reasoning or “math anxiety” (Primi et al., 2017). The CRT-2 meanwhile requires more generalized verbal and logical reasoning as well as close readings of the questions. The CRT-2 should track successful reflection with more accuracy and not screen-off participants who are reflecting but lack requisite numerical skills. Even though they do not require numerical reasoning, the CRT-2 questions chosen for the present rescue all require numerical answers, in line with the original study from Paxton, Greene, and Unger (2012)

2. Prolific

Participants will be pulled from Prolific instead of MTurk. Both the original and replication used participants from MTurk, and the replicating author Fereday (2019) reported that the majority of participants were familiar with the original CRT questions. Prolific also has several advantages which make it more likely that participants are adequately engaged and earnest in their responses.

3. Attention and Commitment Check

The original study did not mention including an attention check, while Fereday (2019) included two attention checks. It is important that attention checks do no interfere with the induction of a reflective state; for the CRT-First group, an attention check during the CRT questions might ensure participants are more engaged, but risk interfering the manipulation. An attention check after the CRT block would fail to help induce engagement for the relevant manipulation, and it also might interfere with following moral judgements.

To address this, the rescue includes a commitment check at the beginning of the study and an attention check at the end. The commitment check aims to motive earnest and engaged participation by simply reminding the participants that the experimenters care about the quality of survey data and request thoughtful answers to each survey question. Unlike some attention checks, the commitment check avoids any paternalistic or antagonistic dynamics between experimenter and participant.

The rescue also includes a simple attention check at the end. Negative answers on the commitment check or failure on the attention check will serve as exclusion criteria. The overall aim is to balance increasing the probability of adequate participant engagement and tracking non-engaged participants, while minimizing the probability of interfering with the experimental manipulation and creating an anatagonistic environment for the participants.

Minor Changes

1. Ignoring Binary YES/NO Moral Evaluations

The original study employed two measurements of participants’ acceptance of utilitarian responses to moral dilemmas. The first was a YES/NO binary (is it acceptable for the captain to kill an injured crewmate to save the whole ship?). The second was a 7-point likert rating of the degree of acceptability (how acceptable is it for the captain to kill an injured crewmate to save the whole ship?). As noted by the replication from Fereday (2019), it is unclear how the original study “collapsed” moral acceptability ratings and whether this included the binary judgement. It also seems strange to combine a categorical acceptability judgement with a judgement of degree. In line with the replication, the YES/NO judgement will be ignored in the main analysis.

2. Adding Measure of Moral Dilemma Familiarity

Neither the original study nor the replication measured how familiar participants were with the moral dilemmas. This seems like an oversight worth addressing for two reasons.Like familiarity with the CRT, familiarity with the moral dilemmas might influence moral acceptability ratings. Trolley-problem variants are common in experiments on moral psychology and are reasonably well-known in popular culture.

Changes from Previous Replication

The previous replication made several additions to the paradigm and exploratory analysis. For example, Fereday (2019) created personalized (name of character in dilemmas is used) and depersonalized (name of character is not used) versions of the questions that were randomly assigned to participants. The author added these two sub-conditions to investigate the relationship between personalization of questions and mean acceptability rating. Since these additions to the paradigm and analysis are orthogonal to the research questions of the original study, they will be excluded.

Methods

Power Analysis

The effect size in the original study was d = 0.43. Given the effect size and the more conservative two-sided alternative, the following sample sizes are required for 80, 90, and 95% power.

For 80% power: n=86 For 90% power: n=115 For 95% power: n=142

Planned Sample

The original study had an exclusion rate of 39%, while the replication had an exclusion rate of 42%. Given the expected exclusion rate, the present study will use 230 participants to ensure at least 95% power. Note that the present rescue is employing changes aimed at lowering the exclusion rate by increasing participant engagement and using new CRT questions that participants can pass even if they are low on mathematical reasoning capabilities.

Materials

The original Cognitive Reflection Test (CRT) questions, quoted from Frederick (2005) and referenced in the original article:

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? _____ cents
If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? _____ minutes
In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake? _____ days”

The present study opted to include a rewritten version of one original CRT questions (to decrease familiarity), alongside three CRT-2 questions (Thomson and Oppenheimer, 2016) which require numerical answers.

CRT and CRT-2 questions:

How many of each animal did Moses put on the Ark? _____ animals.
If it takes 10 programmers 10 minutes to make 10 improvements, how long would it take 50 programmers to make 50 improvements? _____ minutes
How many months have 28 days? ____ months.
A farmer had 15 sheep and all but 8 died. How many are left? _____ sheep.
There is a grasshopper crossing a road. With every jump, the distance the grasshopper jumps doubles. If it takes 26 jumps for the grasshopper to cross the entire road, how many jumps would it take for the grasshopper to make it halfway across? _____ jumps.

Three high conflict moral dilemmas, presented randomly before or after the CRT questions in randomized order:

“John is the captain of a military submarine traveling underneath a large iceberg. An onboard explosion has caused the vessel to lose most of its oxygen supply and has injured a crewman who is quickly losing blood. The injured crewman is going to die from his wounds no matter what happens. The remaining oxygen is not sufficient for the entire crew to make it to the surface. The only way to save the other crew members is for John to shoot dead the injured crewman so that there will be just enough oxygen for the rest of the crew to survive.
Enemy soldiers have taken over Jane’s village. They have orders to kill all remaining civilians. Jane and some of her townspeople have sought refuge in the cellar of a large house. Outside they hear the voices of soldiers who have come to search the house for valuables. Jane’s baby begins to cry loudly. She covers his mouth to block the sound. If she removes her hand from his mouth, his crying will summon the attention of the soldiers, who will kill her, her child, and the others hiding out in the cellar. To save herself and the others, she must smother her child to death.
A runaway trolley is heading down the tracks toward five railway workmen, who will be killed if the trolley proceeds on its present course. Jane is on a footbridge over the tracks, in between the approaching trolley and the five workmen. Next to her on this footbridge is a lone railway workman, who happens to be wearing a large, heavy backpack. The only way to save the lives of the five workmen is for Jane to push the lone work- man off the bridge and onto the tracks below, where he and his large backpack will stop the trolley. The lone workman will die if Jane does this, but the five workmen will be saved.”

This rescue project will follow the exact wording of the questions and dilemmas above.

Procedure

Quoted from original article:

“Subjects were randomly assigned to complete the CRT either before (CRT-First condition) or after (Dilemmas-First condition) responding to the dilemmas. Subjects evaluated the moral acceptability of the utilitarian action with a binary response (YES ⁄ NO), followed by a rating scale (1 = Completely Unacceptable, 7 = Completely Acceptable). No time limits were imposed on responses. Subjects completed the CRT questions and read and responded to the dilemmas at their own pace. Subjects subsequently completed a brief set of demographic questions.”

Controls

A commitment check and attention check are included in the survey to ensure participants read the CRTs and especially the moral dilemmas sincerely. There was no attention check nor commitment check mentioned in the original paper, while Fedreday (2019) included an attention check in the “methods addendum” post data collection.

Analysis Plan

Exclusion rules: Exclude subjects who do not pass attention check, exclude subjects who did not answer at least one CRT question correctly (i.e., those we a CRT score of 0), and exclude subjects who took longer than an hour to finish the survey
CRT scores: Calculate each participant’s CRT score by assigning 1 for correct and 0 for incorrect response. Minimum score is 0, maximum score is 5.
Reliability of CRT scores: Calculate Cronbach’s alpha to determine reliability across moral dilemmas.
Moral acceptability rating: Collapse each subjects moral acceptability rating to create an average moral acceptability rating for each subject.
Linear regression of CRT-First condition on utilitarian moral judgments.
Main Statistical Test: Between-subject t-test of CRT-First condition on individual moral acceptability rating.
Controlling for trait-reflectiveness: Test correlation among subjects in the Dilemmas first condition to rule out variation due to trait reflectiveness. Confirm with a Fischer r-z test.
Controlling for effects of performing a task before moral judgements. A potential objection is that simply doing any non-specific problem solving task might change moral evaluations. To test this, calculate the within CRT-first condition correlation of CRT scores and moral acceptability ratings.
Regress the CRT scores across the two conditions to address the objection that receiving the Dilemmas-First condition would influence subsequent CRT Performance

Key analyses of interest

Main inference of the paper: If mean moral acceptability rating is significantly greater in CRT-first group compared to Dilemmas-first group, results support the main inference of the paper, i.e., that CRT exposure (operationalizing “reflection”) causes an increase in utilitarian (cost-benefit) moral reasoning. Statistical test: two sample t-test between mean_acceptability and CRT_order.

Secondary inference of the paper: If higher CRT scores correlate positively with utilitarian judgements of moral acceptability, results support the secondary inference of the paper, i.e., that higher CRT scores in the CRT First group is expressive of stronger “reflection” which causes an increase in utilitarian (cost-benefit) ratings. Note that this result only supports the secondary inference if there is no positive correlation between CRT scores and moral acceptability ratings in the Dilemmas First condition. Statistical Test(s): Pearson’s r calculation correlating CRT total score and mean acceptability for CRT first observations. Then, Pearson’s r calculation correlating CRT total score and mean acceptability for Dilemma first observations.

Methods Addendum (Post Data Collection)

Actual Sample

There were 241 participants recruited through Prolific pre-exclusion, and 160 participants post-exclusion. Participants were excluded for failing the attention check, getting 0 CRT questions correct, or for taking longer than one hour on the survey. Of the post-exclusion pool, there were 44% males, 54% females, and 2% other. The mean age was 30.9 years,

Differences from pre-data collection methods plan

Participants were instructed to input numerical responses for all CRT questions. However, many participants used word characters (e.g., “nine”), and some included longer written responses. These responses were re-coded using the following criterion: word chatacters were only re-coded as (correct) numerical answers if those characters unambiguously and uniquely mapped to the correct numerical answer. For example:

CRT-1. Answers were re-coded as “0” only if they explicitly showed that Moses put “0” animals on the ark (e.g., “zero” and “none” are correct). Word answers that correctly responded that it was Noah not Moses who is associated with the ark were re-coded as incorrect, since those answers did not specify the number of animals put on the ark by Moses.
CRT-2. The month “February” was re-coded as “1”. “All of them” re-coded as “12” (the correct answer).
CRT-3. Any written answers which clearly said that eight sheep are alive were re-coded as “8”, correct.

Answers equivalent to “I don’t know” were re-coded as 1234, which is an incorrect answer for all questions.

Results

Data preparation

Data preparation following the analysis plan.

#### Import data
library(readr)
raw_d <- read_csv("/Users/alexpereira/Library/CloudStorage/OneDrive-Personal/Academics/Stanford/Courses/Fall_2023/PSYCH_251/paxton2012_rescue/data/data_anon.csv")

Rows: 243 Columns: 39
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (39): StartDate, EndDate, Status, Progress, Duration, Finished, Recorded...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

raw_d <- raw_d %>%
  slice(-c(1, 2))

head(raw_d)

# A tibble: 6 × 39
  StartDate  EndDate Status Progress Duration Finished RecordedDate UserLanguage
  <chr>      <chr>   <chr>  <chr>    <chr>    <chr>    <chr>        <chr>       
1 12/4/23 2… 12/4/2… 0      100      317      1        12/4/23 23:… EN          
2 12/4/23 2… 12/4/2… 0      100      284      1        12/4/23 23:… EN          
3 12/4/23 2… 12/4/2… 0      100      266      1        12/4/23 23:… EN          
4 12/4/23 2… 12/4/2… 0      100      299      1        12/4/23 23:… EN          
5 12/4/23 2… 12/4/2… 0      100      241      1        12/4/23 23:… EN          
6 12/4/23 2… 12/4/2… 0      100      289      1        12/4/23 23:… EN          
# ℹ 31 more variables: COM_1 <chr>, CRT_1_raw <chr>, CRT_1 <chr>,
#   CRT_2_raw <chr>, CRT_2 <chr>, CRT_3_raw <chr>, CRT_3 <chr>,
#   CRT_4_raw <chr>, CRT_4 <chr>, CRT_5_raw <chr>, CRT_5 <chr>, MDA_1 <chr>,
#   MDA_2 <chr>, MDB_1 <chr>, MDB_2 <chr>, MDC_1 <chr>, MDC_2 <chr>,
#   ATT_1 <chr>, DQ_1 <chr>, DQ_2a <chr>, DQ_2b <chr>, DQ_3 <chr>, DQ_4a <chr>,
#   DQ_4b <chr>, DQ_5 <chr>, DQ_6_CRT <chr>, DQ_6_dilemma <chr>, SC0 <chr>,
#   FL_17_DO_raw <chr>, FL_17_DO <chr>, FL_18_DO <chr>

#### Data Preparation for Analysis

# Selects relevant columns, renames columns, replaces numbers with strings for gender variables
clean_d  <- raw_d %>%
  select(Finished, COM_1, CRT_1, CRT_2, CRT_3, CRT_4, CRT_5, ATT_1, MDA_1, MDA_2, MDB_1, MDB_2, MDC_1, MDC_2, DQ_1, DQ_2a, DQ_2b, DQ_3, DQ_4a, DQ_4b, DQ_5, DQ_6_CRT, DQ_6_dilemma, Duration, FL_17_DO, FL_18_DO) %>%
  rename(crt1 = CRT_1, crt2 = CRT_2, crt3 = CRT_3, crt4 = CRT_4, crt5 = CRT_5, commitment = COM_1, crt_order = FL_17_DO, dilemma_order = FL_18_DO, age = DQ_1, gender = DQ_2a, gender_other = DQ_2b, income = DQ_3, ethnicity = DQ_4a, ethnicity_other = DQ_4b, education = DQ_5, fam_crt = DQ_6_CRT, fam_dilemma = DQ_6_dilemma, att_check = ATT_1, duration = Duration, d1_binary = MDA_1, d2_binary = MDB_1, d3_binary = MDC_1, d1_acc = MDA_2, d2_acc = MDB_2, d3_acc = MDC_2, finished = Finished) %>%
  mutate (gender, gender = ifelse(gender == 1, "male", gender)) %>%
  mutate(gender, gender = ifelse(gender == 2, "female", gender)) %>%
  mutate(gender, gender = ifelse(gender == 3, "other", gender))

# Converts data from character to numerical
clean_d <- clean_d %>%
  mutate_at(vars(commitment, crt1, crt2, crt3, crt4, crt5, age, income, education, fam_crt, fam_dilemma, att_check, duration, d1_binary, d2_binary, d3_binary, d1_acc, d2_acc, d3_acc),
            as.numeric)

# Creates a new variable for total CRT score and average CRT score
clean_d <- clean_d %>%
  mutate(across(starts_with("crt1"), ~ ifelse(. == 0, 1, 0), .names = "correct_crt1"),
         across(starts_with("crt2"), ~ ifelse(. == 10, 1, 0), .names = "correct_crt2"),
         across(starts_with("crt3"), ~ ifelse(. == 12, 1, 0), .names = "correct_crt3"),
         across(starts_with("crt4"), ~ ifelse(. == 8, 1, 0), .names = "correct_crt4"),
         across(starts_with("crt5"), ~ ifelse(. == 25, 1, 0), .names = "correct_crt5")) %>%
  mutate(crt_score = rowSums(select(., starts_with("correct_"))),
         crt_mean = rowMeans(select(., starts_with("correct_"))))

# Creates a new column showing mean acceptability rating
clean_d <- clean_d %>%
  mutate(mean_acc = (d1_acc + d2_acc + d3_acc)/3)

view(clean_d)

# Creates a new variable for attention check
clean_d <- clean_d %>%
    mutate(across(starts_with("att_check"), ~ ifelse(. == 4, 1, 0), .names = "pass_att_check"))

# Reassigns binary moral judgements yes/no as 1 or 0
clean_d <- clean_d %>%
  mutate(across(starts_with("d1_binary"), ~ifelse(. == 1, 1, ifelse(. == 2, 0, .)), .names = "{col}"),
         across(starts_with("d2_binary"), ~ifelse(. == 1, 1, ifelse(. == 2, 0, .)), .names = "{col}"),
         across(starts_with("d3_binary"), ~ifelse(. == 1, 1, ifelse(. == 2, 0, .)), .names = "{col}"))

# Creates new column showing mean binary judgment
clean_d <- clean_d %>% 
  mutate(mean_judgment = (d1_binary + d2_binary + d3_binary)/3)

head(clean_d)

# A tibble: 6 × 36
  finished commitment  crt1  crt2  crt3  crt4  crt5 att_check d1_binary d1_acc
  <chr>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>  <dbl>
1 1                 2     2   500    12     0  13           4         1      4
2 1                 2     0    10     1     7   6.5         5         1      7
3 1                 2     1    50     1     7  52           4         1      3
4 1                 2     2    10     1     8  13           4         1      4
5 1                 2     2     1     1     7  13           4         1      4
6 1                 2     2    10    12     8  25           4         1      4
# ℹ 26 more variables: d2_binary <dbl>, d2_acc <dbl>, d3_binary <dbl>,
#   d3_acc <dbl>, age <dbl>, gender <chr>, gender_other <chr>, income <dbl>,
#   ethnicity <chr>, ethnicity_other <chr>, education <dbl>, fam_crt <dbl>,
#   fam_dilemma <dbl>, duration <dbl>, crt_order <chr>, dilemma_order <chr>,
#   correct_crt1 <dbl>, correct_crt2 <dbl>, correct_crt3 <dbl>,
#   correct_crt4 <dbl>, correct_crt5 <dbl>, crt_score <dbl>, crt_mean <dbl>,
#   mean_acc <dbl>, pass_att_check <dbl>, mean_judgment <dbl>

# Data exclusion / filtering. 

## Basic count of CRT scores = 0
count_zero_entries <- clean_d %>%
  filter(crt_score == 0) %>%
  summarise(zero_entries = n())
cat("Count of entries where CRT Score is 0:", count_zero_entries$zero_entries, "\n")

Count of entries where CRT Score is 0: 56

## Basic count of failed attention checks 
count_failed_attention <- clean_d %>%
  filter(pass_att_check == 0) %>%
  summarise(failed_attention = n())
cat("Count of failed attention checks:", count_failed_attention$failed_attention, "\n")

Count of failed attention checks: 30

## Basic count of participants who took > 1 hour to complete the survey
count_too_long <- clean_d %>%
  filter(duration > 3660) %>%
  summarise(too_long = n())
cat("Count of duration longer than an hour:", count_too_long$too_long, "\n")

Count of duration longer than an hour: 2

## Exclusions
filtered_d <- clean_d %>%
  filter(crt_score != 0) %>% ### Filters out participants who did not get at least one CRT question correct (crt_score = 0)
  filter(pass_att_check != 0) %>% ### Filters out participants who did not pass attention check
  filter(duration <= 3660) %>% ### Filters out participants who took > 1 hour to complete the survey (n=2)
  filter(d1_acc != -99, d2_acc != -99, d3_acc != -99) ### Filters out participants who did not provide answers to all moral acceptability questions

Confirmatory analysis

# Calculates Cronbach's Alpha for inter-dilemma reliability to justify one moral acceptability score
filtered_d %>% 
  select(d1_acc, d2_acc, d3_acc) %>% 
  alpha(check.keys = TRUE)


Reliability analysis   
Call: alpha(x = ., check.keys = TRUE)

  raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd median_r
      0.73      0.73    0.65      0.48 2.7 0.037  2.9 1.3     0.47

    95% confidence boundaries 
         lower alpha upper
Feldt     0.65  0.73   0.8
Duhachek  0.66  0.73   0.8

 Reliability if an item is dropped:
       raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
d1_acc      0.60      0.60    0.43      0.43 1.5    0.063    NA  0.43
d2_acc      0.64      0.64    0.47      0.47 1.8    0.056    NA  0.47
d3_acc      0.68      0.68    0.52      0.52 2.2    0.050    NA  0.52

 Item statistics 
         n raw.r std.r r.cor r.drop mean  sd
d1_acc 160  0.84  0.82  0.69   0.59  3.6 1.7
d2_acc 160  0.80  0.81  0.65   0.56  2.3 1.5
d3_acc 160  0.78  0.79  0.60   0.52  2.6 1.6

Non missing response frequency for each item
          1    2    3    4    5    6    7 miss
d1_acc 0.19 0.09 0.12 0.21 0.29 0.07 0.03    0
d2_acc 0.46 0.14 0.16 0.11 0.10 0.01 0.01    0
d3_acc 0.36 0.16 0.22 0.11 0.11 0.01 0.02    0

# Chisquared Test for Pre and Post Exclusion Data

## Creates a pre-exclusions data frame 
original <- clean_d %>% 
  group_by(crt_order) %>% 
  summarise(n = n())

## Creates a data frame with excluded observation
excluded <- clean_d %>% 
  filter(
    crt_score != 0,
    pass_att_check != 0,
    duration <= 3660,
    d1_acc != -99,
    d2_acc != -99,
    d3_acc != -99
  ) %>% 
  group_by(crt_order) %>% 
  summarise(n = n())

## Joins the two data frames
original_excluded <- inner_join(original, excluded, by = "crt_order")

## Chi-squared Test
chisq_result <- chisq.test(original_excluded[, c("n.x", "n.y")])
print(chisq_result)


    Pearson's Chi-squared test with Yates' continuity correction

data:  original_excluded[, c("n.x", "n.y")]
X-squared = 0, df = 1, p-value = 1

view(original_excluded)

# Main statistical test: two sample t-test between mean_acceptability and CRT_order.

## If mean moral acceptability ratings is significantly greater in CRT-first group compared to Dilemmas-first group, results support the main inference of the paper, i.e., that CRT exposure (operationalizing "reflection") causes an increase in utilitarian (cost-benefit) moral reasoning. 

main_t <- filtered_d %>% 
  t.test(data = ., mean_acc ~ crt_order, var.equal = TRUE)
main_t


    Two Sample t-test

data:  mean_acc by crt_order
t = -1.1216, df = 158, p-value = 0.2637
alternative hypothesis: true difference in means between group CRT|FL_18 and group FL_18|CRT is not equal to 0
95 percent confidence interval:
 -0.6305313  0.1737875
sample estimates:
mean in group CRT|FL_18 mean in group FL_18|CRT 
               2.735043                2.963415

# Shows the count for each condition that can then be entered into the calculation for Cohen's d below
condition_counts <- filtered_d %>% 
  group_by(crt_order) %>% 
  count()
kable(condition_counts) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

crt_order	n
CRT\|FL_18	78
FL_18\|CRT	82

# Calculating Cohen's D for effect size
crt_first <- filtered_d %>%
  filter(crt_order == "CRT|FL_18") %>%
  pull(mean_acc)

dilemma_first <- filtered_d %>%
  filter(crt_order == "FL_18|CRT") %>%
  pull(mean_acc)

cohen_d <- cohen.d(crt_first, dilemma_first)
print(cohen_d)


Cohen's d

d estimate: -0.1773934 (negligible)
95 percent confidence interval:
     lower      upper 
-0.4903940  0.1356071

view(crt_first)

# Creates a dataframe subsetting CRT first observations
crt_first_d <- filtered_d %>%
  filter(crt_order == "CRT|FL_18")

head(crt_first_d)

# A tibble: 6 × 36
  finished commitment  crt1  crt2  crt3  crt4  crt5 att_check d1_binary d1_acc
  <chr>         <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>     <dbl>  <dbl>
1 1                 2     2    10    12     8    25         4         1      4
2 1                 2     2    10     1     7    25         4         1      7
3 1                 2     2    50     1     8    13         4         0      2
4 1                 2     2    50     1     8    13         4         0      1
5 1                 2     2    10     1     7    13         4         1      5
6 1                 2     2    50     1     8    13         4         0      1
# ℹ 26 more variables: d2_binary <dbl>, d2_acc <dbl>, d3_binary <dbl>,
#   d3_acc <dbl>, age <dbl>, gender <chr>, gender_other <chr>, income <dbl>,
#   ethnicity <chr>, ethnicity_other <chr>, education <dbl>, fam_crt <dbl>,
#   fam_dilemma <dbl>, duration <dbl>, crt_order <chr>, dilemma_order <chr>,
#   correct_crt1 <dbl>, correct_crt2 <dbl>, correct_crt3 <dbl>,
#   correct_crt4 <dbl>, correct_crt5 <dbl>, crt_score <dbl>, crt_mean <dbl>,
#   mean_acc <dbl>, pass_att_check <dbl>, mean_judgment <dbl>

# Main Statistical Test: Pearson's r calculation correlating CRT total score and mean acceptability for CRT first observations.

## If higher CRT scores correlate positively with utilitarian judgements of moral acceptability, results support the secondary inference of the paper, i.e., that higher CRT scores in the CRT First group is expressive of stronger "reflection" which causes an increase in utilitarian (cost-benefit) ratings. Note that this result only supports the secondary inference if there is no positive correlation between CRT scores and moral acceptability ratings in the Dilemmas First condition.

crt_first_r <- crt_first_d %>%
  summarise(
    correlation = cor(mean_acc, crt_score),
    cor_test_stat = cor.test(mean_acc, crt_score)$statistic,
    cor_test_pvalue = cor.test(mean_acc, crt_score)$p.value
  )

print(crt_first_r)

# A tibble: 1 × 3
  correlation cor_test_stat cor_test_pvalue
        <dbl>         <dbl>           <dbl>
1       0.240          2.15          0.0346

# Creates a dataframe subsetting Dilemma first observations
dilemma_first_d <- filtered_d %>%
  filter(crt_order == "FL_18|CRT")

# (Main Statistical Test) Pearson's r calculation correlating CRT score and mean acceptability for dilemma first observations
dilemma_first_r <- dilemma_first_d %>%
   summarise(
    correlation = cor(mean_acc, crt_score),
    cor_test_stat = cor.test(mean_acc, crt_score)$statistic,
    cor_test_pvalue = cor.test(mean_acc, crt_score)$p.value
  )

print(dilemma_first_r)

# A tibble: 1 × 3
  correlation cor_test_stat cor_test_pvalue
        <dbl>         <dbl>           <dbl>
1       0.114          1.03           0.308

# Scatterplot of Mean Acceptability on CRT Score
ggplot(data = crt_first_d, aes(x = jitter(crt_score), y = mean_acc)) +
  geom_point(aes(color = gender, size = duration)) +  # Size based on "duration"
  stat_smooth(method = 'lm') +
  labs(x = "Number of Correct CRT Items", y = "Mean Moral Acceptability Rating", title = "Mean Acceptability on CRT Score for CRT-First Group") +
  scale_size_continuous(range = c(1, 4))  # Adjust the range according to your preference

`geom_smooth()` using formula = 'y ~ x'

# Fisher r to z test to test correlation between CRT first and Dilemma first r coefficients

## Calculate correlation coefficients
crt_cor <- crt_first_d %>%
  select(crt_score, mean_acc) %>%
  summarise(correlation = cor(crt_score, mean_acc))

dilemma_cor <- dilemma_first_d %>%
  select(crt_score, mean_acc) %>%
  summarise(correlation = cor(crt_score, mean_acc))

## Perform Fisher's r-to-z transformation test
paired.r(crt_cor$correlation, dilemma_cor$correlation, NULL, nrow(crt_first_d), nrow(dilemma_first_d), twotailed=TRUE)

Call: paired.r(xy = crt_cor$correlation, xz = dilemma_cor$correlation, 
    yz = NULL, n = nrow(crt_first_d), n2 = nrow(dilemma_first_d), 
    twotailed = TRUE)
[1] "test of difference between two independent correlations"
z = 0.81  With probability =  0.42

Three-panel graph with original, 1st replication, and your replication is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

# # t-test between CRT score and CRT order to respond to the concern that presentation of the dilemmas first influenced subsequent CRT performance
filtered_d %>% 
  t.test(data = ., crt_score ~ crt_order, var.equal = TRUE)


    Two Sample t-test

data:  crt_score by crt_order
t = -0.56824, df = 158, p-value = 0.5707
alternative hypothesis: true difference in means between group CRT|FL_18 and group FL_18|CRT is not equal to 0
95 percent confidence interval:
 -0.4744491  0.2624416
sample estimates:
mean in group CRT|FL_18 mean in group FL_18|CRT 
               2.076923                2.182927

# t-test between mean acceptability and attention check for pre-excluded data
clean_d %>% 
  t.test(data = ., mean_acc ~ pass_att_check, var.equal = TRUE)


    Two Sample t-test

data:  mean_acc by pass_att_check
t = 0.37378, df = 239, p-value = 0.7089
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -0.8214537  1.2061825
sample estimates:
mean in group 0 mean in group 1 
       2.922222        2.729858

# t-test for mean acceptability by participants that were excluded for getting 0 crt questions correct and those that answered at lease one correctly
clean_d %>% 
  t.test(data = ., mean_acc ~ crt_score == 0, var.equal = TRUE)


    Two Sample t-test

data:  mean_acc by crt_score == 0
t = -0.74006, df = 239, p-value = 0.46
alternative hypothesis: true difference in means between group FALSE and group TRUE is not equal to 0
95 percent confidence interval:
 -1.0892516  0.4943352
sample estimates:
mean in group FALSE  mean in group TRUE 
           2.684685            2.982143

# Correlation between duration and acceptability
cor.test(x=filtered_d$mean_acc, y=filtered_d$duration, method = 'pearson')


    Pearson's product-moment correlation

data:  filtered_d$mean_acc and filtered_d$duration
t = -2.0484, df = 158, p-value = 0.04218
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.308301847 -0.005823429
sample estimates:
       cor 
-0.1608368

# Main statistical test: t-test of CRT score on moral acceptability ratings for pre-exclusion data
main_t <- clean_d %>% 
  t.test(data = ., mean_acc ~ crt_order, var.equal = TRUE)
main_t


    Two Sample t-test

data:  mean_acc by crt_order
t = 0.44225, df = 239, p-value = 0.6587
alternative hypothesis: true difference in means between group CRT|FL_18 and group FL_18|CRT is not equal to 0
95 percent confidence interval:
 -0.5191575  0.8197408
sample estimates:
mean in group CRT|FL_18 mean in group FL_18|CRT 
               2.830508                2.680217

# Main Statistical Test: Pearson's r calculation correlating CRT score and mean acceptability for CRT-first group on pre-exclusion data

## Creates a dataframe subsetting CRT first observations in pre-exclusion data
crt_first_clean <- clean_d %>%
  filter(crt_order == "CRT|FL_18")

## Correlation
crt_first_clean_r <- crt_first_clean %>%
  summarise(
    correlation = cor(mean_acc, crt_score),
    cor_test_stat = cor.test(mean_acc, crt_score)$statistic,
    cor_test_pvalue = cor.test(mean_acc, crt_score)$p.value
  )

print(crt_first_clean_r)

# A tibble: 1 × 3
  correlation cor_test_stat cor_test_pvalue
        <dbl>         <dbl>           <dbl>
1       0.146          1.59           0.115

# Descriptive statsitcis of CRT familiarity 
fam_crt_summary <- filtered_d %>%
  summarise(
    Mean = mean(fam_crt),
    Median = median(fam_crt),
    SD = sd(fam_crt)
  )

print(fam_crt_summary)

# A tibble: 1 × 3
   Mean Median    SD
  <dbl>  <dbl> <dbl>
1  1.79      2 0.695

# Historigram of familiarity with CRT questions
histogram_plot <- filtered_d %>%
  ggplot(aes(x = fam_crt)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black", alpha = 0.7) +
  labs(x = "CRT Familiarity", y = "Frequency", title = "Histogram of CRT Familiarity") +
  theme_minimal()

print(histogram_plot)

# Calculate the percentage of respondents answering "somewhat familiar" (2) or "definitely familiar" (3)
crt_percentage_2_3 <- filtered_d %>%
  filter(fam_crt %in% c(2, 3)) %>%
  summarise(Percentage = n() / nrow(filtered_d) * 100)

crt_percentage_3 <- filtered_d %>%
  filter(fam_crt %in% c(3)) %>%
  summarise(Percentage = n() / nrow(filtered_d) * 100)

print(crt_percentage_2_3)

# A tibble: 1 × 1
  Percentage
       <dbl>
1       63.1

print(crt_percentage_3)

# A tibble: 1 × 1
  Percentage
       <dbl>
1       15.6

# Descriptive statsitcis of Moral Dilemmas familiarity 
fam_dilemma_summary <- filtered_d %>%
  summarise(
    Mean = mean(fam_dilemma),
    Median = median(fam_dilemma),
    SD = sd(fam_dilemma)
  )

print(fam_dilemma_summary)

# A tibble: 1 × 3
   Mean Median    SD
  <dbl>  <dbl> <dbl>
1   1.8      2 0.725

# Historigram of familiarity with moral dilemmas
histogram_plot <- filtered_d %>%
  ggplot(aes(x = fam_dilemma)) +
  geom_histogram(binwidth = 0.5, fill = "blue", color = "black", alpha = 0.7) +
  labs(x = "Dilemma Familiarity", y = "Frequency", title = "Histogram of Dilemma Familiarity") +
  theme_minimal()

print(histogram_plot)

# Calculate the percentage of respondents answering "somewhat familiar" (2) or "definitely familiar" (3)
d_percentage_2_3 <- filtered_d %>%
  filter(fam_dilemma %in% c(2, 3)) %>%
  summarise(Percentage = n() / nrow(filtered_d) * 100)

d_percentage_3 <- filtered_d %>%
  filter(fam_dilemma %in% c(3)) %>%
  summarise(Percentage = n() / nrow(filtered_d) * 100)

print(d_percentage_2_3)

# A tibble: 1 × 1
  Percentage
       <dbl>
1       61.9

print(d_percentage_3)

# A tibble: 1 × 1
  Percentage
       <dbl>
1       18.1

# Correlation between familiarity with CRT questions and CRT Score
cor.test(x=filtered_d$fam_crt, y=filtered_d$crt_score, method = 'pearson')


    Pearson's product-moment correlation

data:  filtered_d$fam_crt and filtered_d$crt_score
t = 3.2406, df = 158, p-value = 0.001455
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.09829203 0.38970718
sample estimates:
      cor 
0.2496435

# Scatterplot of familiarity with CRT questions on CRT score
ggplot(data = filtered_d, aes(x = jitter(fam_crt), y = crt_score)) +
  geom_point(aes(color = gender, size = duration)) + 
  stat_smooth(method = 'lm') +
  labs(x = "Familiarity with CRT Questions", y = "CRT Score", title = "Familiarity on CRT Performance")

`geom_smooth()` using formula = 'y ~ x'

# # Correlation between familiarity with Moral Dilemmas and Moral Acceptability Score
cor.test(x=filtered_d$fam_dilemma, y=filtered_d$mean_acc, method = 'pearson')


    Pearson's product-moment correlation

data:  filtered_d$fam_dilemma and filtered_d$mean_acc
t = 3.9241, df = 158, p-value = 0.0001296
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1497673 0.4331329
sample estimates:
      cor 
0.2980012

# Scatterplot of familiarity with CRT questions on CRT score
ggplot(data = filtered_d, aes(x = jitter(fam_dilemma), y = mean_acc)) +
  geom_point(aes(color = gender, size = duration)) + 
  stat_smooth(method = 'lm') +
  labs(x = "Familiarity with Dilemmas", y = "Mean Acceptability", title = "Familiarity with Dilemmas on Moral Acceptability")

`geom_smooth()` using formula = 'y ~ x'

# Correlation between familiarity with CRT questions on Moral Acceptability
cor.test(x=filtered_d$fam_crt, y=filtered_d$mean_acc, method = 'pearson')


    Pearson's product-moment correlation

data:  filtered_d$fam_crt and filtered_d$mean_acc
t = 1.4192, df = 158, p-value = 0.1578
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.04372792  0.26277647
sample estimates:
     cor 
0.112192

# Scatterplot of familiarty with CRT questions on moral acceptability scores
ggplot(data = filtered_d, aes(x = jitter(fam_crt), y = mean_acc)) +
  geom_point(aes(color = gender, size = duration)) + 
  stat_smooth(method = 'lm') +
  labs(x = "Familiarity with CRT Questions", y = "Mean Moral Acceptability Rating", title = "Familiarity with CRT on Moral Acceptability")

`geom_smooth()` using formula = 'y ~ x'

# Correlation between familiarity with dilemmas on CRT score
cor.test(x=filtered_d$fam_dilemma, y=filtered_d$crt_score, method = 'pearson')


    Pearson's product-moment correlation

data:  filtered_d$fam_dilemma and filtered_d$crt_score
t = 4.4407, df = 158, p-value = 1.677e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1876465 0.4642700
sample estimates:
      cor 
0.3331068

# Scatterplot of familiarty with moral dilemmas on CRT score
ggplot(data = filtered_d, aes(x = jitter(fam_dilemma), y = crt_score)) +
  geom_point(aes(color = gender, size = duration)) + 
  stat_smooth(method = 'lm') +
  labs(x = "Familiarity with Dilemmas", y = "CRT Score", title = "Familiarity with Dilemmas on CRT Score")

`geom_smooth()` using formula = 'y ~ x'

# Main Statistical tests on data from participants who were either unfamiliar, or slightly familiar, with CRT questions.

## Basic count of participants in post-exlcuded data across both conditions who were either unfamilar or only slightly familiar with CRT questions
unfamiliar <- filtered_d %>%
  filter(fam_crt %in% c("1", "2")) %>%
  summarise(count = n())
print(unfamiliar$count)

[1] 135

## Create dataframe of participants who were unfamiliar or slightly unfamilair with CRT questions
unfamiliar_crt_d <- filtered_d %>%
  filter(fam_crt %in% c("1", "2") & fam_crt != "3")

## Main statistical test: t-test of mean moral acceptability ratings in CRT-First vs Dilemma-first groups for participants who were relatively unfamiliar with CRT questions
main_t_unfamiliar_crt <- unfamiliar_crt_d %>% 
  t.test(data = ., mean_acc ~ crt_order, var.equal = TRUE)
print(main_t_unfamiliar_crt)


    Two Sample t-test

data:  mean_acc by crt_order
t = -1.272, df = 133, p-value = 0.2056
alternative hypothesis: true difference in means between group CRT|FL_18 and group FL_18|CRT is not equal to 0
95 percent confidence interval:
 -0.7281279  0.1581645
sample estimates:
mean in group CRT|FL_18 mean in group FL_18|CRT 
               2.676923                2.961905

## Main statistical test: Pearson's r calculation correlating CRT total score and mean acceptability for CRT first observations on participants who were relatively unfamiliar with the CRT questions.
crt_first_unfamiliar_r <- unfamiliar_crt_d %>%
  summarise(
    correlation = cor(mean_acc, crt_score),
    cor_test_stat = cor.test(mean_acc, crt_score)$statistic,
    cor_test_pvalue = cor.test(mean_acc, crt_score)$p.value
  )

print(crt_first_unfamiliar_r)

# A tibble: 1 × 3
  correlation cor_test_stat cor_test_pvalue
        <dbl>         <dbl>           <dbl>
1       0.141          1.65           0.102

# Main Statistical tests on data from participants who were either unfamiliar, or slightly familiar, with moral dilemmas.

## Basic count of participants in post-exlcuded data across both conditions who were either unfamilar or only slightly familiar with CRT questions
unfamiliar_dil <- filtered_d %>%
  filter(fam_dilemma %in% c("1", "2")) %>%
  summarise(count = n())
print(unfamiliar_dil$count)

[1] 131

## Create dataframe of participants who were unfamiliar or slightly unfamilair with CRT questions
unfamiliar_dil_d<- filtered_d %>%
  filter(fam_dilemma %in% c("1", "2") & fam_dilemma != "3")

## Main statistical test: t-test of mean moral acceptability ratings in CRT-First vs Dilemma-first groups for participants who were relatively unfamiliar with CRT questions
main_t_unfamiliar_dil <- unfamiliar_dil_d %>% 
  t.test(data = ., mean_acc ~ crt_order, var.equal = TRUE)
print(main_t_unfamiliar_dil)


    Two Sample t-test

data:  mean_acc by crt_order
t = -0.48519, df = 129, p-value = 0.6284
alternative hypothesis: true difference in means between group CRT|FL_18 and group FL_18|CRT is not equal to 0
95 percent confidence interval:
 -0.5456315  0.3307233
sample estimates:
mean in group CRT|FL_18 mean in group FL_18|CRT 
               2.671958                2.779412

## Main statistical test: Pearson's r calculation correlating CRT total score and mean acceptability for CRT first observations on participants who were relatively unfamiliar with the CRT questions.
crt_first_unfamiliar_dil_r <- unfamiliar_dil_d %>%
  summarise(
    correlation = cor(mean_acc, crt_score),
    cor_test_stat = cor.test(mean_acc, crt_score)$statistic,
    cor_test_pvalue = cor.test(mean_acc, crt_score)$p.value
  )

print(crt_first_unfamiliar_dil_r)

# A tibble: 1 × 3
  correlation cor_test_stat cor_test_pvalue
        <dbl>         <dbl>           <dbl>
1       0.116          1.33           0.186

# Main statistical tests on participants who were relatively unfamiliar with both CRT questions and moral dilemmas

## Creates new dataframe of participants relatively unfamiliar with both CRT and moral dilemma questions
unfamiliar_d <- filtered_d %>%
  filter(fam_crt %in% c("1", "2") & fam_dilemma %in% c("1", "2") & !(fam_crt == "3" | fam_dilemma == "3"))


## Main statistical test: t-test of mean moral acceptability ratings in CRT-First vs Dilemma-first groups for participants who were relatively unfamiliar with CRT questions and dilemmas
main_t_unfamiliar_d <- unfamiliar_d %>% 
  t.test(data = ., mean_acc ~ crt_order, var.equal = TRUE)
print(main_t_unfamiliar_d)


    Two Sample t-test

data:  mean_acc by crt_order
t = -0.59156, df = 115, p-value = 0.5553
alternative hypothesis: true difference in means between group CRT|FL_18 and group FL_18|CRT is not equal to 0
95 percent confidence interval:
 -0.6241742  0.3370938
sample estimates:
mean in group CRT|FL_18 mean in group FL_18|CRT 
                2.64881                 2.79235

## Main statistical test: Pearson's r calculation correlating CRT total score and mean acceptability for CRT first observations on participants who were relatively unfamiliar with the CRT questions.
unfamiliar_r <- unfamiliar_d %>%
  summarise(
    correlation = cor(mean_acc, crt_score),
    cor_test_stat = cor.test(mean_acc, crt_score)$statistic,
    cor_test_pvalue = cor.test(mean_acc, crt_score)$p.value
  )

print(unfamiliar_r)

# A tibble: 1 × 3
  correlation cor_test_stat cor_test_pvalue
        <dbl>         <dbl>           <dbl>
1       0.101          1.09           0.277

# Main statistical tests on participants who were entirely unfamiliar with both CRT questions and moral dilemmas

## Creates new dataframe of participants relatively unfamiliar with both CRT and moral dilemma questions
unfamiliar_d2 <- filtered_d %>%
  filter(fam_crt %in% c("1") & fam_dilemma %in% c("1"))

## Main statistical test: t-test of mean moral acceptability ratings in CRT-First vs Dilemma-first groups for participants who were relatively unfamiliar with CRT questions and dilemmas
unfamiliar_d2_t <- unfamiliar_d2 %>%
  t.test(data = ., mean_acc ~ crt_order, var.equal = TRUE)
print(unfamiliar_d2_t)


    Two Sample t-test

data:  mean_acc by crt_order
t = -0.58957, df = 32, p-value = 0.5596
alternative hypothesis: true difference in means between group CRT|FL_18 and group FL_18|CRT is not equal to 0
95 percent confidence interval:
 -1.0770178  0.5935014
sample estimates:
mean in group CRT|FL_18 mean in group FL_18|CRT 
               2.282051                2.523810

## Main statistical test: Pearson's r calculation correlating CRT total score and mean acceptability for CRT first observations on participants who were relatively unfamiliar with the CRT questions.
unfamiliar_r2 <- unfamiliar_d %>%
  summarise(
    correlation = cor(mean_acc, crt_score),
    cor_test_stat = cor.test(mean_acc, crt_score)$statistic,
    cor_test_pvalue = cor.test(mean_acc, crt_score)$p.value
  )

print(unfamiliar_r2)

# A tibble: 1 × 3
  correlation cor_test_stat cor_test_pvalue
        <dbl>         <dbl>           <dbl>
1       0.101          1.09           0.277

# Exploratory analysis of an extra exclusion: including only participants who took longer than 2 minutes, but less than an hour, to complete the survey

## Basic count of participants who took < 2 minutes to complete the survey
count_too_quick <- clean_d %>%
  filter(duration < 120) %>%
  summarise(too_quick = n())
cat("Count of duration less than two minutes:", count_too_quick$too_quick, "\n")

Count of duration less than two minutes: 3

## Exclusions
filtered_d_exploratory <- filtered_d %>%
  filter(duration >= 120) ### Filters out participants who took < 2 minutes to complete the survey (n=3)

## Creating a new CRT-first dataframe
crt_first_de <- filtered_d_exploratory %>%
  filter(crt_order == "CRT|FL_18")

## (Main Statistical Test) Pearson's r calculation correlating CRT total score and mean acceptability for CRT first observations.
crt_first_re <- crt_first_de %>%
  summarise(
    correlation = cor(mean_acc, crt_score),
    cor_test_stat = cor.test(mean_acc, crt_score)$statistic,
    cor_test_pvalue = cor.test(mean_acc, crt_score)$p.value
  )

print(crt_first_re)

# A tibble: 1 × 3
  correlation cor_test_stat cor_test_pvalue
        <dbl>         <dbl>           <dbl>
1       0.205          1.81          0.0737

Mini meta analysis

Combining across the original paper, 1st replication, and 2nd replication, what is the aggregate effect size?

paxton2012 <- data.frame(
  sample_size = 92,
  t_statistic = 2.03,
  p_value = 0.05,
  cohen_d = 0.43
)

fereday2019 <- data.frame(
  sample_size = 82,
  t_statistic = -0.36,
  p_value = 0.72,
  cohen_d = 0.08
)

pereira2023 <- data.frame(
  sample_size = 160,
  t_statistic = -1.12,
  p_value = 0.26,
  cohen_d = -0.18
)


all_studies <- rbind(paxton2012, fereday2019, pereira2023)
all_studies$study <- c("paxton2012", "fereday2019", "pereira2023")
all_studies$z_score <- qnorm(1 - all_studies$p_value/2)

mini_meta_mod <- rma(yi = all_studies$cohen_d, sei = 1/all_studies$z_score^2, slab = all_studies$study)
summary(mini_meta_mod)


Random-Effects Model (k = 3; tau^2 estimator: REML)

  logLik  deviance       AIC       BIC      AICc   
 -3.4254    6.8508   10.8508    8.2371   22.8508   

tau^2 (estimated amount of total heterogeneity): 0 (SE = 0.4863)
tau (square root of estimated tau^2 value):      0
I^2 (total heterogeneity / total variability):   0.00%
H^2 (total variability / sampling variability):  1.00

Test for Heterogeneity:
Q(df = 2) = 0.5415, p-val = 0.7628

Model Results:

estimate      se    zval    pval    ci.lb   ci.ub    
  0.3697  0.2471  1.4964  0.1345  -0.1145  0.8539    

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary of Rescue Results

In line with the original Paxton et al., (2012) and Fereday’s (2019) first replication, the present replication (henceforth, the “rescue”) found that moral acceptability ratings were reliable across both conditions (Cronbach’s alpha =.73). This result justified the use of a single moral acceptability score taken as the mean of judgements across all three dilemmas. There was no significant difference between the proportions of subjects in the CRT-first and Dilemma-first conditions before vs after exclusions. In fact, the proportion was almost identical (Pre-Exclusion CRT-First: 118 of 241 [49%], Post-Exclusion CRT-First: 78 of 160 [49%], x^2 = 0, p = 1), indicating that exclusions had no systematic effect on condition representation.

The rescue failed to replicate the main inference of Paxton et al., (2012). There was no significant difference detected between the moral acceptability scores in the CRT-first vs Dilemmas-first conditions (CRT-First: M = 2.74; Dilemmas-First: M = 2.96; t(158) = -1.12, p = .26, d = -.18). In fact, the slight but insignificant trend was that subjects in the Dilemmas-first condition actually judged utilitarian responses as more morally acceptable than subjects in the CRT-first condition. The result conflicts with the original study (CRT-First: M = 3.77; Dilemmas-First: M = 3.25; t(90) = 2.03, p = .05, d = 0.43) but matches Fereday’s (2019) replication (CRT-First: M = 3.54; Dilemmas-First: M = 3.66; t(82) = -.36, p = .72, d = .08).

The rescue did appear to replicate the secondary inference of the original paper (however, note that post-hoc exploratory analyses suggest this result did not replicate; see Discussion). A statistically significant positive correlation between CRT scores and moral acceptability scores was detected within the CRT-first condition (r=.24, p=.035). An insignificant relationship in the Dilemmas-First condition (r=.11, p=.31) was also observed. This result matches the original study (r = .39, p = .001), but conflicts with the first replication, which found no significant correlations between CRT scores and moral ratings in either conditions (CRT-first condition: r = .14, p = .32).

Figure 1 from Paxton, Unger, and Greene (2012) Figure 1 from Fereday (2019) Figure 1 from the present study (2023)

Finally, the present study found that viewing the moral dilemmas before vs after answering CRT questions had no significant effect on CRT scores (CRT-First: M = 2.08; Dilemmas-First: M = 2.18; t(158) = -.57, p = .57). This matches both the original and replication and suggests that viewing the dilemmas first did not influence how participants performed on the CRT.

Numeric score of replication closeness = 0.25. On a simple scale [0, .25, .5, .75, 1] with 0 = didn’t replicate, 1 = replicated, and in between if there were mixed results, we can select 0.25 as the present study partially replicated one finding from Paxton et al., (2012)’s original study.

Discussion

The present rescue aimed to improve the experimental design of the original and replication in several ways. First, the rescue was higher powered than both (post exclusions, n=160 [95% power] compared with original [n=92], and the replication [n=82]). Second, it used newer CRT questions from the CRT-2, and reworded classic CRT questions, to address Fereday’s (2019) worry that familiarity with CRT questions impaired the induction of an appropriate reflective state. Third, participants were taken from Prolific instead of MTurk, which also addressed the concern about familiarity.

The rescue did not reproduce the main result found in the original: participants who were shown CRT questions before moral dilemmas did not deliver stronger utilitarian responses. This result might provide evidence against the claim that reflecting on moral dilemmas produces more utilitarian evaluations. However, the rescue did reproduce the other main result from the original study: in the CRT-first condition, participants who answered more CRT questions correctly were more likely to deliver stronger utilitarian responses There was no such relationship in the Dilemmas-First condition, suggesting that the observed relationship was not caused by trait-level reflective or utilitarian attitudes.

However, a state-level explanation – where CRT exposure induced a reflective state which caused stronger utilitarian judgements – clearly does not fit either, since if this was the case we should have seen a significant difference between the utilitarian ratings of the CRT-first vs Dilemma-first groups, which we did not.

This is a puzzle. The start of a solution is suggested by adding a post-hoc exclusion. The original study put no limit on duration while the present study excluded only extreme positive outliers who took over an hour to complete the study (n=4; median duration was 7:37). However, there was a single (n=1) participant in the CRT-first condition who took just 95 seconds to complete the entire survey. The participant had a perfect CRT-score of 5 and a high moral acceptability average rating of 5.8. The participant reported they were “definitely familiar” with both the CRT questions and moral dilemmas.

It is likely this participant did not earnestly engage in the experiment. Furthermore, their results likely inflated the true within-group correlation between CRT scores and utilitarian judgements in the direction predicted by the secondary hypothesis. To test this, all participants who took less than 2 minutes on the survey were excluded in a post-hoc analysis (n=1 in CRT-first condition, n=3 overall).

Surprisingly, excluding just this one participant caused the partial replication to vanish. The correlation between CRT scores and Moral Acceptability ratings in the CRT-first condition became insignificant (r=.21, p=.07). As before, there was no significant correlation in the dilemmas-first condition.

We can conclude that while there appears to a slight pattern of positive correlation between CRT proficiency and the strength of one’s utilitarian judgement, it is neither clear nor significant.

Further exploratory analyses were conducted on participant’s familiarity with the experimental design. Familiarity was a concern of the first replication, and consequentially, the present rescue aimed to reduce familiarity. This was only moderately successful. 37% of participants self-reported being “unfamiliar” with the CRT questions, while the majority (47%) were “somewhat familiar” and 16% were “definitely familiar”. 63% were either “somewhat” or “definitely” familiar, which was an improvement on Fereday’s (2019) replication (where 77% reported being either “somewhat” or “definitely” familiar).

It is possible that levels of familiarity might have interrupted the targeted manipulation; perhaps participants in the CRT-first group who were familiar with the questions were not induced into an appropriate reflective state. Familiarity with CRT questions had a significant positive correlation with CRT performance (r=.25, p=.001), but familiarity with CRT questions did not have a significant relationship with moral acceptability scores (r=.11, p=.16).

Histogram of Participant’s Familiarity with CRT Questions

The present rescue also investigated how familiar participants were with the three moral dilemmas (an improvement on both the original and first replication). Participant’s familiarity with moral dilemmas showed a similar trend to familiarity with CRT questions.

Interestingly, previous exposure to the moral dilemmas appears to have important implications for moral judgements and CRT performance. Suprisingly, familiarity with dilemmas appears more influential than familiarity with the CRT. Familiarity with the moral dilemmas were showed a strong significant positive correlation with strong utilitarian judgements (r=.30, p=.0001). That is, the more familiar participants were with the moral judgements, the more likely they were to produce utilitarian responses. It is possible that familiarity alone is a major explanation of the pattern of moral judgements across both conditions. Familiarity with dilemmas also had a strong and significant positive correlation with high CRT scores (r=.33, p.0001).

In the exploratory analysis, I performed the two main statistical tests on participants who were either completely unfamiliar, or “relatively” unfamiliar (either completely, or somewhat, unfamiliar) with both the CRT questions and moral dilemmas. Even though these analyses were moderately underpowered, if familiarity was interfering with the experimental manipulation we should expect to see patterns closer to the significant observations of the original experiment.

However, all analyses returned insignificant results. For example, a t-test on participants who were relatively unfamiliar with the CRT (n=131) showed no significant difference between the moral acceptability ratings of CRT-First vs Dilemma-First conditions (p=.63), while the same test on participants who were completely unfamiliar (n=33) was also insignificant (p=.56) Results from all analyses suggests that familiarity with either the CRT or dilemmas was not interfering with the main experimental manipulation.

Finally, two more exploratory analyses are of note. First, the pre-exclusion data was analyzed using the same tets (n-241). Statistical tests for both main inferences were also found to be insignificant.

Second, a slight but significant negative relationship between duration and moral acceptability was observed. Participants who took longer on the survey gave weaker utilitarian evaluations (r=-.16, p=.04). Intuitively, reflection is an activity requiring time; one might expect that longer duration is associated with stronger reflection and therefore should induce stronger utilitarian responses. Therefore, this contrary result must be reconciled with Paxton et al.’s (2012) claim that greater reflection produces stronger utilitarian judgements.

I will close by speculating on possible conclusions to draw from this failed rescue, which in turn suggest future directions. One explanation for the failures to replicate Paxton et al., (2012)’s findings in experiment 1 concerns sweeping familiarity: perhaps current participants are just too familiar with the CRT and its variants, and trolley-problem-like moral dilemmas and their variants, for the relevant experimental manipulations and detections to be effective and accurate, respectively. This explanation preserves the integrity of Paxton et al., (2012)’s original study, but suggests that present replications are doomed to fail. To address this, future experiments should adopt different methods and materials to manipulate and measure the constructs of interests. However, such paradigms must justify why we should interpret the alternate materials as playing the same functional roles of the CRT and moral dilemmas in Paxton et al., (2012), which is not easy to do.

Another more critical explanation takes issue with the background theoretical architecture of the original study. Paxton et al., (2012)’s hypotheses are derived from an intricate web of theories and empirical results, which in turn lean on several assumptions and argumentative moves, all made quickly early on in the original paper. Of course, since the original experiment was successful the choice and defense of hypotheses should have be accepted without too much scrutiny. But the growing pattern of failed replication and rescue calls for critical re-examination of background assumptions and arguments motivating the original hypotheses.

There are in fact four different assumptions guiding Paxton et al., (2012)’s predictions.

First, moral judgements tend to be initially automatic and unconscious (this has support from Greene’s two-stage picture).

Second, cognitive reflection can override initial automatic judgements (whatever these judgments happen to be).

Third – and importantly – initial unconscious moral judgements for everyone tend towards the deontological (i.e., respecting the rights of persons, and therefore, not sacrificing one to save many).

Fourth, cognitive reflection makes moral judgements tend towards the utilitarian (i.e., maximising overall good, and therefore, sacrificing one to save many). This last point of course is the main experimental test; the rest is background.

Paxton et al., (2012) weave these assumptions together without much justification nor clarity. To be clear, I am not saying the original authors are unjustified in assuming or predicting the above; I am saying each of these assumptions are not separately motivated nor transparently justified in the original paper. Any of the foregoing assumptions could be wrong. So the failed replications might be detecting an error in the fourth point, or in one to three. Assuming the failed replications are explained by a problem in the background assumptions, from the present vantage point, it is not clear where the error lies; it could be in any of the assumptions, or a collection.

For example: it might be true that initial moral judgements are made automatically and emotionally. It might also be true that cognitive reflection can override these initial judgements. However, perhaps the initial automatic judgements tend towards the utilitarian, and the overriding reflection might be in the direction of deontological. More likely, there might be a lot of variance and individuality to people’s default moral attitude, or context-specific effects with regard to moral attitude manipulation.

An updated literature review regarding assumptions one through four would be helpful to assess the background reasoning of Paxton et al., (2012) against contemporary empirical data.

Links