Replication of Reflection and Reasoning in Moral Judgment (2012, Cognitive Science)

Introduction

Despite the work of Dr. Joshua Greene and others, there is still some ambiguity around the difference between intuition and hot cognition and whether faster response time actually elicits these dispositions to moral questions. Does a slower response time actually control for the effects of intuition? This study epitomizes the approach the leading psychologists in the moral psychology field, like Dr. Greene and Dr. Fiery Cushman, have taken to understand morality. It also serves as an example of why Dr. Jonathan Haidt doesn’t fully fit into that camp and why others, like Dr. William Damon, think it may be an inauthentic measurement of morality. Personally, I’m interested in how transcendent views of self or commitments (like a sense of purpose) influence moral decision making. Part of that work will involve developing a better understanding of how emotion, reasoning, and intuition affect the valuation of possible responses in moral situations. Replicating the first experiment in this study and familiarizing myself with their methodology will help acquaint me to the field and expectations for research on moral decision making. I also hope to get a better idea for how to conduct my own experiments in the future.

Participants recruited through Mechanical Turk will be presented with the Cognitive Reflection Test (CRT) either before or after being presented with one of three possible moral dilemmas on the survey platform Qualtrics. The anticipated result is that participants that are given the CRT prior to being presented with the moral dilemmas will demonstrate a more utilitarian response than the subjects that are presented with the CRT after. I would like to also add the time component of the second experiment the team performed, tools permitting. Adding the timed component would make this significantly harder unless I can find that functionality within Qualtrics. Having never used Mechanical Turk before, I also wonder about the demographic similarities to the sample used for this original experiment. Does Mechanical Turk attract a certain kind of survey responder? This study was also performed eight years ago–its quite possible there may be differences in results due to that time difference.

Link to project repository: https://github.com/psych251/paxton2012.git Link to original paper: https://onlinelibrary.wiley.com/doi/full/10.1111/j.1551-6709.2011.01210.x Link to Qualtrics survey: https://stanforduniversity.qualtrics.com/jfe/form/SV_5cdcPGasiw154i1

Methods

Power Analysis

The original effect size was d = 0.43. Given this effect size and the more conservative two-sided alternative, the following sample sizes are required for 80, 90, and 95 percent power:

For 80% power: n=86 For 90% power: n=115 For 95% power: n=142

In light of the original paper’s 39% exlusion rate, 80% power would require 141 (140.98) participants.

Planned Sample

The sample size will consist of 144 participants recruited through Mechanical Turk, based on a power analysis of the statistics from the original paper.

Materials

The original Cognitive Reflection Test questions are below as well as the rewritten version of the CRT. I also added a fourth question from a set of questions another team of researchers used as a substitute for the CRT. Overuse of the CRT on Mechanical Turk necessitates new versions of the questions.

Cognitive Reflection Test (CRT) questions, quoted from Frederick (2005) and referenced in the original article:

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? _____ cents
If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? _____ minutes
In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake? _____ days”

Rewritten CRT Questions:

A book and a pencil cost $1.20 in total. The book costs $1.00 more than the pencil. How much does the pencil cost? _____ cents.
If it takes 10 programmers 10 minutes to make 10 improvements, how long would it take 50 programmers to make 50 improvements? _____ minutes
There is a grasshopper crossing a road. With every jump, the distance the grasshopper jumps doubles. If it takes 26 jumps for the grasshopper to cross the entire road, how many jumps would it take for the grasshopper to make it halfway across? _____ jumps.
A farmer had 15 sheep and all but 8 died. How many are left? _____ sheep.

Three high conflict moral dilemmas, presented randomly before or after the CRT questions in randomized order:

“John is the captain of a military submarine traveling underneath a large iceberg. An onboard explosion has caused the vessel to lose most of its oxygen supply and has injured a crewman who is quickly losing blood. The injured crewman is going to die from his wounds no matter what happens. The remaining oxygen is not sufficient for the entire crew to make it to the surface. The only way to save the other crew members is for John to shoot dead the injured crewman so that there will be just enough oxygen for the rest of the crew to survive.

Enemy soldiers have taken over Jane’s village. They have orders to kill all remaining civilians. Jane and some of her townspeople have sought refuge in the cellar of a large house. Outside they hear the voices of soldiers who have come to search the house for valuables. Jane’s baby begins to cry loudly. She covers his mouth to block the sound. If she removes her hand from his mouth, his crying will summon the attention of the soldiers, who will kill her, her child, and the others hiding out in the cellar. To save herself and the others, she must smother her child to death.

A runaway trolley is heading down the tracks toward five railway workmen, who will be killed if the trolley proceeds on its present course. Jane is on a footbridge over the tracks, in between the approaching trolley and the five workmen. Next to her on this footbridge is a lone railway workman, who happens to be wearing a large, heavy backpack. The only way to save the lives of the five workmen is for Jane to push the lone work- man off the bridge and onto the tracks below, where he and his large backpack will stop the trolley. The lone workman will die if Jane does this, but the five workmen will be saved.”

The exact wording of the questions above were used in this replication.

Procedure

Quoted from original article:

“Subjects were randomly assigned to complete the CRT either before (CRT-First condition) or after (Dilemmas-First condition) responding to the dilemmas. Subjects evaluated the moral acceptability of the utilitarian action with a binary response (YES ⁄ NO), followed by a rating scale (1 = Completely Unacceptable, 7 = Completely Acceptable). No time limits were imposed on responses. Subjects completed the CRT questions and read and responded to the dilemmas at their own pace. Subjects subsequently completed a brief set of demographic questions.”

This replication followed this exact procedure.

Analysis Plan

Exclude subjects that didn’t answer at least one question on the CRT correctly (look for maintaining proportionality of subjects across conditions after exclusion).
Exclude subjects that did not pass the attention check
Calculate each individual’s CRT score by assigning a 0 for each incorrect answer, 1 for each correct answer, and then adding the three scores.
Calculate Crombach’s alpha to determine reliability across moral dilemmas.
Collapse each subjects moral acceptability rating to create an average moral acceptability rating for each subject.
Linear regression of CRT-First condition on utilitarian moral judgments
Main Statistical Test: Between-subject t-test of CRT-First condition on individual acceptability rating.
Test correlation among subjects in the Dilemmas first condition to rule out variation due to trait reflectiveness. Confirm with a Fischer r-z test.
The control condition (Dilemmas-First Condition) didn’t include a task prior to the dilemmas, which could give credence to the objection that the reported effects of the CRT are merely effects of performing a non-specific task. To address this, calculate the correlation within the CRT-First condition, then compare this to the correlation observed in the Dilemmas-First condition.
Regress the CRT scores across the two conditions to address the objection that receiving the Dilemmas-First condition would influence subsequent CRT Performance

Differences from Original Study

While both the original study and this replication utilize mTurk, there may be possible variance due to the lapse in time between the studies.
The CRT questions were rewritten and one new question was added
An attention check was built into the survey
In light of not having enough information about the specific question wording, I opted to create personalized (name of character in dilemmas is used) and depersonalized (name of character is not used) versions of the questions that are then randomly assigned to participants. I added these two sub-conditions to investigate the relationship between personalization of questions and mean acceptability rating.

Methods Addendum (Post Data Collection)

The significant modification made at the behest of the teaching team, was using an attention check in addition to the CRT questions. Though one was not noted in the original paper, it seemed like a prudent choice as the CRT questions did nothing to ensure participants read the dilemmas sincerely.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan The original study collected responses from 150 Mechanical Turk participants and after exclusions, performed the analyses on 92 observations. This replication received survey responses from 141 participants and retained 82 for analyses after the CRT exclusion and additional attention check exclusion. After exclusions there were 29 females (35%) and 53 males (65%). The average participant age was 39.22 years,

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

##Results

Data preparation

Data preparation following the analysis plan.

###Data Preparation

####Load Relevant Libraries and Functions

library(tidyverse)

## ── Attaching packages ──────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ggplot2)
library(stringr)
library(dplyr)
library(ggthemes)
library(umx) #For Cronbach's Alpha

## Loading required package: OpenMx

## To take full advantage of multiple cores, use:
##   mxOption(key='Number of Threads', value=parallel::detectCores()) #now
##   Sys.setenv(OMP_NUM_THREADS=parallel::detectCores()) #before library(OpenMx)

## For an overview type '?umx'

## 
## Attaching package: 'umx'

## The following object is masked from 'package:stats':
## 
##     loadings

library(styler)
library(lintr)
library(qualtRics)

## Registered S3 methods overwritten by 'insight':
##   method        from 
##   nobs.coxme    MuMIn
##   nobs.coxph    MuMIn
##   nobs.gamm     MuMIn
##   nobs.glimML   MuMIn
##   nobs.hurdle   MuMIn
##   nobs.multinom MuMIn
##   nobs.rq       MuMIn
##   nobs.survreg  MuMIn
##   nobs.zeroinfl MuMIn

library(psych)

## 
## Attaching package: 'psych'

## The following object is masked from 'package:OpenMx':
## 
##     tr

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(pwr)
library(knitr)
library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

raw_data <- readSurvey("/Users/Brendan/Documents/GitHub/paxton2012/data.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   StartDate = col_datetime(format = ""),
##   EndDate = col_datetime(format = ""),
##   IPAddress = col_character(),
##   RecordedDate = col_datetime(format = ""),
##   ResponseId = col_character(),
##   RecipientLastName = col_logical(),
##   RecipientFirstName = col_logical(),
##   RecipientEmail = col_logical(),
##   ExternalReference = col_logical(),
##   DistributionChannel = col_character(),
##   UserLanguage = col_character(),
##   Q18_3_TEXT = col_logical(),
##   Q20 = col_number(),
##   Q20_7_TEXT = col_logical(),
##   FL_10_DO = col_character(),
##   FL_9_DO = col_character(),
##   FL_21_DO = col_character(),
##   FL_22_DO = col_character(),
##   FL_23_DO = col_character(),
##   CRT_DO = col_character()
## )

## See spec(...) for full column specifications.

# Takes out first two rows, selects relevant columns, moves the CRT order to the front, renames columns, replaces numbers with strings for gender variable and turns crt question responses into binaries
clean_data <-raw_data %>% 
  select(Q1, Q2, Q3, Q26, Q9, Q10, Q34, Q36, Q38, Q40, Q46, Q48, Q12, Q13, Q15, Q16, Q42, Q44, Q17, Q18, Q18_3_TEXT, Q19, Q20, Q20_7_TEXT, Q21, Q28, `Random ID`, FL_10_DO, FL_9_DO, FL_21_DO, FL_22_DO, FL_23_DO, "Duration (in seconds)") %>%   
  rename(crt_order = FL_10_DO, dilemma_order = FL_9_DO, education = Q21, gender = Q18, crt1 = Q1, crt2 = Q2, crt3 = Q3, age = Q17, crt4 = Q26, d1 = Q9, d1a = Q10, d1d = Q34, d1ad = Q36, d2 = Q12, d2a = Q13, exposure = Q28, d2d = Q38, d2ad = Q40, attention_check_1 = Q46, attention_check_2 = Q48, d3 = Q15, d3a = Q16, d3d = Q42, d3ad = Q44, ethnicity = Q20, duration = "Duration (in seconds)") %>%
  mutate(gender, gender = ifelse(gender == 1, "male", gender)) %>%
  mutate(gender, gender = ifelse(gender == 2, "female", gender)) %>%
  mutate(gender, gender = ifelse(gender == 3, "other", gender))

# Creates a new variable for total CRT score and average CRT score, then filters out participants that didn't get at least one CRT question correct (crt_total = 0).
clean_data <-clean_data %>% 
  mutate(crt1, crt1 = ifelse(crt1 < 10, crt1 * 100, crt1)) %>%
  mutate(crt1, crt1 = ifelse(crt1 == 10, 1, crt1)) %>%
  mutate(crt1, crt1 = ifelse(crt1 != 1, 0, crt1)) %>%
  mutate(crt2, crt2 = ifelse(crt2 == 10, 1, crt2)) %>%
  mutate(crt2, crt2 = ifelse(crt2 != 1, 0, crt2)) %>%
  mutate(crt3, crt3 = ifelse(crt3 == 25, 1, crt3)) %>%
  mutate(crt3, crt3 = ifelse(crt3 != 1, 0, crt3)) %>%
  mutate(crt4, crt4 = ifelse(crt4 == 8, 1, crt4)) %>%
  mutate(crt4, crt4 = ifelse(crt4 != 1, 0, crt4)) %>%
  mutate(crt_total = crt1 + crt2 + crt3 + crt4) %>%
  mutate(mean_crt = (crt1 + crt2 + crt3 +crt4) / 4)

# Creates one column from the two conditions for binary judgment
clean_data <- clean_data %>% 
  mutate(d_1 = coalesce(d1, d1d)) %>% 
  mutate(d_2 = coalesce(d2, d2d)) %>% 
  mutate(d_3 = coalesce(d3, d3d))

# Creates one column from the two conditions for moral acceptability
clean_data <- clean_data %>% 
  mutate(d_1a = coalesce(d1a, d1ad)) %>% 
  mutate(d_2a = coalesce(d2a, d2ad)) %>% 
  mutate(d_3a = coalesce(d3a, d3ad))

# Creates a new column to capture the attention check response by coalescing the attention checks from the two conditions
clean_data <- clean_data %>%   
  mutate(attention_check = coalesce(attention_check_1, attention_check_2)) %>% 
  select(attention_check, everything(), -attention_check_1, -attention_check_2, -d1a, -d1ad, -d2a, -d2ad, -d3a, -d3ad, -d1, -d1d, -d2, -d2d, -d3, -d3d)

# Creates new column showing mean acceptability
clean_data <- clean_data %>% 
  mutate(mean_acceptability = (d_1a + d_2a + d_3a) / 3)

# Reassigns moral judgment values as Boolean
clean_data <- clean_data %>% 
  mutate(d_1, d_1 = ifelse(d_1 == 1, 1, d_1)) %>%
  mutate(d_1, d_1 = ifelse(d_1 == 2, 0, d_1)) %>%
  mutate(d_2, d_2 = ifelse(d_2 == 1, 1, d_2)) %>%
  mutate(d_1, d_2 = ifelse(d_2 == 2, 0, d_2)) %>%
  mutate(d_1, d_3 = ifelse(d_3 == 1, 1, d_3)) %>%
  mutate(d_1, d_3 = ifelse(d_3 == 2, 0, d_3))

# Creates new column showing mean binary judgment
clean_data <- clean_data %>% 
  mutate(mean_judgment = (d_1 + d_2 + d_3) / 3)

# Removes unnecessary rows and reorders columns
clean_data <- clean_data %>% 
  select(crt_order, crt_total, mean_crt, mean_acceptability, mean_judgment,  everything(), -Q18_3_TEXT, -Q20_7_TEXT)

# Removes observations with NA values
clean_data <- clean_data %>% 
  na.omit(data_1)

# Turns the attention check column into a Boolean
clean_data <- clean_data %>% 
  mutate(attention_check, attention_check = ifelse(attention_check == 4, 1, attention_check)) %>%
  mutate(attention_check, attention_check = ifelse(attention_check != 1, 0, attention_check))

Filters for exclusions

# Filters out subjects that didn't score at least 1 on the CRT
data_1 <- clean_data %>% 
  group_by(crt_total) %>%
    filter(!any(crt_total == 0))

# Filters out subjects that didn't pass the attention check
data_1 <- data_1 %>% 
  group_by(attention_check) %>%
    filter(!any(attention_check == 0))

# Filters out subjects based on time less than two minutes
data_1_duration_exclusion <- data_1 %>% 
  group_by(duration) %>%
    filter(!any(duration < 120))

Confirmatory analysis

#Calculates Cronbach's Alpha for inter-dilemma reliability to justify one moral acceptability score
data_1 %>% 
  select(d_1a, d_2a, d_3a) %>% 
  alpha(check.keys = TRUE)

## Adding missing grouping variables: `attention_check`

## Warning in alpha(., check.keys = TRUE): Item = attention_check had no
## variance and was deleted

## 
## Reliability analysis   
## Call: alpha(x = ., check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N   ase mean  sd median_r
##       0.68      0.68    0.61      0.42 2.1 0.063  3.6 1.4     0.42
## 
##  lower alpha upper     95% confidence boundaries
## 0.55 0.68 0.8 
## 
##  Reliability if an item is dropped:
##      raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r med.r
## d_1a      0.43      0.43    0.27      0.27 0.76    0.126    NA  0.27
## d_2a      0.71      0.71    0.55      0.55 2.46    0.064    NA  0.55
## d_3a      0.59      0.59    0.42      0.42 1.45    0.090    NA  0.42
## 
##  Item statistics 
##       n raw.r std.r r.cor r.drop mean  sd
## d_1a 82  0.83  0.84  0.74   0.61  4.3 1.8
## d_2a 82  0.73  0.72  0.48   0.39  3.4 1.9
## d_3a 82  0.78  0.78  0.62   0.48  3.1 1.8
## 
## Non missing response frequency for each item
##         1    2    3    4    5    6    7 miss
## d_1a 0.15 0.02 0.09 0.22 0.27 0.17 0.09    0
## d_2a 0.29 0.09 0.09 0.16 0.27 0.07 0.04    0
## d_3a 0.29 0.13 0.21 0.10 0.15 0.09 0.04    0

## Chisquared Test for Pre and Post Exclusion Data

# Creates an original data frame
original <-clean_data %>% 
  group_by(crt_order) %>% 
  summarise(n=n())

# Creates a data frame with excluded observation
excluded <- clean_data %>% 
  group_by(crt_total) %>%
    filter(!any(crt_total == 0)) %>% 
  group_by(crt_order) %>% 
  summarise(n=n())

# Joins the two data frames
original_excluded <- inner_join(original, excluded, by = "crt_order")
original_excluded <- original_excluded %>% 
  select(n.x, n.y, -crt_order)

# Chisquared Test
chisq.test(original_excluded)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  original_excluded
## X-squared = 0.22801, df = 1, p-value = 0.633

# (Main statistical test) t-test between mean acceptability and crt order
main_t <- data_1 %>% 
  t.test(data = ., mean_acceptability ~ crt_order, var.equal = TRUE)
main_t

## 
##  Two Sample t-test
## 
## data:  mean_acceptability by crt_order
## t = -0.35568, df = 80, p-value = 0.723
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.7666815  0.5341815
## sample estimates:
## mean in group CRT|FL_9 mean in group FL_9|CRT 
##                3.54000                3.65625

#Shows the count for each condition that can then be entered into the calculation for Cohen's d below
condition_counts <- data_1 %>% 
  group_by(crt_order) %>% 
  count()
kable(condition_counts) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

crt_order	n
CRT\|FL_9	50
FL_9\|CRT	32

# Calculating Cohen's d for effect size. (Manually calculated as I was not able to make either of these work effectively)
# t2d (main_t$statistic, 50, 32)
# cohen.d(data_1[mean_acceptability, crt_order, drop = FALSE])

# Create a dataframe subsetting CRT first observations
data_crt_first <- data_1 %>% 
  group_by(crt_order) %>% 
  filter(!any(crt_order == "FL_9|CRT")) 

# (Main Statistical Test) Pearson's r calculation correlating CRT total score and mean acceptability for CRT first observations
crt_first_r <- cor.test(data_crt_first$mean_acceptability, data_crt_first$crt_total)
(crt_first_r)

## 
##  Pearson's product-moment correlation
## 
## data:  data_crt_first$mean_acceptability and data_crt_first$crt_total
## t = 1.0144, df = 48, p-value = 0.3155
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1390912  0.4068091
## sample estimates:
##      cor 
## 0.144865

# Create a dataframe subsetting dilemma first observations
data_dilemma_first <- data_1 %>% 
  group_by(crt_order) %>% 
  filter(!any(crt_order == "CRT|FL_9"))

# (Main Statistical Test) Pearson's r calculation correlating CRT total score and mean acceptability for dilemma first observations
dilemma_first_r <- cor.test(data_dilemma_first$mean_acceptability, data_dilemma_first$crt_total)

# Scatterplot with jitter and point color as gender and size of point as duration to match original study's figure 1
ggplot(data = data_crt_first, aes(x = jitter(crt_total), y = mean_acceptability)) +
  geom_point(aes(color = gender, size = duration)) + 
  stat_smooth(method = 'lm') +
  labs(x = "Number of Correct CRT Items", y = "Mean Moral Acceptability Rating", title = "Mean Acceptability on CRT Total")

# Fisher r to z test to test correlation between CRT first and Dilemma first r coefficients
crt_cor <- cor(data_crt_first$crt_total, data_crt_first$mean_acceptability)

dilemma_cor <- cor(data_dilemma_first$crt_total, data_dilemma_first$mean_acceptability)

paired.r(crt_cor, dilemma_cor, NULL, length(data_crt_first$crt_order), length(data_dilemma_first$crt_order),twotailed=TRUE)

## Call: paired.r(xy = crt_cor, xz = dilemma_cor, yz = NULL, n = length(data_crt_first$crt_order), 
##     n2 = length(data_dilemma_first$crt_order), twotailed = TRUE)
## [1] "test of difference between two independent correlations"
## z = 0.47  With probability =  0.64

# t-test between mean acceptability and crt order to respond to the concern that presentation of the dilemmas first influenced subsequent CRT performance
clean_data %>% 
  t.test(data = ., crt_total ~ crt_order, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  crt_total by crt_order
## t = 1.3332, df = 139, p-value = 0.1847
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1598839  0.8218443
## sample estimates:
## mean in group CRT|FL_9 mean in group FL_9|CRT 
##               2.689189               2.358209

Exploratory analyses

# Power calculation in r (does not agree with G*Power)
pwr.t.test(d = 0.43, n=NULL, sig.level=0.05, power = 0.95)

## 
##      Two-sample t test power calculation 
## 
##               n = 141.527
##               d = 0.43
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

# Correlation between duration and acceptability
cor.test(x=clean_data$mean_acceptability, y=clean_data$duration, method = 'pearson')

## 
##  Pearson's product-moment correlation
## 
## data:  clean_data$mean_acceptability and clean_data$duration
## t = -0.030623, df = 139, p-value = 0.9756
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1678375  0.1627846
## sample estimates:
##          cor 
## -0.002597417

# t-test between mean acceptability and attention check
clean_data %>% 
  t.test(data = ., mean_acceptability ~ attention_check, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  mean_acceptability by attention_check
## t = 4.9252, df = 139, p-value = 2.357e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.7085764 1.6590410
## sample estimates:
## mean in group 0 mean in group 1 
##        4.820513        3.636704

# t-test between mean judgment and attention check
clean_data %>% 
  t.test(data = ., mean_judgment ~ attention_check, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  mean_judgment by attention_check
## t = 2.9069, df = 139, p-value = 0.00425
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05694606 0.29914729
## sample estimates:
## mean in group 0 mean in group 1 
##       0.7435897       0.5655431

# t-test for mean acceptability by participants that were excluded for getting no crt questions correct and those that answered at lease one correctly
clean_data %>% 
  t.test(data = ., mean_acceptability ~ crt_total == 0, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  mean_acceptability by crt_total == 0
## t = -2.863, df = 139, p-value = 0.004847
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.6943252 -0.3100826
## sample estimates:
## mean in group FALSE  mean in group TRUE 
##            3.931129            4.933333

# Correlation between duration and moral acceptability rating
cor.test(clean_data$duration, clean_data$mean_acceptability)

## 
##  Pearson's product-moment correlation
## 
## data:  clean_data$duration and clean_data$mean_acceptability
## t = -0.030623, df = 139, p-value = 0.9756
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1678375  0.1627846
## sample estimates:
##          cor 
## -0.002597417

# t-test between duration and attention check
clean_data %>% 
  t.test(data = ., duration ~ attention_check, var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  duration by attention_check
## t = 0.64989, df = 139, p-value = 0.5168
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -110.5381  218.7866
## sample estimates:
## mean in group 0 mean in group 1 
##        343.5962        289.4719

# Scatterplot with jitter and point color as gender and size as duration to show self-assessed previous exposure to the questions effect on acceptability ratings
ggplot(data = clean_data, aes(x = jitter(exposure), y = mean_acceptability)) +
  geom_point(aes(color = gender, size = duration)) + 
  stat_smooth(method = 'lm') +
  labs(x = "Previous Exposure", y = "Mean Moral Acceptability Rating", title = "Previous Exposure on Mean Acceptability")

# Correlation between age and moral acceptability rating
cor.test(data_1$age, data_1$mean_acceptability)

## 
##  Pearson's product-moment correlation
## 
## data:  data_1$age and data_1$mean_acceptability
## t = -2.1984, df = 80, p-value = 0.03081
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.43325202 -0.02286294
## sample estimates:
##        cor 
## -0.2386858

# Scatterplot with jitter and point color as gender and size as duration to show age on acceptability ratings
ggplot(data = data_1, aes(x = jitter(age), y = mean_acceptability)) +
  geom_point(aes(color = gender, size = duration)) + 
  stat_smooth(method = 'lm') +
  labs(x = "Age", y = "Mean Moral Acceptability Rating", title = "Mean Acceptability on Age")

# Box plot to show how crt total correct relates to age
ggplot(data = data_1, mapping = aes(group = crt_total, x = crt_total, y = age)) + geom_boxplot()

# Box plot for crt_order and mean_acceptability
ggplot(data = data_1, mapping = aes(x = crt_order, y = mean_acceptability)) +
   geom_boxplot()

# Box plot for duration and attention check
ggplot(data = clean_data, mapping = aes(group = attention_check, x = attention_check, y = duration)) +
   geom_boxplot()

# Means of duration by attention check
clean_data %>% 
group_by(attention_check) %>%
  summarise(duration = mean(duration))

## # A tibble: 2 x 2
##   attention_check duration
##             <dbl>    <dbl>
## 1               0     344.
## 2               1     289.

# Shows mean acceptability by gender
data_1 %>% 
group_by(gender) %>%
  summarise(mean_acceptability = mean(mean_acceptability))

## # A tibble: 2 x 2
##   gender mean_acceptability
##   <chr>               <dbl>
## 1 female               3.11
## 2 male                 3.84

# Mean CRT score by gender
clean_data %>% 
group_by(gender) %>%
  summarise(crt_total = mean(crt_total))

## # A tibble: 2 x 2
##   gender crt_total
##   <chr>      <dbl>
## 1 female      2.02
## 2 male        2.81

# Mean judgment by gender
data_1 %>% 
group_by(gender) %>%
  summarise(mean_judgment = mean(mean_judgment))

## # A tibble: 2 x 2
##   gender mean_judgment
##   <chr>          <dbl>
## 1 female         0.437
## 2 male           0.585

# Standard deviation of crt_order on mean_acceptability
data_1 %>% 
  group_by(crt_order) %>% 
  summarise(sd_acceptability = sd(mean_acceptability))

## # A tibble: 2 x 2
##   crt_order sd_acceptability
##   <chr>                <dbl>
## 1 CRT|FL_9              1.45
## 2 FL_9|CRT              1.44

# ANOVA for mean acceptability and CRT order
data_1 %>% 
  aov(mean_acceptability ~ crt_order, data = ., TRUE)

## Call:
##    aov(formula = mean_acceptability ~ crt_order, data = ., projections = TRUE)
## 
## Terms:
##                 crt_order Residuals
## Sum of Squares    0.26369 166.74986
## Deg. of Freedom         1        80
## 
## Residual standard error: 1.443736
## Estimated effects may be unbalanced

Discussion

Summary of Replication Attempt

While this replication agreed on the internal consistency of the dilemmas and also supported the necessity to exclude participants based on CRT results (shown by studies exploratory analysis), it ultimately failed to replicate the original Paxton, Unger, and Greene study.

I found similarly good inter-dilemma reliability as the original authors did (Cronbach’s a = .71), justifing a collapse of the acceptability responses from the three dilemmas into one score (Pre-exclusions: Cronbach’s a = .75; Post-exclusions: Cronbach’s a = .68). The original paper did not specificy whether they performed this analysis on pre or post exclusion data, so I performed the test with both samples and found they both were adequate (near .7 or higher) to justify the collapse.

While this replication found a weaker chi-squared condition than was found in the original paper, the proportion of subjects before and after exclusions observations is still statistically insignificant. I performed this test both on exclusions after CRT requirements (Pre-Exclusion CRT-First: 74 of 141 [52%], Post-Exclusion CRT-First: 68 of 121 [56%], x^2 = .23, p = .63) and on exclusions after CRT requirements and the attention check (Pre-Exclusion CRT-First: 74 of 141 [52%], Post-Exclusion CRT-First: 50 of 82 [61%], x^2 = 1.19, p = .28).

The first main statistical test, a t-test of the CRT condition first on mean acceptability, yielded a statistically insignificant relationship, conflicting with the findings in the original study (CRT-First: M = 3.54; Dilemmas-First: M = 3.66; t(82) = -.36, p = .72, d = .08). The second main statistical test was a correlation between the number of CRT questions participants answered correctly and their mean acceptability rating across the three dilemmas. This was performed on the CRT first and dilemma first conditions separately. In the CRT first condition, the original study found a statistically significant positive correlation (r = .39, p = .001), while this reproduction found no such effect (r = .14, p = .32). In the dilemma first condition, the original study found a statistically significant positive correlation (r = -.03, p = .8), which agreed with the statistical insignificance, but this reproduction found an inverse relationship between the two variables (r = .03, p = .85).

The final statistical test was a t-test across the two conditions on moral acceptability rating to respond to the concern that presentation of the dilemmas before the CRT question may influence how participants performed on the CRT. Agreeing with the original study (CRT-First: M = 1.32; Dilemmas-First: M = 1.16; t(148) = 0.83, p = .41), this replication also found no statistically significant difference between means (CRT-First: M = 2.69; Dilemmas-First: M = 2.36; t(141) = 1.33, p = .18). The original authors ran this test on the data set before exclusions, presumably to account for those participates that answered all CRT questions incorrectly, in the event that their performance was due to the order the survey objects were presented in.

Commentary

Its unclear whether their collapsing across dilemmas included collapsing the Likert responses with the binary, yes/no, answer. In the original study, this looks to be the case, but that could lead to complications as the distance between two number on the Likert rating (between 5 and 6) signifies acceptability of a different degree while on the judgment score, the difference between 0 and 1 signifies a complete difference in acceptability judgment. In light of this and their use of acceptability means across the two conditions, I opted to the collapsed moral acceptability rating when the paper referred to it and did not incorporate the dichotomous (YES/NO) moral judgment decision.

In the exploratory analysis, I ran an analysis to explore the relationship between the amount of time a subject took on the survey and their moral acceptability ratings. After noticing several extremely fast participants, I was interested to see if speed had any bearing on acceptability ratings of the dilemmas. Duration did not have a statistically significant effect on either mean acceptability ratings or whether a participant passed the attention check.

The attention check was found to have a statistically significant relationship with acceptability ratings, with participants failing the attention check reporting acceptability of the dilemmas more than one point higher than those that passed (Failed attention check: M = 4.82; Passed attention check: M = 3.64; t(141) = 4.93, p = 2.357e-06). It also has a significant effect on moral judgment (Failed attention check: M = .74; Passed attention check: M = .57; t(141) = 2.91, p = .004). Participants that failed the attention check were more likely to both agree with the untilitarian decision and rate the utilitarian actions in the dilemmas as more acceptable. One would expect satisficing participants to merely select the first options available with the least amount of movement across the screen, which in this case, would correlate to lower ratings and disagreement. Perhaps in an effort to avoid being perceived as satsificing, participants behaved in the inverse.

Excluding participants that did not successfully answer at least one of the four CRT questions also seems like a valid methodological choice given the significant difference in acceptability ratings between those that were excluded and those that remained in the sample (CRT = 0: M = 4.76; CRT > 0: M = 3.95; t(146) = -2.3886, p = .02). The correlations between these two condtions were found to not differ significantly (z = .47, p = .64), again deviating from the original study that found they did differ significantly (z = 2.6, p = .01).

The resounding conclusion of this replication is that even after presenting Mechanical Turk participants with rewritten CRT questions, Mechanical Turk participants are too familiar with the study model and dilemmas for them to serve as an accurate representation of their interactions with the stimuli. 77% of participants noted that they either believed or were sure they had prior exposure to dilemmas and questions like these.

There is a follow up question to this, around the validity survey-based responses to moral dilemmas are in the first place. In situations of moral dilemma or moral judgment, individuals encounter powerful moral emotions that could affect their decision, that hypotheticals may not evoke, and surveys not capture. The field needs to consider the limits of surveys and other data collection methods that separate the rational decisions from the emotions one encounters in reality.