Replication of Experiment 1 by Paxton, Unger, and Greene (2012, Cognitive Science)

Author

Alexander Pereira (pereirak@stanford.edu)

Published

December 4, 2023

Introduction

Paxton, Unger and Greene (2012) investigated the role of reflection and reasoning on moral judgements. I am a PhD student in philosophy and MA student in psychology. While I am not primarily interested in moral philosophy or psychology, I am interested in the project of empirically testing the intuitions and assumptions that philosophers often adopt when making arguments. Here, it has been widely assumed in philosophy that reflection and reasoning can change one’s moral views. In fact, if this wasn’t assumed, it would be difficult to justify engaging in moral philosophy at all. However (by an ought-implies-can principle) we should be interested in knowing whether and in what ways reason and reflection can actually influence moral judgements in real people; this is the overarching theme of Paxton et al.’s (2012) experiments.

According to Haidt’s Social Intuitionist Model (SIM) of moral judgement, moral evaluations are furnished by quick and automatic emotional responses, not reasoning nor critical reflection. However, other theories of moral judgement grant reasoning and reflection more influence on moral evaluations. Experiment 1 or Paxton et al. (2012) tested whether reflection caused subjects to engage in utilitarian (cost-benefit) moral reasoning. Subjects were randomly selected to undertake a cognitive reflection task (CRT) either before or after evaluating utilitarian solutions to high-stales moral dilemmas. The CRT is thought to induce a reflective state that can override intuition. It was hypothesized that subjects in the CRT-first condition would be more likely to evaluate moral quandries according to utilitarian (cost-benefit) reasoning, as reflection would override initial emotional aversions to extreme actions (such as killing one person to save many). Three “high-conflict” moral dilemmas were used. All participants evaluated whether a utilitarian response was acceptable, first by a binary (YES/NO) then a rating scale (1 = Completely Unacceptable, 7 = Completely Acceptable). Finally, participants answered basic demographic questions.

Link to repo Link to original paper in repo Link to Qualtrics survey

Summary of prior replication attempt

There were five main differences in design between the original study (Paxton et al., 2012) and the replication attempt (Fereday 2019). First, the CRT questions were rewritten and an additional fourth CRT question was added. Second, Fereday added a sub-condition to the moral dilemmas. Half the dilemmas included personalized information (names of characters) while the rest were depersonalized. Fereday wanted to investigate the impact of personalization on moral evaluations. Second, an attention check was built into the survey. Third, the moral dilemma questions included two sub-conditions which were randomly assigned to participants. In the personalized condition, characters in the questions had names, and in the depersonalized conditions characters were not named. Fereday 2019 aimed to investigate the relationship between personalization of questions and the mean acceptability rating of utilitarian solutions to dilemmas. Finally, Fereday (2019) included an attention check in the survey.

Freyeday’s (2019) study did not replicate either of the two key findings from Paxton, Unger, and Greene (2012). The first main statistical test was a two-sided t-test of the mean moral acceptability ratings of the CRT-First (experiment) vs Dilemma-First (control) groups. The replication yielded a statistically insignificant relationship (CRT-First: M = 3.54; Dilemmas-First: M = 3.66; t(82) = -.36, p = .72, d = .08), conflicting with the original study (CRT-First: M = 3.77; Dilemmas-First: M = 3.25; t(90) = 2.03, p = .05, d = 0.43). The second main statistical test was a correlation between the number of CRT questions participants answered correctly and their mean acceptability rating across the three dilemmas. This was performed on the CRT first and dilemma first conditions separately. The replication observed no significant correlation in the CRT-First group (r = .14, p = .32), while the original study observed a significant positive relationship (r = .39, p = .001). The replication also observed an insignificant negative relationship in the Dilemmas-First condition (r = .03, p = .85). The original also observed an insignificant relationship, but in the positive direction (r = -.03, p = .8).

A list of what changes are being made to address potential reasons for replication failure

Major Changes

1. New CRT Questions

The present rescue uses a new pool of Cognitive Reflection Task (CRT) questions to induce a reflective state in the treatment condition. The original study used the CRT; this rescue will use the CRT-2, developed by Thomson and Oppenheimer (2016). The CRT-2 has two relevant advantages over the original CRT. (i) The validity of the CRT appears to depend on participants being naive to its materials and objectives (Thomson and Oppenheimer, 2016; Haigh, 2016; however, see Meyer, Zhou, and Frederick for a dissenting view). Participants generally, but especially those who work on services like MTurk and Prolific, are likely to be familiar with the original CRT questions but not the CRT-2. (ii) Success at the CRT is affected by numerical skills such that failure at the CRT might be explained by the absence of appropriate reflection, or factors such as poor mathematical reasoning or “math anxiety” (Primi et al., 2017). The CRT-2 meanwhile requires more generalized verbal and logical reasoning as well as close readings of the questions. The CRT-2 should track successful reflection with more accuracy and not screen-off participants who are reflecting but lack requisite numerical skills. Even though they do not require numerical reasoning, the CRT-2 questions chosen for the present rescue all require numerical answers, in line with the original study from Paxton, Greene, and Unger (2012)

2. Prolific

The present rescue uses participants from Prolific instead of MTurk. Both the original and replication used participants from MTurk, and the replicating author Fereday (2019) reported that the majority of participants were familiar with the original CRT questions. Prolific also has several advantages which make it more likely that participants are adequately engaged and earnest in their responses.

3. Attention and Commitment Check

The original study did not report including an attention check, while the replication included two attention checks. For the experimental manipulation to work, participants must be suitably engaged with the CRT-2 questions so they can override intially intuitive answers and enter the appropriate reflection state. This makes including an appropriate attention check challenging We want to ensure that appropriate attention is paid to the CRT-2 questions, but not interfere with any reflective state achieved by the manipulation. So, for the CRT-First group, an attention check during the CRT questions might ensure participants are more engaged, but risk interfering the manipulation. An attention check after the CRT block would fail to help induce engagement for the relevant manipulation, and it also might interfere with the moral judgements.

To address this, the rescue implements a different approach. It will include a commitment check at the beginning of the study, and an attention check at the end. The commitment check aims to motive earnest and engaged participant responses by simply reminding the participants that we (the experimenters) care about the quality of survey data, and that we request thoughtful answers to each survey question. Unlike some attention checks, the commitment check avoids any paternalistic or antagonistic dynamics between experimenter and participant.

The rescue will also include a simple attention check at the end. Negative answers on the commitment check or failure on the attention check will serve as exclusion criteria. The overall aim is to balance increasing the probability of adequate participant engagement and tracking non-engaged participants, while minimizing the probability of interfering with the experimental manipulation and creating an anatagonistic environment for the participants.

Minor Changes

1. Ignoring Binary YES/NO Moral Evaluations

The original study employed two measurements of participants’ acceptance of utilitarian responses to moral dilemmas. The first was a YES/NO binary (is it acceptable for the captain to kill an injured crewmate to save the whole ship?). The second was a 7-point likert rating of the degree of acceptability (how acceptable is it for the captain to kill an injured crewmate to save the whole ship?). As noted by the replication from Fereday (2019), it is unclear how the original study “collapsed” moral acceptability ratings and whether this included the binary judgement. It also seems strange to combine a categorical acceptability judgement with a judgement of degree. In line with the replication, I am not planning on incorporating the YES/NO judgement, and only the means of moral acceptability scores. However, I have presently included the binary judgement measure in the paradigm and analysis in the coding, in case it makes sense to use it.

2. Adding Measure of Moral Dilemma Familiarity

Neither the original study nor the replication measured how familiar participants were with the moral dilemmas. This seems like an oversight worth addressing, for two reasons.Like familiarity with the CRT, familiarity with the moral dilemmas might influence moral acceptability ratings. Trolley-problem variants are common in experiments on moral psychology and are reasonably well-known in popular culture.

Changes from Previous Replication

The previous replication made several additions to the paradigm and exploratory analysis. For example, Fereday (2019) created personalized (name of character in dilemmas is used) and depersonalized (name of character is not used) versions of the questions that were randomly assigned to participants. The author added these two sub-conditions to investigate the relationship between personalization of questions and mean acceptability rating. Since these additions to the paradigm and analysis are orthogonal to the research questions of the original study, I have to decided to exlcude them.

Response to comments on progress reports

Code has been simplified and rewritten using Tidyverse functions (where possible) as opposed to R Basics
Typos in paradigm have been fixed
The likert scale sizing on survey has been fixed

Methods

Power Analysis

The effect size in the original study was d = 0.43. Given the effect size and the more conservative two-sided alternative, the following sample sizes are required for 80, 90, and 95% power.

For 80% power: n=86 For 90% power: n=115 For 95% power: n=142

Planned Sample

The original study had an exclusion rate of 39%, while the replication had an exclusion rate of 42%. Given the expected exclusion rate, the present study will use 230 participants to ensure at least 95% power. Note that the present rescue is employing changes aimed at lowering the exclusion rate by increasing participant engagement and using new CRT questions that participants can pass even if they are low on mathematical reasoning capabilities.

Materials

The original Cognitive Reflection Test (CRT) questions, quoted from Frederick (2005) and referenced in the original article:

A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. How much does the ball cost? _____ cents
If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? _____ minutes
In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake? _____ days”

The present study opted to include a rewritten version of one original CRT questions (to decrease familiarity), alongside five CRT-2 questions (Thomson and Oppenheimer, 2016) which require numerical answers.

CRT and CRT-2 questions:

How many of each animal did Moses put on the Ark? _____ animals.
If it takes 10 programmers 10 minutes to make 10 improvements, how long would it take 50 programmers to make 50 improvements? _____ minutes
How many months have 28 days? ____ months.
A farmer had 15 sheep and all but 8 died. How many are left? _____ sheep.
There is a grasshopper crossing a road. With every jump, the distance the grasshopper jumps doubles. If it takes 26 jumps for the grasshopper to cross the entire road, how many jumps would it take for the grasshopper to make it halfway across? _____ jumps.

Three high conflict moral dilemmas, presented randomly before or after the CRT questions in randomized order:

“John is the captain of a military submarine traveling underneath a large iceberg. An onboard explosion has caused the vessel to lose most of its oxygen supply and has injured a crewman who is quickly losing blood. The injured crewman is going to die from his wounds no matter what happens. The remaining oxygen is not sufficient for the entire crew to make it to the surface. The only way to save the other crew members is for John to shoot dead the injured crewman so that there will be just enough oxygen for the rest of the crew to survive.
Enemy soldiers have taken over Jane’s village. They have orders to kill all remaining civilians. Jane and some of her townspeople have sought refuge in the cellar of a large house. Outside they hear the voices of soldiers who have come to search the house for valuables. Jane’s baby begins to cry loudly. She covers his mouth to block the sound. If she removes her hand from his mouth, his crying will summon the attention of the soldiers, who will kill her, her child, and the others hiding out in the cellar. To save herself and the others, she must smother her child to death.
A runaway trolley is heading down the tracks toward five railway workmen, who will be killed if the trolley proceeds on its present course. Jane is on a footbridge over the tracks, in between the approaching trolley and the five workmen. Next to her on this footbridge is a lone railway workman, who happens to be wearing a large, heavy backpack. The only way to save the lives of the five workmen is for Jane to push the lone work- man off the bridge and onto the tracks below, where he and his large backpack will stop the trolley. The lone workman will die if Jane does this, but the five workmen will be saved.”

This rescue project will follow the exact wording of the questions and dilemmas above.

Procedure

Quoted from original article:

“Subjects were randomly assigned to complete the CRT either before (CRT-First condition) or after (Dilemmas-First condition) responding to the dilemmas. Subjects evaluated the moral acceptability of the utilitarian action with a binary response (YES ⁄ NO), followed by a rating scale (1 = Completely Unacceptable, 7 = Completely Acceptable). No time limits were imposed on responses. Subjects completed the CRT questions and read and responded to the dilemmas at their own pace. Subjects subsequently completed a brief set of demographic questions.”

Controls

A committment check and attention check are included in the survey to ensure participants read the CRTs and especially the moral dilemmas sincerely. There was no attention check nor commitment check mentioned in the original paper, while Fedreday (2019) included an attention check in the “methods addendum” post data collection.

Analysis Plan

Exclusion rules: Exclude subjects who do not pass attention check. Exclude subjects who did not answer at least one CRT question correctly (i.e., those we a CRT score of 0).
CRT scores: Calculate each participant’s CRT score by assigning 1 for correct and 0 for incorrect response. Minimum score is 0, maximum score is 3.
Reliability of CRT scores: Calculate Cronbach’s alpha to determine reliability across moral dilemmas.
Moral acceptability rating: Collapse each subjects moral acceptability rating to create an average moral acceptability rating for each subject.
Linear regression of CRT-First condition on utilitarian moral judgments.
Main Statistical Test: Between-subject t-test of CRT-First condition on individual moral acceptability rating.
Controlling for trait-reflectiveness: Test correlation among subjects in the Dilemmas first condition to rule out variation due to trait reflectiveness. Confirm with a Fischer r-z test.
Controlling for effects of perfoming a task before moral judgements. A potential objection is that simply doing any non-specific problem solving task might change moral evaluations. To test this, calculate the within CRT-first condition correlation of CRT scores and moral acceptability ratings.
Regress the CRT scores across the two conditions to address the objection that receiving the Dilemmas-First condition would influence subsequent CRT Performance.

Key analyses of interest

Main inference of the paper If mean moral acceptability rating is significantly greater in CRT first group compared to Dilemmas first group, results support the main inference of the paper, i.e., that CRT exposure (operationalizing “reflection”) causes an increase in utilitarian (cost-benefit) moral reasoning. Statistical test: two sample t-test between mean_acceptability and CRT_order.

Secondary inference of the paper If higher CRT scores correlate positively with utilitarian judgements of moral acceptability, results support the secondary inference of the paper, i.e., that higher CRT scores in the CRT First group is expressive of stronger “reflection” which causes an increase in utilitarian (cost-benefit) ratings. Note that this result only supports the secondary inference if there is no positive correlation between CRT scores and moral acceptability ratings in the Dilemmas First condition. Statistical Test(s): Pearson’s r calculation correlating CRT total score and mean acceptability for CRT first observations. Then, Pearson’s r calculation correlating CRT total score and mean acceptability for Dilemma first observations.

Differences from Original Study and 1st replication

Explicitly describe known differences in sample, setting, procedure, and analysis plan from original study. The goal, of course, is to minimize those differences, but differences will inevitably occur. Also, note whether such differences are anticipated to make a difference based on claims in the original article or subsequent published research on the conditions for obtaining the effect.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

Data preparation following the analysis plan.

#### Import data
library(readr)
raw_d <- read_csv("/Users/alexpereira/Library/CloudStorage/OneDrive-Personal/Academics/Stanford/Courses/Fall_2023/PSYCH_251/paxton2012_rescue/data/pilot_b_data.csv")

Rows: 7 Columns: 46
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (46): StartDate, EndDate, Status, IPAddress, Progress, Duration (in seco...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

raw_d <- raw_d %>%
  slice(-c(1, 2))

view(raw_d)

#### Data Preparation for Analysis
# Selects relevant columns, renames columns, replaces numbers with strings for gender variables
clean_d  <- raw_d %>%
  select(Finished, COM_1, CRT_1, CRT_2, CRT_3, CRT_4, CRT_5, ATT_1, MDA_1, `MDA_2 _1`, MDB_1, `MDB_2 _1`, MDC_1, `MDC_2 _1`, DQ_1, DQ_2, `DQ_2 _3_TEXT`, DQ_3, DQ_4, DQ_5, `DQ_6 _1`, `Duration (in seconds)`, FL_17_DO, FL_18_DO) %>%
  rename(crt_order = FL_17_DO, dilemma_order = FL_18_DO, age = DQ_1, gender = DQ_2, income = DQ_3, ethnicity = DQ_4, education = DQ_5, familiarity = `DQ_6 _1`, attention_check = ATT_1, duration = 'Duration (in seconds)', D1_Binary = MDA_1, D2_Binary = MDB_1, D3_Binary = MDC_1, D1_Acceptability = `MDA_2 _1`, D2_Acceptability = `MDB_2 _1`, D3_Acceptability = `MDC_2 _1`) %>%
  mutate (gender, gender = ifelse(gender == 1, "male", gender)) %>%
  mutate(gender, gender = ifelse(gender == 2, "female", gender)) %>%
  mutate(gender, gender = ifelse(gender == 3, "other", gender))

# Converts data from character to numerical
clean_d <- clean_d %>%
  mutate_at(vars(CRT_1, CRT_2, CRT_3, CRT_4, CRT_5, Finished,
                 COM_1, attention_check, D1_Binary, D1_Acceptability,
                 D2_Binary, D2_Acceptability, D3_Binary, D3_Acceptability,
                 age, income, education, duration),
            as.numeric)

# Creates a new variable for total CRT score and average CRT score

clean_d <- clean_d %>%
  mutate(across(starts_with("CRT_1"), ~ ifelse(. == 0, 1, ifelse(. != 1, 0, .)), .names = "new_CRT_1"),
         across(starts_with("CRT_2"), ~ ifelse(. == 10, 1, ifelse(. != 1, 0, .)), .names = "new_CRT_2"),
         across(starts_with("CRT_3"), ~ ifelse(. == 12, 1, ifelse(. != 1, 0, .)), .names = "new_CRT_3"),
         across(starts_with("CRT_4"), ~ ifelse(. == 8, 1, ifelse(. != 1, 0, .)), .names = "new_CRT_4"),
         across(starts_with("CRT_5"), ~ ifelse(. == 25, 1, ifelse(. != 1, 0, .)), .names = "new_CRT_5")) %>%
  mutate(CRT_score = rowSums(select(., starts_with("new_CRT_"))),
         CRT_mean = rowMeans(select(., starts_with("new_CRT_"))))

# Creates a new column showing mean acceptability rating
clean_d <- clean_d %>%
  mutate(mean_acceptability = (D1_Acceptability + D2_Acceptability + D3_Acceptability)/3)

# Creates a new variable for attention check
clean_d <- clean_d %>% 
  mutate(across(starts_with("attention_check"), ~ ifelse(. == 4, 1, ifelse(. != 1, 0, .)), .names = "attention_check_1"))

view(clean_d)

# Reassigns binary moral judgements yes/no as 1 or 0
clean_d <- clean_d %>%
  mutate(across(starts_with("D1_Binary"), ~ifelse(. == 1, 1, ifelse(. == 2, 0, .)), .names = "{col}"),
         across(starts_with("D2_Binary"), ~ifelse(. == 1, 1, ifelse(. == 2, 0, .)), .names = "{col}"),
         across(starts_with("D3_Binary"), ~ifelse(. == 1, 1, ifelse(. == 2, 0, .)), .names = "{col}"))

# Creates new column showing mean binary judgment
clean_d <- clean_d %>% 
  mutate(mean_judgment = (D1_Binary+D2_Binary+D3_Binary)/3)

# Data exclusion / filtering. 
## Filters out participants that did not get at least one CRT question correct (CRT_total = 0) or failed attention check

filtered_d <- clean_d %>%
  group_by(CRT_score) %>%
  filter(CRT_score != 0) %>%
  group_by(attention_check_1) %>%
  filter(all(attention_check_1 != 0))

view(filtered_d)

Confirmatory analysis

# Calculates Cronbach's Alpha for inter-dilemma reliability to justify one moral acceptability score
filtered_d %>%
  select(D1_Acceptability, D2_Acceptability, D3_Acceptability) %>%
  alpha(check.keys = TRUE)

Adding missing grouping variables: `attention_check_1`

Warning in alpha(., check.keys = TRUE): Item = attention_check_1 had no
variance and was deleted but still is counted in the score


Reliability analysis   
Call: alpha(x = ., check.keys = TRUE)

  raw_alpha std.alpha G6(smc) average_r S/N ase mean   sd median_r
      0.79      0.79    0.98      0.56 3.7 0.2  2.8 0.89     0.67

    95% confidence boundaries 
         lower alpha upper
Feldt    -0.39  0.79  0.99
Duhachek  0.40  0.79  1.18

 Reliability if an item is dropped:
                 raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r
D1_Acceptability      0.91      0.91    0.83      0.83 10.0    0.091    NA
D2_Acceptability      0.29      0.29    0.17      0.17  0.4    0.714    NA
D3_Acceptability      0.80      0.80    0.67      0.67  4.0    0.200    NA
                 med.r
D1_Acceptability  0.83
D2_Acceptability  0.17
D3_Acceptability  0.67

 Item statistics 
                 n raw.r std.r r.cor r.drop mean  sd
D1_Acceptability 4  0.73  0.73  0.71   0.44    4 1.4
D2_Acceptability 4  0.99  0.99  0.99   0.98    3 1.4
D3_Acceptability 4  0.79  0.79  0.79   0.55    3 1.4

Non missing response frequency for each item
                    1    3    4    6 miss
D1_Acceptability 0.00 0.50 0.25 0.25    0
D2_Acceptability 0.25 0.25 0.50 0.00    0
D3_Acceptability 0.25 0.25 0.50 0.00    0

# Chisquared Test for Pre and Post Exclusion Data
## Creates and original data frame
original_d <- clean_d %>%
  group_by(crt_order) %>%
  summarise (n=n())

## Creates a data frame with excluded observation
excluded_d <- clean_d %>% 
  group_by(CRT_score) %>%
    filter(!any(CRT_score == 0)) %>% 
  group_by(crt_order) %>% 
  summarise(n=n())

## Joins the two data frames
original_excluded <- inner_join(original_d, excluded_d, by = "crt_order")
original_excluded <- original_excluded %>% 
  select(n.x, n.y)

## Chisquared Test
chisq.test(original_excluded)

Warning in chisq.test(original_excluded): Chi-squared approximation may be
incorrect


    Pearson's Chi-squared test

data:  original_excluded
X-squared = 0, df = 1, p-value = 1

print(chisq.test)

function (x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), 
    rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) 
{
    DNAME <- deparse(substitute(x))
    if (is.data.frame(x)) 
        x <- as.matrix(x)
    if (is.matrix(x)) {
        if (min(dim(x)) == 1L) 
            x <- as.vector(x)
    }
    if (!is.matrix(x) && !is.null(y)) {
        if (length(x) != length(y)) 
            stop("'x' and 'y' must have the same length")
        DNAME2 <- deparse(substitute(y))
        xname <- if (length(DNAME) > 1L || nchar(DNAME, "w") > 
            30) 
            ""
        else DNAME
        yname <- if (length(DNAME2) > 1L || nchar(DNAME2, "w") > 
            30) 
            ""
        else DNAME2
        OK <- complete.cases(x, y)
        x <- factor(x[OK])
        y <- factor(y[OK])
        if ((nlevels(x) < 2L) || (nlevels(y) < 2L)) 
            stop("'x' and 'y' must have at least 2 levels")
        x <- table(x, y)
        names(dimnames(x)) <- c(xname, yname)
        DNAME <- paste(paste(DNAME, collapse = "\n"), "and", 
            paste(DNAME2, collapse = "\n"))
    }
    if (any(x < 0) || anyNA(x)) 
        stop("all entries of 'x' must be nonnegative and finite")
    if ((n <- sum(x)) == 0) 
        stop("at least one entry of 'x' must be positive")
    if (simulate.p.value) {
        setMETH <- function() METHOD <<- paste(METHOD, "with simulated p-value\n\t (based on", 
            B, "replicates)")
        almost.1 <- 1 - 64 * .Machine$double.eps
    }
    if (is.matrix(x)) {
        METHOD <- "Pearson's Chi-squared test"
        nr <- as.integer(nrow(x))
        nc <- as.integer(ncol(x))
        if (is.na(nr) || is.na(nc) || is.na(nr * nc)) 
            stop("invalid nrow(x) or ncol(x)", domain = NA)
        sr <- rowSums(x)
        sc <- colSums(x)
        E <- outer(sr, sc)/n
        v <- function(r, c, n) c * r * (n - r) * (n - c)/n^3
        V <- outer(sr, sc, v, n)
        dimnames(E) <- dimnames(x)
        if (simulate.p.value && all(sr > 0) && all(sc > 0)) {
            setMETH()
            tmp <- .Call(C_chisq_sim, sr, sc, B, E)
            STATISTIC <- sum(sort((x - E)^2/E, decreasing = TRUE))
            PARAMETER <- NA
            PVAL <- (1 + sum(tmp >= almost.1 * STATISTIC))/(B + 
                1)
        }
        else {
            if (simulate.p.value) 
                warning("cannot compute simulated p-value with zero marginals")
            if (correct && nrow(x) == 2L && ncol(x) == 2L) {
                YATES <- min(0.5, abs(x - E))
                if (YATES > 0) 
                  METHOD <- paste(METHOD, "with Yates' continuity correction")
            }
            else YATES <- 0
            STATISTIC <- sum((abs(x - E) - YATES)^2/E)
            PARAMETER <- (nr - 1L) * (nc - 1L)
            PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
        }
    }
    else {
        if (length(dim(x)) > 2L) 
            stop("invalid 'x'")
        if (length(x) == 1L) 
            stop("'x' must at least have 2 elements")
        if (length(x) != length(p)) 
            stop("'x' and 'p' must have the same number of elements")
        if (any(p < 0)) 
            stop("probabilities must be non-negative.")
        if (abs(sum(p) - 1) > sqrt(.Machine$double.eps)) {
            if (rescale.p) 
                p <- p/sum(p)
            else stop("probabilities must sum to 1.")
        }
        METHOD <- "Chi-squared test for given probabilities"
        E <- n * p
        V <- n * p * (1 - p)
        STATISTIC <- sum((x - E)^2/E)
        names(E) <- names(x)
        if (simulate.p.value) {
            setMETH()
            nx <- length(x)
            sm <- matrix(sample.int(nx, B * n, TRUE, prob = p), 
                nrow = n)
            ss <- apply(sm, 2L, function(x, E, k) {
                sum((table(factor(x, levels = 1L:k)) - E)^2/E)
            }, E = E, k = nx)
            PARAMETER <- NA
            PVAL <- (1 + sum(ss >= almost.1 * STATISTIC))/(B + 
                1)
        }
        else {
            PARAMETER <- length(x) - 1
            PVAL <- pchisq(STATISTIC, PARAMETER, lower.tail = FALSE)
        }
    }
    names(STATISTIC) <- "X-squared"
    names(PARAMETER) <- "df"
    if (any(E < 5) && is.finite(PARAMETER)) 
        warning("Chi-squared approximation may be incorrect")
    structure(list(statistic = STATISTIC, parameter = PARAMETER, 
        p.value = PVAL, method = METHOD, data.name = DNAME, observed = x, 
        expected = E, residuals = (x - E)/sqrt(E), stdres = (x - 
            E)/sqrt(V)), class = "htest")
}
<bytecode: 0x11c895510>
<environment: namespace:stats>

view(original_excluded)

# (Main statistical test) two sample t-test between mean_acceptability and CRT_order.
## If mean moral acceptability ratings is significantly greater in CRT-first group compared to Dilemmas-first group, results support the main inference of the paper, i.e., that CRT exposure (operationalizing "reflection") causes an increase in utilitarian (cost-benefit) moral reasoning. 
main_t <- filtered_d %>% 
  t.test(data = ., mean_acceptability ~ crt_order, var.equal = TRUE)
main_t


    Two Sample t-test

data:  mean_acceptability by crt_order
t = -1.9612, df = 2, p-value = 0.1889
alternative hypothesis: true difference in means between group CRT |FL_18 and group FL_18|CRT  is not equal to 0
95 percent confidence interval:
 -5.323218  1.989885
sample estimates:
mean in group CRT |FL_18 mean in group FL_18|CRT  
                2.500000                 4.166667

# Shows the count for each condition that can then be entered into the calculation for Cohen's d below
condition_counts <- filtered_d %>% 
  group_by(crt_order) %>% 
  count()
kable(condition_counts) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)

crt_order	n
CRT \|FL_18	2
FL_18\|CRT	2

# Calculating Cohen's D for effect size
## Preparing groups and fixing for hidden special characters in "CRT |FL_18" label [i.e., CRT |FL_18], the label for CRT First Condition
### Removing whitespace from crt_order column
filtered_d$crt_order <- str_trim(filtered_d$crt_order)

###Ensuring crt_order has character values instead of numeric values
filtered_d$crt_order <- as.character(filtered_d$crt_order)

### Replace special characters with regular spaces in the crt_order column
filtered_d$crt_order <- gsub("[^[:alnum:][:space:]|-]", " ", filtered_d$crt_order)
unique(filtered_d$crt_order)

[1] "FL 18|CRT"  "CRT |FL 18"

### Creating mean_acceptability scores for CRT_First and Dilemma_First condition
CRT_First <- filtered_d %>%
  filter(crt_order == "CRT |FL 18") %>%
  pull(mean_acceptability)

Dilemma_First <- filtered_d %>%
  filter(crt_order == "FL 18|CRT") %>%
  pull(mean_acceptability)

##Calculating Cohen's D for effect size
cohen_d <- cohen.d(CRT_First, Dilemma_First)
print(cohen_d)


Cohen's d

d estimate: -1.961161 (large)
95 percent confidence interval:
    lower     upper 
-7.196924  3.274602

# Create a dataframe subsetting CRT first observations
crt_first_d <- filtered_d %>% 
  group_by(crt_order) %>% 
  filter(!any(crt_order == "FL 18|CRT")) 

view(crt_first_d)

# (Main Statistical Test) Pearson's r calculation correlating CRT total score and mean acceptability for CRT first observations.
## If higher CRT scores correlate positively with utilitarian judgements of moral acceptability, results support the secondary inference of the paper, i.e., that higher CRT scores in the CRT First group is expressive of stronger "reflection" which causes an increase in utilitarian (cost-benefit) ratings. Note that this result only supports the secondary inference if there is no positive correlation between CRT scores and moral acceptability ratings in the Dilemmas First condition.

## Note: Code unable to run on pilot B data as there are too few finite observations

# Create a dataframe subsetting dilemma first observations
dilemma_first_d <- filtered_d %>% 
  group_by(crt_order) %>% 
  filter(!any(crt_order == "CRT |FL 18"))

# (Main Statistical Test) Pearson's r calculation correlating CRT total score and mean acceptability for dilemma first observations

## Note: Code unable to run on pilot B data as there are too few finite observations

# Scatterplot with jitter and point color as gender and size of point as duration to match original study's figure 1
ggplot(data = crt_first_d, aes(x = jitter(CRT_score), y = mean_acceptability)) +
  geom_point(aes(color = gender, size = duration)) + 
  stat_smooth(method = 'lm') +
  labs(x = "Number of Correct CRT Items", y = "Mean Moral Acceptability Rating", title = "Mean Acceptability on CRT Score")

`geom_smooth()` using formula = 'y ~ x'

Warning in qt((1 - level)/2, df): NaNs produced

Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
-Inf

# Fisher r to z test to test correlation between CRT first and Dilemma first r coefficients
## Calculate correlation coefficients
crt_cor <- crt_first_d %>%
  select(CRT_score, mean_acceptability) %>%
  summarise(correlation = cor(CRT_score, mean_acceptability))

Adding missing grouping variables: `crt_order`

Warning: There was 1 warning in `summarise()`.
ℹ In argument: `correlation = cor(CRT_score, mean_acceptability)`.
ℹ In group 1: `crt_order = "CRT |FL 18"`.
Caused by warning in `cor()`:
! the standard deviation is zero

dilemma_cor <- dilemma_first_d %>%
  select(CRT_score, mean_acceptability) %>%
  summarise(correlation = cor(CRT_score, mean_acceptability))

Adding missing grouping variables: `crt_order`

## Perform Fisher's r-to-z transformation test
paired.r(crt_cor$correlation, dilemma_cor$correlation, NULL, nrow(crt_first_d), nrow(dilemma_first_d), twotailed=TRUE)

Warning in sqrt(1/(n - 3) + 1/(n2 - 3)): NaNs produced

Call: paired.r(xy = crt_cor$correlation, xz = dilemma_cor$correlation, 
    yz = NULL, n = nrow(crt_first_d), n2 = nrow(dilemma_first_d), 
    twotailed = TRUE)
[1] "test of difference between two independent correlations"
z = NA  With probability =  NA

# t-test between mean acceptability and CRT order to respond to the concern that presentation of the dilemmas first influenced subsequent CRT performance
clean_d %>% 
  t.test(data = ., CRT_score ~ crt_order, var.equal = TRUE)


    Two Sample t-test

data:  CRT_score by crt_order
t = 0.3873, df = 3, p-value = 0.7244
alternative hypothesis: true difference in means between group CRT |FL_18 and group FL_18|CRT  is not equal to 0
95 percent confidence interval:
 -2.405680  3.072347
sample estimates:
mean in group CRT |FL_18 mean in group FL_18|CRT  
                3.333333                 3.000000

Three-panel graph with original, 1st replication, and your replication is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Mini meta analysis

Combining across the original paper, 1st replication, and 2nd replication, what is the aggregate effect size?

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.