Summary of prior replication attempts
There were five main differences in design between the original study (Paxton et al., 2012) and the previous PSYCH251 replication attempt (Fereday 2019). First, the CRT questions were rewritten and an additional fourth CRT question was added. Second, Fereday added a sub-condition to the moral dilemmas. Half the dilemmas included personalized information (names of characters) while the rest were depersonalized. Fereday wanted to investigate the impact of personalization on moral evaluations. Second, an attention check was built into the survey. Third, the moral dilemma questions included two sub-conditions which were randomly assigned to participants. In the personalized condition, characters in the questions had names, and in the depersonalized conditions characters were not named. Fereday 2019 aimed to investigate the relationship between personalization of questions and the mean acceptability rating of utilitarian solutions to dilemmas. Finally, Fereday (2019) included an attention check in the survey.
Freyeday’s (2019) study did not replicate either of the two key findings from Paxton, Unger, and Greene (2012). The first main statistical test was a two-sided t-test of the mean moral acceptability ratings of the CRT-First (experiment) vs Dilemma-First (control) groups. The replication yielded a statistically insignificant relationship (CRT-First: M = 3.54; Dilemmas-First: M = 3.66; t(82) = -.36, p = .72, d = .08), conflicting with the original study (CRT-First: M = 3.77; Dilemmas-First: M = 3.25; t(90) = 2.03, p = .05, d = 0.43). The second main statistical test was a correlation between the number of CRT questions participants answered correctly and their mean acceptability rating across the three dilemmas. This was performed on the CRT-first and dilemma-first conditions separately. The replication observed no significant correlation in the CRT-First group (r = .14, p = .32), while the original study observed a significant positive relationship (r = .39, p = .001). The replication also observed an insignificant negative relationship in the Dilemmas-First condition (r = .03, p = .85). The original also observed an insignificant relationship, but in the positive direction (r = -.03, p = .8).
There have been several others replications of experiment 1 from Paxton et al., (2012). Attie and Knobe (2017) failed to replicate the original results. There was no significant difference in responses between participants in the CRT-First condition (M= 3.19, SD = 1.52) and those in the CRT-After condition (M = 2.99, SD =1.50), t(296) = -1.16, p = .24. There was also no significant correlation found between utilitarian judgement and CRT score in the CRT-first condition, r(316) = .08, p = .11, 95% CI [-.03, .18].
Paxton et al., (2014) however replicated both of the main findings by using a slightly different moral dilemma (Sophie’s Choice). Their results were as follows: CRT-First M = 5.5, Dilemmas-First M = 3.0, t(15) = 2.2, p = .04; r = 0.44, P = 0.006.
Discussion
The present rescue aimed to improve the experimental design of the original and replication in several ways. First, the rescue was higher powered than both (post exclusions, n=160 [95% power] compared with original [n=92], and the replication [n=82]). Second, it used newer CRT questions from the CRT-2, and reworded classic CRT questions, to address Fereday’s (2019) worry that familiarity with CRT questions impaired the induction of an appropriate reflective state. Third, participants were taken from Prolific instead of MTurk, which also addressed the concern about familiarity.
The rescue did not reproduce the main result found in the original: participants who were shown CRT questions before moral dilemmas did not deliver stronger utilitarian responses. This result might provide evidence against the claim that reflecting on moral dilemmas produces more utilitarian evaluations. However, the rescue did reproduce the other main result from the original study: in the CRT-first condition, participants who answered more CRT questions correctly were more likely to deliver stronger utilitarian responses There was no such relationship in the Dilemmas-First condition, suggesting that the observed relationship was not caused by trait-level reflective or utilitarian attitudes.
However, a state-level explanation – where CRT exposure induced a reflective state which caused stronger utilitarian judgements – clearly does not fit either, since if this was the case we should have seen a significant difference between the utilitarian ratings of the CRT-first vs Dilemma-first groups, which we did not.
This is a puzzle. The start of a solution is suggested by adding a post-hoc exclusion. The original study put no limit on duration while the present study excluded only extreme positive outliers who took over an hour to complete the study (n=4; median duration was 7:37). However, there was a single (n=1) participant in the CRT-first condition who took just 95 seconds to complete the entire survey. The participant had a perfect CRT-score of 5 and a high moral acceptability average rating of 5.8. The participant reported they were “definitely familiar” with both the CRT questions and moral dilemmas.
It is likely this participant did not earnestly engage in the experiment. Furthermore, their results likely inflated the true within-group correlation between CRT scores and utilitarian judgements in the direction predicted by the secondary hypothesis. To test this, all participants who took less than 2 minutes on the survey were excluded in a post-hoc analysis (n=1 in CRT-first condition, n=3 overall).
Surprisingly, excluding just this one participant caused the partial replication to vanish. The correlation between CRT scores and Moral Acceptability ratings in the CRT-first condition became insignificant (r=.21, p=.07). As before, there was no significant correlation in the dilemmas-first condition.
We can conclude that while there appears to a slight pattern of positive correlation between CRT proficiency and the strength of one’s utilitarian judgement, it is neither clear nor significant.
Further exploratory analyses were conducted on participant’s familiarity with the experimental design. Familiarity was a concern of the first replication, and consequentially, the present rescue aimed to reduce familiarity. This was only moderately successful. 37% of participants self-reported being “unfamiliar” with the CRT questions, while the majority (47%) were “somewhat familiar” and 16% were “definitely familiar”. 63% were either “somewhat” or “definitely” familiar, which was an improvement on Fereday’s (2019) replication (where 77% reported being either “somewhat” or “definitely” familiar).
It is possible that levels of familiarity might have interrupted the targeted manipulation; perhaps participants in the CRT-first group who were familiar with the questions were not induced into an appropriate reflective state. Familiarity with CRT questions had a significant positive correlation with CRT performance (r=.25, p=.001), but familiarity with CRT questions did not have a significant relationship with moral acceptability scores (r=.11, p=.16).
The present rescue also investigated how familiar participants were with the three moral dilemmas (an improvement on both the original and first replication). Participant’s familiarity with moral dilemmas showed a similar trend to familiarity with CRT questions.
Interestingly, previous exposure to the moral dilemmas appears to have important implications for moral judgements and CRT performance. Suprisingly, familiarity with dilemmas appears more influential than familiarity with the CRT. Familiarity with the moral dilemmas were showed a strong significant positive correlation with strong utilitarian judgements (r=.30, p=.0001). That is, the more familiar participants were with the moral judgements, the more likely they were to produce utilitarian responses. It is possible that familiarity alone is a major explanation of the pattern of moral judgements across both conditions. Familiarity with dilemmas also had a strong and significant positive correlation with high CRT scores (r=.33, p.0001).
In the exploratory analysis, I performed the two main statistical tests on participants who were either completely unfamiliar, or “relatively” unfamiliar (either completely, or somewhat, unfamiliar) with both the CRT questions and moral dilemmas. Even though these analyses were moderately underpowered, if familiarity was interfering with the experimental manipulation we should expect to see patterns closer to the significant observations of the original experiment.
However, all analyses returned insignificant results. For example, a t-test on participants who were relatively unfamiliar with the CRT (n=131) showed no significant difference between the moral acceptability ratings of CRT-First vs Dilemma-First conditions (p=.63), while the same test on participants who were completely unfamiliar (n=33) was also insignificant (p=.56) Results from all analyses suggests that familiarity with either the CRT or dilemmas was not interfering with the main experimental manipulation.
Finally, two more exploratory analyses are of note. First, the pre-exclusion data was analyzed using the same tets (n-241). Statistical tests for both main inferences were also found to be insignificant.
Second, a slight but significant negative relationship between duration and moral acceptability was observed. Participants who took longer on the survey gave weaker utilitarian evaluations (r=-.16, p=.04). Intuitively, reflection is an activity requiring time; one might expect that longer duration is associated with stronger reflection and therefore should induce stronger utilitarian responses. Therefore, this contrary result must be reconciled with Paxton et al.’s (2012) claim that greater reflection produces stronger utilitarian judgements.
I will close by speculating on possible conclusions to draw from this failed rescue, which in turn suggest future directions. One explanation for the failures to replicate Paxton et al., (2012)’s findings in experiment 1 concerns sweeping familiarity: perhaps current participants are just too familiar with the CRT and its variants, and trolley-problem-like moral dilemmas and their variants, for the relevant experimental manipulations and detections to be effective and accurate, respectively. This explanation preserves the integrity of Paxton et al., (2012)’s original study, but suggests that present replications are doomed to fail. To address this, future experiments should adopt different methods and materials to manipulate and measure the constructs of interests. However, such paradigms must justify why we should interpret the alternate materials as playing the same functional roles of the CRT and moral dilemmas in Paxton et al., (2012), which is not easy to do.
Another more critical explanation takes issue with the background theoretical architecture of the original study. Paxton et al., (2012)’s hypotheses are derived from an intricate web of theories and empirical results, which in turn lean on several assumptions and argumentative moves, all made quickly early on in the original paper. Of course, since the original experiment was successful the choice and defense of hypotheses should have be accepted without too much scrutiny. But the growing pattern of failed replication and rescue calls for critical re-examination of background assumptions and arguments motivating the original hypotheses.
There are in fact four different assumptions guiding Paxton et al., (2012)’s predictions.
First, moral judgements tend to be initially automatic and unconscious (this has support from Greene’s two-stage picture).
Second, cognitive reflection can override initial automatic judgements (whatever these judgments happen to be).
Third – and importantly – initial unconscious moral judgements for everyone tend towards the deontological (i.e., respecting the rights of persons, and therefore, not sacrificing one to save many).
Fourth, cognitive reflection makes moral judgements tend towards the utilitarian (i.e., maximising overall good, and therefore, sacrificing one to save many). This last point of course is the main experimental test; the rest is background.
Paxton et al., (2012) weave these assumptions together without much justification nor clarity. To be clear, I am not saying the original authors are unjustified in assuming or predicting the above; I am saying each of these assumptions are not separately motivated nor transparently justified in the original paper. Any of the foregoing assumptions could be wrong. So the failed replications might be detecting an error in the fourth point, or in one to three. Assuming the failed replications are explained by a problem in the background assumptions, from the present vantage point, it is not clear where the error lies; it could be in any of the assumptions, or a collection.
For example: it might be true that initial moral judgements are made automatically and emotionally. It might also be true that cognitive reflection can override these initial judgements. However, perhaps the initial automatic judgements tend towards the utilitarian, and the overriding reflection might be in the direction of deontological. More likely, there might be a lot of variance and individuality to people’s default moral attitude, or context-specific effects with regard to moral attitude manipulation.
An updated literature review regarding assumptions one through four would be helpful to assess the background reasoning of Paxton et al., (2012) against contemporary empirical data.