Replication of “The Illusion of Moral Decline” by Mastroianni & Gilbert (2023, DOI)

Author

Fisher Anderson

Published

December 7, 2025

Introduction

My particular research interests reside in the field of Symbolic Systems, which pertains mainly to the brain and it’s ever growing similarities with computers. Specifically, I have molded this path towards Human Centered AI (HAI), with frequent detours into the realm of human consciousness. In trying to find an adequate study to replicate, none were made known to me that both fit my research relevance and real world feasibility. This is because most work on consciousness has been published as a report or commentary, and the few research articles that exist require large scale machinery like fMRI. This article by Mastroianni can be tied to my field of study by examining how the mind perceives morality, a hot topic in HAI, and is also practically feasibly by using Prolific, directly mimicking the study.

This study consists of 5 major segments, all of which require survey data to make conclusions about what people think about morality. The first three segments show that people will claim that morality has declined when explicitly asked to assess moral change in a variety of time spans. The fourth segment shows that, in reality, people do not think morality is on a decline when asked to assess their own contemporaries. Lastly, the fifth segment shows that the appearance of moral decline is also not present for people when asked about their own personal worlds (i.e. friends and family). I will be replicating study 2(c), which specifically asks about morality in the current year, the year the participant was born, and about 20 years ago.

In order to conduct this experiment, I will need to find a similar database of people (all of which are from the US) to survey about their own morality and whether or not it has declined around them in varying amounts of time. This is made easy by prolific screeners, and an additional screener about US cultural knowledge in the study.

https://github.com/psych251/mastroianni2023

https://github.com/psych251/mastroianni2023/blob/main/original_paper/moral_decline_paper.pdf

Methods

Power Analysis

Original effect size, power analysis for samples to achieve 80%, 90%, 95% power to detect that effect size. Considerations of feasibility for selecting planned sample size.

– There were 3 aspects of study 2(c) that proved to be significant, with effect sizes of −0.72, −1.08, and −0.37.

– Originally, they collected data from 484 respondents, about 50 in each of the 10 age bins. After exclusions, they were left with 347 participants. In order to reach a power of 80%, I would only need 60 participants. Because of the feasibility of Prolific and desire for strong effect, I am using 100 participants (rounding up from 97) to reach a power level of 95%, as shown below.

library(pwr)

# smallest paired contrast effect size
d <- 0.37

power_levels <- c(0.80, 0.90, 0.95)

required_n <- function(power, d) {
  pwr.t.test(d = d,
             power = power,
             sig.level = 0.05,
             type = "paired",
             alternative = "two.sided")$n
}

raw_n <- sapply(power_levels, required_n, d = d)
rounded_n <- ceiling(raw_n)

data.frame(
  power       = power_levels,
  rounded_n   = rounded_n
)

  power rounded_n
1  0.80        60
2  0.90        79
3  0.95        97

Planned Sample

The only major demographic screener is that the participants must be able to complete a three-item test of English proficiency and knowledge of US American culture. For instance, know that a “bell bottom” is not a type of footwear. Additionally, the participants are an evenly distributed sample size across 10 age brackets from 18-69 in roughly 5 year bins.

Participants were excluded upon meeting any of the following criteria: they incorrectly asnwered any of the 3 questions to the English proficiency & cultural screener; their exact reported age at the end of the study did not their selected age bin at the beginning of the study; they failed a built in consistency check about perceived morality in the year they were born; or they failed an attention check asking them to select “other” and write “apple” manually.

Planned sample size and/or termination rule, sampling frame, known demographics if any, preselection rules if any.

Materials

All materials - can quote directly from original article - just put the text in quotations and note that this was followed precisely. Or, quote directly and just point out exceptions to what was described in the original article.

– how do I link to the original work doc with screenshots of the qualtrics flow?

Procedure

Can quote directly from original article - just put the text in quotations and note that this was followed precisely. Or, quote directly and just point out exceptions to what was described in the original article.

– Here is the direct procedure as quoted in the paper: “Study 2c was conducted in 2020. Participants responded to an advertisement for a study on Amazon Mechanical Turk. After providing informed consent, participants reported how “kind, honest, nice and good” people are today. They then reported how “kind, honest, nice and good” people were when they (the participants) were about 20 years old, and at about the time they (the participants) were born. This was done by adjusting the wording of the subsequent questions on the basis of the participant’s age. For example, if the participant was between 30 and 34 years old, they were asked “How kind, honest, nice, and good were people about ten years ago?” and then “How kind, honest, nice, and good were people about 30 years ago?” If participants were under 25 years, they answered only the questions for today and when they were born. All questions were answered using a seven-point Likert scale with endpoints labelled ‘not very’ and ‘very’. As in previous studies, participants were then given a consistency check that required them to remember whether they had rated people today as more, equally or less moral compared to people in the year they were born. Participants then answered some further exploratory and demographic questions. Embedded among them was an attention check that required participants to select the option ‘other’ and type the word ‘apple’. Finally, participants were compensated and dismissed.”

– The only differences are that the study was conducted in 2025, and the platform was Prolific and not MTurk.

Analysis Plan

Can also quote directly, though it is less often spelled out effectively for an analysis strategy section. The key is to report an analysis strategy that is as close to the original - data cleaning rules, data exclusion rules, covariates, etc. - as possible.

– Here is the direct analysis as quoted in the study: “To analyse the data, we fit a linear mixed effects model using the lme4 package in R, extracted P values using the lmerTest package and calculated planned contrasts using the emmeans package, using a Holm–Bonferroni correction for multiple comparisons. The outcome was participants’ ratings and the predictor was the year of those ratings (one factor with three levels: today, the year the participant turned 20, the year the participant was born). The model included a fixed effect of the year of each rating and a random intercept for each participant. For this and all models, we checked model assumptions by plotting the outcome variable, residuals and fitted values. All tests we report are two-tailed.”

Clarify key analysis of interest here You can also pre-specify additional analyses you plan to do.

– We fit a linear mixed effects model with random intercepts for each participants, and then did planned contrasts between each of the time points, with the Holm-Bonferroni correction for multiple comparisons.

Differences from Original Study

Explicitly describe known differences in sample, setting, procedure, and analysis plan from original study. The goal, of course, is to minimize those differences, but differences will inevitably occur. Also, note whether such differences are anticipated to make a difference based on claims in the original article or subsequent published research on the conditions for obtaining the effect.

– A large consideration is the inherent difference between the two platforms, Prolific (used here) and Amazon Mechanical Turk (used in the original). This carries along with it a unique set of population differences that may or may not be significant to the end result.

– Also, my quotas are built into prolific, his were manual on mtruk – Also, I rejected execptionally fast responses

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

Data preparation following the analysis plan.

Confirmatory analysis

The analyses as specified in the analysis plan.

Side-by-side graph with original graph is ideal here

#####model#####
good_melt$time <- factor(good_melt$time, levels(good_melt$time)[c(3,2,1)])
good_mod <- lmer(rating ~ time + (1|participant), data = good_melt) #still having errors here, but also in Pilot A
summary(good_mod)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: rating ~ time + (1 | participant)
   Data: good_melt

REML criterion at convergence: 195.2

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.14615 -0.42903 -0.02105  0.31119  2.53613 

Random effects:
 Groups      Name        Variance Std.Dev.
 participant (Intercept) 0.7145   0.8453  
 Residual                1.2353   1.1115  
Number of obs: 58, groups:  participant, 20

Fixed effects:
            Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)   4.5492     0.3259 46.3056  13.960   <2e-16 ***
timetoday    -0.6492     0.3636 36.7965  -1.785   0.0825 .  
timeborn      0.4008     0.3636 36.7965   1.102   0.2775    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
          (Intr) timtdy
timetoday -0.595       
timeborn  -0.595  0.533

means <- emmeans(good_mod, specs = ~ time)
means

 time   emmean    SE   df lower.CL upper.CL
 twenty   4.55 0.326 46.3     3.89     5.21
 today    3.90 0.312 43.7     3.27     4.53
 born     4.95 0.312 43.7     4.32     5.58

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95

eff_size(means, sigma = sigma(good_mod), edf = 677)

 contrast       effect.size    SE   df lower.CL upper.CL
 twenty - today       0.584 0.328 43.7  -0.0771    1.245
 twenty - born       -0.361 0.328 43.7  -1.0214    0.300
 today - born        -0.945 0.317 43.7  -1.5842   -0.305

sigma used for effect sizes: 1.111 
Degrees-of-freedom method: inherited from kenward-roger when re-gridding 
Confidence level used: 0.95

contr <- contrast(means, method = "pairwise", adjust = "holm")
contr

 contrast       estimate    SE   df t.ratio p.value
 twenty - today    0.649 0.364 36.8   1.783  0.1658
 twenty - born    -0.401 0.364 36.8  -1.101  0.2782
 today - born     -1.050 0.351 36.0  -2.987  0.0151

Degrees-of-freedom method: kenward-roger 
P value adjustment: holm method for 3 tests

confint(contr)

 contrast       estimate    SE   df lower.CL upper.CL
 twenty - today    0.649 0.364 36.8   -0.264    1.563
 twenty - born    -0.401 0.364 36.8   -1.314    0.513
 today - born     -1.050 0.351 36.0   -1.933   -0.167

Degrees-of-freedom method: kenward-roger 
Confidence level used: 0.95 
Conf-level adjustment: bonferroni method for 3 estimates

####plot####
good_melt$time <- factor(good_melt$time, levels = c("born","twenty","today"))
plot <- ggplot(good_melt, aes(x = time, y = rating)) +
  stat_summary(fun.data = "mean_cl_boot") 

#ggplot_build(plot) #unnecessary?
plot

Warning: Removed 35 rows containing non-finite outside the scale range
(`stat_summary()`).

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

// I used Prolific, he used MTurk. This is a potentially big difference. //the only other thing I changed about the qualtrics survey is the consent email and debriefing note.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.