Replication of “The Illusion of Moral Decline” by Mastroianni & Gilbert (2023, DOI)
Author
Fisher Anderson
Published
December 11, 2025
Introduction
My particular research interests reside in the field of Symbolic Systems, which pertains mainly to the brain and it’s ever growing similarities with computers. Specifically, I have molded this path towards Human Centered AI (HAI), with frequent detours into the realm of human consciousness. In trying to find an adequate study to replicate, none were made known to me that both fit my research relevance and real world feasibility. This is because most work on consciousness has been published as a report or commentary, and the few research articles that exist require large scale machinery like fMRI. This article by Mastroianni can be tied to my field of study by examining how the mind perceives morality, a hot topic in HAI, and is also practically feasibly by using Prolific, directly mimicking the study.
This study consists of 5 major segments, all of which require survey data to make conclusions about what people think about morality. The first three segments show that people will claim that morality has declined when explicitly asked to assess moral change in a variety of time spans. The fourth segment shows that, in reality, people do not think morality is on a decline when asked to assess their own contemporaries. Lastly, the fifth segment shows that the appearance of moral decline is also not present for people when asked about their own personal worlds (i.e. friends and family). I will be replicating study 2(c), which specifically asks about morality in the current year, the year the participant was born, and about 20 years ago.
In order to conduct this experiment, I found a similar database of people (all of which are from the US) to survey about their own morality and whether or not it has declined around them in varying amounts of time. This is made easy by prolific screeners, and an additional screener about US cultural knowledge in the study.
There were 3 aspects of study 2(c) that proved to be significant, with effect sizes of −0.72, −1.08, and −0.37. They found these by collecting data from 484 respondents, about 50 in each of the 10 age bins (18–24, 25–29, 30–34, 35–39, 40–44, 45–49, 50–54, 55–59, 60–64 and 65–69). After exclusions, they were left with 347 participants. In order to reach a power of 80%, I would only need 60 participants. Because of the feasibility of Prolific and desire for strong effect, I am using 100 participants (rounding up from 97) to reach a power level of 95%, as shown below.
The only major demographic screener is that the participants must be able to complete a three-item test of English proficiency and knowledge of US American culture. For instance, they need to know that a “bell bottom” is not a type of footwear. Additionally, the participants are an evenly distributed sample size across 10 age brackets from 18-69 in roughly 5 year bins.
Participants were excluded upon meeting any of the following criteria: they incorrectly asnwered any of the 3 questions to the English proficiency & cultural screener; their exact reported age at the end of the study did not their selected age bin at the beginning of the study; they failed a built in consistency check about perceived morality in the year they were born; or they failed an attention check asking them to select “other” and write “apple” manually.
Materials & Procedure
As quoted in the original report: “Participants responded to an advertisement for a study on [Prolific]. After providing informed consent, participants reported how “kind, honest, nice and good” people are today. They then reported how “kind, honest, nice and good” people were when they (the participants) were about 20 years old, and at about the time they (the participants) were born. This was done by adjusting the wording of the subsequent questions on the basis of the participant’s age. For example, if the participant was between 30 and 34 years old, they were asked “How kind, honest, nice, and good were people about ten years ago?” and then “How kind, honest, nice, and good were people about 30 years ago?” If participants were under 25 years, they answered only the questions for today and when they were born. All questions were answered using a seven-point Likert scale with endpoints labelled ‘not very’ and ‘very’. …Participants were then given a consistency check that required them to remember whether they had rated people today as more, equally or less moral compared to people in the year they were born. Participants then answered some further exploratory and demographic questions. Embedded among them was an attention check that required participants to select the option ‘other’ and type the word ‘apple’. Finally, participants were compensated and dismissed.”
Analysis Plan
As quoted in the study: “To analyse the data, [I] fit a linear mixed effects model using the lme4 package in R, extracted P values using the lmerTest package and calculated planned contrasts using the emmeans package, using a Holm–Bonferroni correction for multiple comparisons. The outcome was participants’ ratings and the predictor was the year of those ratings (one factor with three levels: today, the year the participant turned 20, the year the participant was born). The model included a fixed effect of the year of each rating and a random intercept for each participant. For this and all models, we checked model assumptions by plotting the outcome variable, residuals and fitted values. All tests we report are two-tailed.”
The key analysis of interest here is the comparison between perceived morality in the current year and both the year the participant was born and turned 20, respectively. I expect a statistically significant effect of moral decline to be present in both analyses.
Differences from Original Study
A large consideration is the inherent difference between the two platforms, Prolific (used here) and Amazon Mechanical Turk (used in the original). This carries along with it a unique set of population differences that may or may not be significant to the end result. There is a common consensus online that data quality is higher on Prolific because of the user pool that they have, but there is no proven difference on research effects between the two platforms. For example, in Mastroianni’s search for a sample size of 500, 301 participants were screened out by the US cultural questions from Mechanical Turk. On the other hand, only 1 participant was screened out by answering “a pizza” to “Which of these is most likely to have a sign that says ‘out of order?’” in my search for 100 takers.
Also, my quotas for age are built into the Prolific survey, while the original had them built into the Qualtrics survey. This means some participants would begin the survey and immediately be excluded by the quota. Lastly, I chose to reject execptionally fast responses automatically on Prolific. This should have no effect.
Methods Addendum
Actual Sample
There were 101 takers of my survey with 1 screened out and 14 excluded. The final demographic layout of my study was as follows: 39 females, 47 males, 12% Asian, 12% Black or African-American, 1% Hispanic or Latino Origin, 71% White, and 5% more than 1 of the above. As planned, there were 100 takers with roughly 10 participants in each of the 10 aforementioned bins. However, 14 were excluded, 6 of whom reported a different overall perception of morality at the end of the survey than the beginning, and 8 of whom failed the attention check, leaving 86 participants whose data is further analyzed.
These all closely mimic the proportion of ethnicity, gender, and exclusions as presented in the original study.
Differences from pre-data collection methods plan
None.
Results
Data preparation
Data preparation following the analysis plan, hidden for legibility.
####plot####good_melt$time <-factor(good_melt$time, levels =c("born","twenty","today"))plot_dat <- good_melt %>% dplyr::filter(is.finite(rating)) #missing ratings were omitted from plots
Replicated_Graph
Original_Graph
Exploratory analyses
In the original, they also examined the effects of age, gender, race, education (from “did not finish high school” to “graduate degree”), political ideology (from “very liberal” to “very conservative”) and parental status on perceptions of moral decline using an exploratory linear regression.
Mastroianni, in the original study, found that more conservative participants perceived more moral decline than did more liberal participants, and older participants perceived more moral decline than did younger participants. However, none of the listed predictors showed statistically reliable effects on perceived moral decline in my exploratory analyses, most likely because my sample size was much smaller so there was not enough data to make co-variate extrapolations.
Instead, here are some helpful graphs to visualize the potential effect of ideology and race on perceived morality.
Summary_graph
Ideological_graph
`geom_smooth()` using formula = 'y ~ x'
Race_graph
Discussion
Summary of Replication Attempt
Overall, Study 2c successfully replicated the primary findings of Mastroianni et al. (2023). Participants in the present study rated people today as significantly less moral than people in the past, both relative to when they were born and when they were approximately 20 years old. Importantly, the magnitude of these effects closely matched those reported in the original study: the effect sizes observed here (−0.80, −1.08, and −0.28) were highly similar to the original estimates (−0.72, −1.08, and −0.37). This close correspondence suggests that the effect of perceived moral decline is robust to differences in sampling platform and time, and that the original findings generalize beyond the specific context in which they were first obtained. The near-identical study design, procedure, and analysis pipeline likely contributed to the high degree of replication fidelity observed here.
Beyond the confirmatory analyses, exploratory analyses provided further insight into the nature of perceived moral decline. Notably, perceived decline was not reliably associated with age, political ideology, gender, race, education level, or parental status. Neither simple models examining age continuously nor a multivariate regression including all covariates explained meaningful variance in perceived moral decline. These findings mirror those of the original paper and reinforce the interpretation that perceptions of moral decline are broadly shared rather than concentrated within specific demographic or ideological groups.
Commentary
The successful replication across different participant recruitment platforms (Amazon Mechanical Turk in the original study and Prolific in the present replication) further supports the robustness of the effect. Differences between these platforms—such as participant experience, attention levels, and demographic composition—do not appear to meaningfully moderate the phenomenon. This consistency suggests that perceived moral decline is not an artifact of a specific sampling method.
More broadly, this paradigm offers a powerful framework for studying beliefs about societal change. Mastroianni has applied similar methods to beliefs about intelligence and racism, finding comparable though sometimes weaker effects. Future work could extend this approach to domains such as religiosity or political polarization. In particular, comparing beliefs about broad societal trends with beliefs about one’s own friends and family — as done in later studies in the original paper — may help disentangle abstract pessimism about “the world” from perceptions grounded in direct personal experience.
Finally, this replication highlights the critical role of open science practices. The availability of the original data, materials, and analysis code made it possible to closely mirror the original study and evaluate its claims transparently. The clarity and organization of these materials significantly lowered the barrier to replication and underscore the value of data and code sharing for cumulative scientific progress. Thank you.