Replication of Figure–Ground Illusion Study by Oishi et al. (2021, Psychological Science)
Author
Peggy Yin (peggyyin@stanford.edu)
Published
December 9, 2025
Introduction
The field of psychology has long studied and defined a good life through two lenses: happiness and meaning. Oishi et al. (2021) proposed that psychological richness—a life filled with variety, interesting experiences, and perspective change (both positive and negative)—is a distinct component of a good life, separate from happiness and meaning. Across global studies, they find that psychological richness is a distinct and desirable aspect of well-being, with many people saying they would choose a psychologically rich life even at the expense of happiness or meaning.
Just like how some experiences we experience as happier or more meaningful than others, then, there are some experiences that we might experience as more psychologically rich. The study I chose to replicate was one that sought to determine what makes a particular experience psychologically rich compared to another in the visual domain. In this study, participants are shown four images in a row. One set of images contained visual illusions. The authors hypothesized that more complex visual stimuli, such as an optical illusion, would be more likely to induce an experience of psychological richness compared to edited versions of the same images that are not illusions.
I administered the experiment using a Qualtrics form to collect responses. Half of the participants see a set of figure-ground illusions (where both the foreground and background can be focused on, and a different image is produced depending on the focus), and the other half see the same images, but edited such that they are no longer illusions. The procedure will require participants to view and describe each of the four drawings, then self-report their moods and aesthetic evaluations of the image. Psychological richness will then be measured based on the mean of 11 items from the Psychological Richess Questionnaire (interesting, boring[r], intriguing, psychologically rich, complex, fresh, unique, surprised, unusual, typical[r], simple[r]) on a 1–5 point scale. Positive affect will be measured as the mean of 6 items from the SPANE (Diener et al., 2010): positive, good, pleasant, happy, joyful, content, on a 1–5 point scale.
Power Analysis
Our sample size provides the statistical power of .85 with the expected effect size of .50.
Planned Sample
The planned sample size is 152, matching the original study. (N.B.: I originally collected 150 on Prolific, but was able to get 153 participants because I believe that some participants started the survey and finished it later, after 150 participants had already finished it, based on timestamps, which allowed me to hit 152 with one exclusion.)
Materials
The stimuli and the original questionnaire were obtained from Dr. Oishi and are attached in the figures folder: Stimuli. The paradigm is linked here: Paradigm.
Procedure
Following the protocol from the original experiment, I showed participants one set of four images (either the control images or the illusion images). Participants viewed these images one at a time, describing what they saw in them in an open-ended response. After viewing four images, participants filled out the following scales:
“The current moods were measured using the 12-item Scale of Positive and Negative Experience (SPANE; Diener et al., 2010) on a 5-point scale (1 = not at all, 5 = extremely). The positive mood scale consists of positive, good, pleasant, happy, joyful, and contented at this moment (α = .90). The negative mood scale consists of negative, bad, unpleasant, sad, afraid, and angry at this moment (α = .85).
The psychological richness was measured by 11 items on a 5-point scale (1 = not at all, 5 = extremely): “The drawings were very interesting,” “They were very boring (r),” “They were intriguing,” “They were psychologically rich,” “They were complex,” “They were fresh,” “They were unique,” “I was surprised by them,” “They were unusual,” “They were typical (r),” and “They were simple (r)” (α = .88).
We also measured enjoyment, as Silva and colleagues’ work (e.g., Silvia, 2005) suggests that enjoyment is independent of interest. The enjoyment scale consisted of the four items on a 5-point scale (1 = not at all, 5 = extremely): “I enjoyed them a lot,” “I liked them a lot,” “They were fun,” and “They were pleasing” (α = .88).”
Analysis Plan
I plan to analyze the differences in Psychological Richness and Enjoyment in the Control versus Figure-Ground Conditions, and the differences in positive and negative affect in figure-ground condition. The key statistical test used is the unpaired, two-tailed t test comparing the experimental and control groups. The hypotheses follow the original study: that psychologically rich experiences are not necessarily more enjoyable, positive, or negative.
Differences from Original Study
The participants in the original study were 152 undergradutes at a large university in the U.S. who received partial course credit toward an introductory psychology class. My sample pulls from Prolific and specifies people in the US more generally between the ages of 18-30 (I chose this age range because I wanted to capture the age of undergraduate populations that also might be slightly older, as in the case of community colleges).
Methods Addendum (Post Data Collection)
Actual Sample
I got 153 participants. Because some Prolific participants seemed to start the study, then finish it later after the 150 had already been collected, I took the first 152 finished responses from when I ran the study on Prolific (excluding 1 based on this cutoff.)
Differences from pre-data collection methods plan
None (checked gender balance; the race/ethnicity balance and age balance for the original study was not reported).
Results
None of my findings were significant, meaning that while the participants across groups did not differ in enjoyment (as predicted), they also did not differ in psychological richness. Because three of the four analyses were meant to support the hypothesis that a psychological richness experience differs from an enjoyable, or valenced experience, however, my results do not help to explain what psychological richness is and isn’t.
Side-by-side of psychological richness and enjoyment:
Side-by-side of SPANE:
Exploratory analyses
I wanted to see if what the results would be if I just used the item with the highest factor loading in the psychological richness scale:
index_interesting <-which(str_detect(question_row, fixed("interesting", ignore_case =TRUE)))if(length(index_interesting) ==0) {stop("Could not find the question")}interesting_col <- cn[index_interesting]cat("Column for 'The drawings were very interesting':", interesting_col, "\n")
Column for 'The drawings were very interesting': ratings_3
t_test_interesting <-t.test(dat_filtered[[interesting_col]] ~ condition, data = dat_filtered, var.equal =TRUE, alternative ="two.sided") print(t_test_interesting)
Two Sample t-test
data: dat_filtered[[interesting_col]] by condition
t = -0.95578, df = 150, p-value = 0.3407
alternative hypothesis: true difference in means between group Control and group FigureGround is not equal to 0
95 percent confidence interval:
-0.5246739 0.1825687
sample estimates:
mean in group Control mean in group FigureGround
3.210526 3.381579
I failed to replicate the original result. There was no significant difference between participants viewing the psychologically rich stimuli and the participants viewing—the key test result. The larger pattern of results were there to contextualize psychological richness, so without this key finding they say more about the stimuli than about the construct. Enjoyment did not differ between groups, nor did positive affect. Negative affect shows a slight trend, although it is worth noting that in the original study participants in the psychological richness group experienced more positive emotions, whereas participants in my replication experienced more negative emotions.
Commentary
One explanation for the failure to replicate my result is that participants were not giving the task the same amount of time and attention as they did in the original task. The authors did not specify how much time participants spent on the task in the original study, but my study participants averaged around 7 minutes spent on the whole study, which some (10) finishing the entire study in under two minutes, and 30 finishing the study in under three minutes. However, in my exploratory analysis, even excluding those participants did not change the failure to replicate.
I also looked at whether just assessing how much participants found the images to be “interesting” across the two groups, as previous work on psychological richness has determined that evaluations of interestingness have the highest factor loadings. There was no difference between the two groups on evaluations of interestingness, indicating that perhaps because participants found both sets of stimuli equally interesting, one set of images could not be more psychologically rich than another.
The effect size in the original study was quite small, which could explain the failure to replicate.
In the future, I would be curious to know if this experiment would replicate if there were more generative stimuli used to disambiguate illusions from non-illusions, e.g. using the generative illusion methods developed in the Tenenbaum lab. Generative methods would allow for more fine-grained control on differentiating the complexity of both image sets, as my pilot participants had mentioned that they did not really see the illusion in the illusion data set for some images.