I chose this experiment because it examines how prior lexical knowledge influences vocabulary acquisition under incidental learning conditions, which aligns closely with my research interests in language learning and memory integration. Most models of word learning emphasize explicit instruction, yet the majority of real-world vocabulary acquisition occurs incidentally through narrative and social contexts. I am particularly interested in how prior knowledge interacts with consolidation processes to shape long-term lexical representations—a question that sits at the intersection of language acquisition and memory systems research. Replicating this experiment offers an opportunity to examine whether the memory and language mechanisms observed in controlled learning settings extend to more naturalistic, story-based contexts that mirror how vocabulary is acquired in everyday life.
Description of Stimuli, Procedures, and Anticipated Challenges
The experiment presents 15 bisyllabic pseudowords embedded within an illustrated, spoken story (“Trouble at the Intergalactic Zoo”). Each pseudoword belongs to one of three phonological neighborhood conditions:
- No neighbors (e.g., femod)
- One neighbor (e.g., tabric ↔︎ fabric)
- Many neighbors (e.g., dester ↔︎ duster, pester)
Each pseudoword appears five times across the narrative. The story is paired with 15 corresponding cartoon scenes, each containing multiple pseudoword referents to maintain narrative coherence while preventing explicit word–object pairing. This design ensures incidental exposure rather than deliberate memorization.
Participants listen to the 7-minute story while viewing illustrations, then complete three types of memory tests:
Stem completion – recall of word-forms from initial CV cues
Form recognition – distinguishing target pseudowords from minimal phonological foils
Form–picture recognition – mapping pseudowords to their illustrated referents
Each test is administered immediately after learning and the next day to assess consolidation effects.
Challenges in Replication
Stimulus control: Ensuring balanced pseudoword properties (phoneme/letter length, bigram probability, neighborhood frequency) and matching the auditory timing of exposures. (Would like to figure out how to check this)
Incidental exposure fidelity: Participants must attend to the story without adopting explicit memorization strategies, especially in adult online samples.
Retention and engagement: Maintaining consistent participation across multiple testing sessions, particularly for the delayed tests.
Methods
Planned Sample
60 adults
Age 18-35 years old
Native monolingual English speakers residing in the US (will most likely change this to US English speakers residing in the US)
No reported visual, hearing, or literacy difficulties
Had not participated in experiment S1 (not running S1, so not a problem)
Have working microphone that was compatible with the experiment platform
Prolific participants that had participated in at least ten studies with minimum 95% approval rate
Must not self-report inappropriate strategy (i.e., writing the words down)
Must complete vocabulary test properly
Materials
This includes audio and images. I have gathered these materials from the posted materials on OSF. I have also obtained the order in which the test items were presented.
Procedure
Three Testing Days:
Session 1 / Day 1: ~20 minutes
Session 2 / Day 2: ~5 minutes
No session 3 for retention and financing constraints
Analysis Plan
Analyses will be conducted in R, using lme4 to fit mixed effects models and ggplot2 for figures. A mixed effects binomial regression model will be used to analyze each of the dependent variables, with fixed effects of session, neighborhood condition, vocabulary ability, and all corre- sponding interactions. Orthogonal contrasts will be used for each of the factorial predictors. For the fixed effect of session, delay1 contrasted responses before and after opportunities for offline consolidation (T1 vs. T2). For the fixed effect of neighbors, neighb1 contrasted words without versus with neighbors (no vs. one & many), and neighb2 contrasted words with one versus many neighbors. I will attempt to figure out how the authors used raw vocabulary scores for analyses, which were scaled and centered before entering into the model. For each analysis, I will computed a random-intercepts model with all fixed effects and interactions. If there was no indication of a three-way interaction in the model (all ps > .2), these will be pruned to enable a more parsimonious model with better-specified random effects. I will then incorporate random slopes into the model using a forward best-path approach (Barr et al., 2013), progressively adding slopes into the model and retaining only those random effects justified by the data under a liberal alpha-criterion (p < .2).
Clarify key analysis of interest here
I am specifically interested in the form recognition analysis (the test where the participant has to listen to two words, one word that was present in the study and a foil word, and then make a determination of which word they think they heard in the study). I think it will be interesting to see how these results differ after the day delay.
Differences from Original Study
I will only focus on experiment 2 that focuses on adults (18-35 years of age). I had trouble with a microphone check in my jspsych script, so I decided to push forward using Gorilla (the platform originally used by the authors) and thought it would be a great way to test exactly what their procedures were.
My participants will be from the US rather from the UK.
I also implemented an additional exclusion criteria that Prolific offered to return submissions of participants that completed the task unrealistically fast (although I am not sure how the software made this determination).
I only conducted two test sessions rather than three. The original study demonstrated that neighborhood effects became more pronounced after longer consolidation intervals, particularly at the one-week delay.
Unfortunately, I had lots of issues with participant retention due to a delay which participants had to return the next day to complete the remainder of the study. I would love to continue running the experiment to get enough power to look into the results further, but with the financial constraints, and only being an auditor in the course, I did not want to spend more funds on this single experiment.
Methods Addendum (Post Data Collection)
In addition to following the methods outlined in the paper, I also sent a Prolific message once participants completed session 1 to inform them that another Prolific task will be made available to them in 22 hours of completion of session 1. The next day, I sent a reminder email to the participants.
Actual Sample
Number of participants: 17 participants obtained through Prolific
Age: 18-35 years old
Residing in the United States
First Language and Native Language: English (US)
No reported visual, hearing, or literacy difficulties
Have working microphone that was compatible with the experiment platform
Prolific participants that had participated in at least ten studies with minimum 95% approval rate
Must not self-report inappropriate strategy (i.e., writing the words down)
Differences from pre-data collection methods plan
The sample size was substantially smaller than in the original study, limiting statistical power. This is reflected in frequent singular model fits and the inability to justify random slopes, despite following an identical modeling strategy. These issues suggest that the data contained insufficient information to reliably estimate participant-level variability in learning trajectories.
FR$neighb <-fct_relevel(FR$neighb, c("none", "one", "many")) # order for intuitive printing of means FR %>%group_by(session) %>%summarise(mean =mean(acc), sd =sd(acc))
# A tibble: 2 × 3
session mean sd
<int> <dbl> <dbl>
1 1 0.706 0.457
2 2 0.767 0.423
# Compute participant means for the first sessionpptMeans <- FR %>%select(ID, session, acc) %>%filter(session =="Same day") %>%group_by(ID) %>%summarise(mean =mean(acc, na.rm =TRUE))
# Perform one-sample t-test against chance performance (.5)t.test(pptMeans$mean, mu=.5, alternative ="greater")
One Sample t-test
data: pptMeans$mean
t = 5.0402, df = 16, p-value = 6.035e-05
alternative hypothesis: true mean is greater than 0.5
95 percent confidence interval:
0.6345659 Inf
sample estimates:
mean of x
0.7058824
Model building Pruning of fixed effects
# Full model with interaction of session and neighb only, drop the random effect with ~0 variancefix.full <-glmer( acc ~ session * neighb + (1| ID),data = FR,family = binomial,control =glmerControl(optimizer ="bobyqa"))
fix.pars <-glmer(acc ~ session*neighb + (1|ID) + (1|item), data = FR, family = binomial, control =glmerControl(optimizer ="bobyqa"))anova(fix.pars, fix.full)
# Model without random effects for participantsmod.items <-glmer(acc ~ session*neighb + (1|item), data = FR, family = binomial, control =glmerControl(optimizer ="bobyqa")) anova(mod.items, fix.pars)
With my limited data, participant variance is estimated ~0. These results indicate that there isn’t enough information to reliably estimate participant-level variability. I believe my comparison is structurally identical, but statistically underpowered.
# Model without random effects for itemsmod.ppts <-glmer(acc ~ session*neighb + (1|ID), data = FR, family = binomial, control =glmerControl(optimizer ="bobyqa"))anova(mod.ppts, fix.pars)
# Model with by-participant slopes for the effect of session, an uncorrelated random slopemod1 <-glmer( acc ~ session * neighb + (1| item) + (1+ session || ID),data = FR, family = binomial,control =glmerControl(optimizer ="bobyqa"))
# Model with by-participant slopes for the effect of neighbor conditionmod2 <-glmer(acc ~ session*neighb + (1+neighb|ID) + (1|item), data = FR, family = binomial, control =glmerControl(optimizer ="bobyqa"))
# Model with by-participant slopes for the effects of neighbor condition and test sessionmod2a <-glmer(acc ~ session*neighb + (1+neighb+session|ID) + (1|item), data = FR, family = binomial, control =glmerControl(optimizer ="bobyqa"))
# Model with by-item slopes for the effects of test sessionmod3 <-glmer(acc ~ session*neighb + (1+neighb|ID) + (1+session|item), data = FR, family = binomial, control =glmerControl(optimizer ="bobyqa"))
boundary (singular) fit: see help('isSingular')
Final model The final model we arrived at is as follows:
The figure I was able to create from my data shows mean form-recognition accuracy as a function of test session (Same day vs. Next day) and phonological neighborhood condition (none, one, many). Performance was above chance in both sessions and across all neighborhood conditions. Mean accuracy increased from the same-day test (M = 0.71, SD = 0.46) to the next-day test (M = 0.77, SD = 0.42), consistent with an overall benefit of offline consolidation. Across sessions, words with no phonological neighbors were recognized most accurately (M = 0.80), followed by many-neighbor words (M = 0.76), with one-neighbor words showing the lowest accuracy (M = 0.66).
A one-sample t-test on participant-level means from the same-day test confirmed that performance was significantly above chance (M = 0.71), t(16) = 5.04, p < .001, indicating reliable learning from incidental exposure.
Summary of Replication Attempt
This study sought to replicate the adult form-recognition results of Experiment 2 from James et al. (2021), examining how phonological neighborhood structure influences incidental word learning and its consolidation over time. The replication was partially successful. Participants reliably learned novel word forms from incidental story exposure, and performance was significantly above chance. Descriptively, accuracy improved from the same-day test to the next-day test, consistent with consolidation-related gains.
However, the replication did not fully reproduce the original inferential pattern. In particular, the present data did not support complex random-effects structures or yield strong evidence for interactions between session and neighborhood condition.
Mixed-Effects Modeling
Form-recognition accuracy was analyzed using mixed-effects logistic regression models with fixed effects of test session, phonological neighborhood condition, and their interaction. Orthogonal contrasts were used to compare (i) words with no neighbors versus words with neighbors, and (ii) words with one versus many neighbors.
Random Intercepts
Comparisons between models with and without random intercepts indicated that including random effects for both participants and items significantly improved model fit relative to models omitting either source of variance (both ps < .01). This supports the inclusion of crossed random intercepts for participants and items in subsequent models.
Random Slopes
Following the forward best-path approach used in the original study, random slopes were added incrementally. Adding a by-participant random slope for test session did not improve model fit relative to the intercepts-only model, χ²(3) = 0.01, p = .999, and resulted in singular fits. Similarly, more complex random-slope structures were not supported by the data.
As a result, the final model included random intercepts for participants and items, but no random slopes. This contrasts with the original study, which supported by-participant random slopes for session effects.
Comparison to Original Findings/Commentary
The overall pattern of results partially replicates the findings of James et al. (2021). As in the original study, participants demonstrated above-chance learning from incidental exposure, and descriptive results suggest improved performance after a delay. However, unlike the original study, the present replication did not support a more complex random-effects structure, nor did it provide sufficient evidence to detect interactions between session and neighborhood condition.
Importantly, the present study included only two test sessions (same day and next day), whereas the original study included a third delayed session one week later. The absence of this longer delay likely reduced sensitivity to consolidation-related changes that were central to the original theoretical claims.
Interpreting Differences from the Original Study
Several methodological differences likely contributed to these discrepancies.
First, the present replication included only two test sessions rather than three. The original study demonstrated that neighborhood effects became more pronounced after longer consolidation intervals, particularly at the one-week delay. By omitting this session, the present study may have attenuated precisely the effects most diagnostic of consolidation-based lexical integration.
Second, the sample size was substantially smaller than in the original study, limiting statistical power. This is reflected in frequent singular model fits and the inability to justify random slopes, despite following an identical modeling strategy. These issues suggest that the data contained insufficient information to reliably estimate participant-level variability in learning trajectories.
Third, although the procedural framework closely followed the original study, including use of the same experimental platform (Gorilla), stimuli, and task structure, the use of a U.S.-based Prolific sample rather than a U.K.-based sample may have introduced additional variability in phonological familiarity or response strategies.
Implications and Conclusions
Despite these limitations, the replication provides converging evidence that adults can incidentally acquire novel word forms from narrative exposure and that performance improves after a short delay. The failure to fully replicate the original random-effects structure and interaction patterns appears most consistent with reduced power and the absence of a longer consolidation interval, rather than a substantive failure of the theoretical account.
Taken together, these findings suggest that phonological neighborhood effects on lexical consolidation are likely robust, but their detectability depends critically on sufficient longitudinal sampling and statistical power. Future replication efforts should prioritize longer retention intervals and larger samples to more fully evaluate consolidation-based predictions.
Exploratory Analyses
Item-level variability in neighborhood effects – The original paper treats neighborhood density categorically (none / one / many), but individual items may differ substantially in phonological similarity structure.
An exploratory item-level analysis could assess whether neighborhood effects are driven by a subset of highly confusable items, rather than reflecting a uniform property of the neighborhood manipulation.
Trial-level learning dynamics within session – Recognition accuracy may change over the course of a session, even within the same test day.
Future analyses could examine trial-level learning dynamics by incorporating trial order as a predictor, allowing assessment of whether neighborhood density modulates within-session learning trajectories.
Individual differences in consolidation across sessions – My current mixed models suggest limited support for random slopes, but with adequate power, meaningful heterogeneity may emerge.
With a larger sample, exploratory analyses could investigate individual differences in consolidation, including variability in session-related gains across participants.