Replication of Study Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Sample & Sample (2024, Psychological Science)
Author
Junyi Hui,Yawen Dong,Allison Park,Pengjia Cui
Published
November 11, 2024
Introduction
Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Chen Yu and Linda B. Smith presents the challenges of word learning in natural environments, where there are infinite possible word-referent pairings. Previous approaches have focused on how learners constrain this problem using linguistic, social and representational cues within a single moment or trial. However, the authors propose an alternative strategy: cross-situational learning, where learners accumulate statistical information about word-referent pairings across multiple encounters rather than relying on single-trial mapping. The core issue, famously highlighted by Quine (1960), is the indeterminacy of referents in any given instance of language learning, such as when someone says “gavagai” while pointing at a field—it’s unclear what exactly the word refers to. Traditional research has shown that children can use constraints to “fast map” words to their referents in a single encounter, but real-world learning environments are usually more ambiguous, with many words and potential referents presented at once.
The authors suggest that learners might solve this indeterminacy problem by tracking the co-occurrences of words and referents across multiple learning trials. Though there have been simulations supporting this idea, there has been little empirical research to show whether humans can engage in such cross-situational statistical learning. This gap in understanding, particularly in highly ambiguous environments, is what the authors aim to address in their experiments.
Methods
The experiment included 38 participants, all of whom were undergraduate students from Indiana University. Participants received either course credit or $7 for their participation.
Power Analysis
library(pwr)effect_size<-1.425alpha<-0.05result <-pwr.t.test(d = effect_size, n =38, sig.level = alpha,alternative="greater")print(result)
Two-sample t test power calculation
n = 38
d = 1.425
sig.level = 0.05
power = 0.9999967
alternative = greater
NOTE: n is number in *each* group
The power is given in the research.
Planned Sample
Planned sample size is 38.
Materials
The stimuli were slides containing pictures of uncommon objects (e.g., canister, facial sauna, and rasp) paired with auditorily presented pseudowords. These artificial words were generated by a computer program to sample English forms that were broadly phonotactically probable; they were produced by a synthetic female voice in monotone. There were 54 unique objects and 54 unique pseudowords partitioned into three sets of 18 words and referents for use in the three conditions. The training trials were generated by randomly pairing each word with one picture; these were the word-referent pairs to be discovered by the learner. The three learning conditions differed in the number of words and referents presented on each training trial.
Then,participants were exposed to three distinct learning conditions based on the number of words and referents presented per trial: 2-2 Condition: 2 words and 2 pictures 3-3 Condition: 3 words and 3 pictures 4-4 Condition: 4 words and 4 pictures Each training trial presented a random pairing of the words with the pictures, without indicating which picture corresponded to which word. Participants experienced six repetitions of each word-referent pair across trials, allowing for exposure to statistical relationships.
Procedure
The pictures were presented on a 17-in. computer screen, and the sound was played by the speakers connected to the same computer. Subjects were instructed that their task was to learn the words and referents, but they were not told that there was one referent per word. They were told that multiple words and pictures would co-occur on each trial and that their task was to figure out across trials which word went with which picture. After training in each condition, subjects received a fouralternative forced-choice test of learning. On the test, they were presented with 1 word and 4 pictures and asked to indicate the picture named by that word. The target picture and the 3 foils were all drawn from the set of 18 training pictures.
Analysis Plan
The descriptive statistics (mean number of word-referent pairs discovered) and inferential statistics (t-tests comparing performance to chance):
“Figure 1 shows that in each condition, subjects learned more word-referent pairs than expected by chance, smallest t(37) = 8.785, p < .001, prep > .99, d = 1.425, one-tailed (4 × 4 condition). They discovered on average more than 16 of the 18 pairs in the 2 × 2 condition and more than 13 of the 18 pairs in the 3 × 3 condition—all this in less than 6 min of training per condition. Even in the 4 × 4 condition, with 16 potential associations per trial, subjects discovered almost 10 of the 18 word-referent pairs.”
Differences from Original Study
The new reproducibility project will involve a different participant pool compared to the original study. While the original research tested undergraduate students, the replication will recruit participants from Prolific, an online platform that draws from a more diverse and varied population. This shift in sample composition could potentially affect the outcomes. Specifically, we might observe lower learning accuracy and greater variability in the results, as the broader demographic diversity of Prolific workers could introduce more variability in cognitive abilities, learning styles, and background knowledge compared to the more homogenous group of university students. This increased variability could lead to higher variance in the data, making it more challenging to replicate the precise findings of the original study.
Methods Addendum (Post Data Collection)
You can comment this section out prior to final report with data collection.
Actual Sample
Sample Size: 38 workers on Prolific received coupons or minimum wage in California for their participation. Demographics:Participants from Prolific with varied backgrounds. Data Exclusions: There’s no data exclusion criteria.
Differences from pre-data collection methods plan
none
Results
The very large t-value and extremely small p-value indicate strong statistical evidence against the null hypothesis. This means that the true mean is highly unlikely to be 0.25, and the observed sample mean of 0.7917 is significantly different from 0.25. Therefore, the result is very significant.
Data preparation
Data preparation following the analysis plan. We only select condition 1 and 8 people as sample size for the pilot test.https://github.com/ucsd-psych201a/yu2007/blob/837703669871c26bd4e153f0272aa7bc2088548c/coding_test.html The complete experiment is here:https://github.com/ucsd-psych201a/yu2007/blob/837703669871c26bd4e153f0272aa7bc2088548c/coding.html
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
One Sample t-test
data: correct$correct_numeric
t = 15.95, df = 143, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
0.7245359 0.8587974
sample estimates:
mean of x
0.7916667
Test of Learning Accuracy: The primary analysis will compare the participants’ performance (number of correct word-referent pairings) across the three conditions: 2 × 2, 3 × 3, and 4 × 4 trials. Descriptive statistics:mean correct word-referent pairings in each condition will be computed. One-sample t-tests: will be used to determine whether the number of correct pairings is significantly greater than chance (25% in a four-alternative forced-choice test). Repeated-measures ANOVA: will be conducted to compare performance across the three learning conditions to assess the effect of within-trial ambiguity on learning accuracy.
Exploring Variance: Given that the original study used undergraduate students and the new replication will use a more diverse sample from Prolific, the variance in learning accuracy might be higher in the replication study. Variability in performance will be assessed by examining the standard deviations and comparing them across conditions and between the original and replication samples. Side-by-side graph with original graph is ideal here
# Calculate mean accuracy for each group (you can adjust these calculations as needed)accuracy_2x2 <-mean(correct$correct_numeric, na.rm =TRUE)accuracy_3x3 <-mean(correct$correct_numeric[correct$group =="3x3"], na.rm =TRUE)accuracy_4x4 <-mean(correct$correct_numeric[correct$group =="4x4"], na.rm =TRUE)# Create a data frame with the accuracy values for each groupaccuracy_data <-data.frame(Group =c("2x2", "3x3", "4x4"), # The group labelsAccuracy =c(accuracy_2x2, accuracy_3x3, accuracy_4x4) # Mean accuracy for each group)# Create the bar plotlibrary(ggplot2)ggplot(accuracy_data, aes(x = Group, y = Accuracy, fill = Group)) +geom_bar(stat ="identity", show.legend =FALSE,width =0.5) +# show.legend = FALSE to remove the legendlabs(title ="Accuracy Bar Plot by Group", y ="Accuracy", x ="") +theme_minimal()
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_bar()`).
Exploratory analyses
Any follow-up analyses desired (not required).
Discussion
Summary of Replication Attempt
Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.
Commentary
Design Overview
Only one factor, consisting of three conditions, was manipulated in this study. We collected two measures: response time and accuracy. A within-participants design was used, with repeated measures taken for each participant. Moving from a within- to a between-participants design could increase the outcome variance, potentially decreasing result consistency. Steps were taken to reduce demand characteristics, such as implementing time constraints and accuracy tests. However, the design may have limitations, including audio clarity issues and limited reaction time, which could present challenges for individuals with hearing difficulties or varying reaction speeds.