The study “Rapid Word Learning Under Uncertainty via Cross-Situational Statistics” by Yu and Smith (2007) investigated how adults learn new words in ambiguous settings. The authors proposed that a cross-situational learning strategy could solve the indeterminacy problem, where multiple possible meanings exist for a new word. This strategy involves keeping track of word-referent pairings across multiple encounters and using statistical probabilities to determine the correct word-referent mappings.
Yu and Smith (2007) hypothesized that learners store potential word-referent pairings across trials, evaluate the statistical evidence, and eventually map words to their correct referents. This cross-situational learning mechanism was tested in an experiment where adults were briefly exposed to trials containing multiple spoken words and pictures of individual objects. No within-trial information was given to link the words and objects. The study found that participants could learn the word-picture mappings through cross-trial statistical relations.
This replication project aims to reproduce the findings of Yu and Smith (2007) and further explore the role of cross-situational learning in word acquisition. We will be closely following their methodology and analysis to determine the robustness of their findings. Any modifications or deviations from the original study will be explicitly noted in our report.
Factor Manipulated: Within-trial ambiguity through three conditions (2×2, 3×3, and 4×4).
Measures: Accuracy in learning word-referent pairs and response time.
Design: Within-participants design, reducing variance from individual differences.
Confounds: Familiarity/memorization strategies and generalizability due to the use of pseudowords and uncommon objects.
The study reduced demand characteristics by using pseudowords and avoiding explicit cues, requiring participants to rely solely on cross-trial statistical learning. However, testing was limited to adults, which might limit applicability to children or naturalistic learning environments.
Methods
Power Analysis
The original experiment included 38 participants, all undergraduate students from Indiana University. Participants received course credit or $7 for participation.
library(pwr)
Warning: package 'pwr' was built under R version 4.4.2
effect_size<-1.425alpha<-0.05result <-pwr.t.test(d = effect_size, n =38, sig.level = alpha,alternative="greater")print(result)
Two-sample t test power calculation
n = 38
d = 1.425
sig.level = 0.05
power = 0.9999967
alternative = greater
NOTE: n is number in *each* group
Power: 0.9999967, indicating nearly 100% probability of rejecting the null hypothesis if the alternative hypothesis is true.
Planned Sample
The replication aimed to include 42 participants from Prolific, ensuring high statistical power and maintaining methodological consistency with the original study. All data could be found in final data.zip on Github.
Pseudowords: Generated to mimic phonotactic probabilities in English.
Stimuli were divided into three sets of 18 word-referent pairs for each condition (2x2, 3x3, 4x4). Training trials were randomly paired, and testing used a 4-alternative forced-choice design.
The pseudowords are computer-generated to maintain phonotactic probability in English. We then created 3x3 and 4x4 using Google Voice Generator. The pictures are based on NOUN Database (http://www.sussex.ac.uk/wordlab/noun)[(Horst, J. S., & Hout, M. C. (2016))
Procedure
The procedure will follow Yu & Smith (2007) with minimal modification: “Each trial began with the simultaneous visual presentation of the referents on a computer monitor. The names were then presented auditorily over the computer’s speakers.” As in the original study, no additional cues will be provided to suggest word-object pairings within individual trials, and participants will complete a 4-alternative forced-choice test following each condition.
Analysis Plan
The analysis plan will replicate the original study’s strategy closely. Data cleaning rules include removing trials with response times exceeding three standard deviations from the participant’s mean or below 200 ms. Data exclusions will follow the criteria set forth in the “Exclusions” section of this protocol. We will calculate the mean accuracy for each participant in each condition and conduct a repeated-measures ANOVA to test for differences across the 2x2, 3x3, and 4x4 conditions.
The primary analysis of interest is the repeated-measures ANOVA, testing whether accuracy varies significantly across conditions with different ambiguity levels. Additional exploratory analyses will examine potential learning patterns across trials.
Differences from Original Study
Our study aims to mirror the original as closely as possible. However, differences may arise from using a different sample frame (our university student pool) and potential minor procedural adjustments due to updated software. Based on the literature, these differences are not expected to substantially impact the effect.
Methods Addendum (Post Data Collection)
Actual Sample
42 participants on Prolific received $5 for their participation. One participant was excluded from analysis for failure to complete the Condition 3 test phase.
Differences from pre-data collection methods plan
No substantial deviations from the original methods.
Results
Data preparation
Data preparation will follow the steps outlined in the Analysis Plan to ensure consistent and accurate processing. The following steps will be applied:
Data Cleaning:
Exclude trials with response times that exceed three standard deviations from each participant’s mean RT or are under 200 ms (The criteria may need to be slightly changed). Such trials are flagged as outliers, indicating inattentive or anticipatory responses.
Calculating Mean Accuracy:
Calculate the mean accuracy score for each participant in each condition (2x2, 3x3, and 4x4). These scores will be the dependent variable for ANOVA.
Exclusion of Participants:
Apply participant exclusion criteria as specified in the Exclusions section (The criteria may need to be slightly changed).
Final Dataset Creation:
The cleaned data will include mean accuracy scores per condition for each participant who meets the inclusion criteria. This dataset will be used for the confirmatory analysis.
Data cleaning
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(effsize)
Warning: package 'effsize' was built under R version 4.4.2
library(pwr)# Function to process files (improved)process_files <-function(file_list, condition_label) { df_list <-list()for (i inseq_along(file_list)) { df <-read.csv(file_list[i]) df <- df[, (ncol(df) -4):ncol(df)] # Select the last 5 columns df <-na.omit(df) # Remove rows with NA values df$respondent <-substr(file_list[i], 23, 32) # Add a respondent ID df$condition <- condition_label # Add a condition label df$correct <-tolower(df$correct) =="true"# Convert to logical HERE! df_list[[i]] <- df }bind_rows(df_list)}# Get file listsfile_list_c1 <-list.files("final data/condition1/", pattern ="\\.csv$", full.names =TRUE)file_list_c2 <-list.files("final data/condition2/", pattern ="\\.csv$", full.names =TRUE)file_list_c3 <-list.files("final data/condition3/", pattern ="\\.csv$", full.names =TRUE)# Process each condition's datadata_c1 <-process_files(file_list_c1, "condition1")data_c2 <-process_files(file_list_c2, "condition2")data_c3 <-process_files(file_list_c3, "condition3")# Combine all data (now works correctly)data <-bind_rows(data_c1, data_c2, data_c3)# Now data$correct is consistently logicalstr(data$correct) # Check the structure
We will perform a repeated-measures ANOVA to assess differences in accuracy across the three conditions: 2x2, 3x3, and 4x4. This test will evaluate whether the degree of within-trial ambiguity significantly affects participants’ accuracy in learning word-referent pairings.
Compared to the original study, the error bars for Conditions 1 and 2 are larger in our experiment, indicating greater variability in participant performance. In the original study, the error bars are smaller, reflecting more consistent accuracy among participants across all conditions.
condition_accuracy <- data %>%group_by(condition) %>%summarise(Accuracy =mean(correct), .groups ='drop')condition_accuracy
In comparison to the original experiment, our accuracy data reflects similar but slightly lower performance across all conditions.
In the original study, participants discovered, on average, more than 16 of the 18 pairs in the 2x2 condition, corresponding to an accuracy above 88.9%, while our participants achieved an average accuracy of 68.1%.
For the 3x3 condition, the original study reported participants discovering more than 13 of the 18 pairs, or an accuracy above 72.2%, while our participants averaged 55.5%.
In the 4x4 condition, the original study indicated participants discovered nearly 10 pairs, or an accuracy of approximately 55.6%, while our participants averaged 39.0%.
Both our study and the original experiment demonstrate a pattern of declining performance as ambiguity increases across conditions.
Image-Specific Accuracy
img_acc_c1 <- data_c1 %>%group_by(correct_image) %>%summarize(accuracy =mean(correct, na.rm =TRUE))ggplot(img_acc_c1, aes(x = correct_image, y = accuracy)) +geom_line(color ="grey",linewidth =1) +geom_point(size =1) +scale_x_continuous(breaks =seq(min(data$correct_image), max(data$correct_image), 1)) +labs(x ="Per Image", y ="Accuracy", title ="Condition1")
img_acc_c2 <- data_c2 %>%group_by(correct_image) %>%summarize(accuracy =mean(correct, na.rm =TRUE))ggplot(img_acc_c2, aes(x = correct_image, y = accuracy)) +geom_line(color ="grey",linewidth =1) +geom_point(size =1) +scale_x_continuous(breaks =seq(min(data$correct_image), max(data$correct_image), 1)) +labs(x ="Per Image", y ="Accuracy", title ="Condition2")
img_acc_c3 <- data_c3 %>%group_by(correct_image) %>%summarize(accuracy =mean(correct, na.rm =TRUE))ggplot(img_acc_c3, aes(x = correct_image, y = accuracy)) +geom_line(color ="grey",linewidth =1) +geom_point(size =1) +scale_x_continuous(breaks =seq(min(data$correct_image), max(data$correct_image), 1)) +labs(x ="Per Image", y ="Accuracy", title ="Condition3")
The plot presents accuracy by image for each condition, showing how participants performed across the 18 images in the 2x2, 3x3, and 4x4 conditions. In Condition 1, accuracy remains relatively stable across all images, with minimal variability. Condition 2 and Condition 3 show slightly greater variability but no extreme deviations.
Df Sum Sq Mean Sq F value Pr(>F)
condition 2 1.367 0.6834 10.71 5.78e-05 ***
Residuals 106 6.761 0.0638
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our ANOVA results (F = 10.8, p = 5.37e-05) revealed a statistically significant effect of condition on accuracy, indicating that participants’ performance varied significantly across the three conditions. This finding aligns with the original study, which also reported a decline in accuracy as task complexity increased from the 2x2 to 4x4 condition.
This provides sufficient evidence for that, similar to the original study, ambiguity plays an important role in participants’ ability to learn word-referent pairs.
Post-hoc
tukey <-TukeyHSD(anova)print(tukey)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = perc ~ condition, data = participant_data)
$condition
diff lwr upr p adj
condition2-condition1 -0.1356519 -0.2844401 0.01313628 0.0816094
condition3-condition1 -0.2860584 -0.4348466 -0.13727022 0.0000392
condition3-condition2 -0.1504065 -0.2829965 -0.01781646 0.0220384
The post-hoc analysis highlights statistically significant differences in accuracy across the conditions, particularly between Condition 3 and the other two conditions. This is consistent with the findings of the original study, where performance declined as ambiguity increased.
In detail, the relatively weak significance between Condition 1 and Condition 2 suggests that participants found the 3x3 condition slightly more challenging than the 2x2 condition, but the effect is less pronounced compared to the drop in accuracy observed in the 4x4 condition.
Exploratory analyses
Response Time by Correctness
data_combined <-bind_rows(mutate(data_c1, condition ="Condition 1"),mutate(data_c2, condition ="Condition 2"),mutate(data_c3, condition ="Condition 3"))# Summarize response times by condition and correctness (optional, for printing)response_time_summary <- data_combined %>%group_by(condition, correct) %>%summarize(mean_response_time =mean(response_time, na.rm =TRUE),median_response_time =median(response_time, na.rm =TRUE),sd_response_time =sd(response_time, na.rm =TRUE),.groups ="drop" )print(response_time_summary)
# Create the faceted boxplotggplot(data_combined, aes(x =factor(correct, labels =c("Incorrect", "Correct")), y = response_time)) +geom_boxplot(fill ="skyblue3", alpha =0.7) +stat_summary(fun = mean, geom ="point", color ="red", size =3) +# Increased point sizescale_y_log10() +facet_wrap(~ condition) +# This creates the separate panelslabs(title ="Response Time by Correctness Across Conditions",x ="Response Accuracy",y ="Response Time (log scale)" ) +theme_minimal() +theme(strip.text =element_text(size =12)) #Adjust the size of facet labels
We use Bartlett’s test for multiple groups.
bartlett.test(data$response_time~data$condition)
Bartlett test of homogeneity of variances
data: data$response_time by data$condition
Bartlett's K-squared = 401.64, df = 2, p-value < 2.2e-16
The differences among each condition are not significant. We now visualize the results over images.
Response Time by Images
response_time_by_image <- data %>%group_by(condition, correct_image) %>%summarise(Mean_ResponseTime =mean(response_time, na.rm =TRUE),.groups ='drop' )ggplot(response_time_by_image, aes(x = correct_image, y = Mean_ResponseTime, group = condition, color = condition)) +geom_line(size =1) +geom_point(size =2) +labs(title ="Response Times by Image for Each Condition",x ="Image",y ="Mean Response Time (ms)" ) +scale_x_continuous(breaks =unique(response_time_by_image$correct_image)) +facet_wrap(~condition, scales ="free_x", ncol =1) +theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
In Condition 1, response times stabilize quickly after a drop from Image 1 to Image 2, showing minimal variability. Condition 2 shows a significant initial drop but remain stable afterwords. Condition 3 has relatively stable response times and a smaller initial drop. The initial drop across all conditions might suggest participants spend time adapting to the test phase at the beginning.
Discussion
Summary of Replication Attempt
Our confirmatory analysis revealed that participants’ performance significantly declined as task ambiguity increased across the three conditions, consistent with the original study’s findings. Accuracy was highest in the 2x2 condition, lower in the 3x3 condition, and lowest in the 4x4 condition, with all conditions exceeding chance levels. While the general pattern replicated the original study, the levels of accuracy in our study was lower across all conditions, and variability was higher, particularly in the more complex conditions. These results suggest a partial replication of the original findings, capturing the overall trend but with differences in the strength and consistency of participant performance.
Commentary
Our exploratory analyses showed that response times increased with task complexity, particularly in the 3x3 condition. Interestingly, a small number of participants in the 4x4 condition still achieved relatively high accuracy within a short response time, suggesting that they may use more effective strategies to resolve high ambiguity.
While the overall trend of declining accuracy with increasing task complexity replicated the original findings, the lower accuracy and higher variability in our study suggest the presence of moderating factors. Differences in participant demographics (university students vs. Prolific users), experimental settings, or task delivery (i.e. instructions, timing) could potentially influence the test results.
The main challenge in interpreting the results lies in the high variability observed in our study, which could be attributed to uncontrolled external factors. Another important aspect not tested in our replication is the role of foil probability—how often incorrect choices were presented. We ignored the factor due to limited time, which may have affected participants’ ability to differentiate correct from incorrect word pairings. Future studies may attempt to take better controll of external factors and take fol probability into account.