Replication of Study Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Sample & Sample (2024, Psychological Science)
Author
Junyi Hui
Published
December 10, 2024
Introduction
Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Chen Yu and Linda B. Smith presents the challenges of word learning in natural environments, where there are infinite possible word-referent pairings. Previous approaches have focused on how learners constrain this problem using linguistic, social and representational cues within a single moment or trial. However, the authors propose an alternative strategy: cross-situational learning, where learners accumulate statistical information about word-referent pairings across multiple encounters rather than relying on single-trial mapping. The core issue, famously highlighted by Quine (1960), is the indeterminacy of referents in any given instance of language learning, such as when someone says “gavagai” while pointing at a field—it’s unclear what exactly the word refers to. Traditional research has shown that children can use constraints to “fast map” words to their referents in a single encounter, but real-world learning environments are usually more ambiguous, with many words and potential referents presented at once.
The authors suggest that learners might solve this indeterminacy problem by tracking the co-occurrences of words and referents across multiple learning trials. Though there have been simulations supporting this idea, there has been little empirical research to show whether humans can engage in such cross-situational statistical learning. This gap in understanding, particularly in highly ambiguous environments, is what the authors aim to address in their experiments.
Methods
The experiment included 38 participants, all of whom were undergraduate students from Indiana University. Participants received either course credit or $7 for their participation.
Power Analysis
library(pwr)library(ggplot2)effect_size<-1.425alpha<-0.05result <-pwr.t.test(d = effect_size, n =38, sig.level = alpha,alternative="greater")print(result)
Two-sample t test power calculation
n = 38
d = 1.425
sig.level = 0.05
power = 0.9999967
alternative = greater
NOTE: n is number in *each* group
The power is given in the research.
Planned Sample
Planned sample size is 38.
Materials
The stimuli were slides containing pictures of uncommon objects (e.g., canister, facial sauna, and rasp) paired with auditorily presented pseudowords. These artificial words were generated by a computer program to sample English forms that were broadly phonotactically probable; they were produced by a synthetic female voice in monotone. There were 54 unique objects and 54 unique pseudowords partitioned into three sets of 18 words and referents for use in the three conditions. The training trials were generated by randomly pairing each word with one picture; these were the word-referent pairs to be discovered by the learner. The three learning conditions differed in the number of words and referents presented on each training trial.
Then,participants were exposed to three distinct learning conditions based on the number of words and referents presented per trial: 2-2 Condition: 2 words and 2 pictures 3-3 Condition: 3 words and 3 pictures 4-4 Condition: 4 words and 4 pictures Each training trial presented a random pairing of the words with the pictures, without indicating which picture corresponded to which word. Participants experienced six repetitions of each word-referent pair across trials, allowing for exposure to statistical relationships.
Procedure
The pictures were presented on a 17-in. computer screen, and the sound was played by the speakers connected to the same computer. Subjects were instructed that their task was to learn the words and referents, but they were not told that there was one referent per word. They were told that multiple words and pictures would co-occur on each trial and that their task was to figure out across trials which word went with which picture. After training in each condition, subjects received a fouralternative forced-choice test of learning. On the test, they were presented with 1 word and 4 pictures and asked to indicate the picture named by that word. The target picture and the 3 foils were all drawn from the set of 18 training pictures.
Analysis Plan
The descriptive statistics (mean number of word-referent pairs discovered) and inferential statistics (t-tests comparing performance to chance):
“Figure 1 shows that in each condition, subjects learned more word-referent pairs than expected by chance, smallest t(37) = 8.785, p < .001, prep > .99, d = 1.425, one-tailed (4 × 4 condition). They discovered on average more than 16 of the 18 pairs in the 2 × 2 condition and more than 13 of the 18 pairs in the 3 × 3 condition—all this in less than 6 min of training per condition. Even in the 4 × 4 condition, with 16 potential associations per trial, subjects discovered almost 10 of the 18 word-referent pairs.”
Differences from Original Study
The new reproducibility project will involve a different participant pool compared to the original study. While the original research tested undergraduate students, the replication will recruit participants from Prolific, an online platform that draws from a more diverse and varied population. This shift in sample composition could potentially affect the outcomes. Specifically, we might observe lower learning accuracy and greater variability in the results, as the broader demographic diversity of Prolific workers could introduce more variability in cognitive abilities, learning styles, and background knowledge compared to the more homogenous group of university students. This increased variability could lead to higher variance in the data, making it more challenging to replicate the precise findings of the original study.
Methods Addendum (Post Data Collection)
You can comment this section out prior to final report with data collection.
Actual Sample
Sample Size: 38 workers on Prolific received coupons or minimum wage in California for their participation. Demographics:Participants from Prolific with varied backgrounds. Data Exclusions: There’s no data exclusion criteria.
Differences from pre-data collection methods plan
None
Results
The very large t-value and extremely small p-value indicate strong statistical evidence against the null hypothesis. This means that the true mean is highly unlikely to be 0.25, and the observed sample mean of 0.7917 is significantly different from 0.25. Therefore, the result is very significant.
Data preparation
Pilot A
Pilot A Sample
Pilot sample A size is 8 with only 1 condition.
Data Cleaning:
Exclude trials with response times that exceed three standard deviations from each participant’s mean RT or are under 200 ms (The criteria may need to be slightly changed). Such trials are flagged as outliers, indicating inattentive or anticipatory responses.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
One Sample t-test
data: filtered_data$correct_numeric
t = 15.95, df = 143, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
0.7245359 0.8587974
sample estimates:
mean of x
0.7916667
Test of Learning Accuracy: The primary analysis will compare the participants’ performance (number of correct word-referent pairings) across the three conditions: 2 × 2, 3 × 3, and 4 × 4 trials.
Descriptive statistics: mean correct word-referent pairings in each condition will be computed.
One-sample t-tests: will be used to determine whether the number of correct pairings is significantly greater than chance (25% in a four-alternative forced-choice test).
Repeated-measures ANOVA: will be conducted to compare performance across the three learning conditions to assess the effect of within-trial ambiguity on learning accuracy.
2.Exploring Variance: Given that the original study used undergraduate students and the new replication will use a more diverse sample from Prolific, the variance in learning accuracy might be higher in the replication study. Variability in performance will be assessed by examining the standard deviations and comparing them across conditions and between the original and replication samples.
accuracy <-mean(filtered_data$correct_numeric, na.rm =TRUE)accuracy_all<-data.frame(Group ="Overall", Accuracy = accuracy)ggplot(accuracy_all, aes(x = Group, y = Accuracy, fill = Group)) +geom_bar(stat ="identity", show.legend =FALSE, width =0.5) +# show.legend = FALSE labs(title ="Accuracy for overall Group", y ="Accuracy", x ="") +theme_minimal()
ggplot(filtered_data, aes(x =factor(correct, labels =c("Incorrect", "Correct")), y = response_time)) +geom_boxplot(fill ="skyblue3", alpha =0.7) +stat_summary(fun = mean, geom ="point", shape =18, size =3, color ="red") +labs(title ="Response Time by Correctness",x ="Response Accuracy",y ="Response Time" ) +theme_minimal()
Pilot B
Pilot B Sample
Pilot sample B size is 5 with all three conditions. Average time participants took during pilot B is about 21 minutes.
Data Cleaning:
Exclude trials with response times that exceed three standard deviations from each participant’s mean RT or are under 200 ms (The criteria may need to be slightly changed). Such trials are flagged as outliers, indicating inattentive or anticipatory responses.
ggplot(long_data, aes(x =factor(correct, labels =c("Incorrect", "Correct")),y = response_time,fill = condition)) +geom_boxplot(alpha =0.7) +stat_summary(fun = mean, geom ="point", shape =18, size =3, color ="black") +labs(title ="Response Time by Correctness and Condition",x ="Response Accuracy",y ="Response Time (ms)" ) +facet_wrap(~ condition) +# Separate plots for each conditiontheme_minimal() +theme(plot.title =element_text(hjust =0.5, size =16, face ="bold"),axis.title.x =element_text(size =14),axis.title.y =element_text(size =14),axis.text.x =element_text(size =12),axis.text.y =element_text(size =12),strip.text =element_text(size =14, face ="bold") ) +scale_fill_manual(values =c("#FFB5E8", "#B5EAD7", "#FFDAC1")) +scale_y_continuous(limits =c(0, 10000), breaks =seq(0, 10000, by =2000))
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_summary()`).
Chance Performance
chance_level <-0.25# Perform one-sample t-tests for each conditiont_condition1 <-t.test(combined_data_B$correct_numeric_condition1, mu = chance_level, alternative ="two.sided", na.rm =TRUE)t_condition2 <-t.test(combined_data_B$correct_numeric_condition2, mu = chance_level, alternative ="two.sided", na.rm =TRUE)t_condition3 <-t.test(combined_data_B$correct_numeric_condition3, mu = chance_level, alternative ="two.sided", na.rm =TRUE)# Print t-test resultsprint("Condition 1:")
[1] "Condition 1:"
print(t_condition1)
One Sample t-test
data: combined_data_B$correct_numeric_condition1
t = 10.311, df = 89, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
0.6401940 0.8264727
sample estimates:
mean of x
0.7333333
print("Condition 2:")
[1] "Condition 2:"
print(t_condition2)
One Sample t-test
data: combined_data_B$correct_numeric_condition2
t = 9.9462, df = 89, p-value = 4.135e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
0.6278852 0.8165593
sample estimates:
mean of x
0.7222222
print("Condition 3:")
[1] "Condition 3:"
print(t_condition3)
One Sample t-test
data: combined_data_B$correct_numeric_condition3
t = 5.8011, df = 89, p-value = 9.924e-08
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
0.4508980 0.6602131
sample estimates:
mean of x
0.5555556
6.Effect of Condition (ANOVA)
library(tidyr)long_data <- combined_data_B %>%pivot_longer(cols =c("correct_numeric_condition1", "correct_numeric_condition2", "correct_numeric_condition3"), # Specify the correctness columnsnames_to ="condition", # New column for condition namesvalues_to ="correctness"# New column for correctness values )anova_result <-aov(correctness ~ condition, data = long_data)summary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
condition 2 1.79 0.8926 4.118 0.0173 *
Residuals 267 57.88 0.2168
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Final Analysis size is 41 with all three conditions. Average time participants took during final experiment is about 21 minutes. The complete experiment is here: https://ucsd-psych201a.github.io/yu2007/123_234.html https://ucsd-psych201a.github.io/yu2007/132_243.html https://ucsd-psych201a.github.io/yu2007/213_324.html https://ucsd-psych201a.github.io/yu2007/231_342.html https://ucsd-psych201a.github.io/yu2007/321_432.html https://ucsd-psych201a.github.io/yu2007/312_423.html
Data Cleaning:
Exclude trials not completing all 3 conditions.
library(dplyr)library(purrr)# Paths to the datapath1 <-"/Users/user/Desktop/Fall_quarter/PSCH/final project/final data/condition1"path2 <-"/Users/user/Desktop/Fall_quarter/PSCH/final project/final data/condition2"path3 <-"/Users/user/Desktop/Fall_quarter/PSCH/final project/final data/condition3"# List of files for each conditionlist1 <-list.files(path = path1, pattern ="*.csv", full.names =TRUE)list2 <-list.files(path = path2, pattern ="*.csv", full.names =TRUE)list3 <-list.files(path = path3, pattern ="*.csv", full.names =TRUE)# Function to clean and process each datasetprocess_data <-function(file_list, condition) { file_list %>%lapply(read.csv) %>%lapply(function(df) { df <- df[, c("correct", "response_time")] # Select only the required columns df$correct <-as.logical(df$correct) # Ensure "correct" is logical (TRUE/FALSE) df$response_time <-as.numeric(df$response_time) # Ensure "response_time" is numericna.omit(df) # Remove rows with NA values }) %>%bind_rows() %>%rename_with(~paste0(., "_", condition)) # Add condition-specific suffix to column names}# Process each conditionfinal_data1 <-process_data(list1, "condition1")final_data2 <-process_data(list2, "condition2")final_data3 <-process_data(list3, "condition3")# Filter and transform each datasetfiltered_data1 <- final_data1 %>%filter(!is.na(correct_condition1) &!is.na(response_time_condition1)) %>%mutate(correct_numeric_condition1 =ifelse(correct_condition1, 1, 0)) # Convert logical to numeric (1/0)filtered_data2 <- final_data2 %>%filter(!is.na(correct_condition2) &!is.na(response_time_condition2)) %>%mutate(correct_numeric_condition2 =ifelse(correct_condition2, 1, 0)) # Convert logical to numeric (1/0)filtered_data3 <- final_data3 %>%filter(!is.na(correct_condition3) &!is.na(response_time_condition3)) %>%mutate(correct_numeric_condition3 =ifelse(correct_condition3, 1, 0)) # Convert logical to numeric (1/0)# Combine all filtered data frames into one# Using bind_rows() instead of bind_cols() to handle row mismatchesfinal_data <-bind_rows( filtered_data1 %>%mutate(condition ="condition1"), filtered_data2 %>%mutate(condition ="condition2"), filtered_data3 %>%mutate(condition ="condition3"))# View the head of the final data framehead(final_data)
correct_condition1 response_time_condition1 correct_numeric_condition1
1 TRUE 4313 1
2 FALSE 3969 0
3 FALSE 3782 0
4 FALSE 1746 0
5 FALSE 2075 0
6 FALSE 8708 0
condition correct_condition2 response_time_condition2
1 condition1 NA NA
2 condition1 NA NA
3 condition1 NA NA
4 condition1 NA NA
5 condition1 NA NA
6 condition1 NA NA
correct_numeric_condition2 correct_condition3 response_time_condition3
1 NA NA NA
2 NA NA NA
3 NA NA NA
4 NA NA NA
5 NA NA NA
6 NA NA NA
correct_numeric_condition3
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
original research plot
Chance Performance
chance_level <-0.25# Perform t-tests for each condition# Perform one-sample t-tests for each conditiont_condition1 <-t.test(filtered_data1$correct_numeric_condition1, mu = chance_level, alternative ="two.sided", na.rm =TRUE)t_condition2 <-t.test(filtered_data2$correct_numeric_condition2, mu = chance_level, alternative ="two.sided", na.rm =TRUE)t_condition3 <-t.test(filtered_data3$correct_numeric_condition3, mu = chance_level, alternative ="two.sided", na.rm =TRUE)# Print t-test resultsprint("Condition 1:")
[1] "Condition 1:"
print(t_condition1)
One Sample t-test
data: filtered_data1$correct_numeric_condition1
t = 24.935, df = 737, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
0.6450969 0.7126266
sample estimates:
mean of x
0.6788618
print("Condition 2:")
[1] "Condition 2:"
print(t_condition2)
One Sample t-test
data: filtered_data2$correct_numeric_condition2
t = 15.988, df = 737, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
0.5073392 0.5793817
sample estimates:
mean of x
0.5433604
print("Condition 3:")
[1] "Condition 3:"
print(t_condition3)
One Sample t-test
data: filtered_data3$correct_numeric_condition3
t = 7.946, df = 737, p-value = 7.215e-15
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
0.3576348 0.4282730
sample estimates:
mean of x
0.3929539
5.Effect of Condition (ANOVA)
data_long <- final_data %>%pivot_longer(cols =starts_with("correct_numeric_condition"), # All condition columnsvalues_to ="score", # New column for scoresvalues_drop_na =TRUE# Drop rows with NA values ) data_long<-data_long %>%group_by(name)%>%mutate(id =ceiling(row_number() /18))%>%ungroup()aggregated_data <- data_long %>%group_by(id, condition) %>%summarize(mean_response =mean(score), .groups ="drop")anova_result <-aov(mean_response ~ condition +Error(id / condition), data = aggregated_data )summary(anova_result)
Error: id
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 1 0.001176 0.001176
Error: id:condition
Df Sum Sq Mean Sq
condition 2 1.478 0.7388
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
condition 2 0.376 0.1880 2.717 0.0702 .
Residuals 117 8.097 0.0692
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pairwise comparisons using paired t tests
data: aggregated_data$mean_response and aggregated_data$condition
condition1 condition2
condition2 7.3e-05 -
condition3 6.2e-10 0.0013
P value adjustment method: bonferroni
Exploratory analyses
Response Time by Correctness
library(dplyr)library(tidyr)library(ggplot2)final_data_na <- final_data %>%filter(!is.na(response_time_condition1) &!is.na(correct_condition1)) %>%# Remove rows with NA valuesselect(1:4) # Select only the first four columnsresponse_time_summary <- final_data_na %>%group_by(condition) %>%summarize(mean_response_time =mean(response_time_condition1, na.rm =TRUE),median_response_time =median(response_time_condition1, na.rm =TRUE),sd_response_time =sd(response_time_condition1, na.rm =TRUE),n =n(),.groups ="drop" )print(response_time_summary)
ggplot(final_data_na, aes(x =factor(correct_condition1, labels =c("Incorrect", "Correct")),y = response_time_condition1)) +geom_boxplot(alpha =0.7) +stat_summary(fun = mean, geom ="point", shape =18, size =3, color ="black") +labs(title ="Response Time by Correctness and Condition",x ="Response Accuracy",y ="Response Time (ms)" ) +theme_minimal() +theme(plot.title =element_text(hjust =0.5, size =16, face ="bold"),axis.title.x =element_text(size =14),axis.title.y =element_text(size =14),axis.text.x =element_text(size =12),axis.text.y =element_text(size =12),strip.text =element_text(size =14, face ="bold") ) +scale_fill_manual(values =c("#FFB5E8", "#B5EAD7", "#FFDAC1")) +scale_y_continuous(limits =c(0, 10000), breaks =seq(0, 10000, by =2000))
Warning: Removed 28 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 28 rows containing non-finite outside the scale range
(`stat_summary()`).
Discussion
Summary of Replication Attempt
A repeated-measures ANOVA was conducted to evaluate the effect of condition on mean responses. The results revealed no statistically significant main effect of condition (F(2, 117) = 2.717, p = 0.0702), suggesting that the overall differences among conditions were not strong enough to reach significance. However, pairwise comparisons using paired t-tests with Bonferroni adjustment indicated significant differences between all pairs of conditions: Condition1 vs Condition2: p < 0.001 Condition1 vs Condition3: p < 0.001 Condition2 vs Condition3: p = 0.0013 These results suggest that while the overall effect of condition was not statistically significant, specific pairs of conditions do differ significantly. ### Commentary The differences in participant recruitment (college students vs. Prolific participants) likely explain the discrepancies between my study and the original paper: Lower Accuracy: Prolific participants may be less engaged, more distracted, or less familiar with similar cognitive tasks compared to college students. Higher Variability: The broader demographics and uncontrolled testing environments of Prolific participants could increase response variability, reducing statistical power. Non-Significant Main Effect: The higher residual variance and lower performance levels in your Prolific sample likely reduced the F-statistic, leading to a non-significant main effect of condition. Broader Generalizability: Despite these limitations, your results from a diverse Prolific sample provide valuable insights into how this task might generalize beyond a college-student population. ## Design Overview Similarities: Performance in all conditions was significantly above chance, and pairwise differences followed the same trend (2×2 > 3×3 > 4×4). This suggests participants were able to learn word-referent pairs, even under high levels of ambiguity. Differences: Performance levels were lower across all conditions in your study. The main effect of condition was not significant in your study, contrasting with the original paper’s highly significant main effect and large effect size. These differences may be due to methodological variations (e.g., task design, training time, or participant characteristics) or issues related to sample size and variability in your dataset.