Replication of “Rapid Word Learning Under Uncertainty via Cross-Situational Statistics” (2007 Psychological Science)

Author

Allison Park, Yawen Dong, Junyi Hui & Pengjia Cui

Published

Invalid Date

The github repo for the full project

Introduction

The study “Rapid Word Learning Under Uncertainty via Cross-Situational Statistics” by Yu and Smith (2007) investigated how adults learn new words in ambiguous settings. The authors proposed that a cross-situational learning strategy could solve the indeterminacy problem, where multiple possible meanings exist for a new word. This strategy involves keeping track of word-referent pairings across multiple encounters and using statistical probabilities to determine the correct word-referent mappings.

Yu and Smith (2007) hypothesized that learners store potential word-referent pairings across trials, evaluate the statistical evidence, and eventually map words to their correct referents. This cross-situational learning mechanism was tested in an experiment where adults were briefly exposed to trials containing multiple spoken words and pictures of individual objects. No within-trial information was given to link the words and objects. The study found that participants could learn the word-picture mappings through cross-trial statistical relations.

This replication project aims to reproduce the findings of Yu and Smith (2007) and further explore the role of cross-situational learning in word acquisition. We will be closely following their methodology and analysis to determine the robustness of their findings. Any modifications or deviations from the original study will be explicitly noted in our report.

The original paper can be accessed via DOI.

Factor Manipulated: Within-trial ambiguity through three conditions (2×2, 3×3, and 4×4).
Measures: Accuracy in learning word-referent pairs and response time.
Design: Within-participants design, reducing variance from individual differences.
Confounds: Familiarity/memorization strategies and generalizability due to the use of pseudowords and uncommon objects.

The study reduced demand characteristics by using pseudowords and avoiding explicit cues, requiring participants to rely solely on cross-trial statistical learning. However, testing was limited to adults, which might limit applicability to children or naturalistic learning environments.

Methods

Power Analysis

The original experiment included 38 participants, all undergraduate students from Indiana University. Participants received course credit or $7 for participation.

library(pwr)

Warning: package 'pwr' was built under R version 4.4.2

effect_size<-1.425
alpha<-0.05
result <- pwr.t.test(d = effect_size, n = 38, sig.level = alpha,alternative="greater")
print(result)


     Two-sample t test power calculation 

              n = 38
              d = 1.425
      sig.level = 0.05
          power = 0.9999967
    alternative = greater

NOTE: n is number in *each* group

Power: 0.9999967, indicating nearly 100% probability of rejecting the null hypothesis if the alternative hypothesis is true.

Planned Sample

The replication aimed to include 42 participants from Prolific, ensuring high statistical power and maintaining methodological consistency with the original study. All data could be found in final data.zip on Github.

Materials

Stimuli included:

Pictures: Uncommon objects (e.g., canister, facial sauna).
Pseudowords: Generated to mimic phonotactic probabilities in English.

Stimuli were divided into three sets of 18 word-referent pairs for each condition (2x2, 3x3, 4x4). Training trials were randomly paired, and testing used a 4-alternative forced-choice design.

The pseudowords are computer-generated to maintain phonotactic probability in English. We then created 3x3 and 4x4 using Google Voice Generator. The pictures are based on NOUN Database (http://www.sussex.ac.uk/wordlab/noun)[(Horst, J. S., & Hout, M. C. (2016))

Procedure

The procedure will follow Yu & Smith (2007) with minimal modification: “Each trial began with the simultaneous visual presentation of the referents on a computer monitor. The names were then presented auditorily over the computer’s speakers.” As in the original study, no additional cues will be provided to suggest word-object pairings within individual trials, and participants will complete a 4-alternative forced-choice test following each condition.

Analysis Plan

The analysis plan will replicate the original study’s strategy closely. Data cleaning rules include removing trials with response times exceeding three standard deviations from the participant’s mean or below 200 ms. Data exclusions will follow the criteria set forth in the “Exclusions” section of this protocol. We will calculate the mean accuracy for each participant in each condition and conduct a repeated-measures ANOVA to test for differences across the 2x2, 3x3, and 4x4 conditions.

The primary analysis of interest is the repeated-measures ANOVA, testing whether accuracy varies significantly across conditions with different ambiguity levels. Additional exploratory analyses will examine potential learning patterns across trials.

Differences from Original Study

Our study aims to mirror the original as closely as possible. However, differences may arise from using a different sample frame (our university student pool) and potential minor procedural adjustments due to updated software. Based on the literature, these differences are not expected to substantially impact the effect.

Methods Addendum (Post Data Collection)

Actual Sample

42 participants on Prolific received $5 for their participation. One participant was excluded from analysis for failure to complete the Condition 3 test phase.

Differences from pre-data collection methods plan

No substantial deviations from the original methods.

Results

Data preparation

Data preparation will follow the steps outlined in the Analysis Plan to ensure consistent and accurate processing. The following steps will be applied:

Data Cleaning:
- Exclude trials with response times that exceed three standard deviations from each participant’s mean RT or are under 200 ms (The criteria may need to be slightly changed). Such trials are flagged as outliers, indicating inattentive or anticipatory responses.
Calculating Mean Accuracy:
- Calculate the mean accuracy score for each participant in each condition (2x2, 3x3, and 4x4). These scores will be the dependent variable for ANOVA.
Exclusion of Participants:
- Apply participant exclusion criteria as specified in the Exclusions section (The criteria may need to be slightly changed).
Final Dataset Creation:
- The cleaned data will include mean accuracy scores per condition for each participant who meets the inclusion criteria. This dataset will be used for the confirmatory analysis.

Data cleaning

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.3     ✔ stringr   1.5.1
✔ purrr     1.0.2     ✔ tibble    3.2.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(effsize)

Warning: package 'effsize' was built under R version 4.4.2

library(pwr)

# Function to process files (improved)
process_files <- function(file_list, condition_label) {
  df_list <- list()
  for (i in seq_along(file_list)) {
    df <- read.csv(file_list[i])
    df <- df[, (ncol(df) - 4):ncol(df)] # Select the last 5 columns
    df <- na.omit(df) # Remove rows with NA values
    df$respondent <- substr(file_list[i], 23, 32) # Add a respondent ID
    df$condition <- condition_label # Add a condition label
    df$correct <- tolower(df$correct) == "true" # Convert to logical HERE!
    df_list[[i]] <- df
  }
  bind_rows(df_list)
}

# Get file lists
file_list_c1 <- list.files("final data/condition1/", pattern = "\\.csv$", full.names = TRUE)
file_list_c2 <- list.files("final data/condition2/", pattern = "\\.csv$", full.names = TRUE)
file_list_c3 <- list.files("final data/condition3/", pattern = "\\.csv$", full.names = TRUE)

# Process each condition's data
data_c1 <- process_files(file_list_c1, "condition1")
data_c2 <- process_files(file_list_c2, "condition2")
data_c3 <- process_files(file_list_c3, "condition3")

# Combine all data (now works correctly)
data <- bind_rows(data_c1, data_c2, data_c3)

# Now data$correct is consistently logical
str(data$correct) # Check the structure

 logi [1:2214] TRUE FALSE FALSE FALSE FALSE FALSE ...

# Select necessary columns for analysis
selected_data1 <- data_c1 %>% select(correct_choice, correct_image, response_letter, correct, response_time, respondent, condition)
selected_data2 <- data_c2 %>% select(correct_choice, correct_image, response_letter, correct, response_time, respondent, condition)
selected_data3 <- data_c3 %>% select(correct_choice, correct_image, response_letter, correct, response_time, respondent, condition)

# Remove rows with NAs
cleaned_data1 <- na.omit(selected_data1)
cleaned_data1$correct <- as.logical(cleaned_data1$correct)
cleaned_data2 <- na.omit(selected_data2)
cleaned_data2$correct <- as.logical(cleaned_data2$correct)
cleaned_data3 <- na.omit(selected_data3)
cleaned_data3$correct <- as.logical(cleaned_data3$correct)

data <- bind_rows(cleaned_data1, cleaned_data2, cleaned_data3)

Confirmatory Analysis

We will perform a repeated-measures ANOVA to assess differences in accuracy across the three conditions: 2x2, 3x3, and 4x4. This test will evaluate whether the degree of within-trial ambiguity significantly affects participants’ accuracy in learning word-referent pairings.

Accuracy

Overall Accuracy

# Calculate overall accuracy
overall_accuracy <- mean(data$correct, na.rm = TRUE)
print(paste("Overall Accuracy: ", round(overall_accuracy * 100, 2), "%"))

[1] "Overall Accuracy:  52.04 %"

# Check participant performance above chance level
participant_accuracy <- data %>%
  group_by(condition, respondent) %>%
  summarise(
    participant_accuracy = mean(correct, na.rm = TRUE),
    .groups = "drop"
  )

mean_accuracy <- participant_accuracy %>%
  group_by(condition) %>%
  summarise(
    mean_accuracy = mean(participant_accuracy, na.rm = TRUE),
    se_accuracy = sd(participant_accuracy, na.rm = TRUE) / sqrt(n()),
    .groups = "drop"
  )

# Accuracy by condition plot
ggplot(mean_accuracy, aes(x = condition, y = mean_accuracy)) +
  geom_bar(stat = "identity", width = 0.4) +
  geom_errorbar(aes(ymin = mean_accuracy - se_accuracy, ymax = mean_accuracy + se_accuracy), 
                width = 0.1, color = "black") +
  geom_hline(yintercept = 0.25, linetype = "dotted", color = "black") +
  annotate("text", x = 3.35, y = 0.27, label = "Chance", 
           hjust = 0, size = 3, color = "black") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  labs(
    title = "Participant Accuracy by Condition",
    x = "Condition",
    y = "Accuracy"
  )

The original figure from Yu & Smith (2007):

Compared to the original study, the error bars for Conditions 1 and 2 are larger in our experiment, indicating greater variability in participant performance. In the original study, the error bars are smaller, reflecting more consistent accuracy among participants across all conditions.

condition_accuracy <- data %>%
  group_by(condition) %>%
  summarise(Accuracy = mean(correct), .groups = 'drop')

condition_accuracy

# A tibble: 3 × 2
  condition  Accuracy
  <chr>         <dbl>
1 condition1    0.679
2 condition2    0.543
3 condition3    0.393

In comparison to the original experiment, our accuracy data reflects similar but slightly lower performance across all conditions.
In the original study, participants discovered, on average, more than 16 of the 18 pairs in the 2x2 condition, corresponding to an accuracy above 88.9%, while our participants achieved an average accuracy of 68.1%.
For the 3x3 condition, the original study reported participants discovering more than 13 of the 18 pairs, or an accuracy above 72.2%, while our participants averaged 55.5%.
In the 4x4 condition, the original study indicated participants discovered nearly 10 pairs, or an accuracy of approximately 55.6%, while our participants averaged 39.0%.
Both our study and the original experiment demonstrate a pattern of declining performance as ambiguity increases across conditions.

Image-Specific Accuracy

img_acc_c1 <- data_c1 %>%
  group_by(correct_image) %>%
  summarize(accuracy = mean(correct, na.rm = TRUE))

ggplot(img_acc_c1, aes(x = correct_image, y = accuracy)) +
  geom_line(color = "grey",linewidth = 1) + 
  geom_point(size = 1) +
  scale_x_continuous(breaks = seq(min(data$correct_image), max(data$correct_image), 1)) +
  labs(x = "Per Image", y = "Accuracy", title = "Condition1")

img_acc_c2 <- data_c2 %>%
  group_by(correct_image) %>%
  summarize(accuracy = mean(correct, na.rm = TRUE))

ggplot(img_acc_c2, aes(x = correct_image, y = accuracy)) +
  geom_line(color = "grey",linewidth = 1) + 
  geom_point(size = 1) +
  scale_x_continuous(breaks = seq(min(data$correct_image), max(data$correct_image), 1)) +
  labs(x = "Per Image", y = "Accuracy", title = "Condition2")

img_acc_c3 <- data_c3 %>%
  group_by(correct_image) %>%
  summarize(accuracy = mean(correct, na.rm = TRUE))

ggplot(img_acc_c3, aes(x = correct_image, y = accuracy)) +
  geom_line(color = "grey",linewidth = 1) + 
  geom_point(size = 1) +
  scale_x_continuous(breaks = seq(min(data$correct_image), max(data$correct_image), 1)) +
  labs(x = "Per Image", y = "Accuracy", title = "Condition3")

The plot presents accuracy by image for each condition, showing how participants performed across the 18 images in the 2x2, 3x3, and 4x4 conditions. In Condition 1, accuracy remains relatively stable across all images, with minimal variability. Condition 2 and Condition 3 show slightly greater variability but no extreme deviations.

Effect of condition (ANOVA)

participant_data <- data %>%
  group_by(condition, respondent) %>%
  summarise(
    perc = mean(correct, na.rm = TRUE),
    res_time = mean(response_time, na.rm = TRUE),
    .groups = 'drop'
  )
participant_data$condition <- as.factor(participant_data$condition)
anova <- aov(perc ~ condition, data = participant_data)
summary(anova)

             Df Sum Sq Mean Sq F value   Pr(>F)    
condition     2  1.367  0.6834   10.71 5.78e-05 ***
Residuals   106  6.761  0.0638                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our ANOVA results (F = 10.8, p = 5.37e-05) revealed a statistically significant effect of condition on accuracy, indicating that participants’ performance varied significantly across the three conditions. This finding aligns with the original study, which also reported a decline in accuracy as task complexity increased from the 2x2 to 4x4 condition.
This provides sufficient evidence for that, similar to the original study, ambiguity plays an important role in participants’ ability to learn word-referent pairs.

Post-hoc

tukey <- TukeyHSD(anova)
print(tukey)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = perc ~ condition, data = participant_data)

$condition
                            diff        lwr         upr     p adj
condition2-condition1 -0.1356519 -0.2844401  0.01313628 0.0816094
condition3-condition1 -0.2860584 -0.4348466 -0.13727022 0.0000392
condition3-condition2 -0.1504065 -0.2829965 -0.01781646 0.0220384

The post-hoc analysis highlights statistically significant differences in accuracy across the conditions, particularly between Condition 3 and the other two conditions. This is consistent with the findings of the original study, where performance declined as ambiguity increased.
In detail, the relatively weak significance between Condition 1 and Condition 2 suggests that participants found the 3x3 condition slightly more challenging than the 2x2 condition, but the effect is less pronounced compared to the drop in accuracy observed in the 4x4 condition.

Exploratory analyses

Response Time by Correctness

data_combined <- bind_rows(
  mutate(data_c1, condition = "Condition 1"),
  mutate(data_c2, condition = "Condition 2"),
  mutate(data_c3, condition = "Condition 3")
)

# Summarize response times by condition and correctness (optional, for printing)
response_time_summary <- data_combined %>%
  group_by(condition, correct) %>%
  summarize(
    mean_response_time = mean(response_time, na.rm = TRUE),
    median_response_time = median(response_time, na.rm = TRUE),
    sd_response_time = sd(response_time, na.rm = TRUE),
    .groups = "drop"
  )
print(response_time_summary)

# A tibble: 6 × 5
  condition   correct mean_response_time median_response_time sd_response_time
  <chr>       <lgl>                <dbl>                <dbl>            <dbl>
1 Condition 1 FALSE                4279.                3137             5442.
2 Condition 1 TRUE                 3746.                2652             4058.
3 Condition 2 FALSE                5873.                3630             8675.
4 Condition 2 TRUE                 3966.                2726             5161.
5 Condition 3 FALSE                3544.                3002.            2713.
6 Condition 3 TRUE                 3861.                2828.            3956.

# Create the faceted boxplot
ggplot(data_combined, aes(x = factor(correct, labels = c("Incorrect", "Correct")), y = response_time)) +
  geom_boxplot(fill = "skyblue3", alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", color = "red", size = 3) + # Increased point size
  scale_y_log10() +
  facet_wrap(~ condition) +  # This creates the separate panels
  labs(
    title = "Response Time by Correctness Across Conditions",
    x = "Response Accuracy",
    y = "Response Time (log scale)"
  ) +
  theme_minimal() +
  theme(strip.text = element_text(size = 12)) #Adjust the size of facet labels

We use Bartlett’s test for multiple groups.

bartlett.test(data$response_time~data$condition)


    Bartlett test of homogeneity of variances

data:  data$response_time by data$condition
Bartlett's K-squared = 401.64, df = 2, p-value < 2.2e-16

The differences among each condition are not significant. We now visualize the results over images.

Response Time by Images

response_time_by_image <- data %>%
  group_by(condition, correct_image) %>%
  summarise(
    Mean_ResponseTime = mean(response_time, na.rm = TRUE),
    .groups = 'drop'
  )

ggplot(response_time_by_image, aes(x = correct_image, 
                                   y = Mean_ResponseTime, 
                                   group = condition, 
                                   color = condition)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(
    title = "Response Times by Image for Each Condition",
    x = "Image",
    y = "Mean Response Time (ms)"
  ) +
  scale_x_continuous(breaks = unique(response_time_by_image$correct_image)) +
  facet_wrap(~condition, scales = "free_x", ncol = 1) +
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

In Condition 1, response times stabilize quickly after a drop from Image 1 to Image 2, showing minimal variability. Condition 2 shows a significant initial drop but remain stable afterwords. Condition 3 has relatively stable response times and a smaller initial drop. The initial drop across all conditions might suggest participants spend time adapting to the test phase at the beginning.

Discussion

Summary of Replication Attempt

Our confirmatory analysis revealed that participants’ performance significantly declined as task ambiguity increased across the three conditions, consistent with the original study’s findings. Accuracy was highest in the 2x2 condition, lower in the 3x3 condition, and lowest in the 4x4 condition, with all conditions exceeding chance levels. While the general pattern replicated the original study, the levels of accuracy in our study was lower across all conditions, and variability was higher, particularly in the more complex conditions. These results suggest a partial replication of the original findings, capturing the overall trend but with differences in the strength and consistency of participant performance.

Commentary

Our exploratory analyses showed that response times increased with task complexity, particularly in the 3x3 condition. Interestingly, a small number of participants in the 4x4 condition still achieved relatively high accuracy within a short response time, suggesting that they may use more effective strategies to resolve high ambiguity.
While the overall trend of declining accuracy with increasing task complexity replicated the original findings, the lower accuracy and higher variability in our study suggest the presence of moderating factors. Differences in participant demographics (university students vs. Prolific users), experimental settings, or task delivery (i.e. instructions, timing) could potentially influence the test results.
The main challenge in interpreting the results lies in the high variability observed in our study, which could be attributed to uncontrolled external factors. Another important aspect not tested in our replication is the role of foil probability—how often incorrect choices were presented. We ignored the factor due to limited time, which may have affected participants’ ability to differentiate correct from incorrect word pairings. Future studies may attempt to take better controll of external factors and take fol probability into account.

Author Contribution Statement

Hui Junyi: Experiment coding (trial/training phase coding; pre-test audio check and fullscreen check; randomization; condition order counterbalancing code; debugging; review); pre-registration revisions
Yawen Dong: Experiment coding (test phase coding; randomization; debugging; review); pre-registration revisions; OSF datapipe setup
Allison Park: Stimuli analysis and selection; pre-registration original draft; Prolific experiment management; code review
Pengjia Cui: Stimuli generation code; code review