Replication of “Rapid Word Learning Under Uncertainty via Cross-Situational Statistics” (2007 Psychological Science)

Author

Allison Park, Yawen Dong, Junyi Hui & Pengjia Cui

Published

Invalid Date

The github repo for the full project

Introduction

The study “Rapid Word Learning Under Uncertainty via Cross-Situational Statistics” by Yu and Smith (2007) examined how adults acquire new words in highly ambiguous settings. The authors proposed a cross-situational learning strategy to address the indeterminacy problem, where multiple possible meanings exist for a new word. This approach involves tracking word-referent pairings across multiple encounters and using statistical probabilities to identify the correct mappings.

Yu and Smith hypothesized that learners accumulate potential word-referent pairings over trials, assess the statistical evidence, and eventually resolve ambiguities to form accurate mappings. Their experiment tested this mechanism by briefly exposing adults to trials containing spoken words and images of individual objects, with no within-trial cues linking words to objects. The findings demonstrated that participants successfully learned word-picture pairings through statistical relations observed across trials.

This replication project seeks to reproduce the findings of Yu and Smith (2007) and further examine the role of cross-situational learning in word acquisition. By closely adhering to their methodology and analysis, we aim to assess the robustness of their conclusions. Any modifications or deviations from the original study will be explicitly detailed in this report.

The original paper is accessible via DOI.

Key aspects of the study design include:

Factor Manipulated: Within-trial ambiguity across three conditions (2×2, 3×3, and 4×4).
Measures: Accuracy in learning word-referent pairs and response time.
Design: A within-participants design to minimize variance due to individual differences.
Confounds: The potential influence of familiarity or memorization strategies, and limited generalizability due to the use of pseudowords and uncommon objects.

The study minimized demand characteristics by employing pseudowords and avoiding explicit within-trial cues, requiring participants to rely solely on cross-trial statistical learning. However, restricting the participant pool to adults may limit the findings’ applicability to children or naturalistic learning contexts.

Methods

Power Analysis

The original experiment included 38 participants, all undergraduate students from Indiana University. Participants received course credit or $7 for participation.

library(pwr)

Warning: package 'pwr' was built under R version 4.4.2

effect_size<-1.425
alpha<-0.05
result <- pwr.t.test(d = effect_size, n = 38, sig.level = alpha,alternative="greater")
print(result)


     Two-sample t test power calculation 

              n = 38
              d = 1.425
      sig.level = 0.05
          power = 0.9999967
    alternative = greater

NOTE: n is number in *each* group

Power: 0.9999967, indicating nearly 100% probability of rejecting the null hypothesis if the alternative hypothesis is true.

Planned Sample

The replication aimed to recruit 42 participants through Prolific to ensure high statistical power and maintain methodological consistency with the original study. Participants were compensated for their time, and data collection adhered to ethical research guidelines. All collected data are available in the final_data.zip file on the project’s GitHub repository.

Materials

The stimuli included the following:

Pictures: Images of uncommon objects (e.g., canister, facial sauna) selected to ensure minimal prior familiarity.
Pseudowords: Artificially generated to mimic English phonotactic probabilities.

The stimuli were divided into three sets of 18 word-referent pairs, corresponding to the three experimental conditions (2×2, 3×3, and 4×4). Training trials consisted of randomly paired word-referent combinations, while testing employed a 4-alternative forced-choice design.

Pseudoword Generation: The pseudowords were computer-generated to align with English phonotactic probabilities. Auditory stimuli for the 3×3 and 4×4 conditions were created using the Google Voice Generator.
Picture Selection: Images were sourced from the NOUN Database (Horst & Hout, 2016).

This design ensured that stimuli were consistent with the original study and controlled for prior familiarity and linguistic biases.

Procedure

The procedure will closely follow Yu and Smith (2007) with minimal modifications to ensure replication fidelity.

Each trial will consist of the following steps:

Visual Presentation: Referents will be displayed simultaneously on a computer monitor.
Auditory Presentation: Pseudoword names will be presented audibly through the computer’s speakers.

No additional cues will be provided during trials to indicate word-object pairings. This ensures that participants rely solely on cross-trial statistical learning to deduce correct pairings.

At the end of each condition (2×2, 3×3, and 4×4), participants will complete a 4-alternative forced-choice test. During this test, participants will hear a pseudoword and select its corresponding referent from four images displayed on the screen. The test evaluates the accuracy of word-referent pair learning across conditions.

Analysis Plan

The analysis plan will replicate the original study’s strategy closely. Data cleaning rules include removing trials with response times exceeding three standard deviations from the participant’s mean or below 200 ms. Data exclusions will follow the criteria set forth in the “Exclusions” section of this protocol. We will calculate the mean accuracy for each participant in each condition and conduct a repeated-measures ANOVA to test for differences across the 2x2, 3x3, and 4x4 conditions.

The primary analysis of interest is the repeated-measures ANOVA, testing whether accuracy varies significantly across conditions with different ambiguity levels. Additional exploratory analyses will examine potential learning patterns across trials.

Differences from Original Study

Our study aims to mirror the original as closely as possible. However, differences may arise from using a different sample frame (our university student pool) and potential minor procedural adjustments due to updated software. Based on the literature, these differences are not expected to substantially impact the effect.

Methods Addendum (Post Data Collection)

Actual Sample

42 participants on Prolific received $5 for their participation. One participant was excluded from analysis for failure to complete the Condition 3 test phase.

Differences from pre-data collection methods plan

No substantial deviations from the original methods.

Results

Data preparation

Data preparation will follow the steps outlined in the Analysis Plan to ensure consistent and accurate processing. The following steps will be applied:

Data Cleaning:
- Exclude trials with response times that exceed three standard deviations from each participant’s mean RT or are under 200 ms (The criteria may need to be slightly changed). Such trials are flagged as outliers, indicating inattentive or anticipatory responses.
Calculating Mean Accuracy:
- Calculate the mean accuracy score for each participant in each condition (2x2, 3x3, and 4x4). These scores will be the dependent variable for ANOVA.
Exclusion of Participants:
- Apply participant exclusion criteria as specified in the Exclusions section (The criteria may need to be slightly changed).
Final Dataset Creation:
- The cleaned data will include mean accuracy scores per condition for each participant who meets the inclusion criteria. This dataset will be used for the confirmatory analysis.

Data cleaning

library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)
library(tidyr)
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ lubridate 1.9.3     ✔ stringr   1.5.1
✔ purrr     1.0.2     ✔ tibble    3.2.1

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(effsize)

Warning: package 'effsize' was built under R version 4.4.2

library(pwr)
library(car)

Warning: package 'car' was built under R version 4.4.2

Loading required package: carData

Warning: package 'carData' was built under R version 4.4.2


Attaching package: 'car'

The following object is masked from 'package:purrr':

    some

The following object is masked from 'package:dplyr':

    recode

# Function to process files (improved)
process_files <- function(file_list, condition_label) {
  df_list <- list()
  for (i in seq_along(file_list)) {
    df <- read.csv(file_list[i])
    df <- df[, (ncol(df) - 4):ncol(df)] # Select the last 5 columns
    df <- na.omit(df) # Remove rows with NA values
    df$respondent <- substr(file_list[i], 23, 32) # Add a respondent ID
    df$condition <- condition_label # Add a condition label
    df$correct <- tolower(df$correct) == "true" # Convert to logical HERE!
    df_list[[i]] <- df
  }
  bind_rows(df_list)
}

# Get file lists
file_list_c1 <- list.files("final data/condition1/", pattern = "\\.csv$", full.names = TRUE)
file_list_c2 <- list.files("final data/condition2/", pattern = "\\.csv$", full.names = TRUE)
file_list_c3 <- list.files("final data/condition3/", pattern = "\\.csv$", full.names = TRUE)

# Process each condition's data
data_c1 <- process_files(file_list_c1, "condition1")
data_c2 <- process_files(file_list_c2, "condition2")
data_c3 <- process_files(file_list_c3, "condition3")

# Combine all data (now works correctly)
data <- bind_rows(data_c1, data_c2, data_c3)

# Now data$correct is consistently logical
str(data$correct) # Check the structure

 logi [1:2214] TRUE FALSE FALSE FALSE FALSE FALSE ...

# Select necessary columns for analysis
selected_data1 <- data_c1 %>% select(correct_choice, correct_image, response_letter, correct, response_time, respondent, condition)
selected_data2 <- data_c2 %>% select(correct_choice, correct_image, response_letter, correct, response_time, respondent, condition)
selected_data3 <- data_c3 %>% select(correct_choice, correct_image, response_letter, correct, response_time, respondent, condition)

# Remove rows with NAs
cleaned_data1 <- na.omit(selected_data1)
cleaned_data1$correct <- as.logical(cleaned_data1$correct)
cleaned_data2 <- na.omit(selected_data2)
cleaned_data2$correct <- as.logical(cleaned_data2$correct)
cleaned_data3 <- na.omit(selected_data3)
cleaned_data3$correct <- as.logical(cleaned_data3$correct)

data <- bind_rows(cleaned_data1, cleaned_data2, cleaned_data3)

Confirmatory Analysis

We will perform a repeated-measures ANOVA to assess differences in accuracy across the three conditions: 2x2, 3x3, and 4x4. This test will evaluate whether the degree of within-trial ambiguity significantly affects participants’ accuracy in learning word-referent pairings.

Accuracy

Overall Accuracy

# Calculate overall accuracy
overall_accuracy <- mean(data$correct, na.rm = TRUE)
print(paste("Overall Accuracy: ", round(overall_accuracy * 100, 2), "%"))

[1] "Overall Accuracy:  52.04 %"

# Check participant performance above chance level
participant_accuracy <- data %>%
  group_by(condition, respondent) %>%
  summarise(
    participant_accuracy = mean(correct, na.rm = TRUE),
    .groups = "drop"
  )

mean_accuracy <- participant_accuracy %>%
  group_by(condition) %>%
  summarise(
    mean_accuracy = mean(participant_accuracy, na.rm = TRUE),
    se_accuracy = sd(participant_accuracy, na.rm = TRUE) / sqrt(n()),
    .groups = "drop"
  )

# Accuracy by condition plot
ggplot(mean_accuracy, aes(x = condition, y = mean_accuracy)) +
  geom_bar(stat = "identity", width = 0.4) +
  geom_errorbar(aes(ymin = mean_accuracy - se_accuracy, ymax = mean_accuracy + se_accuracy), 
                width = 0.1, color = "black") +
  geom_hline(yintercept = 0.25, linetype = "dotted", color = "black") +
  annotate("text", x = 3.35, y = 0.27, label = "Chance", 
           hjust = 0, size = 3, color = "black") +
  theme_minimal() +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  labs(
    title = "Participant Accuracy by Condition",
    x = "Condition",
    y = "Accuracy"
  )

The original figure from Yu & Smith (2007):

Accuracy decreases as within-trial ambiguity increases. This trend aligns with the findings of the original study, indicating that task complexity negatively impacts learning performance.

condition_accuracy <- data %>%
  group_by(condition) %>%
  summarise(Accuracy = mean(correct), .groups = 'drop')

condition_accuracy

# A tibble: 3 × 2
  condition  Accuracy
  <chr>         <dbl>
1 condition1    0.679
2 condition2    0.543
3 condition3    0.393

Our replication reflects the same pattern of declining accuracy with increasing ambiguity observed in the original study, but with systematically lower performance. In the 2×2 condition, accuracy was 67.9% in our study compared to over 88.9% in the original. For the 3×3 condition, our participants achieved 54.3%, compared to 72.2% in the original. In the 4×4 condition, accuracy dropped to 39.3% in our study versus 55.6% in the original. While participants in both studies consistently outperformed chance, overall accuracy was lower in our replication.
These discrepancies likely stem from differences in participant populations (Prolific users vs. university students), the online testing environment, and minor procedural variations. Despite this, our findings reaffirm the robustness of cross-situational learning and the impact of task complexity on performance. Future replications should address these methodological differences to reduce variability and better align with the original study.

Image-Specific Accuracy

# Combine data from all conditions
combined_data <- bind_rows(
  data_c1 %>% mutate(Condition = "Condition1"),
  data_c2 %>% mutate(Condition = "Condition2"),
  data_c3 %>% mutate(Condition = "Condition3")
)

# Calculate accuracy per image for each condition
image_accuracy <- combined_data %>%
  group_by(correct_image, Condition) %>%
  summarize(accuracy = mean(correct, na.rm = TRUE), .groups = 'drop')

# Plot all conditions in one plot with distinct colors
ggplot(image_accuracy, aes(x = correct_image, y = accuracy, color = Condition, group = Condition)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2) +
  scale_x_continuous(breaks = seq(min(image_accuracy$correct_image), max(image_accuracy$correct_image), 1)) +
  labs(
    x = "Per Image",
    y = "Accuracy",
    title = "Accuracy by Image Across Conditions"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("Condition1" = "blue", "Condition2" = "navy", "Condition3" = "lightblue"))

The combined plot highlights clear trends in accuracy across conditions, with performance declining as ambiguity increases.

The overlap between Conditions 1 and 2 for certain images suggests that low-ambiguity strategies may partially carry over to moderate ambiguity. However, the lack of overlap with Condition 3 highlights the disruption caused by higher complexity. Variability across images, particularly in Conditions 2 and 3, likely reflects image-specific factors such as distinctiveness or interference from distractors.

These results support the hypothesis that increased ambiguity negatively impacts learning, consistent with the original study. The variability within conditions underscores the importance of pair-specific factors, suggesting future research should investigate how such features influence learning under ambiguous conditions.

Effect of condition (ANOVA)

participant_data <- data %>%
  group_by(condition, respondent) %>%
  summarise(
    perc = mean(correct, na.rm = TRUE),
    res_time = mean(response_time, na.rm = TRUE),
    .groups = 'drop'
  )
participant_data$condition <- as.factor(participant_data$condition)
anova <- aov(perc ~ condition, data = participant_data)
summary(anova)

             Df Sum Sq Mean Sq F value   Pr(>F)    
condition     2  1.367  0.6834   10.71 5.78e-05 ***
Residuals   106  6.761  0.0638                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA revealed a significant effect of condition on accuracy (F = 10.8, p = 5.37e-05), indicating that performance declined as task complexity increased.
This finding aligns with the original study, supporting the hypothesis that higher ambiguity impairs participants’ ability to learn word-referent pairs. The results highlight the critical role of ambiguity in influencing cross-situational learning outcomes.

Post-hoc

tukey <- TukeyHSD(anova)
print(tukey)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = perc ~ condition, data = participant_data)

$condition
                            diff        lwr         upr     p adj
condition2-condition1 -0.1356519 -0.2844401  0.01313628 0.0816094
condition3-condition1 -0.2860584 -0.4348466 -0.13727022 0.0000392
condition3-condition2 -0.1504065 -0.2829965 -0.01781646 0.0220384

The Tukey post-hoc analysis highlights significant declines in accuracy as ambiguity increases. The difference between Condition 3 (4×4) and the other two conditions is statistically significant, indicating that participants struggled substantially more in the highest ambiguity condition. However, the difference between Condition 2 (3×3) and Condition 1 (2×2) was not significant, suggesting only a modest increase in difficulty between low and moderate ambiguity.
These results are consistent with the original study, which reported the most pronounced drop in accuracy between the 3×3 and 4×4 conditions, emphasizing the impact of ambiguity on learning performance.

Exploratory analyses

Response Time by Correctness

data_combined <- bind_rows(
  mutate(data_c1, condition = "Condition 1"),
  mutate(data_c2, condition = "Condition 2"),
  mutate(data_c3, condition = "Condition 3")
)

# Summarize response times by condition and correctness (optional, for printing)
response_time_summary <- data_combined %>%
  group_by(condition, correct) %>%
  summarize(
    mean_response_time = mean(response_time, na.rm = TRUE),
    median_response_time = median(response_time, na.rm = TRUE),
    sd_response_time = sd(response_time, na.rm = TRUE),
    .groups = "drop"
  )
print(response_time_summary)

# A tibble: 6 × 5
  condition   correct mean_response_time median_response_time sd_response_time
  <chr>       <lgl>                <dbl>                <dbl>            <dbl>
1 Condition 1 FALSE                4279.                3137             5442.
2 Condition 1 TRUE                 3746.                2652             4058.
3 Condition 2 FALSE                5873.                3630             8675.
4 Condition 2 TRUE                 3966.                2726             5161.
5 Condition 3 FALSE                3544.                3002.            2713.
6 Condition 3 TRUE                 3861.                2828.            3956.

# Create the faceted boxplot
ggplot(data_combined, aes(x = factor(correct, labels = c("Incorrect", "Correct")), y = response_time)) +
  geom_boxplot(fill = "skyblue3", alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", color = "red", size = 3) + # Increased point size
  scale_y_log10() +
  facet_wrap(~ condition) +  # This creates the separate panels
  labs(
    title = "Response Time by Correctness Across Conditions",
    x = "Response Accuracy",
    y = "Response Time (log scale)"
  ) +
  theme_minimal() +
  theme(strip.text = element_text(size = 12)) #Adjust the size of facet labels

The analysis indicates that response times increase with task complexity, as seen in the jump from Condition 1 to Condition 2 for both correct and incorrect responses. Correct responses were generally faster than incorrect ones in Conditions 1 and 2, suggesting greater confidence in resolving ambiguity. However, in Condition 3, correct responses were slower than incorrect ones, indicating higher cognitive effort for successful learning under high ambiguity. The variability in response times also grows with complexity, as seen in the larger spread in Conditions 2 and 3, reflecting increased uncertainty and guessing. This supports the conclusion that ambiguity influences not just accuracy but the cognitive load required for both correct and incorrect responses.

Here, we perform Levene’s test.

# Perform Levene's Test for homogeneity of variances
levene_test <- leveneTest(response_time ~ condition, data = data_combined)

Warning in leveneTest.default(y = y, group = group, ...): group coerced to
factor.

print(levene_test)

Levene's Test for Homogeneity of Variance (center = median)
        Df F value    Pr(>F)    
group    2  9.7043 6.366e-05 ***
      2211                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The Levene’s test results indicate that the variances of response times across conditions are significantly different (F = 9.704, p < 0.001).

Response Time by Images

response_time_by_image <- data %>%
  group_by(condition, correct_image) %>%
  summarize(
    Mean_ResponseTime = mean(response_time, na.rm = TRUE),
    .groups = 'drop'
  )

ggplot(response_time_by_image, aes(x = correct_image, 
                                   y = Mean_ResponseTime, 
                                   color = condition, 
                                   group = condition)) +
  geom_line(size = 1) + 
  geom_point(size = 3) +
  labs(
    title = "Response Times by Image Across Conditions",
    x = "Image",
    y = "Mean Response Time (ms)",
    color = "Condition"
  ) +
  scale_x_continuous(breaks = seq(min(response_time_by_image$correct_image), 
                                  max(response_time_by_image$correct_image), 1)) +
  theme_minimal() +
  theme(
    legend.position = "top",
    legend.title = element_text(size = 12),
    legend.text = element_text(size = 10),
    plot.title = element_text(size = 14, face = "bold")
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The trends suggest that task familiarity and ambiguity interact to influence response times. In Condition 1, participants quickly adapted due to lower ambiguity, while in Condition 2, the initial ambiguity resulted in slower responses that improved over time. In Condition 3, response times reflect persistent difficulty, highlighting the impact of high ambiguity on learning and decision-making processes. These patterns reinforce the role of task complexity in shaping cognitive load and adaptive strategies during word-referent learning.

Correctness by Response Time Quartiles

To explore how response time relates to performance, we divide response times into quartiles and examine the proportion of correct responses within each quartile across conditions.

data_combined <- data_combined %>%
  group_by(condition) %>%
  mutate(
    response_time_quartile = ntile(response_time, 4) # Divide into 4 quartiles
  )

# Calculate proportion correct within each quartile and condition
correctness_by_quartile <- data_combined %>%
  group_by(condition, response_time_quartile) %>%
  summarize(
    proportion_correct = mean(correct, na.rm = TRUE),
    .groups = "drop"
  )

# Plot correctness by response time quartiles
ggplot(correctness_by_quartile, aes(x = response_time_quartile, y = proportion_correct, color = condition, group = condition)) +
  geom_line(size = 1) +
  geom_point(size = 3) +
  labs(
    title = "Proportion Correct by Response Time Quartiles",
    x = "Response Time Quartile (1 = Fastest, 4 = Slowest)",
    y = "Proportion Correct",
    color = "Condition"
  ) +
  theme_minimal() +
  scale_x_continuous(breaks = 1:4)

This suggests that response time is a stronger predictor of accuracy in low-ambiguity conditions, where faster responses reflect confidence and effective learning. In high-ambiguity conditions, slower response times do not appear to improve accuracy, indicating that participants struggle regardless of deliberation.

Discussion

Summary of Replication Attempt

Our replication successfully reproduced the general trends reported by Yu and Smith (2007), demonstrating that participant accuracy declined as within-trial ambiguity increased across the three conditions (2×2, 3×3, and 4×4). Participants achieved the highest accuracy in the 2×2 condition, moderate accuracy in the 3×3 condition, and the lowest accuracy in the 4×4 condition, consistent with the original findings. However, overall accuracy levels in our replication were slightly lower, suggesting differences due to sample population, testing environment, or procedural nuances.

Response time analyses revealed further insights into learning performance. Participants responded faster when correct under low and moderate ambiguity, while slower responses in the high-ambiguity condition did not necessarily improve accuracy. These findings align with the original study’s assertion that ambiguity significantly impacts learning outcomes, with increased complexity leading to greater cognitive load and reduced performance.

Commentary

The replication highlights the robustness of cross-situational learning mechanisms under low ambiguity while emphasizing their limitations as task complexity increases. The slightly lower accuracy and higher variability in our replication may reflect differences in the online testing environment compared to the controlled lab settings of the original study, as well as potential differences in participant characteristics between the Prolific sample and university students.

Exploratory analyses provided additional insights, showing how response time patterns differ across conditions. Faster responses correlated with higher accuracy in low-ambiguity conditions, suggesting confidence and effective learning strategies. In contrast, the high-ambiguity condition showed limited gains from slower responses, implying that participants struggled to resolve the task regardless of deliberation time.

Overall, the replication confirms the critical role of ambiguity in shaping word-referent learning and provides a foundation for future research to explore factors such as visual complexity, distractor interference, and individual differences in learning strategies. Addressing these factors in subsequent studies could enhance the generalizability of findings to real-world learning contexts.

Author Contribution Statement

Hui Junyi: Experiment coding (trial/training phase coding; pre-test audio check and fullscreen check; randomization; condition order counterbalancing code; debugging; review); pre-registration revisions
Yawen Dong: Experiment coding (test phase coding; randomization; debugging; review); pre-registration revisions; OSF datapipe setup
Allison Park: Stimuli analysis and selection; pre-registration original draft; Prolific experiment management; code review
Pengjia Cui: Stimuli generation code; code review