Replication of Study Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Yu & Smith (2007, Psychological Science)

Author

Alison Park, Junyi Hui, Pengjia Cui, and Yawen Dong

Published

December 6, 2024

Introduction

The study Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Yu and Smith explored how adults can learn word-referent pairs under highly ambiguous settings. Past studies on word learning have been focusing on constraints such as social, attentional, or linguistic cues to solve the word-referent mapping problem. While these strategies performed well in controlled, minimally ambiguous contexts, real-world learning environments presented learners with greater complexity.

This raises an important question: can learners successfully acquire word-referent pairs in highly ambiguous settings through alternative means, even when they cannot determine correct pairings within a single trial? To address the question, Yu and Smith propose an alternative mechanism—— cross-situational learning —— in this study. They demonstrated that learners could track word-referent pairings across multiple trials by calculating statistical associations over time rather than relying on immediate clarity within each learning instance.

Design Overview

One factor was manipulated in the study: within-trial ambiguity. The manipulation operates through three conditions in which the number of words and referents presented per trial varied (2×2, 3×3, and 4×4).
Two measures were taken: accuracy in learning word-referent pairs and response time.
The study employed a within-participants design as each participant experienced all three conditions.
Measures were repeated across each condition for every participant.
Applying a between-participants design instead of a within-participants design would increase variance due to individual learning differences.
The study reduced demand characteristics by using pseudowords and not providing explicit cues linking words to specific referents, thus participants had to rely solely on cross-trial statistical learning.
A potential confound is the repetitive exposure to pseudowords and objects, which could lead participants to develop their own strategies which are not based on cross-trial statistical learning but rather on familiarity or memorization.
The use of pseudowords and uncommon objects may limit generalizability to real-world language learning, where learners often have social and contextual cues available. Also, testing was limited to adult participants, so findings may not generalize well to children.

Power Analysis

The original experiment included 38 participants, all of whom were undergraduate students from Indiana University. Participants received either course credit or $7 for their participation.

library(pwr)

effect_size<-1.425
alpha<-0.05
result <- pwr.t.test(d = effect_size, n = 38, sig.level = alpha,alternative="greater")
print(result)


     Two-sample t test power calculation 

              n = 38
              d = 1.425
      sig.level = 0.05
          power = 0.9999967
    alternative = greater

NOTE: n is number in *each* group

With the data given in the original study, we found that with 38 participants per group,a very high statistical power is achieved. This indicates that the probability of correctly rejecting the null hypothesis, if the alternative hypothesis is true, is nearly 100%.

Planned Sample

Given the high statistical power of the original study, our replication aim to include a similar or slightly larger sample size with recruitment from Prolific to maintain consistency with the original design.

Methods

Materials

“The stimuli were slides containing pictures of uncommon objects (e.g., canister, facial sauna, and rasp) paired with auditorily presented pseudowords. These artificial words were generated by a computer program to sample English forms that were broadly phonotactically probable; they were produced by a synthetic female voice in monotone. There were 54 unique objects and 54 unique pseudowords partitioned into three sets of 18 words and referents for use in the three conditions. The training trials were generated by randomly pairing each word with one picture; these were the word-referent pairs to be discovered by the learner. The three learning conditions differed in the number of words and referents presented on each training trial: 2-2 Condition: 2 words and 2 pictures; 3-3 Condition: 3 words and 3 pictures; 4-4 Condition: 4 words and 4 pictures” (Yu and Smith 2007)

Procedure

“The pictures were presented on a 17-in. computer screen, and the sound was played by the speakers connected to the same computer. Subjects were instructed that their task was to learn the words and referents, but they were not told that there was one referent per word. They were told that multiple words and pictures would co-occur on each trial and that their task was to figure out across trials which word went with which picture. After training in each condition, subjects received a fouralternative forced-choice test of learning. On the test, they were presented with 1 word and 4 pictures and asked to indicate the picture named by that word. The target picture and the 3 foils were all drawn from the set of 18 training pictures.” (Yu and Smith 2007)

Analysis Plan

The primary analysis will involves a one-way ANOVA to compare learning accuracy across the three conditions (2×2, 3×3, and 4×4). In this setup, the independent variable is the condition (level of ambiguity), and the dependent variable is the accuracy of word-object pair identification. We will also examine response times across conditions to investigate whether higher ambiguity affects the speed of learning, which may contribute to understanding cognitive processing under different conditions. Data cleaning will exclude incomplete responses and trials where response times are excessively high or low.

Differences from Original Study

Sample: The original study included 38 undergraduate participants from Indiana University. Our sample may differ slightly due to recruitment constraints; participants will probably being drawn from a broader demographic pool, which could introduce variability in learning abilities or prior exposure to similar experimental tasks. However, as cross-situational learning mechanisms are believed to be consistent across adult populations, the sample difference is not supposed to significantly impact the findings.
Setting: In the original study, participants completed the trials in a controlled lab environment. Our replication may only involve online settings. Conducting the experiment outside of a laboratory could introduce additional distractions or variations. As the original research suggests that cross-situational learning effects are resilient to minor environmental changes, we do not expect this variation to significantly influence the outcome.

Methods Addendum (Post Data Collection)

Actual Sample

42 participants on Prolific received $5 for their participation. One participant is excluded from analysis for failure to complete condition 3 test phase.

Differences from pre-data collection methods plan

none

Results

Data preparation

Load Relevant Libraries and Functions

library(jsonlite)

Warning: package 'jsonlite' was built under R version 4.2.3

library(dplyr)

Warning: package 'dplyr' was built under R version 4.2.3


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.2.3

library(effectsize)
library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

library(tidyr)
library(stringr)

Import data

# Condition 1 (2 * 2)

setwd("/Users/yawendong/Documents/GitHub/psych final project/final_data/Condition1")
files1 <- list.files(pattern = "\\.csv$")

data1 <- lapply(files1, function(file) {
  # Extract the first 10 characters as ParticipantID
  participant_id <- substr(file, 1, 10)
  df <- read.csv(file)
  df$correct <- as.character(df$correct)
  df <- df %>% filter(!is.na(correct))
  # Create a 'ParticipantID' column
  df$ParticipantID <- participant_id
  return(df)
}) %>% bind_rows()

# Condition 2 (3 * 3)

setwd("/Users/yawendong/Documents/GitHub/psych final project/final_data/Condition2")
files2 <- list.files(pattern = "\\.csv$")

data2 <- lapply(files2, function(file) {
  # Extract the first 10 characters as ParticipantID
  participant_id <- substr(file, 1, 10)
  df <- read.csv(file)
  df$correct <- as.character(df$correct)
  df <- df %>% filter(!is.na(correct))
  # Create a 'ParticipantID' column
  df$ParticipantID <- participant_id
  return(df)
}) %>% bind_rows()

# Condition 3 (4 * 4)

setwd("/Users/yawendong/Documents/GitHub/psych final project/final_data/Condition3")
files3 <- list.files(pattern = "\\.csv$")
data3 <- lapply(files3, function(file) {
  # Extract the first 10 characters as ParticipantID
  participant_id <- substr(file, 1, 10)
  df <- read.csv(file)
  # Create a 'ParticipantID' column
  df$ParticipantID <- participant_id
  return(df)
}) %>% bind_rows()

Data exclusion / filtering

# Select necessary columns for analysis
selected_data1 <- data1 %>% select(correct_choice, correct_image, response_letter, correct, response_time, ParticipantID)
selected_data2 <- data2 %>% select(correct_choice, correct_image, response_letter, correct, response_time, ParticipantID)
selected_data3 <- data3 %>% select(correct_choice, correct_image, response_letter, correct, response_time, ParticipantID)

# Remove rows with NAs
cleaned_data1 <- na.omit(selected_data1)
cleaned_data1$correct <- as.logical(cleaned_data1$correct)
cleaned_data2 <- na.omit(selected_data2)
cleaned_data2$correct <- as.logical(cleaned_data2$correct)
cleaned_data3 <- na.omit(selected_data3)
cleaned_data3$correct <- as.logical(cleaned_data3$correct)

# Function to identify outlier participants
outliers <- function(data) {
  stats <- data %>%
    summarise(
      median = median(response_time, na.rm = TRUE),
      sd = sd(response_time, na.rm = TRUE)
    )
  lower_bound <- stats$median - (3 * stats$sd)
  upper_bound <- stats$median + (3 * stats$sd)
  participant <- data %>%
    group_by(ParticipantID) %>%
    summarise(
      avg = mean(response_time, na.rm = TRUE)
    )
  outliers <- participant %>%
    filter(avg < lower_bound | avg > upper_bound) %>%
    pull(ParticipantID)
  return(outliers)
}

# Identify outliers in each condition
outliers1 <- outliers(cleaned_data1)
outliers2 <- outliers(cleaned_data2)
outliers3 <- outliers(cleaned_data3)

# Combine outlier participant IDs from all conditions
all_outliers <- unique(c(outliers1, outliers2, outliers3))

#### Exclude outliers from all conditions
filtered_data1 <- cleaned_data1 %>%
  filter(!ParticipantID %in% all_outliers)
filtered_data2 <- cleaned_data2 %>%
  filter(!ParticipantID %in% all_outliers)
filtered_data3 <- cleaned_data3 %>%
  filter(!ParticipantID %in% all_outliers)

Prepare data for analysis - create columns etc.

# Create a 'Condition' column
filtered_data1$Condition <- 'Condition1'
filtered_data2$Condition <- 'Condition2'
filtered_data3$Condition <- 'Condition3'

# Combine Condition 1, 2, and 3
combined_data <- bind_rows(filtered_data1, filtered_data2, filtered_data3)

Confirmatory analysis

As noted before, we collected data from 42 participants across three experimental conditions. One participant was excluded for failing to complete all tests, and another two were removed due to excessively high response times (greater than 3 standard deviations above the median) in Condition 1. Therefore, the following confirmatory analysis is based on data from the remaining 39 participants.

Accuracy

Overall Accuracy

# Calculate accuracy over condition
accuracy <- combined_data %>%
  group_by(Condition) %>%
  summarise(Accuracy = mean(correct), .groups = 'drop')

print(accuracy)

# A tibble: 3 × 2
  Condition  Accuracy
  <chr>         <dbl>
1 Condition1    0.692
2 Condition2    0.551
3 Condition3    0.403

# Calculate condition 3 accuracy by participant
condition3_accuracy <- combined_data %>%
  filter(Condition == "Condition3") %>%
  group_by(ParticipantID) %>%
  summarise(
    participant_accuracy = mean(correct, na.rm = TRUE),
    .groups = "drop"
  )
high_accuracy <- condition3_accuracy %>%
  filter(participant_accuracy > 0.75)

print(high_accuracy)

# A tibble: 4 × 2
  ParticipantID participant_accuracy
  <chr>                        <dbl>
1 8nkzlv9ca3                   0.944
2 ab2tzywewf                   0.944
3 aqurvmvs3g                   0.833
4 d8bz99odus                   0.889

In comparison to the original experiment, our accuracy data reflects similar but slightly lower performance across all conditions.
In the original study, participants discovered, on average, more than 16 of the 18 pairs in the 2x2 condition, corresponding to an accuracy above 88.9%, while our participants achieved an average accuracy of 69.2%.
For the 3x3 condition, the original study reported participants discovering more than 13 of the 18 pairs, or an accuracy above 72.2%, while our participants averaged 55.1%.
Lastly, in the 4x4 condition, the original study indicated participants discovered nearly 10 pairs, or an accuracy of approximately 55.6%, while our participants averaged 40.3%. Besides, 4 participants in our experiment discovered more than 75% of the pairs in this condition, which is less than half the number of participants (9) who achieved the same level of performance in the original experiment.
Despite the differences in accuracy levels, both our study and the original experiment demonstrate a pattern of declining performance as ambiguity increases across conditions.

Accuracy over images

accuracy_by_image <- combined_data %>%
  group_by(Condition, correct_image) %>%
  summarise(Accuracy = mean(correct), .groups = 'drop')

ggplot(accuracy_by_image, aes(x = correct_image, y = Accuracy, group = Condition, color = Condition)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(
    title = "Accuracy by Image for Each Condition",
    x = "Image",
    y = "Accuracy"
  ) +
  scale_x_continuous(breaks = unique(accuracy_by_image$correct_image)) +
  facet_wrap(~Condition, scales = "free_x", ncol = 1)+
  theme_minimal()

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

The plot presents accuracy by image for each condition, showing how participants performed across the 18 images in the 2x2, 3x3, and 4x4 conditions. In Condition 1, accuracy remains relatively stable across all images, with minimal variability. Condition 2 and Condition 3 show slightly greater variability but no extreme deviations.

Mean Accuracy Over Condition

# Calculate means and standard errors for plotting
participant_accuracy <- combined_data %>%
  group_by(Condition, ParticipantID) %>%
  summarise(
    participant_accuracy = mean(correct, na.rm = TRUE),
    .groups = "drop"
  )

mean_accuracy <- participant_accuracy %>%
  group_by(Condition) %>%
  summarise(
    mean_accuracy = mean(participant_accuracy, na.rm = TRUE),
    se_accuracy = sd(participant_accuracy, na.rm = TRUE) / sqrt(n()),
    .groups = "drop"
  )

# Create the bar plot
ggplot(mean_accuracy, aes(x = Condition, y = mean_accuracy, fill = Condition)) +
  geom_bar(stat = "identity", width = 0.4) +
  geom_errorbar(aes(ymin = mean_accuracy - se_accuracy, ymax = mean_accuracy + se_accuracy), 
                width = 0.1, color = "black") +
  geom_hline(yintercept = 0.25, linetype = "dotted", color = "black") +
  annotate("text", x = 3.35, y = 0.27, label = "Chance", 
           hjust = 0, size = 3, color = "black") +
  labs(
    title = "Mean Accuracy by Condition",
    x = "Learning Condition",
    y = "Proportion Correct"
  ) +
  scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
  theme_minimal() +
  theme(legend.position = "none")

Original Study Plot Original Research Plot

Compared to the original study, the error bars for Conditions 1 and 2 are larger in our experiment, indicating greater variability in participant performance. In the original study, the error bars are smaller, reflecting more consistent accuracy among participants across all conditions.

Chance Performance

# The expected performance by chance for 2*2, 3*3, and 4*4 Condition are all 1/4
combined_data <- combined_data %>%
  mutate(chance_level = case_when(
    Condition == "Condition1" ~ 0.25,
    Condition == "Condition2" ~ 0.25,
    Condition == "Condition3" ~ 0.25
  ))

t_test_results <- combined_data %>%
  group_by(Condition) %>%
  summarise(
    t_test_p_value = t.test(correct, mu = unique(chance_level))$p.value,
    .groups = 'drop'
  )

print(t_test_results)

# A tibble: 3 × 2
  Condition  t_test_p_value
  <chr>               <dbl>
1 Condition1      2.94e-101
2 Condition2      1.52e- 49
3 Condition3      6.96e- 16

In the original study, it was reported that participants’ performance in all conditions significantly exceeded chance levels. In our study, the t-test results similarly show that performance across all conditions was significantly above the chance level of 0.25. The p-values for each condition are exceptionally small, confirming that participants were not guessing randomly but learning word-referent pairs.

Effect of Condition (ANOVA)

# Aggregate trial-level data into participant-level data
participant_data <- combined_data %>%
  group_by(Condition, ParticipantID) %>%
  summarise(
    perc = mean(correct, na.rm = TRUE),
    res_time = mean(response_time, na.rm = TRUE),
    .groups = 'drop'
  )

# Run Anova test
participant_data$Condition <- as.factor(participant_data$Condition)
anova <- aov(perc ~ Condition, data = participant_data)
summary(anova)

             Df Sum Sq Mean Sq F value   Pr(>F)    
Condition     2  1.631  0.8155   11.68 2.42e-05 ***
Residuals   114  7.956  0.0698                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Our ANOVA results (F = 11.68, p = 2.42e-05) revealed a statistically significant effect of condition on accuracy, indicating that participants’ performance varied significantly across the three conditions. This finding aligns with the original study, which also reported a decline in accuracy as task complexity increased from the 2x2 to 4x4 condition.
This provides sufficient evidence for that, similar to the original study, ambiguity plays an important role in participants’ ability to learn word-referent pairs.

Post-Hoc Analysis

tukey <- TukeyHSD(anova)
print(tukey)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = perc ~ Condition, data = participant_data)

$Condition
                            diff        lwr          upr     p adj
Condition2-Condition1 -0.1410256 -0.2830956  0.001044269 0.0521636
Condition3-Condition1 -0.2891738 -0.4312437 -0.147103879 0.0000125
Condition3-Condition2 -0.1481481 -0.2902181 -0.006078238 0.0388659

The post-hoc analysis highlights statistically significant differences in accuracy across the conditions, particularly between Condition 3 and the other two conditions. This is consistent with the findings of the original study, where performance declined as ambiguity increased.
In detail, the relatively weak significance between Condition 1 and Condition 2 suggests that participants found the 3x3 condition slightly more challenging than the 2x2 condition, but the effect is less pronounced compared to the drop in accuracy observed in the 4x4 condition.

Exploratory analyses

Response Times

Overall response times

response_time <- combined_data %>%
  group_by(Condition) %>%
  summarise(
    Mean_ResponseTime = mean(response_time),
    SD_ResponseTime = sd(response_time),
    .groups = 'drop'
  )

print(response_time)

# A tibble: 3 × 3
  Condition  Mean_ResponseTime SD_ResponseTime
  <chr>                  <dbl>           <dbl>
1 Condition1             3487.           2868.
2 Condition2             3774.           3167.
3 Condition3             3688.           3016.

In Condition 1, participants had the shortest mean response time, indicating that tasks with lower ambiguity allowed for faster responses. While Condition 2 had the longest mean response time, Condition 3 showed a mean response time that was slightly shorter than Condition 2 but still longer than Condition 1.
The standard deviations are relatively high in condition 2 and 3, indicating variability in participants’ response times across trials.

response times over images

response_time_by_image <- combined_data %>%
  group_by(Condition, correct_image) %>%
  summarise(
    Mean_ResponseTime = mean(response_time, na.rm = TRUE),
    .groups = 'drop'
  )

ggplot(response_time_by_image, aes(x = correct_image, 
                                   y = Mean_ResponseTime, 
                                   group = Condition, 
                                   color = Condition)) +
  geom_line(size = 1) +
  geom_point(size = 2) +
  labs(
    title = "Response Times by Image for Each Condition",
    x = "Image",
    y = "Mean Response Time (ms)"
  ) +
  scale_x_continuous(breaks = unique(response_time_by_image$correct_image)) +
  facet_wrap(~Condition, scales = "free_x", ncol = 1) +
  theme_minimal()

In Condition 1, response times stabilize quickly after a drop from Image 1 to Image 2, showing minimal variability. Condition 2 shows a significant initial drop but remain stable afterwords. Condition 3 has relatively stable response times and a smaller initial drop. The initial drop across all conditions might suggest participants spend time adapting to the test phase at the beginning.

Response times vs. Accuracy

ggplot(participant_data, 
       aes(x = res_time, y = perc)) + 
  geom_point(alpha = 0.7, 
             position = position_jitter(width = 0.2), size = 2,
             aes(color = ParticipantID)) +
  facet_wrap(~Condition, scales = "free") +
  labs(
    x = "Response Time (ms)", 
    y = "Proportion Correct", 
    title = "Response Time vs. Proportion Correct"
  ) +
  theme_minimal() +
  theme(
    legend.position = "none",
    strip.text = element_text(size = 12, face = "bold")
  )

The scatterplot shows the relationship between response time and proportion correct across conditions. - In Condition 1, participants generally achieved high scores with lower response times, showing a cluster near the top-left. - In Condition 2, there is more variability in both response times and scores, with some participants taking significantly longer to respond. - In Condition 3, scores are lower overall, with response times spread more evenly, reflecting increased task difficulty.

Discussion

Summary of Replication Attempt

Our confirmatory analysis revealed that participants’ performance significantly declined as task ambiguity increased across the three conditions, consistent with the original study’s findings. Accuracy was highest in the 2x2 condition, lower in the 3x3 condition, and lowest in the 4x4 condition, with all conditions exceeding chance levels. While the general pattern replicated the original study, the levels of accuracy in our study was lower across all conditions, and variability was higher, particularly in the more complex conditions. These results suggest a partial replication of the original findings, capturing the overall trend but with differences in the strength and consistency of participant performance.

Commentary

Our exploratory analyses showed that response times increased with task complexity, particularly in the 3x3 condition. Interestingly, a small number of participants in the 4x4 condition still achieved relatively high accuracy within a short response time, suggesting that they may use more effective strategies to resolve high ambiguity.
While the overall trend of declining accuracy with increasing task complexity replicated the original findings, the lower accuracy and higher variability in our study suggest the presence of moderating factors. Differences in participant demographics (university students vs. Prolific users), experimental settings, or task delivery (i.e. instructions, timing) could potentially influence the test results.
The main challenge in interpreting the results lies in the high variability observed in our study, which could be attributed to uncontrolled external factors. Another important aspect not tested in our replication is the role of foil probability—how often incorrect choices were presented. We ignored the factor due to limited time, which may have affected participants’ ability to differentiate correct from incorrect word pairings. Future studies may attempt to take better controll of external factors and take fol probability into account.

Author Contribution Statement

Hui Junyi: Experiment coding (trial/training phase coding; pre-test audio check and fullscreen check; randomization; condition order counterbalancing code; debugging; review); pre-registration revisions
Yawen Dong: Experiment coding (test phase coding; randomization; debugging; review); pre-registration revisions; OSF datapipe setup
Allison Park: Stimuli analysis and selection; pre-registration original draft; Prolific experiment management; code review
Pengjia Cui: Stimuli generation code; code review