library(pwr)
Replication of Study Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Yu & Smith (2007, Psychological Science)
Introduction
The study Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Yu and Smith explored how adults can learn word-referent pairs under highly ambiguous settings. Past studies on word learning have been focusing on constraints such as social, attentional, or linguistic cues to solve the word-referent mapping problem. While these strategies performed well in controlled, minimally ambiguous contexts, real-world learning environments presented learners with greater complexity.
This raises an important question: can learners successfully acquire word-referent pairs in highly ambiguous settings through alternative means, even when they cannot determine correct pairings within a single trial? To address the question, Yu and Smith propose an alternative mechanism—— cross-situational learning —— in this study. They demonstrated that learners could track word-referent pairings across multiple trials by calculating statistical associations over time rather than relying on immediate clarity within each learning instance.
Design Overview
- One factor was manipulated in the study: within-trial ambiguity. The manipulation operates through three conditions in which the number of words and referents presented per trial varied (2×2, 3×3, and 4×4).
- Two measures were taken: accuracy in learning word-referent pairs and response time.
- The study employed a within-participants design as each participant experienced all three conditions.
- Measures were repeated across each condition for every participant.
- Applying a between-participants design instead of a within-participants design would increase variance due to individual learning differences.
- The study reduced demand characteristics by using pseudowords and not providing explicit cues linking words to specific referents, thus participants had to rely solely on cross-trial statistical learning.
- A potential confound is the repetitive exposure to pseudowords and objects, which could lead participants to develop their own strategies which are not based on cross-trial statistical learning but rather on familiarity or memorization.
- The use of pseudowords and uncommon objects may limit generalizability to real-world language learning, where learners often have social and contextual cues available. Also, testing was limited to adult participants, so findings may not generalize well to children.
Power Analysis
The original experiment included 38 participants, all of whom were undergraduate students from Indiana University. Participants received either course credit or $7 for their participation.
<-1.425
effect_size<-0.05
alpha<- pwr.t.test(d = effect_size, n = 38, sig.level = alpha,alternative="greater")
result print(result)
Two-sample t test power calculation
n = 38
d = 1.425
sig.level = 0.05
power = 0.9999967
alternative = greater
NOTE: n is number in *each* group
With the data given in the original study, we found that with 38 participants per group,a very high statistical power is achieved. This indicates that the probability of correctly rejecting the null hypothesis, if the alternative hypothesis is true, is nearly 100%.
Planned Sample
Given the high statistical power of the original study, our replication aim to include a similar or slightly larger sample size with recruitment from Prolific to maintain consistency with the original design.
Methods
Materials
“The stimuli were slides containing pictures of uncommon objects (e.g., canister, facial sauna, and rasp) paired with auditorily presented pseudowords. These artificial words were generated by a computer program to sample English forms that were broadly phonotactically probable; they were produced by a synthetic female voice in monotone. There were 54 unique objects and 54 unique pseudowords partitioned into three sets of 18 words and referents for use in the three conditions. The training trials were generated by randomly pairing each word with one picture; these were the word-referent pairs to be discovered by the learner. The three learning conditions differed in the number of words and referents presented on each training trial: 2-2 Condition: 2 words and 2 pictures; 3-3 Condition: 3 words and 3 pictures; 4-4 Condition: 4 words and 4 pictures” (Yu and Smith 2007)
Procedure
“The pictures were presented on a 17-in. computer screen, and the sound was played by the speakers connected to the same computer. Subjects were instructed that their task was to learn the words and referents, but they were not told that there was one referent per word. They were told that multiple words and pictures would co-occur on each trial and that their task was to figure out across trials which word went with which picture. After training in each condition, subjects received a fouralternative forced-choice test of learning. On the test, they were presented with 1 word and 4 pictures and asked to indicate the picture named by that word. The target picture and the 3 foils were all drawn from the set of 18 training pictures.” (Yu and Smith 2007)
Analysis Plan
The primary analysis will involves a one-way ANOVA to compare learning accuracy across the three conditions (2×2, 3×3, and 4×4). In this setup, the independent variable is the condition (level of ambiguity), and the dependent variable is the accuracy of word-object pair identification. We will also examine response times across conditions to investigate whether higher ambiguity affects the speed of learning, which may contribute to understanding cognitive processing under different conditions. Data cleaning will exclude incomplete responses and trials where response times are excessively high or low.
Differences from Original Study
Sample: The original study included 38 undergraduate participants from Indiana University. Our sample may differ slightly due to recruitment constraints; participants will probably being drawn from a broader demographic pool, which could introduce variability in learning abilities or prior exposure to similar experimental tasks. However, as cross-situational learning mechanisms are believed to be consistent across adult populations, the sample difference is not supposed to significantly impact the findings.
Setting: In the original study, participants completed the trials in a controlled lab environment. Our replication may only involve online settings. Conducting the experiment outside of a laboratory could introduce additional distractions or variations. As the original research suggests that cross-situational learning effects are resilient to minor environmental changes, we do not expect this variation to significantly influence the outcome.
Methods Addendum (Post Data Collection)
Actual Sample
42 participants on Prolific received $5 for their participation. One participant is excluded from analysis for failure to complete condition 3 test phase.
Differences from pre-data collection methods plan
none
Results
Data preparation
Load Relevant Libraries and Functions
library(jsonlite)
Warning: package 'jsonlite' was built under R version 4.2.3
library(dplyr)
Warning: package 'dplyr' was built under R version 4.2.3
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)
Warning: package 'ggplot2' was built under R version 4.2.3
library(effectsize)
library(car)
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
library(tidyr)
library(stringr)
Import data
# Condition 1 (2 * 2)
setwd("/Users/yawendong/Documents/GitHub/psych final project/final_data/Condition1")
<- list.files(pattern = "\\.csv$")
files1
<- lapply(files1, function(file) {
data1 # Extract the first 10 characters as ParticipantID
<- substr(file, 1, 10)
participant_id <- read.csv(file)
df $correct <- as.character(df$correct)
df<- df %>% filter(!is.na(correct))
df # Create a 'ParticipantID' column
$ParticipantID <- participant_id
dfreturn(df)
%>% bind_rows() })
# Condition 2 (3 * 3)
setwd("/Users/yawendong/Documents/GitHub/psych final project/final_data/Condition2")
<- list.files(pattern = "\\.csv$")
files2
<- lapply(files2, function(file) {
data2 # Extract the first 10 characters as ParticipantID
<- substr(file, 1, 10)
participant_id <- read.csv(file)
df $correct <- as.character(df$correct)
df<- df %>% filter(!is.na(correct))
df # Create a 'ParticipantID' column
$ParticipantID <- participant_id
dfreturn(df)
%>% bind_rows() })
# Condition 3 (4 * 4)
setwd("/Users/yawendong/Documents/GitHub/psych final project/final_data/Condition3")
<- list.files(pattern = "\\.csv$")
files3 <- lapply(files3, function(file) {
data3 # Extract the first 10 characters as ParticipantID
<- substr(file, 1, 10)
participant_id <- read.csv(file)
df # Create a 'ParticipantID' column
$ParticipantID <- participant_id
dfreturn(df)
%>% bind_rows() })
Data exclusion / filtering
# Select necessary columns for analysis
<- data1 %>% select(correct_choice, correct_image, response_letter, correct, response_time, ParticipantID)
selected_data1 <- data2 %>% select(correct_choice, correct_image, response_letter, correct, response_time, ParticipantID)
selected_data2 <- data3 %>% select(correct_choice, correct_image, response_letter, correct, response_time, ParticipantID)
selected_data3
# Remove rows with NAs
<- na.omit(selected_data1)
cleaned_data1 $correct <- as.logical(cleaned_data1$correct)
cleaned_data1<- na.omit(selected_data2)
cleaned_data2 $correct <- as.logical(cleaned_data2$correct)
cleaned_data2<- na.omit(selected_data3)
cleaned_data3 $correct <- as.logical(cleaned_data3$correct) cleaned_data3
# Function to identify outlier participants
<- function(data) {
outliers <- data %>%
stats summarise(
median = median(response_time, na.rm = TRUE),
sd = sd(response_time, na.rm = TRUE)
)<- stats$median - (3 * stats$sd)
lower_bound <- stats$median + (3 * stats$sd)
upper_bound <- data %>%
participant group_by(ParticipantID) %>%
summarise(
avg = mean(response_time, na.rm = TRUE)
)<- participant %>%
outliers filter(avg < lower_bound | avg > upper_bound) %>%
pull(ParticipantID)
return(outliers)
}
# Identify outliers in each condition
<- outliers(cleaned_data1)
outliers1 <- outliers(cleaned_data2)
outliers2 <- outliers(cleaned_data3)
outliers3
# Combine outlier participant IDs from all conditions
<- unique(c(outliers1, outliers2, outliers3))
all_outliers
#### Exclude outliers from all conditions
<- cleaned_data1 %>%
filtered_data1 filter(!ParticipantID %in% all_outliers)
<- cleaned_data2 %>%
filtered_data2 filter(!ParticipantID %in% all_outliers)
<- cleaned_data3 %>%
filtered_data3 filter(!ParticipantID %in% all_outliers)
Prepare data for analysis - create columns etc.
# Create a 'Condition' column
$Condition <- 'Condition1'
filtered_data1$Condition <- 'Condition2'
filtered_data2$Condition <- 'Condition3'
filtered_data3
# Combine Condition 1, 2, and 3
<- bind_rows(filtered_data1, filtered_data2, filtered_data3) combined_data
Confirmatory analysis
As noted before, we collected data from 42 participants across three experimental conditions. One participant was excluded for failing to complete all tests, and another two were removed due to excessively high response times (greater than 3 standard deviations above the median) in Condition 1. Therefore, the following confirmatory analysis is based on data from the remaining 39 participants.
Accuracy
Overall Accuracy
# Calculate accuracy over condition
<- combined_data %>%
accuracy group_by(Condition) %>%
summarise(Accuracy = mean(correct), .groups = 'drop')
print(accuracy)
# A tibble: 3 × 2
Condition Accuracy
<chr> <dbl>
1 Condition1 0.692
2 Condition2 0.551
3 Condition3 0.403
# Calculate condition 3 accuracy by participant
<- combined_data %>%
condition3_accuracy filter(Condition == "Condition3") %>%
group_by(ParticipantID) %>%
summarise(
participant_accuracy = mean(correct, na.rm = TRUE),
.groups = "drop"
)<- condition3_accuracy %>%
high_accuracy filter(participant_accuracy > 0.75)
print(high_accuracy)
# A tibble: 4 × 2
ParticipantID participant_accuracy
<chr> <dbl>
1 8nkzlv9ca3 0.944
2 ab2tzywewf 0.944
3 aqurvmvs3g 0.833
4 d8bz99odus 0.889
- In comparison to the original experiment, our accuracy data reflects similar but slightly lower performance across all conditions.
- In the original study, participants discovered, on average, more than 16 of the 18 pairs in the 2x2 condition, corresponding to an accuracy above 88.9%, while our participants achieved an average accuracy of 69.2%.
- For the 3x3 condition, the original study reported participants discovering more than 13 of the 18 pairs, or an accuracy above 72.2%, while our participants averaged 55.1%.
- Lastly, in the 4x4 condition, the original study indicated participants discovered nearly 10 pairs, or an accuracy of approximately 55.6%, while our participants averaged 40.3%. Besides, 4 participants in our experiment discovered more than 75% of the pairs in this condition, which is less than half the number of participants (9) who achieved the same level of performance in the original experiment.
- Despite the differences in accuracy levels, both our study and the original experiment demonstrate a pattern of declining performance as ambiguity increases across conditions.
Accuracy over images
<- combined_data %>%
accuracy_by_image group_by(Condition, correct_image) %>%
summarise(Accuracy = mean(correct), .groups = 'drop')
ggplot(accuracy_by_image, aes(x = correct_image, y = Accuracy, group = Condition, color = Condition)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(
title = "Accuracy by Image for Each Condition",
x = "Image",
y = "Accuracy"
+
) scale_x_continuous(breaks = unique(accuracy_by_image$correct_image)) +
facet_wrap(~Condition, scales = "free_x", ncol = 1)+
theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
The plot presents accuracy by image for each condition, showing how participants performed across the 18 images in the 2x2, 3x3, and 4x4 conditions. In Condition 1, accuracy remains relatively stable across all images, with minimal variability. Condition 2 and Condition 3 show slightly greater variability but no extreme deviations.
Mean Accuracy Over Condition
# Calculate means and standard errors for plotting
<- combined_data %>%
participant_accuracy group_by(Condition, ParticipantID) %>%
summarise(
participant_accuracy = mean(correct, na.rm = TRUE),
.groups = "drop"
)
<- participant_accuracy %>%
mean_accuracy group_by(Condition) %>%
summarise(
mean_accuracy = mean(participant_accuracy, na.rm = TRUE),
se_accuracy = sd(participant_accuracy, na.rm = TRUE) / sqrt(n()),
.groups = "drop"
)
# Create the bar plot
ggplot(mean_accuracy, aes(x = Condition, y = mean_accuracy, fill = Condition)) +
geom_bar(stat = "identity", width = 0.4) +
geom_errorbar(aes(ymin = mean_accuracy - se_accuracy, ymax = mean_accuracy + se_accuracy),
width = 0.1, color = "black") +
geom_hline(yintercept = 0.25, linetype = "dotted", color = "black") +
annotate("text", x = 3.35, y = 0.27, label = "Chance",
hjust = 0, size = 3, color = "black") +
labs(
title = "Mean Accuracy by Condition",
x = "Learning Condition",
y = "Proportion Correct"
+
) scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.2)) +
theme_minimal() +
theme(legend.position = "none")
Original Study Plot
Compared to the original study, the error bars for Conditions 1 and 2 are larger in our experiment, indicating greater variability in participant performance. In the original study, the error bars are smaller, reflecting more consistent accuracy among participants across all conditions.
Chance Performance
# The expected performance by chance for 2*2, 3*3, and 4*4 Condition are all 1/4
<- combined_data %>%
combined_data mutate(chance_level = case_when(
== "Condition1" ~ 0.25,
Condition == "Condition2" ~ 0.25,
Condition == "Condition3" ~ 0.25
Condition
))
<- combined_data %>%
t_test_results group_by(Condition) %>%
summarise(
t_test_p_value = t.test(correct, mu = unique(chance_level))$p.value,
.groups = 'drop'
)
print(t_test_results)
# A tibble: 3 × 2
Condition t_test_p_value
<chr> <dbl>
1 Condition1 2.94e-101
2 Condition2 1.52e- 49
3 Condition3 6.96e- 16
In the original study, it was reported that participants’ performance in all conditions significantly exceeded chance levels. In our study, the t-test results similarly show that performance across all conditions was significantly above the chance level of 0.25. The p-values for each condition are exceptionally small, confirming that participants were not guessing randomly but learning word-referent pairs.
Effect of Condition (ANOVA)
# Aggregate trial-level data into participant-level data
<- combined_data %>%
participant_data group_by(Condition, ParticipantID) %>%
summarise(
perc = mean(correct, na.rm = TRUE),
res_time = mean(response_time, na.rm = TRUE),
.groups = 'drop'
)
# Run Anova test
$Condition <- as.factor(participant_data$Condition)
participant_data<- aov(perc ~ Condition, data = participant_data)
anova summary(anova)
Df Sum Sq Mean Sq F value Pr(>F)
Condition 2 1.631 0.8155 11.68 2.42e-05 ***
Residuals 114 7.956 0.0698
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- Our ANOVA results (F = 11.68, p = 2.42e-05) revealed a statistically significant effect of condition on accuracy, indicating that participants’ performance varied significantly across the three conditions. This finding aligns with the original study, which also reported a decline in accuracy as task complexity increased from the 2x2 to 4x4 condition.
- This provides sufficient evidence for that, similar to the original study, ambiguity plays an important role in participants’ ability to learn word-referent pairs.
Post-Hoc Analysis
<- TukeyHSD(anova)
tukey print(tukey)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = perc ~ Condition, data = participant_data)
$Condition
diff lwr upr p adj
Condition2-Condition1 -0.1410256 -0.2830956 0.001044269 0.0521636
Condition3-Condition1 -0.2891738 -0.4312437 -0.147103879 0.0000125
Condition3-Condition2 -0.1481481 -0.2902181 -0.006078238 0.0388659
- The post-hoc analysis highlights statistically significant differences in accuracy across the conditions, particularly between Condition 3 and the other two conditions. This is consistent with the findings of the original study, where performance declined as ambiguity increased.
- In detail, the relatively weak significance between Condition 1 and Condition 2 suggests that participants found the 3x3 condition slightly more challenging than the 2x2 condition, but the effect is less pronounced compared to the drop in accuracy observed in the 4x4 condition.
Exploratory analyses
Response Times
Overall response times
<- combined_data %>%
response_time group_by(Condition) %>%
summarise(
Mean_ResponseTime = mean(response_time),
SD_ResponseTime = sd(response_time),
.groups = 'drop'
)
print(response_time)
# A tibble: 3 × 3
Condition Mean_ResponseTime SD_ResponseTime
<chr> <dbl> <dbl>
1 Condition1 3487. 2868.
2 Condition2 3774. 3167.
3 Condition3 3688. 3016.
- In Condition 1, participants had the shortest mean response time, indicating that tasks with lower ambiguity allowed for faster responses. While Condition 2 had the longest mean response time, Condition 3 showed a mean response time that was slightly shorter than Condition 2 but still longer than Condition 1.
- The standard deviations are relatively high in condition 2 and 3, indicating variability in participants’ response times across trials.
response times over images
<- combined_data %>%
response_time_by_image group_by(Condition, correct_image) %>%
summarise(
Mean_ResponseTime = mean(response_time, na.rm = TRUE),
.groups = 'drop'
)
ggplot(response_time_by_image, aes(x = correct_image,
y = Mean_ResponseTime,
group = Condition,
color = Condition)) +
geom_line(size = 1) +
geom_point(size = 2) +
labs(
title = "Response Times by Image for Each Condition",
x = "Image",
y = "Mean Response Time (ms)"
+
) scale_x_continuous(breaks = unique(response_time_by_image$correct_image)) +
facet_wrap(~Condition, scales = "free_x", ncol = 1) +
theme_minimal()
In Condition 1, response times stabilize quickly after a drop from Image 1 to Image 2, showing minimal variability. Condition 2 shows a significant initial drop but remain stable afterwords. Condition 3 has relatively stable response times and a smaller initial drop. The initial drop across all conditions might suggest participants spend time adapting to the test phase at the beginning.
Response times vs. Accuracy
ggplot(participant_data,
aes(x = res_time, y = perc)) +
geom_point(alpha = 0.7,
position = position_jitter(width = 0.2), size = 2,
aes(color = ParticipantID)) +
facet_wrap(~Condition, scales = "free") +
labs(
x = "Response Time (ms)",
y = "Proportion Correct",
title = "Response Time vs. Proportion Correct"
+
) theme_minimal() +
theme(
legend.position = "none",
strip.text = element_text(size = 12, face = "bold")
)
The scatterplot shows the relationship between response time and proportion correct across conditions. - In Condition 1, participants generally achieved high scores with lower response times, showing a cluster near the top-left. - In Condition 2, there is more variability in both response times and scores, with some participants taking significantly longer to respond. - In Condition 3, scores are lower overall, with response times spread more evenly, reflecting increased task difficulty.
Discussion
Summary of Replication Attempt
Our confirmatory analysis revealed that participants’ performance significantly declined as task ambiguity increased across the three conditions, consistent with the original study’s findings. Accuracy was highest in the 2x2 condition, lower in the 3x3 condition, and lowest in the 4x4 condition, with all conditions exceeding chance levels. While the general pattern replicated the original study, the levels of accuracy in our study was lower across all conditions, and variability was higher, particularly in the more complex conditions. These results suggest a partial replication of the original findings, capturing the overall trend but with differences in the strength and consistency of participant performance.
Commentary
- Our exploratory analyses showed that response times increased with task complexity, particularly in the 3x3 condition. Interestingly, a small number of participants in the 4x4 condition still achieved relatively high accuracy within a short response time, suggesting that they may use more effective strategies to resolve high ambiguity.
- While the overall trend of declining accuracy with increasing task complexity replicated the original findings, the lower accuracy and higher variability in our study suggest the presence of moderating factors. Differences in participant demographics (university students vs. Prolific users), experimental settings, or task delivery (i.e. instructions, timing) could potentially influence the test results.
- The main challenge in interpreting the results lies in the high variability observed in our study, which could be attributed to uncontrolled external factors. Another important aspect not tested in our replication is the role of foil probability—how often incorrect choices were presented. We ignored the factor due to limited time, which may have affected participants’ ability to differentiate correct from incorrect word pairings. Future studies may attempt to take better controll of external factors and take fol probability into account.