Replication of Study Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Sample & Sample (2024, Psychological Science)

Author

Junyi Hui

Published

December 10, 2024

Introduction

Rapid Word Learning Under Uncertainty via Cross-Situational Statistics by Chen Yu and Linda B. Smith presents the challenges of word learning in natural environments, where there are infinite possible word-referent pairings. Previous approaches have focused on how learners constrain this problem using linguistic, social and representational cues within a single moment or trial. However, the authors propose an alternative strategy: cross-situational learning, where learners accumulate statistical information about word-referent pairings across multiple encounters rather than relying on single-trial mapping. The core issue, famously highlighted by Quine (1960), is the indeterminacy of referents in any given instance of language learning, such as when someone says “gavagai” while pointing at a field—it’s unclear what exactly the word refers to. Traditional research has shown that children can use constraints to “fast map” words to their referents in a single encounter, but real-world learning environments are usually more ambiguous, with many words and potential referents presented at once.

The authors suggest that learners might solve this indeterminacy problem by tracking the co-occurrences of words and referents across multiple learning trials. Though there have been simulations supporting this idea, there has been little empirical research to show whether humans can engage in such cross-situational statistical learning. This gap in understanding, particularly in highly ambiguous environments, is what the authors aim to address in their experiments.

Methods

The experiment included 38 participants, all of whom were undergraduate students from Indiana University. Participants received either course credit or $7 for their participation.

Power Analysis

library(pwr)
library(ggplot2)
effect_size<-1.425
alpha<-0.05
result <- pwr.t.test(d = effect_size, n = 38, sig.level = alpha,alternative="greater")
print(result)

     Two-sample t test power calculation 

              n = 38
              d = 1.425
      sig.level = 0.05
          power = 0.9999967
    alternative = greater

NOTE: n is number in *each* group

The power is given in the research.

Planned Sample

Planned sample size is 38.

Materials

The stimuli were slides containing pictures of uncommon objects (e.g., canister, facial sauna, and rasp) paired with auditorily presented pseudowords. These artificial words were generated by a computer program to sample English forms that were broadly phonotactically probable; they were produced by a synthetic female voice in monotone. There were 54 unique objects and 54 unique pseudowords partitioned into three sets of 18 words and referents for use in the three conditions. The training trials were generated by randomly pairing each word with one picture; these were the word-referent pairs to be discovered by the learner. The three learning conditions differed in the number of words and referents presented on each training trial.

Then,participants were exposed to three distinct learning conditions based on the number of words and referents presented per trial: 2-2 Condition: 2 words and 2 pictures 3-3 Condition: 3 words and 3 pictures 4-4 Condition: 4 words and 4 pictures Each training trial presented a random pairing of the words with the pictures, without indicating which picture corresponded to which word. Participants experienced six repetitions of each word-referent pair across trials, allowing for exposure to statistical relationships.

Procedure

The pictures were presented on a 17-in. computer screen, and the sound was played by the speakers connected to the same computer. Subjects were instructed that their task was to learn the words and referents, but they were not told that there was one referent per word. They were told that multiple words and pictures would co-occur on each trial and that their task was to figure out across trials which word went with which picture. After training in each condition, subjects received a fouralternative forced-choice test of learning. On the test, they were presented with 1 word and 4 pictures and asked to indicate the picture named by that word. The target picture and the 3 foils were all drawn from the set of 18 training pictures.

Analysis Plan

The descriptive statistics (mean number of word-referent pairs discovered) and inferential statistics (t-tests comparing performance to chance):

“Figure 1 shows that in each condition, subjects learned more word-referent pairs than expected by chance, smallest t(37) = 8.785, p < .001, prep > .99, d = 1.425, one-tailed (4 × 4 condition). They discovered on average more than 16 of the 18 pairs in the 2 × 2 condition and more than 13 of the 18 pairs in the 3 × 3 condition—all this in less than 6 min of training per condition. Even in the 4 × 4 condition, with 16 potential associations per trial, subjects discovered almost 10 of the 18 word-referent pairs.”

Differences from Original Study

The new reproducibility project will involve a different participant pool compared to the original study. While the original research tested undergraduate students, the replication will recruit participants from Prolific, an online platform that draws from a more diverse and varied population. This shift in sample composition could potentially affect the outcomes. Specifically, we might observe lower learning accuracy and greater variability in the results, as the broader demographic diversity of Prolific workers could introduce more variability in cognitive abilities, learning styles, and background knowledge compared to the more homogenous group of university students. This increased variability could lead to higher variance in the data, making it more challenging to replicate the precise findings of the original study.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample Size: 38 workers on Prolific received coupons or minimum wage in California for their participation. Demographics:Participants from Prolific with varied backgrounds. Data Exclusions: There’s no data exclusion criteria.

Differences from pre-data collection methods plan

None

Results

The very large t-value and extremely small p-value indicate strong statistical evidence against the null hypothesis. This means that the true mean is highly unlikely to be 0.25, and the observed sample mean of 0.7917 is significantly different from 0.25. Therefore, the result is very significant.

Data preparation

Pilot A

Pilot A Sample

Pilot sample A size is 8 with only 1 condition.

  1. Data Cleaning:
    • Exclude trials with response times that exceed three standard deviations from each participant’s mean RT or are under 200 ms (The criteria may need to be slightly changed). Such trials are flagged as outliers, indicating inattentive or anticipatory responses.
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
folder_path <- "/Users/user/Desktop/Fall_quarter/PSCH/final project/osfstorage-archive"
file_list <- list.files(path = folder_path, pattern = "*.csv", full.names = TRUE)
combined_data <- file_list %>%
  lapply(read.csv) %>%
  bind_rows()

filtered_data <- combined_data %>%
  select(c(correct,response_time))%>%
  filter(!is.na(correct) & !is.na(response_time) & (correct == "true" | correct == "false")) %>%
  mutate(correct_numeric = ifelse(correct == "true", 1, 0))

t_test_result <- t.test(filtered_data$correct_numeric, mu = 0.25)

print(t_test_result)

    One Sample t-test

data:  filtered_data$correct_numeric
t = 15.95, df = 143, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
 0.7245359 0.8587974
sample estimates:
mean of x 
0.7916667 
head(filtered_data)
  correct response_time correct_numeric
1    true          4226               1
2    true          3081               1
3    true          2087               1
4    true          1673               1
5    true          3188               1
6    true          5857               1
  1. Calculate Cohen’s d and effect size:
mean_sample <- 0.7916667 
hypothesized_mean <- 0.25  
sd_sample <- sd(filtered_data$correct_numeric)
# Cohen's d
cohen_d <- (mean_sample - hypothesized_mean) / sd_sample
cohen_d
[1] 1.329133
t_statistic <- 15.95
df <- 7

# prep (preparation effect size)
prep <- t_statistic / sqrt(t_statistic^2 + df)
prep
[1] 0.9865198

Confirmatory analysis

  1. Test of Learning Accuracy: The primary analysis will compare the participants’ performance (number of correct word-referent pairings) across the three conditions: 2 × 2, 3 × 3, and 4 × 4 trials.

Descriptive statistics: mean correct word-referent pairings in each condition will be computed.

One-sample t-tests: will be used to determine whether the number of correct pairings is significantly greater than chance (25% in a four-alternative forced-choice test).

Repeated-measures ANOVA: will be conducted to compare performance across the three learning conditions to assess the effect of within-trial ambiguity on learning accuracy.

2.Exploring Variance: Given that the original study used undergraduate students and the new replication will use a more diverse sample from Prolific, the variance in learning accuracy might be higher in the replication study. Variability in performance will be assessed by examining the standard deviations and comparing them across conditions and between the original and replication samples.

accuracy <- mean(filtered_data$correct_numeric, na.rm = TRUE)

accuracy_all<- data.frame(
  Group = "Overall", 
  Accuracy = accuracy
)

ggplot(accuracy_all, aes(x = Group, y = Accuracy, fill = Group)) +
  geom_bar(stat = "identity", show.legend = FALSE, width = 0.5) +  # show.legend = FALSE 
  labs(title = "Accuracy for overall Group", y = "Accuracy", x = "") +
  theme_minimal()

Response Time by Correctness

response_time_summary <- filtered_data %>%
  group_by(correct) %>%
  summarize(
    mean_response_time = mean(response_time, na.rm = TRUE),
    median_response_time = median(response_time, na.rm = TRUE),
    sd_response_time = sd(response_time, na.rm = TRUE)
  )
print(response_time_summary)
# A tibble: 2 × 4
  correct mean_response_time median_response_time sd_response_time
  <chr>                <dbl>                <dbl>            <dbl>
1 false                4303.                 4041            1841.
2 true                 3495.                 3158            1637.
ggplot(filtered_data, aes(x = factor(correct, labels = c("Incorrect", "Correct")), y = response_time)) +
  geom_boxplot(fill = "skyblue3", alpha = 0.7) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "red") +
  labs(
    title = "Response Time by Correctness",
    x = "Response Accuracy",
    y = "Response Time"
  ) +
  theme_minimal()

Pilot B

Pilot B Sample

Pilot sample B size is 5 with all three conditions. Average time participants took during pilot B is about 21 minutes.

  1. Data Cleaning:
    • Exclude trials with response times that exceed three standard deviations from each participant’s mean RT or are under 200 ms (The criteria may need to be slightly changed). Such trials are flagged as outliers, indicating inattentive or anticipatory responses.
library(dplyr)
library(purrr)

folder_path1 <- "/Users/user/Desktop/Fall_quarter/PSCH/final project/osfstorage-archive pilot B/condition1"
folder_path2 <- "/Users/user/Desktop/Fall_quarter/PSCH/final project/osfstorage-archive pilot B/condition2"
folder_path3 <- "/Users/user/Desktop/Fall_quarter/PSCH/final project/osfstorage-archive pilot B/condition3"
file_listB1 <- list.files(path = folder_path1, pattern = "*.csv", full.names = TRUE)
file_listB2 <- list.files(path = folder_path2, pattern = "*.csv", full.names = TRUE)
file_listB3 <- list.files(path = folder_path3, pattern = "*.csv", full.names = TRUE)

combined_dataB1 <- file_listB1 %>%
lapply(read.csv) %>%
bind_rows()

combined_dataB2 <- file_listB2 %>%
lapply(read.csv) %>%
bind_rows()

combined_dataB3 <- file_listB3 %>%
lapply(read.csv) %>%
bind_rows()

colnames(combined_dataB1) <- paste0(colnames(combined_dataB1), "_condition1")
colnames(combined_dataB2) <- paste0(colnames(combined_dataB2), "_condition2")
colnames(combined_dataB3) <- paste0(colnames(combined_dataB3), "_condition3")

filtered_dataB1 <- combined_dataB1 %>%
  select(c(correct_condition1,response_time_condition1))%>%
  filter(!is.na(correct_condition1) & !is.na(response_time_condition1) & (correct_condition1 == "true" | correct_condition1 == "false")) %>%
  mutate(correct_numeric_condition1 = ifelse(correct_condition1 == "true", 1, 0))

filtered_dataB2 <- combined_dataB2 %>%
  select(c(correct_condition2,response_time_condition2))%>%
  filter(!is.na(correct_condition2) & !is.na(response_time_condition2) & (correct_condition2 == "true" | correct_condition2 == "false")) %>%
  mutate(correct_numeric_condition2 = ifelse(correct_condition2 == "true", 1, 0))

filtered_dataB3 <- combined_dataB3 %>%
  select(c(correct_condition3,response_time_condition3))%>%
  filter(!is.na(correct_condition3) & !is.na(response_time_condition3) & (correct_condition3 == "true" | correct_condition3 == "false")) %>%
  mutate(correct_numeric_condition3 = ifelse(correct_condition3 == "true", 1, 0))

combined_data_B <- bind_cols(filtered_dataB1, filtered_dataB2, filtered_dataB3)


head(combined_data_B)
  correct_condition1 response_time_condition1 correct_numeric_condition1
1               true                     7512                          1
2               true                     2971                          1
3              false                     4610                          0
4              false                     4259                          0
5               true                     1782                          1
6              false                     4434                          0
  correct_condition2 response_time_condition2 correct_numeric_condition2
1              false                     3377                          0
2              false                     2275                          0
3              false                     2440                          0
4              false                     4393                          0
5               true                     2575                          1
6              false                     2421                          0
  correct_condition3 response_time_condition3 correct_numeric_condition3
1              false                     6317                          0
2              false                     1870                          0
3               true                     1814                          1
4               true                     1062                          1
5              false                     1862                          0
6               true                     1564                          1
  1. Overall Accuracy
correctness <- combined_data_B[, c("correct_numeric_condition1", "correct_numeric_condition2", "correct_numeric_condition3")]
overall_accuracy <- sum(correctness, na.rm = TRUE) / (nrow(correctness) * ncol(correctness))
print(overall_accuracy)
[1] 0.6703704
  1. Calculate Mean Accuracy:
library(ggplot2)
mean_condition1 <- mean(filtered_dataB1$correct_numeric_condition1, na.rm = TRUE)
mean_condition2 <- mean(filtered_dataB2$correct_numeric_condition2, na.rm = TRUE)
mean_condition3 <- mean(filtered_dataB3$correct_numeric_condition3, na.rm = TRUE)

se_condition1 <- sd(filtered_dataB1$correct_numeric_condition1, na.rm = TRUE) / sqrt(nrow(filtered_dataB1))
se_condition2 <- sd(filtered_dataB2$correct_numeric_condition2, na.rm = TRUE) / sqrt(nrow(filtered_dataB2))
se_condition3 <- sd(filtered_dataB3$correct_numeric_condition3, na.rm = TRUE) / sqrt(nrow(filtered_dataB3))

mean_dataB <- data.frame(
  condition = c("Condition 1", "Condition 2", "Condition 3"),
  mean_correct_numeric = c(mean_condition1, mean_condition2, mean_condition3),
  se = c(se_condition1, se_condition2, se_condition3)
)

ggplot(mean_dataB, aes(x = condition, y = mean_correct_numeric, fill = condition)) +
  geom_bar(stat = "identity", width = 0.6) + 
  geom_errorbar(aes(ymin = mean_correct_numeric - se, ymax = mean_correct_numeric + se), 
                width = 0.2, color = "black") + 
  labs(title = "Mean Correct Numeric by Condition", 
       x = "Condition", 
       y = "Mean Correct Numeric") +
  theme_minimal() +
  theme(
    legend.position = "none",                
    plot.title = element_text(hjust = 0.5)   
  ) +
  scale_fill_manual(values = c("#FFB5E8", "#B5EAD7", "#FFDAC1"))

4.Response Time by Correctness

library(dplyr)
library(tidyr)
library(ggplot2)

long_data <- combined_data_B %>%
  pivot_longer(
    cols = c(
      response_time_condition1, response_time_condition2, response_time_condition3,
      correct_condition1, correct_condition2, correct_condition3
    ),
    names_to = c(".value", "condition"), 
    names_pattern = "(.*)_(condition[1-3])"
  ) %>%
  mutate(
    condition = case_when(
      condition == "condition1" ~ "Condition 1",
      condition == "condition2" ~ "Condition 2",
      condition == "condition3" ~ "Condition 3"
    )
  )

response_time_summary <- long_data %>%
  group_by(condition, correct) %>%
  summarize(
    mean_response_time = mean(response_time, na.rm = TRUE),
    median_response_time = median(response_time, na.rm = TRUE),
    sd_response_time = sd(response_time, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )
print(response_time_summary)
# A tibble: 6 × 6
  condition   correct mean_response_time median_response_time sd_response_time
  <chr>       <chr>                <dbl>                <dbl>            <dbl>
1 Condition 1 false                4386.                4279             1667.
2 Condition 1 true                 3879.                2454             8601.
3 Condition 2 false                3578.                3501             1386.
4 Condition 2 true                 2455.                1840             1520.
5 Condition 3 false                3518.                3051             1336.
6 Condition 3 true                 2661.                2338.            1360.
# ℹ 1 more variable: n <int>
ggplot(long_data, aes(
  x = factor(correct, labels = c("Incorrect", "Correct")),
  y = response_time,
  fill = condition
)) +
  geom_boxplot(alpha = 0.7) + 
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "black") + 
  labs(
    title = "Response Time by Correctness and Condition",
    x = "Response Accuracy",
    y = "Response Time (ms)"
  ) +
  facet_wrap(~ condition) + # Separate plots for each condition
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    strip.text = element_text(size = 14, face = "bold") 
  ) +
  scale_fill_manual(values = c("#FFB5E8", "#B5EAD7", "#FFDAC1")) + 
  scale_y_continuous(limits = c(0, 10000), breaks = seq(0, 10000, by = 2000)) 
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 2 rows containing non-finite outside the scale range
(`stat_summary()`).

  1. Chance Performance
chance_level <- 0.25

# Perform one-sample t-tests for each condition
t_condition1 <- t.test(combined_data_B$correct_numeric_condition1, mu = chance_level, alternative = "two.sided", na.rm = TRUE)
t_condition2 <- t.test(combined_data_B$correct_numeric_condition2, mu = chance_level, alternative = "two.sided", na.rm = TRUE)
t_condition3 <- t.test(combined_data_B$correct_numeric_condition3, mu = chance_level, alternative = "two.sided", na.rm = TRUE)

# Print t-test results
print("Condition 1:")
[1] "Condition 1:"
print(t_condition1)

    One Sample t-test

data:  combined_data_B$correct_numeric_condition1
t = 10.311, df = 89, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
 0.6401940 0.8264727
sample estimates:
mean of x 
0.7333333 
print("Condition 2:")
[1] "Condition 2:"
print(t_condition2)

    One Sample t-test

data:  combined_data_B$correct_numeric_condition2
t = 9.9462, df = 89, p-value = 4.135e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
 0.6278852 0.8165593
sample estimates:
mean of x 
0.7222222 
print("Condition 3:")
[1] "Condition 3:"
print(t_condition3)

    One Sample t-test

data:  combined_data_B$correct_numeric_condition3
t = 5.8011, df = 89, p-value = 9.924e-08
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
 0.4508980 0.6602131
sample estimates:
mean of x 
0.5555556 

6.Effect of Condition (ANOVA)

library(tidyr)
long_data <- combined_data_B %>%
  pivot_longer(
    cols = c("correct_numeric_condition1", "correct_numeric_condition2", "correct_numeric_condition3"),  # Specify the correctness columns
    names_to = "condition",                              # New column for condition names
    values_to = "correctness"                            # New column for correctness values
  )
anova_result <- aov(correctness ~ condition, data = long_data)

summary(anova_result)
             Df Sum Sq Mean Sq F value Pr(>F)  
condition     2   1.79  0.8926   4.118 0.0173 *
Residuals   267  57.88  0.2168                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.Post-Hoc Analysis

tukey <- TukeyHSD(anova_result)
print(tukey)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = correctness ~ condition, data = long_data)

$condition
                                                             diff        lwr
correct_numeric_condition2-correct_numeric_condition1 -0.01111111 -0.1746902
correct_numeric_condition3-correct_numeric_condition1 -0.17777778 -0.3413569
correct_numeric_condition3-correct_numeric_condition2 -0.16666667 -0.3302457
                                                               upr     p adj
correct_numeric_condition2-correct_numeric_condition1  0.152467963 0.9859710
correct_numeric_condition3-correct_numeric_condition1 -0.014198703 0.0294468
correct_numeric_condition3-correct_numeric_condition2 -0.003087592 0.0447158

Final Experiment

Final Experiment Sample

Final Analysis size is 41 with all three conditions. Average time participants took during final experiment is about 21 minutes. The complete experiment is here: https://ucsd-psych201a.github.io/yu2007/123_234.html https://ucsd-psych201a.github.io/yu2007/132_243.html https://ucsd-psych201a.github.io/yu2007/213_324.html https://ucsd-psych201a.github.io/yu2007/231_342.html https://ucsd-psych201a.github.io/yu2007/321_432.html https://ucsd-psych201a.github.io/yu2007/312_423.html

  1. Data Cleaning:
    • Exclude trials not completing all 3 conditions.
library(dplyr)
library(purrr)

# Paths to the data
path1 <- "/Users/user/Desktop/Fall_quarter/PSCH/final project/final data/condition1"
path2 <- "/Users/user/Desktop/Fall_quarter/PSCH/final project/final data/condition2"
path3 <- "/Users/user/Desktop/Fall_quarter/PSCH/final project/final data/condition3"

# List of files for each condition
list1 <- list.files(path = path1, pattern = "*.csv", full.names = TRUE)
list2 <- list.files(path = path2, pattern = "*.csv", full.names = TRUE)
list3 <- list.files(path = path3, pattern = "*.csv", full.names = TRUE)

# Function to clean and process each dataset
process_data <- function(file_list, condition) {
  file_list %>%
    lapply(read.csv) %>%
    lapply(function(df) {
      df <- df[, c("correct", "response_time")]  # Select only the required columns
      df$correct <- as.logical(df$correct)      # Ensure "correct" is logical (TRUE/FALSE)
      df$response_time <- as.numeric(df$response_time)  # Ensure "response_time" is numeric
      na.omit(df)                               # Remove rows with NA values
    }) %>%
    bind_rows() %>%
    rename_with(~ paste0(., "_", condition))    # Add condition-specific suffix to column names
}

# Process each condition
final_data1 <- process_data(list1, "condition1")
final_data2 <- process_data(list2, "condition2")
final_data3 <- process_data(list3, "condition3")

# Filter and transform each dataset
filtered_data1 <- final_data1 %>%
  filter(!is.na(correct_condition1) & !is.na(response_time_condition1)) %>%
  mutate(correct_numeric_condition1 = ifelse(correct_condition1, 1, 0))  # Convert logical to numeric (1/0)

filtered_data2 <- final_data2 %>%
  filter(!is.na(correct_condition2) & !is.na(response_time_condition2)) %>%
  mutate(correct_numeric_condition2 = ifelse(correct_condition2, 1, 0))  # Convert logical to numeric (1/0)

filtered_data3 <- final_data3 %>%
  filter(!is.na(correct_condition3) & !is.na(response_time_condition3)) %>%
  mutate(correct_numeric_condition3 = ifelse(correct_condition3, 1, 0))  # Convert logical to numeric (1/0)

# Combine all filtered data frames into one
# Using bind_rows() instead of bind_cols() to handle row mismatches
final_data <- bind_rows(
  filtered_data1 %>% mutate(condition = "condition1"),
  filtered_data2 %>% mutate(condition = "condition2"),
  filtered_data3 %>% mutate(condition = "condition3")
)

# View the head of the final data frame
head(final_data)
  correct_condition1 response_time_condition1 correct_numeric_condition1
1               TRUE                     4313                          1
2              FALSE                     3969                          0
3              FALSE                     3782                          0
4              FALSE                     1746                          0
5              FALSE                     2075                          0
6              FALSE                     8708                          0
   condition correct_condition2 response_time_condition2
1 condition1                 NA                       NA
2 condition1                 NA                       NA
3 condition1                 NA                       NA
4 condition1                 NA                       NA
5 condition1                 NA                       NA
6 condition1                 NA                       NA
  correct_numeric_condition2 correct_condition3 response_time_condition3
1                         NA                 NA                       NA
2                         NA                 NA                       NA
3                         NA                 NA                       NA
4                         NA                 NA                       NA
5                         NA                 NA                       NA
6                         NA                 NA                       NA
  correct_numeric_condition3
1                         NA
2                         NA
3                         NA
4                         NA
5                         NA
6                         NA
  1. Overall Accuracy
correctness <- final_data[, c("correct_numeric_condition1", "correct_numeric_condition2", "correct_numeric_condition3")]
overall_accuracy <- sum(correctness, na.rm = TRUE) / (nrow(correctness) * ncol(correctness))
print(overall_accuracy)
[1] 0.179464

3.3. Calculate Mean Accuracy:

library(ggplot2)
mean_condition1 <- mean(filtered_data1$correct_numeric_condition1, na.rm = TRUE)
mean_condition2 <- mean(filtered_data2$correct_numeric_condition2, na.rm = TRUE)
mean_condition3 <- mean(filtered_data3$correct_numeric_condition3, na.rm = TRUE)

se_condition1 <- sd(filtered_data1$correct_numeric_condition1, na.rm = TRUE) / sqrt(nrow(filtered_data1))
se_condition2 <- sd(filtered_data2$correct_numeric_condition2, na.rm = TRUE) / sqrt(nrow(filtered_data2))
se_condition3 <- sd(filtered_data3$correct_numeric_condition3, na.rm = TRUE) / sqrt(nrow(filtered_data3))

mean_data <- data.frame(
  condition = c("Condition 1", "Condition 2", "Condition 3"),
  mean_correct_numeric = c(mean_condition1, mean_condition2, mean_condition3),
  se = c(se_condition1, se_condition2, se_condition3)
)

ggplot(mean_data, aes(x = condition, y = mean_correct_numeric, fill = condition)) +
  geom_bar(stat = "identity", width = 0.6) + 
  geom_errorbar(aes(ymin = mean_correct_numeric - se, ymax = mean_correct_numeric + se), 
                width = 0.2, color = "black") + 
  geom_hline(yintercept = 0.25, linetype = "dotted", color = "grey", size = 1) +
  labs(title = "Mean Correct Numeric by Condition", 
       x = "Condition", 
       y = "Mean Correct Numeric") +
  theme_minimal() +
  theme(
    legend.position = "none",                
    plot.title = element_text(hjust = 0.5)   
  ) +
  scale_fill_manual(values = c("#FFB5E8", "#B5EAD7", "#FFDAC1"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

original research plot
  1. Chance Performance
chance_level <- 0.25

# Perform t-tests for each condition
# Perform one-sample t-tests for each condition
t_condition1 <- t.test(filtered_data1$correct_numeric_condition1, mu = chance_level, alternative = "two.sided", na.rm = TRUE)
t_condition2 <- t.test(filtered_data2$correct_numeric_condition2, mu = chance_level, alternative = "two.sided", na.rm = TRUE)
t_condition3 <- t.test(filtered_data3$correct_numeric_condition3, mu = chance_level, alternative = "two.sided", na.rm = TRUE)


# Print t-test results
print("Condition 1:")
[1] "Condition 1:"
print(t_condition1)

    One Sample t-test

data:  filtered_data1$correct_numeric_condition1
t = 24.935, df = 737, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
 0.6450969 0.7126266
sample estimates:
mean of x 
0.6788618 
print("Condition 2:")
[1] "Condition 2:"
print(t_condition2)

    One Sample t-test

data:  filtered_data2$correct_numeric_condition2
t = 15.988, df = 737, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
 0.5073392 0.5793817
sample estimates:
mean of x 
0.5433604 
print("Condition 3:")
[1] "Condition 3:"
print(t_condition3)

    One Sample t-test

data:  filtered_data3$correct_numeric_condition3
t = 7.946, df = 737, p-value = 7.215e-15
alternative hypothesis: true mean is not equal to 0.25
95 percent confidence interval:
 0.3576348 0.4282730
sample estimates:
mean of x 
0.3929539 

5.Effect of Condition (ANOVA)

data_long <- final_data %>%
  pivot_longer(
    cols = starts_with("correct_numeric_condition"), # All condition columns
    values_to = "score",                            # New column for scores
    values_drop_na = TRUE                           # Drop rows with NA values
  ) 
data_long<-data_long %>%
  group_by(name)%>%
  mutate(id = ceiling(row_number() / 18))%>%
  ungroup()
aggregated_data <- data_long %>%
  group_by(id, condition) %>%
  summarize(mean_response = mean(score), .groups = "drop")
anova_result <- aov(mean_response ~ condition + Error(id / condition), data = aggregated_data )

summary(anova_result)

Error: id
          Df   Sum Sq  Mean Sq F value Pr(>F)
Residuals  1 0.001176 0.001176               

Error: id:condition
          Df Sum Sq Mean Sq
condition  2  1.478  0.7388

Error: Within
           Df Sum Sq Mean Sq F value Pr(>F)  
condition   2  0.376  0.1880   2.717 0.0702 .
Residuals 117  8.097  0.0692                 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

6.Post-Hoc Analysis

pairwise.t.test(aggregated_data$mean_response, aggregated_data$condition, paired = TRUE, p.adjust.method = "bonferroni")

    Pairwise comparisons using paired t tests 

data:  aggregated_data$mean_response and aggregated_data$condition 

           condition1 condition2
condition2 7.3e-05    -         
condition3 6.2e-10    0.0013    

P value adjustment method: bonferroni 

Exploratory analyses

Response Time by Correctness

library(dplyr)
library(tidyr)
library(ggplot2)

final_data_na <- final_data %>%
  filter(!is.na(response_time_condition1) & !is.na(correct_condition1)) %>% # Remove rows with NA values
  select(1:4) # Select only the first four columns

response_time_summary <- final_data_na %>%
  group_by(condition) %>%
  summarize(
    mean_response_time = mean(response_time_condition1, na.rm = TRUE),
    median_response_time = median(response_time_condition1, na.rm = TRUE),
    sd_response_time = sd(response_time_condition1, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )
print(response_time_summary)
# A tibble: 1 × 5
  condition  mean_response_time median_response_time sd_response_time     n
  <chr>                   <dbl>                <dbl>            <dbl> <int>
1 condition1              3917.                 2833            4552.   738
ggplot(final_data_na, aes(
  x = factor(correct_condition1, labels = c("Incorrect", "Correct")),
  y = response_time_condition1
)) +
  geom_boxplot(alpha = 0.7) + 
  stat_summary(fun = mean, geom = "point", shape = 18, size = 3, color = "black") + 
  labs(
    title = "Response Time by Correctness and Condition",
    x = "Response Accuracy",
    y = "Response Time (ms)"
  ) +
   theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
    axis.title.x = element_text(size = 14),
    axis.title.y = element_text(size = 14),
    axis.text.x = element_text(size = 12),
    axis.text.y = element_text(size = 12),
    strip.text = element_text(size = 14, face = "bold") 
  ) +
  scale_fill_manual(values = c("#FFB5E8", "#B5EAD7", "#FFDAC1")) + 
  scale_y_continuous(limits = c(0, 10000), breaks = seq(0, 10000, by = 2000)) 
Warning: Removed 28 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Warning: Removed 28 rows containing non-finite outside the scale range
(`stat_summary()`).

Discussion

Summary of Replication Attempt

A repeated-measures ANOVA was conducted to evaluate the effect of condition on mean responses. The results revealed no statistically significant main effect of condition (F(2, 117) = 2.717, p = 0.0702), suggesting that the overall differences among conditions were not strong enough to reach significance. However, pairwise comparisons using paired t-tests with Bonferroni adjustment indicated significant differences between all pairs of conditions: Condition1 vs Condition2: p < 0.001 Condition1 vs Condition3: p < 0.001 Condition2 vs Condition3: p = 0.0013 These results suggest that while the overall effect of condition was not statistically significant, specific pairs of conditions do differ significantly. ### Commentary The differences in participant recruitment (college students vs. Prolific participants) likely explain the discrepancies between my study and the original paper: Lower Accuracy: Prolific participants may be less engaged, more distracted, or less familiar with similar cognitive tasks compared to college students. Higher Variability: The broader demographics and uncontrolled testing environments of Prolific participants could increase response variability, reducing statistical power. Non-Significant Main Effect: The higher residual variance and lower performance levels in your Prolific sample likely reduced the F-statistic, leading to a non-significant main effect of condition. Broader Generalizability: Despite these limitations, your results from a diverse Prolific sample provide valuable insights into how this task might generalize beyond a college-student population. ## Design Overview Similarities: Performance in all conditions was significantly above chance, and pairwise differences followed the same trend (2×2 > 3×3 > 4×4). This suggests participants were able to learn word-referent pairs, even under high levels of ambiguity. Differences: Performance levels were lower across all conditions in your study. The main effect of condition was not significant in your study, contrasting with the original paper’s highly significant main effect and large effect size. These differences may be due to methodological variations (e.g., task design, training time, or participant characteristics) or issues related to sample size and variability in your dataset.

Author Contribution Statement

Hui Junyi: Experiment coding (trial/training phase coding; pre-test audio check and fullscreen check; randomization; condition order counterbalancing code; debugging; review); pre-registration revisions; Yawen Dong: Experiment coding (test phase code; randomization; debugging; review); pre-registration revisions; OSF datapipe setup; Allison Park: Stimuli analysis and selection; pre-registration original draft; Prolific experiment management; code review; Pengjia Cui: Stimuli generation code; code review