Replication of Study Yu & Smith (2007) by Cui, Dong, Hui, and Park (2024)

Author

Cui, Dong, Hui, and Park

Published

December 11, 2024

Introduction

This project aims to replicate Experiment 1 of Yu & Smith (2007). This experiment is a cross-situational word-learning task involving pairs of pseudowords and images. The experiment investigates whether word learning can happen under uncertainty (when it is not clear what exactly a word refers to) and cross-situationally (over multiple trials or instances, not just one targeted occurrence). The experiment involves two components: a training phase and a knowledge test. During the training phase, images are presented in sets of two, three, or four, varying by condition; at the same time, the matching pseudoword audio is played. This comprises one training trial. After the training phase, the participant takes a four-way forced-choice multiple choice test. Four images are presented and a single pseudoword is played; the participant is tasked with selecting the referent of the pseudoword. Each participant experiences all three (2x2, 3x3, and 4x4) learning conditions, the order of which is counterbalanced overall.

The repository for this project is available on Github. The original paper, Rapid Word Learning Under Uncertainty via Cross-Situational Statistics (Yu & Smith, 2007), is also available on Github. The replication’s pre-registration is available on OSF.

The replication project’s experimental paradigm is available on the repository’s Github Pages.

Methods

Power Analysis

The original paper reported an effect size of d = 1.425. The effect size is such that the G*Power application was unable to run a post-hoc power analysis (d > 1). As such, the paper’s original sample size of n = 38 should be sufficiently powered for this replication project.

Planned Sample

The original paper had 38 college students as participants for their experiment. This replication project originally aimed to have the same number of participants, recruited from the online survey platform Prolific. However, when dividing the study into six different condition orders for counterbalancing purposes, we rounded up to 42 participants to recruit the same number of participants per condition order.

Materials

The original paper describes its stimuli as follows:

“The stimuli were slides containing pictures of uncommon objects (e.g., canister, facial sauna, and rasp) paired with auditorily presented pseudowords. These artificial words were generated by a computer program to sample English forms that were broadly phonotactically probable; they were produced by a synthetic female voice in monotone.”

We obtained 18 pairs of images and audio files from the original experiment. However, the experiment requires 54 pairs of images and audio files to complete the within-participant manipulation of learning conditions. The remaining 36 images were selected from the Novel Object and Unusual Name (NOUN) Database.

Sixteen out of the eighteen original audio stimuli available are two-syllable pseudowords with stress on the initial syllable. Two pseudowords (numbered 6 and 10 in the set of original stimuli) are single syllables. The majority of pseudoword stimuli follow CV or CVC syllable structure, avoiding vowel onsets in both word-initial and syllable-initial positions. The stimuli also avoid highly clustered codas, with only two out of eighteen original stimuli containing syllable-final CC clusters; the other stimuli tend toward single-C codas if present, with eight stimuli featuring no word-final coda. The stimuli tend not to feature fricatives and instead generally consist of stops. There are no diphthongs or vowel hiatus in the original stimuli. The original stimuli are, as reported in the paper, generally delivered in monotone.

Accordingly, the majority of newly selected stimuli consist of two-syllable, consonant-initial pseudowords preferring CV or CVC syllables with no diphthongs or vowel hiatus; highly preferring stops over fricatives; and with consonant clustering no more complex than CC in either onset or coda positions. Where applicable, stress should fall on the first syllable. Newly selected stimuli are also produced in monotone, utilizing the Google TTS audio download option on voicegenerator.io.

There were approximately 24 NOUN pseudowords that best matched the phonological criteria of the original 18 stimuli. The remaining twelve pseudowords still fit within the original paper’s criterion of “broadly phonotactically probable”; while continuing to match the original stimuli in syllable structure and count, they contain some sound types not represented within the original stimuli. However, word-initial voiced fricatives, the affricate “ch”, word-initial liquids, and word-initial “h” are all very common sounds in English, and their inclusion in the new stimuli should not affect participants’ ability to learn word and image pairs.

Procedure

The original paper’s procedure is described as follows:

“To form each trial, we randomly selected several (2, 3, or 4, depending on condition) word-referent pairs from the 18 word-referent pairs for that condition; across trials in a condition, each word and referent were presented six times.” “Subjects were instructed that their task was to learn the words and referents, but they were not told that there was one referent per word. They were told that multiple words and pictures would co-occur on each trial and that their task was to figure out across trials which word went with which picture. After training in each condition, subjects received a four-alternative forced-choice test of learning. On the test, they were presented with 1 word and 4 pictures and asked to indicate the picture named by that word. The target picture and the 3 foils were all drawn from the set of 18 training pictures.”

This procedure was followed as closely as possible.

Analysis Plan

The original paper features a one-way ANOVA analyzing the differences between the 2x2, 3x3, and 4x4 learning conditions. It also includes a one-tailed t-test for the number of word-referent pairs learned by participants (vs chance likelihood of correct guesses on the test after training trials). These are the primary statistics concerning Experiment 1. The t-test relates to the underlying inference of the original paper - whether participants are able to learn words under rapid and uncertain trials - and tests whether participants were able to learn words at a rate better than chance. The ANOVA tests the primary inference of the paper by comparing participant results between conditions to see if there is any effect of manipulating the number of word and image pairs presented in one trial.

The original paper does not describe data cleaning or whether any subjects withdrew or were dropped from the first experiment.

Participant results from the knowledge test after each condition were selected from the data, then transformed into boolean True (correct) or False (incorrect) rather than character string representations of Correct or Incorrect. Each participant’s percentage of correct answers were then calculated for use in the t-test and ANOVA.

Differences from Original Study

The original paper specifies that stimuli were presented on a 17in. computer screen; this replication experiment will not be able to standardize the screen or speaker quality. However, this should not affect the study effects as long as the user’s speakers or headphones are not so poor as to prevent their ability to distinguish pseudowords. The project implemented an audio check that displayed pictures of a wallet, some walleye fish, walnuts, and a walrus. The word “walrus”, rendered in the same voice audio generator as the pseudoword stimuli, was then played, and participants were required to correctly select the walrus before they were allowed to proceed. Participants should therefore have adjusted their audio to reasonable and understandable levels prior to beginning the trials.

This replication project recruited participants from the online survey platform Prolific and compensated them with $5 USD, rather than recruiting undergraduate students to complete an in-person experiment for course credits.

For this project, we pre-registered the exclusion of participants with reaction times 3 standard deviations away from the mean (in either direction). While the original study did not note the exclusion of any participants, the participants for this replication were drawn from an online platform, and this metric was implemented to exclude bots or inattentive participants.

Methods Addendum (Post Data Collection)

Actual Sample

One participant did not complete all three conditions, resulting in a sample size of 41 participants for this replication project. While some participants had average reaction times greater than three standard deviations away from the mean, no participants met this criterion across all three conditions, and thus no participants were excluded from the final analysis.

Differences from pre-data collection methods plan

There were no differences from the planned methods.

Results

Data preparation

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(ggthemes)

## initialize dataframe
all_conditions <- data.frame(matrix(ncol = 4, nrow = 0))
colnames(all_conditions) <- c("id", "perc", "res_time", "condition")

for (j in 1:3) {
  for (k in 1:41)
    {
      ## build string to read in file & read file
      filename = paste("condition", j, " (", k, ").csv", sep = "")
      item = read.csv(filename)
      
      ## parse dataframe:
      ## select relevant columns & calculate percent/react scores
      item_parsed = item |>
        select(c("correct", "response_time")) |>
        mutate(bool_correct = as.logical(correct)) |>
        na.omit() |>
        mutate(perc = sum(bool_correct == TRUE) / n(), 
               res_time = mean(response_time),
               id = k,
               condition = j) |>
        select(c("id", "bool_correct", "perc", "res_time", "condition"))
      
      all_conditions[nrow(all_conditions) +1,] = c(item_parsed$id[1],
                                                   item_parsed$perc[1],
                                                   item_parsed$res_time[1],
                                                   item_parsed$condition[1])
      
  }
}

## cast condition number as a factor
## cast id number as a factor - this is the internal ID that was added in the loops above, not a prolific ID, don't worry
all_conditions <- all_conditions |>
    mutate(
      condition = factor(condition),
      id = factor(id)
      )

## create dataframe of condition mean score & mean reaction time
condition_means <- all_conditions |> 
    group_by(condition) |> 
    summarize(perc_cond = mean(perc),
              res_cond = mean(res_time),
              sd_res = sd(res_time)) |>
  
    ## calculate the threshold time that disqualifies a participant 
    ## don't need to calculate mean minus 3SD bc it goes into negative
    mutate(
      exclu = res_cond + 3*sd_res
    )

## creating sub-frames for each condition so i don't have to select as often
condition1 <- all_conditions[all_conditions$condition == 1,] 
condition2 <- all_conditions[all_conditions$condition == 2,] 
condition3 <- all_conditions[all_conditions$condition == 3,]

Selecting rows from each condition where the average reaction time exceeded the exclusion threshold.

condition1[condition1$res_time > condition_means$exclu[1],]

   id perc res_time condition
38 38  0.5 21715.72         1

condition2[condition2$res_time > condition_means$exclu[2],]

   id      perc res_time condition
51 10 0.2777778 25317.17         2
79 38 0.5000000 25792.50         2

condition3[condition3$res_time > condition_means$exclu[3],]

    id      perc res_time condition
116 34 0.6111111 11660.78         3

In condition 1, one participant met the exclusion criterion. In condition 2, that same participant and one other met the exclusion criterion. In condition 3, a different participant met the exclusion criterion.

While some participants’ average response times were greater than 3 standard deviations from the mean, no participant met this criterion in all three conditions. No participants were excluded from the data analysis.

Confirmatory analysis

Statistical testing

T-tests

## t-testing condition 3, which had the lowest average score on the test
t.test(x = condition3$perc, mu = 0.25, alternative = "greater")


    One Sample t-test

data:  condition3$perc
t = 4.0661, df = 40, p-value = 0.0001089
alternative hypothesis: true mean is greater than 0.25
95 percent confidence interval:
 0.3337543       Inf
sample estimates:
mean of x 
0.3929539

## calculating effect size
mean_sample <- 0.3929539
hypothesized_mean <- 0.25

sd_cond3 <- sd(condition3$perc)
# Cohen's d
cohen_d <- (mean_sample - hypothesized_mean) / sd_cond3
cohen_d

[1] 0.6350214

The t-test returns a p-value of 0.0001089, indicating a significant result.

The Cohen’s d value of 0.6 can be interpreted as a little above a “medium” effect size. It is less than half the effect size in the original experiment (d = 1.425), but it still reflects the presence of an effect.

ANOVA

## ANOVA for the three conditions' effect on average score
nova3 <- aov(perc ~ condition, data = all_conditions)
print(summary(nova3))

             Df Sum Sq Mean Sq F value   Pr(>F)    
condition     2  1.677  0.8386   12.16 1.55e-05 ***
Residuals   120  8.274  0.0690                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because the ANOVA returned a significant result, we then ran a post-hoc analysis to determine which groups are different from each other.

## post-hoc to see what groups are different from each other
TukeyHSD(nova3, conf_level = 0.95)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = perc ~ condition, data = all_conditions)

$condition
          diff        lwr          upr     p adj
2-1 -0.1355014 -0.2731333  0.002130633 0.0546546
3-1 -0.2859079 -0.4235398 -0.148275871 0.0000079
3-2 -0.1504065 -0.2880385 -0.012774516 0.0285965

plot(TukeyHSD(nova3, conf.level=.95), las = 2)

The bars for the differences between conditions 2 and 1 (3x3 and 2x2 trials) cross the 0 line, but the bars involving condition 3 (the 4x4 trial condition) do not cross the 0 line, indicating significant differences between condition 3 and condition 1; and between condition 3 and condition 2. The difference between condition 3 and condition 1 is the greatest.

Plots & Figures

## calculating standard errors for error bars
se_condition1 <- sd(condition1$perc, na.rm = TRUE) / sqrt(nrow(condition1)) 
se_condition2 <- sd(condition2$perc, na.rm = TRUE) / sqrt(nrow(condition2)) 
se_condition3 <- sd(condition3$perc, na.rm = TRUE) / sqrt(nrow(condition3)) 

## add calculated SE to a dataframe
condition_means_se <- condition_means |>
  mutate(
    se = c(se_condition1, se_condition2, se_condition3)
  )

## plot for mean scores by condition
percplot <- ggplot(condition_means_se, aes(x = condition, y = perc_cond, fill = condition)) +
    geom_bar(width = 0.65, stat = "identity") +
    geom_errorbar(
      aes(ymin = perc_cond - se, 
          ymax = perc_cond + se), 
                 width = 0.2, color = "black"
    ) +
    geom_hline(yintercept = 0.25, linetype = "dotted", color = "black", linewidth = 1) +
    scale_y_continuous(
      limits = c(0,1),
      n.breaks = 6
    ) +
    labs(title = "Average percent scores by condition", 
        x = "Condition", 
        y = "Average score") +
    scale_fill_hue(labels = c("2x2", "3x3", "4x4")) +
    scale_x_discrete(label = c("2x2", "3x3", "4x4")) +
  theme(legend.position = "none")

percplot

Side-by-side figure comparison to original paper:

The replication project’s average scores were much lower than those in the original paper. While participant average scores in all three conditions were clearly higher than chance, the apparent average of 90% in the original experiment’s 2x2 condition is much higher than the average of 68% in the replication; similarly compare the original apparent 75% 3x3 condition score with 54% replicated here, and the apparent 50% 4x4 condition score with 40% replicated here. However, the approximate proportions of participant performance are the same: scores are highest in the 2x2 condition, decline somewhat in the 3x3 condition, and drop further in the 4x4 condition.

The overall lower scores in the replication project are consistent with the lower effect size compared to the original paper.

Exploratory analyses

A cursory investigation of reaction time vs. participant score.

scatter_facet <- ggplot(all_conditions, 
                  aes(x = res_time, y = perc)) + 
                  geom_point(alpha = 0.55, 
                             position = position_jitter(), size = 2,
                  aes(color = condition)) +
                  facet_wrap(~condition) +
                labs(x = "Reaction time", y = "Participant mean knowledge test score") +
                ggtitle("Reaction time vs score, faceted by condition")
              
scatter_facet

Approximately five outliers severely warp the scale of the plots. Limiting the x-axis to 8000ms allows us to see the patterns a little more clearly:

scatter_facet_zoom <- ggplot(all_conditions, 
                  aes(x = res_time, y = perc)) + 
                  geom_point(alpha = 0.55, 
                             position = position_jitter(), size = 2,
                  aes(colour = condition)) +
                  facet_wrap(~condition) +
                scale_x_continuous(limits = c(0, 8000)) +
                labs(x = "Reaction time", y = "Participant mean knowledge test score") +
                ggtitle("Reaction time vs score, faceted by condition") 
              
scatter_facet_zoom

Warning: Removed 5 rows containing missing values or values outside the scale range
(`geom_point()`).

There is a small cluster in condition 1: 100% performance is aligned with a reaction time of between 2 and 3 seconds.

Somewhat similarly, in condition 2, all scores above 75% fall in a 2-4s range.

This pattern is not fully reflected in condition 3, where scores decline and times increase, resulting in a general appearance of the data points moving to the bottom right. Three out of four scores above 75% in condition 3 remain within the 2 to 4 second reaction time range; however, one participant achieves a score above 75% with an average reaction time of approximately 6.5s.

The pattern of reaction times below 4s generally reflecting higher scores may indicate that taking one’s time on these trials does not necessarily result in higher accuracy.

Discussion

Summary of Replication Attempt

The overall result - participants in each condition scoring higher than chance - replicated. The pattern of scores in the original paper, where performance was best in the 2x2 condition, declined in 3x3, and was lowest in 4x4, was also present in the replication results. However, participants in the replication scored lower in all conditions compared to the original experiment. The replication’s effect size was much smaller than that of the original paper, and the F-value from the ANOVA was also much smaller than the original experiment.

Nevertheless, the effect was still present, and participant scores appeared to fall in roughly the same proportions as the original experiment.

Commentary

Possible differences between the subject pools of the original experiment and this replication may have contributed to this replication’s participants scoring much lower compared to the original results. Online survey participants are not supervised or necessarily motivated to pay attention throughout the training trials; some participants may have been multitasking or entirely forgotten the experiment was running between questions, given response times of 30s+ on individual questions in the test phases. Despite putting the experiment in fullscreen during the experiment’s introduction, there was no way to guarantee that participants stayed in fullscreen or continued paying attention throughout. The design of the training trials also made it impossible to implement attention checks.

Statement of Contributions

Pengjia Cui: Stimuli generation code; code review;

Yawen Dong: Experiment coding (test phase code; randomization; debugging; review); pre-registration revisions; OSF datapipe setup;

Hui Junyi: Experiment coding (trial/training phase coding; pre-test audio check and fullscreen check; randomization; condition order counterbalancing code; debugging; review); pre-registration revisions;

Allison Park: Stimuli analysis and selection; pre-registration original draft; Prolific experiment management; code review.