Replication of Study ‘What makes words special? Words as unmotivated cues’ by Edmiston & Lupyan (2015, Cognition)

Author

Replication Author[s]: Sihan Yang (siy009@ucsd.edu)

Published

December 11, 2024

Introduction

How language supports cognition is a centeral question in cognitive linguistics. Previous research suggests that language may play an unique role in the mental processing of categories. Lupyan and Thompson-Schill (2012) found that, during an image-recognition task, participants judged whether an image and a preceeding auditory cue were of the same category faster when the cue was verbal rather than non-verbal. For instance, people identified a guitar image as matching the word ‘guitar’ faster than the sound of an acoustic guitar playing.

Edmiston and Lupyan (2015) further explored why verbal cues activate concepts more effectively. They proposed that verbal cues are associated with the abstract concept rather than specific instances, thereby cueing more diagnostic features of a category. In their experiment 1A, they found that in an image-recogniton task, participants (1) responded faster to an image if the cue is a verbal label than an environmental sound, and (2) responded faster to an image if the environmental sound cue is a congruent one (i.e., matching the instance in the image) than an incongruent one (i.e., matching the category but not the exact instance). These findings suggest the potential mechanism by which language supports conceptual thinking.

Our study aims to replicate Experiment 1A from Edmiston and Lupyan (2015). We define the replication as successful if the replication result meets the following conditions: (1) the relationships between reaction time under different auditory cue conditions are the same as described above, and (2) the differences in reaction time are statistically attributable to the type of auditory cue. Specifically, we expect reaction times to follow the pattern: verbal-label condition < congruent-environmental-sound condition < incongruent-environmental-sound condition, with statistical tests confirming that auditory cue type is the primary determinant of these differences.

Followings are the the links to our repository, pre-registration, and hosted experiment: Project Github repository Study pre-registration Experiment link

Methods

Power Analysis

Since the original study did not provide means and standard deviations for the effects and employed a linear mixed-effects regression model, it is challenging to conduct a conventional power analysis to estimate a proper sample size. However, based on guidance from course instructional staff, the original sample size (\(N=43\)) should be adequate to acheiving 80% power for detecting the effect reported in the original study, with a significance criteria of \(\alpha=0.05\).

Planned Sample

Given that the original study is well-powered (\(N=43\)) but in a laboratory setting, we choose a sample size slightly larger (\(N=50\)) sample size to account for the uncertainty in online settings. Participants are recruited through Prolific, prescreened for English fluency, and compensated monetarily for their participation.

Materials

We use the exact same images and auditory cues as in the original study. The following are the detailed explanation from the authors: “The auditory cues comprised basic-level category labels and environmental sounds for six categories: bird, dog, drum, guitar, motorcycle, and phone. For each category, we obtained two distinct environmental sound cues, e.g., , , and two separate images for each subordinate category, e.g., two electric guitars for , two acoustic guitars for . To control for cue variability, we also used two versions of each spoken category label: one pronounced by a female speaker, one by a male speaker. All auditory cues were equated in duration (600 ms.) and normalized in volume. The images were color photographs (four images per category). The materials, obtained from online repositories, are available for download at http://sapir.psych.wisc.edu/stimuli/ MotivatedCuesExp1A-1B.zip.” (Edmiston & Lupyan, 2015)

Procedure

Participants are instructed to use their personal computer to complete task in a quite environment absent of distraction. During the experiment, they will be required to make their response by clicking ‘f’ on the keyboard of their computer for ‘Yes’, and ‘j’ for ‘No’.

During the experiment, “on each trial participants heard a cue and saw a picture…participants decide as quickly and accurately as possible if the picture they saw came from the same basic-level category as the word or sound they heard…Trials began with a 250 ms. fixation cross followed immediately by the auditory cue, delivered via headphones. The target image appeared centrally 1 s after the offset of the auditory cue and remained visible until a response was made. Each participant completed 6 practice and 384 test trials. If the picture matched the auditory cue (50% of trials) participants were instructed to respond ‘Yes’…(e.g., or ‘phone’ followed by a picture of any phone). Otherwise, they were to press ‘No’ (e.g., or ‘phone’ followed by a dog). All factors (cue type, congruence) varied randomly within subjects. Auditory feedback (buzz or bleep) was given after each trial.” (Edmiston & Lupyan, 2015)

Analysis Plan

Data Cleaning and Data Exclusion Rules

Before testing the main hypothesis of the original study, we will first evaluate participants’ performance for data quality assurance. This will involve calculating each participant’s accuracy. Given that the average accuracy is 97% in the original study, we set the data exclusion threshold at 90%: only participants who have achieved an accuracy above 90% will be included in the further analysis.

Trials with response times (RTs) outside the range of 250 ms to 1500 ms will be excluded, as in the original study. Before applying these filters, we will examine the distribution of RT among participants who meet the accuracy requirement.

Finally, only RTs from matching trials with correct responses will be included for further analysis, following the original study’s approach.

Covariates

The main covariate in this experiment will be “the presence of incongruent sound trials (e.g., hearing a and responding ‘Yes’ if a rotary phone was presented) may have led participants to be more careful on these trials, inflating the RTs” (Edmiston & Lupyan, 2015). However, as this factor is not the focus of the experiment to replicate and was addressed in another following experiment in the original study, we will not include it in our analysis.

Key Analysis

We will “fit RTs for correct responses on matching trials ( responses)” using a “linear mixed regression using maximum likelihood estimation (Bates, Maechler, Bolker, & Walker, 2013). The model has the condition between image and sound cue as the main factor (”label”, “congruent-sound” and “incongruent sound”), but also including “random intercepts and random slopes for within-subject factors and random intercepts for repeated items (unique trial types) following the recommendations of Barr, Levy, Scheepers, and Tily (2013).” (Edmiston & Lupyan, 2015).

To further quantify the estimated differences in RTs between different congruency condtision, we will then apply a post-hoc analysis – EMMs contrast – on the fitted linear mixed-effect regression model. The three contrasts of interest are therefore: (1) “label” v.s. “congruent-sound” (2) “label” v.s. “incongruent-sound” (3) “congruent-sound” v.s. “incongruent-sound”. For each contrast of interest, we will report the “parameter estimates (b) and confidence intervals” (Edmiston & Lupyan, 2015).

Finally, significance tests will be calculated “using chi-square tests that compared nested models—models with and without the factor of interest—on improvement in log-likelihood” (Edmiston & Lupyan, 2015). This is to further confirm whether the factors of interest truly accounts for the differences in RTs. We will report the resulting Chi-square value and the resulting p-value.

Criteria for successful replication: according to the original study, we predict that the results should be as follows:

In the post-hoc analysis: both “label” v.s. “congruent-sound” contrast and “congruent-sound” v.s. “incongruent-sound” contrast should be significantly below zero. This is denoted by the estimate of contrast between effects of ‘label’ and ‘congruent-sound’ and the contrast between effects of ‘congruent-sound’ and ‘incongruent-sound’, which should both be significantly below zero (significances are quantified by the adjusted p-values)
In the Chi-square test: the model with the factor of interest (the congruency of sound cues) should fit the data significantoy better than the model without the factor of interest. This is denoted by the chi-square test result from compraing nested models with and without congruency as the main factor, where the full model should be significantly better than the reduced model (significances quantified by the p-value)

Differences from Original Study

Sample

While the original study recruited participants on campus, our recruitment is conducted on a public global outsourcing platform, expanding the sampling pool from U.S. college undergraduates to a more diverse population.

Procedure

While the original study was conducted in person in a lab, our replication will be hosted online, allowing participants to complete the task independently at a location of their choice, without the supervision of experimenters. In addition, responses will be made using a keyboard rather than a game controller.

Methods Addendum (Post Data Collection)

Actual Sample

We successfully collected data from 50 participants as planned, with 42 passing the data quality check (accuracy rate above 90%). Refer to the Results section for an overview of performance and the code for data quality assurance.

Differences from pre-data collection methods plan

None. We collected the data the same way we specified in the plan and managed to collect the same number of participants.

Results

Data preparation

As proposed in our analysis plan, we will first examine the distribution of accuracy and response time, then apply data cleaning and transformation.

First, we load the data and perform basic preprocessing. This includes removing practice trial data, converting the data types of correct responses and responses, and calculating the accuracy for each participant.

### Data Preparation

#### Load Relevant Libraries and Functions
library(tidyverse)
library(readr)
library(lme4)

# Import data
folder_path <- "../data/complete"
csv_files <- list.files(folder_path, pattern = "*.csv", full.names = TRUE)
df_list <- lapply(csv_files, read_csv, show_col_types = FALSE)
original_df <- bind_rows(df_list)

#exclude practice trials 
original_df <- original_df |>
  filter(exp_part == "actual")

# convert the data type and compute correctness of each trial
original_df <- original_df %>%
  mutate(correct_response = as.character(correct_response)) %>%
  mutate(response = as.character(response)) %>%
  mutate(correct = correct_response == response)

# compute the accuracy of each participant
accuracy_table <- original_df %>%
  group_by(ID) %>%
  summarise(accuracy = mean(correct, na.rm = TRUE))

Next, we check the distribution of accuracy.

library(ggplot2)

# set the accuracy filtering threshold
accuracy_thresh <- 0.9

# plot the distribution of accuract
ggplot(accuracy_table, aes(x = accuracy)) +
  geom_histogram(
    breaks=seq(0, 1, length.out=21),
    fill = "blue", color = "black", alpha = 0.7) +
  geom_vline(xintercept = accuracy_thresh, color = "red", linetype = "dashed", size = 1) + # mark the threshold for QA
  labs(
    title = "Distribution of Accuracy",
    x = "Accuracy",
    y = ""
  ) +
  theme_minimal()

42 people (which are 84% of the total participants) achieved an accuracy above 90%. Considering that in the original study “all participants performed very accurately on all items (M = 97%)” (Edmiston & Lupyan, 2015), the accuracy of participants in our replication is notably lower.

We also examine the distribution of response times for participants with an accuracy above 90%:

library(ggplot2)

# get the list of IDs of participants whose accuracy is above the threshold
id_pass_qa <- accuracy_table %>%
  filter(accuracy > accuracy_thresh) %>%
  pull(ID)
# filter the original dataframe to keep only data from the qualified participants
original_df_with_good_subj <- original_df %>%
  filter(ID %in% id_pass_qa)

# plot the distribution of response time
ggplot(original_df_with_good_subj, aes(x = rt)) +
  geom_histogram(
    breaks=seq(0, 2000, length.out=40),
    fill = "blue", color = "black", alpha = 0.7) +
  geom_vline(xintercept = 250, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = 1500, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Distribution of Reaction Time",
    x = "Reaction Time",
    y = ""
  ) +
  theme_minimal()

# compute the quantile
ecdf_func <- ecdf(original_df_with_good_subj$rt)
rt_quantiles <- ecdf_func(c(250, 1500))
# compute the mask for whether each trial has passed the check
rt_pass_mask <- (original_df_with_good_subj$rt > 250) & (original_df_with_good_subj$rt < 1500)

15083 trials (which are 94% of the trials) have response time within the range of 250-1500 ms.

Now we further preprocess the data, by

excluding participants who do not achieve an accuracy of 90%
excluding trials whose reaction time is shorter than 250 ms or longer than 1500 ms
keeping only trials whose correct answer is ‘YES’ and people made the correct response.
recode the trial ‘congruency’ conditions as ‘label’, ‘congruent’, and ‘incongruent’

# keep only people who passed the QA check
combined_df <- original_df_with_good_subj

# exclude only correct 'YES' trials 
combined_df <- combined_df %>%
  filter(response == "f") %>%
  filter(correct_response == "f")

prepare_fitting_ready_df <- function(df) {
  # rename "sound_subtype" to "cue"
  df <- df %>%
    rename(cue = sound_subtype)

  # create congruency column 
  df <- df %>%
     mutate(congruency = case_when(
      correct_response == 'j' ~ NA_character_,
      cue == "label" ~ "label",
      (img_subtype %in% c("song", "york", "bongo", "acoustic", "harley", "rotary") & sound_version == "A") ~ "incongruent",
      TRUE ~ "congruent"
    ))
  
  # drop the correctness column
  df <- df %>%
    select(-correct)

  # filter reaction time 
  df <- df %>%
    filter(rt >250, rt <1500)

  # Prepare data for analysis - keep only columns relevant to the model.
  df <- df %>%
    select(rt, ID, sound_category,cue, congruency)
  
  return(df)
}

combined_df <- prepare_fitting_ready_df(combined_df)

Confirmatory analysis

Key Analysis

As proposed in our analysis plan, in the following section, we will first report the parameter estimates and confidence intervals for each “contrast of interest” from the EMM contrast analysis of a linear mixed-effect model fitted for RTs. We will then conduct the chi-square test to examine whether the model with the factor of interest (congruency) has an significant improvement in log-likelihood, and report the resulting Chi-square values and p-values.

The following are the code to construct and fit a linear mixed-effect regression model:

library(lmerTest)
library(emmeans)
library(lme4)

# set reference level to congruent
combined_df$congruency <- factor(
  combined_df$congruency, levels = c("congruent", "label", "incongruent"))

# compare between reference (congruent) and the other two values (incongruent and label)
## dependent variable: rt (response time)
## main indepent variable: congruency (of auditory cue)
## random intercepts and random slopes for within-subject factors: (1 + congruency|ID)
## random intercepts for repeated items: (1|sound_category)
model_full <- lmer(rt ~ congruency + (1 + congruency|ID) + (1|sound_category), data = combined_df)

summary(model_full)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: rt ~ congruency + (1 + congruency | ID) + (1 | sound_category)
   Data: combined_df

REML criterion at convergence: 99407.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3748 -0.6827 -0.1974  0.4509  4.2050 

Random effects:
 Groups         Name                  Variance Std.Dev. Corr       
 ID             (Intercept)           18945.89 137.64              
                congruencylabel        1767.99  42.05   -0.46      
                congruencyincongruent   822.65  28.68   -0.10  0.30
 sound_category (Intercept)              16.16   4.02              
 Residual                             44351.39 210.60              
Number of obs: 7328, groups:  ID, 42; sound_category, 6

Fixed effects:
                      Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)            690.928     21.686  41.062  31.860  < 2e-16 ***
congruencylabel        -52.267      8.390  39.660  -6.229 2.33e-07 ***
congruencyincongruent   15.047      9.365  41.782   1.607    0.116    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) cngrncyl
congrncylbl -0.439         
cngrncyncng -0.126  0.317

The associated 95% confidence intervals are:

#get 95% CI
confint.merMod(model_full,method="Wald")

                           2.5 %    97.5 %
.sig01                        NA        NA
.sig02                        NA        NA
.sig03                        NA        NA
.sig04                        NA        NA
.sig05                        NA        NA
.sig06                        NA        NA
.sig07                        NA        NA
.sigma                        NA        NA
(Intercept)           648.424523 733.43240
congruencylabel       -68.712156 -35.82273
congruencyincongruent  -3.307511  33.40136

Now we have the average effect of each congruency level on the RTs with random effects accounted for. It suggests that when the cue is ‘congruent’, the average RT is \(690.93\) ms with a 95% confidence interval \([648.42, 733.42]\). RT under ‘label’ condition is expected to be shorter than in ‘congruent’ condition for \(52.27\) ms, and the difference has a 95% confidence interval \([-68.71, -35.82]\). RT under ‘incongruent’ condition is expected to be longer than in ‘congruent’ condition for \(15.05\) ms, and the difference has a 95% confidence interval \([-3.31, 33.40]\). However, this comparison is neither sufficient nor fair, as it does not account for covariates and multiple comparison issues or provide a more straightforward pairwise comparision between conditions with important statistics like p-values. These could be addressed by applying a post-hoc analysis:

#post-hoc test:
fit_result <- model_full %>% 
  emmeans(
    pairwise ~ congruency, # contrast across 'congruency' conditions
    adjust = "bonferroni" # p-value adjusted for multiple comparisons
  ) %>% 
  pluck("contrasts")

fit_result

 contrast                estimate    SE  df z.ratio p.value
 congruent - label           52.3  8.39 Inf   6.229  <.0001
 congruent - incongruent    -15.0  9.36 Inf  -1.607  0.3243
 label - incongruent        -67.3 10.40 Inf  -6.470  <.0001

Degrees-of-freedom method: asymptotic 
P value adjustment: bonferroni method for 3 tests

To compute the 95 confidence interval:

ci <- confint(fit_result, level=0.95)
ci

 contrast                estimate    SE  df asymp.LCL asymp.UCL
 congruent - label           52.3  8.39 Inf      32.2     72.35
 congruent - incongruent    -15.0  9.36 Inf     -37.5      7.37
 label - incongruent        -67.3 10.40 Inf     -92.2    -42.41

Degrees-of-freedom method: asymptotic 
Confidence level used: 0.95 
Conf-level adjustment: bonferroni method for 3 estimates

To compute \(\chi^2(1)\)

emm_results <- summary(fit_result)
z_squared <- emm_results["z.ratio"] ^ 2
z_squared

    z.ratio
1 38.806637
2  2.581723
3 41.862952

The result suggests that:

RT under ‘congruent’ condition is expected to be longer than in ‘label’ condition for \(b=52.3\) ms, 95% CI \([32.2, 72.35]\), \(\chi^2(1)=38.81\), \(p<0.001\). For comparison, the original study has \(b=25.53\) ms, 95% CI \([10.17, 40.89]\), \(\chi^2(1)=9.96\), \(p=0.002\). Both studies found that participants responded significantly faster when a label cue is used than when a congruent sound cue is used.
RT under ‘congruent’ condition is expected to be shoter than in ‘incongruent’ condition for \(b=-15.0\) ms, 95% CI \([-37.5, 7.37]\), \(\chi^2(1)=2.58\), \(p=0.324\). For comparison, the original study has \(b=-38.71\) ms, 95% CI \([-55.33, -22.10]\), \(\chi^2(1)=17.61\), \(p<0.001\). Only the original study supports that when an environmental sound cue is used, participants responded significantly faster when the cue is congruent than incongruent.

To further test whether congruency has a meaningful impact on rt (i.e., whether a reduced model assuming a systematic effect of congruency fits the data better), we apply the chi-square test and the result is as follows:

# create the model without congruency as the main factor
model_reduced <- lmer(rt ~ (1 + congruency|ID) + (1|sound_category), data = combined_df)
# compare their log likelihood
anova(model_full, model_reduced)

Data: combined_df
Models:
model_reduced: rt ~ (1 + congruency | ID) + (1 | sound_category)
model_full: rt ~ congruency + (1 + congruency | ID) + (1 | sound_category)
              npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
model_reduced    9 99481 99543 -49731    99463                         
model_full      11 99450 99526 -49714    99428 34.967  2  2.553e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results suggest that the full model (which includes the congruency factor) explains the data significantly better, as the full model has a relatively higher log-likelihood value with \(p<0.001\).

Finally, to compare our result and result in the original study:

library(gridExtra)
library(jpeg)
library(grid)

# rename and reorder some columns for plotting
plot_df <- combined_df |>
  mutate(congruency=factor(
    congruency, levels = c("label", "congruent", "incongruent"))) |>
  mutate(congruency = fct_recode(congruency, 
    "Label"="label", 
    "Congruent Sound"="congruent", 
    "Incongruent Sound"="incongruent")
  )

# plot the result of our replication
our_plot <- ggplot(data = plot_df,
       mapping = aes(x = congruency,
                     y = rt,
                     color = congruency, 
                     fill = congruency)) +
  geom_bar(stat = "summary", fun = "mean", 
           width = 1,
           color = "black") + 
  stat_summary(fun.data = "mean_cl_boot", 
               geom = "linerange", 
               color = "black") +
   geom_errorbar(stat = "summary", 
                fun.data = "mean_cl_boot", 
                width = 0.5, 
                size=0.5,
                color = "black") +
  scale_fill_manual(values = c("firebrick", "darkblue", "slategray2")) +
  scale_color_manual(values = c("firebrick", "darkblue", "slategray2")) +
  labs(
    x = "Experiment 1A",
    y = "Verification Speed (ms)",
    color = "Cue Type",
    fill = "Cue Type") +
  coord_cartesian(xlim=c(0, 4), ylim = c(400, 725), expand=FALSE) +
  scale_y_continuous(breaks = seq(400, 700, by = 50), expand=c(0, 25)) +
  theme_classic() +
  theme(
    axis.text.x = element_blank(), 
    axis.ticks.x = element_blank(),
    axis.text.y = element_text(size = 10, color = "black"),
    axis.title.x = element_text(size = 10, color = "black"), 
    axis.title.y = element_text(size = 10, color = "black"), 
    legend.title = element_text(size = 10, color = "black", face='bold'), 
    legend.text = element_text(size = 10, color = "black"),
    plot.margin = unit(c(0.0, 0.1, 0.0, 0.1), "npc")
  ) +
  theme(
    aspect.ratio = 1.8)

# load the plot from the original study
edmiston_img <- rasterGrob(readJPEG("edmiston_exp1a.jpg"), interpolate = TRUE)

# display side by side
plot_title <- textGrob("Our Result", gp = gpar(fontface = "bold", fontsize = 14))
image_title <- textGrob("Original Study Result", gp = gpar(fontface = "bold", fontsize = 14))

grid.arrange(
  arrangeGrob(image_title, edmiston_img, ncol = 1, heights = c(0.5, 5)),
  arrangeGrob(plot_title, ggplotGrob(our_plot), ncol = 1, heights = c(0.5, 5)),
  ncol = 2, widths=c(1, 1.7))

In summary, both our study and the original study observe the pattern that RTs follow the order: label cue < congruent-sound cue < incongruent-sound cue. However, while the original study found all pairwise differences to be statistically significant, we only found significant differences between label cue and congruent-sound cue, between label cue and incongruent-sound cue, but not between congruent-sound cue and incongruent-sound cue. Additionally, reaction times were generally longer in our replication compared to the original study.

Exploratory analyses

We also plotted the distribution of RTs under each condition to investigate whether there were an excessive number of unusually short or long RTs in certain conditions that might distort the overall reaction time pattern.

plot_df <- combined_df |>
  mutate(congruency=factor(
    congruency, levels = c("label", "congruent", "incongruent"))) |>
  mutate(congruency = fct_recode(congruency, 
    "Label"="label", 
    "Congruent Sound"="congruent", 
    "Incongruent Sound"="incongruent")
  )

ggplot(data = plot_df,
       mapping = aes(x = congruency,
                     y = rt,
                     fill = congruency)) +
  geom_boxplot(width = 0.6, outlier.shape = NA, alpha=0.5) + 
  geom_jitter(shape = 21, color = "black", width = 0.2, alpha = 0.2, size = 1.5) +
  geom_hline(yintercept = 1200, color = "red", linetype = "dashed", size = 1) +
  scale_fill_manual(values = c("firebrick", "darkblue", "slategray")) +
  labs(
    y = "Reaction Time (Ms)",
    color = "Cue Type",
    x = "Cue Type",
    fill = "Cue Type "
  ) +
  theme_classic() +
  theme(
  axis.text.x = element_blank(), 
  axis.text.y = element_text(size = 14, color = "black"),
  axis.title.x = element_text(size = 16, color = "black", face = "bold"), 
  axis.title.y = element_text(size = 16, color = "black", face = "bold"), 
  legend.title = element_text(size = 14, color = "black"), 
  legend.text = element_text(size = 14, color = "black") 
)

While the original study found that reaction times were significantly shorter in congruent trials than in incongruent trials, our results reveal a heavy tail in the congruent condition, with some extremely long RTs (i.e., greater than 1200 ms). It is unclear why this occurs, but we suspect these trials may have boosted the average RTs in the congruent condition, thereby reducing the difference between the congruent and incongruent conditions.

In the original study, trials where participants gave incorrect responses were excluded from further analysis. However, we were curious whether including incorrect trials might yield different results. Specifically, while there is no evidence that reaction times differ significantly between correct trials with congruent and incongruent cues, we wondered whether reaction time will significantly differ especially when participants are uncertain (i.e., making incorrect responses). To investigate this, we fitted a linear mixed-effects regression model and conducted post-hoc analysis as in the confirmatory analysis, but this time including trials with incorrect responses.

# include wrong trials
include_wrong_df <- original_df_with_good_subj %>%
  filter(correct_response=='f')
include_wrong_df <- prepare_fitting_ready_df(include_wrong_df)

# lmer
# set reference level to congruent
include_wrong_df$congruency <- factor(
  include_wrong_df$congruency, levels = c("congruent", "label", "incongruent"))
model_wrong_full <- lmer(rt ~ congruency + (1 + congruency|ID) + (1|sound_category), data = include_wrong_df)

# summary(model_wrong_full)

# EMM contrast
include_wrong_fit_result <- model_wrong_full %>%
  emmeans(
    pairwise ~ congruency, # contrast across 'congruency' conditions
    adjust = "bonferroni", # p-value adjusted for multiple comparisons
  ) %>%
  pluck("contrasts")
include_wrong_fit_result

 contrast                estimate   SE  df z.ratio p.value
 congruent - label           53.8 8.16 Inf   6.591  <.0001
 congruent - incongruent    -19.4 8.82 Inf  -2.198  0.0838
 label - incongruent        -73.2 9.65 Inf  -7.586  <.0001

Degrees-of-freedom method: asymptotic 
P value adjustment: bonferroni method for 3 tests

confint(include_wrong_fit_result, level=0.95)

 contrast                estimate   SE  df asymp.LCL asymp.UCL
 congruent - label           53.8 8.16 Inf      34.3     73.36
 congruent - incongruent    -19.4 8.82 Inf     -40.5      1.73
 label - incongruent        -73.2 9.65 Inf     -96.3    -50.11

Degrees-of-freedom method: asymptotic 
Confidence level used: 0.95 
Conf-level adjustment: bonferroni method for 3 estimates

The result suggests that including incorrect trials does not change the previously observed pattern: RTs are significantly shorter (\(p<0.001\)) in label cue condition than in congruent-sound cue condition, and shorter in congruent-sound cue condition than in incongruent-sound cue condition but without statistical significance.

Discussion

Summary of Replication Attempt

In our version of the experiment:

We found that RTs were significantly shorter when the sound cue was a label rather than a congruent environmental sound, with standard p-value threshold of 0.05. This is consistent with the results of the original study.
We didn’t find statistically significant evidence that RTs are shorter when the sound cue is a congruent environmental sound compared to an incongruent one. Although the average effect suggests such pattern, the p-value exceeds the standard threshold of p = 0.05. This is inconsistent with the results of the original study, which reported that RTs are significantly shorter when the environmental sound cue is congruent rather than incongruent.

To summarize, the primary result has only been partially replicated.

Commentary

We propose that one of the most likely causes of such discrepancy is that our experiment was conducted online. Online setting could introduce several challenges: (1) It was difficult to communicate effectively with participants and receive feedback during the instruction and practice phases, making it harder to ensure that participants fully understood the task. Confusion about the task may have altered the reaction time patterns. (2) It was challenging to monitor participants’ attention during the experiment. These challenges are supported by the observations that overall accuracy was lower and reaction times were generally longer than in the original study. Therefore, we recommend that researchers who aim to replicate this study in the future use the same laboratory settings as the original study to minimize potential discrepancies..

Statement of Contributions

Jenna Brooks: Project Administration (lead), Software (JsPsych Experiment) (supporting), Writing and Editing (equal)
Noah Khaloo: Formal Analysis (lead), Data Curation (lead), Statistical Analysis and Visualization (lead), Writing and Editing (equal)
Sihan Yang: Conceptualization (lead), Software (JsPsych Experiment) (lead), Writing and Editing (equal)
Reeka Estacio: Software (JsPsych Experiment) (supporting), Validation (lead), JsPsych Experiment (supporting), Presentation (lead), Writing and Editing (equal)

References

Lupyan, G., & Thompson-Schill, S. L. (2012). The evocative power of words: Activation of concepts by verbal and nonverbal means. Journal of Experimental Psychology-General, 141(1), 170–186. http://dx.doi.org/10.1037/a0024904.
Edmiston, P., & Lupyan, G. (2015). What makes words special? Words as unmotivated cues. Cognition, 143, 93-100.