Replication of Study What makes words special? Words as unmotivated cues by Pierce Edmiston & Gary Lupyan (2015, Cognition)

Author

Noah Khaloo

Published

December 11, 2024

Introduction

The study presented in Edminston & Lupyan (2015) examines the relationship between verbal labels (i.e., “dog” or “guitar”) and activation of perceptual knowledge. Previous research has demostrated that environmental sounds, like a dog bark or the strum of a guiater, prompts the listener towards specific instances of a concept (i.e. a ‘big’ dog, or an electric guitar), while words, like “dog,” cue broader, more abstract mental representations (Lupyan & Thompson-Schill, 2012). Experiment 1A presented in Edminston & Lupyan (2015) provides further evidence towards notion that ambiguous labels (i.e., “dog” or “guitar”) activate knowledge of their referent quicker than environmental sounds.In other words, labels should elicit faster reaction times than sounds. In this experiment, they also provide novel evidence towards the notion that hearing some sort of mismatch in environmental sounds (i.e. hearing a loud, low pitched bark, but being presented with a picture of a small dog), increases reaction time even further. They show this by creating a picture recognition task. In this task, participants hear one of three conditions:

1.a direct label (i.e. the word “dog”, or “guitar”) 2. a “congruent” sound (i.e. a low pitched bark matched with a big dog) 3. an “incongruent” sound (i.e. a low pitched bark matched with a small dog)

More specifically, Edminston & Lupyan (2015) found that labels elicit a shorter reaction time than congruent trials and incongruent trials. They also found that congruent trials elicit a faster reaction time than incongruent trials.

In our replication, we will be attempting to replicate experiment 1A of Edminston & Lupyan (2015) exactly; using the same stimuli and task. In order to satisfy our replication, our data should suggest the exact same results that Edminston & Lupyan (2015) found:

labels elicit a elicit a faster reaction time than congruent trials and incongruent trials
congruent trials elicit a faster reaction time than incongruent trials.

We will conduct the same experiment, and run the same statistical tests that they do in the original paper. We will also visualize the results in the same way. Our main criteria for replicability will be found in the results of the statistical analysis; we must see the same statistical trends and significance levels as they found in the original paper in order to deem the reproduction successful. We (or at least, I) expect that both main results will replicate.

Methods

Participants: We collected data from n=50 participants (see Power Analysis). Edminston & Lupyan (2015) only targeted undergraduates, but, since we have an online study, we targeted anyone who was willing to take the study.

Materials: We use same materials as Edminston & Lupyan (2015): labels and environmental sounds for six categories: bird, dog, drum, guitar, motorcycle, and phone. Each category will have two separate images and two separate environmental sounds. We also implement a male and female voice for our spoken labels.

Procedure: We follow the same procedure as Edminston & Lupyan (2015). Participants hear a cue and see a picture, and are prompted to decide as “quickly and accurately as possible if the picture they saw came from the same basic-level category as the word or sound they heard” (Edminston & Lupyan pp.94). Following Edminston & Lupyan (2015), some environmental sounds will be congruent with the picture (i.e., a big dog matched with a low-pitched bark). These will be labeled as “congruent sounds”. Other environmental sounds will be deemed “incongruent sounds”, in which the picture does not necessarily match the environmental sounds (i.e., a low pitched bark matched with apicture of a small dog).

Power Analysis

Based on guidance from instructional staff, the sample size was determined with an a-priori power analysis with the package simr, and is adequate to achieve at least 80% power for detecting the effect reported in the original study at a significance criterion of alpha = .05 (any random effects not specified in the original paper were taken from a small pilot study).

Planned Sample

We plan to have a sample size of n = 50. For pre-screening, participants are required to speak English fluently and not have any hearing difficulties. This is put in place to ensure that there are no issues in task comprehension. Participants are recruited and compensated via Prolific.

Materials

From the original study: “The auditory cues comprised basic-level category labels and environmental sounds for six categories: bird, dog, drum, guitar, motorcycle, and phone. For each category, we obtained two distinct environmental sound cues, e.g., , , and two separate images for each subordinate category, e.g., two electric guitars for , two acoustic guitars for . To control for cue variability, we also used two versions of each spoken category label:one pronounced by a female speaker, one by a male speaker. All auditory cues were equated in duration (600 ms.) and normalized in volume. The images were color photographs (four images per category).” (pp. 94)

Procedure

From the original study: “On each trial participants heard a cue and saw a picture. We instructed participants to decide as quickly and accurately as possible if the picture they saw came from the same basic-level category as the word or sound they heard Participants were tested in individual rooms sitting approximately 2400 from a monitor such that images subtended 10 10. Trials began with a 250 ms. fixation cross followed immediately by the auditory cue, delivered via headphones. The target image appeared centrally 1s after the offset of the auditory cue and remained visible until a response was made. Each participant completed 6 practice and 384 test trials. If the picture matched the auditory cue (50% of trials) participants were instructed to respond ‘Yes’ on a gaming controller (e.g., or ‘‘phone’’ followed by a picture of any phone). Otherwise, they were to press ‘No’ (e.g., or ‘‘phone’’ followed by a dog). All factors (cue type, congruence) varied randomly within subjects. Auditory feedback (buzz or bleep) was given after each trial.” (pp. 94)

Analysis Plan

From the original study: “All participants performed very accurately on all items (M = 97%). Response times (RTs) shorter than 250 ms. or longer than 1500 ms. were removed (292 trials removed, 1.77% of total). We fit RTs for correct responses on matching trials (‘Yes’ responses) with linear mixed regression using maximum likelihood estimation (Bates, Maechler, Bolker, & Walker, 2013), including random intercepts and random slopes for within-subject factors and random intercepts for repeated items (unique trial types). Reported below are the parameter estimates (b) and confidence intervals for each contrast of interest. Significance tests were calculated using chi-square tests that compared nested models—models with and without the factor of interest—on improvement in log-likelihood.” (pp.94/95)

In our replication, we follow their analysis as closely as possible: we remove trials where response time is shorter than 250 ms or longer than 1500ms. We also replicate the same linear mixed-effects regression model, including the same main effects and random structure. Furthermore, we will also implement the chi-squared tests they ran that compared models with-and without the main effect using log-likelihood. We predict that their results will replicate; that verbal labels will elicit faster responses than sounds, and that congruent sounds will eleicit faster response times that incongruent sounds.

Clarify key analysis of interest here Linear mixed regression model, chi-square tests of nested models with and without the factor of interest.

Differences from Original Study

Our project will be online, while Edminston & Lupyan’s (2015) was in-person, in a lab setting. One thing we have to consider that might influence reaction time in our replication is internet speed, or other technical issues, that we cannot control for as participants take our online study. Furthermore, participants may be more prone to distractions. In addition, our participants will have to use their keyboards on their computers to react to stimuli, while, in the original study, participants used a video-game controller. This may result in a difference in reaction time in either direction.

Methods Addendum (Post Data Collection)

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

This replication, instead of being conducted in a lab, will take place online, which may influence reaction times. Additionally, participants will respond to trials using a keyboard rather than the gaming controller.

Design Review

1 factor was directly manipulated in Edminston & Lupyan (2015). This was the auditory cue participants heard, which could either be a label, a congruent environmental sound, or an incongruent environmental sound.

One measurement was taken, which was reaction time of the participants.

They used a within-participant design, subjecting participants to the same condition

The measures were repeated. 384 trials

There would be more confounds, and a difference in reaction time if the experimental design was switched.

They did not mention any direct measures to lessen the impact of demand characteristics.

They should have added some attention checks, given how many trials they had. I also would have kept the “incorrect” trials, given the fact that the incongruent labels may have resulted in a “no” response, which could be experimentally valid.

A confound could have been that they did not use a varied enough sample, more specifically, their sample was primarily undergraduates, which could mean that their population was skewed more towards the WEIRD population.

Pilot A

For Pilot A, we closely followed the experiment described in the original study. The GitHub page for our paradigm is linked below:

For our pilot, we collected data from three participants that we recruited using personal connections. This data was imported and uploaded to the data folder in the project repository, which can be found on our GitHub repoisitory here: Pilot A data: https://github.com/ucsd-psych201a/edmiston2015/tree/main/data/pilot_a

Pilot B

For pilot B, we collected 5 participants total, recruited via prolific. The data was uploaded to a folder in our github repo, which can be found here:

Pilot B data: https://github.com/ucsd-psych201a/edmiston2015/tree/main/data/pilot_b

median time it took participants to do the experiment: 25 minutes (via Prolific)

Results

Data preparation

As mentioned previously, we replicate experiment 1A from Edminston & Lupyan (2015). As for our data preparation, we make sure to label the type of relationship between the sound participants hear in a given trial, and its match to the picture. In other words, we define whether a particular trial is a label (which will just be a word such as “dog, or”guitar”), a congruent sound, or an incongruent sound. Furthermore, we will use the exact same audio and visual stimuli as Edminston & Lupyan (2015). All of the different audio files have already been clipped to make them match in duration, and they have all been amplitude normalized.

We start by loading our raw data and pre-processing it:

### Data Preparation

#### Load Relevant Libraries and Functions
library(tidyverse)
library(readr)
library(lme4)

# Import data
folder_path <- "../data/complete"
csv_files <- list.files(folder_path, pattern = "*.csv", full.names = TRUE)
df_list <- lapply(csv_files, read_csv, show_col_types = FALSE)
original_df <- bind_rows(df_list)

#exclude practice trials 
original_df <- original_df |>
  filter(exp_part == "actual")

#create an extra df for exploratory analysis
df_n <- original_df

# convert the data type and compute correctness of each trial
original_df <- original_df %>%
  mutate(correct_response = as.character(correct_response)) %>%
  mutate(response = as.character(response)) %>%
  mutate(correct = correct_response == response)

# compute the accuracy of each participant
accuracy_table <- original_df %>%
  group_by(ID) %>%
  summarise(accuracy = mean(correct, na.rm = TRUE))

Next, we check the distribution of accuracy:

library(ggplot2)

# set the accuracy filtering threshold
accuracy_thresh <- 0.9

# plot the distribution of accuract
ggplot(accuracy_table, aes(x = accuracy)) +
  geom_histogram(
    breaks=seq(0, 1, length.out=21),
    fill = "blue", color = "black", alpha = 0.7) +
  geom_vline(xintercept = accuracy_thresh, color = "red", linetype = "dashed", size = 1) + # mark the threshold for QA
  labs(
    title = "Distribution of Accuracy",
    x = "Accuracy",
    y = ""
  ) +
  theme_minimal()

42 people (which are 84% of the total participants) achieved an accuracy above 90%. Considering that in the original study “all participants performed very accurately on all items (M = 97%)” (Edmiston & Lupyan, 2015), the accuracy of participants in our replication is notably lower.

We also examine the distribution of response times for participants with an accuracy above 90%:

library(ggplot2)

# get the list of IDs of participants whose accuracy is above the threshold
id_pass_qa <- accuracy_table %>%
  filter(accuracy > accuracy_thresh) %>%
  pull(ID)
# filter the original dataframe to keep only data from the qualified participants
original_df_with_good_subj <- original_df %>%
  filter(ID %in% id_pass_qa)

# plot the distribution of response time
ggplot(original_df_with_good_subj, aes(x = rt)) +
  geom_histogram(
    breaks=seq(0, 2000, length.out=40),
    fill = "blue", color = "black", alpha = 0.7) +
  geom_vline(xintercept = 250, color = "red", linetype = "dashed", size = 1) +
  geom_vline(xintercept = 1500, color = "red", linetype = "dashed", size = 1) +
  labs(
    title = "Distribution of Reaction Time",
    x = "Reaction Time",
    y = ""
  ) +
  theme_minimal()

# compute the quantile
ecdf_func <- ecdf(original_df_with_good_subj$rt)
rt_quantiles <- ecdf_func(c(250, 1500))
# compute the mask for whether each trial has passed the check
rt_pass_mask <- (original_df_with_good_subj$rt > 250) & (original_df_with_good_subj$rt < 1500)

15083 trials (which are 94% of the trials) have response time within the range of 250-1500 ms.

Now we further preprocess the data, by

excluding participants who do not achieve an accuracy of 90%
excluding trials whose reaction time is shorter than 250 ms or longer than 1500 ms
keeping only trials whose correct answer is ‘YES’ and people made the correct response.
recoding the trial ‘congruency’ conditions as ‘label’, ‘congruent’, and ‘incongruent’

combined_df <- original_df
  
## keep only people who passed the QA check
combined_df <- combined_df %>%
  filter(ID %in% id_pass_qa) %>%
  select(-correct) # drop the correctness column

# exclude only correct 'YES' trials 
combined_df <- combined_df %>%
  filter(response == "f") %>%
  filter(correct_response == "f")

# rename "sound_subtype" to "cue"
combined_df <- combined_df %>%
  rename(cue = sound_subtype)

# create congruency column 
combined_df <- combined_df %>%
  mutate(congruency = case_when(
    cue == "label" ~ "label",
    img_subtype %in% c("song", "york", "bongo", "acoustic", "harley", "rotary") & sound_version == "A" ~ "incongruent",
    TRUE ~ "congruent"
  ))

# filter reaction time 
combined_df <- combined_df %>%
  filter(rt >250, rt <1500)

# Prepare data for analysis - keep only columns relevant to the model.
combined_df <- combined_df %>%
  select(rt, ID, sound_category,cue, congruency)

Confirmatory analysis

Following Edminston & Lupyan (2015), we fit a mixed effects linear regression model to our response time data (which is filtered for correct responses on matching trials (i.e., ‘Yes’ responses)). The model includes random intercepts and random slopes for within-subject factors (i.e. “subject”), and random intercepts for repeated items (i.e. what item was used (bird, guitar, dog, etc.)). The main effect of the model will be the “condition” variable, which denotes whether a particular trial presented a label, congruent sound, or incongruent sound. Furthermore, we calculate chi-square tests that “compare nested models–models with and without the factor of interest–on improvement in log-likelihood” (pp.94).

Our analysis tests the main inference of the paper, which is that condition affects reaction time. More specifically, our parameter estimates help us get a sense of exactly how the different conditions influence reaction time. Crucially, the model also allows us to compare the effect on reaction time each condition has to each other. We also implement a pairwise comparison to get a sense of the relationship between ‘label’ and ‘incongreunt’ trials, which is not directly represented in our base model where the reference level is condition.

Furthermore, creating a new model without the factor of interest and comparing it performance to the full model will allow us to better understand if the factor of interest truly accounts for the trends we see.

Before creating the model, however, we first visualize the data and compare to the original study:

#load libraries:
library(ggplot2)
library(gridExtra)


Attaching package: 'gridExtra'

The following object is masked from 'package:dplyr':

    combine

library(jpeg)
library(grid)

# rename and reorder some columns for plotting
plot_df <- combined_df |>
  mutate(congruency=factor(
    congruency, levels = c("label", "congruent", "incongruent"))) |>
  mutate(congruency = fct_recode(congruency, 
    "Label"="label", 
    "Congruent Sound"="congruent", 
    "Incongruent Sound"="incongruent")
  )

# plot the result of our replication
our_plot <- ggplot(data = plot_df,
       mapping = aes(x = congruency,
                     y = rt,
                     color = congruency, 
                     fill = congruency)) +
  geom_bar(stat = "summary", fun = "mean", 
           width = 1,
           color = "black") + 
  stat_summary(fun.data = "mean_cl_boot", 
               geom = "linerange", 
               color = "black") +
   geom_errorbar(stat = "summary", 
                fun.data = "mean_cl_boot", 
                width = 0.5, 
                size=0.5,
                color = "black") +
  scale_fill_manual(values = c("firebrick", "darkblue", "slategray2")) +
  scale_color_manual(values = c("firebrick", "darkblue", "slategray2")) +
  labs(
    x = "Experiment 1A",
    y = "Verification Speed (ms)",
    color = "Cue Type",
    fill = "Cue Type") +
  coord_cartesian(xlim=c(0, 4), ylim = c(400, 725), expand=FALSE) +
  scale_y_continuous(breaks = seq(400, 700, by = 50), expand=c(0, 25)) +
  theme_classic() +
  theme(
    axis.text.x = element_blank(), 
    axis.text.y = element_text(size = 10, color = "black"),
    axis.title.x = element_text(size = 10, color = "black"), 
    axis.title.y = element_text(size = 10, color = "black"), 
    legend.title = element_text(size = 10, color = "black", face='bold'), 
    legend.text = element_text(size = 10, color = "black"),
    plot.margin = unit(c(0.0, 0.1, 0.0, 0.1), "npc")
  ) +
  theme(
    aspect.ratio = 1.8)

# load the plot from the original study
edmiston_img <- rasterGrob(readJPEG("edmiston_exp1a.jpg"), interpolate = TRUE)

# display side by side
plot_title <- textGrob("Our Result", gp = gpar(fontface = "bold", fontsize = 14))
image_title <- textGrob("Original Study Result", gp = gpar(fontface = "bold", fontsize = 14))

grid.arrange(
  arrangeGrob(image_title, edmiston_img, ncol = 1, heights = c(0.5, 5)),
  arrangeGrob(plot_title, ggplotGrob(our_plot), ncol = 1, heights = c(0.5, 5)),
  ncol = 2, widths=c(1, 1.7))

Now that we have visualized the data, we fit our regression model:

#load libraries
library(lmerTest)
library(emmeans)
library(lme4)

#find best optimizer
#allFit(model_full)

#set up model and view results
model_full <- lmer(rt ~ congruency + (1 + congruency|ID) + (1|sound_category), data = combined_df, control = lmerControl(optimizer = "bobyqa"))

summary(model_full)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: rt ~ congruency + (1 + congruency | ID) + (1 | sound_category)
   Data: combined_df
Control: lmerControl(optimizer = "bobyqa")

REML criterion at convergence: 99407.8

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.3748 -0.6827 -0.1974  0.4509  4.2050 

Random effects:
 Groups         Name                  Variance Std.Dev. Corr       
 ID             (Intercept)           18945.80 137.644             
                congruencyincongruent   822.77  28.684  -0.10      
                congruencylabel        1768.00  42.048  -0.46  0.30
 sound_category (Intercept)              16.17   4.021             
 Residual                             44351.37 210.598             
Number of obs: 7328, groups:  ID, 42; sound_category, 6

Fixed effects:
                      Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)            690.928     21.686  41.062  31.861  < 2e-16 ***
congruencyincongruent   15.047      9.365  41.783   1.607    0.116    
congruencylabel        -52.267      8.390  39.661  -6.229 2.33e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) cngrncyn
cngrncyncng -0.126         
congrncylbl -0.439  0.317

#get 95% CI
confint.merMod(model_full,method="Wald")

                           2.5 %    97.5 %
.sig01                        NA        NA
.sig02                        NA        NA
.sig03                        NA        NA
.sig04                        NA        NA
.sig05                        NA        NA
.sig06                        NA        NA
.sig07                        NA        NA
.sigma                        NA        NA
(Intercept)           648.424558 733.43232
congruencyincongruent  -3.307666  33.40179
congruencylabel       -68.712147 -35.82268

The model only gives us a comparison between congruent vs. incongruent trials, and congruent vs. label trials. To see the full extent of the results, we conduct a post-hoc analysis:

#post-hoc test:
fit_result <- model_full %>% 
  emmeans(
    pairwise ~ congruency, # contrast across 'congruency' conditions
    adjust = "bonferroni" # p-value adjusted for multiple comparisons
  ) %>% 
  pluck("contrasts")

fit_result

 contrast                estimate    SE  df z.ratio p.value
 congruent - incongruent    -15.0  9.36 Inf  -1.607  0.3243
 congruent - label           52.3  8.39 Inf   6.229  <.0001
 incongruent - label         67.3 10.40 Inf   6.470  <.0001

Degrees-of-freedom method: asymptotic 
P value adjustment: bonferroni method for 3 tests

To compute the 95 confidence interval:

ci <- confint(fit_result, level=0.95)
ci

 contrast                estimate    SE  df asymp.LCL asymp.UCL
 congruent - incongruent    -15.0  9.36 Inf     -37.5      7.37
 congruent - label           52.3  8.39 Inf      32.2     72.35
 incongruent - label         67.3 10.40 Inf      42.4     92.22

Degrees-of-freedom method: asymptotic 
Confidence level used: 0.95 
Conf-level adjustment: bonferroni method for 3 estimates

Next, we conduct our significance test:

#create model without factor of interest 
model_reduced <- lmer(rt ~ (1 + congruency|ID) + (1|sound_category), data = combined_df)

#run an anova to get chi-squared values
anova(model_full, model_reduced)

Data: combined_df
Models:
model_reduced: rt ~ (1 + congruency | ID) + (1 | sound_category)
model_full: rt ~ congruency + (1 + congruency | ID) + (1 | sound_category)
              npar   AIC   BIC logLik deviance  Chisq Df Pr(>Chisq)    
model_reduced    9 99481 99543 -49731    99463                         
model_full      11 99450 99526 -49714    99428 34.967  2  2.553e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results

The results of the statistical analysis showed us that labels elicited faster reaction times than congruent trials (b = -52.3, CI [-68.71, -35.82]) and that congruent trials were not significantly faster than incongruent trials (b = 15.04, CI [-3.31 33.4]. Furthermore, the results of our post-hoc analysis shows us that incongurent trials elicited slower reaction times than labels trials (b = 67.3, CI [42.4, 92.22]). Lastly, the results of a chi-squared signifcance test shows us that the model with the factor of interest (congruency) out-performs a model without this factor of interest.

Exploratory analyses

The results of the original data were presented using only a bar graph, however, we thought it was important to see the distribution the data, especially since our results were slightly different. To that end, we include a box plot that shows the individual data points. We expand a bit on the results shown in the original paper by re-running the same analysis, but instead of filtering out ‘No’ responses, we filter out ‘Yes’ responses.

#box plot
combined_df$congruency <- factor(combined_df$congruency, levels = c("label", "congruent", "incongruent"))
ggplot(data = combined_df,
       mapping = aes(x = congruency,
                     y = rt,
                     color = congruency)) +
  geom_boxplot(width = 0.6, outlier.shape = NA) + 
  geom_jitter(width = 0.2, alpha = 0.1, size = 1.5) +
  scale_color_manual(values = c("darkred", "darkblue", "darkgreen")) + #I set the color to darkgreen instead of slategrey so that the points are easier to see
  labs(
    y = "Reaction Time (Ms)",
    color = "Cue Type",
    x = "Cue Type",
    fill = "Cue Type "
  ) +
  theme_classic() +
  theme(
  axis.text.x = element_blank(), 
  axis.text.y = element_text(size = 14, color = "black"),
  axis.title.x = element_text(size = 16, color = "black", face = "bold"), 
  axis.title.y = element_text(size = 16, color = "black", face = "bold"), 
  legend.title = element_text(size = 14, color = "black"), 
  legend.text = element_text(size = 14, color = "black") 
)

Looking at the graph with the data distribution, it becomes clear that there are quite a few participants who had long rt’s for congruent trials. I imagine this brought the mean rt for this category up, thus creating a less straightforward distinction in rt between congruent and incongruent trials.

Next, we run the same statistical analysis on the data, but instead, filter out the ‘yes’ trials:

#re-run statistics on "No" data
df_n <- df_n%>%
  filter(exp_part == "actual") %>%
  filter(response == "j") %>%
  filter(correct_response == "j") %>%
  rename(cue = sound_subtype) %>%
  mutate(congruency = case_when(
    cue == "label" ~ "label",
    img_subtype %in% c("song", "york", "bongo", "acoustic", "harley", "rotary") & sound_version == "A" ~ "incongruent",
    TRUE ~ "congruent"
  )) %>%
  filter(rt >250, rt <1500)

  
model_n <- lmer(rt ~ congruency + (1 + congruency|ID) + (1|sound_category), data = df_n, control = lmerControl(optimizer = "bobyqa"))
summary(model_n)

Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: rt ~ congruency + (1 + congruency | ID) + (1 | sound_category)
   Data: df_n
Control: lmerControl(optimizer = "bobyqa")

REML criterion at convergence: 109130.2

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.5803 -0.6826 -0.1987  0.4881  4.1562 

Random effects:
 Groups         Name                  Variance Std.Dev. Corr       
 ID             (Intercept)           23630.5  153.72              
                congruencyincongruent   669.0   25.87    0.35      
                congruencylabel        2016.4   44.90   -0.44  0.45
 sound_category (Intercept)             181.7   13.48              
 Residual                             45966.8  214.40              
Number of obs: 8022, groups:  ID, 48; sound_category, 6

Fixed effects:
                      Estimate Std. Error      df t value Pr(>|t|)    
(Intercept)            757.720     23.214  49.748  32.641  < 2e-16 ***
congruencyincongruent    2.594      8.777  45.036   0.295    0.769    
congruencylabel        -37.253      8.336  45.758  -4.469 5.14e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) cngrncyn
cngrncyncng  0.064         
congrncylbl -0.409  0.366

confint.merMod(model_full,method="Wald")

                           2.5 %    97.5 %
.sig01                        NA        NA
.sig02                        NA        NA
.sig03                        NA        NA
.sig04                        NA        NA
.sig05                        NA        NA
.sig06                        NA        NA
.sig07                        NA        NA
.sigma                        NA        NA
(Intercept)           648.424558 733.43232
congruencyincongruent  -3.307666  33.40179
congruencylabel       -68.712147 -35.82268

We get the same trends when the statistical model is run on the “No” responses: the difference between label and congruent is signifcant, but not the difference between congruent and incongruent.

Discussion

Summary of Replication Attempt

The main finding of the original paper (i.e. that ambiguous labels elicit faster cue recognition than environmental sounds) held true. However, our results only partially replicated, as the difference in rt between congruent and incongruent sound trials was insignificant (though, trending in the right direction!). Our team belives this discrepency between the original study and our own is due to the difference in environement: i.e., the original results were taken in a lab, using video ga,e controllers, while ours was taken online using a keyboard.

Commentary

As mentioned previously, our results only partially replicated. I actually find this a bit odd; earlier studies (e.g.Lupyan and Schill 2012) found the same results: that labels elicit faster rt’s than sounds. However, I believe the major contribution Edminston & Lupyan (2015) were trying to make is that this discrepancy in rt between label and sound is due to the extra processing time it takes to decide whether the sound/picture combination is congruent or incongruent. In their own words “the label-advantage is obtained precisely because labels are detached from idiosyncracies of specific category members, and thereby able to selectively activating the features/dimensions most diagnostic of the named category…While a dog-bark or a guitar strum are both readily recognizable, the information conveyed by these environmental cues is necessarily specific” (pp. 98). What makes our results confusing, then, is that we found a difference between label and sound trials, but the notion that this difference is conditioned by the extra processing sounds require is less clear, given that deciding between congruent and incongruent sound trials does not seem to influence rt in our data.

However, it is worth noting that our experiment differed from the original study in terms of modality and location; our study was online, and the keyboard was different. It could be the case that these differences caused a skew in our data that was not present in the original study, but this seems a bit unlikely to me. If you compare the mean rt’s for congruency (i.e. compare the boxplot in the original study to ours), the means for label and incongruent are actually pretty similar; its the congruent that have a higher rt (about ~100 ms). In other words, it seems sort of unlikely that the differences presented in our rendition of the experiment only really affected the congruent trials.

Statement of Contributions

Jenna Brooks: Project Administration (lead), Software (JsPsych Experiment) (supporting), Writing and Editing (equal)
Noah Khaloo: Formal Analysis (lead), Data Curation (lead), Statistical Analysis and Visualization (lead), Writing and Editing (equal)
Sihan Yang: Conceptualization (lead), Software (JsPsych Experiment) (lead), Writing and Editing (equal)
Reeka Estacio: Software (JsPsych Experiment) (supporting), Validation (lead), JsPsych Experiment (supporting), Presentation (lead), Writing and Editing (equal)

References

Lupyan, G., & Thompson-Schill, S. L. (2012). The evocative power of words: Activation of concepts by verbal and nonverbal means. Journal of Experimental Psychology-General, 141(1), 170–186. http://dx.doi.org/10.1037/a0024904.