Replication of Study 1 by Corps & Rabagliati (2020, Journal of Memory and Language)

Author

Xirong Hu (xirohu@stanford.edu)

Published

December 14, 2025

Introduction

Justification

In what form predictions occur remains an unsolved problem in the field of speech comprehension. Corps and Rabagliati (2020) revealed that predictions about semantic content, instead of word forms, facilitated the perception of distorted speech in Experiment 1. This closely aligns with my research interest in how top-down information facilitates speech comprehension in challenging scenarios. Another reason for choosing this study is that the context manipulation used in Experiment 1 is relatively new and was proposed to better separate word-form prediction from semantic content prediction. Thus, it is worthwhile to replicate these findings. In addition to these theoretical motivations, there are two practical considerations. First, the original study was conducted on Prolific Academic, which makes the project more feasible. Second, the procedure was relatively well-documented in the paper, with stimulus list and analysis scripts shared on OSF, which makes it easier to adhere to the original procedure and analysis as closely as possible.

Stimuli and Procedures

The stimuli consist of 30 auditorially presented questions and 30 distorted answers. For the procedure, I will strictly follow that described in the original paper (Experiment1/Method/Procedure).

I anticipate two main challenges. The first concerns the stimuli. The authors have not made the stimuli publicly available. Moreover, the original study was conducted in the United Kingdom, where the low-level acoustic features are different. That might lead to perceptual differences after applying the noise-vocoding. This means I will need to record new auditory stimuli and apply the noise-vocoding. However, the script for that procedure was not shared. One solution is to reach out to the original authors. The second challenge concerns the programming of the experiment, since I have no experience with jsPsych.

Methods

Power Analysis

The paper’s central research question is whether high-level knowledge enhances comprehension through:

  • Semantic predictions measured by null interaction between Question Constrain and Answer Consistency.
  • Form predictions measured by Question Constraint effect and its interaction with Answer Consistency.

The key observation that favors the Semantic prediction instead of form prediction is the absence of interaction.

Original effect size:

  • Answer Consistency * Questions Constraint: b = -0.18, SE = 0.30, p = 0.55, BayesFactor = 0.19
  • Original sample size: 80

Considering how computationally heavy it is to simulate the power by a complex Bayesian modeling, I chose a simple way to approximate the power, which specified as follows. I simplified the original model to a simple two-way ANOVA (Prop.Expected ~ ACondition * QCondition), and calculated the effect size (cohen’s f2). Then I used the BayesPower package to calculate the expected sample size to achieve 80% and 90% power. The script can be found in the repository

Power Analysis Results:

  • 80% power: 104 participants
  • 90% power: 324 participants

Planned Sample

My planned sample size is 80. I chose to stay with the original sample size due to budget concern. Although it does not meet 80% power, the lowest cohen’s f2 I can choose in BayesPower package is 0.01, whereas my calculated Cohen’s f2 is 0.0015. This suggested that the actual needed participants to achieve 80% power might be smaller than 104. A follow-up simulation analysis could test this although it exceeds my capacity.

Participants were recruited from Prolific with the following criteria: - Residence: United Kingdom - Nationality: United Kingdom - First Language: English

Materials

To ensure my replication closely align with the original study, I reached out to the first author for material and scripts. The author kindly shared all the stimuli. Since some questions asked about political leaders at that time and the answer no longer holds for now, I deleted those items, leaving 14 items per list whereas in the original stimuli there are 15 items per list. The items that I deleted are:

  • List_1a: (Question)Which female candidate recently ran for president of the United States? (Answer)Hillary Clinton
  • List_1b: (Question)What is the name of the British prime minister? (Answer)Theresa May
  • List_2a: (Question)Which female candidate recently ran for president of the United States? (Answer)The Northern Lights
  • List_2b: (Question)What is the name of the British prime minister? (Answer)New York
  • List_3a: (Question)Who did you see when you visited America? (Answer)Hillary Clinton
  • List_3b: (Question)Who did you see when you visited London? (Answer)Theresa May
  • List_4a: (Question)What did you buy from the shop? (Answer)Theresa May
  • List_4b: (Question)What is your least favorite method of transport? (Answer)Hillary Clinton

Procedure

Can quote directly from original article - just put the text in quotations and note that this was followed precisely. Or, quote directly and just point out exceptions to what was described in the original article.

“The experiment was administered online on Prolific Academic. Stimulus presentation was controlled using jsPsych (De Leeuw, 2015) and data was recorded using MySQL (version 5.7). Participants were warned that they would be listening to audio stimuli, and so were encouraged to complete the experiment in a quiet environment or to use headphones. Before the task, participants were instructed: “First you will hear a female speaker ask a question in a clear voice. You will then hear a male answer this question in a distorted voice. Your task is to listen carefully and type exactly what you think the male speaker said. If you do not know, then please guess”. To make stimulus onset salient, a fixation cross appeared 500 ms before question playback (see Fig. 1a). The fixation cross then turned red and answer playback began 500 ms later. After listening to the answer, participants were prompted to type their response and press a “submit answer” button.”

Due to implementation challenge, the data was recorded using AJAX call instead of mySQL. Everything else remains the same.

The eight experiments corresponding to eight lists are listed below:

Analysis Plan

Following Corps and Rabagliati (2020), I will conduct a signal detection analysis to assess participants’ sensitivity to the words they actually heard, while controlling for response bias. Participants responses will be manually coded for the number of words matching the semantically consistent answer. I will follow these rules as specified by the original author: - correct obvious spelling mistakes - do not correct morphological mismatches - words reported in the wrong order will not be scored as matching - exclude trials where participants typed question rather than the answer

After that, I will perform the exact mixed-effects logistic regression model as the original authors did. The scripts can be found in the github repository

key analysis of interest The model structure is as follows:

cbind(ReportedExpectedWords, UnreportedExpectedWords) ~ 
  Question_Constraint * Answer_Consistency * Block + 
  (1 + Block || Participant) + 
  (1 + Question_Constraint * Answer_Consistency * Block || Item)
cbind(ReportedExpectedWords, UnreportedExpectedWords) ~ Question_Constraint * 
    Answer_Consistency * Block + (1 + Block || Participant) + 
    (1 + Question_Constraint * Answer_Consistency * Block || 
        Item)

Differences from Original Study

As noted in previous sections, the differences are: - As noted in the materials section, items referencing outdated political figures were removed, resulting in 14 items per list instead of the original 15. - Data was successfully recorded using AJAX calls, distinct from the MySQL method in the original paper, but the resulting data structure was transformed to match the original format for analysis.

This project was registered before data collection at the link

Methods Addendum (Post Data Collection)

Actual Sample

The final dataset consists of 77 participants recruited via Prolific Academic(three were excluded due to typed questions instead of answers). There are 41 females and the median age is 45.

Differences from pre-data collection methods plan

To better understand participant behavior, I conducted an exploratory analysis on the type of errors made in the “implausible” conditions after collecting data. Given the difficulty of the task (noise-vocoded speech with semantically inconsistent answers), we might expect a high rate of “giving up”. This analysis was inspired by reviewing the raw responses. Below is the code.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(knitr)
library(lme4)
Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack
library(doBy)

Attaching package: 'doBy'

The following object is masked from 'package:dplyr':

    order_by
library(broom.mixed)
# read data
data_path = 'https://raw.githubusercontent.com/psych251/corps2020/main/data/processed/transformed_dprime.csv'
data = read_csv(data_path)
Rows: 1078 Columns: 26
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): ACondition, Heard Answer, List, QCondition, QFile, Question, Expec...
dbl (18): ItemNo, Participant, Trial, Plaus, HA Wlength, Anumber, Qnumber, E...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# calculate proportion of "don't know" responses
dont_know_keywords <- c("don't know", "dont know", "unsure", "muffled", "no idea", "unclear")
data$IsDontKnow <- grepl(paste(dont_know_keywords, collapse="|"), tolower(data$response))

# Summarize  rates by condition
dont_know_summary <- summaryBy(IsDontKnow ~ ACondition + QCondition, data=data, FUN=c(mean))
kable(dont_know_summary, caption = "Proportion of 'Don't Know' Responses by Condition")
Proportion of ‘Don’t Know’ Responses by Condition
ACondition QCondition IsDontKnow.mean
Implausible Predictable 0.0413534
Implausible Unpredictable 0.0680272
Plausible Predictable 0.0075188
Plausible Unpredictable 0.0119048
# Fit a model to see how different factors influencce the don't know rate
model_interaction <- glm(IsDontKnow ~ ACondition * QCondition, 
                        data = data, 
                        family = binomial)
model_summary <- tidy(model_interaction) %>%
  mutate(
    OR = exp(estimate),  # Odds ratios
    `95% CI` = paste0("[", round(exp(estimate - 1.96*std.error), 2), 
                      ", ", round(exp(estimate + 1.96*std.error), 2), "]"),
    across(c(estimate, std.error, statistic, p.value), ~round(., 3))
  ) %>%
  select(term, estimate, std.error, statistic, p.value, OR, `95% CI`)

kable(model_summary,
      caption = "Logistic Regression: 'Don't Know' Responses ~ Answer Consistency × Question Constraint",
      col.names = c("Predictor", "β", "SE", "z", "p", "OR", "95% CI"),
      digits = 3)
Logistic Regression: ‘Don’t Know’ Responses ~ Answer Consistency × Question Constraint
Predictor β SE z p OR 95% CI
(Intercept) -3.143 0.308 -10.208 0.000 0.043 [0.02, 0.08]
AConditionPlausible -1.739 0.774 -2.248 0.025 0.176 [0.04, 0.8]
QConditionUnpredictable 0.526 0.385 1.365 0.172 1.692 [0.8, 3.6]
AConditionPlausible:QConditionUnpredictable -0.062 0.995 -0.062 0.950 0.940 [0.13, 6.6]

Results

Data preparation

I have converted the stored data into the structure that aligns with the data structure in the original paper (see https://osf.io/kwa32) using python script. The converted file can be found here. The processed data file includes trial-level information on question constraint, answer consistency, and the proportion of expected words reported.

Confirmatory analysis

To replicate the findings of Corps and Rabagliati (2020), I fitted a generalized linear mixed-effects model specifying a binomial family. The model predicted the proportion of expected words reported based on the fixed effects of Question Constraint, Answer Consistency, Block and their interactions. The scripts for figure and mixed-effect model can be found in the scripts folder

Table 1 summarizes the fixed effects from the model.

Table 1: Full model output for fixed effects for the analysis of word report scores in Experiment 1.
Words Reported in Expected Answer
Predictor Estimate (SE) z p value Bayes factor
Intercept −0.71 (0.40) −1.80 .071
Question Constraint 1.87 (0.37) 5.02 < .001 > 100
Answer Consistency 5.19 (0.53) 9.87 < .001 > 100
Block 0.41 (0.24) 1.74 .082 0.30
Question Constraint * Answer Consistency −0.02 (0.37) −0.05 .961 0.04
Question Constraint * Block −0.10 (0.23) −0.43 .668 0.07
Answer Consistency * Block 0.85 (0.25) 3.40 < .001 13.17
Question Constraint * Answer Consistency * Block 0.10 (0.25) 0.41 .685 0.35

The original table can be found below. Original Table Figure 1 displays the average proportion of words reported correctly and words reported with expectation across conditions. Figure 2 is the figure from the Corps and Rabagliati (2020).

Figure 1: Observed means of the proportion of words in the heard answer (left panel) and the expected answer (right panel) reported correctly across conditions. Error bars represent +/- 1 standard error from the mean
Figure 2: Figures in Original Paper

The descriptive figures and model output reveal a strong main effect of Answer Consistency. Participants were significantly more accurate at reporting expected words when the answer was semantically consistent compared to when it was inconsistent. The interaction between Question Constraint and Answer Consistency - the key statistic for distinguishing “form” versus “semantic” predictions - did not reach significance (beta = -0.02,SE=0.37, p=0.961, BF=0.04). This mirrors the findings of Corps and Rabagliati (2020) (beta = -0.18, SE=0.30, p=0.55, BF=0.19), supporting the hypothesis that prediction facilitates perception of distorted speech via semantic content rather than specific word-form anticipation. ### Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Replication Attempt

This project successfully replicated the primary result of Corps & Rabagliati (2020). The confirmatory analysis demonstrated a robust main effect of Answer Consistency: participants were significantly better at identifying noise-vocoded words when the answer was semantically plausible given the question. Crucially, as in the original study, this benefit did not interact strongly with Question Constraint. This supports the theoretical claim that top-down predictions during speech comprehension operate largely at the semantic level, allowing listeners to leverage consistency to resolve perceptual ambiguity, regardless of how strongly the question constrains specific word forms.

Commentary

The successful replication of this effect, despite modifications to the stimulus list (removal of items referencing outdated political figures), demonstrates the robustness of the underlying phenomenon. The main effect of Answer Consistency and null effect of its interaction with Question Constraint are substantial and evident in both the statistical model and descriptive visualizations.

The exploratory analysis provided further insight into the nature of the “Implausible” condition. The high rate of “don’t know” responses suggests that without semantic support, the bottom-up acoustic signal (noise-vocoded speech) is often insufficient for any successful lexical retrieval. This reinforces the “all-or-nothing” nature of perception in highly degraded conditions: semantic prediction doesn’t just refine perception; it often enables it entirely.

From a data science perspective, the replication was largely facilitated by the original authors’ transparent code sharing. However, the reliance on specific cultural knowledge (names of prime ministers) required a limited population to prevent cultural and historical artifacts from confounding the “Plausibility” metric. For example, the prime minister in 2019 was no longer the prime minister in 2025.Future studies on semantic prediction should consider using more timeless and cross-cultural facts.