Replication of Superficial Auditory (Dis)fluency Biases Higher-Level Social Judgment by Walter-Terrill, Ongchoco & Scholl (2025, Proceedings of the National Academy of Sciences)

Author

David Barnstone (dbarnsto@stanford.edu)

Published

December 8, 2025

1 Introduction

Human interpersonal communication is increasingly mediated by videoconferencing technology. While platforms like Zoom allow users to see themselves, they are not typically able to hear how they sound to others. Walter-Terrill et al. (2025) found that participants rated a job candidate more favorably when the audio quality of their application statement was clear vs. distorted but otherwise comprehensible. This finding provides empirical support for the influential communication theory media ecology, which suggests characteristics of the medium through which a message is delivered may affect people independently of the content of the message itself (Griffin et al., 2019). Practically, microphone quality represents a relatively simple target for intervention that could increase a family’s socioeconomic status if it is indeed critical in hiring decisions. The increasingly popularity of podcasts as news sources could also affect how listeners judge the quality of information presented, regardless of the content.

The target of this replication is Experiment 4, in which 1200 participants from the Prolific platform were randomly assigned to the clear or distorted audio conditions (600 in each group). After listening to the audio statement, participants rated how likely they would be to hire the candidate on a continuous scale from 0 to 100. I will additionally conduct several exploratory analyses with the goal of generating hypotheses to test in future work.

All materials and details needed to conduct this replication can be found in the repository and in the original paper.

2 Methods

2.1 Power Analysis

# define sample and effect sizes for original experiments of interest
pwr_original = tibble(experiment = c("Exp. 1: Hireability",
                                     "Exp. 4: Hireability",
                                     "Exp. 5: Intelligence"),
                      voice_type = c("Human", "Computer", "Computer"),
                      n_per_group = c(300, 600, 600),
                      total_n = n_per_group * 2,
                      effect_size = c(0.35, 0.43, 0.20))

# add row for replication
pwr_replication = pwr_original %>% 
  add_row(experiment = "Exp. 1 + 4: Pooled",
          voice_type = "Both",
          n_per_group = 125,
          total_n = n_per_group * 2,
          effect_size = 0.43)

# set significance threshold
alpha = 0.05

2.1.1 Original Study

Table 1 displays the results of a post-hoc power analysis for the original study’s Experiments 1, 4, and 5, which all used the same application statement that was recorded by a human male (Experiment 1) or computer generated (Experiments 4 and 5). All three experiments also used the same audio quality manipulation. Experiment 5 was identical to Experiments 1 and 4, except the outcome variable was the perceived intellgience of the speaker, measured on the same continuous 0-100 scale.

pwr_original %>%
  mutate(power = map2_chr(n_per_group,
                          effect_size,
                          ~{paste0(round(pwr.t.test(
                            n = .x,
                            d = .y,
                            sig.level = alpha,
                            type = "two.sample",
                            alternative = "two.sided")
                            $power * 100, 1), "%")})) %>%
  kable(digits = 2)

Table 1: Original Study Power Analysis (post hoc)

experiment	voice_type	n_per_group	total_n	effect_size	power
Exp. 1: Hireability	Human	300	600	0.35	99%
Exp. 4: Hireability	Computer	600	1200	0.43	100%
Exp. 5: Intelligence	Computer	600	1200	0.20	93.3%

2.1.2 Replication

Table 2 shows the total N needed to achieve 80-95% power to detect the main effect of audio quality on hireability ratings. Because Walter-Terrill et al. (2025) found a significant effect of audio quality on social judgement across voice types (male, female, American, British, human, and artificial) and contexts (hireability, romantic desirability, and credibility), this power analysis pools across voice type.

# calculate total sample size needed for 3 power levels
tibble(analysis = "Audio quality main effect (pooled)",
       effect_size = 0.43, # from Exp. 4
       n_80 = map_dbl(effect_size, ~ceiling(
         pwr.t.test(d = .x,
                    power = 0.80,
                    sig.level = alpha,
                    type = "two.sample")$n)),
       n_90 = map_dbl(effect_size, ~ceiling(
         pwr.t.test(d = .x,
                    power = 0.90,
                    sig.level = alpha,
                    type = "two.sample")$n)),
       n_95 = map_dbl(effect_size, ~ceiling(
         pwr.t.test(d = .x,
                    power = 0.95,
                    sig.level = alpha,
                    type = "two.sample")$n)),
       `80%` = n_80 * 2,
       `90%` = n_90 * 2,
       `95%` = n_95 * 2) %>%
  select(analysis, effect_size, `80%`, `90%`, `95%`) %>%
  kable(col.names = c("Analysis", "Effect Size (d)",
                      "80% Power", "90% Power", "95% Power"))

Table 2: Replication Power Analysis (a priori)

Analysis	Effect Size (d)	80% Power	90% Power	95% Power
Audio quality main effect (pooled)	0.43	172	230	284

Accordingly, I selected a total N of 250 to yield > 90% power and compared this sample size’s power to detect the effect sizes of the original Experiments 1, 4, and 5. As shown in Table 3, these unpooled experiments would be underpowered (< 80%) at N = 250. Thus, I chose to proceed with a replication of the original study’s Experiments 1 and 4, which were identical except for the voice type.

# compare power across experiments
pwr_replication = pwr_replication %>% 
  mutate(n_per_group = c(62, 62, 62, 125),
         total_n = n_per_group * 2,
         power = map2_chr(n_per_group, effect_size, ~{
           paste0(round(pwr.t.test(n = .x,
                                   d = .y,
                                   sig.level = alpha,
                                   type = "two.sample")
                        $power * 100, 1), "%")}))

pwr = pwr_replication$power[4]

# print the table
pwr_replication %>%
  kable()

Table 3: Replication Power Comparison (a priori)

experiment	voice_type	n_per_group	total_n	effect_size	power
Exp. 1: Hireability	Human	62	124	0.35	48.9%
Exp. 4: Hireability	Computer	62	124	0.43	66.1%
Exp. 5: Intelligence	Computer	62	124	0.20	19.7%
Exp. 1 + 4: Pooled	Both	125	250	0.43	92.3%

2.2 Planned Sample

Based on the above power analysis, I will recruit 250 participants from Prolific to yield 92.3% power to detect the main effect of audio quality. Participants will be prescreened using the criteria from the original study:

“Prescreening criteria required listeners to have English as a first language, a Prolific approval rating of at least 95%, previous completion of at least 100 Prolific studies, no prior participation in another experiment from this project, and the use of a laptop or desktop computer (but not a phone or tablet).” (Walter-Terrill et al., 2025)

The study will remain active on the Prolific platform until 250 submissions are received. As this study involves random assignment to one of four conditions (see below), this sampling strategy may result in slight differences in n per group which will be reported.

2.3 Materials

This replication will use four of the eight audio stimuli from the original study, which are available in the repo and were also published with the PNAS paper.

2.3.1 Human voice

The human voice recording will be identical to the original. It was prepared by the original authors as follows:

“The Clear audio stimulus (in mp3 format) was a recorded narration by a naive male speaker. The recording (Audio S1) was 27.5 s and consisted of the following spoken text: ‘After 8 y in sales, I am currently seeking a new challenge which will utilize my meticulous attention to detail, and friendly, professional manner. I am an excellent fit for your company and will be an asset to your team as a senior sales manager. As an experienced sales manager with my previous company, my tenacious and proactive approach resulted in numerous important contract wins. Through this experience, I have improved and developed my networking skills, which have proven to be very effective in increasing my number of sales.’” (Walter-Terrill et al., 2025)

2.3.2 Computer voice

The artificial voice recording will be identical to the original. It was created by the original authors using Amazon Polly (https://aws.amazon.com/polly/):

“The audio stimulus (Audios S7 and S8) was generated (as in Experiment 3) using a simulated male American English voice (“Matthew”). The recording was 28.5 s long and consisted of the same text as used in Experiment 1.” (Walter-Terrill et al., 2025)

2.3.3 Audio quality manipulation

The distorted audio stimuli will be identical to the original. They were prepared by the original authors as follows:

“The Distorted audio stimulus (Audio S2) was created by modifying the Clear recording using the open-source VST effect “MDACombo” (42) in the TwistedWave Online Audio Editor (https://twistedwave.com/online), with the following settings: Model: 4x12>; Drive: –55; Bias: 83; Output: 0; Process HPF Frequency: 20%; HPF Resolution: 90%. This resulted in speech that was fully comprehensible, but that had a high-frequency tinny quality like that commonly experienced during videoconferences with a low-quality computer microphone.” (Walter-Terrill et al., 2025)

2.4 Procedure

The experimental procedure will be identical to the original except as noted below:

“All experiments were completed on custom webpages created with software written in a combination of PHP, JavaScript, CSS, and HTML, with the jsPsych libraries (40). Before beginning each experiment, listeners were asked to either wear headphones or move to a quiet environment. To discourage multitasking, listeners completed the experiment with their browser in full-screen mode. During debriefing, listeners reported whether they heard the stimuli using speakers or headphones.” (Walter-Terrill et al., 2025)

Participants will be randomly assigned to one of four conditions using a between-subjects, 2 (voice type: computer or human) x 2 (audio quality: clear or distorted) design.

“Listeners first heard a short audio recording which confirmed that they could hear speech, and which allowed them to adjust the volume to a comfortable level…Listeners were then presented with the following centered written prompts [across a few separate screens, with all written text presented in black “Open Sans” font, scaled to a point size that was 1.5% of the full-screen pixel width of the listener’s display, on a light gray (#D3D3D3) background]: “For this study, we’d like you to imagine that as part of your job, you are tasked with making a hiring decision for a highly competitive position: senior sales manager. You will listen to a few lines from a personal statement from an application for this position. We will then ask you a few questions about your impressions of the personal statement and its author. Click on the ‘Play’ button below to play the audio clip of the personal statement. You will only be able to listen to this once. (You cannot replay it.)” Listeners then heard the audio stimuli after clicking on the relevant button (which then disappeared). After the recording finished playing, listeners were asked “What is the likelihood that you would hire this person?”, and responded by using their computer mouse to position a slider (a light blue circle 35px in diameter) on a continuous scale (depicted by a white bar, 25px high, with a width equal to 75% of the full-screen width) from Very Unlikely to Very Likely (with these terms appearing just below the bar on the far left and right, respectively). (Listeners could adjust the marker’s position as many times as they wished, after which they clicked on a “Continue” button.) Responses were recorded as values ranging from 0 (matching the bar’s far left) to 100 (representing the bar’s far right). (Walter-Terrill et al., 2025)

In addition to the hireability question, participants will respond to the following question from the original study’s Experiment 4, which used the same script and audio manipulation: “How intelligent is the author of this personal statement?” The scale will also be continuous, from 0 (“Very Unintelligent) to 100 (”Very Intelligent”). The order of the questions will be counterbalanced. Participants will answer the following follow-up questions during debriefing:

Demographics: Age (text box), Gender (select one: Male, Female, Other, Prefer not to say)
How are you listening to the audio from experiment? (select one: Headphones, Laptop Speakers, External Speakers, Earbuds, Other)
In 1-2 sentences, what do you think this experiment was testing? (text box)
Did you experience any technical difficulties in playing the audio or did anything distract you while you were listening to the audio? If so please describe. (This will not affect whether you receive credit or compensation.) (text box)
Using the slider below, on a scale of 1-100 (with 1 being very distracted, and 100 being very focused), how well did you pay attention to the experiment? (This will not affect whether you receive credit or compensation.)
Have you ever been employed in a position where you made hiring decisions? (Yes or No)
Using the slider below, on a scale of 1-100, how would you rate the recording quality of the audio clip you just heard? (With 1 being very poor, and 100 being excellent.)
Using the slider below, on a scale of 1-100, how well were you able to understand the words spoken in the audio clip you heard – setting aside any possible issues with the recording quality? (With 0 being you understood none of the words, 50 being you understood about half of the words, and 100 being you understood all the words.) (No Understanding to Complete Understanding)
Is there anything else we should know (either about you or how you did the experiment) that may have had an impact on your results? (text box)

Finally, participants will see the following debriefing screen:

Thank you for completing this experiment! I really appreciate your help with this experiment and making it through to the end. Your participation will help us better understand how audio quality influences the judgements we make about other people.

To get credit for this experiment, please click on continue, which will redirect you to the completion page on Prolific.

2.5 Analysis Plan

The primary analyses will be conducted using the following linear model:

\[Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3(X_{1i} \times X_{2i}) + \epsilon_i \tag{1}\]

where

\(Y_i\) (Outcome) is the hireability rating for participant \(i\),
\(\beta_0\) (Intercept) is the mean outcome when all predictors = 0,
\(\beta_1\) is the main effect of audio quality,
\(X_{1i}\) is audio quality for participant \(i\), dummy coded as 0 = Clear and 1 = Distorted,
\(\beta_2\) is the main effect of voice type,
\(X_{2i}\) is voice type for participant \(i\), dummy coded as 0 = Computer and 1 = Human,
\(\beta_3\) (Interaction) is the interaction effect between audio quality and voice type, and
\(\epsilon_i\) (Error) is the residual variation not explained by the other predictors.

Confirmatory analysis: To replicate the finding from the original study’s Experiment 4, I will test the effect of audio quality on hireability using estimated marginal means (Lenth & Piaskowski, 2025). This approach is equivalent to the “between-samples two-tailed t test” used by (Walter-Terrill et al., 2025). Additionally, I will rule out the potential confound of comprehension using an independent samples t-test.

Exploratory analyses: I will use the interaction term in Equation 1 to test the potential effect of voice type on hireability, which was not directly tested in the original study. A non-significant interaction will support the generalizbility of the audio quality effect and, thus, the pooling strategy.

I will also examine the effect of audio quality on perceived intelligence using the same form of Equation 1, except the outcome variable \(Y_i\) will be the intelligence rating. This test will attempt to replicate the original study’s Experiment 5, although this result should be interpreted with caution due to low power (d = 0.20).

Finally, I will report the correlation between hireability and intelligence ratings.

All analyses will be conducted in R (R Core Team, 2025).

2.6 Differences from Original Study

The main difference from the original study will be the pooling of human and computer voice types. The original Experiment 1 included a transcription task to assess participants’ understanding of the recorded statement. Due to resource constraints, this replication will instead rely on participants’ self-reported level of understanding. The instructional sample audio in the original experiments also applied the audio manipulation (clear or distored) to match the experimental condition. For simplicity, this replication uses the same, clear sample audio for all conditions. These differences are not expected to influence the results or interpretations of this replication.

2.7 Methods Addendum (Post Data Collection)

2.7.1 Actual Sample

# describe sample characteristics 

full_sample <- read_csv(here("data", "full_sample.csv")) %>%
  filter(data_summary == TRUE)

mean_age <- mean(full_sample$debrief_age)
sd_age <- sd(full_sample$debrief_age)

percent_female <- full_sample %>% 
  group_by(debrief_gender) %>% 
  summarize(n = n()) %>% 
  filter(debrief_gender == "female") %>%
  pull()/nrow(full_sample)*100

A total of 248 participants (\(\text M_{age}\) = 45.97, \(\text {SD}_{age}\) = 12.77, 52% female) completed the study on Prolific. Table 4 shows the number of participants randomly assigned to each condition.

full_sample %>%
  group_by(voice_type, distorted_audio) %>%
  summarize(n = n(), .groups = "drop") %>%
  pivot_wider(names_from = distorted_audio, values_from = n) %>%
  adorn_totals(c("row", "col")) %>% 
  kable(col.names = c("Voice Type", "Clear",
                      "Distorted", "Total"))

Table 4: Randomized Condition Assignment

Voice Type	Clear	Distorted	Total
computer	60	56	116
human	59	73	132
Total	119	129	248

2.7.2 Differences from pre-data collection methods plan

Data collection followed the preregistered methods. Data for two participants was missing from Prolific.

3 Results

3.1 Data preparation

This code anonomyzes the data file and removes participants from the analysis according to the exclusion criteria used in the original study: self-reported attention < 70/100, response times \(\pm2.5\) SD than the mean, and comprehension scores \(\pm2\) SD from the mean. Data from participants who self-report technical difficulties or distractions will be excluded only if I deem these challenges severe enough to interfere with their ability to complete the task.

#### Import data
full_sample <- read_csv(here("data", "full_sample.csv")) %>%
  filter(data_summary == TRUE) %>% 
  # remove identifying information
  select(-rt, -subject, -prolific_study_id,
         -prolific_session_id, -participant_id) %>%
  # create sequential subject ids
  mutate(id = seq(1:nrow(.))) %>% 
  # create anonymized data file
  write_csv(., "../data/full_sample-anon.csv")

#### Data exclusion / filtering

# define function to calculate distance from the mean
remove_outliers <- function(x, threshold = 2.5) {
  mean_x <- mean(x, na.rm = TRUE)
  sd_x <- sd(x, na.rm = TRUE)
  abs(x - mean_x) <= threshold * sd_x
}

# define original sample size
n_original <- nrow(full_sample)

# attention check
n_before <- nrow(full_sample)
full_sample <- full_sample %>% 
  filter(debrief_attention >= 70)
n_attention <- n_before - nrow(full_sample)

# hire response time
n_before <- nrow(full_sample)
full_sample <- full_sample %>% 
  filter(remove_outliers(hire_rt, threshold = 2.5))
n_hire_rt <- n_before - nrow(full_sample)

# intelligence response time
n_before <- nrow(full_sample)
full_sample <- full_sample %>% 
  filter(remove_outliers(intelligence_rt, threshold = 2.5))
n_intelligence_rt <- n_before - nrow(full_sample)

# comprehension check
n_before <- nrow(full_sample)
full_sample <- full_sample %>% 
  filter(remove_outliers(debrief_understandable, threshold = 2))
n_understandable <- n_before - nrow(full_sample)

# total exclusions
n_excluded <- n_original - nrow(full_sample)

A total of 31 participants were excluded from the analysis for the following reasons (defined above) in the following order: low attention (n = 3), hireability rating response time (n = 6), intelligence rating response time (n = 4), and comprehension issues (n = 18). No participants were excluded for technical difficulties or distractions.

#### Prepare data for analysis - create columns etc.
full_sample = full_sample %>%
  select(-c(key_press, stimulus, trial_type, trial_index)) %>%
  mutate(distorted_audio = as.factor(distorted_audio),
         audio_quality = fct_recode(distorted_audio,
                                    "Clear" = "FALSE",
                                    "Distorted" = "TRUE"))

3.2 Confirmatory analysis

The figure below shows the results from the original experiments by Walter-Terrill et al. (2025). The figures and analysis from my replication follow.

Figure from the original study to reproduce (in part)

3.2.1 No effect of audio quality on hireability

# plot main effect of audio quality on hireability
ggplot(data = full_sample,
       mapping = aes(x = audio_quality,
                     y = hire,
                     color = audio_quality)) +
  geom_beeswarm(cex = 2.5,
                size = 1,
                alpha = 0.5) +
  stat_summary(fun = mean,
               geom = "crossbar",
               width = 0.5,
               color = "black") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Hireability",
       x = "Simulated Audio Quality",
       y = "Rating") +
  scale_y_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  theme_classic() +
  theme(text = element_text(size = 13)) +
  theme(plot.title = element_text(hjust = 0.5)) +
  theme(legend.position = "none")

# linear model for hireability
model_hire = lm(hire ~ 1 + audio_quality + voice_type +
                  audio_quality*voice_type, data = full_sample)

# get descriptive statistics
full_sample_descriptives = full_sample %>%
  group_by(audio_quality) %>%
  summarise(
    n = n(),
    M_hire = mean(hire, na.rm = TRUE),
    SD_hire = sd(hire, na.rm = TRUE))

# save means and standard deviations
mean_hire_clear <- full_sample_descriptives %>% 
  filter(audio_quality == "Clear") %>% 
  select(M_hire) %>% 
  pull()

sd_hire_clear <- full_sample_descriptives %>% 
  filter(audio_quality == "Clear") %>% 
  select(SD_hire) %>% 
  pull()

mean_hire_distorted <- full_sample_descriptives %>% 
  filter(audio_quality == "Distorted") %>% 
  select(M_hire) %>% 
  pull()

sd_hire_distorted <- full_sample_descriptives %>% 
  filter(audio_quality == "Distorted") %>% 
  select(SD_hire) %>% 
  pull()

# main effect of audio quality on hireability
emm_hire <- pairs(emmeans(model_hire, ~ audio_quality), infer = TRUE)

# save inferential statistic values
df_hire <- summary(emm_hire)$df
t_hire <- round(summary(emm_hire)$t.ratio, digits = 2)
ci_lower_hire <- round(summary(emm_hire)$lower.CL, digits = 2)
ci_upper_hire <- round(summary(emm_hire)$upper.CL, digits = 2)
p_hire <- round(summary(emm_hire)$p.value, digits = 3)

# effect size
emm_hire_means <- emmeans(model_hire, ~ audio_quality)
eff_size_hire <- eff_size(emm_hire_means, sigma = sigma(model_hire), edf = df.residual(model_hire))
eff_size_hire <- round(summary(eff_size_hire)$effect.size, digits = 2)

There was no significant difference in hireability ratings between participants who heard the distorted (M = 66.53, SD = 23.82) vs. clear (M = 68.44, SD = 24.01) recording (t(213) = 1.03, 95% CI [-2.95, 9.39], p = 0.305, d = 0.14)

# comprehension check
comprehension_check = t.test(debrief_understandable ~ audio_quality, 
                             data = full_sample,
                             var.equal = FALSE)

format_pvalue <- function(p) {
  case_when(
    p < 0.001 ~ "< .001",
    p < 0.01 ~ "< .01",
    p < 0.05 ~ "< .05",
    p >= 0.05 ~ as.character(round(p, digits = 3))
  )
}

# save inferential statistic values
df_understandable <- round(comprehension_check$parameter, digits = 2)
t_understandable <- round(comprehension_check$statistic, digits = 2)
ci_lower_understandable <- round(comprehension_check$conf.int[1],
                                 digits = 2)
ci_upper_understandable <- round(comprehension_check$conf.int[2],
                                 digits = 2)
p_understandable <- format_pvalue(comprehension_check$p.value) 
eff_size_understandable <- round(cohens_d(debrief_understandable
                                          ~ audio_quality,
                                          data = full_sample)$Cohens_d,
                                 digits = 2)

# get descriptive statistics for comprehension check
full_sample_descriptives = full_sample_descriptives %>%
  bind_cols(full_sample %>%
  group_by(audio_quality) %>%
  summarise(M_understand = mean(debrief_understandable,
                                 na.rm = TRUE),
            SD_understand = sd(debrief_understandable,
                               na.rm = TRUE)) %>% 
  select(-audio_quality))

# save values
mean_understand_distorted <- full_sample_descriptives %>% 
  filter(audio_quality == "Distorted") %>% 
  select(M_understand) %>%
  pull() %>% 
  round(digits = 2)
  
sd_understand_distorted <- full_sample_descriptives %>% 
  filter(audio_quality == "Distorted") %>% 
  select(SD_understand) %>%
  pull() %>% 
  round(digits = 2)

mean_understand_clear <- full_sample_descriptives %>% 
  filter(audio_quality == "Clear") %>% 
  select(M_understand) %>%
  pull() %>% 
  round(digits = 2)

sd_understand_clear <- full_sample_descriptives %>% 
  filter(audio_quality == "Clear") %>% 
  select(SD_understand) %>%
  pull() %>% 
  round(digits = 2)

However, participants reported significantly lower comprehension of the statement when the audio was distorted (M = 92.06, SD = 10.36) vs. clear (M = 97.17, SD = 5.21) (t(155.51) = 4.58, 95% CI [2.91, 7.33], p < .001, d = 0.63).

3.3 Exploratory analyses

3.3.1 Null effect of audio quality holds across voice type

# plot interaction voice type x audio quality
ggplot(data = full_sample,
       mapping = aes(x = voice_type,
                     y = hire,
                     color = audio_quality,
                     group = audio_quality)) +
  geom_jitter(width = 0.1,
              alpha = 0.1) +
  stat_summary(fun.data = mean_cl_boot,
               geom = "pointrange",
               size = 0.5) +
  stat_summary(fun = mean,
               geom = "line",
               linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  scale_y_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  labs(title = "Audio Quality x Voice Type",
       x = "Voice Type",
       y = "Hireability Rating") +
  theme_classic() +
  theme(text = element_text(size = 13)) +
  theme(plot.title = element_text(hjust = 0.5))

anova(model_hire)

Analysis of Variance Table

Response: hire
                          Df Sum Sq Mean Sq F value    Pr(>F)    
audio_quality              1    197   196.6  0.3777    0.5395    
voice_type                 1  12125 12125.3 23.3005 2.642e-06 ***
audio_quality:voice_type   1     42    41.8  0.0803    0.7772    
Residuals                213 110843   520.4                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

eta_squared(model_hire, partial = TRUE)

# Effect Size for ANOVA (Type I)

Parameter                | Eta2 (partial) |       95% CI
--------------------------------------------------------
audio_quality            |       1.77e-03 | [0.00, 1.00]
voice_type               |           0.10 | [0.04, 1.00]
audio_quality:voice_type |       3.77e-04 | [0.00, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

The interaction between audio quality and voice type was NOT significant; the null effect held across both voice types (F(1, 213) = 0.08, p = 0.78, η²p < .001).

Human voices (M = 74.1, SD = 16.6) were rated much more hireable than computer voices (M = 59.2, SD = 28.7) (t(213) = -3.92, p < .001, d = 0.66).

# plot main effect of audio quality on intelligence (Exp. 5)
ggplot(data = full_sample,
       mapping = aes(x = audio_quality,
                     y = intelligence,
                     color = audio_quality)) +
  geom_beeswarm(cex = 3,
                size = 1,
                alpha = 0.5) +
  stat_summary(fun = mean,
               geom = "crossbar",
               width = 0.5,
               color = "black") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Intelligence",
       x = "Simulated Audio Quality",
       y = "Rating") +
  scale_y_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  theme_classic() +
  theme(text = element_text(size = 13)) +
  theme(plot.title = element_text(hjust = 0.5))

# build linear model for intelligence
model_intel = lm(intelligence ~ 1 + audio_quality + voice_type + audio_quality*voice_type,
           data = full_sample)
summary(model_intel)


Call:
lm(formula = intelligence ~ 1 + audio_quality + voice_type + 
    audio_quality * voice_type, data = full_sample)

Residuals:
    Min      1Q  Median      3Q     Max 
-72.717  -7.263   2.279   9.891  28.279 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                             72.7170     2.1627  33.623  < 2e-16 ***
audio_qualityDistorted                  -0.9961     3.2315  -0.308  0.75821    
voice_typehuman                          8.5462     3.0044   2.845  0.00488 ** 
audio_qualityDistorted:voice_typehuman  -0.1577     4.3203  -0.037  0.97091    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.74 on 213 degrees of freedom
Multiple R-squared:  0.06754,   Adjusted R-squared:  0.05441 
F-statistic: 5.143 on 3 and 213 DF,  p-value: 0.001878

# main effect of audio quality on intelligence
emm <- emmeans(model_intel, ~ audio_quality) # name this intel

NOTE: Results may be misleading due to involvement in interactions

pairs(emm)

 contrast          estimate   SE  df t.ratio p.value
 Clear - Distorted     1.07 2.16 213   0.498  0.6193

Results are averaged over the levels of: voice_type

eff_size(emm, sigma = sigma(model_intel), edf = df.residual(model_intel))

 contrast          effect.size    SE  df lower.CL upper.CL
 Clear - Distorted      0.0683 0.137 213   -0.202    0.339

Results are averaged over the levels of: voice_type 
sigma used for effect sizes: 15.74 
Confidence level used: 0.95

# are hire and intelligence ratings correlated?
cor_overall <- cor.test(full_sample$hire, full_sample$intelligence, 
                        method = "pearson")
r_value <- cor_overall$estimate
p_value <- cor_overall$p.value
ci_lower <- cor_overall$conf.int[1]
ci_upper <- cor_overall$conf.int[2]

ggplot(full_sample, aes(x = hire, y = intelligence)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", alpha = 1, fill = "lightblue") +
  labs(
    title = "Hireability-Intelligence Correlation",
    subtitle = sprintf("r = %.2f, p = %.3f", r_value, p_value),
    x = "Hireability Rating",
    y = "Intelligence Rating"
  ) +
    scale_y_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  scale_x_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  scale_color_brewer(palette = "Set1") +
  theme_classic() +
  theme(text = element_text(size = 13)) +
  theme(plot.title = element_text(hjust = 0.5))

`geom_smooth()` using formula = 'y ~ x'

4 Discussion

4.1 Summary of Replication Attempt

To completed post-data collection.

4.2 Commentary

To completed post-data collection.

4.3 COI Disclosure

In the interest of transparency, I disclose that I occasionally work as an independent contractor for the National Academies of Sciences, the publisher of PNAS, where the original paper was published. Last year I was assigned to write a brief summary of this paper for the journal. However, I had no role in selecting this paper to cover nor relationships to the authors. My work for NAS/PNAS has no bearing on the journal’s editorial decisions as research summaries are assigned only after acceptance.

5 References

Griffin, E., Ledbetter, A., & Sparks, S., Glenn. (2019). Media ecology of Marshall McLuhan. In A first look at communication theory (10th ed., pp. 309–319). McGraw Hill Education.

Lenth, R. V., & Piaskowski, J. (2025). Emmeans: Estimated Marginal Means, aka Least-Squares Means (Version R package version 2.0.0) [Computer software]. https://rvlenth.github.io/emmeans/

R Core Team. (2025). R: A Language and Environment for Statistical Computing [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org

Walter-Terrill, R., Ongchoco, J. D. K., & Scholl, B. J. (2025). Superficial auditory (dis)fluency biases higher-level social judgment. Proceedings of the National Academy of Sciences, 122(13), e2415254122. https://doi.org/10.1073/pnas.2415254122

6 Session Info

This document was written and compiled in RStudio.

R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS 26.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.4  forcats_1.0.0    stringr_1.6.0    dplyr_1.1.4     
 [5] purrr_1.0.4      readr_2.1.5      tidyr_1.3.1      tibble_3.2.1    
 [9] tidyverse_2.0.0  janitor_2.2.1    emmeans_1.11.1   effectsize_1.0.0
[13] ggbeeswarm_0.7.2 ggplot2_3.5.2    here_1.0.2       knitr_1.50      
[17] pwr_1.3-0       

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       beeswarm_0.4.0     xfun_0.52          bayestestR_0.15.3 
 [5] htmlwidgets_1.6.4  insight_1.2.0      lattice_0.22-7     tzdb_0.5.0        
 [9] vctrs_0.6.5        tools_4.4.2        generics_0.1.4     datawizard_1.0.2  
[13] parallel_4.4.2     cluster_2.1.8.1    pkgconfig_2.0.3    Matrix_1.7-3      
[17] data.table_1.17.0  checkmate_2.3.2    RColorBrewer_1.1-3 lifecycle_1.0.4   
[21] compiler_4.4.2     farver_2.1.2       snakecase_0.11.1   vipor_0.4.7       
[25] htmltools_0.5.8.1  yaml_2.3.10        htmlTable_2.4.3    Formula_1.2-5     
[29] pillar_1.10.2      crayon_1.5.3       Hmisc_5.2-3        rpart_4.1.24      
[33] nlme_3.1-168       tidyselect_1.2.1   digest_0.6.39      mvtnorm_1.3-3     
[37] stringi_1.8.7      splines_4.4.2      rprojroot_2.1.1    fastmap_1.2.0     
[41] grid_4.4.2         colorspace_2.1-1   cli_3.6.5          magrittr_2.0.4    
[45] base64enc_0.1-3    foreign_0.8-90     withr_3.0.2        backports_1.5.0   
[49] scales_1.4.0       bit64_4.6.0-1      timechange_0.3.0   estimability_1.5.1
[53] rmarkdown_2.29     bit_4.6.0          nnet_7.3-20        gridExtra_2.3     
[57] hms_1.1.3          coda_0.19-4.1      evaluate_1.0.5     parameters_0.25.0 
[61] mgcv_1.9-3         rlang_1.1.6        xtable_1.8-4       glue_1.8.0        
[65] rstudioapi_0.17.1  vroom_1.6.5        jsonlite_2.0.0     R6_2.6.1