Replication of Superficial Auditory (Dis)fluency Biases Higher-Level Social Judgment by Walter-Terrill, Ongchoco & Scholl (2025, Proceedings of the National Academy of Sciences)

Author

David Barnstone (dbarnsto@stanford.edu)

Published

December 2, 2025

1 Introduction

Human interpersonal communication is increasingly mediated by videoconferencing technology. While platforms like Zoom allow users to see themselves, they are not typically able to hear how they sound to others. Walter-Terrill et al. (2025) found that participants rated a job candidate more favorably when the audio quality of their application statement was clear vs. distorted but otherwise comprehensible. This finding provides empirical support for the influential communication theory media ecology, which suggests characteristics of the medium through which a message is delivered may affect people independently of the content of the message itself. Practically, microphone quality represents a relatively simple target for intervention that could increase a family’s socioeconomic status if it is indeed critical in hiring decisions. The increasingly popularity of podcasts as news sources could also affect how listeners judge the quality of information presented, regardless of the content.

The target of this replication is Experiment 4, in which 1200 participants from the Prolific platform were randomly assigned to the clear or distorted audio conditions (600 in each group). After listening to the audio statement, participants rated how likely they would be to hire the candidate on a continuous scale from 0 to 100. I will additionally conduct several exploratory analyses with the goal of generating hypotheses to test in future work.

All materials and details needed to conduct this replication can be found in the repository and in the original paper.

2 Methods

2.1 Power Analysis

# define sample and effect sizes for original experiments of interest
pwr_original = tibble(experiment = c("Exp. 1: Hireability",
                                     "Exp. 4: Hireability",
                                     "Exp. 5: Intelligence"),
                      voice_type = c("Human", "Computer", "Computer"),
                      n_per_group = c(300, 600, 600),
                      total_n = n_per_group * 2,
                      effect_size = c(0.35, 0.43, 0.20))

# add row for replication
pwr_replication = pwr_original %>% 
  add_row(experiment = "Exp. 1 + 4: Pooled",
          voice_type = "Both",
          n_per_group = 125,
          total_n = n_per_group * 2,
          effect_size = 0.43)

# set significance threshold
alpha = 0.05

2.1.1 Original Study

Table 1 displays the results of a post-hoc power analysis for the original study’s Experiments 1, 4, and 5, which all used the same application statement that was recorded by a human male (Experiment 1) or computer generated (Experiments 4 and 5). All three experiments also used the same audio quality manipulation. Experiment 5 was identical to Experiments 1 and 4, except the outcome variable was the perceived intellgience of the speaker, measured on the same continuous 0-100 scale.

pwr_original %>%
  mutate(power = map2_chr(n_per_group,
                          effect_size,
                          ~{paste0(round(pwr.t.test(
                            n = .x,
                            d = .y,
                            sig.level = alpha,
                            type = "two.sample",
                            alternative = "two.sided")
                            $power * 100, 1), "%")})) %>%
  kable(digits = 2)

Table 1: Original Study Power Analysis (post hoc)

experiment	voice_type	n_per_group	total_n	effect_size	power
Exp. 1: Hireability	Human	300	600	0.35	99%
Exp. 4: Hireability	Computer	600	1200	0.43	100%
Exp. 5: Intelligence	Computer	600	1200	0.20	93.3%

2.1.2 Replication

Table 2 shows the total N needed to achieve 80-95% power to detect the main effect of audio quality on hireability ratings. Because Walter-Terrill et al. (2025) found a significant effect of audio quality on social judgement across voice types (male, female, American, British, human, and artificial) and contexts (hireability, romantic desirability, and credibility), this power analysis pools across voice type.

# calculate total sample size needed for 3 power levels
tibble(analysis = "Audio quality main effect (pooled)",
       effect_size = 0.43, # from Exp. 4
       n_80 = map_dbl(effect_size, ~ceiling(
         pwr.t.test(d = .x,
                    power = 0.80,
                    sig.level = alpha,
                    type = "two.sample")$n)),
       n_90 = map_dbl(effect_size, ~ceiling(
         pwr.t.test(d = .x,
                    power = 0.90,
                    sig.level = alpha,
                    type = "two.sample")$n)),
       n_95 = map_dbl(effect_size, ~ceiling(
         pwr.t.test(d = .x,
                    power = 0.95,
                    sig.level = alpha,
                    type = "two.sample")$n)),
       `80%` = n_80 * 2,
       `90%` = n_90 * 2,
       `95%` = n_95 * 2) %>%
  select(analysis, effect_size, `80%`, `90%`, `95%`) %>%
  kable(col.names = c("Analysis", "Effect Size (d)",
                      "80% Power", "90% Power", "95% Power"))

Table 2: Replication Power Analysis (a priori)

Analysis	Effect Size (d)	80% Power	90% Power	95% Power
Audio quality main effect (pooled)	0.43	172	230	284

Accordingly, I selected a total N of 250 to yield > 90% power and compared this sample size’s power to detect the effect sizes of the original Experiments 1, 4, and 5. As shown in Table 3, these unpooled experiments would be underpowered (< 80%) at N = 250. Thus, I chose to proceed with a replication of the original study’s Experiments 1 and 4, which were identical except for the voice type.

# compare power across experiments
pwr_replication = pwr_replication %>% 
  mutate(n_per_group = c(62, 62, 62, 125),
         total_n = n_per_group * 2,
         power = map2_chr(n_per_group, effect_size, ~{
           paste0(round(pwr.t.test(n = .x,
                                   d = .y,
                                   sig.level = alpha,
                                   type = "two.sample")
                        $power * 100, 1), "%")}))

pwr = pwr_replication$power[4]

# print the table
pwr_replication %>%
  kable()

Table 3: Replication Power Comparison (a priori)

experiment	voice_type	n_per_group	total_n	effect_size	power
Exp. 1: Hireability	Human	62	124	0.35	48.9%
Exp. 4: Hireability	Computer	62	124	0.43	66.1%
Exp. 5: Intelligence	Computer	62	124	0.20	19.7%
Exp. 1 + 4: Pooled	Both	125	250	0.43	92.3%

2.2 Planned Sample

Based on the above power analysis, I will recruit 250 participants from Prolific to yield 92.3% power to detect the main effect of audio quality. Participants will be prescreened using the criteria from the original study:

“Prescreening criteria required listeners to have English as a first language, a Prolific approval rating of at least 95%, previous completion of at least 100 Prolific studies, no prior participation in another experiment from this project, and the use of a laptop or desktop computer (but not a phone or tablet).” (Walter-Terrill et al., 2025)

The study will remain active on the Prolific platform until 250 submissions are received. As this study involves random assignment to one of four conditions (see below), this sampling strategy may result in slight differences in n per group which will be reported.

2.3 Materials

This replication will use four of the eight audio stimuli from the original study, which are available in the repo and were also published with the PNAS paper.

2.3.1 Human voice

The human voice recording will be identical to the original. It was prepared by the original authors as follows:

“The Clear audio stimulus (in mp3 format) was a recorded narration by a naive male speaker. The recording (Audio S1) was 27.5 s and consisted of the following spoken text: ‘After 8 y in sales, I am currently seeking a new challenge which will utilize my meticulous attention to detail, and friendly, professional manner. I am an excellent fit for your company and will be an asset to your team as a senior sales manager. As an experienced sales manager with my previous company, my tenacious and proactive approach resulted in numerous important contract wins. Through this experience, I have improved and developed my networking skills, which have proven to be very effective in increasing my number of sales.’” (Walter-Terrill et al., 2025)

2.3.2 Computer voice

The artificial voice recording will be identical to the original. It was created by the original authors using Amazon Polly (https://aws.amazon.com/polly/):

“The audio stimulus (Audios S7 and S8) was generated (as in Experiment 3) using a simulated male American English voice (“Matthew”). The recording was 28.5 s long and consisted of the same text as used in Experiment 1.” (Walter-Terrill et al., 2025)

2.3.3 Audio quality manipulation

The distorted audio stimuli will be identical to the original. They were prepared by the original authors as follows:

“The Distorted audio stimulus (Audio S2) was created by modifying the Clear recording using the open-source VST effect “MDACombo” (42) in the TwistedWave Online Audio Editor (https://twistedwave.com/online), with the following settings: Model: 4x12>; Drive: –55; Bias: 83; Output: 0; Process HPF Frequency: 20%; HPF Resolution: 90%. This resulted in speech that was fully comprehensible, but that had a high-frequency tinny quality like that commonly experienced during videoconferences with a low-quality computer microphone.” (Walter-Terrill et al., 2025)

2.4 Procedure

The experimental procedure will be identical to the original except as noted below:

“All experiments were completed on custom webpages created with software written in a combination of PHP, JavaScript, CSS, and HTML, with the jsPsych libraries (40). Before beginning each experiment, listeners were asked to either wear headphones or move to a quiet environment. To discourage multitasking, listeners completed the experiment with their browser in full-screen mode. During debriefing, listeners reported whether they heard the stimuli using speakers or headphones.” (Walter-Terrill et al., 2025)

Participants will be randomly assigned to one of four conditions using a between-subjects, 2 (voice type: computer or human) x 2 (audio quality: clear or distorted) design.

“Listeners first heard a short audio recording which confirmed that they could hear speech, and which allowed them to adjust the volume to a comfortable level…Listeners were then presented with the following centered written prompts [across a few separate screens, with all written text presented in black “Open Sans” font, scaled to a point size that was 1.5% of the full-screen pixel width of the listener’s display, on a light gray (#D3D3D3) background]: “For this study, we’d like you to imagine that as part of your job, you are tasked with making a hiring decision for a highly competitive position: senior sales manager. You will listen to a few lines from a personal statement from an application for this position. We will then ask you a few questions about your impressions of the personal statement and its author. Click on the ‘Play’ button below to play the audio clip of the personal statement. You will only be able to listen to this once. (You cannot replay it.)” Listeners then heard the audio stimuli after clicking on the relevant button (which then disappeared). After the recording finished playing, listeners were asked “What is the likelihood that you would hire this person?”, and responded by using their computer mouse to position a slider (a light blue circle 35px in diameter) on a continuous scale (depicted by a white bar, 25px high, with a width equal to 75% of the full-screen width) from Very Unlikely to Very Likely (with these terms appearing just below the bar on the far left and right, respectively). (Listeners could adjust the marker’s position as many times as they wished, after which they clicked on a “Continue” button.) Responses were recorded as values ranging from 0 (matching the bar’s far left) to 100 (representing the bar’s far right). (Walter-Terrill et al., 2025)

In addition to the hireability question, participants will respond to the following question from the original study’s Experiment 4, which used the same script and audio manipulation: “How intelligent is the author of this personal statement?” The scale will also be continuous, from 0 (“Very Unintelligent) to 100 (”Very Intelligent”). The order of the questions will be counterbalanced. Participants will answer the following follow-up questions during debriefing:

Demographics: Age (text box), Gender (select one: Male, Female, Other, Prefer not to say)
How are you listening to the audio from experiment? (select one: Headphones, Laptop Speakers, External Speakers, Earbuds, Other)
In 1-2 sentences, what do you think this experiment was testing? (text box)
Did you experience any technical difficulties in playing the audio or did anything distract you while you were listening to the audio? If so please describe. (This will not affect whether you receive credit or compensation.) (text box)
Using the slider below, on a scale of 1-100 (with 1 being very distracted, and 100 being very focused), how well did you pay attention to the experiment? (This will not affect whether you receive credit or compensation.)
Have you ever been employed in a position where you made hiring decisions? (Yes or No)
Using the slider below, on a scale of 1-100, how would you rate the recording quality of the audio clip you just heard? (With 1 being very poor, and 100 being excellent.)
Using the slider below, on a scale of 1-100, how well were you able to understand the words spoken in the audio clip you heard – setting aside any possible issues with the recording quality? (With 0 being you understood none of the words, 50 being you understood about half of the words, and 100 being you understood all the words.) (No Understanding to Complete Understanding)
Is there anything else we should know (either about you or how you did the experiment) that may have had an impact on your results? (text box)

Finally, participants will see the following debriefing screen:

Thank you for completing this experiment! I really appreciate your help with this experiment and making it through to the end. Your participation will help us better understand how audio quality influences the judgements we make about other people.

To get credit for this experiment, please click on continue, which will redirect you to the completion page on Prolific.

2.5 Analysis Plan

The primary analyses will be conducted using the following linear model:

\[Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3(X_{1i} \times X_{2i}) + \epsilon_i \tag{1}\]

where

\(Y_i\) (Outcome) is the hireability rating for participant \(i\),
\(\beta_0\) (Intercept) is the mean outcome when all predictors = 0,
\(\beta_1\) is the main effect of audio quality,
\(X_{1i}\) is audio quality for participant \(i\), dummy coded as 0 = Clear and 1 = Distorted,
\(\beta_2\) is the main effect of voice type,
\(X_{2i}\) is voice type for participant \(i\), dummy coded as 0 = Computer and 1 = Human,
\(\beta_3\) (Interaction) is the interaction effect between audio quality and voice type, and
\(\epsilon_i\) (Error) is the residual variation not explained by the other predictors.

Confirmatory analysis: To replicate the finding from the original study’s Experiment 4, I will test the effect of audio quality on hireability using estimated marginal means. This approach is equivalent to the “between-samples two-tailed t test” used by (Walter-Terrill et al., 2025). Additionally, I will rule out the potential confound of comprehension using an independent samples t-test.

Exploratory analyses: I will use the interaction term in Equation 1 to test the potential effect of voice type on hireability, which was not directly tested in the original study. A non-significant interaction will support the generalizbility of the audio quality effect and, thus, the pooling strategy.

I will also examine the effect of audio quality on perceived intelligence using the same form of Equation 1, except the outcome variable \(Y_i\) will be the intelligence rating. This test will attempt to replicate the original study’s Experiment 5, although this result should be interpreted with caution due to low power (d = 0.20).

Finally, I will report the correlation between hireability and intelligence ratings.

All analyses will be conducted in R.

2.6 Differences from Original Study

The main difference from the original study will be the pooling of human and computer voice types. The original Experiment 1 included a transcription task to assess participants’ understanding of the recorded statement. Due to resource constraints, this replication will instead rely on participants’ self-reported level of understanding. The instructional sample audio in the original experiments also applied the audio manipulation (clear or distored) to match the experimental condition. For simplicity, this replication uses the same, clear sample audio for all conditions. These differences are not expected to influence the results or interpretations of this replication.

2.7 Methods Addendum (Post Data Collection)

2.7.1 Actual Sample

To completed post-data collection.

2.7.2 Differences from pre-data collection methods plan

To completed post-data collection.

3 Results

3.1 Data preparation

This code anonomyzes the data file and removes participants from the analysis according to the exclusion criteria used in the original study: self-reported attention < 70/100, response times \(\pm\) 2.5 SD than the mean, and comprehension scores \(\pm\) 2 SD from the mean. Data from participants who self-report technical difficulties or distractions will be excluded only if I deem these challenges severe enough to interfere with their ability to complete the task.

#### Import data
pilot_B <- read_csv(here("data", "pilot_B.csv")) %>%
  filter(data_summary == TRUE) %>% 
  # remove identifying information
  select(-rt, -subject, -prolific_study_id,
         -prolific_session_id, -participant_id) %>%
  # create sequential subject ids
  mutate(id = seq(1:nrow(.))) %>% 
  # create anonymized data file
  write_csv(., "../data/pilot_B-anon.csv")

Rows: 16 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (17): rt, stimulus, trial_type, internal_node_id, subject, prolific_stud...
dbl (11): trial_index, time_elapsed, condition, hire, hire_rt, intelligence,...
lgl  (3): key_press, distorted_audio, data_summary

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#### Data exclusion / filtering

# define function to calculate distance from the mean
remove_outliers <- function(x, threshold = 2.5) {
  mean_x <- mean(x, na.rm = TRUE)
  sd_x <- sd(x, na.rm = TRUE)
  abs(x - mean_x) <= threshold * sd_x
}

# exclusion
pilot_B = pilot_B %>% 
  filter(debrief_attention >= 70) %>% 
  filter(remove_outliers(hire_rt, threshold = 2.5)) %>% 
  filter(remove_outliers(intelligence_rt, threshold = 2.5)) %>% 
  filter(remove_outliers(debrief_understandable, threshold = 2))

#### Prepare data for analysis - create columns etc.
pilot_B = pilot_B %>%
  select(-c(key_press, stimulus, trial_type, trial_index)) %>%
  mutate(distorted_audio = as.factor(distorted_audio),
         audio_quality = fct_recode(distorted_audio,
                                    "Clear" = "FALSE",
                                    "Distorted" = "TRUE"))

3.2 Confirmatory analysis

The figure below shows the results from the original experiments by Walter-Terrill et al. (2025). The figures and analysis from my replication follow.

Figure from the original study to reproduce (in part)

# plot main effect of audio quality on hireability
ggplot(data = pilot_B,
       mapping = aes(x = audio_quality,
                     y = hire,
                     color = audio_quality)) +
  geom_beeswarm(cex = 10,
                size = 2) +
  stat_summary(fun = mean,
               geom = "crossbar",
               width = 0.5,
               color = "grey") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Hireability",
       x = "Simulated Audio Quality",
       y = "Rating") +
  scale_y_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  theme_classic() +
  theme(text = element_text(size = 13)) +
  theme(plot.title = element_text(hjust = 0.5))

# build linear model for hireability
model_hire = lm(hire ~ 1 + audio_quality + voice_type + audio_quality*voice_type,
           data = pilot_B)
summary(model_hire)


Call:
lm(formula = hire ~ 1 + audio_quality + voice_type + audio_quality * 
    voice_type, data = pilot_B)

Residuals:
         1          2          3          4          5          6          7 
-3.000e+00  3.000e+00  7.500e+00 -7.500e+00  2.158e-16 -1.150e+01  1.150e+01 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)   
(Intercept)                              97.000      8.114  11.955  0.00126 **
audio_qualityDistorted                  -97.000     14.053  -6.902  0.00623 **
voice_typehuman                         -27.500     11.475  -2.397  0.09617 . 
audio_qualityDistorted:voice_typehuman   46.000     18.143   2.535  0.08502 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.47 on 3 degrees of freedom
Multiple R-squared:  0.9605,    Adjusted R-squared:  0.921 
F-statistic: 24.32 on 3 and 3 DF,  p-value: 0.01317

# main effect of audio quality on hireability
emm <- emmeans(model_hire, ~ audio_quality)

NOTE: Results may be misleading due to involvement in interactions

pairs(emm)

 contrast          estimate   SE df t.ratio p.value
 Clear - Distorted       74 9.07  3   8.157  0.0039

Results are averaged over the levels of: voice_type

eff_size(emm, sigma = sigma(model_hire), edf = df.residual(model_hire))

 contrast          effect.size   SE df lower.CL upper.CL
 Clear - Distorted        6.45 2.75  3     -2.3     15.2

Results are averaged over the levels of: voice_type 
sigma used for effect sizes: 11.47 
Confidence level used: 0.95

# comprehension check
t.test(debrief_understandable ~ audio_quality, 
                             data = pilot_B,
                             var.equal = FALSE)


    Welch Two Sample t-test

data:  debrief_understandable by audio_quality
t = 1, df = 2, p-value = 0.4226
alternative hypothesis: true difference in means between group Clear and group Distorted is not equal to 0
95 percent confidence interval:
 -5.504421  8.837755
sample estimates:
    mean in group Clear mean in group Distorted 
              100.00000                98.33333

cohens_d(debrief_understandable ~ audio_quality, 
                           data = pilot_B)

Warning: 'y' is numeric but has only 2 unique values.
  If this is a grouping variable, convert it to a factor.

Cohen's d |        95% CI
-------------------------
0.91      | [-0.72, 2.47]

- Estimated using pooled SD.

3.3 Exploratory analyses

# plot interaction voice type x audio quality
ggplot(data = pilot_B,
       mapping = aes(x = voice_type,
                     y = hire,
                     color = audio_quality,
                     group = audio_quality)) +
  geom_jitter(width = 0.1,
              alpha = 0.1) +
  stat_summary(fun.data = mean_cl_boot,
               geom = "pointrange",
               size = 0.5) +
  stat_summary(fun = mean,
               geom = "line",
               linewidth = 1) +
  scale_color_brewer(palette = "Set1") +
  scale_y_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  labs(title = "Hireability x Voice Type",
       x = "Voice Type",
       y = "Rating") +
  theme_classic() +
  theme(text = element_text(size = 13)) +
  theme(plot.title = element_text(hjust = 0.5))

anova(model_hire)

Analysis of Variance Table

Response: hire
                         Df Sum Sq Mean Sq F value   Pr(>F)   
audio_quality             1 8621.4  8621.4 65.4793 0.003944 **
voice_type                1  138.0   138.0  1.0482 0.381266   
audio_quality:voice_type  1  846.4   846.4  6.4284 0.085020 . 
Residuals                 3  395.0   131.7                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

eta_squared(model_hire, partial = TRUE)

# Effect Size for ANOVA (Type I)

Parameter                | Eta2 (partial) |       95% CI
--------------------------------------------------------
audio_quality            |           0.96 | [0.67, 1.00]
voice_type               |           0.26 | [0.00, 1.00]
audio_quality:voice_type |           0.68 | [0.00, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

# plot main effect of audio quality on intelligence (Exp. 5)
ggplot(data = pilot_B,
       mapping = aes(x = audio_quality,
                     y = intelligence,
                     color = audio_quality)) +
  geom_beeswarm(cex = 10,
                size = 2) +
  stat_summary(fun = mean,
               geom = "crossbar",
               width = 0.5,
               color = "grey") +
  scale_color_brewer(palette = "Set1") +
  labs(title = "Intelligence",
       x = "Simulated Audio Quality",
       y = "Rating") +
  scale_y_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  theme_classic() +
  theme(text = element_text(size = 13)) +
  theme(plot.title = element_text(hjust = 0.5))

# build linear model for intelligence
model_intel = lm(intelligence ~ 1 + audio_quality + voice_type + audio_quality*voice_type,
           data = pilot_B)
summary(model_intel)


Call:
lm(formula = intelligence ~ 1 + audio_quality + voice_type + 
    audio_quality * voice_type, data = pilot_B)

Residuals:
         1          2          3          4          5          6          7 
-5.000e+00  5.000e+00 -1.000e+01  1.000e+01  1.405e-15  1.000e+00 -1.000e+00 

Coefficients:
                                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)                              95.000      6.481  14.659 0.000689 ***
audio_qualityDistorted                  -64.000     11.225  -5.702 0.010700 *  
voice_typehuman                         -18.000      9.165  -1.964 0.144294    
audio_qualityDistorted:voice_typehuman    9.000     14.491   0.621 0.578553    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.165 on 3 degrees of freedom
Multiple R-squared:  0.964, Adjusted R-squared:  0.9281 
F-statistic: 26.81 on 3 and 3 DF,  p-value: 0.01145

# main effect of audio quality on intelligence
emm <- emmeans(model_intel, ~ audio_quality)

NOTE: Results may be misleading due to involvement in interactions

pairs(emm)

 contrast          estimate   SE df t.ratio p.value
 Clear - Distorted     59.5 7.25  3   8.212  0.0038

Results are averaged over the levels of: voice_type

eff_size(emm, sigma = sigma(model_intel), edf = df.residual(model_intel))

 contrast          effect.size   SE df lower.CL upper.CL
 Clear - Distorted        6.49 2.77  3    -2.31     15.3

Results are averaged over the levels of: voice_type 
sigma used for effect sizes: 9.165 
Confidence level used: 0.95

# are hire and intelligence ratings correlated?
cor_overall <- cor.test(pilot_B$hire, pilot_B$intelligence, 
                        method = "pearson")
r_value <- cor_overall$estimate
p_value <- cor_overall$p.value
ci_lower <- cor_overall$conf.int[1]
ci_upper <- cor_overall$conf.int[2]

ggplot(pilot_B, aes(x = hire, y = intelligence)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", alpha = 0.1, fill = "lightblue") +
  labs(
    title = "Hireability-Intelligence Correlation",
    subtitle = sprintf("r = %.2f, p = %.3f", r_value, p_value),
    x = "Hireability Rating",
    y = "Intelligence Rating"
  ) +
    scale_y_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  scale_x_continuous(breaks = seq(from = 0,
                                  to = 100,
                                  by = 10)) +
  scale_color_brewer(palette = "Set1") +
  theme_classic() +
  theme(text = element_text(size = 13)) +
  theme(plot.title = element_text(hjust = 0.5))

`geom_smooth()` using formula = 'y ~ x'

4 Discussion

4.1 Summary of Replication Attempt

To completed post-data collection.

4.2 Commentary

To completed post-data collection.

4.3 COI Disclosure

In the interest of transparency, I disclose that I occasionally work as an independent contractor for the National Academies of Sciences, the publisher of PNAS, where the original paper was published. Last year I was assigned to write a brief summary of this paper for the journal. However, I had no role in selecting this paper to cover nor relationships to the authors. My work for NAS/PNAS has no bearing on the journal’s editorial decisions as research summaries are assigned only after acceptance.

5 References

Walter-Terrill, R., Ongchoco, J. D. K., & Scholl, B. J. (2025). Superficial auditory (dis)fluency biases higher-level social judgment. Proceedings of the National Academy of Sciences, 122(13), e2415254122. https://doi.org/10.1073/pnas.2415254122