Replication of Study 1a by Rubin et al. (2025, Nature Human Behaviour)
Author
Dora Zhao (dorothyz@stanford.edu)
Published
October 24, 2025
Introduction
Justification
Rubin et al. seek to understand whether people perceive empathy differently when it comes from another person compared to an AI system. In Study 1a, Rubin et al. focus specifically on whether the perceived source of empathy affected participants’ perceived sense of empathy. As a part of my PhD research, I have ongoing work focused on the relationship between AI companion usage and user well-being. In prior work, we found that interacting with AI companions on Character.AI have a negative relationship with user well-being. One of the benefits of AI companions that our study participants reported was that these systems could provide emotional support and reduce loneliness; however, a detriment is that using AI companions can displace human-human relationships. Nonetheless, our current work is limited to only survey results and correlational results. Replicating Rubin et al.’s work on perceived empathy depending on human vs. AI sources is a good first step in teasing apart more of the causal mechanisms related to my line of work.
Stimuli and Procedures
For Study 1a, Participants are randomly assigned into one of two conditions: one group is told that they would receive a response generated by an AI and the second group is told that they would receive a response from a human. After, participants are asked to describe a recent emotional experience. After 60 seconds, they are then shown a response to their experience. In both conditions, the response participants are shown is AI-generated. Participants then rate how much empathy they felt the response as well as answering questions that disaggregated dimensions of empathy.
There are two main challenges I anticipate with this study. The first will be developing the experimental code. The authors provide the prompts that they used to generate the AI-responses in the Methods section, but they do not provide the code for the experimental platform, which I will need to develop myself. One modification I will make from the original methods is to use GPT-4.1, which is the latest state-of-the-art non-reasoning model for generating responses, rather than GPT-4, which is no longer offered from OpenAI. The second challenge is whether I will be able to recruit a sufficient number of participants as the original study recruited ~800. While I believe recruiting this number of participants is feasible for an online study, I believe it is out-of-scope and budget for the class. Nonetheless, I ran a power analysis using the effect size (Cohen’s d = 0.34) reported in the paper; to achieve a power of 80% at an alpha level of 0.05, we would need a total sample size of 216 participants (or 108 per condition). Given that the study is on the shorter-side, I believe this is feasible. Nonetheless, it is possible that my study could be underpowered.
In the paper, the authors report an effect size of 0.34. I estimate power at 0.80, 0.90, 0.95 for the effect size in the paper as well as a more conservative Cohen’s d of 0.3.
library(pwr)d_original <-0.34d_conservative <-0.3## power levels to testpowers <-list(.80, .90, .95)## power analyses for t-tests with effect size reported in original paperfor (power in powers) {print(power)print(pwr.t.test(d = d_original, power = power, type ="two.sample", alternative ="greater"))}
[1] 0.8
Two-sample t test power calculation
n = 107.6474
d = 0.34
sig.level = 0.05
power = 0.8
alternative = greater
NOTE: n is number in *each* group
[1] 0.9
Two-sample t test power calculation
n = 148.8448
d = 0.34
sig.level = 0.05
power = 0.9
alternative = greater
NOTE: n is number in *each* group
[1] 0.95
Two-sample t test power calculation
n = 187.9153
d = 0.34
sig.level = 0.05
power = 0.95
alternative = greater
NOTE: n is number in *each* group
for (power in powers) {print(power)print(pwr.t.test(d = d_conservative, power = power, type ="two.sample", alternative ="greater"))}
[1] 0.8
Two-sample t test power calculation
n = 138.0716
d = 0.3
sig.level = 0.05
power = 0.8
alternative = greater
NOTE: n is number in *each* group
[1] 0.9
Two-sample t test power calculation
n = 190.9879
d = 0.3
sig.level = 0.05
power = 0.9
alternative = greater
NOTE: n is number in *each* group
[1] 0.95
Two-sample t test power calculation
n = 241.1723
d = 0.3
sig.level = 0.05
power = 0.95
alternative = greater
NOTE: n is number in *each* group
Planned Sample
Based on the power analyses for the more conservative effect size (Cohen’s d = .30), I will collect data from 276 participants and data collection will stop once complete data from all 276 participants has been collected from Prolific. The experiment should take about 2-3 minutes to complete.
Materials
The participants are asked to describe an emotional experience and then showed a response generated using a large language model (LLMs). Given the stochastic nature of LLMs and the different inputs that participants provide, the responses the model provides will vary. Nonetheless, we use the prompt provided by the authors in the SI when generating responses. We deviate slightly from the original work’s material in that we use Gemini-2.5.-Flash to generate responses rather than GPT-4o. These models should be similar in capabilities and are both closed-sourced models.
Procedure
We follow the procedure as described in the original article as follows:
“We told the participants they were paired with either an AI or another participant. The participants shared a recent emotional experience and waited for 60 seconds; they were told either that the AI was generating a response or that the other participant was writing one, depending on the experimental condition. Participants in both conditions were then shown an AI-generated response, with the AI having been prompted to respond to their specific experience and include all three aspects of empathy.”
Analysis Plan
First, we will exclude participants who fail the attention check question in the task. Next, following Rubin et al., we removed “any participants that were more than 2.5 standard deviations away from the mean of the dependent variable for that analysis” and ensured that “all independent categorical variables were effect-coded.”
The first analysis of interest is comparing which response participants found more empathic by conducting a Welch’s t-test on the general empathy question between conditions.
The second analysis is comparing which response participants found more postively resonant again by conducting a Welch’s t-test on the positive resonance measure (mean across the three questions) between conditions.
Finally, to see whether condition affected different types of empathy (e.g., cognitive, affective, motivational), we fit a ” a linear mixed-effect model to predict empathy with condition, aspect of empathy (cognitive, affective or motivational) and their interaction.”
Differences from Original Study
There are two main differences. First, we will be using a different LLM to generate the responses; however, this change should not produce material differences given that it is a model with similar capabilities to that used in the original study. Furthermore, in the article, they also performed the same study using an open-source Llama model to find similar results. Second, we are recruiting a much smaller sample size in comparison to the original study, which had 725 participants.
Methods Addendum (Post Data Collection)
You can comment this section out prior to final report with data collection.
Actual Sample
Sample size, demographics, data exclusions based on rules spelled out in analysis plan
Differences from pre-data collection methods plan
Any differences from what was described as the original plan, or “none”.
Results
Data preparation
Data preparation following the analysis plan.
Confirmatory analysis
First, we conduct two t-tests comparing perceived empathy and positive resonance between conditions.
empathy_ttest <-t.test(formula = Empathy_1~Condition, data = Q1_filtered)print(empathy_ttest)
Welch Two Sample t-test
data: Empathy_1 by Condition
t = -0.39736, df = 2.635, p-value = 0.7211
alternative hypothesis: true difference in means between group AI and group Human is not equal to 0
95 percent confidence interval:
-4.837302 3.837302
sample estimates:
mean in group AI mean in group Human
7.0 7.5
resonance_ttest <-t.test(formula = positive_resonance~Condition, data = Resonance_filtered)print(resonance_ttest)
Welch Two Sample t-test
data: positive_resonance by Condition
t = -1.4806, df = 2.6826, p-value = 0.2455
alternative hypothesis: true difference in means between group AI and group Human is not equal to 0
95 percent confidence interval:
-67.11361 26.44694
sample estimates:
mean in group AI mean in group Human
35.00000 55.33333
model <-lmer(empathy.s~Condition*empathy_type+(1|ID),data=filtered_data)
boundary (singular) fit: see help('isSingular')
summary(model)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: empathy.s ~ Condition * empathy_type + (1 | ID)
Data: filtered_data
REML criterion at convergence: 28.9
Scaled residuals:
Min 1Q Median 3Q Max
-1.6020 -0.5149 0.0000 0.6294 0.9917
Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 0.0000 0.0000
Residual 0.7962 0.8923
Number of obs: 15, groups: ID, 5
Fixed effects:
Estimate Std. Error df t value
(Intercept) -0.7760 0.5152 9.0000 -1.506
ConditionHuman 1.3954 0.8145 9.0000 1.713
empathy_typeaffective 1.0210 0.7285 9.0000 1.401
empathy_typemotivational 1.7017 0.7285 9.0000 2.336
ConditionHuman:empathy_typeaffective -2.1442 1.1519 9.0000 -1.861
ConditionHuman:empathy_typemotivational -3.0290 1.1519 9.0000 -2.630
Pr(>|t|)
(Intercept) 0.1663
ConditionHuman 0.1208
empathy_typeaffective 0.1946
empathy_typemotivational 0.0443 *
ConditionHuman:empathy_typeaffective 0.0956 .
ConditionHuman:empathy_typemotivational 0.0274 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Correlation of Fixed Effects:
(Intr) CndtnH empthy_typf empthy_typm CndtnHmn:mpthy_typf
ConditinHmn -0.632
empthy_typf -0.707 0.447
empthy_typm -0.707 0.447 0.500
CndtnHmn:mpthy_typf 0.447 -0.707 -0.632 -0.316
CndtnHmn:mpthy_typm 0.447 -0.707 -0.316 -0.632 0.500
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')
The analyses as specified in the analysis plan.
Side-by-side graph with original graph is ideal here
Exploratory analyses
Any follow-up analyses desired (not required).
Discussion
Summary of Replication Attempt
Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.
Commentary
Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.