Replication of Study 1a by Rubin et al. (2025, Nature Human Behaviour)
Author
Dora Zhao (dorothyz@stanford.edu)
Published
December 15, 2025
Introduction
As AI systems proliferate, people are increasingly turning to these technologies not only as functional tools but also as relational partners. Given that these relationships can supplant offline social interactions, it is important to better understand how human–AI interactions compare to human–human interactions. In this project, I aim to replicate Study 1A from Rubin et al.’s paper “Comparing the value of perceived human versus AI-generated empathy,” published in Nature Human Behavior. In their work, Rubin et al. conduct a series of experiments to examine how the perceived source of empathy—whether from humans or AI systems—affects people’s experience of empathy. The key hypothesis of this work is that participants who believe they are receiving an AI-generated response will perceive it as less empathic compared to a response believed to be human-written. For Study 1A, participants are randomly assigned to one of two conditions: one group is told that they will receive a response generated by an AI, and the second group is told that they will receive a response from a human. Participants are then asked to describe a recent emotional experience. After 60 seconds, they are shown a response to their experience. In both conditions, the response participants see is AI-generated. Participants then rate how much empathy they felt in the response, as well as answering questions that break empathy into separate dimensions.
Justification for Study
As a part of my PhD research, I have ongoing work focused on the relationship between AI companion usage and user well-being. In prior work, we found that interacting with AI companions on Character.AI has a negative relationship with user well-being [1]. One of the benefits of AI companions that our study participants reported was that these systems could provide emotional support and reduce loneliness; however, a detriment is that using AI companions can displace human-human relationships. Nonetheless, our current work is limited to only survey results, reporting correlations between usage and well-being. Replicating Rubin et al.’s work on perceived empathy depending on human vs. AI sources is a good first step in teasing apart more of the causal mechanisms related to my line of work.
[1] Zhang et al. “The Rise of AI Companions: How Human-Chatbot Relationships Influence Well-Being.” arXiv 2025.
Anticipated Challenges
There are two main challenges I anticipate with this study. The first will be developing the experimental code. The authors provide the prompts that they used to generate the AI-responses in the Methods section, but they do not provide the code for the experimental platform, which I will need to develop myself. One modification I will make from the original methods is to use GPT-4.1, which is the latest state-of-the-art non-reasoning model for generating responses, rather than GPT-4, which is no longer offered from OpenAI. The second challenge is whether I will be able to recruit a sufficient number of participants as the original study recruited ~800. While I believe recruiting this number of participants is feasible for an online study, I believe it is out-of-scope and budget for the class. Nonetheless, I ran a power analysis using the effect size (Cohen’s d = 0.34) reported in the paper; to achieve a power of 80% at an alpha level of 0.05, we would need a total sample size of 350 participants (or 175 per condition). Given that the study is on the shorter-side, I believe this is feasible. Nonetheless, it is possible that my study could be underpowered.
Pre-registration: https://osf.io/n37rt/overview Note: For some reason, my Github repo was not showing up in the original pre-registration. It should be available in this link!
Methods
Power Analysis
In the paper, the authors report an effect size of 0.34 for a Welch’s t-test on the general empathy question between the Human and AI conditions. I estimate power at 0.80, 0.90, 0.95 for the effect size in the paper as well as for a more conservative Cohen’s d of 0.3.
library(pwr)d_original <-0.34d_conservative <-0.3## power levels to testpowers <-list(.80, .90, .95)## power analyses for t-tests with effect size reported in original paperprint("Cohen's d = 0.34")
[1] "Cohen's d = 0.34"
for (power in powers) {print(power)print(pwr.t.test(d = d_original, power = power, type ="two.sample"))}
[1] 0.8
Two-sample t test power calculation
n = 136.7605
d = 0.34
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
[1] 0.9
Two-sample t test power calculation
n = 182.755
d = 0.34
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
[1] 0.95
Two-sample t test power calculation
n = 225.7869
d = 0.34
sig.level = 0.05
power = 0.95
alternative = two.sided
NOTE: n is number in *each* group
print("Cohen's d = 0.3")
[1] "Cohen's d = 0.3"
for (power in powers) {print(power)print(pwr.t.test(d = d_conservative, power = power, type ="two.sample"))}
[1] 0.8
Two-sample t test power calculation
n = 175.3847
d = 0.3
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
[1] 0.9
Two-sample t test power calculation
n = 234.4627
d = 0.3
sig.level = 0.05
power = 0.9
alternative = two.sided
NOTE: n is number in *each* group
[1] 0.95
Two-sample t test power calculation
n = 289.7353
d = 0.3
sig.level = 0.05
power = 0.95
alternative = two.sided
NOTE: n is number in *each* group
Planned Sample
Assuming a conservative effect size (Cohen’s d = .30), a sample size of 350 participants provides power greater than 80% at \(\alpha\) = .05. Data collection will stop once complete data from all 350 participants has been collected from Prolific. The experiment should take about 4 minutes to complete.
Materials
The participants are asked to describe an emotional experience and then showed a response generated using a large language model (LLMs). Given the stochastic nature of LLMs and the different inputs that participants provide, the responses the model provides will vary. Nonetheless, we use the prompt provided by the authors in the SI when generating responses.
We deviate slightly from the original work’s material in that we use Gemini-2.5.-Flash to generate responses rather than GPT-4. These models should be similar in capabilities and are both closed-sourced models.
Procedure
We follow the procedure as described in the original article as follows:
“We told the participants they were paired with either an AI or another participant. The participants shared a recent emotional experience and waited for 60 seconds; they were told either that the AI was generating a response or that the other participant was writing one, depending on the experimental condition. Participants in both conditions were then shown an AI-generated response, with the AI having been prompted to respond to their specific experience and include all three aspects of empathy.”
Analysis Plan
First, we will exclude participants who fail the attention check question in the task and those who do not provide an emotional response, which we will manually validate. Next, following Rubin et al., we removed “any participants that were more than 2.5 standard deviations away from the mean of the dependent variable for that analysis” and ensured that “all independent categorical variables were effect-coded.”
The first analysis of interest is comparing which response participants found more empathic by conducting a Welch’s t-test on the general empathy question between conditions.
The second analysis is comparing which response participants found more postively resonant again by conducting a Welch’s t-test on the positive resonance measure (mean across the three questions) between conditions.
Finally, to see whether condition affected different types of empathy (e.g., cognitive, affective, motivational), we fit a ” a linear mixed-effect model to predict empathy with condition, aspect of empathy (cognitive, affective, or motivational) and their interaction.”
Differences from Original Study
There are three main differences. First, we will be using a different LLM (Gemini-2.5-Flash instead of GPT 4-0613) to generate the responses. We select Gemini-2.5-Flash due to practical cost constraints with the replication study, as Gemini offers a more generous rate limit for lower tiered API accounts compared to OpenAI. We select Gemini-2.5-Flash as it is a more cost-efficient model while still having capabilities comparable to GPT-4. This change should not produce material differences given that it is a model with similar capabilities to that used in the original study. Furthermore, in the article, they also performed the same study using an open-source Llama model to find similar results. Second, we are recruiting a much smaller sample size of 350 participants in comparison to the original study, which had 725 participants. Third, we are adding an additional validation to ensure that the participants provided a description of a recent emotional event. Since the LLM will generate a response regardless of whether the participant inputs an emotional response, we require a post-hoc verification that the participants’ input matched our question. Thus, I will manually review the submitted responses and remove those that do not describe a recent event.
Methods Addendum (Post Data Collection)
Actual Sample
I recruited 348 participants on Prolific. After excluding participants who failed the attention check (N=66), did not provide an emotional response to the LLM (N=2), or did not receive a response from the LLM responding to their experience (N=8), I was left with a total of 272 participants (median age, 46.5; 64.7% female; 75.4% White, 9.6% Black; 7.4% Asian; 4.4% mixed; 2.9% other) for the final analysis. There were 140 participants in the Human condition and 132 in the AI condition. Compared to Rubin et al., our participants are skewed older (compared to a mean age of 28.4) and have higher proportion of females (compared to 53%).
Differences from pre-data collection methods plan
I originally planned to only screen for participants who did not provide an emotional experience to the LLM. However, this screening did not account for the fact that the LLM could provide responses that were unrelated to the response thus impacting our manipulation. For example, in one trial, a participant responded that an emotional experience was “Asking about natal charts”; however, the model interpreted this query as the participant seeking information about what natal charts are and provided a description to the participant. To account for this, I also removed participants (N=8) who received a response from the LLM that was not relevant to their input.
Results
Data preparation
Data preparation following the analysis plan.
Confirmatory analysis
library(papaja)
Loading required package: tinylabels
library(effsize)library(emmeans)
Welcome to emmeans.
Caution: You lose important information if you filter this package's results.
See '? untidy'
First, I will reproduce the main statistical analyses from Rubin et al.’s Study 1a.
General Empathy
To start, I will conduct Welch’s t-test on the general empathy question — the key statistic for our replication of Study 1a.
# T-test comparing general empathy across conditionsempathy_ttest <-t.test(formula = overall~Condition, data = Q1_filtered)empathy_cohensd <-cohen.d(overall ~ Condition, data = Q1_filtered) # effect sizeempathy_ttest
Welch Two Sample t-test
data: overall by Condition
t = -3.7494, df = 228.05, p-value = 0.0002248
alternative hypothesis: true difference in means between group AI and group Human is not equal to 0
95 percent confidence interval:
-1.0665432 -0.3317112
sample estimates:
mean in group AI mean in group Human
8.455285 9.154412
On average both groups found the responses to be empathetic. In the AI condition, the mean perceived empathy of the response was 8.03 out of 10, and in the Human condition, the mean perceived empathy was 8.94. There was a significant difference between the means between conditions, \(t(228.05) = -3.75\), \(p < .001\) with a Cohen’s d equal to -0.47. This result mirrors findings from Rubin et al., who find a \(t(685.77) = -4.56\), \(p < .001\), Cohen’s \(d = 0.34\).
Next, we provide a violin plot comparing the perceived general empathy in the AI condition and Human condition from our replication and then the original figure from the paper.
## Visualization code comparing general empathy across conditionsQ1PlotViolinE1 <-ggplot(data = Q1_filtered, aes(x = Condition, y = overall, fill = Condition, color = Condition))+geom_boxplot(color ="black", width =0.25,# removing outliersoutlier.color =NA,alpha =0.8)+ ggdist::stat_dots(aes(fill = Condition), side ="left", dotsize =0.1, position =position_nudge(x =-0.2),binwidth =0.15, overflow ="compress")+ ggdist::stat_halfeye( # for distributionsxmax =3, # creates distance between them - the max heightadjust =1, # adjust bandwidthjustification =-.3, # less close to boxplotalpha =0.6,# remove the slub interval.width =0,point_colour =NA) +scale_color_manual(values =c("indianred1","darkturquoise"))+scale_fill_manual(values =c("indianred1","darkturquoise"))+theme(axis.text.x =element_text(face ="bold", size =13),legend.position ="none")+stat_summary(geom ="errorbar", fun.data = mean_cl_normal,position =position_dodge(.75), linewidth=.3, width =.1, colour="black")+stat_summary(fun=mean, geom="point",shape=4, size=2, colour="black",position =position_dodge(0.75))+xlab("Condition")+ylim(c(0,11.5))+ylab("General empathy in response")+theme_classic(base_size =14, base_family="Helvetica") +theme(legend.position ="none") +stat_compare_means(aes(label =paste0("p", scales::label_pvalue()(..p..) ) )) +ggtitle("Effect of condition on perceived empathy")Q1PlotViolinE1
Positivity Resonance
Next, I will run a set of confirmatory analyses on positivity resonance, which is a mean over three dimensions: mutual warmth and concern, mutual sense of feeling energized and uplifted, and a mutual sense of trust and respect. Each of these dimensions is rated on a scale from 0 - 100.
Mirroring our analyses for general empathy, we conduct a Welch’s t-test on positivity resonance.
## T-test on positivity resonance across conditionsres_ttest <-t.test(formula = positive_resonance~Condition, data = Resonance_filtered)res_cohensd <-cohen.d(positive_resonance~Condition, data = Resonance_filtered) # effect sizeprint(res_cohensd$estimate)
[1] -0.3721895
print(res_ttest)
Welch Two Sample t-test
data: positive_resonance by Condition
t = -2.9708, df = 243.5, p-value = 0.003268
alternative hypothesis: true difference in means between group AI and group Human is not equal to 0
95 percent confidence interval:
-13.18851 -2.67216
sample estimates:
mean in group AI mean in group Human
69.67751 77.60784
On average, both groups found the responses to be somewhat positively resonance. In the AI condition, the mean perceived empathy of the response was 69.7 out of 100, and in the Human condition, the mean perceived empathy was 77.6. There was a significant difference between the means between conditions, \(t(243.50) = -2.97\), \(p = .003\) with a Cohen’s \(d\) equal to -0.37. This result is similar to the findings from Rubin et al., who report \(t(694) = -5.63\), \(p < .001\), Cohen’s \(d = 0.43\), albeit with a slightly smaller effect size.
Again, we provide the violin plots for positive resonance from the replication and then the original figure from the paper.
## Visualization code comparing positivity resonance across conditionsPosResPlotE1 <-ggplot(data = Resonance_filtered, aes(x = Condition, y = positive_resonance,fill = Condition, color = Condition))+ ggdist::stat_halfeye( # for distributionsxmax =3, # creates distance between them - the max heightadjust =1, # adjust bandwidthjustification =-.3, # less close to boxplotalpha =0.6,# remove the slub interval.width =0,point_colour =NA) +geom_boxplot(color ="black", width =0.25,# removing outliersoutlier.color =NA,alpha =0.8)+ ggdist::stat_dots(aes(fill = Condition), side ="left", justification = (1.2), binwidth =0.15, overflow ="compress")+scale_color_manual(values =c("indianred1","darkturquoise"))+scale_fill_manual(values =c("indianred1","darkturquoise"))+theme(axis.text.x =element_text(face ="bold", size =13),legend.position ="none")+stat_summary(geom ="errorbar", fun.data = mean_cl_normal,position =position_dodge(.75), linewidth=.3, width =.1, colour="black")+stat_summary(fun=mean, geom="point",shape=4, size=2, colour="black",position =position_dodge(0.75))+xlab("Condition")+ylab("Positivity Resonance")+theme_classic(base_size =13, base_family="Helvetica") +theme(legend.position ="none") +stat_compare_means(method ="t.test") +ylim(c(0,105))+ggtitle("Effect of condition on positivity resonance")PosResPlotE1
Aspects of Empathy
## Linear mixed-effect model looking at aspects of empathymodel_inter <-lmer(empathy.s ~ Condition * empathy_type + (1| ID), data = filtered_data)anova_res <-anova(model_inter)# Post-hoc comparison with emmeansemm <-emmeans(model_inter, ~ Condition * empathy_type)conts1 <-contrast(emm, adjust ="bonferroni", method ="pairwise", by ="empathy_type")emm_diff <-as.data.frame(summary(conts1))
Similar to Rubin et al., our analysis reveals a significant main effect of condition (\(F(1, 262.78) = 7.24\), \(p = .008\)). However, unlike Rubin et al., we find there is a significant effect of aspects of empathy (\(F(2, 521.42) = 31.90\), \(p < .001\)).
Post-hoc pairwise contrasts showed that the AI–Human difference shows that for cognitive aspects of empathy, the difference between conditions is smaller and not statistically significant, (\(t(343.05) = -1.77\), \(p = .078\)). In contrast, the difference was larger and significant for affective empathy and motivational empathy. While Rubin et al. did not observe this finding in Study 1a, they did find the same pattern in Study 1b.
Exploratory analyses
We conduct an exploratory analysis to understand whether our results are moderated by whether participants believe an AI aided in generating the human response or a human aided in generated the AI response. This belief was captured on a 10-point Likert scale where 0 corresponded to no aid from the other source at all and 10 corresponded to heavy aid from the other source.
# Linear model exploring moderator "other source"model <-lm(general_empathy~Condition*other_source_z,data=moderator_data)summary(model)
Call:
lm(formula = general_empathy ~ Condition * other_source_z, data = moderator_data)
Residuals:
Min 1Q Median 3Q Max
-4.5493 -0.6811 0.3351 0.9806 2.1035
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.96775 0.14886 53.527 < 2e-16 ***
ConditionHuman 0.62631 0.20396 3.071 0.00237 **
other_source_z -0.09047 0.14328 -0.631 0.52834
ConditionHuman:other_source_z -0.22054 0.20403 -1.081 0.28075
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.395 on 254 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.04115, Adjusted R-squared: 0.02983
F-statistic: 3.634 on 3 and 254 DF, p-value: 0.0135
We do observe that the main effect of condition is significant, indicating that responses from the Human Condition are rated more empathic than responses from the AI Condition (\(b = 0.63\), 95% CI \([0.22, 1.03]\), \(t(254) = 3.07\), \(p = .002\)). We do not find a significant interaction between condition and participant’s belief about other sources assisting the response generation (\(b = -0.22\), 95% CI \([-0.62, 0.18]\), \(t(254) = -1.08\), \(p = .281\)).
Discussion
Summary of Replication Attempt
The primary finding was that participants rate responses they perceive as being from a human as more empathetic compared to responses they perceive as being generated by AI with an effect size of \(d=0.34\). The primary statistical test did replicate in this project. I conducted a Welch’s t-test between the Human and AI conditions, finding that the generated responses in the Human condition were rated as significantly more empathic than those in the AI condition. In fact, the effect size, \(d=-0.47\) identified in this project, is larger than that from the original study.
Commentary
The results from this project succeeded in replicating the results on empathy (described above) and positivity resonance from Study 1a. This outcome suggests that the results from Rubin et al. are robust even with different models (i.e., Gemini-2.5-Flash rather than GPT-4) and as model capabilities improve. Interestingly, the additional analysis exploring the different aspects of empathy (cognitive, affective, and motivational) found that there was a significant effect of empathy type on perceived empathy. In the original study, Rubin et al. found a null effect here although in later studies in their paper, they report similar results to us. Thus, our replication results provide evidence for their later findings, which did not appear in the original Study 1a. Another exploratory analysis I conducted was understanding whether the results are moderated by whether participants believe an AI aided in generating the human response or vice versa. I expected that our results would mirror what Rubin et al., finding that in the Human condition, the more participants thought an AI was involved, the less empathic they perceived the response. However, while the interaction effect is negative, it is not statistically significant. It is possible that this result is due to the fact that our sample size is much smaller compared to that used in Rubin et al.’s work. I also wonder whether there are changing societal perceptions toward AI usage as these technologies become more prevalent. That is, people find it more permissible for others to use AI to assist in tasks, and thus there is not a significant reduction in the perceived empathy so long as they believe the response still originates from another human. It would be interesting to see whether participants’ AI usage moderates this relationship. Finally, I am grateful to the main author (Matan Rubin) who not only has replicable code and data for the project available on OSF but also responded to my emails to provide direct feedback on my study paradigm.