Replication of Study 1a by Rubin et al. (2025, Nature Human Behaviour)

Author

Dora Zhao (dorothyz@stanford.edu)

Published

October 24, 2025

Introduction

Justification

Rubin et al. seek to understand whether people perceive empathy differently when it comes from another person compared to an AI system. In Study 1a, Rubin et al. focus specifically on whether the perceived source of empathy affected participants’ perceived sense of empathy. As a part of my PhD research, I have ongoing work focused on the relationship between AI companion usage and user well-being. In prior work, we found that interacting with AI companions on Character.AI have a negative relationship with user well-being. One of the benefits of AI companions that our study participants reported was that these systems could provide emotional support and reduce loneliness; however, a detriment is that using AI companions can displace human-human relationships. Nonetheless, our current work is limited to only survey results and correlational results. Replicating Rubin et al.’s work on perceived empathy depending on human vs. AI sources is a good first step in teasing apart more of the causal mechanisms related to my line of work.

Stimuli and Procedures

For Study 1a, Participants are randomly assigned into one of two conditions: one group is told that they would receive a response generated by an AI and the second group is told that they would receive a response from a human. After, participants are asked to describe a recent emotional experience. After 60 seconds, they are then shown a response to their experience. In both conditions, the response participants are shown is AI-generated. Participants then rate how much empathy they felt the response as well as answering questions that disaggregated dimensions of empathy.

There are two main challenges I anticipate with this study. The first will be developing the experimental code. The authors provide the prompts that they used to generate the AI-responses in the Methods section, but they do not provide the code for the experimental platform, which I will need to develop myself. One modification I will make from the original methods is to use GPT-4.1, which is the latest state-of-the-art non-reasoning model for generating responses, rather than GPT-4, which is no longer offered from OpenAI. The second challenge is whether I will be able to recruit a sufficient number of participants as the original study recruited ~800. While I believe recruiting this number of participants is feasible for an online study, I believe it is out-of-scope and budget for the class. Nonetheless, I ran a power analysis using the effect size (Cohen’s d = 0.34) reported in the paper; to achieve a power of 80% at an alpha level of 0.05, we would need a total sample size of 216 participants (or 108 per condition). Given that the study is on the shorter-side, I believe this is feasible. Nonetheless, it is possible that my study could be underpowered.

Methods

Power Analysis

In the paper, the authors report an effect size of 0.34. I estimate power at 0.80, 0.90, 0.95 for the effect size in the paper as well as a more conservative Cohen’s d of 0.3.

library(pwr)
d_original <- 0.34
d_conservative <- 0.3

## power levels to test
powers <- list(.80, .90, .95)

## power analyses for t-tests with effect size reported in original paper
for (power in powers) {
  print(power)
  print(pwr.t.test(d = d_original, power = power, type = "two.sample", alternative = "greater"))
}
[1] 0.8

     Two-sample t test power calculation 

              n = 107.6474
              d = 0.34
      sig.level = 0.05
          power = 0.8
    alternative = greater

NOTE: n is number in *each* group

[1] 0.9

     Two-sample t test power calculation 

              n = 148.8448
              d = 0.34
      sig.level = 0.05
          power = 0.9
    alternative = greater

NOTE: n is number in *each* group

[1] 0.95

     Two-sample t test power calculation 

              n = 187.9153
              d = 0.34
      sig.level = 0.05
          power = 0.95
    alternative = greater

NOTE: n is number in *each* group
for (power in powers) {
  print(power)
  print(pwr.t.test(d = d_conservative, power = power, type = "two.sample", alternative = "greater"))
}
[1] 0.8

     Two-sample t test power calculation 

              n = 138.0716
              d = 0.3
      sig.level = 0.05
          power = 0.8
    alternative = greater

NOTE: n is number in *each* group

[1] 0.9

     Two-sample t test power calculation 

              n = 190.9879
              d = 0.3
      sig.level = 0.05
          power = 0.9
    alternative = greater

NOTE: n is number in *each* group

[1] 0.95

     Two-sample t test power calculation 

              n = 241.1723
              d = 0.3
      sig.level = 0.05
          power = 0.95
    alternative = greater

NOTE: n is number in *each* group

Planned Sample

Based on the power analyses for the more conservative effect size (Cohen’s d = .30), I will collect data from 276 participants and data collection will stop once complete data from all 276 participants has been collected from Prolific. The experiment should take about 2-3 minutes to complete.

Materials

The participants are asked to describe an emotional experience and then showed a response generated using a large language model (LLMs). Given the stochastic nature of LLMs and the different inputs that participants provide, the responses the model provides will vary. Nonetheless, we use the prompt provided by the authors in the SI when generating responses. We deviate slightly from the original work’s material in that we use Gemini-2.5.-Flash to generate responses rather than GPT-4o. These models should be similar in capabilities and are both closed-sourced models.

Procedure

We follow the procedure as described in the original article as follows:

“We told the participants they were paired with either an AI or another participant. The participants shared a recent emotional experience and waited for 60 seconds; they were told either that the AI was generating a response or that the other participant was writing one, depending on the experimental condition. Participants in both conditions were then shown an AI-generated response, with the AI having been prompted to respond to their specific experience and include all three aspects of empathy.”

Analysis Plan

First, we will exclude participants who fail the attention check question in the task. Next, following Rubin et al., we removed “any participants that were more than 2.5 standard deviations away from the mean of the dependent variable for that analysis” and ensured that “all independent categorical variables were effect-coded.”

The first analysis of interest is comparing which response participants found more empathic by conducting a Welch’s t-test on the general empathy question between conditions.

The second analysis is comparing which response participants found more postively resonant again by conducting a Welch’s t-test on the positive resonance measure (mean across the three questions) between conditions.

Finally, to see whether condition affected different types of empathy (e.g., cognitive, affective, motivational), we fit a ” a linear mixed-effect model to predict empathy with condition, aspect of empathy (cognitive, affective or motivational) and their interaction.”

Differences from Original Study

There are two main differences. First, we will be using a different LLM to generate the responses; however, this change should not produce material differences given that it is a model with similar capabilities to that used in the original study. Furthermore, in the article, they also performed the same study using an open-source Llama model to find similar results. Second, we are recruiting a much smaller sample size in comparison to the original study, which had 725 participants.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

Data preparation following the analysis plan.

Confirmatory analysis

First, we conduct two t-tests comparing perceived empathy and positive resonance between conditions.

empathy_ttest <- t.test(formula = Empathy_1~Condition, data = Q1_filtered)
print(empathy_ttest)

    Welch Two Sample t-test

data:  Empathy_1 by Condition
t = -0.39736, df = 2.635, p-value = 0.7211
alternative hypothesis: true difference in means between group AI and group Human is not equal to 0
95 percent confidence interval:
 -4.837302  3.837302
sample estimates:
   mean in group AI mean in group Human 
                7.0                 7.5 
resonance_ttest <- t.test(formula = positive_resonance~Condition, data = Resonance_filtered)
print(resonance_ttest)

    Welch Two Sample t-test

data:  positive_resonance by Condition
t = -1.4806, df = 2.6826, p-value = 0.2455
alternative hypothesis: true difference in means between group AI and group Human is not equal to 0
95 percent confidence interval:
 -67.11361  26.44694
sample estimates:
   mean in group AI mean in group Human 
           35.00000            55.33333 
model <- lmer(empathy.s~Condition*empathy_type+(1|ID),data=filtered_data)
boundary (singular) fit: see help('isSingular')
summary(model)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: empathy.s ~ Condition * empathy_type + (1 | ID)
   Data: filtered_data

REML criterion at convergence: 28.9

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.6020 -0.5149  0.0000  0.6294  0.9917 

Random effects:
 Groups   Name        Variance Std.Dev.
 ID       (Intercept) 0.0000   0.0000  
 Residual             0.7962   0.8923  
Number of obs: 15, groups:  ID, 5

Fixed effects:
                                        Estimate Std. Error      df t value
(Intercept)                              -0.7760     0.5152  9.0000  -1.506
ConditionHuman                            1.3954     0.8145  9.0000   1.713
empathy_typeaffective                     1.0210     0.7285  9.0000   1.401
empathy_typemotivational                  1.7017     0.7285  9.0000   2.336
ConditionHuman:empathy_typeaffective     -2.1442     1.1519  9.0000  -1.861
ConditionHuman:empathy_typemotivational  -3.0290     1.1519  9.0000  -2.630
                                        Pr(>|t|)  
(Intercept)                               0.1663  
ConditionHuman                            0.1208  
empathy_typeaffective                     0.1946  
empathy_typemotivational                  0.0443 *
ConditionHuman:empathy_typeaffective      0.0956 .
ConditionHuman:empathy_typemotivational   0.0274 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
                    (Intr) CndtnH empthy_typf empthy_typm CndtnHmn:mpthy_typf
ConditinHmn         -0.632                                                   
empthy_typf         -0.707  0.447                                            
empthy_typm         -0.707  0.447  0.500                                     
CndtnHmn:mpthy_typf  0.447 -0.707 -0.632      -0.316                         
CndtnHmn:mpthy_typm  0.447 -0.707 -0.316      -0.632       0.500             
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')

The analyses as specified in the analysis plan.

Side-by-side graph with original graph is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.