Replication of Individual Differences in Emotion Prediction and Implications for Social Success by Barrick et al. (2024, Emotion)

Author

Izzy Aslarus (aslarus@stanford.edu)

Published

December 14, 2025

Introduction

I chose to replicate Individual Differences in Emotion Prediction and Implications for Social Success by Barrick et al. (2024, Emotion) because I am interested in the relationship between social cognition (what we know about those around us) and interpersonal emotion regulation (how we help those around us to improve their emotional states). In this study, Barrick et al. focus on subjects’ ability to accurately predict how other people transition between different emotions over time. Emotion prediction accuracy is an interesting measure of social cognition that has theoretical links to interpersonal emotion regulation. For instance, someone who is better able to predict how another person’s emotions change over time might be better able to simulate the emotional consequences of different interpersonal emotion regulation strategies (e.g., distraction, humor, reappraisal, venting), then intervene with the most effective strategy. Barrick et al. found that emotion prediction accuracy is linked to a constellation of other socioemotional outcomes, such as larger social networks and better emotion perception from facial expressions. This constellation of outcomes is likewise theoretically relevant for interpersonal emotion regulation (e.g., having a larger social network may be related to being more effective at improving others’ emotions when desired).

The main task in Barrick et al. (2024) was the Emotion Transitions Task. In this task, subjects rated the likelihood of a generic other person transitioning between every possible pair of emotions from the following list: irritable, anxious, calm, happy, sad, full of thought, sluggish. The study’s key measure of interest, emotion prediction accuracy, was operationalized as the correlation between subjects’ ratings of a generic other person and a set of average transition probabilities obtained from a preexisting experience sampling dataset. Subjects then completed several other tasks and questionnaires to measure potential correlates of emotion prediction accuracy. First, subjects rated the likelihood that they themselves would transition between every pair of emotions, enabling subjects’ emotion typicality to be operationalized as the correlation between their ratings of their own emotion transitions and the average transition probabilities. To measure emotion perception, subjects completed the multiracial version of the Reading the Mind in the Eyes task. Finally, subjects completed a series of questionnaires: the Toronto Alexithymia Scale, which measures emotion understanding (i.e., understanding of one’s own internal states); the Communication/Mind-Reading and Social Skills subscales of the Autism Quotient, which measure autistic traits related to social difficulties; the UCLA loneliness scale, which measures feelings of loneliness; the Multidimensional Scale of Perceived Social Support, which measures feelings of social support; and social network nominations, which measure social network size. Barrick et al. (2024) found that emotion prediction accuracy was significantly associated with lower loneliness, larger social networks, more typical emotion transitions, better emotion perception, better emotion understanding, and fewer communication difficulties. Counter to the authors’ predictions, emotion prediction accuracy was associated with poorer social skills, and it was not significantly associated with perceived social support.

There are two main challenges for this replication. Firstly, analyses of the original dataset should be reproduced using code and data available on OSF prior to conducting the replication. This is particularly important because a key element of the analysis involves data from an experience sampling study published prior to Barrick et al.’s 2024 paper, which provides the “ground truth” emotion transitions from which accuracy and typicality are calculated. Therefore, computational reproducibility is needed to validate the operationalization of these two key measures. Secondly, it is possible that there are too many tasks and questionnaires for available resources to support while maintaining sufficient power, which would necessitate narrowing the scope of the replication (e.g., focusing only on social outcomes). The scope of the replication has already been narrowed relative to the original paper, which also examined emotion prediction accuracy at the level of a specific community (e.g., a college campus) and a specific other person (e.g., a close friend). This replication focuses on outcomes related to generic emotion prediction accuracy because of the difficulty of collecting data from subjects in the same community, or from dyads, which would be necessary in order to establish a “ground truth” for specific emotion prediction accuracy.

The repository for this replication can be found here: https://github.com/psych251/barrick2024/

Furthermore, the original paper can be found here: https://github.com/psych251/barrick2024/blob/main/original_paper/barrick2024.pdf

Methods

Power Analysis

I conducted power analyses for two key statistical tests: loneliness predicted by emotion prediction accuracy (i.e., does emotion prediction accuracy have downstream social consequences?), and emotion prediction accuracy predicted by emotion understanding (alexithymia; i.e., does poor understanding of one’s own emotions have downstream consequences for one’s ability to understand others’ emotions?).

Based on the results of these power analyses, I determined that I will be under-powered for the loneliness analysis. Therefore, I will only test whether emotion prediction accuracy is predicted by emotion understanding (alexithymia).

Loneliness ~ Emotion Prediction Accuracy

Original effect (with standardized coefficients):

β = −.10, SE = 0.04, t(730) = −2.56, p = .01

Sample size for 80% power: 780

Sample size for 90% power: 1044

Sample size for 95% power: 1290

Warning: `r.squared` is possibly larger.
+--------------------------------------------------+
|             SAMPLE SIZE CALCULATION              |
+--------------------------------------------------+

Linear Regression Coefficient (T-Test)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : beta - null.beta = 0 
  H1 (Alt. Claim) : beta - null.beta != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 780  <<
  Type 1 Error (alpha)   = 0.050
  Type 2 Error           = 0.200
  Statistical Power      = 0.8
Warning: `r.squared` is possibly larger.
+--------------------------------------------------+
|             SAMPLE SIZE CALCULATION              |
+--------------------------------------------------+

Linear Regression Coefficient (T-Test)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : beta - null.beta = 0 
  H1 (Alt. Claim) : beta - null.beta != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 1044  <<
  Type 1 Error (alpha)   = 0.050
  Type 2 Error           = 0.100
  Statistical Power      = 0.9
Warning: `r.squared` is possibly larger.
+--------------------------------------------------+
|             SAMPLE SIZE CALCULATION              |
+--------------------------------------------------+

Linear Regression Coefficient (T-Test)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : beta - null.beta = 0 
  H1 (Alt. Claim) : beta - null.beta != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 1290  <<
  Type 1 Error (alpha)   = 0.050
  Type 2 Error           = 0.050
  Statistical Power      = 0.95
[1] "N for 80% power = 780"
[1] "N for 90% power = 1044"
[1] "N for 95% power = 1290"

Emotion Prediction Accuracy ~ Emotion Understanding

Original effect (with standardized coefficients):

β = −.35, SE = 0.03, t(1010) = -12.44, p < .001

Sample size for 80% power: 58

Sample size for 90% power: 77

Sample size for 95% power: 94

+--------------------------------------------------+
|             SAMPLE SIZE CALCULATION              |
+--------------------------------------------------+

Linear Regression Coefficient (T-Test)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : beta - null.beta = 0 
  H1 (Alt. Claim) : beta - null.beta != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 58  <<
  Type 1 Error (alpha)   = 0.050
  Type 2 Error           = 0.197
  Statistical Power      = 0.803
+--------------------------------------------------+
|             SAMPLE SIZE CALCULATION              |
+--------------------------------------------------+

Linear Regression Coefficient (T-Test)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : beta - null.beta = 0 
  H1 (Alt. Claim) : beta - null.beta != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 77  <<
  Type 1 Error (alpha)   = 0.050
  Type 2 Error           = 0.098
  Statistical Power      = 0.902
+--------------------------------------------------+
|             SAMPLE SIZE CALCULATION              |
+--------------------------------------------------+

Linear Regression Coefficient (T-Test)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : beta - null.beta = 0 
  H1 (Alt. Claim) : beta - null.beta != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 94  <<
  Type 1 Error (alpha)   = 0.050
  Type 2 Error           = 0.050
  Statistical Power      = 0.95
[1] "N for 80% power = 58"
[1] "N for 90% power = 77"
[1] "N for 95% power = 94"

Planned Sample

My planned sample size will be 100 participants. Given the discrepancy in the power analysis for Emotion Understanding (N=58 for 80% power and N=94 for 95% power) and for Loneliness (N=780 for 80% power and N=1290 for 95% power), I will be only be conducting the first analysis (Emotion Prediction Accuracy ~ Emotion Understanding) as my main confirmatory hypothesis test.

Materials

Emotion Transitions Task

Re-used materials from original study, including verbatim wording for task instructions and task blocks. Reproduced original analyses of task performance to obtain the final measure of Emotion Prediction Accuracy. Below is the description of this task from the Methods section of the original study:

“In the emotion transitions task (Thornton & Tamir, 2017), participants rated the likelihood that a person (generic other, community member, or specific person) would transition between two hypothetical mental states… On each trial of this task, participants were presented with two mental states connected by an arrow (e.g., happy → angry) and informed that the state to the left of the arrow is the person’s current state, and the mental state on the right side of the arrow is a mental state the person might experience next. Participants then rated the likelihood of that person making that transition on a continuous scale from 0% to 100%. Instructions did not include a specific time interval for the mental state transition (see Supplemental Methods for complete instructions). Participants rated all possible transitions between mental states, including transitions of a state back to itself and transitions in both directions between states… Emotions for Studies 1–4 (irritable, anxious, calm, happy, sad, full of thought, sluggish) were chosen from a previous study (Tamir et al., 2016)… General emotion prediction accuracy was calculated by correlating participant transition ratings with real-world emotion transition likelihoods between the states that were obtained from a previous experience sampling study (Thornton & Tamir, 2017; Trampe et al., 2015; Wilt et al., 2011; see Supplemental Methods for how ground truths were calculated)” (Barrick et al. 2024).

Emotion Understanding Scale

Used same scale as the original study. Below is the description of this scale from the Methods section of the original study:

“To measure participants’ understanding of their emotions, we administered the Toronto Alexithymia Scale, a 20-item self-report scale that measures difficulty describing feelings, difficulty identifying feelings, and externally oriented thinking (Bagby et al., 1994). Participants answer each item on a 5-point scale anchored at strongly disagree to strongly agree. Responses were summed to obtain a total score; higher scores indicate more difficulty understanding emotions” (Barrick et al., 2024).

Procedure

Participants and exclusions

The original study recruited participants from Amazon’s Mechanical Turk, whereas I will recruit 100 participants from Prolific. My replication will be conducted using Qualtrics, which is the same platform used in the original study. I will use the same or similar inclusion criteria as the original study. Firstly, I will recruit participants who have a minimum approval rating of 95% on Prolific, matching this inclusion criterion: “Study participation was restricted to workers in the United States with >95% approval ratings.” I will also restrict recruitment to participants who report being fluent in English on Prolific, which is conceptually similar to this exclusion criterion: “Participants… were excluded if they… indicated an English comprehension less than ‘Good’.” Finally, I will not enable participants to skip questions in my Qualtrics survey, which amounts to a stricter version of this exclusion criterion: “Participants… were excluded if they completed less than half of the questions” (Barrick et al. 2024).

Tasks and Scales

After providing informed consent, participants will first complete the Emotion Transitions Task. Next, they will complete the Emotion Understanding Scale. Finally, they will provide demographic information and receive a debrief.

Analysis Plan

Data cleaning

For the Emotion Transitions Task, Emotion Prediction Accuracy will be calculated using the same method as the original study:

“General emotion prediction accuracy was calculated by correlating participant transition ratings with real-world emotion transition likelihoods between the states that were obtained from a previous experience sampling study (Thornton & Tamir, 2017; Trampe et al., 2015; Wilt et al., 2011; see Supplemental Methods for how ground truths were calculated)” (Barrick et al. 2024).

The authors provide ground-truth emotion transition probabilities, which they computed using ecological momentary assessment data normalized by the overall frequency of different emotions. Thus, I will be able to compute Emotion Prediction Accuracy by correlating participants’ transition likelihood ratings with the same ground truth transitions used by in the original study.

For the Toronto Alexithymia Scale, scores will be obtained by summing item ratings (with reverse-coding where applicable), as described in the scale instructions.

Emotion understanding: Key confirmatory test

This is my key confirmatory analysis, given that my power analysis indicates that the replication will be sufficiently powered to detect the original effect.

Analyses in the original study were as follows:

“To test the roles of internal and external information in emotion prediction, we fit separate linear mixed effects models for each source of information (typicality, emotion understanding, emotion perception) using a REML in R, using the lmer() function of the lme4 package. Initial models consisted of participants’ general emotion prediction accuracy scores as the dependent variable, a fixed effect of information source (typicality, emotion understanding, and emotion perception), a random intercept for study to account for study-level variation, and a random slope for each information source. The random slope term was removed to allow the model to converge, so only the random intercept for study was included as a random effect. p-values and standardized coefficients were obtained using the parameters package in R (Lüdecke et al., 2020)” (Barrick et al. 2024).

Because I will not be collecting multiple waves of data, I will not need to run a multilevel model accounting for study-level variation. Instead, I will use the lm() function to conduct a simple linear regression with emotion prediction accuracy as the dependent variable and emotion understanding as the predictor. I will use the same method as the original study to obtain standardized coefficients from this model.

Differences from Original Study

I do not anticipate major differences from the original study due to the availability of all materials (i.e., task instructions, ground truth data for calculating Emotion Prediction Accuracy), the use of the same platform (Qualtrics) to administer the task, and the use of an online sample (although the original sample was recruited from Mechanical Turk, whereas I will be recruiting participants from Prolific).

Sample size for my replication (N=100) is based on a power analysis, so it will differ from the sample sizes in the original study across multiple waves of data collection. Sample sizes for each wave of data collection in the original study were based on power analyses for other effects that I am not replicating (e.g., a mediation analysis), so the original sample sizes were mostly larger than my planned replication sample size.

Because the original study collected data in multiple waves that each included a different set of measures, my replication study will be shorter in duration, and will involve fewer tasks and scales, since I am only collecting data relevant for my analyses of interest. Also, my analysis plan differs in that I will not need to conduct multilevel models with random intercepts for each study (i.e., each wave of data collection).

Methods Addendum (Post Data Collection)

Actual Sample

As pre-registered, my final sample size is N=100.

The mean age of participants was 44.7 years, with a standard deviation of 12.91. The sample consisted of 39 men, 58 women, 3 other, 1 prefer not to answer. 71 participants were white, 7 were Black, 7 were Hispanic or Latino/a, 2 were East Asian, 2 were Southeast Asian, 1 prefer not to answer, and 10 were mixed race.

Differences from pre-data collection methods plan

No differences to report.

Results

Data preparation

Import data

Note: run “anonymize.R” in the data/scripts folder first!

Compute emotion prediction accuracy

Pilot A
Pilot B
Full sample

Compute emotion understanding

Pilot A
Pilot B
Full sample

Visualize emotion transitions

Ground truth

Participants
Pilot A

Pilot B

Full sample

Visualizing only a random sub-sample of 15 subjects for readability.

Confirmatory analysis

Test

Pilot A

Testing analysis pipeline; pilot results should not be interpreted.

Parameter   | Coefficient |   SE |        95% CI |      t(2) |      p
---------------------------------------------------------------------
(Intercept) |   -9.02e-17 | 0.61 | [-2.63, 2.63] | -1.47e-16 | > .999
tas total   |       -0.01 | 0.71 | [-3.05, 3.03] |     -0.02 | 0.987 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Pilot B

Testing analysis pipeline; pilot results should not be interpreted.

Parameter   | Coefficient | 95% CI | df
---------------------------------------
(Intercept) |   -1.46e-16 |        |  0
tas total   |       -1.00 |        |  0

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Full sample

Here, we fail to replicate the original finding that Emotion Understanding Impairment (i.e., alexithymia) is associated with poorer Emotion Prediction Accuracy. Compared to the original effect (β = -.35, p < .001), we find no significant relationship (β = -.05, p = .653).

tas_acc_model_full_sample <- emo_predict_accuracy_full_sample %>%
  left_join(ques_demo_full_sample, by = c("sub_id" = "sub_id")) %>%
  lm(
    accuracy ~ tas_total,
    data = .
  )

# Same method for standardizing coefficients as in original analyses

model_parameters(tas_acc_model_full_sample, standardize = 'refit')
Parameter   | Coefficient |   SE |        95% CI |     t(98) |      p
---------------------------------------------------------------------
(Intercept) |   -3.45e-17 | 0.10 | [-0.20, 0.20] | -3.44e-16 | > .999
tas total   |       -0.05 | 0.10 | [-0.25, 0.15] |     -0.45 | 0.653 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.

Visualize

Pilot A

Testing visualization pipeline; pilot results should not be interpreted.

Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Pilot B

Testing visualization pipeline; pilot results should not be interpreted.

Warning in stats::qt(ci, df = dof): NaNs produced
Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Full sample

In our final replication, we find a non-significant relationship (β = -.05, 95% CI = [-.25, .15], p = .653).

Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Original figure

The analysis I am replicating is the second panel in Figure 2A (the middle panel of this figure).

Original Figure

Exploratory analyses

TAS sub-scales

Here, we test whether Emotion Prediction Accuracy is linked to any of the three sub-scales measuring different facets of Emotion Understanding Impairment. If one sub-scale drives the relationship between these variables, while there is no relationship with the other sub-scales, it could be the case that its effect is masked when we use the total score for Emotion Understanding Impairment as a predictor in our main confirmatory test.

However, we do not see any relationships between Emotion Prediction Accuracy and any sub-scale (see results below).

Identifying feelings
Parameter   | Coefficient |   SE |        95% CI |     t(98) |      p
---------------------------------------------------------------------
(Intercept) |   -4.20e-17 | 0.10 | [-0.20, 0.20] | -4.20e-16 | > .999
tas idfeel  |       -0.09 | 0.10 | [-0.29, 0.11] |     -0.93 | 0.356 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Describing feelings
Parameter        | Coefficient |   SE |        95% CI |     t(98) |      p
--------------------------------------------------------------------------
(Intercept)      |   -1.63e-17 | 0.10 | [-0.20, 0.20] | -1.62e-16 | > .999
tas describefeel |        0.01 | 0.10 | [-0.19, 0.21] |      0.11 | 0.911 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Externally-Oriented Thinking
Parameter     | Coefficient |   SE |        95% CI |     t(98) |      p
-----------------------------------------------------------------------
(Intercept)   |   -1.10e-17 | 0.10 | [-0.20, 0.20] | -1.09e-16 | > .999
tas exorthink |       -0.02 | 0.10 | [-0.22, 0.18] |     -0.20 | 0.841 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Post-hoc power analysis of individual studies from original paper

My pre-registered power analysis was based on the multilevel model reported in the original paper, which combined data from across 5 studies (i.e., 5 waves of data collection), with an intercept for each study. In the original study, the authors visualized bivariate correlations betweent the predictor and outcome variables of interest for each of the 5 studies separately, but all statistical tests and reported results were based on multilevel models combining the studies.

Given the null results for my preregistered replication, I was curious whether there would be heterogeneity in power analyses conducted based on bivariate correlations for each individual study from the original paper. It could be the case that my replication sample would be considered underpowered based on some waves of data collection from the original paper, but sufficiently powered based on other waves of data colelction.

Import data from original paper
Reproduce original figure

Because I am conducting a power analysis based on bivariate correlations between Emotion Understanding Impairment and Emotion Prediction Accuracy for each study in the original paper, I took this opportunity to reproduce the original figure with bivariate correlations from each study.

Compute bivariate correlations by study
Power analysis for each study

Compute N needed to reach 80%, 90%, and 95% power based on each individual study. Dashed line indicates N=100, which is the size of my replication sample.

My sample size of N=100 appears to be sufficiently powered based on Studies 2, 3, and 5, but underpowered based on Studies 1 and 4.

Power given N=100 based on each study

Compute power for a sample size of N=100 (i.e., the size of my replication sample) based on each individual study. Dashed lines indicate power of 80%, 90%, and 95%.

Again, based on Studies 1 and 4, my sample size of N=100 is below 80% power (69% in the case of Study 1, and 29% in the case of Study 4), but above 90% power based on Study 2 and above 95% power based on Studies 3 and 5.

N needed to reach significance based on replication data

[1] "Based on observed r, N = 3788 needed to reach 80% power"

Excluding fast responders

Participants responded to 72 items (7*7 = 49 emotion transitions; 20 items for the Toronto Alexithymia Scale; and 3 items for demographics). The survey was intended to take 10 minutes, based on Pilot A participants, whom I trust to have engaged with the survey in good faith. Pilot A participants took 8.3 minutes (my time, likely the fastest due to familiarity with the survey), and 12.0, 13.5, and 15.2 minutes. However, in Pilot A, participants completed 2 additional surveys (measuring Loneliness and Perceived Social Support), which were removed from the final replication study, and which comprised of 32 additional items. When Pilot A participants’ times are adjusted to reflect the proportion of items that actually appeared on the final version of the study — i.e., multiplied by 72/(72+32) ≈ 0.7 — the projected Pilot A response times for the shortened study are 5.7 minutes (my time), and 8.3, 9.3, and 10.5 minutes.

A completion time of 6 minutes (360 seconds) on the final survey — approximately the same time it took for me to take the study, with which I was already highly familiar — would reflect an average response time of 5 seconds per item… which is a conservative estimate, as this doesn’t account for any additional time needed to read the consent form and instructions.

A completion time of 8 minutes (480 seconds) on the final survey — slightly faster than the fastest adjusted time of the Pilot A participants — would reflect an average response time of 6.7 seconds per item, which, again, doesn’t account for the consent form and instructions.

Based on suspicions of poor data quality from participants with unexpectedly fast response times, here I re-run the analysis while excluding participants with response times faster than 6 minutes and response times faster than 8 minutes.

Distribution of response times

# A tibble: 3 × 2
  Speed                   Count
  <fct>                   <int>
1 Faster than 6 minutes      18
2 Between 6 and 8 minutes    21
3 Slower than 8 minutes      61
Power analysis with 6+ minute (N=82) and 8+ minute (N=61) subsamples
Power based on original multilevel model

Based on the original multilevel model, a sample size of N=82 (i.e., the number of participants in the replication sample who took longer than 6 minutes) would have 92% power to detect the effect, and a sample size of N=61 (i.e., the number of participants in the replication sample who took longer than 8 minutes) would have 82% power to detect the original effect.

+--------------------------------------------------+
|                POWER CALCULATION                 |
+--------------------------------------------------+

Linear Regression Coefficient (T-Test)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : beta - null.beta = 0 
  H1 (Alt. Claim) : beta - null.beta != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 82
  Type 1 Error (alpha)   = 0.050
  Type 2 Error           = 0.080
  Statistical Power      = 0.92  <<
+--------------------------------------------------+
|                POWER CALCULATION                 |
+--------------------------------------------------+

Linear Regression Coefficient (T-Test)

---------------------------------------------------
Hypotheses
---------------------------------------------------
  H0 (Null Claim) : beta - null.beta = 0 
  H1 (Alt. Claim) : beta - null.beta != 0 

---------------------------------------------------
Results
---------------------------------------------------
  Sample Size            = 61
  Type 1 Error (alpha)   = 0.050
  Type 2 Error           = 0.177
  Statistical Power      = 0.823  <<
Power based on each individual study

A sub-sample of participants from my replication sample who took at least 6 minutes to complete the study would be underpowered based on 2 of the 5 original studies. A sub-sample who took at least 8 minutes would be underpowered based on 3 of the 5 original studies.

Emotion understanding and emotion prediction accuracy by response time

Visual inspection of the relationship between Emotion Understanding and Emotion Prediction Accuracy reveals that, for participants who took at least 8 minutes to complete the study, a slightly more negative trend is apparent (as expected) compared to the full replication sample.

Tests
Excluding 6 minutes or faster

Analysis based on participants who took at least 6 minutes to complete the study does not reach significance.

Parameter   | Coefficient |   SE |        95% CI |    t(80) |      p
--------------------------------------------------------------------
(Intercept) |    1.04e-17 | 0.11 | [-0.22, 0.22] | 9.39e-17 | > .999
tas total   |       -0.09 | 0.11 | [-0.31, 0.13] |    -0.81 | 0.422 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Excluding 8 minutes or faster

Analysis based on participants who took at least 6 minutes to complete the study does not reach significance.

Note that this analysis is based on a substantially smaller sample size of N=61 (vs. the target sample size of N=100), and that the effect approaches significance (p = .102).

Furthermore, note that the magnitude of the standardized beta coefficient increases, in the direction of the original effect, as we exclude participants with short response times. In the original paper, β = −.35 from the multilevel model combining across all studies. In this replication, β = −.05 for the full sample (N=100); β = −.09 for the sub-sample who took at least 6 minutes (N=82), and β = −.21 for the sub-sample who took at least 8 minutes (N=61).

Parameter   | Coefficient |   SE |        95% CI |    t(59) |      p
--------------------------------------------------------------------
(Intercept) |    8.76e-17 | 0.13 | [-0.25, 0.25] | 6.95e-16 | > .999
tas total   |       -0.21 | 0.13 | [-0.47, 0.04] |    -1.66 | 0.102 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Excluding 8 minutes or faster, with each sub-scale as an outcome variable

Identifying feelings:

In this exploratory analysis, difficulty identifying feelings is significantly associated with Emotion Prediction Accuracy for participants who took 8 minutes or longer to complete the study (β = -.28, 95% CI = [-.53, -.03], p = .031). However, these results should be interpreted as exploratory with great caution, and the significance testing should not be used to evaluate our hypothesis.

Parameter   | Coefficient |   SE |         95% CI |    t(59) |      p
---------------------------------------------------------------------
(Intercept) |    1.36e-16 | 0.12 | [-0.25,  0.25] | 1.10e-15 | > .999
tas idfeel  |       -0.28 | 0.13 | [-0.53, -0.03] |    -2.20 | 0.031 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Describing feelings:

In this exploratory analysis, no significant relationship was found between difficulty describing feelings and Emotion Prediction Accuracy for participants who took longer than 8 minutes to complete the study (β = -.07, 95% CI = [-.33, .19], p = .612).

Parameter        | Coefficient |   SE |        95% CI |    t(59) |      p
-------------------------------------------------------------------------
(Intercept)      |    1.05e-16 | 0.13 | [-0.26, 0.26] | 8.14e-16 | > .999
tas describefeel |       -0.07 | 0.13 | [-0.33, 0.19] |    -0.51 | 0.612 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Externally-oriented thinking:

In this exploratory analysis, no significant relationship was found between externally-oriented thinking and Emotion Prediction Accuracy for participants who took longer than 8 minutes to complete the study (β = -.15, 95% CI = [-.41, .10], p = .238).

Parameter     | Coefficient |   SE |        95% CI |    t(59) |      p
----------------------------------------------------------------------
(Intercept)   |    5.94e-17 | 0.13 | [-0.26, 0.26] | 4.66e-16 | > .999
tas exorthink |       -0.15 | 0.13 | [-0.41, 0.10] |    -1.19 | 0.238 

Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
  using a Wald t-distribution approximation.
Data points may overlap. Use the `jitter` argument to add some amount of
  random variation to the location of data points and avoid overplotting.

Discussion

Summary of Replication Attempt

The primary confirmatory analysis, which aims to replicate a negative association between emotion understanding impairment (alexithymia) and emotion prediction accuracy, produced a null effect, β = -.05, 95% CI = [-0.25, 0.15], p = .653. Therefore, this replication attempt failed to replicate the original result.

Commentary

Follow-up exploratory analyses revolved around several questions: (1) do different subscales of the alexithymia scale show a relationship with emotion prediction accuracy? (2) Given that the original study conducted multiple waves of data collection, do power analyses differ across waves, and what implication does this have for our sample size? (3) Is there evidence of poor-quality data in the replication sample? And (4) is there suggestive evidence in favor of the effect of interest once we account for these issues?

Firstly, none of the alexithymia subscales had a significant relationship with emotion understanding. This renders unlikely the possibility that the general relationship between alexithymia and emotion understanding is driven by a particular subscale, such that its effects might be masked when analyzing the full scale.

Secondly, power analyses of individual waves of data collection in the original study reveal substantial heterogeneity in predicted power, which was masked in my original power analysis conduced on a mixed-effects model that combined all waves. Based on two out of the five waves of data collection in the original study, the replication sample size of N=100 is underpowered. Therefore, I suspect that the original effect is “real” but smaller in magnitude than the effect reported in the original paper, likely reflecting a combination of statistical luck and publication bias. If this is the case, then the true effect size may be more in line with power analyses suggesting that my replication sample is underpowered.

Thirdly, based on response times, there were a substantial number of responses that I judged to be somewhat suspicious (N=39 <8 minutes; faster than the fastest Pilot A time for a participant unfamiliar with the survey) or very suspicious (N=18 <6 minutes; approximately how long it took for me to click through the survey without reading any instructions or items, as the researcher who assembled and was therefore very familiar with the survey). If these do indeed represent bad-faith responses, then my sample size would be effectively smaller (i.e., even more underpowered), with additional noise potentially diluting any effects.

Fourthly, when fast response times are excluded in exploratory analyses, the test approaches but does not reach significance (p = .102), and the estimated effect is larger in magnitude and closer to the original study’s effect as compared to the full replication sample (β = −.21 for the exploratory analysis of a sub-sample of N=61, vs. β = −.35 in the original study and β = −.05 in the main confirmatory analysis). Furthermore, exploratory analyses of one subscale (difficulty identifying feelings) does reach significance, but this should be interpreted with extreme caution.

Ultimately, I would not read into these exploratory analyses as evidence of replication per se. Rather, I believe these exploratory analyses provide evidence consistent with my hypothesis that my sample was underpowered and diluted by poor-quality responses. I believe it is likely that the effect reported in the original study is replicable but smaller in magnitude than originally reported, leading my replication to be underpowered.