Cohen’s d: Why Individual Responses Matter

Author

Timothy Achala

1 Background

Cohen’s d is a widely used standardised effect size metric defined as:

\[d = \frac{\bar{X}_{post} - \bar{X}_{pre}}{SD_{pooled}}\]

where \(SD_{pooled} = \sqrt{\dfrac{SD_{pre}^2 + SD_{post}^2}{2}}\).

Cohen proposed rough benchmarks — d = 0.2 (small), 0.5 (medium), 0.8 (large) — as a practical guide, not rigid thresholds. The problem arises when these values are reported in isolation, stripping away everything that makes individual outcomes meaningful.

The Core Problem

Two studies can yield identical Cohen’s d values while one intervention consistently helps, and the other harms a substantial proportion of participants. The aggregate masks the individual.


2 Simulating Two Hypothetical Studies

We simulate n = 100 participants in two pre–post intervention studies with equal mean change but very different variability in individual outcomes.

Code
set.seed(42)
n <- 100

# Helper: compute Cohen's d from a data frame with $pre and $post
cohens_d <- function(df) {
  sd_p <- sqrt((sd(df$pre)^2 + sd(df$post)^2) / 2)
  round((mean(df$post) - mean(df$pre)) / sd_p, 3)
}

# Study A: Consistent, low-variance improvement
study_a <- tibble(
  id    = 1:n,
  pre   = rnorm(n, mean = 50, sd = 10),
  post  = pre + rnorm(n, mean = 5, sd = 2),
  study = "Study A — Low Variance"
)

# Study B: Same mean improvement, but wide spread (some participants harmed)
study_b <- tibble(
  id    = 1:n,
  pre   = rnorm(n, mean = 50, sd = 10),
  post  = pre + rnorm(n, mean = 5, sd = 12),
  study = "Study B — High Variance"
)

studies <- bind_rows(study_a, study_b) |>
  mutate(
    change   = post - pre,
    outcome  = if_else(change < 0, "Harmed", "Benefited / No Change"),
    study_id = str_extract(study, "Study [AB]")
  )

# Print Cohen's d side by side
tibble(
  Study     = c("Study A", "Study B"),
  `Cohen's d` = c(cohens_d(study_a), cohens_d(study_b))
) |>
  kable(align = "c") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))
Study Cohen's d
Study A 0.459
Study B 0.435

Both studies produce nearly identical d values — yet their stories are very different, as the plots below reveal.


3 Visualising the Evidence

3.1 Distribution of Individual Change Scores

The simplest diagnostic: plot the distribution of Δ = Post − Pre for each participant. The zero line marks the threshold between benefit and harm.

Code
ggplot(studies, aes(x = change, fill = outcome)) +
  geom_histogram(binwidth = 2.5, colour = "white", alpha = 0.88) +
  geom_vline(xintercept = 0, linetype = "dashed",
             colour = "grey25", linewidth = 0.9) +
  annotate("text", x = -1.2, y = Inf, label = "← Harmed",
           hjust = 1, vjust = 1.6, size = 3.5, colour = "grey35") +
  annotate("text", x =  1.2, y = Inf, label = "Benefited →",
           hjust = 0, vjust = 1.6, size = 3.5, colour = "grey35") +
  facet_wrap(~study, ncol = 2) +
  scale_fill_manual(
    values = c("Benefited / No Change" = "#27ae60",
               "Harmed"                = "#e74c3c")
  ) +
  labs(
    title    = "Distribution of Pre–Post Change Scores",
    subtitle = "Identical Cohen's d, radically different distributions",
    x        = "Change Score (Post − Pre)",
    y        = "Count",
    fill     = NULL,
    caption  = "Dashed line = zero change (no effect)"
  )

Histograms of individual change scores. Green = benefited or no change; red = harmed. Both distributions have the same mean (~5 units), but Study B has a long left tail of participants who got worse.

Interpretation: Study A’s histogram is tightly packed around a small positive shift — nearly everyone benefited modestly. Study B’s histogram is spread widely; a sizeable left tail indicates participants who were made worse by the intervention.


3.2 Half-Eye Plot: Full Distribution + Uncertainty

The ggdist half-eye combines a density slab with a point-interval, showing both the shape of the distribution and its key quantiles simultaneously.

Code
ggplot(studies, aes(x = study_id, y = change, fill = study_id)) +
  stat_halfeye(
    adjust       = 1.2,
    width        = 0.65,
    .width       = c(0.50, 0.95),
    point_colour = "black",
    point_size   = 2.5,
    slab_alpha   = 0.75
  ) +
  geom_hline(yintercept = 0, linetype = "dashed",
             colour = "grey30", linewidth = 0.8) +
  scale_fill_manual(values = c("Study A" = "#2980b9",
                               "Study B" = "#e67e22")) +
  labs(
    title    = "Individual Change Scores: Half-Eye Plot",
    subtitle = "Dot = median · Thick bar = 50% CI · Thin bar = 95% CI",
    x        = NULL,
    y        = "Change Score (Post − Pre)",
    fill     = NULL,
    caption  = "Dashed line = zero change threshold"
  ) +
  theme(legend.position = "none")

Half-eye plots. The central dot = median; thick bar = 50% credible interval; thin bar = 95% interval. Study B’s wide slab exposes the heterogeneity hidden in its Cohen’s d.

Interpretation: Study A’s slab is narrow and sits well above zero — consistent benefit. Study B’s slab straddles the zero line with a thick left tail. Even though the median is similar, the 95% interval extends deeply into negative territory, flagging real harm at the tails.


3.3 Spaghetti Plot: Individual Trajectories

Each line represents one participant’s journey from pre to post. This is the most direct way to see who improved and who deteriorated.

Code
studies |>
  pivot_longer(cols = c(pre, post),
               names_to  = "time",
               values_to = "score") |>
  mutate(time = factor(time, levels = c("pre", "post"),
                       labels = c("Pre", "Post"))) |>
  ggplot(aes(x = time, y = score,
             group = id, colour = outcome)) +
  geom_line(alpha = 0.35, linewidth = 0.55) +
  geom_point(alpha = 0.55, size = 0.9) +
  facet_wrap(~study) +
  scale_colour_manual(
    values = c("Benefited / No Change" = "#2980b9",
               "Harmed"                = "#c0392b")
  ) +
  labs(
    title    = "Individual Trajectories: Pre → Post",
    subtitle = "Red lines = participants who worsened after the intervention",
    x        = "Time Point",
    y        = "Score",
    colour   = NULL,
    caption  = "Each line = one participant (n = 100 per study)"
  )

Each line is one participant. Blue = benefited or no change; red = harmed (post < pre). The density of red crossing lines in Study B makes harm visible in a way a single number never could.

Interpretation: In Study A, lines are nearly parallel — small upward shifts, almost no crossings below baseline. In Study B, the red lines crossing downward make harm viscerally obvious. No summary statistic can convey this as clearly.


4 Summary Statistics

Code
studies |>
  group_by(Study = study_id) |>
  summarise(
    `Mean Change (SD)` = paste0(
      round(mean(change), 2), " (",
      round(sd(change), 2), ")"
    ),
    `Median Change`  = round(median(change), 2),
    `% Harmed`       = paste0(round(mean(outcome == "Harmed") * 100, 1), "%"),
    `n Harmed`       = sum(outcome == "Harmed"),
    `Cohen's d`      = cohens_d(pick(everything()))
  ) |>
  kable(align = "lccccr") |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE
  ) |>
  column_spec(5, bold = TRUE, color = "#c0392b") |>
  column_spec(6, bold = TRUE)
Aggregate summary confirms the paradox: same Cohen’s d, very different harm profiles.
Study Mean Change (SD) Median Change % Harmed n Harmed Cohen's d
Study A 4.83 (1.81) 4.86 0% 0 0.459
Study B 5.4 (10.51) 4.45 31% 31 0.435

5 Key Takeaways

What Cohen's d tells you What Cohen's d misses
Standardised mean difference Variability of individual responses
Group-level signal strength Who is harmed vs. who benefits
Comparability across studies Shape of the response distribution
Effect 'magnitude' label (small/medium/large) Clinical / practical significance
Complement Effect Sizes With
  • Histograms / density plots of individual change scores
  • Spaghetti plots for repeated-measures data
  • Half-eye or violin plots to show the full response distribution
  • Proportion harmed and NNT (Number Needed to Treat / Harm)
  • SD of change scores — not just the mean
  • Bayesian hierarchical models when estimating individual-level variance is the goal

6 References

  • Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
  • Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science. Frontiers in Psychology, 4, 863. https://doi.org/10.3389/fpsyg.2013.00863
  • Wilkinson, L. & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals. American Psychologist, 54(8), 594–604.