Cohen’s d: Why Individual Responses Matter

Author

Timothy Achala

1 Background

Cohen’s d is a widely used standardised effect size metric defined as:

\[d = \frac{\bar{X}_{post} - \bar{X}_{pre}}{SD_{pooled}}\]

where $SD_{pooled} = \sqrt{\dfrac{SD_{pre}^2 + SD_{post}^2}{2}}$.

Cohen proposed rough benchmarks — d = 0.2 (small), 0.5 (medium), 0.8 (large) — as a practical guide, not rigid thresholds. The problem arises when these values are reported in isolation, stripping away everything that makes individual outcomes meaningful.

The Core Problem

Two studies can yield identical Cohen’s d values while one intervention consistently helps, and the other harms a substantial proportion of participants. The aggregate masks the individual.

2 Simulating Two Hypothetical Studies

We simulate n = 100 participants in two pre–post intervention studies with equal mean change but very different variability in individual outcomes.

Code

set.seed(42)
n <- 100

# Helper: compute Cohen's d from a data frame with $pre and $post
cohens_d <- function(df) {
  sd_p <- sqrt((sd(df$pre)^2 + sd(df$post)^2) / 2)
  round((mean(df$post) - mean(df$pre)) / sd_p, 3)
}

# Study A: Consistent, low-variance improvement
study_a <- tibble(
  id    = 1:n,
  pre   = rnorm(n, mean = 50, sd = 10),
  post  = pre + rnorm(n, mean = 5, sd = 2),
  study = "Study A — Low Variance"
)

# Study B: Same mean improvement, but wide spread (some participants harmed)
study_b <- tibble(
  id    = 1:n,
  pre   = rnorm(n, mean = 50, sd = 10),
  post  = pre + rnorm(n, mean = 5, sd = 12),
  study = "Study B — High Variance"
)

studies <- bind_rows(study_a, study_b) |>
  mutate(
    change   = post - pre,
    outcome  = if_else(change < 0, "Harmed", "Benefited / No Change"),
    study_id = str_extract(study, "Study [AB]")
  )

# Print Cohen's d side by side
tibble(
  Study     = c("Study A", "Study B"),
  `Cohen's d` = c(cohens_d(study_a), cohens_d(study_b))
) |>
  kable(align = "c") |>
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))

Study	Cohen's d
Study A	0.459
Study B	0.435

Both studies produce nearly identical d values — yet their stories are very different, as the plots below reveal.

3 Visualising the Evidence

3.1 Distribution of Individual Change Scores

The simplest diagnostic: plot the distribution of Δ = Post − Pre for each participant. The zero line marks the threshold between benefit and harm.

Code

ggplot(studies, aes(x = change, fill = outcome)) +
  geom_histogram(binwidth = 2.5, colour = "white", alpha = 0.88) +
  geom_vline(xintercept = 0, linetype = "dashed",
             colour = "grey25", linewidth = 0.9) +
  annotate("text", x = -1.2, y = Inf, label = "← Harmed",
           hjust = 1, vjust = 1.6, size = 3.5, colour = "grey35") +
  annotate("text", x =  1.2, y = Inf, label = "Benefited →",
           hjust = 0, vjust = 1.6, size = 3.5, colour = "grey35") +
  facet_wrap(~study, ncol = 2) +
  scale_fill_manual(
    values = c("Benefited / No Change" = "#27ae60",
               "Harmed"                = "#e74c3c")
  ) +
  labs(
    title    = "Distribution of Pre–Post Change Scores",
    subtitle = "Identical Cohen's d, radically different distributions",
    x        = "Change Score (Post − Pre)",
    y        = "Count",
    fill     = NULL,
    caption  = "Dashed line = zero change (no effect)"
  )

Histograms of individual change scores. Green = benefited or no change; red = harmed. Both distributions have the same mean (~5 units), but Study B has a long left tail of participants who got worse.

Interpretation: Study A’s histogram is tightly packed around a small positive shift — nearly everyone benefited modestly. Study B’s histogram is spread widely; a sizeable left tail indicates participants who were made worse by the intervention.

3.2 Half-Eye Plot: Full Distribution + Uncertainty

The ggdist half-eye combines a density slab with a point-interval, showing both the shape of the distribution and its key quantiles simultaneously.

Code

ggplot(studies, aes(x = study_id, y = change, fill = study_id)) +
  stat_halfeye(
    adjust       = 1.2,
    width        = 0.65,
    .width       = c(0.50, 0.95),
    point_colour = "black",
    point_size   = 2.5,
    slab_alpha   = 0.75
  ) +
  geom_hline(yintercept = 0, linetype = "dashed",
             colour = "grey30", linewidth = 0.8) +
  scale_fill_manual(values = c("Study A" = "#2980b9",
                               "Study B" = "#e67e22")) +
  labs(
    title    = "Individual Change Scores: Half-Eye Plot",
    subtitle = "Dot = median · Thick bar = 50% CI · Thin bar = 95% CI",
    x        = NULL,
    y        = "Change Score (Post − Pre)",
    fill     = NULL,
    caption  = "Dashed line = zero change threshold"
  ) +
  theme(legend.position = "none")

Half-eye plots. The central dot = median; thick bar = 50% credible interval; thin bar = 95% interval. Study B’s wide slab exposes the heterogeneity hidden in its Cohen’s d.

Interpretation: Study A’s slab is narrow and sits well above zero — consistent benefit. Study B’s slab straddles the zero line with a thick left tail. Even though the median is similar, the 95% interval extends deeply into negative territory, flagging real harm at the tails.

3.3 Spaghetti Plot: Individual Trajectories

Each line represents one participant’s journey from pre to post. This is the most direct way to see who improved and who deteriorated.

Code

studies |>
  pivot_longer(cols = c(pre, post),
               names_to  = "time",
               values_to = "score") |>
  mutate(time = factor(time, levels = c("pre", "post"),
                       labels = c("Pre", "Post"))) |>
  ggplot(aes(x = time, y = score,
             group = id, colour = outcome)) +
  geom_line(alpha = 0.35, linewidth = 0.55) +
  geom_point(alpha = 0.55, size = 0.9) +
  facet_wrap(~study) +
  scale_colour_manual(
    values = c("Benefited / No Change" = "#2980b9",
               "Harmed"                = "#c0392b")
  ) +
  labs(
    title    = "Individual Trajectories: Pre → Post",
    subtitle = "Red lines = participants who worsened after the intervention",
    x        = "Time Point",
    y        = "Score",
    colour   = NULL,
    caption  = "Each line = one participant (n = 100 per study)"
  )

Each line is one participant. Blue = benefited or no change; red = harmed (post < pre). The density of red crossing lines in Study B makes harm visible in a way a single number never could.

Interpretation: In Study A, lines are nearly parallel — small upward shifts, almost no crossings below baseline. In Study B, the red lines crossing downward make harm viscerally obvious. No summary statistic can convey this as clearly.

4 Summary Statistics

Code

studies |>
  group_by(Study = study_id) |>
  summarise(
    `Mean Change (SD)` = paste0(
      round(mean(change), 2), " (",
      round(sd(change), 2), ")"
    ),
    `Median Change`  = round(median(change), 2),
    `% Harmed`       = paste0(round(mean(outcome == "Harmed") * 100, 1), "%"),
    `n Harmed`       = sum(outcome == "Harmed"),
    `Cohen's d`      = cohens_d(pick(everything()))
  ) |>
  kable(align = "lccccr") |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width        = FALSE
  ) |>
  column_spec(5, bold = TRUE, color = "#c0392b") |>
  column_spec(6, bold = TRUE)

Aggregate summary confirms the paradox: same Cohen’s d, very different harm profiles.
Study	Mean Change (SD)	Median Change	% Harmed	n Harmed	Cohen's d
Study A	4.83 (1.81)	4.86	0%	0	0.459
Study B	5.4 (10.51)	4.45	31%	31	0.435

5 Key Takeaways

What Cohen's d tells you	What Cohen's d misses
Standardised mean difference	Variability of individual responses
Group-level signal strength	Who is harmed vs. who benefits
Comparability across studies	Shape of the response distribution
Effect 'magnitude' label (small/medium/large)	Clinical / practical significance

Complement Effect Sizes With

Histograms / density plots of individual change scores
Spaghetti plots for repeated-measures data
Half-eye or violin plots to show the full response distribution
Proportion harmed and NNT (Number Needed to Treat / Harm)
SD of change scores — not just the mean
Bayesian hierarchical models when estimating individual-level variance is the goal

6 References

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science. Frontiers in Psychology, 4, 863. https://doi.org/10.3389/fpsyg.2013.00863
Wilkinson, L. & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals. American Psychologist, 54(8), 594–604.

--- title: "Cohen's *d*: Why Individual Responses Matter" author: "Timothy Achala" format: html: theme: flatly toc: true toc-depth: 3 toc-title: "Contents" code-fold: true code-tools: true fig-width: 10 fig-height: 5.5 fig-dpi: 150 number-sections: true highlight-style: github embed-resources: true execute: warning: false message: false echo: true --- ```{r setup} #| include: false library(tidyverse) library(patchwork) library(ggdist) library(knitr) library(kableExtra) theme_set( theme_minimal(base_size = 13) + theme( plot.title = element_text(face = "bold", size = 14), plot.subtitle = element_text(colour = "grey45", size = 11), plot.caption = element_text(colour = "grey55", size = 9, hjust = 0), strip.text = element_text(face = "bold"), panel.grid.minor = element_blank(), legend.position = "bottom" ) ) ``` ## Background **Cohen's *d*** is a widely used standardised effect size metric defined as: $$d = \frac{\bar{X}_{post} - \bar{X}_{pre}}{SD_{pooled}}$$ where $SD_{pooled} = \sqrt{\dfrac{SD_{pre}^2 + SD_{post}^2}{2}}$. Cohen proposed rough benchmarks — *d* = 0.2 (small), 0.5 (medium), 0.8 (large) — as a practical guide, not rigid thresholds. The problem arises when these values are **reported in isolation**, stripping away everything that makes individual outcomes meaningful. ::: {.callout-warning} ## The Core Problem Two studies can yield **identical Cohen's *d* values** while one intervention consistently helps, and the other harms a substantial proportion of participants. The aggregate masks the individual. ::: --- ## Simulating Two Hypothetical Studies We simulate `n = 100` participants in two pre–post intervention studies with **equal mean change** but very different variability in individual outcomes. ```{r simulate} set.seed(42) n <- 100 # Helper: compute Cohen's d from a data frame with $pre and $post cohens_d <- function(df) { sd_p <- sqrt((sd(df$pre)^2 + sd(df$post)^2) / 2) round((mean(df$post) - mean(df$pre)) / sd_p, 3) } # Study A: Consistent, low-variance improvement study_a <- tibble( id = 1:n, pre = rnorm(n, mean = 50, sd = 10), post = pre + rnorm(n, mean = 5, sd = 2), study = "Study A — Low Variance" ) # Study B: Same mean improvement, but wide spread (some participants harmed) study_b <- tibble( id = 1:n, pre = rnorm(n, mean = 50, sd = 10), post = pre + rnorm(n, mean = 5, sd = 12), study = "Study B — High Variance" ) studies <- bind_rows(study_a, study_b) |> mutate( change = post - pre, outcome = if_else(change < 0, "Harmed", "Benefited / No Change"), study_id = str_extract(study, "Study [AB]") ) # Print Cohen's d side by side tibble( Study = c("Study A", "Study B"), `Cohen's d` = c(cohens_d(study_a), cohens_d(study_b)) ) |> kable(align = "c") |> kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover")) ``` > Both studies produce nearly identical *d* values — yet their stories are very different, as the plots below reveal. --- ## Visualising the Evidence ### Distribution of Individual Change Scores The simplest diagnostic: plot the **distribution of Δ = Post − Pre** for each participant. The zero line marks the threshold between benefit and harm. ```{r plot-histogram} #| fig-cap: "Histograms of individual change scores. Green = benefited or no change; red = harmed. Both distributions have the same mean (~5 units), but Study B has a long left tail of participants who got worse." ggplot(studies, aes(x = change, fill = outcome)) + geom_histogram(binwidth = 2.5, colour = "white", alpha = 0.88) + geom_vline(xintercept = 0, linetype = "dashed", colour = "grey25", linewidth = 0.9) + annotate("text", x = -1.2, y = Inf, label = "← Harmed", hjust = 1, vjust = 1.6, size = 3.5, colour = "grey35") + annotate("text", x = 1.2, y = Inf, label = "Benefited →", hjust = 0, vjust = 1.6, size = 3.5, colour = "grey35") + facet_wrap(~study, ncol = 2) + scale_fill_manual( values = c("Benefited / No Change" = "#27ae60", "Harmed" = "#e74c3c") ) + labs( title = "Distribution of Pre–Post Change Scores", subtitle = "Identical Cohen's d, radically different distributions", x = "Change Score (Post − Pre)", y = "Count", fill = NULL, caption = "Dashed line = zero change (no effect)" ) ``` **Interpretation:** Study A's histogram is tightly packed around a small positive shift — nearly everyone benefited modestly. Study B's histogram is spread widely; a sizeable left tail indicates participants who were made worse by the intervention. --- ### Half-Eye Plot: Full Distribution + Uncertainty The `ggdist` half-eye combines a **density slab** with a **point-interval**, showing both the shape of the distribution and its key quantiles simultaneously. ```{r plot-halfeye} #| fig-cap: "Half-eye plots. The central dot = median; thick bar = 50% credible interval; thin bar = 95% interval. Study B's wide slab exposes the heterogeneity hidden in its Cohen's d." ggplot(studies, aes(x = study_id, y = change, fill = study_id)) + stat_halfeye( adjust = 1.2, width = 0.65, .width = c(0.50, 0.95), point_colour = "black", point_size = 2.5, slab_alpha = 0.75 ) + geom_hline(yintercept = 0, linetype = "dashed", colour = "grey30", linewidth = 0.8) + scale_fill_manual(values = c("Study A" = "#2980b9", "Study B" = "#e67e22")) + labs( title = "Individual Change Scores: Half-Eye Plot", subtitle = "Dot = median · Thick bar = 50% CI · Thin bar = 95% CI", x = NULL, y = "Change Score (Post − Pre)", fill = NULL, caption = "Dashed line = zero change threshold" ) + theme(legend.position = "none") ``` **Interpretation:** Study A's slab is narrow and sits well above zero — consistent benefit. Study B's slab straddles the zero line with a thick left tail. Even though the median is similar, the 95% interval extends deeply into negative territory, flagging real harm at the tails. --- ### Spaghetti Plot: Individual Trajectories Each line represents **one participant's journey** from pre to post. This is the most direct way to see who improved and who deteriorated. ```{r plot-spaghetti} #| fig-cap: "Each line is one participant. Blue = benefited or no change; red = harmed (post < pre). The density of red crossing lines in Study B makes harm visible in a way a single number never could." studies |> pivot_longer(cols = c(pre, post), names_to = "time", values_to = "score") |> mutate(time = factor(time, levels = c("pre", "post"), labels = c("Pre", "Post"))) |> ggplot(aes(x = time, y = score, group = id, colour = outcome)) + geom_line(alpha = 0.35, linewidth = 0.55) + geom_point(alpha = 0.55, size = 0.9) + facet_wrap(~study) + scale_colour_manual( values = c("Benefited / No Change" = "#2980b9", "Harmed" = "#c0392b") ) + labs( title = "Individual Trajectories: Pre → Post", subtitle = "Red lines = participants who worsened after the intervention", x = "Time Point", y = "Score", colour = NULL, caption = "Each line = one participant (n = 100 per study)" ) ``` **Interpretation:** In Study A, lines are nearly parallel — small upward shifts, almost no crossings below baseline. In Study B, the red lines crossing downward make harm viscerally obvious. **No summary statistic can convey this as clearly.** --- ## Summary Statistics ```{r summary-table} #| tbl-cap: "Aggregate summary confirms the paradox: same Cohen's d, very different harm profiles." studies |> group_by(Study = study_id) |> summarise( `Mean Change (SD)` = paste0( round(mean(change), 2), " (", round(sd(change), 2), ")" ), `Median Change` = round(median(change), 2), `% Harmed` = paste0(round(mean(outcome == "Harmed") * 100, 1), "%"), `n Harmed` = sum(outcome == "Harmed"), `Cohen's d` = cohens_d(pick(everything())) ) |> kable(align = "lccccr") |> kable_styling( bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE ) |> column_spec(5, bold = TRUE, color = "#c0392b") |> column_spec(6, bold = TRUE) ``` --- ## Key Takeaways ```{r takeaways-table} #| echo: false tibble( `What Cohen's d tells you` = c( "Standardised mean difference", "Group-level signal strength", "Comparability across studies", "Effect 'magnitude' label (small/medium/large)" ), `What Cohen's d misses` = c( "Variability of individual responses", "Who is harmed vs. who benefits", "Shape of the response distribution", "Clinical / practical significance" ) ) |> kable() |> kable_styling(bootstrap_options = c("bordered", "hover")) |> column_spec(1, background = "#eaf4fb") |> column_spec(2, background = "#fdf2f2") ``` ::: {.callout-tip} ## Complement Effect Sizes With - **Histograms / density plots** of individual change scores - **Spaghetti plots** for repeated-measures data - **Half-eye or violin plots** to show the full response distribution - **Proportion harmed** and **NNT** (Number Needed to Treat / Harm) - **SD of change scores** — not just the mean - **Bayesian hierarchical models** when estimating individual-level variance is the goal ::: --- ## References - Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Lawrence Erlbaum Associates. - Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science. *Frontiers in Psychology*, 4, 863. <https://doi.org/10.3389/fpsyg.2013.00863> - Wilkinson, L. & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals. *American Psychologist*, 54(8), 594–604.