where \(SD_{pooled} = \sqrt{\dfrac{SD_{pre}^2 + SD_{post}^2}{2}}\).
Cohen proposed rough benchmarks — d = 0.2 (small), 0.5 (medium), 0.8 (large) — as a practical guide, not rigid thresholds. The problem arises when these values are reported in isolation, stripping away everything that makes individual outcomes meaningful.
The Core Problem
Two studies can yield identical Cohen’s d values while one intervention consistently helps, and the other harms a substantial proportion of participants. The aggregate masks the individual.
2 Simulating Two Hypothetical Studies
We simulate n = 100 participants in two pre–post intervention studies with equal mean change but very different variability in individual outcomes.
Code
set.seed(42)n <-100# Helper: compute Cohen's d from a data frame with $pre and $postcohens_d <-function(df) { sd_p <-sqrt((sd(df$pre)^2+sd(df$post)^2) /2)round((mean(df$post) -mean(df$pre)) / sd_p, 3)}# Study A: Consistent, low-variance improvementstudy_a <-tibble(id =1:n,pre =rnorm(n, mean =50, sd =10),post = pre +rnorm(n, mean =5, sd =2),study ="Study A — Low Variance")# Study B: Same mean improvement, but wide spread (some participants harmed)study_b <-tibble(id =1:n,pre =rnorm(n, mean =50, sd =10),post = pre +rnorm(n, mean =5, sd =12),study ="Study B — High Variance")studies <-bind_rows(study_a, study_b) |>mutate(change = post - pre,outcome =if_else(change <0, "Harmed", "Benefited / No Change"),study_id =str_extract(study, "Study [AB]") )# Print Cohen's d side by sidetibble(Study =c("Study A", "Study B"),`Cohen's d`=c(cohens_d(study_a), cohens_d(study_b))) |>kable(align ="c") |>kable_styling(full_width =FALSE, bootstrap_options =c("striped", "hover"))
Study
Cohen's d
Study A
0.459
Study B
0.435
Both studies produce nearly identical d values — yet their stories are very different, as the plots below reveal.
3 Visualising the Evidence
3.1 Distribution of Individual Change Scores
The simplest diagnostic: plot the distribution of Δ = Post − Pre for each participant. The zero line marks the threshold between benefit and harm.
Code
ggplot(studies, aes(x = change, fill = outcome)) +geom_histogram(binwidth =2.5, colour ="white", alpha =0.88) +geom_vline(xintercept =0, linetype ="dashed",colour ="grey25", linewidth =0.9) +annotate("text", x =-1.2, y =Inf, label ="← Harmed",hjust =1, vjust =1.6, size =3.5, colour ="grey35") +annotate("text", x =1.2, y =Inf, label ="Benefited →",hjust =0, vjust =1.6, size =3.5, colour ="grey35") +facet_wrap(~study, ncol =2) +scale_fill_manual(values =c("Benefited / No Change"="#27ae60","Harmed"="#e74c3c") ) +labs(title ="Distribution of Pre–Post Change Scores",subtitle ="Identical Cohen's d, radically different distributions",x ="Change Score (Post − Pre)",y ="Count",fill =NULL,caption ="Dashed line = zero change (no effect)" )
Histograms of individual change scores. Green = benefited or no change; red = harmed. Both distributions have the same mean (~5 units), but Study B has a long left tail of participants who got worse.
Interpretation: Study A’s histogram is tightly packed around a small positive shift — nearly everyone benefited modestly. Study B’s histogram is spread widely; a sizeable left tail indicates participants who were made worse by the intervention.
3.2 Half-Eye Plot: Full Distribution + Uncertainty
The ggdist half-eye combines a density slab with a point-interval, showing both the shape of the distribution and its key quantiles simultaneously.
Code
ggplot(studies, aes(x = study_id, y = change, fill = study_id)) +stat_halfeye(adjust =1.2,width =0.65,.width =c(0.50, 0.95),point_colour ="black",point_size =2.5,slab_alpha =0.75 ) +geom_hline(yintercept =0, linetype ="dashed",colour ="grey30", linewidth =0.8) +scale_fill_manual(values =c("Study A"="#2980b9","Study B"="#e67e22")) +labs(title ="Individual Change Scores: Half-Eye Plot",subtitle ="Dot = median · Thick bar = 50% CI · Thin bar = 95% CI",x =NULL,y ="Change Score (Post − Pre)",fill =NULL,caption ="Dashed line = zero change threshold" ) +theme(legend.position ="none")
Half-eye plots. The central dot = median; thick bar = 50% credible interval; thin bar = 95% interval. Study B’s wide slab exposes the heterogeneity hidden in its Cohen’s d.
Interpretation: Study A’s slab is narrow and sits well above zero — consistent benefit. Study B’s slab straddles the zero line with a thick left tail. Even though the median is similar, the 95% interval extends deeply into negative territory, flagging real harm at the tails.
3.3 Spaghetti Plot: Individual Trajectories
Each line represents one participant’s journey from pre to post. This is the most direct way to see who improved and who deteriorated.
Code
studies |>pivot_longer(cols =c(pre, post),names_to ="time",values_to ="score") |>mutate(time =factor(time, levels =c("pre", "post"),labels =c("Pre", "Post"))) |>ggplot(aes(x = time, y = score,group = id, colour = outcome)) +geom_line(alpha =0.35, linewidth =0.55) +geom_point(alpha =0.55, size =0.9) +facet_wrap(~study) +scale_colour_manual(values =c("Benefited / No Change"="#2980b9","Harmed"="#c0392b") ) +labs(title ="Individual Trajectories: Pre → Post",subtitle ="Red lines = participants who worsened after the intervention",x ="Time Point",y ="Score",colour =NULL,caption ="Each line = one participant (n = 100 per study)" )
Each line is one participant. Blue = benefited or no change; red = harmed (post < pre). The density of red crossing lines in Study B makes harm visible in a way a single number never could.
Interpretation: In Study A, lines are nearly parallel — small upward shifts, almost no crossings below baseline. In Study B, the red lines crossing downward make harm viscerally obvious. No summary statistic can convey this as clearly.
Aggregate summary confirms the paradox: same Cohen’s d, very different harm profiles.
Study
Mean Change (SD)
Median Change
% Harmed
n Harmed
Cohen's d
Study A
4.83 (1.81)
4.86
0%
0
0.459
Study B
5.4 (10.51)
4.45
31%
31
0.435
5 Key Takeaways
What Cohen's d tells you
What Cohen's d misses
Standardised mean difference
Variability of individual responses
Group-level signal strength
Who is harmed vs. who benefits
Comparability across studies
Shape of the response distribution
Effect 'magnitude' label (small/medium/large)
Clinical / practical significance
Complement Effect Sizes With
Histograms / density plots of individual change scores
Spaghetti plots for repeated-measures data
Half-eye or violin plots to show the full response distribution
Proportion harmed and NNT (Number Needed to Treat / Harm)
SD of change scores — not just the mean
Bayesian hierarchical models when estimating individual-level variance is the goal
6 References
Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science. Frontiers in Psychology, 4, 863. https://doi.org/10.3389/fpsyg.2013.00863
Wilkinson, L. & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals. American Psychologist, 54(8), 594–604.
Source Code
---title: "Cohen's *d*: Why Individual Responses Matter"author: "Timothy Achala"format: html: theme: flatly toc: true toc-depth: 3 toc-title: "Contents" code-fold: true code-tools: true fig-width: 10 fig-height: 5.5 fig-dpi: 150 number-sections: true highlight-style: github embed-resources: trueexecute: warning: false message: false echo: true---```{r setup}#| include: falselibrary(tidyverse)library(patchwork)library(ggdist)library(knitr)library(kableExtra)theme_set(theme_minimal(base_size =13) +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(colour ="grey45", size =11),plot.caption =element_text(colour ="grey55", size =9, hjust =0),strip.text =element_text(face ="bold"),panel.grid.minor =element_blank(),legend.position ="bottom" ))```## Background**Cohen's *d*** is a widely used standardised effect size metric defined as:$$d = \frac{\bar{X}_{post} - \bar{X}_{pre}}{SD_{pooled}}$$where $SD_{pooled} = \sqrt{\dfrac{SD_{pre}^2 + SD_{post}^2}{2}}$.Cohen proposed rough benchmarks — *d* = 0.2 (small), 0.5 (medium), 0.8 (large) — as a practical guide, not rigid thresholds. The problem arises when these values are **reported in isolation**, stripping away everything that makes individual outcomes meaningful.::: {.callout-warning}## The Core ProblemTwo studies can yield **identical Cohen's *d* values** while one intervention consistently helps, and the other harms a substantial proportion of participants. The aggregate masks the individual.:::---## Simulating Two Hypothetical StudiesWe simulate `n = 100` participants in two pre–post intervention studies with **equal mean change** but very different variability in individual outcomes.```{r simulate}set.seed(42)n <-100# Helper: compute Cohen's d from a data frame with $pre and $postcohens_d <-function(df) { sd_p <-sqrt((sd(df$pre)^2+sd(df$post)^2) /2)round((mean(df$post) -mean(df$pre)) / sd_p, 3)}# Study A: Consistent, low-variance improvementstudy_a <-tibble(id =1:n,pre =rnorm(n, mean =50, sd =10),post = pre +rnorm(n, mean =5, sd =2),study ="Study A — Low Variance")# Study B: Same mean improvement, but wide spread (some participants harmed)study_b <-tibble(id =1:n,pre =rnorm(n, mean =50, sd =10),post = pre +rnorm(n, mean =5, sd =12),study ="Study B — High Variance")studies <-bind_rows(study_a, study_b) |>mutate(change = post - pre,outcome =if_else(change <0, "Harmed", "Benefited / No Change"),study_id =str_extract(study, "Study [AB]") )# Print Cohen's d side by sidetibble(Study =c("Study A", "Study B"),`Cohen's d`=c(cohens_d(study_a), cohens_d(study_b))) |>kable(align ="c") |>kable_styling(full_width =FALSE, bootstrap_options =c("striped", "hover"))```> Both studies produce nearly identical *d* values — yet their stories are very different, as the plots below reveal.---## Visualising the Evidence### Distribution of Individual Change ScoresThe simplest diagnostic: plot the **distribution of Δ = Post − Pre** for each participant. The zero line marks the threshold between benefit and harm.```{r plot-histogram}#| fig-cap: "Histograms of individual change scores. Green = benefited or no change; red = harmed. Both distributions have the same mean (~5 units), but Study B has a long left tail of participants who got worse."ggplot(studies, aes(x = change, fill = outcome)) +geom_histogram(binwidth =2.5, colour ="white", alpha =0.88) +geom_vline(xintercept =0, linetype ="dashed",colour ="grey25", linewidth =0.9) +annotate("text", x =-1.2, y =Inf, label ="← Harmed",hjust =1, vjust =1.6, size =3.5, colour ="grey35") +annotate("text", x =1.2, y =Inf, label ="Benefited →",hjust =0, vjust =1.6, size =3.5, colour ="grey35") +facet_wrap(~study, ncol =2) +scale_fill_manual(values =c("Benefited / No Change"="#27ae60","Harmed"="#e74c3c") ) +labs(title ="Distribution of Pre–Post Change Scores",subtitle ="Identical Cohen's d, radically different distributions",x ="Change Score (Post − Pre)",y ="Count",fill =NULL,caption ="Dashed line = zero change (no effect)" )```**Interpretation:** Study A's histogram is tightly packed around a small positive shift — nearly everyone benefited modestly. Study B's histogram is spread widely; a sizeable left tail indicates participants who were made worse by the intervention.---### Half-Eye Plot: Full Distribution + UncertaintyThe `ggdist` half-eye combines a **density slab** with a **point-interval**, showing both the shape of the distribution and its key quantiles simultaneously.```{r plot-halfeye}#| fig-cap: "Half-eye plots. The central dot = median; thick bar = 50% credible interval; thin bar = 95% interval. Study B's wide slab exposes the heterogeneity hidden in its Cohen's d."ggplot(studies, aes(x = study_id, y = change, fill = study_id)) +stat_halfeye(adjust =1.2,width =0.65,.width =c(0.50, 0.95),point_colour ="black",point_size =2.5,slab_alpha =0.75 ) +geom_hline(yintercept =0, linetype ="dashed",colour ="grey30", linewidth =0.8) +scale_fill_manual(values =c("Study A"="#2980b9","Study B"="#e67e22")) +labs(title ="Individual Change Scores: Half-Eye Plot",subtitle ="Dot = median · Thick bar = 50% CI · Thin bar = 95% CI",x =NULL,y ="Change Score (Post − Pre)",fill =NULL,caption ="Dashed line = zero change threshold" ) +theme(legend.position ="none")```**Interpretation:** Study A's slab is narrow and sits well above zero — consistent benefit. Study B's slab straddles the zero line with a thick left tail. Even though the median is similar, the 95% interval extends deeply into negative territory, flagging real harm at the tails.---### Spaghetti Plot: Individual TrajectoriesEach line represents **one participant's journey** from pre to post. This is the most direct way to see who improved and who deteriorated.```{r plot-spaghetti}#| fig-cap: "Each line is one participant. Blue = benefited or no change; red = harmed (post < pre). The density of red crossing lines in Study B makes harm visible in a way a single number never could."studies |>pivot_longer(cols =c(pre, post),names_to ="time",values_to ="score") |>mutate(time =factor(time, levels =c("pre", "post"),labels =c("Pre", "Post"))) |>ggplot(aes(x = time, y = score,group = id, colour = outcome)) +geom_line(alpha =0.35, linewidth =0.55) +geom_point(alpha =0.55, size =0.9) +facet_wrap(~study) +scale_colour_manual(values =c("Benefited / No Change"="#2980b9","Harmed"="#c0392b") ) +labs(title ="Individual Trajectories: Pre → Post",subtitle ="Red lines = participants who worsened after the intervention",x ="Time Point",y ="Score",colour =NULL,caption ="Each line = one participant (n = 100 per study)" )```**Interpretation:** In Study A, lines are nearly parallel — small upward shifts, almost no crossings below baseline. In Study B, the red lines crossing downward make harm viscerally obvious. **No summary statistic can convey this as clearly.**---## Summary Statistics```{r summary-table}#| tbl-cap: "Aggregate summary confirms the paradox: same Cohen's d, very different harm profiles."studies |>group_by(Study = study_id) |>summarise(`Mean Change (SD)`=paste0(round(mean(change), 2), " (",round(sd(change), 2), ")" ),`Median Change`=round(median(change), 2),`% Harmed`=paste0(round(mean(outcome =="Harmed") *100, 1), "%"),`n Harmed`=sum(outcome =="Harmed"),`Cohen's d`=cohens_d(pick(everything())) ) |>kable(align ="lccccr") |>kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =FALSE ) |>column_spec(5, bold =TRUE, color ="#c0392b") |>column_spec(6, bold =TRUE)```---## Key Takeaways```{r takeaways-table}#| echo: falsetibble(`What Cohen's d tells you`=c("Standardised mean difference","Group-level signal strength","Comparability across studies","Effect 'magnitude' label (small/medium/large)" ),`What Cohen's d misses`=c("Variability of individual responses","Who is harmed vs. who benefits","Shape of the response distribution","Clinical / practical significance" )) |>kable() |>kable_styling(bootstrap_options =c("bordered", "hover")) |>column_spec(1, background ="#eaf4fb") |>column_spec(2, background ="#fdf2f2")```::: {.callout-tip}## Complement Effect Sizes With- **Histograms / density plots** of individual change scores - **Spaghetti plots** for repeated-measures data - **Half-eye or violin plots** to show the full response distribution - **Proportion harmed** and **NNT** (Number Needed to Treat / Harm) - **SD of change scores** — not just the mean - **Bayesian hierarchical models** when estimating individual-level variance is the goal:::---## References- Cohen, J. (1988). *Statistical Power Analysis for the Behavioral Sciences* (2nd ed.). Lawrence Erlbaum Associates.- Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science. *Frontiers in Psychology*, 4, 863. <https://doi.org/10.3389/fpsyg.2013.00863>- Wilkinson, L. & the Task Force on Statistical Inference. (1999). Statistical methods in psychology journals. *American Psychologist*, 54(8), 594–604.