Cohen’s D: Why Sample Size Matters

Introduction

It is often said that p values in statistics should be accompanied by effect sizes as a measure of the strength and practicality of effects (Cumming, 2014; Sullivan & Feinn, 2012). However, even Jacob Cohen, who devised the original effect size for Cohen’s d, was fairly adamant that sample results are “always dependent upon the size of the sample” (Cohen, 1988, p.6). The issue therein is that smaller samples are almost always bad at detecting reliable effect sizes and thus lack power (Lakens, 2022). As such, this page will serve as a tutorial for how effect size can often be spurious when one does not account for the size of the sample.

Cohen’s D: A Brief Look at the Formula

To simplify things greatly, this tutorial will just look at Cohen’s d for an independent samples t-test. The classic formula for the Cohen’s d of this test can be found below:

\[ \begin{aligned} \text{Cohen's d} = \frac{\mu_1 - \mu_2}{\sigma} \end{aligned} \]

Or simplified:

\[ \begin{aligned} \text{Cohen's d} = \frac{\text{mean of sample 1} - \text{mean of sample 2}}{\text{pooled standard deviation}} \end{aligned} \]

This essentially means that when testing two samples, one subtracts the means of each sample from each other, and then this difference is divided by the average standard deviation of both samples. How this is accomplished can be seen below:

\[ \begin{aligned} \text{pooled sd} = \sqrt\frac{df}{N} \end{aligned} \]

This formula presents two issues. First, the effect size can only be determined by ignoring group-level sample sizes. If we have, for example, a group with 20 participants and another group with 40 participants, this effect size ignores this difference and simply uses the sum total of participants. Some alternatives, like Hedge’s g, account for this difference in samples (Cumming, 2013). Second, and the topic of this tutorial, the effect size is undoubtedly effected by this sum of participants. As \(N\) becomes lower, the effect size is more likely to be high because \(SD\) will be more erratic with smaller samples (Lakens, 2022).

A visual depiction of what Cohen’s d captures can be seen below. These are two simulated samples, with their means drawn with a red dashed line. The region between these lines represents the difference in means for each sample. The further these two lines are from each other, the larger the effect. To word it another way, if the values of the groups differ by a lot, their “hills” here should be further apart, whereas if they were similar (the hills overlap a lot), we could assume the effect between these groups is too low to be practically useful.

Simulating Cohen’s D: Two Sample Sizes

First, we will load two libraries within R. The first is the effectsize package, which provides easy to read effect size summaries for our samples of interest. The second collection of packages, called the tidyverse, will allow us to plot and summarise values from our simulation. To install these packages, use the code below.

install.packages("tidyverse")
install.packages("effectsize")

To load these packages, use the library function.

library(tidyverse)
library(effectsize)

Now we will set criterions for our simulation. The small_n and large_n objects are the number of participants in each. As these samples will be duplicated for a proper two-sample Cohen value, these will be simulated twice for sample sizes of 10 and 100 accordingly. We will also create a sample mean and standard deviation for each sample, with the SD made relatively high to simulate issues with variation in different sample sizes. Finally we will run the simulation 1000 times with the n_boots object in our upcoming for loop.

n_boots = 1000
small_n = 5 
large_n = 50 
sample_mean = 50
sample_sd = 35

Next, we will create placeholders for the small sample Cohen d values and and the large sample Cohen d values respectively. This is accomplished by filling the placeholders with NA values that will be replaced by the for loop later, with n_boots representing the 1000 simulations we set earlier.

small_cd = rep(NA, n_boots)
large_cd = rep(NA, n_boots)

Next, we will write the actual for loop. This basically takes our pre-defined arguments and runs them 1000 times. The first for code basically says “from simulation 1 to 1000, do the following things.” While doing this, it will create two normally distributed samples that are small and then store their respective Cohen’s d values in small_cd. The same will be done but with a large sample for large_cd.

for(i in 1:n_boots){
  
  small_response_1 = rnorm(n = small_n,
                           mean = sample_mean,
                           sd = sample_sd)
  small_response_2 = rnorm(n = small_n, 
                           mean = sample_mean,
                           sd = sample_sd)
  small_cd[i] = cohens_d(small_response_1,
                         small_response_2)$Cohens_d
  
  large_response_1 = rnorm(n = large_n,
                           mean = sample_mean,
                           sd = sample_sd)
  large_response_2 = rnorm(n = large_n,
                           mean = sample_mean,
                           sd = sample_sd)
  large_cd[i] = cohens_d(large_response_1,
                         large_response_2)$Cohens_d
  
}

Our final step before plotting is to simply store this data into a data frame called cohen_data. The Cohens_D variable is simply the simulated values we made for small and large sample effect sizes. The Group variable prints a label (“Small” or “Large”), repeats it by the number of simulations (1000), and turns this into a factor variable that represents whether or not our simulated effect size came from a 10 subject or 100 subject sample.

cohen_data = data.frame(Cohens_D = c(small_cd, 
                              large_cd),
                  Group = factor(c(rep("Small", 
                                       length(small_cd)),
                                   rep("Large", 
                                       length(large_cd)))))

Plotting Small and Large Sample Simulations

A quick summary of the data will demonstrate that while the effect sizes have very similar means and medians, their respective standard deviations and IQR are obviously different.

cohen_data %>% 
  group_by(Group) %>% 
  summarise(Mean = mean(Cohens_D),
            Median = median(Cohens_D),
            SD = sd(Cohens_D),
            IQR = IQR(Cohens_D))

## # A tibble: 2 × 5
##   Group      Mean   Median    SD   IQR
##   <fct>     <dbl>    <dbl> <dbl> <dbl>
## 1 Large -0.000349 -0.00280 0.203 0.278
## 2 Small -0.0339   -0.0268  0.732 0.913

Plotting from here will demonstrate why. We can use typical ggplot code here. To effectively demonstrate how the data is distributed, I have used violin plots. As boxplots do a fairly poor job of showing exactly how dense the distributions are across the data range, violin plots are visually more informative.

cohen_data %>% 
  ggplot(aes(x=Group,
             y=Cohens_D))+
  geom_violin(fill="lightblue",
              alpha = .5)+
  labs(y="Cohen's D",
       title = "Variation in Cohen's D by Two Sample Sizes")+
  theme_bw()

The plot may look slightly different from the one posted here, as I purposely didn’t set a random seed so you can test this over multiple iterations. However, your plot should look fairly similar each time you run the simulation. As you can probably see now, when using a “large” sample size of 100 participants, the average Cohen’s d value is very consistent. Most of the distribution is centered around zero and only deviates slightly above and below this.

The simulated Cohen values for the small samples are far more problematic. You can see that there is still a fair amount of distribution around the zero mark, but the variation is so high that it ranges far past the minimum and maximum values of the large group. This means that one group with 10 participants can have crazy fluctuations in effect sizes depending on which 10 person sample we used. The large samples, comparatively, are a lot more consistent.

Simulating Cohen’s D: Using Several Sample Sizes

You can probably already see an issue with this first simulation. Why set an arbitrary distinction of 5 per group (\(n=10\)) and 50 per group (\(n=100\))? Does this accurately reflect Cohen’s d for various sample sizes? The next simulation may show why I used these specific numbers. Instead of simulating by two artibrary sample sizes, now we will simulate 1000 samples, each with varying sizes picked at random, and see what their respective Cohen values look like. Effectively, we will pretend we ran 1000 experiments, each with different sample sizes, ranging from 5 to 500 subjects. After, we will calculate the effect sizes from each experiment.

We will make a slight adaptation to the previous code, but we will still set n_boot to the same 1000 simulation count. We will then set a minimum and maximum number of participants to draw each time, ranging from 5 participants to 500 participants with min_n and max_n. Each sample mean and SD will be the same as before, as well as the same placeholder for Cohen’s d. We will additionally use a placeholder for the sample size for each iteration of the for loop.

n_boots = 1000
min_n =  5
max_n = 500
mean_sample = 50
sd_sample = 35
cohens_d = rep(NA, n_boots)
sample_size = rep(NA, n_boots)

I have tried to give as intructive of labels as possible below, but this may require some more explanation. We first create this_sample_size as a variable that randomly picks a sample size from a uniform distribution. As highlighted before, this will range from 5 participants to 500 participants. Instead of creating a response for two groups this time, we just create two responses to compare with each drawn sample. Finally, we will create a cohens_d variable for the effect size of each sample draw, and a sample_size variable to label which sample it was drawn from.

for(i in 1:n_boots){
  
  this_sample_size = runif(1, min=min_n, max=max_n)
  response_1 = rnorm(n=this_sample_size, mean=sample_mean, sd=sample_sd)
  response_2 = rnorm(n=this_sample_size, mean=sample_mean, sd=sample_sd)
  cohens_d[i] = cohens_d(response_1, response_2)$Cohens_d
  sample_size[i] = this_sample_size
  
}

Now we just create a new data frame like we did before, using the two final variables cohens_d and sample_size.

new_data <- data.frame(cohens_d,
                       sample_size)

This time we will use a scatter plot. The x-axis will contain the randomly drawn sample size, while the y-axis will represent the Cohen’s d measure calculated from the two responses of each sample.

Plotting By-Sample Size Effects

new_data %>% 
  ggplot(aes(y=cohens_d,
             x=sample_size))+
  geom_point(alpha = .5)+
  theme_bw()+
  labs(x="Sample Size",
       y="Cohen's D",
       title = "Simulated Cohen's D Variation by Sample Size")

Again, the plot may fluctuate a little based on how the for loop draws the data. However, you will almost always see a similar pattern…the far left region, which represents small sample sizes, ranges considerably across the y axis. However, the Cohen’s d values will bottleneck around 100 participants and be fairly consistent thereafter.

The conclusion from this plot is probably clear. The small sample size groups are all over the place, and thus are way less reliable compared to larger samples. Even though we sometimes get fairly large effect sizes, there is no guarantee they can be reproduced. Its likely that a replication of such a study would fail simply because the chances of getting the same effect size are incredibly slim.

Conclusion

Hopefully this simulation has been instructive about why p values or effect sizes alone aren’t super useful. One must consider all the factors involved in an experiment, but sample size is a very obvious indicator of the power of tests. This simulation should at least provide insight as to why. Special thanks to Sal Mangiafico on Cross Validated for providing advice for simulation.

Citations

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
Cumming, G. (2013). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis (1st ed.). Routledge. https://doi.org/10.4324/9780203807002
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
Lakens, D. (2022). Sample size justification. Collabra: Psychology, 8(1), 33267.
Sullivan, G. M., & Feinn, R. (2012). Using effect size—Or why the p value is not enough. Journal of Graduate Medical Education, 4(3), 279–282. https://doi.org/10.4300/JGME-D-12-00156.1