It is often said that p values in statistics should be accompanied by effect sizes as a measure of the strength and practicality of effects (Cumming, 2014; Sullivan & Feinn, 2012). However, even Jacob Cohen, who devised the original effect size for Cohen’s d, was fairly adamant that sample results are “always dependent upon the size of the sample” (Cohen, 1988, p.6). The issue therein is that smaller samples are almost always bad at detecting reliable effect sizes and thus lack power (Lakens, 2022). As such, this page will serve as a tutorial for how effect size can often be spurious when one does not account for the size of the sample.
To simplify things greatly, this tutorial will just look at Cohen’s d for an independent samples t-test. The classic formula for the Cohen’s d of this test can be found below:
\[ \begin{aligned} \text{Cohen's d} = \frac{\mu_1 - \mu_2}{\sigma} \end{aligned} \]
Or simplified:
\[ \begin{aligned} \text{Cohen's d} = \frac{\text{mean of sample 1} - \text{mean of sample 2}}{\text{pooled standard deviation}} \end{aligned} \]
This essentially means that when testing two samples, one subtracts the means of each sample from each other, and then this difference is divided by the average standard deviation of both samples. How this is accomplished can be seen below:
\[ \begin{aligned} \text{pooled sd} = \sqrt\frac{df}{N} \end{aligned} \]
This formula presents two issues. First, the effect size can only be determined by ignoring group-level sample sizes. If we have, for example, a group with 20 participants and another group with 40 participants, this effect size ignores this difference and simply uses the sum total of participants. Some alternatives, like Hedge’s g, account for this difference in samples (Cumming, 2013). Second, and the topic of this tutorial, the effect size is undoubtedly effected by this sum of participants. As \(N\) becomes lower, the effect size is more likely to be high because \(SD\) will be more erratic with smaller samples (Lakens, 2022).
A visual depiction of what Cohen’s d captures can be seen below. These are two simulated samples, with their means drawn with a red dashed line. The region between these lines represents the difference in means for each sample. The further these two lines are from each other, the larger the effect. To word it another way, if the values of the groups differ by a lot, their “hills” here should be further apart, whereas if they were similar (the hills overlap a lot), we could assume the effect between these groups is too low to be practically useful.
First, we will load two libraries within R. The first is the
effectsize
package, which provides easy to read effect size
summaries for our samples of interest. The second collection of
packages, called the tidyverse
, will allow us to plot and
summarise values from our simulation. To install these packages, use the
code below.
install.packages("tidyverse")
install.packages("effectsize")
To load these packages, use the library
function.
library(tidyverse)
library(effectsize)
Now we will set criterions for our simulation. The
small_n
and large_n
objects are the number of
participants in each. As these samples will be duplicated for a proper
two-sample Cohen value, these will be simulated twice for sample sizes
of 10 and 100 accordingly. We will also create a sample mean and
standard deviation for each sample, with the SD made relatively high to
simulate issues with variation in different sample sizes. Finally we
will run the simulation 1000 times with the n_boots
object
in our upcoming for
loop.
n_boots = 1000
small_n = 5
large_n = 50
sample_mean = 50
sample_sd = 35
Next, we will create placeholders for the small sample Cohen d values
and and the large sample Cohen d values respectively. This is
accomplished by filling the placeholders with NA values that will be
replaced by the for
loop later, with n_boots
representing the 1000 simulations we set earlier.
small_cd = rep(NA, n_boots)
large_cd = rep(NA, n_boots)
Next, we will write the actual for
loop. This basically
takes our pre-defined arguments and runs them 1000 times. The first
for
code basically says “from simulation 1 to 1000, do the
following things.” While doing this, it will create two normally
distributed samples that are small and then store their respective
Cohen’s d values in small_cd
. The same will be done but
with a large sample for large_cd
.
for(i in 1:n_boots){
small_response_1 = rnorm(n = small_n,
mean = sample_mean,
sd = sample_sd)
small_response_2 = rnorm(n = small_n,
mean = sample_mean,
sd = sample_sd)
small_cd[i] = cohens_d(small_response_1,
small_response_2)$Cohens_d
large_response_1 = rnorm(n = large_n,
mean = sample_mean,
sd = sample_sd)
large_response_2 = rnorm(n = large_n,
mean = sample_mean,
sd = sample_sd)
large_cd[i] = cohens_d(large_response_1,
large_response_2)$Cohens_d
}
Our final step before plotting is to simply store this data into a
data frame called cohen_data
. The Cohens_D
variable is simply the simulated values we made for small and large
sample effect sizes. The Group
variable prints a label
(“Small” or “Large”), repeats it by the number of simulations (1000),
and turns this into a factor variable that represents whether or not our
simulated effect size came from a 10 subject or 100 subject sample.
cohen_data = data.frame(Cohens_D = c(small_cd,
large_cd),
Group = factor(c(rep("Small",
length(small_cd)),
rep("Large",
length(large_cd)))))
A quick summary of the data will demonstrate that while the effect sizes have very similar means and medians, their respective standard deviations and IQR are obviously different.
cohen_data %>%
group_by(Group) %>%
summarise(Mean = mean(Cohens_D),
Median = median(Cohens_D),
SD = sd(Cohens_D),
IQR = IQR(Cohens_D))
## # A tibble: 2 × 5
## Group Mean Median SD IQR
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Large -0.000349 -0.00280 0.203 0.278
## 2 Small -0.0339 -0.0268 0.732 0.913
Plotting from here will demonstrate why. We can use typical
ggplot
code here. To effectively demonstrate how the data
is distributed, I have used violin plots. As boxplots do a fairly poor
job of showing exactly how dense the distributions are across the data
range, violin plots are visually more informative.
cohen_data %>%
ggplot(aes(x=Group,
y=Cohens_D))+
geom_violin(fill="lightblue",
alpha = .5)+
labs(y="Cohen's D",
title = "Variation in Cohen's D by Two Sample Sizes")+
theme_bw()
The plot may look slightly different from the one posted here, as I purposely didn’t set a random seed so you can test this over multiple iterations. However, your plot should look fairly similar each time you run the simulation. As you can probably see now, when using a “large” sample size of 100 participants, the average Cohen’s d value is very consistent. Most of the distribution is centered around zero and only deviates slightly above and below this.
The simulated Cohen values for the small samples are far more problematic. You can see that there is still a fair amount of distribution around the zero mark, but the variation is so high that it ranges far past the minimum and maximum values of the large group. This means that one group with 10 participants can have crazy fluctuations in effect sizes depending on which 10 person sample we used. The large samples, comparatively, are a lot more consistent.
You can probably already see an issue with this first simulation. Why set an arbitrary distinction of 5 per group (\(n=10\)) and 50 per group (\(n=100\))? Does this accurately reflect Cohen’s d for various sample sizes? The next simulation may show why I used these specific numbers. Instead of simulating by two artibrary sample sizes, now we will simulate 1000 samples, each with varying sizes picked at random, and see what their respective Cohen values look like. Effectively, we will pretend we ran 1000 experiments, each with different sample sizes, ranging from 5 to 500 subjects. After, we will calculate the effect sizes from each experiment.
We will make a slight adaptation to the previous code, but we will
still set n_boot
to the same 1000 simulation count. We will
then set a minimum and maximum number of participants to draw each time,
ranging from 5 participants to 500 participants with min_n
and max_n
. Each sample mean and SD will be the same as
before, as well as the same placeholder for Cohen’s d. We will
additionally use a placeholder for the sample size for each iteration of
the for loop.
n_boots = 1000
min_n = 5
max_n = 500
mean_sample = 50
sd_sample = 35
cohens_d = rep(NA, n_boots)
sample_size = rep(NA, n_boots)
I have tried to give as intructive of labels as possible below, but
this may require some more explanation. We first create
this_sample_size
as a variable that randomly picks a sample
size from a uniform distribution. As highlighted before, this will range
from 5 participants to 500 participants. Instead of creating a response
for two groups this time, we just create two responses to compare with
each drawn sample. Finally, we will create a cohens_d
variable for the effect size of each sample draw, and a
sample_size
variable to label which sample it was drawn
from.
for(i in 1:n_boots){
this_sample_size = runif(1, min=min_n, max=max_n)
response_1 = rnorm(n=this_sample_size, mean=sample_mean, sd=sample_sd)
response_2 = rnorm(n=this_sample_size, mean=sample_mean, sd=sample_sd)
cohens_d[i] = cohens_d(response_1, response_2)$Cohens_d
sample_size[i] = this_sample_size
}
Now we just create a new data frame like we did before, using the two
final variables cohens_d
and sample_size
.
new_data <- data.frame(cohens_d,
sample_size)
This time we will use a scatter plot. The x-axis will contain the randomly drawn sample size, while the y-axis will represent the Cohen’s d measure calculated from the two responses of each sample.
new_data %>%
ggplot(aes(y=cohens_d,
x=sample_size))+
geom_point(alpha = .5)+
theme_bw()+
labs(x="Sample Size",
y="Cohen's D",
title = "Simulated Cohen's D Variation by Sample Size")
Again, the plot may fluctuate a little based on how the
for
loop draws the data. However, you will almost always
see a similar pattern…the far left region, which represents small sample
sizes, ranges considerably across the y axis. However, the Cohen’s d
values will bottleneck around 100 participants and be fairly consistent
thereafter.
The conclusion from this plot is probably clear. The small sample size groups are all over the place, and thus are way less reliable compared to larger samples. Even though we sometimes get fairly large effect sizes, there is no guarantee they can be reproduced. Its likely that a replication of such a study would fail simply because the chances of getting the same effect size are incredibly slim.
Hopefully this simulation has been instructive about why p values or effect sizes alone aren’t super useful. One must consider all the factors involved in an experiment, but sample size is a very obvious indicator of the power of tests. This simulation should at least provide insight as to why. Special thanks to Sal Mangiafico on Cross Validated for providing advice for simulation.
Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum Associates.
Cumming, G. (2013). Understanding the new statistics: Effect sizes, confidence intervals, and meta-analysis (1st ed.). Routledge. https://doi.org/10.4324/9780203807002
Cumming, G. (2014). The new statistics: Why and how. Psychological Science, 25(1), 7–29. https://doi.org/10.1177/0956797613504966
Lakens, D. (2022). Sample size justification. Collabra: Psychology, 8(1), 33267.
Sullivan, G. M., & Feinn, R. (2012). Using effect size—Or why the p value is not enough. Journal of Graduate Medical Education, 4(3), 279–282. https://doi.org/10.4300/JGME-D-12-00156.1