In this week’s forum threads for the Principles and Practice of Clinical research, one of the posts raises the question:
Why We Underestimate Variance for Sample Size Calculation?
The colleagues that posted before me all found this question hard to answer, and, when looking at their answers, I agreed. I couldn’t come up with verifiable reasons for why that would happen systematically. The opening post claims:
An article by Livingston 2005 (1), shows that in most of the trials the SD used for sample size calculation is usually smaller as compared to the actual SD found in the trials in 80% of the studied trials.
Well, if this was due to chance, one would presume that it would be closer to 50%, half are smaller than the calculated and half are bigger. I thought that maybe it would be a good idea to run an hypothesis test on this data to see if this is a statistically significant difference as well, but I couldn’t find the original claim in the article so I just let it be.
Either way, the question still remained and I felt like I lacked the mathematical background to answer it. I had a feeling, though, that two things might play a part here, due to the fact that many trials use previous pilot studies to determine the values which they will use in sample size calculation:
As the sampling is done by convenience, a smaller sample size might also lead to a more homogenous sample than the one that gets enrolled on the larger, subsequent trial.
The sample standard deviation might be systematically lower than the population standard deviation, and, as sample size increases, the SD approximates the population standard deviation, getting larger and larger.
These are just gut feelings, and I have no compelling evidence that this is the case for sure, but I thought I might be able to test it empirically with R.
The first step was to create a normal distribution with a pre-determined SD.
library (tidyverse) # loading required libraries
set.seed(1234) # to ensure reproducibility
normal <- rnorm (1000000, mean = 0, sd = 100)
as.data.frame(normal) %>% ggplot (aes(x=normal)) + geom_density()
Next step is to sample this distribution and create a data-frame that collects the size of the sample and the standard-deviation of that sample.
# Create an empty data frame
df <- data.frame(n = numeric(), sd = numeric())
# Loop through each sample size (n)
for (n in 2:24){
# Create 100 samples for each sample size
for (i in 1:100){
# Generate a sample of size n from the "normal" variable
sample_data <- sample(normal, n)
# Calculate the standard deviation of the sample
sample_sd <- sd(sample_data)
# Append the sample size (n) and sample standard deviation (sd) to the data frame
df <- rbind(df, data.frame(n = n, sd = sample_sd))
}
}
First, we can plot the average standard deviation for each sample size. The horizontal line shows the population standard deviation.
df %>% group_by(n) %>%
summarize(average.sd = mean(sd)) %>%
ggplot(aes(x = n, y = average.sd)) + geom_point() +
geom_hline(yintercept = 100)
Next, we can test if the average SD for a given sample size is statistically different from the population SD of 100.
library(knitr)
df2 <- data.frame(n = numeric(), p.value = numeric(), significant = logical())
for(i in 2:24){
df %>% filter (n == i) %>% pull (sd) %>% t.test(mu = 100) -> ttest
if (ttest$p.value < 0.05){
sig <- TRUE
} else {
sig <- FALSE
}
df2 <- rbind(df2, data.frame(n = i, p.value = ttest$p.value, significant = sig))
}
kable(df2)
| n | p.value | significant |
|---|---|---|
| 2 | 0.0833216 | FALSE |
| 3 | 0.0132271 | TRUE |
| 4 | 0.0112484 | TRUE |
| 5 | 0.0125649 | TRUE |
| 6 | 0.1247991 | FALSE |
| 7 | 0.0666966 | FALSE |
| 8 | 0.0074822 | TRUE |
| 9 | 0.2523369 | FALSE |
| 10 | 0.3442947 | FALSE |
| 11 | 0.0678563 | FALSE |
| 12 | 0.0563202 | FALSE |
| 13 | 0.3115111 | FALSE |
| 14 | 0.5915422 | FALSE |
| 15 | 0.0013134 | TRUE |
| 16 | 0.1019738 | FALSE |
| 17 | 0.4076397 | FALSE |
| 18 | 0.4318861 | FALSE |
| 19 | 0.0541384 | FALSE |
| 20 | 0.5566075 | FALSE |
| 21 | 0.1187056 | FALSE |
| 22 | 0.4587887 | FALSE |
| 23 | 0.1820971 | FALSE |
| 24 | 0.4685764 | FALSE |
These results lead me to think that there might be an effect of low sample sizes on underestimating the true SD of the population, and, therefore, when researchers base their calculations on pilot studies, they may be underestimating the true SD.