In this week’s forum threads for the Principles and Practice of Clinical research, one of the posts raises the question:

Why We Underestimate Variance for Sample Size Calculation?

The colleagues that posted before me all found this question hard to answer, and, when looking at their answers, I agreed. I couldn’t come up with verifiable reasons for why that would happen systematically. The opening post claims:

An article by Livingston 2005 (1), shows that in most of the trials the SD used for sample size calculation is usually smaller as compared to the actual SD found in the trials in 80% of the studied trials.

Well, if this was due to chance, one would presume that it would be closer to 50%, half are smaller than the calculated and half are bigger. I thought that maybe it would be a good idea to run an hypothesis test on this data to see if this is a statistically significant difference as well, but I couldn’t find the original claim in the article so I just let it be.

Either way, the question still remained and I felt like I lacked the mathematical background to answer it. I had a feeling, though, that two things might play a part here, due to the fact that many trials use previous pilot studies to determine the values which they will use in sample size calculation:

  1. As the sampling is done by convenience, a smaller sample size might also lead to a more homogenous sample than the one that gets enrolled on the larger, subsequent trial.

  2. The sample standard deviation might be systematically lower than the population standard deviation, and, as sample size increases, the SD approximates the population standard deviation, getting larger and larger.

These are just gut feelings, and I have no compelling evidence that this is the case for sure, but I thought I might be able to test it empirically with R.

The first step was to create a normal distribution with a pre-determined SD.

library (tidyverse) # loading required libraries
set.seed(1234) # to ensure reproducibility
normal <- rnorm (1000000, mean = 0, sd = 100)
as.data.frame(normal) %>% ggplot (aes(x=normal)) + geom_density()

Next step is to sample this distribution and create a data-frame that collects the size of the sample and the standard-deviation of that sample.

# Create an empty data frame
df <- data.frame(n = numeric(), sd = numeric())

# Loop through each sample size (n)
for (n in 2:24){
        # Create 100 samples for each sample size
        for (i in 1:100){
        # Generate a sample of size n from the "normal" variable
        sample_data <- sample(normal, n)
        
        # Calculate the standard deviation of the sample
        sample_sd <- sd(sample_data)
        
        # Append the sample size (n) and sample standard deviation (sd) to the data frame
        df <- rbind(df, data.frame(n = n, sd = sample_sd))
        }
}

First, we can plot the average standard deviation for each sample size. The horizontal line shows the population standard deviation.

df %>% group_by(n) %>% 
        summarize(average.sd = mean(sd)) %>%
                ggplot(aes(x = n, y = average.sd)) + geom_point() +
                        geom_hline(yintercept = 100)

Next, we can test if the average SD for a given sample size is statistically different from the population SD of 100.

library(knitr)
df2 <- data.frame(n = numeric(), p.value = numeric(), significant = logical())
for(i in 2:24){
        df %>% filter (n == i) %>% pull (sd) %>% t.test(mu = 100) -> ttest
        if (ttest$p.value < 0.05){
                sig <- TRUE
        } else {
                sig <- FALSE
        }
        df2 <- rbind(df2, data.frame(n = i, p.value = ttest$p.value, significant = sig))
}
kable(df2)
n p.value significant
2 0.0833216 FALSE
3 0.0132271 TRUE
4 0.0112484 TRUE
5 0.0125649 TRUE
6 0.1247991 FALSE
7 0.0666966 FALSE
8 0.0074822 TRUE
9 0.2523369 FALSE
10 0.3442947 FALSE
11 0.0678563 FALSE
12 0.0563202 FALSE
13 0.3115111 FALSE
14 0.5915422 FALSE
15 0.0013134 TRUE
16 0.1019738 FALSE
17 0.4076397 FALSE
18 0.4318861 FALSE
19 0.0541384 FALSE
20 0.5566075 FALSE
21 0.1187056 FALSE
22 0.4587887 FALSE
23 0.1820971 FALSE
24 0.4685764 FALSE

These results lead me to think that there might be an effect of low sample sizes on underestimating the true SD of the population, and, therefore, when researchers base their calculations on pilot studies, they may be underestimating the true SD.