Sample Size Estimation and Power Analysis

Mark Bounthavong

12/29/2021

Introduction

This tutorial is available on RPubs

Estimating the sample size for a prospective study is one of the first things that researchers do prior to enrolling patients. Doing this exercise helps the researcher to determine if they could feasibly enroll the necessary number of patients to answer the research question they developed. If a study’s estimated sample size was over a million patients, it would be a very daunting project to undertake. Knowing this information at the beginning of a study design helps to calibrate and adjust expectations to more reasonable levels.

In this tutorial, you will learn how to use R to:

In this course, we will use R to estimate sample size and perform power analysis. Alternatively, you can also download and use G*Power to perform these exercises.

  1. estimate sample size for a prospective study (e.g., randomized controlled trial)
  2. determine how much power you have to detect a difference if one existed (post hoc analysis)

If you are performing a prospective study where you need to enroll patients, estimating the optimal number of patients is necessary for your Statistical Analysis Plan.

If you are performing a retrospective study, you can determine how much power you have to detect a difference if one existed. This is not commonly done unless you have a non-significant value and want to determine if you have a high potential of type II error due to a small sample.

Sample size estimation

Before enrolling patients, you want to make sure that you estimate the optimal number of patients you’ll need for your study. It would not be efficient if you enrolled everyone into your study because that would be costly and unethical.

Sample size estimations for two proportions

This is a common sample size calculation when you have two groups and the outcome is dichotomous (e.g., Yes / No)

Let’s assume that you were asked to estimate the number of patients that would be needed for an upcoming randomized controlled trial. The study’s primary aim is to estimate the efficacy of Treatment A versus Treatment B. The main end point is a dichotomous variable: “Response” or “No response.”

The null hypothesis is:

\(H_{0}\): There is no difference in the proportion of responders between Treatment A and Treatment B

The alternative hypothesis is:

\(H_{a}\): There is a difference in the proportion of responders between Treatment A and Treatment B

To determine the sample size needed to detect a meaningful difference in the proportion of responders between Treatment A and Treatment B, we need the following information:

The alpha level is the threshold where you make the determination whether something is statistically significantly different. We normally use an alpha level of 0.05 (type I error). This is the convention for most biomedical sciences, but it can vary depending on the situation (e.g., multiple comparisons).

Effect size is the standardized difference one would expect to see between treatment groups. This is an estimate and something that is usually based on past studies. Often times, researchers have to make an educated guess what the effect size is.

Power (1 - \(\beta\)) is the ability to correctly reject the null hypothesis if the actual effect of the population is equal to or greater than the specified effect size. (Recall that type II error is \(\beta\).) In other words, if you conclude that there is no difference in the sample, then there is no true difference in the population conditioned on the specified effect size

We’ll set the two-tailed alpha level to 0.05.

We can estimate size \(h\) using the following equation:

\(h = \varphi_{1} - \varphi_{2}\),

where \(\varphi_{i} = 2 * arcsine(\sqrt{p_{i}})\)

where \(p_{i}\) denotes the proportion in treatment \(i\) that were classified as “responders.”

We generally set the power to 80%.

With these pieces of information, we can estimate the sample size for a study with two proportions as the outcome using the following formula:

With R you will not need to use this formula. However, if you decide to estimate sample size by hand, you need to make sure to use the correct \(Z_{\alpha/2}\) = 1.96 and \(Z_{(1-\beta)}\) = 0.84.

\(\begin{aligned} n_{i} = \frac{(Z_{\alpha/2} + Z_{1 - \beta})^2 * (p_{1}(1 - p_{1})) + (p_{2}(1 - p_{2}))}{(p_{1} - p_{2})^2} \end{aligned}\)

where \(n_{i}\) is the sample size for one group.

Fortunately, we don’t have to do all this work. We can estimate the sample size using the R package pwr to do the heavy lifting.

Make sure you have the package installed by typing install.packages("pwr").

Once installed, make sure to load the package by typing library("pwr").

Use the pwr.2p.test() function to find out how many patients are needed. We’re going to need a couple of pieces of information. We already explained that you need the alpha level (alpha = 0.05), effect size \(h\), and power level (power = 80%).

We also need the \(p_{i}\) for Treatment A and Treatment B. This is something that we have to guess. We make this guess by looking at past studies and determined the normal rate of response. Sometimes you can’t find this in the literature, so you have to make an educated guess. I usually start with 50% because this is a nice neutral point; it’s like a coin flip (50-50). I’ll assign Treatment B as the reference (e.g., control) with 50% response rate. Now, I’ll “guess” that Treatment A is slightly better with a response rate of 60%. Therefore, I have:

\(p_{1} = 0.60\)
\(p_{2} = 0.50\)

pwr.2p.test makes estimating \(h\) easy. All you need to do is enter values for \(p_{1}\) and \(p_{2}\) into the ES.h(p1, p2) function.

With these pieces of information, I can estimate the sample size required to detect a difference in “response” rate between Treatment A and Treatment B that is 10% (\(p_{1}\) - \(p_{2}\)) with an alpha of 5% and power of 80%.

### alpha = sig.level option and is equal to 0.05
### power = 0.80
### p1 = 0.60 
### p2 = 0.50

power1 <-pwr.2p.test(h = ES.h(p1 = 0.60, p2 = 0.50), sig.level = 0.05, power = .80)
power1
## 
##      Difference of proportion power calculation for binomial distribution (arcsine transformation) 
## 
##               h = 0.2013579
##               n = 387.1677
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: same sample sizes

The effect size h is 0.201, and the sample n is 387.2 or 388 rounded to the nearest whole number. This n is only for one group. We assume that this is 1 to 1 ratio. Hence, based on the parameters of our study, we need approximately 388 patients in Treatment A and 388 patients in Treatment B to detect a difference of 10% response or greater with an alpha of 0.05 and power of 80%.

Power analysis for two proportions

Power (1 - \(\beta\)) is the ability to correctly reject the null hypothesis if the actual effect of the population is equal to or greater than the specified effect size. In other words, if you conclude that there is no difference in the sample, then there is no true difference in the population conditioned on the specified effect size

We can also use the plot feature to see how the power level changes with varying sample sizes. As the sample size goes up, power increases. As the sample size goes down, power decreases. This is important to understand. As we increase our sample size, we reduce the uncertainty around the estimates. By reducing this uncertainty, we gain greater precision in our estimates, which results in greater confidence in our ability to avoid making a type II error.

### We can plot the power relative to different levels of the sample size. 
plot(power1)

Let’s change the \(p_{i}\) and see how the power level change; we are fixing our sample size at 388 for each group with an alpha of 0.05. We create a sequence of values by varying the proportion of “responders” for Treatment A. We will change these from 50% to 100% in intervals of 5%.

p1 <- seq(0.5, 1.0, 0.05)
power1 <-pwr.2p.test(h = ES.h(p1 = p1, p2 = 0.50),
                     n = 388,
                     sig.level = 0.05)
powerchange <- data.frame(p1, power = power1$power * 100)
plot(powerchange$p1, 
     powerchange$power, 
     type = "b", 
     xlab = "Proportion of Responders in Treatment A", 
     ylab = "Power (%)")

We can also write a function for this:

Learning to write functions can make programming much easier. In this example, I wrote a function called iteration() that will allow me to input parameters to generate the power level iterated at different values of p1.

iteration <- function(p_i, P_i, i_i, p2, n, alpha) {
              p1 <- seq(p_i, P_i, i_i)
              power1 <-pwr.2p.test(h = ES.h(p1 = p1, p2 = p2),
                                   n = n,
                                   sig.level = alpha)
              powerchange <- data.frame(p1, power = power1$power * 100)
              powerchange
}

iteration(0.5, 1.0, 0.05, 0.50, 388, 0.05)
##      p1     power
## 1  0.50   5.00000
## 2  0.55  28.65038
## 3  0.60  80.08415
## 4  0.65  98.88117
## 5  0.70  99.99190
## 6  0.75 100.00000
## 7  0.80 100.00000
## 8  0.85 100.00000
## 9  0.90 100.00000
## 10 0.95 100.00000
## 11 1.00 100.00000

Sample size estimations for two averages

This is another common sample size estimation for two groups when the outcome is a continuous variable. Now let’s estimate the sample size for a study where we are comparing the averages between two groups. Let’s suppose that we are working on a randomized controlled trial that seeks to evaluate the difference in the average change in hemogloblin A1c (HbA1c) from baseline between Treatment A and Treatment B.

The null hypothesis is:

\(H_{0}\): There is no difference in the average change in HbA1c from baseline between Treatment A and Treatment B

The alternative hypothesis is:

\(H_{a}\): There is a difference in the average change in HbA1c from baseline between Treatment A and Treatment B

To determine the sample size needed to detect a meaningful difference in the average HbA1c change from baseline between Treatment A and Treatment B, we can use the following formula:

With R you will not need to use this formula. However, if you decide to estimate sample size by hand, you need to make sure to use the correct \(Z_{\alpha/2}\) = 1.96 and \(Z_{(1-\beta)}\) = 0.84.

\(\begin{aligned} n_{i} = \frac{2 \sigma^2 * (Z_{\alpha/2} + Z_{1 - \beta})^2}{({\mu_{1} - \mu_{2}})^2} \end{aligned}\)

where \(n_{i}\) is the sample size for one of the groups,
\(\mu_{1}\) and \(\mu_{2}\) are the average changes in HbA1c from baseline for Treatment A and Treatment B, respectively

We also need the following information:

We’ll set the two-tailed alpha level to 0.05.

There are other formulas that provide standardized measures such as Hedges \(g\). However, it is not always clear what the standardize effect size means. The interpretation of effect size varies, but common convention recommends the following:
\(d\) = 0.2 (small effect)
\(d\) = 0.5 (medium effect)
\(d\) = 0.8 (large effect)

The effect size, also known as Cohen’s \(d\), is estimated using the following equation:

\(\begin{aligned} d = \frac{\mu_{1} - \mu_{2}}{\sigma_{pooled}} \end{aligned}\)

The pooled standard deviation is estimated using the following formula:

Sometimes it is not easy figuring out the pooled standard deviation (\(\sigma_{pooled}\)). So, you will have to make an educated guess.

\(\begin{aligned} \sigma_{pooled} = \sqrt{\frac{sd_{1}^2 + sd_{2}^2}{2}} \end{aligned}\)

Once again, the R pwr package can make this task easy for us. However, we’ll need to estimate the Cohen’s \(d\).

Since we haven’t started the study, we have to make some assumptions of about each treatment strategy’s change in HbA1c. You can do this by reviewing the literature and getting a general sense of what is considered the average change in HbA1c and then make an educated guess. For our example, let’s assume that the expected average change in HbA1c from baseline for Treatment A was 1.5% with a standard deviation of 0.25%. Additionally, let’s assume that the expected average change in HbA1c from baseline for Treatment B was 1.0% with a standard deviation of 0.20.

First, we’ll calculate the pooled standard deviation (\(\sigma_{pooled}\)):

sd1 <- 0.25
sd2 <- 0.30
sd_pooled <- sqrt((sd1^2 +sd2^2) / 2)
sd_pooled
## [1] 0.276134

Once we have the \(\sigma_{pooled}\), we can estimate the Cohen’s \(d\):

mu1 <- 1.5
mu2 <- 1.0
d <- (mu1 - mu2) / sd_pooled
d
## [1] 1.810715

The Cohen’s \(d\) is 1.81, which is considered a large effect size.

Recall that the average change in HbA1c from baseline for Treatment A was 1.5% and 1.0% for Treatment B.

Now, we can take advantage of the pwr package and estimate the sample size needed to detect a difference of 0.5% (1.5% - 1.0%) in the average HbA1c change from baseline between Treatment A and Treatment B with 80% power and a significance threshold of 5%.

We will use the pwr.t.test() function to estimate the sample size for a study where there are two groups and the outcome is a continuous variable.

### d = Cohen's d
### power = 0.80
### alpha = 0.05

n_i <- pwr.t.test(d = d, power = 0.80, sig.level = 0.05)
n_i
## 
##      Two-sample t test power calculation 
## 
##               n = 5.921286
##               d = 1.810715
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

Recall that the average change in HbA1c from baseline for Treatment A was 1.5% and 1.0% for Treatment B. Hence, the difference is 0.5%. Based on our study parameters, we need 6 patients in each group to detect a difference of 0.5% or greater with 80% power and a significance threshold of 5%.

Power analysis for two averages

Power (1 - \(\beta\)) is the ability to correctly reject the null hypothesis if the actual effect of the population is equal to or greater than the specified effect size. In other words, if you conclude that there is no difference in the sample, then there is no true difference in the population conditioned on the specified effect size

We can plot how the power will change as the sample size changes. As the sample size increases, power increases. This should make sense. Like our previous example with two proportions, as we increase our sample size, we reduce the uncertainty around the estimates. By reducing this uncertainty, we gain greater precision in our estimates, which results in greater confidence in our ability to avoid making a type II error.

### We can plot the power relative to different levels of the sample size. 
n <- seq(1, 10, 1)
nchange <- pwr.t.test(d = d, n = n, sig.level = 0.05)
## Warning in qt(sig.level/tside, nu, lower = FALSE): NaNs produced
nchange.df <- data.frame(n, power = nchange$power * 100)
nchange.df
##     n    power
## 1   1      NaN
## 2   2 19.03307
## 3   3 39.61785
## 4   4 57.33850
## 5   5 70.87945
## 6   6 80.64997
## 7   7 87.42531
## 8   8 91.98145
## 9   9 94.96979
## 10 10 96.88938
plot(nchange.df$n, 
     nchange.df$power, 
     type = "b", 
     xlab = "Sample size, n", 
     ylab = "Power (%)")

We don’t use the plot(n_i) because it doesn’t do a good job of plotting the power when the sample size is small.

As the sample size increases, our power increases. This makes sense because we have more patients to detect differences that may be smaller. But we fixed our effect size (Cohen’s \(d\)), so as we increase the sample size, our power to detect that difference ultimately increases.

Let’s change the \(\mu_{i}\) and see how the power level change; we are fixing our sample size at 6 for each group with an alpha of 0.05.

Estimating the power is easy. You type the same command, but the only difference is that you enter a value for the sample size for one group (n) and make power = NULL by removing it from the options. Here is an example:
pwr.t.test(d = d, n = 6, sig.level = 0.05)

We create a sequence of values by varying the average change in HbA1c from baseline for Treatment A. We will change these from 0% to 2% in intervals of 0.1%.

mu1 <- seq(0.0, 2.0, 0.1)

d <- (mu1 - mu2) / sd_pooled

power1 <- pwr.t.test(d = d, n = 6, sig.level = 0.05)
powerchange <- data.frame(d, power = power1$power * 100)
powerchange
##            d     power
## 1  -3.621430 99.986261
## 2  -3.259287 99.898978
## 3  -2.897144 99.437136
## 4  -2.535001 97.615372
## 5  -2.172858 92.271971
## 6  -1.810715 80.649971
## 7  -1.448572 61.960800
## 8  -1.086429 39.817661
## 9  -0.724286 20.610368
## 10 -0.362143  8.785855
## 11  0.000000  5.000000
## 12  0.362143  8.785855
## 13  0.724286 20.610368
## 14  1.086429 39.817661
## 15  1.448572 61.960800
## 16  1.810715 80.649971
## 17  2.172858 92.271971
## 18  2.535001 97.615372
## 19  2.897144 99.437136
## 20  3.259287 99.898978
## 21  3.621430 99.986261
plot(powerchange$d, 
     powerchange$power, 
     type = "b", 
     xlab = "Cohen's d", 
     ylab = "Power (%)")

This figure shows how the power changes with Cohen’s \(d\). It has a symmetrical patter because of negative and positive range associated with Cohen’s \(d\). But the story is the same. As the effect size increases (negative and positive signs do not matter; we only care about the absolute values), the power increases. This makes sense because we only have enough power to detect large differences with the current sample size (which is fixed in this case). If the differences are small, then we do not have enough power with the current sample size of 6.

Sample size estimation for paired data (before and after)

So far, we discussed how to perform sample size estimations for “between-groups” comparisons. However, many studies investigate the “within-group” changes. This are paired data, which means that the observation for one data point is dependent on another observation. A common study design where paired data is collected are longitudinal studies. Theses types of studies involve repeated measure. For example a pretest posttest study design will measure a data point for a patient at baseline and then repeat that measurement at another point in time. In the figure, Patient A has two measurements at \(t_{0}\) and \(t_{f}\). Since these measurements were made in the same person, the change is “within” the patient. Alternatively, we can think of this as a “repeated” measure since the patient had the measurement performed twice.

Figure caption: Pretest-Posttest (repeated measures) framework

Figure caption: Pretest-Posttest (repeated measures) framework

When you have a study where you are performing a paired t test, you can use the same pwr.t.test() function for a two sample test from the pwr package.

Let’s assume that we want to conduct a prospective study measure the weight change of a cohort of patients who started a diet. You want to enroll enough patients to detect a 5 lb reduction in the weight 3 weeks after the diet started. Let’s assume that at baseline the expected average weight for the cohort was 130 lbs with a standard deviation of 11. After 3 weeks of diet, the expected average weight was 125 lbs with a standard deviation of 12.

We can estimate the effect size (Cohen’s \(d_{z}\)) for a paired t test.

\(\rho\) can range between 0 and 1 where 0 means no correlation and 1 means perfect correlation. For sample size estimations, we can start at 0.5 if we’re uncertain of the actual correlation.

\(\begin{aligned} d_{z} = \frac{ | \mu_{z} |}{\sigma_{z}} = \frac{| \mu_{x} - \mu_{y} | }{ \sqrt{\sigma_{x}^{2} + \sigma_{y}^{2} - 2 \rho_{x, y} \sigma_{x} \sigma_{y}}} \end{aligned}\)

\(x\) denotes “before” (or baseline)
\(y\) denotes “after”
\(d_{z}\) denotes the Cohen’s \(d\) for paired analysis
\(\rho\) denotes the correlation between the measures before and after the diet. (For simplicity, I use 0.50 if I don’t have prior information about this correlation.)

### Parameters for paired analysis or a pretest-posttest study design

mu_x <- 130     ### Average weight before the diet (baseline)
mu_y <- 125     ### Average weight after the diet

sd_x <- 11      ### Standard deviation before the diet
sd_y <- 12      ### Standard deviation after the diet

rho <- 0.5      ### Correlation between measures before and after the diet

sd_z <- sqrt(sd_x^2 + sd_y^2 - 2*rho*sd_x*sd_y)
  
d_z <- abs(mu_x - mu_y) / sd_z
d_z
## [1] 0.433555

The Cohen’s \(d_{z}\) is 0.433. We can input this into the pwr.t.test() function.

n.paired <- pwr.t.test(d = d_z, power = 0.80, sig.level = 0.05, type = "paired")
n.paired
## 
##      Paired t test power calculation 
## 
##               n = 43.71557
##               d = 0.433555
##       sig.level = 0.05
##           power = 0.8
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*

We need 44 patients with two measurements (before and after) they implement their diet to detect a difference of 5 lbs or greater with 80% power and a significance level of 0.05.

Power analysis of paired samples (paired t test)

We can plot how the power will change as the sample size changes for the paired t test analysis. As the sample size increases, power increases. This should make sense. Like our previous examples, as we increase our sample size, we reduce the uncertainty around the estimates. By reducing this uncertainty, we gain greater precision in our estimates, which results in greater confidence in our ability to avoid making a type II error.

### We can plot the power relative to different levels of the sample size for paired analysis. 
n_z <- seq(1, 80, 5)
n_z.change <- pwr.t.test(d = d_z, n = n_z, sig.level = 0.05, type = "paired")
## Warning in qt(sig.level/tside, nu, lower = FALSE): NaNs produced
n_z.change.df <- data.frame(n_z, power = n_z.change$power * 100)
n_z.change.df
##    n_z    power
## 1    1      NaN
## 2    6 14.03624
## 3   11 25.58334
## 4   16 36.84309
## 5   21 47.26307
## 6   26 56.56985
## 7   31 64.66154
## 8   36 71.54769
## 9   41 77.30572
## 10  46 82.04980
## 11  51 85.90929
## 12  56 89.01478
## 13  61 91.48950
## 14  66 93.44465
## 15  71 94.97744
## 16  76 96.17076
plot(n_z.change.df$n, 
     n_z.change.df$power, 
     type = "b", 
     xlab = "Sample size, n", 
     ylab = "Power (%)")

We increase the sample size from 1 to 80 at 5-unit intervals.

We increase the sample size from 1 to 80 at 5-unit intervals.

As the sample size increases, we generate more power to detect a difference of 5 lbs with a significance level of 0.05 and a fixed sample size of 44 patients with two measurements (before and after) they implement their diet.

Let’s see how power changes when we change the effect size. Let’s change the average weight after the patients implement their diet. Instead of an average of 125 lbs, let’s see how the power will change when we reduce that to 100 lbs.

### Vary the mu_y from 50 lbs to 130 lbs in intervals of 5 lbs.
mu_y <- seq(50, 130, 5)

d_z <- abs(mu_x - mu_y) / sd_z

n_z.change <- pwr.t.test(d = d_z, n = 44, sig.level = 0.05)
n_z.change.df <- data.frame(d_z, power = n_z.change$power * 100)
n_z.change.df
##         d_z     power
## 1  6.936880 100.00000
## 2  6.503325 100.00000
## 3  6.069770 100.00000
## 4  5.636215 100.00000
## 5  5.202660 100.00000
## 6  4.769105 100.00000
## 7  4.335550 100.00000
## 8  3.901995 100.00000
## 9  3.468440 100.00000
## 10 3.034885 100.00000
## 11 2.601330 100.00000
## 12 2.167775 100.00000
## 13 1.734220 100.00000
## 14 1.300665  99.99766
## 15 0.867110  98.03639
## 16 0.433555  52.03146
## 17 0.000000   5.00000
plot(n_z.change.df$d_z, 
     n_z.change.df$power, 
     type = "b", 
     xlab = "Cohen's d_z", 
     ylab = "Power (%)",
     xlim = c(0, 2))

We vary the average weight \(\mu_{y}\) between 50 lbs and 130 lbs in intervals of 5 lbs.

We vary the average weight $\mu_{y}$ between 50 lbs and 130 lbs in intervals of 5 lbs.

When we increase the effect size (Cohen’s \(d_{z}\)), our power goes up; recall that the sample size is fixed at 44 and significance level is 0.05. But when the effect size gets smaller (or when the average weight loss shrinks), we lose power to detect a difference because our sample size is too small. We’ll need to increase our sample size to have a reasonable power to detect small differences.

We can also see how power changes when we vary \(\rho\). If we set \(\rho\) = 0, then the Cohen’s \(d_{z}\) = 0.307. If we set \(\rho\) = 1, then the Cohen’s \(d_{z}\) = 5.

mu_x <- 130     ### Average weight before the diet (baseline)
mu_y <- 125     ### Average weight after the diet

sd_x <- 11      ### Standard deviation before the diet
sd_y <- 12      ### Standard deviation after the diet

sd_z_1 <- sqrt(sd_x^2 + sd_y^2 - 2*1*sd_x*sd_y)
sd_z_0 <- sqrt(sd_x^2 + sd_y^2 - 2*0*sd_x*sd_y)

d_z_1 <- abs(mu_x - mu_y) / sd_z_1
d_z_0 <- abs(mu_x - mu_y) / sd_z_0

d_z_1
## [1] 5
d_z_0
## [1] 0.3071476

So, higher \(\rho\) results in large \(d_{z}\) and smaller \(\rho\) results in small \(d_{z}\) values.

Let’s see how power changes when we change the \(\rho\) range from 0 to 1 in intervals of 0.1 units.

rho <- seq(0.0, 1.0, 0.1)

sd_z <- sqrt(sd_x^2 + sd_y^2 - 2*rho*sd_x*sd_y)

d_z <- abs(mu_x - mu_y) / sd_z

rho.change <- pwr.t.test(d = d_z, n = 44, sig.level = 0.05)
rho.change.df <- data.frame(d_z, power = rho.change$power * 100)
rho.change.df
##          d_z     power
## 1  0.3071476  29.65480
## 2  0.3236941  32.35129
## 3  0.3432395  35.66281
## 4  0.3668151  39.80733
## 5  0.3960280  45.10546
## 6  0.4335550  52.03146
## 7  0.4842743  61.25990
## 8  0.5583195  73.54735
## 9  0.6816774  88.52181
## 10 0.9552009  99.32401
## 11 5.0000000 100.00000
plot(rho.change.df$d_z, 
     rho.change.df$power, 
     type = "b", 
     xlab = "Cohen's d_z", 
     ylab = "Power (%)",
     xlim = c(0, 1.5))

We vary \(\rho\) from 0 to 1 at intervals of 0.1 unit.

We vary $\rho$ from 0 to 1 at intervals of 0.1 unit.

As \(\rho\) increases, our power increases. This makes sense because we are nearing “perfect” correlation, which would require less sample to detect a difference if one existed. As the correlation becomes less “perfect” our power drops suggesting that we need to increase our sample size to make up for this poor correlation.

Power analysis with unequal sample sizes

It is common for the sample size to be different. The pwr.t2n.test() is a useful tool to help estimate the power given the sample sizes of the study.

It is common to perform power analysis on a study where the sample sizes between groups are different.

Suppose you have a retrospective study where the patients were prescribed Treatment A and Treatment B. There were 130 patients in Treatment A (\(n_{A}\) = 130) and 120 patients in Treatment B (\(n_{B}\) = 120). The average change in HbA1c was 1.5% with a standard deviation of 1.25% in Treatment A, and the average change in HbA1c was 1.4% with a standard deviation of 1.01% in Treatment B.

First, we’ll calculate the pooled standard deviation (\(\sigma_{pooled}\)):

sd1 <- 1.25
sd2 <- 1.01
sd_pooled <- sqrt((sd1^2 +sd2^2) / 2)
sd_pooled
## [1] 1.136354

Once we have the \(\sigma_{pooled}\), we can estimate the Cohen’s \(d\):

mu1 <- 1.5
mu2 <- 1.4
d <- (mu1 - mu2) / sd_pooled
d
## [1] 0.08800076

Now, we can estimate the power with the different sample sizes across the groups (\(n_{A}\) = 130, \(n_{B}\) = 120).

n1 <- 130
n2 <- 120

power.diff_n <- pwr.t2n.test(d = d, n1 = n1, n2 = n2, sig.level = 0.05)
power.diff_n
## 
##      t test power calculation 
## 
##              n1 = 130
##              n2 = 120
##               d = 0.08800076
##       sig.level = 0.05
##           power = 0.1064836
##     alternative = two.sided

Since the average HbA1c change from baseline for Treatment A is 1.5% and 1.4% for Treatment B, the average difference in the HbA1c change from baseline is 0.1%. This is a difference (difference between the groups) of the differences (difference from baseline within the group) calculation.

You only have 11% power to detect a difference of 0.10% or greater in the HbA1c change from baseline. This means that you are underpowered to detect a difference of 0.10% or greater in the HbA1c change from baseline with \(n_{A}\) = 130, \(n_{B}\) = 120, and a significance level of 0.05. When studies are underpowered, there is a high potential for type II error. The only way to address this problem is to enroll more patients or expand the sample by relaxing inclusion criteria. But this may increase the threats to the study’s internal validity.

Conclusions

Sample size estimations and power analysis are very useful tools to determine how many patients you need in your study and how confident you are that you didn’t make a type II error. Depending on the type of study, you will need to use different functions from the pwr package. I highly encourage you to explore the other functions of the pwr package to see if those fit the study design you have planned.

Acknowledgements

I generated this tutorial using examples from an excellent website. The R CRAN network was a great resource for learning how to use the pwr package. I recommend interested students who want to expand their knowledge of sample size estimations and power analysis with other statistical inferences (e.g., paired t test, one-way ANOVA) to explore their vignette on pwr at this link.

I used the Tufte R Markdown package to create the style for this tutorial. The authors of the package are JJ Allaire and Yihui Xie, and their work can be found at this link or on their GitHub page.

You can email me with feedback at: internal.validity.blog@gmail.com.

This is a work in progress, and I expect to update this in the future.