Statistical Power for Randomized Experiments: Examples of Calculations and Pitfalls

E. C. Hedberg, Ph.D.
Assistant Professor, Sanford School of Social and Family Dynamics
Senior Fellow, Morrison Institute for Public Policy
Faculty Affiliate, Southwest Interdisciplinary Research Center
Arizona State University

Jan 20, 2016

The what, why, and when, of power analyses

What is power?

The power of a statistical test is the probability that it will yield statistically significant results

-Cohen (1988)

Why do power analyses?

Reasons for your funders:

Reasons for your discipline:

Today, many results from experiments are probably false, and low power contributes to this.

For example, less than half of replicated psychology results were significant.

When to do power analyses

Before you start your study.

Power analyses guide sample design and study expectations before data are collected.

Once the data are collected, power analyses do not change the results. A power analysis does not negate a significant result.

At best, power analyses are a post mortem for insignificant results.

A more detailed what and how of power analyses

Types of error

Hypothesis testing

Jean Biau, et al. “P Value and the Theory of Hypothesis Testing.”, (2010)

Type I error based on null distribution

Type II error based on alternative distribution with z score of 1

The bigger the test statistic, the larger the power

The expected test statistic is the mean of the alternative distribution, which determines power.

Balancing three things

Power analyses include three tasks:

Doing any of these tasks requires mastery of the test statistic in question.

Two group tests

In this presentation, we will focus on the simple two-group comparison, but the same principles apply to other tests (e.g., multilevel two group tests).

Example Analysis: Breakfast intervention to promote weight loss

Dhurandhar, Emily J., et al. “The effectiveness of breakfast recommendations on weight loss: a randomized controlled trial.” The American Journal of Clinical Nutrition 100.2 (2014): 507-513.

I selected only two of three conditions and randomly sampled 25 observations from each. Site variation was ignored.

Reading in the data

bmi <- data.frame(
  read.csv(file = "bmi.csv")) #create dataframe
bmi[c(5,10,35,40),] #list rows 5, 10, 35, and 40
##    treat        x        y
## 5      0 27.78202 27.02582
## 10     0 37.47405 37.61246
## 35     1 37.38256 38.64678
## 40     1 38.32031 40.85937

Summary statistics: mean and standard deviation by group

attach(bmi)
aggregate(cbind(x, y) ~ treat, FUN = mean)
##   treat        x        y
## 1     0 32.03280 31.91977
## 2     1 32.71625 33.13091
aggregate(cbind(x, y) ~ treat, FUN = sd)
##   treat        x        y
## 1     0 5.495015 5.762522
## 2     1 4.541026 5.289409

Data generating process

The outcome for person \(i\) in group \(j\) is a function of the overall mean, \(\mu\), plus the effect of the group, \(\tau_j\), plus the within-group residual, \(\epsilon\)

\[ y_{ij} = \mu + \tau_j + \epsilon_{ij} \] \[ \tau_j = \mu_j - \mu \] \[ \epsilon_{ij} = y_{ij} - \mu_j \] \[ \epsilon_{ij} \sim N\left[0, \sigma^2\right] \]

where \(\sigma^2\) is the population variance of the outcome.

Post-intervention BMI = Average Post-intervention BMI + Effect of Intervention or Control + Random Error

The test where \(\sigma^2\) is known

If \(\bar{y}_1\) is the mean for the intervention group and \(\bar{y}_0\) is the mean of the control group, and each group has \(n\) observations (i.e., balanced data), then

\[\bar{y}_1 - \bar{y}_0 \sim N\left[\mu_1-\mu_0,\sigma^2\frac{2}{n}\right]\]

Thus, to get the p-value, you calculate the z-score of the observed difference in the means as it relates to the null hypothesis and the variance of the mean difference.

\[ z = \frac{(\bar{y}_1 - \bar{y}_0)-\mbox{Null hypothesis}}{\sqrt{\sigma^2\frac{2}{n}}} \]

If \(z\) is greater in magnitude than the critical value, you reject the null.

The test statistic for samples (where \(\sigma^2\) is unknown)

The test statistic for samples (where \(\sigma^2\) is unknown)

Thus, the sampling variance of the estimated difference in means is (when data are balanced) \[ \hat{V}\left\{\bar{y}_1-\bar{y}_0\right\} = \hat{\sigma}^2\frac{2}{n}\] making the test statistic \(t\) \[ t = \frac{(\bar{y}_1-\bar{y}_0) - \mbox{Null hypothesis}}{\sqrt{\hat{\sigma}^2\frac{2}{n}}}\] distributed with \(2n-2\) degrees of freedom. Note that the computation of the pooled variance, \(\hat{\sigma}^2\) assumes that each treatment group has the same variance in reality (thus the observed differences are by chance).

Testing homogenity of variance

aggregate(y ~ treat, FUN = sd)
##   treat        y
## 1     0 5.762522
## 2     1 5.289409
bartlett.test(y ~ treat)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  y by treat
## Bartlett's K-squared = 0.17233, df = 1, p-value = 0.678

Performing the \(t\)-test with regression

printCoefmat(coefficients(summary(lm(y ~ treat))))
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  31.9198     1.1062 28.8552   <2e-16
## treat         1.2111     1.5644  0.7742   0.4426

Power of that test

The alternative distribution has a mean of the \(t\)-test, thus most of the distribution is before the critical value. The power is very low.

All about the test

So, the expected test is what determines power. The alternative distribution is governed by its mean, the noncentrality parameter, \(\lambda\) \[\mbox{Expected test} = \lambda \]

The goal is to break down the expected test into scale-free parameters that don’t require knowledge about the unit of measurement. We can figure out the anatomy of the test by looking at the “Statistics 101” regression formula.

The regression model is \[ y_{ij} = \gamma_0 + \gamma_1\times\mbox{treat}_{ij} + \epsilon_{ij} \] and \(\gamma_1\) is defined as \[ \gamma_1 = \frac{\sum_j\sum_i\left(y_{ij}-\bar{y}\right)\times\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right)}{\sum_j\sum_i\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right)^2}\]

The slope

In the balanced case, the denominator of \(\gamma_1\) is the total sample size divided by four, or half the sample, \(n\), divided by 2, \[ \sum_j\sum_i\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right)^2 = \sum_j\sum_i 0.25 =\frac{N}{4}= \frac{n}{2} \] The numerator of \(\gamma_1\) simplifies to \[\sum_j\sum_i\left(y_{ij}-\bar{y}\right)\times\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right) = \frac{1}{2}\sum_i\left(y_{i1}-\bar{y}\right)-\frac{1}{2}\sum_i\left(y_{i0}-\bar{y}\right)\] Thus, we can express the slope as \[ \hat{\gamma}_1 = \frac{\frac{1}{2}\sum_i\left(y_{i1}-\bar{y}\right)-\frac{1}{2}\sum_i\left(y_{i0}-\bar{y}\right)}{n/2}=\frac{\sum_i\left(y_{i1}-\bar{y}\right)-\sum_i\left(y_{i0}-\bar{y}\right)}{n} \] which is equivalent to \[ \hat{\gamma}_1 = \frac{\sum_iy_{i1}}{n} - \frac{\sum_iy_{i0}}{n} = \bar{y}_1-\bar{y}_0 \]

The variance of the slope

The estimated sampling variance of the slope, \(\gamma_1\), in a bi-variate regression is often presented as \[ \hat{\mbox{V}}\left\{\hat{\gamma}_1\right\}=\frac{\hat{\sigma}_\epsilon^2}{\sum_j\sum_i\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right)^2} \] where the mean squared error of a bi-variate regression (MSE or \(\sigma_\epsilon^2\)) is estimated as \[ \hat{\sigma}_\epsilon^2 = \frac{\sum_j\sum_i\hat{\epsilon}_{ij}^2}{N-2} = \frac{\left(n_1-1\right)\sigma_1^2+\left(n_0-1\right)\sigma_0^2}{n_1+n_0-2} \] which means that for bi-variate models with a dichotomous predictor, the estimate of the MSE is the same as the estimate of the population variance, \(\sigma^2\), meaning that that the estimated sampling variance of \(\gamma_1\) is \[ \hat{\mbox{V}}\left\{\hat{\gamma}_1\right\} = \hat{\sigma}^2\frac{2}{n}. \]

Let’s rearrange the test into scale free parameters

Thus, the slope is the mean difference is \[ \hat{\gamma}_1 = \bar{y}_1-\bar{y}_0 \] and the standard error is \[ \sqrt{\hat{\sigma}^2\frac{2}{n}} \] so the test is \[ \frac{\left(\bar{y}_1-\bar{y}_0\right)-\mbox{Null hypothesis}}{\sqrt{\hat{\sigma}^2\frac{2}{n}}}\] making the expected test \[ \lambda = \frac{\mu_1-\mu_0}{\sqrt{\sigma^2\frac{2}{n}}} = \frac{\mu_1-\mu_0}{\sigma}\sqrt{\frac{n}{2}} = \delta\sqrt{\frac{n}{2}} \] \[\mbox{Effect size} = \delta = \frac{\mu_1-\mu_0}{\sigma} \]

Effect sizes

In order to make \(\lambda\) scale free, we replaced \(\frac{\mu_1-\mu_0}{\sigma}\) with the effect size \(\delta\).

Effect sizes represent the correlation between treatment conditions and an outcome. Typically, effect sizes are either in the form of the correlation coefficient \(r\) or Cohen’s \(d\) (a standardized difference between means). In the balanced case, the translations between the two are

\[ \delta = \frac{\mu_1-\mu_0}{\sigma} \] \[ \delta = \frac{2\rho}{\sqrt{1-\rho^2}} \] \[ \rho = \sqrt{\frac{\delta^2}{\delta^2+4}} \]

These formulas are useful for reading previous literature to inform expectations.

Cohen, “A Power Primer.” (1992)

Ferguson, “An Effect Size Primer: A Guide for Clinicians and Researchers.” (2009)

Simulation example

sim <- function(delta, n) {
  y.0 <- rnorm(n, mean = 0) #cases in control
  y.1 <- rnorm(n, mean = delta) #cases in treatment
  test <- t.test(y.1,y.0, var.equal = TRUE) #the test
  return(t = test$statistic) #return the result
}
sim(delta = .5, n = 50) #one simulation
##        t 
## 1.427165

Simulation example

delta <- .5 #effect size
n <- 50 #size per group
delta*sqrt(n/2) #expected test
## [1] 2.5
results <- 
  replicate(5000,sim(delta,n)) #5000 simulations
mean(results) #mean of the simulation
## [1] 2.510451

What proportion of the results are after the critical value?

alpha <- .05
df <- 2*n - 2
ct <- qt(alpha/2, df, lower.tail = FALSE)
sim.power <- length(results[results >= ct])/length(results)
sim.power
## [1] 0.6944

What proportion of the results are after the critical value?

What proportion of the results are after the critical value?

What proportion of the results are after the critical value?

length(results[results >= ct])/length(results)
## [1] 0.6944

The noncentrality parameter

\[ \lambda = \delta\sqrt{\frac{n}{2}} \]

With the non-centrality parameter, we can use an expected effect size, \(\delta\), and an expected sample size, \(n\), to estimate what our test-statistic is likely to be, \(\lambda\).

We then use \(\lambda\) calculate the expected Type II error given a level of Type I error, and thus, power.

The shaded area is the Type II error. It is the area of the alternative distribution between the two critical values. How do we calculate the area of the non-central distribution between the two critical values?

Area between critical values

If we assume some Type I error (\(\alpha\), usually 0.05 with two tails) and a sample size (which produces degrees of freedom), Type II error is

\[\beta = \underbrace{H\left[t_{\frac{\alpha}{2}}^{critical},df,\lambda\right]}_\text{Area before right critial value}-\underbrace{H\left[-t_{\frac{\alpha}{2}}^{critical},df,\lambda\right]}_\text{Area before left critical value} \]

where \(H\left[a,b,c\right]\) is the cumulative function of the non-central \(t\) distribution at point \(a\) with \(b\) degrees of freedom and non-centrality parameter \(c\).

The power, then, is the complement of this area

\[\mbox{Power} = 1-\beta.\]

Power given effect size and sample size

In R, we can use the pt function to get the cumulative non-central \(t\) distribution. Suppose we thought that we would have 50 cases in each group with an effect size of 0.5 (just like the simulation), power would then be

n <- 50 #cases per group
df <- 2*n-2 #degrees of freedom
delta <- .5 #effect size
alpha <- .05 #type i error
ct <- qt(alpha/2,df, lower.tail = F) #critial t
lambda <- delta*sqrt(n/2) #noncentrality parameter
beta <- pt(ct,df,lambda)-pt(-ct,df,lambda) #type ii error
power <- 1-beta #power
power
## [1] 0.6968934
sim.power
## [1] 0.6944

Intuition about power

Power increases both with sample size and effect size.

What is the effect size in the BMI data?

mean1 <- mean(y[treat==1]) #mean treatment
mean0 <- mean(y[treat==0]) #mean control
var1 <- var(y[treat==1]) #variance treatment
var0 <- var(y[treat==0]) #variance control
varp <- (var1+var0)/2 #pooled variance
bmi.delta <- (mean1-mean0)/sqrt(varp) #effect size
bmi.delta
## [1] 0.2189726

What was the power for the BMI test?

n <- 25 #cases per group
df <- 2*n-2 #degrees of freedom
delta <- bmi.delta #effect size
alpha <- .05 #type i error
ct <- qt(alpha/2,df, lower.tail = FALSE) #critical
lambda <- delta*sqrt(n/2) #expected test statistic
lambda
## [1] 0.774185
beta <- pt(ct,df,lambda)-pt(-ct,df,lambda) #type ii error
power <- 1-beta #power
power
## [1] 0.1181177

Sample size given power and effect size

This is difficult because there is no way to solve \[ \mbox{Power} = 1- H\left[t_{\frac{\alpha}{2}}^{critical},df,\lambda\right]+H\left[-t_{\frac{\alpha}{2}}^{critical},df,\lambda\right] \] for the sample size because \(t_{\frac{\alpha}{2}}^{critical}\) keeps changing with the sample size.

However, there is a trick. When \(\sigma^2\) is known, \[ Z_{1-\beta}-Z_{\alpha/2} = \lambda \] Where \(Z_{\alpha/2}\) is the quantile value of the normal distribution at half of \(\alpha\) Thus, if \(\alpha = 0.05\), \(Z_{0.025} = -1.96\). Leading to a quantity \(M\) that is approximate to the expected test \[ M_z = Z_{\mbox{Power}}-Z_{\alpha/2} \]

Sample size given power and effect size

This means \[ M_z \approx \lambda = \delta\sqrt{\frac{n}{2}} \] which we can rearrange to find \[ n_z = \frac{2M_z^2}{\delta^2} \] now, the approximate sample size is a function of \(1-\beta\) (power), \(\alpha\) (Type I error), and the effect size (\(\delta\)).

Next, we use \(n_z\) for the degrees of freedom to calculate a sample-based factor, \(M_t\). Finally, we use the same formula \((n_t = \frac{2M_t^2}{\delta^2})\) to find the appropriate sample size.

Sample size given power and effect size

On the computer, we can do this easily

alpha <- .05 #set alpha
power <- .8 #set power
delta <- .5 #set effect size
M_z <- qnorm(power)-qnorm(alpha/2) #normal factor
n_z <- (2*M_z^2)/delta^2 #approx sample size

next, we use the approximate sample size to get a better estimate with \(M_t\)

M_t <- qt(power,2*n_z-2)-
  qt(alpha/2,2*n_z-2) #sample factor
n_t <- (2*M_t^2)/delta^2 #sample size
n_t
## [1] 63.79463

How many cases would the BMI study need?

alpha <- .05 #set alpha
power <- .8 #set power
delta <- bmi.delta #set effect size
M_z <- qnorm(power)-qnorm(alpha/2) #normal factor
n_z <- (2*M_z^2)/delta^2 #approx sample size
M_t <- qt(power,2*n_z-2)-
  qt(alpha/2,2*n_z-2) #sample factor
n_t <- (2*M_t^2)/delta^2 #sample size
n_t
## [1] 328.3649

328 in each group

Lowest effect size possible with a given sample size and power

We call this the minimum detectable effect size (MDES).

This is approximated by \[ \delta_{m.d.} = M_t\sqrt{\frac{2}{n}} \]

n <- 25
df <- 2*n-2
alpha <- .05
power <- 0.8
M_t <- qt(power,df)-qt(alpha/2,df)
mdes <- M_t*sqrt(2/n)
mdes
## [1] 0.808876

Intuition about MDES

Larger samples are more sensitive.

Adding covariates

Adding covariates (like a pretest) to an analysis of an experiment can (possibly) increase power by reducing the variance of the treatment effect.

However, if there is any correlation between the treatment condition and the covariate, it could make matters worse.

Example BMI analysis with pretest covariate

Compare effect of treat with and without the covariate

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  31.9198     1.1062 28.8552   <2e-16
## treat         1.2111     1.5644  0.7742   0.4426
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.375131   1.149875 -2.0656  0.04441
## x            1.070618   0.035073 30.5256  < 2e-16
## treat        0.479431   0.347263  1.3806  0.17393
aggregate(x ~ treat, FUN = mean)
##   treat        x
## 1     0 32.03280
## 2     1 32.71625

Adding covariates

New data generating process \[ y_{ij}=\mu+\tau_j^{'}+\phi\left(x_{ij}-\bar{x}\right)+\epsilon_{ij} \] where \(\tau_j^{'}\) is the adjusted treatment effect and \(\phi\) is the within-group slope for \(x\), the covariate, predicting the outcome, \(y\). \[ \tau_j^{'} = \tau_j - \phi\left(\bar{x}_1-\bar{x}_0\right) \] This means that the adjusted treatment effect, \(\tau_j^{'}\), is related to the unadjusted treatment effect through \(\phi\) and the difference between the means of the covariate by treatment and control groups.

When there is no correlation between treatment and the covariate \(x\), the means of \(x\) for each group are the same. In such cases, \(\tau_j^{'} = \tau_j\).

Regression slope for treatment group

The regression model is

\[ y_{ij} = \gamma_0 + \gamma_1\times \mbox{treat}_{ij} + \gamma_2\times (x_{ij}-\bar{x}) + \epsilon_{ij} \]

The colinearity between treatment group and the covariance makes the test more complex. The regression slope for the treatment effect is \[ \gamma_1 = \left(\bar{y}_1-\bar{y}_0\right) - \gamma_2\left(\bar{x}_1-\bar{x}_0\right) \] and the variance of the effect is \[ \hat{V}\left\{\hat{\gamma_1} \right\} = \hat{\sigma}^2\left(1-\hat{\rho}_{yx,w}^2\right)\frac{2n-2}{2n-3}\frac{2}{n}\frac{1}{1-\hat{\rho}_{tx}^2} \]

Porter & Raudenbush. “Analysis of covariance: Its model and use in psychological research.” (1987)

Noncentrality parameter

The final test statistic has many components \[ \lambda = \underbrace{\frac{\left(\mu_1-\mu_0\right) - \gamma_2\left(\bar{x}_1-\bar{x}_0\right)}{\sigma}}_\text{Adjusted Effect size}\underbrace{\sqrt{\frac{1}{\left(1-\rho_{yx,w}^2\right)}}}_\text{Covariate effect}\underbrace{\sqrt{\frac{n}{2}\frac{2n-3}{2n-2}}}_\text{Sample size}\underbrace{\sqrt{1-\rho_{tx}^2}}_\text{Multicolinearity} \] where the effectiveness of the covariate in explaining variation in the outcome is \(\rho_{yx,w}^2\) and the multicolinearity that can be expected between the group indicator and the covariate is \(\rho_{tx}\)

If we can plan on zero correlation between the covariate and treatment group, this expression simplifies to \[ \lambda = \underbrace{\frac{\left(\mu_1-\mu_0\right)}{\sigma}}_\text{Effect size}\underbrace{\sqrt{\frac{1}{\left(1-\rho_{yx,w}^2\right)}}}_\text{Covariate effect}\underbrace{\sqrt{\frac{n}{2}\frac{2n-3}{2n-2}}}_\text{Sample size} \]

Further simplifications

The square root of \(\frac{2n-3}{2n-2}\) is always less than 1, but never dips below about 0.95 for samples of \(n = 6\) or greater.

curve(sqrt((2*x-3)/(2*x-2)), from = 4, to = 10)

Further simplifications

Thus, unless the sample is extremely small, the effect of \(\frac{2n-3}{2n-2}\) on the non-centrality parameter is minimal. Knowing this, the non-centrality parameter be approximated with the simplified formula \[\lambda \approx \frac{\left(\mu_1-\mu_0\right) - \gamma_2\left(\bar{x}_1-\bar{x}_0\right)}{\sigma}\sqrt{\frac{n}{2}}\sqrt{\frac{1}{\left(1-\rho_{yx,w}^2\right)}}\sqrt{1-\rho_{tx}^2}\]

Finding power with a covariate

n <- 25 #sample size 
df <- 2*n-2-1 #degrees of freedom
delta_a <- 0.5 #adjusted effect size
r_yx.w <- 0.5 #within group correlation
r_tx <- 0.2 #multicolinearity
alpha <- .05 #type i error
ct <- qt(alpha/2,df, 
         lower.tail = FALSE) #critical t
lambda <- delta_a*sqrt(n/2)*
  sqrt(1/(1-r_yx.w^2))*
  sqrt(1-r_tx^2) #noncentrality parameter
beta <- pt(ct,df,ncp)-pt(-ct,df,ncp) #type ii error
power <- 1-beta #power
power
## [1] 0.1947309

Finding the sample size

\[ n \approx \frac{2M_t^2}{\delta_a^2}\frac{1-\rho_{yx,w}^2}{1-\rho_{tx}^2} \]

alpha <- 0.05 #type i error
power <- 0.8 #power
delta_a <- 0.5 #adjusted effect size
r_yx.w <- 0.5 #within group correlation
r_tx <- 0.2 #multicolinearity
M_z <- qnorm(power)-qnorm(alpha/2) #normal factor
n_z <- ((2*M_z^2)/delta_a^2)*
  ((1-r_yx.w^2)/(1-r_tx^2)) #approx. sample size
M_t <- qt(power,2*n_z-2-1)-
  qt(alpha/2,2*n_z-2-1) #sample factor
n_t <- ((2*M_t^2)/delta_a^2)*
  ((1-r_yx.w^2)/(1-r_tx^2)) #sample size
n_t
## [1] 50.07819

Other topics: Multilevel models

Multilevel studies require knowledge about how the total population variance breaks down within and between clusters, summarized by the intraclass correlation (\(ICC\))

In the case of a two-level cluster randomized trial (where each treatment group has \(m\) clusters and each cluster has \(n\) units), the non-centrality parameter (i.e., the test) is \[ \lambda=\delta\sqrt{\frac{mn}{2}}\sqrt{\frac{1}{1+\left(n-1\right)ICC-\left(R_1^2+\left(nR_2^2-R_1^2\right)ICC\right)}} \]

Hedges & Rhoads. “Statistical Power Analysis in Education Research.” (2010)

Other topics: Generalized outcomes

Nonlinear models (i.e. generalized linear models and multilevel generalized linear models) typically require larger samples.

Power is difficult to estimate because the variance of the outcome is usually a function of the mean, and the experiment changes the means…so estimates of the baseline mean is essential to planning sample sizes.

For example, extreme probabilities for dichotomous outcomes (< 0.05 or > 0.95) exponentially increase required sample sizes for the same odds ratio.

Thank you!

Any questions or concerns, feel free to contact me

E. C. Hedberg

ehedberg@asu.edu