E. C. Hedberg, Ph.D.
Assistant Professor, Sanford School of Social and Family Dynamics
Senior Fellow, Morrison Institute for Public Policy
Faculty Affiliate, Southwest Interdisciplinary Research Center
Arizona State University
Jan 20, 2016
The power of a statistical test is the probability that it will yield statistically significant results
-Cohen (1988)
Reasons for your funders:
Reasons for your discipline:
Today, many results from experiments are probably false, and low power contributes to this.
For example, less than half of replicated psychology results were significant.
Before you start your study.
Power analyses guide sample design and study expectations before data are collected.
Once the data are collected, power analyses do not change the results. A power analysis does not negate a significant result.
At best, power analyses are a post mortem for insignificant results.
Jean Biau, et al. “P Value and the Theory of Hypothesis Testing.”, (2010)
The expected test statistic is the mean of the alternative distribution, which determines power.
Power analyses include three tasks:
Doing any of these tasks requires mastery of the test statistic in question.
In this presentation, we will focus on the simple two-group comparison, but the same principles apply to other tests (e.g., multilevel two group tests).
I selected only two of three conditions and randomly sampled 25 observations from each. Site variation was ignored.
bmi <- data.frame(
read.csv(file = "bmi.csv")) #create dataframe
bmi[c(5,10,35,40),] #list rows 5, 10, 35, and 40
## treat x y
## 5 0 27.78202 27.02582
## 10 0 37.47405 37.61246
## 35 1 37.38256 38.64678
## 40 1 38.32031 40.85937
attach(bmi)
aggregate(cbind(x, y) ~ treat, FUN = mean)
## treat x y
## 1 0 32.03280 31.91977
## 2 1 32.71625 33.13091
aggregate(cbind(x, y) ~ treat, FUN = sd)
## treat x y
## 1 0 5.495015 5.762522
## 2 1 4.541026 5.289409
The outcome for person \(i\) in group \(j\) is a function of the overall mean, \(\mu\), plus the effect of the group, \(\tau_j\), plus the within-group residual, \(\epsilon\)
\[ y_{ij} = \mu + \tau_j + \epsilon_{ij} \] \[ \tau_j = \mu_j - \mu \] \[ \epsilon_{ij} = y_{ij} - \mu_j \] \[ \epsilon_{ij} \sim N\left[0, \sigma^2\right] \]
where \(\sigma^2\) is the population variance of the outcome.
Post-intervention BMI = Average Post-intervention BMI + Effect of Intervention or Control + Random Error
If \(\bar{y}_1\) is the mean for the intervention group and \(\bar{y}_0\) is the mean of the control group, and each group has \(n\) observations (i.e., balanced data), then
\[\bar{y}_1 - \bar{y}_0 \sim N\left[\mu_1-\mu_0,\sigma^2\frac{2}{n}\right]\]
Thus, to get the p-value, you calculate the z-score of the observed difference in the means as it relates to the null hypothesis and the variance of the mean difference.
\[ z = \frac{(\bar{y}_1 - \bar{y}_0)-\mbox{Null hypothesis}}{\sqrt{\sigma^2\frac{2}{n}}} \]
If \(z\) is greater in magnitude than the critical value, you reject the null.
Thus, the sampling variance of the estimated difference in means is (when data are balanced) \[ \hat{V}\left\{\bar{y}_1-\bar{y}_0\right\} = \hat{\sigma}^2\frac{2}{n}\] making the test statistic \(t\) \[ t = \frac{(\bar{y}_1-\bar{y}_0) - \mbox{Null hypothesis}}{\sqrt{\hat{\sigma}^2\frac{2}{n}}}\] distributed with \(2n-2\) degrees of freedom. Note that the computation of the pooled variance, \(\hat{\sigma}^2\) assumes that each treatment group has the same variance in reality (thus the observed differences are by chance).
aggregate(y ~ treat, FUN = sd)
## treat y
## 1 0 5.762522
## 2 1 5.289409
bartlett.test(y ~ treat)
##
## Bartlett test of homogeneity of variances
##
## data: y by treat
## Bartlett's K-squared = 0.17233, df = 1, p-value = 0.678
printCoefmat(coefficients(summary(lm(y ~ treat))))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.9198 1.1062 28.8552 <2e-16
## treat 1.2111 1.5644 0.7742 0.4426
The alternative distribution has a mean of the \(t\)-test, thus most of the distribution is before the critical value. The power is very low.
So, the expected test is what determines power. The alternative distribution is governed by its mean, the noncentrality parameter, \(\lambda\) \[\mbox{Expected test} = \lambda \]
The goal is to break down the expected test into scale-free parameters that don’t require knowledge about the unit of measurement. We can figure out the anatomy of the test by looking at the “Statistics 101” regression formula.
The regression model is \[ y_{ij} = \gamma_0 + \gamma_1\times\mbox{treat}_{ij} + \epsilon_{ij} \] and \(\gamma_1\) is defined as \[ \gamma_1 = \frac{\sum_j\sum_i\left(y_{ij}-\bar{y}\right)\times\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right)}{\sum_j\sum_i\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right)^2}\]
In the balanced case, the denominator of \(\gamma_1\) is the total sample size divided by four, or half the sample, \(n\), divided by 2, \[ \sum_j\sum_i\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right)^2 = \sum_j\sum_i 0.25 =\frac{N}{4}= \frac{n}{2} \] The numerator of \(\gamma_1\) simplifies to \[\sum_j\sum_i\left(y_{ij}-\bar{y}\right)\times\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right) = \frac{1}{2}\sum_i\left(y_{i1}-\bar{y}\right)-\frac{1}{2}\sum_i\left(y_{i0}-\bar{y}\right)\] Thus, we can express the slope as \[ \hat{\gamma}_1 = \frac{\frac{1}{2}\sum_i\left(y_{i1}-\bar{y}\right)-\frac{1}{2}\sum_i\left(y_{i0}-\bar{y}\right)}{n/2}=\frac{\sum_i\left(y_{i1}-\bar{y}\right)-\sum_i\left(y_{i0}-\bar{y}\right)}{n} \] which is equivalent to \[ \hat{\gamma}_1 = \frac{\sum_iy_{i1}}{n} - \frac{\sum_iy_{i0}}{n} = \bar{y}_1-\bar{y}_0 \]
The estimated sampling variance of the slope, \(\gamma_1\), in a bi-variate regression is often presented as \[ \hat{\mbox{V}}\left\{\hat{\gamma}_1\right\}=\frac{\hat{\sigma}_\epsilon^2}{\sum_j\sum_i\left(\mbox{treat}_{ij}-\bar{\mbox{treat}}\right)^2} \] where the mean squared error of a bi-variate regression (MSE or \(\sigma_\epsilon^2\)) is estimated as \[ \hat{\sigma}_\epsilon^2 = \frac{\sum_j\sum_i\hat{\epsilon}_{ij}^2}{N-2} = \frac{\left(n_1-1\right)\sigma_1^2+\left(n_0-1\right)\sigma_0^2}{n_1+n_0-2} \] which means that for bi-variate models with a dichotomous predictor, the estimate of the MSE is the same as the estimate of the population variance, \(\sigma^2\), meaning that that the estimated sampling variance of \(\gamma_1\) is \[ \hat{\mbox{V}}\left\{\hat{\gamma}_1\right\} = \hat{\sigma}^2\frac{2}{n}. \]
Thus, the slope is the mean difference is \[ \hat{\gamma}_1 = \bar{y}_1-\bar{y}_0 \] and the standard error is \[ \sqrt{\hat{\sigma}^2\frac{2}{n}} \] so the test is \[ \frac{\left(\bar{y}_1-\bar{y}_0\right)-\mbox{Null hypothesis}}{\sqrt{\hat{\sigma}^2\frac{2}{n}}}\] making the expected test \[ \lambda = \frac{\mu_1-\mu_0}{\sqrt{\sigma^2\frac{2}{n}}} = \frac{\mu_1-\mu_0}{\sigma}\sqrt{\frac{n}{2}} = \delta\sqrt{\frac{n}{2}} \] \[\mbox{Effect size} = \delta = \frac{\mu_1-\mu_0}{\sigma} \]
In order to make \(\lambda\) scale free, we replaced \(\frac{\mu_1-\mu_0}{\sigma}\) with the effect size \(\delta\).
Effect sizes represent the correlation between treatment conditions and an outcome. Typically, effect sizes are either in the form of the correlation coefficient \(r\) or Cohen’s \(d\) (a standardized difference between means). In the balanced case, the translations between the two are
\[ \delta = \frac{\mu_1-\mu_0}{\sigma} \] \[ \delta = \frac{2\rho}{\sqrt{1-\rho^2}} \] \[ \rho = \sqrt{\frac{\delta^2}{\delta^2+4}} \]
These formulas are useful for reading previous literature to inform expectations.
Cohen, “A Power Primer.” (1992)
Ferguson, “An Effect Size Primer: A Guide for Clinicians and Researchers.” (2009)
sim <- function(delta, n) {
y.0 <- rnorm(n, mean = 0) #cases in control
y.1 <- rnorm(n, mean = delta) #cases in treatment
test <- t.test(y.1,y.0, var.equal = TRUE) #the test
return(t = test$statistic) #return the result
}
sim(delta = .5, n = 50) #one simulation
## t
## 1.427165
delta <- .5 #effect size
n <- 50 #size per group
delta*sqrt(n/2) #expected test
## [1] 2.5
results <-
replicate(5000,sim(delta,n)) #5000 simulations
mean(results) #mean of the simulation
## [1] 2.510451
alpha <- .05
df <- 2*n - 2
ct <- qt(alpha/2, df, lower.tail = FALSE)
sim.power <- length(results[results >= ct])/length(results)
sim.power
## [1] 0.6944
length(results[results >= ct])/length(results)
## [1] 0.6944
\[ \lambda = \delta\sqrt{\frac{n}{2}} \]
With the non-centrality parameter, we can use an expected effect size, \(\delta\), and an expected sample size, \(n\), to estimate what our test-statistic is likely to be, \(\lambda\).
We then use \(\lambda\) calculate the expected Type II error given a level of Type I error, and thus, power.
The shaded area is the Type II error. It is the area of the alternative distribution between the two critical values. How do we calculate the area of the non-central distribution between the two critical values?
If we assume some Type I error (\(\alpha\), usually 0.05 with two tails) and a sample size (which produces degrees of freedom), Type II error is
\[\beta = \underbrace{H\left[t_{\frac{\alpha}{2}}^{critical},df,\lambda\right]}_\text{Area before right critial value}-\underbrace{H\left[-t_{\frac{\alpha}{2}}^{critical},df,\lambda\right]}_\text{Area before left critical value} \]
where \(H\left[a,b,c\right]\) is the cumulative function of the non-central \(t\) distribution at point \(a\) with \(b\) degrees of freedom and non-centrality parameter \(c\).
The power, then, is the complement of this area
\[\mbox{Power} = 1-\beta.\]
In R, we can use the pt function to get the cumulative non-central \(t\) distribution. Suppose we thought that we would have 50 cases in each group with an effect size of 0.5 (just like the simulation), power would then be
n <- 50 #cases per group
df <- 2*n-2 #degrees of freedom
delta <- .5 #effect size
alpha <- .05 #type i error
ct <- qt(alpha/2,df, lower.tail = F) #critial t
lambda <- delta*sqrt(n/2) #noncentrality parameter
beta <- pt(ct,df,lambda)-pt(-ct,df,lambda) #type ii error
power <- 1-beta #power
power
## [1] 0.6968934
sim.power
## [1] 0.6944
Power increases both with sample size and effect size.
mean1 <- mean(y[treat==1]) #mean treatment
mean0 <- mean(y[treat==0]) #mean control
var1 <- var(y[treat==1]) #variance treatment
var0 <- var(y[treat==0]) #variance control
varp <- (var1+var0)/2 #pooled variance
bmi.delta <- (mean1-mean0)/sqrt(varp) #effect size
bmi.delta
## [1] 0.2189726
n <- 25 #cases per group
df <- 2*n-2 #degrees of freedom
delta <- bmi.delta #effect size
alpha <- .05 #type i error
ct <- qt(alpha/2,df, lower.tail = FALSE) #critical
lambda <- delta*sqrt(n/2) #expected test statistic
lambda
## [1] 0.774185
beta <- pt(ct,df,lambda)-pt(-ct,df,lambda) #type ii error
power <- 1-beta #power
power
## [1] 0.1181177
This is difficult because there is no way to solve \[ \mbox{Power} = 1- H\left[t_{\frac{\alpha}{2}}^{critical},df,\lambda\right]+H\left[-t_{\frac{\alpha}{2}}^{critical},df,\lambda\right] \] for the sample size because \(t_{\frac{\alpha}{2}}^{critical}\) keeps changing with the sample size.
However, there is a trick. When \(\sigma^2\) is known, \[ Z_{1-\beta}-Z_{\alpha/2} = \lambda \] Where \(Z_{\alpha/2}\) is the quantile value of the normal distribution at half of \(\alpha\) Thus, if \(\alpha = 0.05\), \(Z_{0.025} = -1.96\). Leading to a quantity \(M\) that is approximate to the expected test \[ M_z = Z_{\mbox{Power}}-Z_{\alpha/2} \]
This means \[ M_z \approx \lambda = \delta\sqrt{\frac{n}{2}} \] which we can rearrange to find \[ n_z = \frac{2M_z^2}{\delta^2} \] now, the approximate sample size is a function of \(1-\beta\) (power), \(\alpha\) (Type I error), and the effect size (\(\delta\)).
Next, we use \(n_z\) for the degrees of freedom to calculate a sample-based factor, \(M_t\). Finally, we use the same formula \((n_t = \frac{2M_t^2}{\delta^2})\) to find the appropriate sample size.
On the computer, we can do this easily
alpha <- .05 #set alpha
power <- .8 #set power
delta <- .5 #set effect size
M_z <- qnorm(power)-qnorm(alpha/2) #normal factor
n_z <- (2*M_z^2)/delta^2 #approx sample size
next, we use the approximate sample size to get a better estimate with \(M_t\)
M_t <- qt(power,2*n_z-2)-
qt(alpha/2,2*n_z-2) #sample factor
n_t <- (2*M_t^2)/delta^2 #sample size
n_t
## [1] 63.79463
alpha <- .05 #set alpha
power <- .8 #set power
delta <- bmi.delta #set effect size
M_z <- qnorm(power)-qnorm(alpha/2) #normal factor
n_z <- (2*M_z^2)/delta^2 #approx sample size
M_t <- qt(power,2*n_z-2)-
qt(alpha/2,2*n_z-2) #sample factor
n_t <- (2*M_t^2)/delta^2 #sample size
n_t
## [1] 328.3649
328 in each group
We call this the minimum detectable effect size (MDES).
This is approximated by \[ \delta_{m.d.} = M_t\sqrt{\frac{2}{n}} \]
n <- 25
df <- 2*n-2
alpha <- .05
power <- 0.8
M_t <- qt(power,df)-qt(alpha/2,df)
mdes <- M_t*sqrt(2/n)
mdes
## [1] 0.808876
Larger samples are more sensitive.
Adding covariates (like a pretest) to an analysis of an experiment can (possibly) increase power by reducing the variance of the treatment effect.
However, if there is any correlation between the treatment condition and the covariate, it could make matters worse.
Compare effect of treat with and without the covariate
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.9198 1.1062 28.8552 <2e-16
## treat 1.2111 1.5644 0.7742 0.4426
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.375131 1.149875 -2.0656 0.04441
## x 1.070618 0.035073 30.5256 < 2e-16
## treat 0.479431 0.347263 1.3806 0.17393
aggregate(x ~ treat, FUN = mean)
## treat x
## 1 0 32.03280
## 2 1 32.71625
New data generating process \[ y_{ij}=\mu+\tau_j^{'}+\phi\left(x_{ij}-\bar{x}\right)+\epsilon_{ij} \] where \(\tau_j^{'}\) is the adjusted treatment effect and \(\phi\) is the within-group slope for \(x\), the covariate, predicting the outcome, \(y\). \[ \tau_j^{'} = \tau_j - \phi\left(\bar{x}_1-\bar{x}_0\right) \] This means that the adjusted treatment effect, \(\tau_j^{'}\), is related to the unadjusted treatment effect through \(\phi\) and the difference between the means of the covariate by treatment and control groups.
When there is no correlation between treatment and the covariate \(x\), the means of \(x\) for each group are the same. In such cases, \(\tau_j^{'} = \tau_j\).
The regression model is
\[ y_{ij} = \gamma_0 + \gamma_1\times \mbox{treat}_{ij} + \gamma_2\times (x_{ij}-\bar{x}) + \epsilon_{ij} \]
The colinearity between treatment group and the covariance makes the test more complex. The regression slope for the treatment effect is \[ \gamma_1 = \left(\bar{y}_1-\bar{y}_0\right) - \gamma_2\left(\bar{x}_1-\bar{x}_0\right) \] and the variance of the effect is \[ \hat{V}\left\{\hat{\gamma_1} \right\} = \hat{\sigma}^2\left(1-\hat{\rho}_{yx,w}^2\right)\frac{2n-2}{2n-3}\frac{2}{n}\frac{1}{1-\hat{\rho}_{tx}^2} \]
Porter & Raudenbush. “Analysis of covariance: Its model and use in psychological research.” (1987)
The final test statistic has many components \[ \lambda = \underbrace{\frac{\left(\mu_1-\mu_0\right) - \gamma_2\left(\bar{x}_1-\bar{x}_0\right)}{\sigma}}_\text{Adjusted Effect size}\underbrace{\sqrt{\frac{1}{\left(1-\rho_{yx,w}^2\right)}}}_\text{Covariate effect}\underbrace{\sqrt{\frac{n}{2}\frac{2n-3}{2n-2}}}_\text{Sample size}\underbrace{\sqrt{1-\rho_{tx}^2}}_\text{Multicolinearity} \] where the effectiveness of the covariate in explaining variation in the outcome is \(\rho_{yx,w}^2\) and the multicolinearity that can be expected between the group indicator and the covariate is \(\rho_{tx}\)
If we can plan on zero correlation between the covariate and treatment group, this expression simplifies to \[ \lambda = \underbrace{\frac{\left(\mu_1-\mu_0\right)}{\sigma}}_\text{Effect size}\underbrace{\sqrt{\frac{1}{\left(1-\rho_{yx,w}^2\right)}}}_\text{Covariate effect}\underbrace{\sqrt{\frac{n}{2}\frac{2n-3}{2n-2}}}_\text{Sample size} \]
The square root of \(\frac{2n-3}{2n-2}\) is always less than 1, but never dips below about 0.95 for samples of \(n = 6\) or greater.
curve(sqrt((2*x-3)/(2*x-2)), from = 4, to = 10)
Thus, unless the sample is extremely small, the effect of \(\frac{2n-3}{2n-2}\) on the non-centrality parameter is minimal. Knowing this, the non-centrality parameter be approximated with the simplified formula \[\lambda \approx \frac{\left(\mu_1-\mu_0\right) - \gamma_2\left(\bar{x}_1-\bar{x}_0\right)}{\sigma}\sqrt{\frac{n}{2}}\sqrt{\frac{1}{\left(1-\rho_{yx,w}^2\right)}}\sqrt{1-\rho_{tx}^2}\]
n <- 25 #sample size
df <- 2*n-2-1 #degrees of freedom
delta_a <- 0.5 #adjusted effect size
r_yx.w <- 0.5 #within group correlation
r_tx <- 0.2 #multicolinearity
alpha <- .05 #type i error
ct <- qt(alpha/2,df,
lower.tail = FALSE) #critical t
lambda <- delta_a*sqrt(n/2)*
sqrt(1/(1-r_yx.w^2))*
sqrt(1-r_tx^2) #noncentrality parameter
beta <- pt(ct,df,ncp)-pt(-ct,df,ncp) #type ii error
power <- 1-beta #power
power
## [1] 0.1947309
\[ n \approx \frac{2M_t^2}{\delta_a^2}\frac{1-\rho_{yx,w}^2}{1-\rho_{tx}^2} \]
alpha <- 0.05 #type i error
power <- 0.8 #power
delta_a <- 0.5 #adjusted effect size
r_yx.w <- 0.5 #within group correlation
r_tx <- 0.2 #multicolinearity
M_z <- qnorm(power)-qnorm(alpha/2) #normal factor
n_z <- ((2*M_z^2)/delta_a^2)*
((1-r_yx.w^2)/(1-r_tx^2)) #approx. sample size
M_t <- qt(power,2*n_z-2-1)-
qt(alpha/2,2*n_z-2-1) #sample factor
n_t <- ((2*M_t^2)/delta_a^2)*
((1-r_yx.w^2)/(1-r_tx^2)) #sample size
n_t
## [1] 50.07819
Multilevel studies require knowledge about how the total population variance breaks down within and between clusters, summarized by the intraclass correlation (\(ICC\))
In the case of a two-level cluster randomized trial (where each treatment group has \(m\) clusters and each cluster has \(n\) units), the non-centrality parameter (i.e., the test) is \[ \lambda=\delta\sqrt{\frac{mn}{2}}\sqrt{\frac{1}{1+\left(n-1\right)ICC-\left(R_1^2+\left(nR_2^2-R_1^2\right)ICC\right)}} \]
Hedges & Rhoads. “Statistical Power Analysis in Education Research.” (2010)
Nonlinear models (i.e. generalized linear models and multilevel generalized linear models) typically require larger samples.
Power is difficult to estimate because the variance of the outcome is usually a function of the mean, and the experiment changes the means…so estimates of the baseline mean is essential to planning sample sizes.
For example, extreme probabilities for dichotomous outcomes (< 0.05 or > 0.95) exponentially increase required sample sizes for the same odds ratio.