Power of a test is the probability of rejecting the null hypothesis when the alternative hypothesis is true. Mathematically,
\[Power = 1 - P(\ \ Type \ \ II \ \ Error)\]
The null hypothesis is the hypothesis of no difference, whereas the alternative hypothesis is that a non-zero difference exists. A non-zero difference is different from having a clinically relevant difference. Clinically Relevant Difference is a magnitude of difference which the researchers care about and hope to detect in their experiment. For example, in clinical trials, a vaccine variety will be approved for production on a large scale only if it can be shown that it cures 20% more patients as compared to the current vaccine in use. In such a case your power analysis should be set up to detect a difference of 20% and not merely a non-zero difference. Sure, you would need more samples to detect the clinically relevant difference as against the non-zero difference but don’t be afraid to report a higher sample size if you have followed the appropriate steps. Sure your boss would be unhappy, but you would have done the correct thing, and the statistics forefathers won’t be turning in their graves.
Completely Randomized Design (CRD) is the simplest experimental design. In this design, we assume that the variation in the response is only explained by the treatments and the inherent variation in the experimental units. The statistical model for the the CRD (assuming a Normal distribution for the response but this could be any known statistical distribution) is \[\begin{aligned}Y_{ij} \sim N(\mu_{i}, \sigma^2) \\ \mu_i = \mu + \tau_{i}\end{aligned}\] Here i represents the \(i^{th}\) treatment and j represents the \(j^{th}\) experimental unit within the \(i^{th}\) treatment. What these two equations are saying is that the variation in the response for any experimental unit can be attributed to the deterministic effect of the treatment (\(\tau_{i}\)) and the stochastic experimental unit uniqueness quantified by \(\sigma^2\).
Randomized Complete Block Designs (RCBD) is an experimental design when we suspect there are factors other than the treatment and the experimental unit uniqueness that can explain the variation in the response. These factors are known as blocks, and RCBD designs can be a seamless way of improving the power of the design if, indeed, there is a significant blocking factor that impacts the response. Using similar statistical notation as in the case of the CRD the model for an RCBD is as follows \[\begin{aligned}Y_{ij} \sim N(\mu_{ij}, \sigma^2) \\ \mu_{ij} = \mu + \tau_{i} + b_j\end{aligned}\] Here i represents the \(i^{th}\) treatment and j represents the \(j^{th}\) block. Any experimental unit can be identified by a combination of treatment and the block, which is why we do not have any additional indices apart from i and j. This equation tells us that in addition to the treatment and the experimental unit uniqueness, the blocking factor also explains a significant portion of the response, making it only natural to include it in the probabilistic model.
Fixed vs. Random Effects - The discussion of RCBDs provides a natural opportunity to broach the concepts of fixed vs. random effects in the field of experimental designs. Imagine your blocking factor is the operating system; there are very few operating systems such as ios, Android, etc. Most of the time, your inference would want to know results specifically across these categories. In such a case your blocks are said to be fixed effects. In simple terms, it means that there is absolutely no ambiguity or stochasticity about the effect of the blocks on the response. However, if you are dealing with, let’s say, some sort of a geographical blocking factor, like counties of which you can have hundreds or thousands of counties, then it might just be infeasible to involve all the counties in your experimental setup. In such a case, you can sample a few blocks to be a part of your experimental setup, and because of this sampling, you assume that the blocks you chose are a random sample from the population of blocks you could have chosen from. This would be a classic example of a random effect. (On a side note, if you chose an RCBD and expect missing data you should treat the blocks as random and not fixed. Using simple algebra, it can be shown that in the case of missing data, fixed block analysis leads to confounding in the sense that you cannot separate the effects of the treatments from the effect of the blocks). The probabilistic model for the RCBD with random block effects is as follows.
\[\begin{aligned}Y_{ij} \sim N(\mu_{ij}, \sigma^2) \\ \mu_{ij} = \mu + \tau_{i} + b_j \\ b_j \sim N (0, \sigma_{b}^2)\end{aligned}\] The only change in the above model as compared to the fixed block RCBD is the \(b_j \sim N (0, \sigma_{b}^2)\). This probabilistic statement says two things, one that the block effects are stochastic and not fixed; second, the block effects also follow a normal distribution with a mean centered at 0 and the variance equal to \(\sigma_{b}^2\).
I would use the excellently laid out steps as on page 104 of SAS for Mixed Models book by Stroup et al. (I had the extreme fortune to take three classes with Professor Walter Stroup at the University of Nebraska - Lincoln and can’t express how grateful I am to him for brilliantly explaining very complex experimental design and generalized linear mixed modeling concepts). Stroup et al. recommend four steps, but I am rephrasing the main ones that are pertinent to the power analysis:
We first need to create a dataset that would look similar to the one we will collect at the culmination of the experiment for analysis. We need not worry about the actual values that we will see for the treatments but need to have an idea about the expected value of the response for the treatments. Don’t worry even if you do not know the exact expected values of the treatments, what really matters is for the difference between the expected values to reflect the clinically relevant difference that you are hoping to detect. For instance, if you want to detect a difference of 20 between your treatments, you can set the expected value of one of the treatment to be 10 and the other to be 30 such that the difference is 20.
Compute the test statistics using PROC GLIMMIX or PROC MIXED. Here, we would need to provide reasonable estimates for any sources of variation we have in our design. For example, if we are doing the power analysis for a CRD, the only source of variation is the experimental unit variation, but if we are using an RCBD with random block effects, we would then need to provide the variance estimates for the experimental unit variance and the block variance.
Use the F-statistics calculated in 2. and the statistical definitions of power to calculate the numerical value of power. Contrary to the calculation of the p-values where we assume that the null hypothesis is true, in the calculation of power we assume that the alternative hypothesis is true. When we assume the alternative hypothesis is true, the distribution of F-statistic is no longer a central F-distribution; now, the distribution is known as a non-central F-distribution and has a non-centrality parameter. This non-centrality parameter can be computed by multiplying the computed F-statistic with the numerator degrees of freedom. You would need this non-centrality parameter for the power calculations. You would also need to calculate the critical F-value (\(F_{crit}\)), which is such a quantile on the central F-distribution such that the area to the right of this value is \(\alpha\) or the significance level of the test, which is also the probability of rejecting the null hypothesis when it is true. Once you have done all of this you can compute power (remember we reject the null when the observed F statistic is greater than the critical value) by \[Power = P(F_{num_{df}, den_{df}, ncp} > F_{crit})\]
Here \(num_{df}\) is the numerator degrees of freedom, \(den_{df}\) is the denominator degrees of freedom, and \(ncp\) is the non-centrality parameter.
Now let’s put all of this into action and write some SAS code. Why SAS? One might ask. I am choosing SAS because of the apparent comfort I have with SAS because we did the entire sequence of experimental design in my graduate coursework in SAS language. Also, as far as I know, PROC GLIMMIX and PROC MIXED are two of the most powerful procedures for doing experimental design analysis. They have simple syntax and excellent documentation online. But, take the last few sentences with a grain of salt because they reflect my biased views. Irrespective of what language you choose (I would still recommend to choose SAS to make life a little easier), following these steps provide a structured way of thinking about and executing power analysis.
Since I do not have SAS on my system, I cannot set up R Markdown to work with executable SAS code, so I will post screenshots of the SAS code and the relevant outputs.
Figure 1 shows the code to create the dataset for the CRD (Step 1).
Figure 1: SAS code to create the CRD dataset
Figure 2 shows the dataset created using the code in Figure 1:
Figure 2: CRD Dataset View
The PROC GLIMMIX code in Figure 3 shows how to get the F-test information (Step 2). In particular, we get test statistic results for three different hypotheses. Why does it make sense to calculate power for these three hypotheses? Because we have made our exemplary dataset in such a way that the null in all three of these hypotheses can be rejected. (Remember: power is always calculated under the assumption that the alternative is true)
The hypothesis in (1) is the overall hypothesis that tests for a non-zero difference. The hypotheses in (2) and (3) test for clinically relevant differences of 5 and 10 units, respectively.
\[\begin{align} \tag{1} H_O: \mu_A = \mu_B = \mu_C \\ H_A: \mu_i \neq \mu_j \ \ \ {for \ \ some \ \ \ \ i \neq j} \end{align}\]
\[\begin{align} \tag{2} H_O: \mu_B = \mu_C \\ H_A: \mu_B \gt \mu_C \end{align}\]
\[\begin{align} \tag{3} H_O: \mu_C = \mu_A \\ H_A: \mu_C \gt \mu_A \end{align}\]
Figure 3: GLIMMIX code for Power Analysis for a CRD
The code in Figure 4 visualizes what is stored in the F_overall and F_contrast datasets that were obtained from PROC GLIMMIX.
Figure 4: Code to visualize the F test details for CRD
The output in Figure 5 shows how the output from the above code looks like.
Figure 5: F test detailed outputs for CRD
Once we have the F-test details for the overall F-test and the specific contrasts of interest, we need to use a SAS data step to compute the power using the statistical definition of power.
Figure 6 shows the SAS code for computing power for a CRD design (Step 3).
Figure 6: SAS Data Step for Power computation for CRD
The output in Figure 7 shows the power for the overall test and the contrasts of interest.
Figure 7: Power values for the three hypothesis of interest with a CRD design
We can see above that with four replications per treatment, the power to detect a non-zero difference between at least one pair of treatments is 56%, to detect a five-unit difference is 24%, and to detect a ten-unit difference is 71%. You can change the number of replications in the code to see how it affects the power for the three hypotheses.
Figure 8 shows the code used for creating the dataset for the RCBD.
Figure 8: SAS code to create the RCBD dataset
Figure 9 shows the dataset created using the code:
Figure 9: RCBD Dataset View
Notice that every block has all the three treatments which is why it is called a complete block design.
Figure 10 shows the PROC GLIMMIX code for obtaining the requisite F-test details. It is pretty similar to the PROC GLIMMIX code for CRD with the only addition of the blocking variable in the CLASS as well as the MODEL statements.
Figure 10: GLIMMIX code for Power Analysis for a RCBD
Figure 11 shows the code to obtain what is stored in the F_overall_rcbd and F_contrast_rcbd datasets that were obtained from PROC GLIMMIX.
Figure 11: Code to visualize the F test details for RCBD
Figure 12 shows the output from the code in Figure 11.
Figure 12: F test detailed outputs for RCBD
Figure 13 shows the SAS code for the computation of power for the RCBD design.
Figure 13: SAS Data Step for Power computation for RCBD
Figure 14 shows the power for the overall test and the contrasts of interest for the RCBD design.
Figure 14: Power values for the three hypothesis of interest
Notice all the powers decreased when we used the same experimental unit variance value of 25. This is because in the case of an RCBD, we lose the degrees of freedom due to the estimation of the block effects (you can see in the outputs that the denominator degrees of freedom for RCBD is 6 as compared to 9 in CRD, this is because we have 4 blocks which means 3 degrees of freedom are used for estimating the block effects). This loss in degrees of freedom increases the critical value, making it harder for us to reject the null hypothesis when it is true, thereby bringing down the power. Therefore, unless blocking reduces the experimental unit variance, it will not usually lead to a benefit in terms of power. You can change the experimental unit variance value of 25 in the PARMS statement of the PROC GLIMMIX code to see the effect it has on the power. For example, a value of 20 would make all the powers in the RCBD design be higher than that of the CRD design. Therefore, in order for the RCBD to be better than the CRD, the blocking needs to be done in such a way that it can decrease the experimental unit variance from 25 to at least 20.
In this post, we directly used the experimental unit variance values without going into too much detail on how to obtain them. Also, we introduced the concept of random block effects but didn’t explore them. In the next post, I will talk more in detail on how to obtain the variance components for the experimental unit variance and will also talk about the random block effects and when they are useful.
This post is heavily influenced by the book SAS for Mixed Models - Introduction and Basic Applications by Stroup et al. All of the SAS programs here are taken from this book with minor modifications and the addition of comments to explain the programming statements. This is a fascinating book with the right mix of theory and applications in SAS. I would recommend it to anyone who wants to study the experimental design and analysis from a statistical point of view.