Section One § 7a

Power Supplement

Page 1 of 14

From Allen et al (2016). A Combined Patient and Provider Intervention for Management of Osteoarthritis in Veterans: A Randomized Clinical Trial. Annals of Internal Medicine, 164(2): 73-83.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4732728/

Study Design.
This was a cluster randomized, controlled trial, with primary care providers (PCPs) assigned to an osteoarthritis intervention group or a usual care control group. Randomization was computer-generated, maintained by the study statistician, and stratified on the basis of the providers’ volume of female patients (<15% vs. ≥15%). We aimed to enroll 10 patient participants (5 white and 5 nonwhite) from each of 30 PCPs.

Sample Size.
We based our sample size of 300 patient participants on detection of a moderate effect size of approximately 0.30 for the difference in mean WOMAC scores between groups, with 80% power and a type I error rate of 0.05. This translates to a 4.2-point difference at 12 months, which is equivalent to an improvement of approximately 11% from the anticipated mean baseline score; this allowed sufficient power to detect a clinically relevant difference (12% to 18%, based on prior relevant literature) (37–39). We used a 2-sample t test sample size calculation for the between-group difference at 12 months multiplied by a factor of 1 − ρ², where ρ represents the Pearson correlation between baseline and follow-up outcome measures (0.60) (40). This sample size was then adjusted to reflect provider clustering using an intraclass correlation coefficient of 0.02 (41) and was inflated to compensate for potential attrition (12%). On the basis of our pilot work, we assumed a mean baseline WOMAC score of 38 with a standard deviation of 14.

Page 2 of 14

1. An effect of \(d = 0.30 = \frac{\text{Mean Difference}}{SD} = \frac{4.2}{14}\) yields a sample size of \(n_j = 175.385\) per arm (group); \(N = 352\) total subjects. So where does the sample size of 300 come from?

2. The ANCOVA adjustment factor is actually applied to the variance (or use the square root multiplied by the SD): \[SD_C = SD_B\sqrt{1 - \rho^2}\]where \(SD_C\) is the SD of the change score and \(SD_B\) is the assumed baseline SD. In this case, \(SD_B = 14\) and \(\rho = 0.60\), so \(SD_C = 14\sqrt{1 - 0.6^2} = 14(0.8) = 11.2\).

Now an effect size of \(d = \frac{4.2}{11.2} = 0.375\), or equivalently \(\frac{d}{\sqrt{1 - \rho^2}} = \frac{0.3}{0.8} = 0.375\). This yields \(n_j = 112.5967\) per arm; \(N = 226\) total subjects for \(80\%\) power at \(\alpha = 0.05\).

3. Adjustment has to do with patients being nested in clinics (clustering) and that a random effects mixed model will be used for analyses. The adjustment factor is called the Design Effect. The authors state they will “enroll 10 patient participants from each of 30 PCPs.” The Design Effect formula is \[D = 1 + ((C - 1)\times ICC)\] where \(C\) is the average cluster size and \(ICC\) is the Intra-Class (Intra-Cluster) Correlation. In this case, \(C = 10\) and \(ICC = 0.02\): \(D = 1 + ((10-1)(0.02)) = 1.18\). This yields a sample size of \(n_j = 112.5967 \times 1.18 = 132.864\) per arm; \(N = 265.72\) total subjects.

4. The inflation factor for attrition is \(\frac{N}{1-A}\), where \(A\) is the proportion of attrition. Here \(A = 0.12\). Final \(N = \frac{265.72}{1 - 0.12} = 301.96\), rounded to \(302\).

Page 3 of 14

We based our sample size of 300 patient participants on detection of a moderate effect size of approximately \(d = 0.30\) for the difference in mean WOMAC scores between groups, with \(80\%\) power and a type I error rate of \(\alpha = 0.05\). This translates to a 4.2-point difference at 12 months, which is equivalent to an improvement of approximately \(11\%\) from the anticipated mean baseline score (since \(11\% = 4.2/38\)); this allowed sufficient power to detect a clinically relevant difference (12% to 18%, based on prior relevant literature) (37–39).

Page 4 of 14

SAS code (a)

proc power; 
   twosamplemeans test=diff 
      alpha =0.05  
      nfractional 
      meandiff = 0.3   /* standardized difference */
      stddev = 1 
      npergroup = . 
      power = .80;
run;

Output (a)

The POWER Procedure
Two-Sample t Test for Mean Difference

Fixed Scenario Elements
Distribution                Normal
Method                      Exact
Alpha                       0.05
Mean Difference             0.3
Standard Deviation          1
Nominal Power               0.8
Number of Sides             2
Null Difference             0

Computed Ceiling N per Group
Fractional N per Group   Actual Power   Ceiling N per Group
175.384669          0.801            176

SAS code (b)

proc power; 
   twosamplemeans test=diff 
      alpha =0.05  
      nfractional 
      meandiff = 4.2   /* raw mean difference */
      stddev = 14      /* baseline SD */
      npergroup = . 
      power = .80;
run;

Output (b)

The POWER Procedure
Two-Sample t Test for Mean Difference

Fixed Scenario Elements
Distribution                Normal
Method                      Exact
Alpha                       0.05
Mean Difference             4.2
Standard Deviation          14
Nominal Power               0.8
Number of Sides             2
Null Difference             0

Computed Ceiling N per Group
Fractional N per Group   Actual Power   Ceiling N per Group
175.384669          0.801            176

Page 5 of 14

2. The ANCOVA adjustment factor is actually applied to the variance. Or you could use the square root multiplied by the SD:
\[ SD_C = SD_B \sqrt{1 - \rho^2} \].

Where \(SD_C\) is the SD of the change score and \(SD_B\) is the assumed baseline SD.
In this case, \(SD_B = 14\) and \(\rho = 0.60\), so:
\[ SD_C = 14\sqrt{1 - 0.6^2} = 11.2 \].

Now an effect size of \(d = \frac{4.2}{11.2} = 0.375\), which yields \(n_j = 112.5967\) per arm; \(N = 226\) total subjects for 80% power at \(\alpha = 0.05\).

So where does the sample size of 300 come from?

Page 6 of 14

\(n_j = 112.5967\) per arm; \(N = 226\) total subjects for 80% power at \(\alpha = 0.05\).

SAS code (a)

proc power; 
   twosamplemeans test=diff 
      alpha =0.05  
      nfractional 
      meandiff = 0.375   /* adjusted effect size */
      stddev = 1 
      npergroup = . 
      power = .80;
run;

Output (a)

The POWER Procedure
Two-Sample t Test for Mean Difference

Fixed Scenario Elements
Distribution                Normal
Method                      Exact
Alpha                       0.05
Mean Difference             0.375
Standard Deviation          1
Nominal Power               0.8
Number of Sides             2
Null Difference             0

Computed Ceiling N per Group
Fractional N per Group   Actual Power   Ceiling N per Group
112.596695          0.801            113

SAS code (b)

proc power; 
   twosamplemeans test=diff 
      alpha =0.05  
      nfractional 
      meandiff = 4.2   /* raw mean difference */
      stddev = 11.2    /* adjusted SD */
      npergroup = . 
      power = .80;
run;

Output (b)

The POWER Procedure
Two-Sample t Test for Mean Difference

Fixed Scenario Elements
Distribution                Normal
Method                      Exact
Alpha                       0.05
Mean Difference             4.2
Standard Deviation          11.2
Nominal Power               0.8
Number of Sides             2
Null Difference             0

Computed Ceiling N per Group
Fractional N per Group   Actual Power   Ceiling N per Group
112.596695          0.801            113

Page 7 of 14

SIDE NOTE on the COVARIANCE ADJUSTMENT (Borm et al., 2007)

When calculating power or sample size for pre–post designs there are two different covariance adjustments that can be applied.

Difference Score Covariance Adjustment (ANOVA model):
\[ (Y_{\text{FOLLOW}} - Y_{\text{BASE}}) = \beta_0 + \beta_G G + \varepsilon \]

Assume pre (baseline) and post (follow-up) have the same variance:
\[ S_B^2 = S_F^2 = S_{*}^2 \]

Then the variance of the difference score is:
\[ \begin{aligned} S_D^2 &= S_B^2 + S_F^2 - 2\text{COV}(B,F) \\ &= S_B^2 + S_F^2 - 2r_{BF} S_B^2 S_F^2 \\ &= S_{*}^2 + S_{*}^2 - 2r_{BF} S_{*}^2 S_{*}^2 \\ &= 2S_{*}^2 - 2r_{BF} S_{*}^2 \\ &= 2S_{*}^2(1 - r_{BF}) \end{aligned} \]

Current Example:
\[ \begin{aligned} S_D^2 &= 2 \times 14^2 (1 - 0.6) = 156.8 \\ S_D &= 12.522 \\ d &= \frac{4.2}{12.522} = 0.3354 \end{aligned} \]

Page 8 of 14

ANCOVA Covariance Adjustment (ANCOVA model):
\[ \hat{Y}_{\text{FOLLOW}} = \beta_0 + \beta_G G + \beta_B Y_{\text{BASE}} \]

The slope for the baseline covariate is a function of the baseline–follow-up correlation \(r_{BF}\).

Assume that group (\(G\)) and baseline (\(X = Y_{\text{BASE}}\)) are uncorrelated (\(r_{GX} = 0\)), which is a reasonable assumption when groups are randomized. However, it will almost never happen in practice — one reason why we solve for 80% or 90% power (to account for random fluctuation).

Since we assume \(r_{GX} = 0\):
\[ \beta_B = r_{BF}\left(\frac{S_F^2}{S_B^2}\right) \]

If \(S_B^2 = S_F^2 = S_{*}^2\), then \(\beta_B = r_{BF}\).

Thus, the ANCOVA model becomes:
\[ \begin{aligned} Y_{\text{FOLLOW}} &= \beta_0 + \beta_G G + \beta_B Y_{\text{BASE}} + e \\ &= \beta_0 + \beta_G G + r_{BF} Y_{\text{BASE}} + e \\ Y_{\text{FOLLOW}} - r_{BF} Y_{\text{BASE}} &= \beta_0 + \beta_G G + e \end{aligned} \]

↑
CONCEPTUAL

The variance of \(\hat{Y}_F\) reduces as a function of the pre–post correlation:
\[ S_{\hat{Y}}^2 = S_{*}^2(1 - r_{BF}^2) \]

A more formal treatment is presented below.

Page 9 of 14

Starting from the adjusted ANCOVA model: \[ Y_{\text{FOLLOW}} - r_{BF} Y_{\text{BASE}} = \beta_0 + \beta_G G + e \]

Assume \(S_B^2 = S_F^2 = S_{*}^2 \;\Rightarrow\; S_B = S_F = S_{*}\).

Expanding the quadratic form: \[ (Y_F - r_{BF} Y_B)'(Y_F - r_{BF} Y_B) = Y_F^2 + r_{BF}^2 Y_B^2 - 2r_{BF} Y_B Y_F \]

After summation: \[ \Sigma Y_F^2 + r_{BF}^2 \Sigma Y_B^2 - 2r_{BF} \Sigma Y_B Y_F \]

In variance terms: \[ S_F^2 + r_{BF}^2 S_B^2 - 2r_{BF} S_{BF} \]

With the assumption \(S_B^2 = S_F^2 = S_{*}^2\), we have \[ S_{BF} = r_{BF} S_B S_F = r_{BF} S_{*} S_{*} = r_{BF} S_{*}^2. \]

Substituting \(S_{BF}\) with \(r_{BF} S_{*}^2\) gives \[ S_{*}^2 + r_{BF}^2 S_{*}^2 - 2 r_{BF} r_{BF} S_{*}^2 \]

Continuing on with some factoring, we have \[ \begin{aligned} S_{*}^2 + r_{BF}^2 S_{*}^2 - 2 r_{BF}^2 S_{*}^2 &= S_{*}^2\,(1 + r_{BF}^2 - 2 r_{BF}^2) \\ &= S_{*}^2\,(1 - r_{BF}^2). \end{aligned} \]

Page 10 of 14

This sample size was then adjusted to reflect provider clustering using an intraclass correlation coefficient of 0.02 (Batistatou et al., 2014).

The Design Effect Formula is:
\[ D = 1 + ((C - 1)\, ICC) \]
where \(C\) is the average cluster size and \(ICC\) is the Intra-Class (Cluster) Correlation.

Based on this formula: many small clusters will yield more power than a few large clusters.

ICC is a measure of the relatedness of clustered data. It accounts for this relatedness by comparing the variance within clusters with the variance between clusters:
\[ ICC = \frac{S_B^2}{S_B^2 + S_W^2} \]
where \(S_B^2\) = Between Cluster Variance and \(S_W^2\) = Within Cluster Variance.

The authors state they will “enroll 10 patient participants from each of 30 PCPs.”

Example values of Design Effect D:

C    ICC    D
1    1      1.00
2    0.05   1.05
5    0.05   1.20
10   0.05   1.45
20   0.05   1.95
2    0.02   1.02
5    0.02   1.08
10   0.02   1.18
20   0.02   1.38

Interpretation for RCTs:

Expect little between-cluster variation → clusters within the same treatment arm should have similar means.
Expect substantial within-cluster variation → individual subject responses inside each cluster will vary widely.

Page 11 of 14

This sample size was then adjusted to reflect provider clustering using an intraclass correlation coefficient of 0.02 (41) and was inflated to compensate for potential attrition (12%).

The authors state they will “enroll 10 patient participants from each of 30 PCPs.”
In this case \(C = 10\) and \(ICC = 0.02\).

\[ D = \big(1 + ((10 - 1)(0.02))\big) = 1.18 \]

Yielding a sample size of:
\[ n_j = (112.5967 \times 1.18) = 132.864 \ \text{per arm}; \quad N = 265.72 \ \text{Total.} \]

This was then inflated to compensate for potential attrition (12%).

4. The Inflation Factor for Attrition is:
\[ N / (1 - A) \]
where \(A\) is the proportion of attrition.

Here \(A = 0.12\). So:
\[ \text{Final } N = \frac{265.72}{1 - 0.12} = 301.96 \ \ (\text{rounded to } 302). \]

I think the authors just rounded down to have 300 subjects, or they used (and rounded) the values (SD; ICC) in the analyses.

Page 12 of 14

Working backwards

1. Take the total sample size and multiply by the attrition factor \[ N(1 - A) = 300 \times (1 - 0.12) = 264 \]

2. Divide by the clustering design effect \[ \frac{N(1 - A)}{D} = \frac{300 \times (1 - 0.12)}{1.18} = \frac{264}{1.18} = 223.729 \]

3. Find the effect size (covariate-adjusted, \(d = 0.375\)):

SAS code:

ods output Output=MDiff;

proc power;
   twosamplemeans test=diff
      alpha = 0.05
      nfractional
      meandiff = 0.375 to 0.38 by 0.00001
      stddev = 1
      ntotal = 223.729
      power = .;
run;

data MDiff; set MDiff;
   format MeanDiff 9.5 Power pvalue12.10;
run;

proc print data=MDiff; var MeanDiff Power; run;

Output:

Obs     MeanDiff           Power
124     0.37623     0.7999874329
125     0.37624     0.8000082792

4. Multiply \(d^*\) by the covariate adjustment factor \[ d = d^* \sqrt{1 - \rho^2} = 0.37624 \times \sqrt{1 - 0.6^2} \]

\[ d = 0.37624 \times 0.8 = 0.300992 \]

References

Borm GF, Fransen J, Lemmens WA. A simple sample size formula for analysis of covariance in randomized clinical trials. J Clin Epidemiol. 2007;60:1234–1238.
https://pubmed.ncbi.nlm.nih.gov/17998077/
Donner A, Klar N. Design and Analysis of Cluster Randomized Trials in Health Research. New York: Oxford University Press; 2000.

Power for Unequal Variances and Unequally Sized Treatment Arms
Batistatou E, Roberts C, Roberts S. Sample size and power calculations for trials and quasi-experimental studies with clustering. The Stata Journal. 2014;14(1):159–175.
https://journals.sagepub.com/doi/pdf/10.1177/1536867X1401400111

Power for Unequal Cluster Sizes
(variation in cluster size can affect power)

Guittet L, Ravaud P, Giraudeau B. Planning a cluster randomized trial with unequal cluster sizes: practical issues involving continuous outcomes. BMC Med Res Methodol. 2006;6:17.
https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-6-17
Kerry SM, Bland JM. Unequal cluster sizes for trials in English and Welsh general practice: implications for sample size calculations. Stat Med. 2001;20:377–390.
https://onlinelibrary.wiley.com/doi/epdf/10.1002/1097-0258%2820010215%2920%3A3%3C377%3A%3AAID-SIM799%3E3.0.CO%3B2-N

From:
Bennell KL, Nelligan RK, Kimp AJ, Wrigley TV, Metcalf B, Kasza J, Hodges PW, Hinman RS. Comparison of weight-bearing functional exercise and non-weight-bearing quadriceps strengthening exercise on pain and function for people with knee osteoarthritis and obesity: protocol for the TARGET randomized controlled trial. BMC Musculoskeletal Disorders. 2019;20:291.
https://doi.org/10.1186/s12891-019-2662-5

Trial sample size

The sample size was calculated based on both primary outcomes of pain and function. For an effect size of 0.5, power 80% and two-sided significance level 0.05, with a correlation between pre- and post-measurements of 0.45 for pain, 51 participants per arm will be required (using analysis of covariance including baseline pain measurement as a covariate). To account for 20% loss to follow up, sample size will be increased to 64 per arm, for a total of 128. This gives power of 83% to detect an effect size of 0.5 for function with a correlation between pre- and post-measurements of 0.49 and a two-sided significance level of 0.05.

Can you reproduce these values?

Section One § 7a

BST 623 Beasley

Fall 2025

Power Supplement