Power Analysis with Multiple Regression

Data and Background

Suppose there are two researchers: Researcher 1 and Researcher 2. These two researchers do their own pilot study (N = 30 for each study). The data from Researcher 1 is stored in dat1.txt (or dat1.sav). The data from Researcher 2 is in dat2.txt (or dat2.sav).

Analysis

Let us first read the data of researcher 1 and 2 as follows,

dat1 = read.table(file.choose(), header = TRUE)
dat2 = read.table(file.choose(), header = TRUE)
head(dat1)

##           x1         x2          y
## 1  0.2829038 -0.9622935  0.5785019
## 2  0.6976416 -0.9788763 -0.9911112
## 3  0.1087889 -0.6763632  0.2456311
## 4  1.4218883  0.2334207  0.9166005
## 5 -0.3361515 -0.4413236  0.4254810
## 6  0.5995572  0.5996114  1.4206643

head(dat2)

##            x1         x2          y
## 1  0.12891980  0.9167267  0.2439615
## 2  0.29807332  0.5763166  1.7826756
## 3  0.06258626  0.2355066  0.3947717
## 4 -1.09422800 -1.4970410 -1.6438435
## 5  1.05027245  0.2900778 -0.5219040
## 6 -0.65525488 -0.9462623 -2.0847839

Part 1 - “Analyze pilot data in dat1”

Now the estimates of the regression coefficients are as follows,

fit1 = lm(y~x1+x2, dat1)
summary(fit1)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4305 -0.7631  0.1318  0.7031  2.1560 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.551e-16  1.795e-01   0.000    1.000
## x1           1.000e-01  1.826e-01   0.548    0.588
## x2           3.000e-01  1.826e-01   1.643    0.112
## 
## Residual standard error: 0.9832 on 27 degrees of freedom
## Multiple R-squared:    0.1,  Adjusted R-squared:  0.03333 
## F-statistic:   1.5 on 2 and 27 DF,  p-value: 0.2411

Here observe that the fitted multiple regression is y = -1.551e-16 + 1.000e-01 x1 + 3.000e-01 x2. Also, note that, here multiple R-squared is 0.1. As the R squared is low, so fitted model is not well enough. Also, notice that for the test of significance of the model, p-value > 0.05, so we null hypothesis at level 0.05, so the fitted model is not significant.

Now, the partial R-squared for x2,

summary(lm(y~x2, dat1))

## 
## Call:
## lm(formula = y ~ x2, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5414 -0.7420  0.2039  0.7912  2.2615 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.763e-16  1.772e-01   0.000    1.000
## x2           3.000e-01  1.803e-01   1.664    0.107
## 
## Residual standard error: 0.9708 on 28 degrees of freedom
## Multiple R-squared:   0.09,  Adjusted R-squared:  0.0575 
## F-statistic: 2.769 on 1 and 28 DF,  p-value: 0.1072

Now the partial r squared is,

SS_full = (0.9832^2)*26
SS_reduced = (0.9708^2)*27
multiple_r_square = (SS_reduced - SS_full)/SS_reduced
multiple_r_square

## [1] 0.01228014

Now the f^2 is as follows,

f2 = multiple_r_square/(1 - multiple_r_square)
f2

## [1] 0.01243281

Part 2 - Conduct power analyses based on dat1

First call the library “pwr” as follows,

library(pwr)

Now, for N = 120, the power of the test is,

# P1 = 2
# P2 = 1
pwr.f2.test(2, 120 - 2 - 1, f2)

## 
##      Multiple regression power calculation 
## 
##               u = 2
##               v = 117
##              f2 = 0.01243281
##       sig.level = 0.05
##           power = 0.1740897

Here the df for numerator is 2. Here the df for denominator is 120 - 2 - 1 = 117. Here observe that the power of the test is greater than the significance level, so the test is unbiased. Now, again observe that the power is very far away from 1, so test is not so good.

Part 3 - Analyze pilot data in dat2

Now the estimates of the regression coefficients are as follows,

fit2 = lm(y~x1+x2, dat2)
summary(fit2)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7354 -0.6760 -0.1489  0.7349  1.7125 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.321e-17  1.753e-01   0.000    1.000
## x1           1.000e-01  2.496e-01   0.401    0.692
## x2           3.000e-01  2.496e-01   1.202    0.240
## 
## Residual standard error: 0.96 on 27 degrees of freedom
## Multiple R-squared:  0.142,  Adjusted R-squared:  0.07844 
## F-statistic: 2.234 on 2 and 27 DF,  p-value: 0.1265

Here observe that the fitted multiple regression is y = -5.321e-17 + 1.000e-01 x1 + 3.000e-01 x2. Also, note that, here multiple R-squared is 0.1.As the R squared is low, so fitted model is not well enough. Also, notice that for the test of significance of the model, p-value > 0.05, so we null hypothesis at level 0.05, so the fitted model is not significant.

Now, the partial R-squared for x2,

summary(lm(y~x2, dat2))

## 
## Call:
## lm(formula = y ~ x2, data = dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7347 -0.6442 -0.1434  0.7319  1.6560 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1.313e-16  1.726e-01   0.000   1.0000  
## x2           3.700e-01  1.756e-01   2.107   0.0442 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9455 on 28 degrees of freedom
## Multiple R-squared:  0.1369, Adjusted R-squared:  0.1061 
## F-statistic: 4.441 on 1 and 28 DF,  p-value: 0.04416

Now the partial r squared is,

SS_full_1 = (0.96^2)*26
SS_reduced_1 = (0.9455^2)*27
multiple_r_square1 = (SS_reduced_1 - SS_full_1)/SS_reduced_1
multiple_r_square1

## [1] 0.007274944

Now the f^2 is as follows,

f2_1 = multiple_r_square1/(1 - multiple_r_square1)
f2_1

## [1] 0.007328256

Part 4 - Conduct power analyses based on dat2

Now, for N = 120, the power of the test is,

# P1 = 2
# P2 = 1
pwr.f2.test(2, 120 - 2 - 1, f2_1)

## 
##      Multiple regression power calculation 
## 
##               u = 2
##               v = 117
##              f2 = 0.007328256
##       sig.level = 0.05
##           power = 0.1200965

Part 5 - Compare results from dat1 and dat2

Now first standardize our data as follows,

scaled.dat1 = data.frame(scale(dat1))
scaled.dat2 = data.frame(scale(dat2))
head(scaled.dat1)

##           x1         x2          y
## 1  0.2829038 -0.9622935  0.5785019
## 2  0.6976416 -0.9788763 -0.9911112
## 3  0.1087889 -0.6763632  0.2456311
## 4  1.4218883  0.2334207  0.9166005
## 5 -0.3361515 -0.4413236  0.4254810
## 6  0.5995572  0.5996114  1.4206643

head(scaled.dat2)

##            x1         x2          y
## 1  0.12891980  0.9167267  0.2439615
## 2  0.29807332  0.5763166  1.7826756
## 3  0.06258626  0.2355066  0.3947717
## 4 -1.09422800 -1.4970410 -1.6438435
## 5  1.05027245  0.2900778 -0.5219040
## 6 -0.65525488 -0.9462623 -2.0847839

Now, the first fit the multiple regression for scaled.dat1 as follows,

fit3 = lm(y~x1+x2, scaled.dat1)
summary(fit3)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = scaled.dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.4305 -0.7631  0.1318  0.7031  2.1560 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.446e-17  1.795e-01   0.000    1.000
## x1          1.000e-01  1.826e-01   0.548    0.588
## x2          3.000e-01  1.826e-01   1.643    0.112
## 
## Residual standard error: 0.9832 on 27 degrees of freedom
## Multiple R-squared:    0.1,  Adjusted R-squared:  0.03333 
## F-statistic:   1.5 on 2 and 27 DF,  p-value: 0.2411

So, the fitted multiple regression for the standardized data is y = 3.446e-17 + 1.000e-01 x1 + 3.000e-01 x2 and the fitted multiple regression in part 1 was y = -1.551e-16 + 1.000e-01 x1 + 3.000e-01 x2. So, the regression coefficients didn’t changed, only intercept terms changed. Now, fit the multiple regression for scaled.dat2 as follows,

fit4 = lm(y~x1+x2, scaled.dat2)
summary(fit4)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = scaled.dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7354 -0.6760 -0.1489  0.7349  1.7125 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.078e-17  1.753e-01   0.000    1.000
## x1           1.000e-01  2.496e-01   0.401    0.692
## x2           3.000e-01  2.496e-01   1.202    0.240
## 
## Residual standard error: 0.96 on 27 degrees of freedom
## Multiple R-squared:  0.142,  Adjusted R-squared:  0.07844 
## F-statistic: 2.234 on 2 and 27 DF,  p-value: 0.1265

So, the fitted multiple regression for the standardized data is y = -2.078e-17 + 1.000e-01 x1 + 3.000e-01 x2 and the fitted multiple regression in part 3 was y = -5.321e-17 + 1.000e-01 x1 + 3.000e-01 x2. So, the regression coefficients didn’t changed, only intercept terms changed.

Now observe that the partial f2 in part 1 was 0.01243281 and the in part 3 was 0.007328256. So, the partial f2 of part 1 is higher.

The power for part 2 which is based on dat1 is 0.1740897 and for part 4 which is based on dat2 is 0.1200965 Hence the power based on dat1 is higher.

Actually standardized regression coefficients don’t affect Partial f2 (or partial R2), as they are independent of scaling.

Part 6 - Reflection

Actually when we standardize the data, we here loosing some information about the data i.e. around what the data actually varies or what is the actual mean of the data, so the power of the tests decreases.

Part 7 - Required sample size or minimum detectable effect size

Suppose instead of choosing N = 120, Researcher 1 would like to know the required sample size for detecting a significant effect at 80% power. Assume the estimate of Partial f2 from dat1, and α = .05. So,

pwr.f2.test(u=2, f2 = 0.01243281, power = 0.8)

## 
##      Multiple regression power calculation 
## 
##               u = 2
##               v = 774.9441
##              f2 = 0.01243281
##       sig.level = 0.05
##           power = 0.8

Hence the v = 775, then the required sample size is 775 + 2 + 1 = 778.

Suppose Researcher 2 does not trust their estimate of Partial f2, but would still like to do a future study with N = 120. Then,

pwr.f2.test(2,117, power = 0.8)

## 
##      Multiple regression power calculation 
## 
##               u = 2
##               v = 117
##              f2 = 0.08238326
##       sig.level = 0.05
##           power = 0.8

Hence the estimate of f2 is 0.08238326.

part 8 - Extra credit

Let us find the correlation between x1 and x2 for dat1 as follows,

cor(dat1$x1, dat1$x2)

## [1] -7.31171e-16

Now let us find the correlation between x1 and x2 for dat2 as follows,

cor(dat2$x1, dat2$x2)

## [1] 0.7

Hence observe that for dat2, the correlation is 0.7 which is significantly than dat1. Hence dat2 has higher multicorrelation, thats why dat1 has higher partial f2.