a). Backward elimination
Backward elimination is the simplest of all variabl selection procedures and can be easily implemented without special software. We start with all predictors and then remove the predictor with highest p-value greater than alphacrit. By proceeding in this manner, all “nonsignificant” predictors will be removed and the selection process will be complete.
From the regression analysis below, we remove the gleason predictor as it has the highest p-value of 0.77503 > 0.05 and run the analysis again and find that lcp has a p-value of 0.25127 > 0.05. We next remove this predictor and find that pgg45 has a p-value > 0.05. We remove this predictor next and find that age as a p-value of 0.169528 > 0.05 and remove this predictor next. The last predictor that we find in this iteration is lbph which has a p-value > 0.05. We remove this also.
We notice that R^2 for the full model of 0.6234 is reduced only slightly to 0.6144 in the final model. Thus, the removal of four predictors causes only a minor reduction in fit. It is important to understand that the variables omitted from the model may still be related to the response. We see that all lcp seems to have some association with lpsa. Thus, one failing of the backward elimination model is that it cannot reliably distinguish between important and unimportant predictors.
library(faraway)
data(prostate)
lmod <- lm(lpsa ~., prostate)
summary(lmod)
##
## Call:
## lm(formula = lpsa ~ ., data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7331 -0.3713 -0.0170 0.4141 1.6381
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.669337 1.296387 0.516 0.60693
## lcavol 0.587022 0.087920 6.677 2.11e-09 ***
## lweight 0.454467 0.170012 2.673 0.00896 **
## age -0.019637 0.011173 -1.758 0.08229 .
## lbph 0.107054 0.058449 1.832 0.07040 .
## svi 0.766157 0.244309 3.136 0.00233 **
## lcp -0.105474 0.091013 -1.159 0.24964
## gleason 0.045142 0.157465 0.287 0.77503
## pgg45 0.004525 0.004421 1.024 0.30886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7084 on 88 degrees of freedom
## Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234
## F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16
lmod <- update(lmod, . ~ . - gleason)
summary(lmod)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp +
## pgg45, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.73117 -0.38137 -0.01728 0.43364 1.63513
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.953926 0.829439 1.150 0.25319
## lcavol 0.591615 0.086001 6.879 8.07e-10 ***
## lweight 0.448292 0.167771 2.672 0.00897 **
## age -0.019336 0.011066 -1.747 0.08402 .
## lbph 0.107671 0.058108 1.853 0.06720 .
## svi 0.757734 0.241282 3.140 0.00229 **
## lcp -0.104482 0.090478 -1.155 0.25127
## pgg45 0.005318 0.003433 1.549 0.12488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7048 on 89 degrees of freedom
## Multiple R-squared: 0.6544, Adjusted R-squared: 0.6273
## F-statistic: 24.08 on 7 and 89 DF, p-value: < 2.2e-16
lmod <- update(lmod, . ~ . - lcp)
summary(lmod)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + pgg45,
## data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.77711 -0.41708 0.00002 0.40676 1.59681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.980085 0.830665 1.180 0.24116
## lcavol 0.545770 0.076431 7.141 2.31e-10 ***
## lweight 0.449450 0.168078 2.674 0.00890 **
## age -0.017470 0.010967 -1.593 0.11469
## lbph 0.105755 0.058191 1.817 0.07249 .
## svi 0.641666 0.219757 2.920 0.00442 **
## pgg45 0.003528 0.003068 1.150 0.25331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7061 on 90 degrees of freedom
## Multiple R-squared: 0.6493, Adjusted R-squared: 0.6259
## F-statistic: 27.77 on 6 and 90 DF, p-value: < 2.2e-16
lmod <- update(lmod, . ~ . - pgg45)
summary(lmod)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83505 -0.39396 0.00414 0.46336 1.57888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.95100 0.83175 1.143 0.255882
## lcavol 0.56561 0.07459 7.583 2.77e-11 ***
## lweight 0.42369 0.16687 2.539 0.012814 *
## age -0.01489 0.01075 -1.385 0.169528
## lbph 0.11184 0.05805 1.927 0.057160 .
## svi 0.72095 0.20902 3.449 0.000854 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7073 on 91 degrees of freedom
## Multiple R-squared: 0.6441, Adjusted R-squared: 0.6245
## F-statistic: 32.94 on 5 and 91 DF, p-value: < 2.2e-16
lmod <- update(lmod, . ~ . - age)
summary(lmod)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + lbph + svi, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82653 -0.42270 0.04362 0.47041 1.48530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.14554 0.59747 0.244 0.80809
## lcavol 0.54960 0.07406 7.422 5.64e-11 ***
## lweight 0.39088 0.16600 2.355 0.02067 *
## lbph 0.09009 0.05617 1.604 0.11213
## svi 0.71174 0.20996 3.390 0.00103 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7108 on 92 degrees of freedom
## Multiple R-squared: 0.6366, Adjusted R-squared: 0.6208
## F-statistic: 40.29 on 4 and 92 DF, p-value: < 2.2e-16
lmod <- update(lmod, . ~ . - lbph)
summary(lmod)
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + svi, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.72964 -0.45764 0.02812 0.46403 1.57013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.26809 0.54350 -0.493 0.62298
## lcavol 0.55164 0.07467 7.388 6.3e-11 ***
## lweight 0.50854 0.15017 3.386 0.00104 **
## svi 0.66616 0.20978 3.176 0.00203 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7168 on 93 degrees of freedom
## Multiple R-squared: 0.6264, Adjusted R-squared: 0.6144
## F-statistic: 51.99 on 3 and 93 DF, p-value: < 2.2e-16
summary(lm(lpsa ~ gleason + lcp + pgg45 + age + lbph, prostate))
##
## Call:
## lm(formula = lpsa ~ gleason + lcp + pgg45 + age + lbph, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0633 -0.5770 -0.0574 0.5861 2.3477
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.615915 1.529515 1.056 0.2935
## gleason 0.102677 0.207699 0.494 0.6222
## lcp 0.398056 0.091097 4.370 3.3e-05 ***
## pgg45 0.002105 0.005904 0.357 0.7222
## age 0.002753 0.014668 0.188 0.8516
## lbph 0.133615 0.072286 1.848 0.0678 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9608 on 91 degrees of freedom
## Multiple R-squared: 0.3433, Adjusted R-squared: 0.3072
## F-statistic: 9.512 on 5 and 91 DF, p-value: 2.527e-07
b). AIC
the leaps package exhasutively searches all possible combinations of the predictors. For each size of model p, it finds the variables that prduce the minimum RSS. From the analysis, we see that the best one-predictor model uses lcavol, the best two-predictor model uses lweight etc. We see from the figure below that the AIC is minimized by a choice of three predictors, namely lcavol, lweight and svi.
require(leaps)
## Loading required package: leaps
b <- regsubsets(lpsa ~., prostate)
rs <- summary(b)
rs$which
## (Intercept) lcavol lweight age lbph svi lcp gleason pgg45
## 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 TRUE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
## 4 TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
AIC <- 50 * log(rs$rss/50) + (2:8) * 2
## Warning in 50 * log(rs$rss/50) + (2:8) * 2: longer object length is not a
## multiple of shorter object length
plot(AIC ~ I(1:8), ylabs = "AIC", xlab = "Number of Predictors")
## Warning in plot.window(...): "ylabs" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "ylabs" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "ylabs" is not a
## graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "ylabs" is not a
## graphical parameter
## Warning in box(...): "ylabs" is not a graphical parameter
## Warning in title(...): "ylabs" is not a graphical parameter
c). Adjusted R^2
The model that the adjusted R^2 criterion selects is model 7 which has lcavol, lweight, age, lbph, svi and lcp as predictors.
plot(1:8, rs$adjr2, xlab = "No. of parameters", ylab = "Adjusted R-squared")
which.max(rs$adjr2)
## [1] 7
d). Mallows Cp
One final criterion is Mallow’s Cp statistic. A good model should predict well, so the average mean square error of prediction might be a good criterion. Per the plot, there is only one model (model 7) on or below the Cp = p line, indicating a good fit.
The best model may be model 7 which is picked using both the Mallow’s method as well as the adjusted-R^2 method. This model excludes gleason and has an adjusted R-squared of 0.62.