There are 3 strategies we learned in section 5.1 to building a good linear model. They are forward selection, backward elimination, and all subsets regression.
With forward selection we start by looking at simple linear models of our response with each of our possible predictors. We take the predictor that fits the best and eliminate any that don’t fit. We then make new models with our response using the first accepted predictor and each of the variables that remain. We repeat this process until we find that none of the remaining predictors are good fits.
In backward elimination we start by using a single model with all of our possible predictors. We then eliminate the one that is the worst fit, assuming it is not a significant predictor. We repeat this until all of the remaining predictors are found to be significant.
In the case of all subsets regression, we make models for every possible combination of predictors and use the one that fits best. As you might imagine, this process takes a ridiculous amount of time if there are even more than a couple variables so for our example I’ll just show the first two processes.
The tools we use to determine significance of each predictor in these processes are t-tests, partial F-tests, AIC, and Mallow’s Cp. We have already learned how to do t-tests and partial F-tests but the other two are new. AIC is a measurement that balances the number of predictors with the model fit, thus penalizing models with large amounts of predictors. The smaller the AIC value, the better. Mallow’s Cp is calculated by finding the squared residuals of a chosen subset over the mean squared error of the model with every predictor. This value is also one that we would like to minimize. When comparing different models, as is done while creating a model, we can use any of the four methods if the models are nested in our large model. However, if they are not, we can only use AIC and Mallow’s Cp to compare the models.
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
summary(water)
## Year APMAM APSAB APSLAKE
## Min. :1948 Min. : 2.700 Min. : 1.450 Min. : 1.77
## 1st Qu.:1958 1st Qu.: 4.975 1st Qu.: 3.390 1st Qu.: 3.36
## Median :1969 Median : 7.080 Median : 4.460 Median : 4.62
## Mean :1969 Mean : 7.323 Mean : 4.652 Mean : 4.93
## 3rd Qu.:1980 3rd Qu.: 9.115 3rd Qu.: 5.685 3rd Qu.: 5.83
## Max. :1990 Max. :18.080 Max. :11.960 Max. :13.02
## OPBPC OPRC OPSLAKE BSAAM
## Min. : 4.050 Min. : 4.350 Min. : 4.600 Min. : 41785
## 1st Qu.: 7.975 1st Qu.: 7.875 1st Qu.: 8.705 1st Qu.: 59857
## Median : 9.550 Median :11.110 Median :12.140 Median : 69177
## Mean :12.836 Mean :12.002 Mean :13.522 Mean : 77756
## 3rd Qu.:16.545 3rd Qu.:14.975 3rd Qu.:16.920 3rd Qu.: 92206
## Max. :43.370 Max. :24.850 Max. :33.070 Max. :146345
We are working with the water date, trying to relate snowfall to runoff. We want to create a model using only efficient predictors. We will start with the forward selection, using t-tests to compare models.
mod1 <- lm(BSAAM ~ APMAM)
mod2 <- lm(BSAAM ~ APSAB)
mod3 <- lm(BSAAM ~ APSLAKE)
mod4 <- lm(BSAAM ~ OPBPC)
mod5 <- lm(BSAAM ~ OPRC)
mod6 <- lm(BSAAM ~ OPSLAKE)
summary(mod1)
##
## Call:
## lm(formula = BSAAM ~ APMAM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37043 -16339 -5457 17158 72467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63364 9917 6.389 1.21e-07 ***
## APMAM 1965 1249 1.573 0.123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25080 on 41 degrees of freedom
## Multiple R-squared: 0.05692, Adjusted R-squared: 0.03391
## F-statistic: 2.474 on 1 and 41 DF, p-value: 0.1234
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ APSAB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41314 -16784 -5101 16492 70942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67152 9689 6.931 2.06e-08 ***
## APSAB 2279 1909 1.194 0.239
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25390 on 41 degrees of freedom
## Multiple R-squared: 0.0336, Adjusted R-squared: 0.01003
## F-statistic: 1.425 on 1 and 41 DF, p-value: 0.2394
summary(mod3)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46438 -16907 -5661 19028 69464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63864 9249 6.905 2.25e-08 ***
## APSLAKE 2818 1709 1.649 0.107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared: 0.06217, Adjusted R-squared: 0.0393
## F-statistic: 2.718 on 1 and 41 DF, p-value: 0.1069
summary(mod4)
##
## Call:
## lm(formula = BSAAM ~ OPBPC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21183 -7298 -819 4731 38430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40017.4 3589.1 11.15 5.47e-14 ***
## OPBPC 2940.1 240.6 12.22 3.00e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11990 on 41 degrees of freedom
## Multiple R-squared: 0.7845, Adjusted R-squared: 0.7793
## F-statistic: 149.3 on 1 and 41 DF, p-value: 2.996e-15
summary(mod5)
##
## Call:
## lm(formula = BSAAM ~ OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24356 -5514 -522 7448 24854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21741.4 4044.1 5.376 3.32e-06 ***
## OPRC 4667.3 311.3 14.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared: 0.8457, Adjusted R-squared: 0.842
## F-statistic: 224.7 on 1 and 41 DF, p-value: < 2.2e-16
summary(mod6)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17603.8 -5338.0 332.1 3410.6 20875.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27014.6 3218.9 8.393 1.93e-10 ***
## OPSLAKE 3752.5 215.7 17.394 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared: 0.8807, Adjusted R-squared: 0.8778
## F-statistic: 302.6 on 1 and 41 DF, p-value: < 2.2e-16
After finding the p-values for each of the individual predictors, we have found that the OPSLAKE predictor is the best. It can also be seen that each of the predictors starting with AP are insignificant on their own, so we have no need to test them further. We can continue now by adding the two remaining predictors to OPSLAKE.
mod14 <- lm(BSAAM ~ OPSLAKE + OPBPC)
mod15 <- lm(BSAAM ~ OPSLAKE + OPRC)
summary(mod14)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPBPC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17591.0 -5276.6 275.6 3380.7 20867.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27050.95 3540.07 7.641 2.44e-09 ***
## OPSLAKE 3736.16 658.24 5.676 1.35e-06 ***
## OPBPC 14.37 546.41 0.026 0.979
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9033 on 40 degrees of freedom
## Multiple R-squared: 0.8807, Adjusted R-squared: 0.8747
## F-statistic: 147.6 on 2 and 40 DF, p-value: < 2.2e-16
summary(mod15)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15991.2 -6484.6 -498.3 4700.1 19945.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22891.2 3277.8 6.984 1.98e-08 ***
## OPSLAKE 2400.8 503.3 4.770 2.46e-05 ***
## OPRC 1866.5 638.8 2.922 0.0057 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared: 0.9017, Adjusted R-squared: 0.8967
## F-statistic: 183.4 on 2 and 40 DF, p-value: < 2.2e-16
These summaries show us that OPBPC is insignificant with OPSLAKE but OPRC is not, so our model will have OPRC and OPSLAKE as predictors. Now we can try the backward elimination method while still using t-tests as our tests of significance.
bigmod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(bigmod)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
## OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## APMAM -12.77 708.89 -0.018 0.985725
## APSAB -664.41 1522.89 -0.436 0.665237
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## OPBPC 69.70 461.69 0.151 0.880839
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
This test showed us that APMAM was our worst predictor so we eliminate it and move on with the rest of the predictors.
cincomod <- lm(BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(cincomod)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12696 -4933 -1396 4187 18550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15930.84 3972.50 4.010 0.000283 ***
## APSAB -673.42 1418.96 -0.475 0.637873
## APSLAKE 2263.86 1269.35 1.783 0.082714 .
## OPBPC 68.94 453.50 0.152 0.879996
## OPRC 1915.75 631.46 3.034 0.004399 **
## OPSLAKE 2212.62 740.28 2.989 0.004952 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9147
## F-statistic: 91.05 on 5 and 37 DF, p-value: < 2.2e-16
This time we eliminate OPBPC.
backmod4 <- lm(BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE)
summary(backmod4)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12750 -5095 -1494 4245 18594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15749.8 3740.8 4.210 0.000151 ***
## APSAB -650.6 1392.8 -0.467 0.643055
## APSLAKE 2244.9 1246.9 1.800 0.079735 .
## OPRC 1910.2 622.3 3.070 0.003942 **
## OPSLAKE 2295.4 494.8 4.639 4.07e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9169
## F-statistic: 116.8 on 4 and 38 DF, p-value: < 2.2e-16
This time we eliminate APSAB.
backmod3 <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE)
summary(backmod3)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
All of our remaining predictors have been found to be significant so after backwards elimination we have a model with APSLAKE, OPRC, and OPSLAKE as predictors. Notice this is slightly different than what we obtained using forward selection.
Next we will try to use the stepAIC function in the MASS package to do the same thing but with AIC as our test for significance.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
stepAIC(bigmod)
## Start: AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APMAM 1 18537 2055849271 772.36
## - OPBPC 1 1301629 2057132362 772.39
## - APSAB 1 10869771 2066700504 772.58
## <none> 2055830733 774.36
## - APSLAKE 1 163662571 2219493304 775.65
## - OPSLAKE 1 493012936 2548843669 781.60
## - OPRC 1 509894399 2565725132 781.89
##
## Step: AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1284108 2057133378 770.39
## - APSAB 1 12514566 2068363837 770.62
## <none> 2055849271 772.36
## - APSLAKE 1 176735690 2232584961 773.90
## - OPSLAKE 1 496370866 2552220136 779.66
## - OPRC 1 511413723 2567262994 779.91
##
## Step: AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APSAB 1 11814207 2068947585 768.63
## <none> 2057133378 770.39
## - APSLAKE 1 175480984 2232614362 771.91
## - OPRC 1 510159318 2567292697 777.91
## - OPSLAKE 1 1165227857 3222361235 787.68
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
R does everything for us and came out with predictors APSLAKE, OPRC, and OPSLAKE. Now we can use this same function to go backwards.
simplemod <- lm(BSAAM ~ 1)
stepAIC(simplemod, direction = "forward", scope = list(upper = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE))
## Start: AIC=873.65
## BSAAM ~ 1
##
## Df Sum of Sq RSS AIC
## + OPSLAKE 1 2.4087e+10 3.2640e+09 784.24
## + OPRC 1 2.3131e+10 4.2199e+09 795.28
## + OPBPC 1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE 1 1.7004e+09 2.5651e+10 872.89
## + APMAM 1 1.5567e+09 2.5794e+10 873.13
## <none> 2.7351e+10 873.65
## + APSAB 1 9.1891e+08 2.6432e+10 874.18
##
## Step: AIC=784.24
## BSAAM ~ OPSLAKE
##
## Df Sum of Sq RSS AIC
## + APSLAKE 1 663368666 2600641788 776.47
## + APSAB 1 661988129 2602022326 776.49
## + OPRC 1 574050696 2689959758 777.92
## + APMAM 1 524283532 2739726922 778.71
## <none> 3264010454 784.24
## + OPBPC 1 56424 3263954031 786.24
##
## Step: AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## + OPRC 1 531694203 2068947585 768.63
## <none> 2600641788 776.47
## + APSAB 1 33349091 2567292697 777.91
## + APMAM 1 11041158 2589600630 778.28
## + OPBPC 1 122447 2600519341 778.46
##
## Step: AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## + APSAB 1 11814207 2057133378 770.39
## + APMAM 1 1410311 2067537274 770.60
## + OPBPC 1 583748 2068363837 770.62
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
##
## Coefficients:
## (Intercept) OPSLAKE APSLAKE OPRC
## 15425 2390 1712 1797
The backwards elimination method using AIC comes out with the same predictors.