We reviewed why Multicollinearity is a problem for multiple reasons.
1) point estimates become unstable and differ from sample to sample.
2) test statistics shrink closer to 0 and it becomes more difficult to reject the null hypothesis.
3) Confidence intervals widen and are less helpful.
We discussed the ways to analyze and decide the best amount of predictors.
1) t testing for nested models where one model is a proper subset of another.
2) partial f testing for nested models using the anova() command.
3) adjusted R^2 (but we don’t actually use this so… nevermind)
4) AIC = 2* (k +1) - 2 * log likelihood to compare non-nested and the results can be compared to other model’s AIC
5) Mallow CP which is callculated as the SSE of subset / MSE of all predictors
Using the snowfall runoff data, crease a model for runoff using some of the snowfall measurements: include forward selection and backward elimination.
USing the water data set, we will start with the more reasonable method for this group of variables - backward elimination.
library(alr3)
## Loading required package: car
data(water)
Name the full model with all predictors included watermod.
watermod <- lm(BSAAM~Year+APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE, data=water)
summary(watermod)
##
## Call:
## lm(formula = BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC +
## OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12772.4 -5164.9 -360.6 4379.1 16807.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -227814.8 197920.2 -1.151 0.25752
## Year 123.9 100.6 1.232 0.22621
## APMAM 143.4 715.2 0.200 0.84228
## APSAB -546.0 1515.1 -0.360 0.72074
## APSLAKE 1885.0 1368.1 1.378 0.17699
## OPBPC 76.6 458.4 0.167 0.86827
## OPRC 2081.5 650.7 3.199 0.00293 **
## OPSLAKE 2055.0 758.1 2.711 0.01033 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7503 on 35 degrees of freedom
## Multiple R-squared: 0.928, Adjusted R-squared: 0.9136
## F-statistic: 64.4 on 7 and 35 DF, p-value: < 2.2e-16
I will eliminate OPBPC, APSAB, and APMAM first since they have super high p-values and I don’t thinnk they’re good predictors.
watermod2 <- lm(BSAAM~Year+APSLAKE+OPRC+OPSLAKE, data=water)
summary(watermod2)
##
## Call:
## lm(formula = BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12722.2 -5121.2 -248.7 3903.5 16807.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -227559.31 185510.06 -1.227 0.22749
## Year 123.66 94.39 1.310 0.19804
## APSLAKE 1597.29 503.72 3.171 0.00300 **
## OPRC 2009.88 585.52 3.433 0.00146 **
## OPSLAKE 2206.53 464.59 4.749 2.9e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7218 on 38 degrees of freedom
## Multiple R-squared: 0.9276, Adjusted R-squared: 0.92
## F-statistic: 121.8 on 4 and 38 DF, p-value: < 2.2e-16
We will compare this using the partial f test method since it’s nested models. I use anova() command to see if we should eliminate these three predictors or not.
anova(watermod,watermod2)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 1970400272
## 2 38 1979541875 -3 -9141602 0.0541 0.9831
Since it is a high p-value, we will not reject the null that the reduced model, watermod2, is better. Let’s try getting rid of one other of the predictors, year, and see if it is better
watermod3 <- lm(BSAAM~APSLAKE+OPRC+OPSLAKE, data=water)
summary(watermod3)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
anova(watermod2,watermod3)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 38 1979541875
## 2 39 2068947585 -1 -89405710 1.7163 0.198
This says we also can’t reject the null that our model is better without the year predictor.
Another way to check is using a different method of comparison.
I’ll try looking at the AIC to compare:
AIC(watermod)
## [1] 898.5617
AIC(watermod2)
## [1] 892.7608
AIC(watermod3)
## [1] 892.6603
The AIC is not a large difference between the 3; but especially between the model 2 and 3, where we got rid of Year as a predictor. In this case, watermod 3 is the predictor you would use, because you can only justify using a more complicated model if the AIC drops more than 10. This is the same conclusion we came to in the anova test.
start with OPSLAKE
wmod <- lm(BSAAM ~ OPSLAKE, data=water)
summary(wmod)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17603.8 -5338.0 332.1 3410.6 20875.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27014.6 3218.9 8.393 1.93e-10 ***
## OPSLAKE 3752.5 215.7 17.394 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared: 0.8807, Adjusted R-squared: 0.8778
## F-statistic: 302.6 on 1 and 41 DF, p-value: < 2.2e-16
I will now add another predictor, APSLAKE.
wmod2 <- lm(BSAAM~ APSLAKE + OPSLAKE, data=water)
summary(wmod2)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13335.8 -5893.2 -171.8 4219.5 19500.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19144.9 3812.0 5.022 1.1e-05 ***
## APSLAKE 1768.8 553.7 3.194 0.00273 **
## OPSLAKE 3689.5 196.0 18.829 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared: 0.9049, Adjusted R-squared: 0.9002
## F-statistic: 190.3 on 2 and 40 DF, p-value: < 2.2e-16
Comparing these two models using anova() yields:
anova(wmod,wmod2)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 3264010454
## 2 40 2600641788 1 663368666 10.203 0.002734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We reject the null that the model with one predictor is better, so we should use APSLAKE as well as OPSLAKE.
We will continue the pattern by adding another predictor, OPRC:
wmod3 <- lm(BSAAM~ APSLAKE + OPRC+ OPSLAKE, data=water)
summary(wmod3)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
Now we must check if it’s a better model with the third predictor using anova() again.
anova(wmod3,wmod2)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APSLAKE + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 39 2068947585
## 2 40 2600641788 -1 -531694203 10.023 0.002998 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We will use the model with 3 predictors since our pvalue is so small, not the model with only 2, because we can reject the null that the beta for OPRC is = 0.
AIC(wmod)
## [1] 908.2647
AIC(wmod2)
## [1] 900.4951
AIC(wmod3)
## [1] 892.6603
The smallest AIC is third model, but the difference between each of the models is not >10 , and we were instructed to only involve a more complicated model if AIC(wmod2)-AID(wmod3) is >10. The conclusion we will draw, then, is that the best model is wmod, with only 1 predictor. This is a different conclusion than what we drew for the anova() test, and this can happen.
After all that work, we were shown a much easier way to compare AIC for this model. We can use stepAIC() function to get rid of each unhelpful predictor.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
stepAIC(watermod,direction="backward")
## Start: AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1571591 1971971864 772.57
## - APMAM 1 2262194 1972662466 772.58
## - APSAB 1 7311109 1977711381 772.69
## - Year 1 85430461 2055830733 774.36
## <none> 1970400272 774.53
## - APSLAKE 1 106880993 2077281265 774.80
## - OPSLAKE 1 413707192 2384107464 780.73
## - OPRC 1 576007855 2546408128 783.56
##
## Step: AIC=772.57
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APMAM 1 2626064 1974597927 770.62
## - APSAB 1 6887325 1978859189 770.72
## - Year 1 85160498 2057132362 772.39
## <none> 1971971864 772.57
## - APSLAKE 1 105315871 2077287734 772.80
## - OPRC 1 574517654 2546489517 781.56
## - OPSLAKE 1 964675516 2936647380 787.69
##
## Step: AIC=770.62
## BSAAM ~ Year + APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APSAB 1 4943947 1979541875 768.73
## - Year 1 82535451 2057133378 770.39
## <none> 1974597927 770.62
## - APSLAKE 1 127432687 2102030614 771.31
## - OPRC 1 575963916 2550561844 779.63
## - OPSLAKE 1 968394770 2942992697 785.78
##
## Step: AIC=768.73
## BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - Year 1 89405710 2068947585 768.63
## <none> 1979541875 768.73
## - APSLAKE 1 523812582 2503354457 776.83
## - OPRC 1 613807319 2593349193 778.35
## - OPSLAKE 1 1175063776 3154605651 786.77
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
according to this backward method, the same conclusion we drew earlier is the result here: use OPRC, APSLAKE, and OPSLAKE as predictors.
stepAIC(watermod,direction="forward")
## Start: AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Call:
## lm(formula = BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC +
## OPRC + OPSLAKE, data = water)
##
## Coefficients:
## (Intercept) Year APMAM APSAB APSLAKE
## -227814.8 123.9 143.4 -546.0 1885.0
## OPBPC OPRC OPSLAKE
## 76.6 2081.5 2055.0
This was the code Knudson typed for the forward stepAIC, but I cannot seem to make it run. I’m not sure what’s wrong with my code but in class it worked for her.
#simple<- lm(BSAAM~APSLAKE,data=water)
#stepAIC(simple, direction="forward", scope= list(upper = BSAAM~ OPSLAKE + OPRC, lower = BASAAM~1)