Multicollinearity

We reviewed why Multicollinearity is a problem for multiple reasons.
1) point estimates become unstable and differ from sample to sample.
2) test statistics shrink closer to 0 and it becomes more difficult to reject the null hypothesis.
3) Confidence intervals widen and are less helpful.

Predictors

We discussed the ways to analyze and decide the best amount of predictors.
1) t testing for nested models where one model is a proper subset of another.
2) partial f testing for nested models using the anova() command.
3) adjusted R^2 (but we don’t actually use this so… nevermind)
4) AIC = 2* (k +1) - 2 * log likelihood to compare non-nested and the results can be compared to other model’s AIC
5) Mallow CP which is callculated as the SSE of subset / MSE of all predictors

Model Building

Using the snowfall runoff data, crease a model for runoff using some of the snowfall measurements: include forward selection and backward elimination.

Backward elimination

USing the water data set, we will start with the more reasonable method for this group of variables - backward elimination.

library(alr3)
## Loading required package: car
data(water)

Name the full model with all predictors included watermod.

watermod <- lm(BSAAM~Year+APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE, data=water)
summary(watermod)
## 
## Call:
## lm(formula = BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + 
##     OPRC + OPSLAKE, data = water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12772.4  -5164.9   -360.6   4379.1  16807.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -227814.8   197920.2  -1.151  0.25752   
## Year            123.9      100.6   1.232  0.22621   
## APMAM           143.4      715.2   0.200  0.84228   
## APSAB          -546.0     1515.1  -0.360  0.72074   
## APSLAKE        1885.0     1368.1   1.378  0.17699   
## OPBPC            76.6      458.4   0.167  0.86827   
## OPRC           2081.5      650.7   3.199  0.00293 **
## OPSLAKE        2055.0      758.1   2.711  0.01033 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7503 on 35 degrees of freedom
## Multiple R-squared:  0.928,  Adjusted R-squared:  0.9136 
## F-statistic:  64.4 on 7 and 35 DF,  p-value: < 2.2e-16

I will eliminate OPBPC, APSAB, and APMAM first since they have super high p-values and I don’t thinnk they’re good predictors.

watermod2 <- lm(BSAAM~Year+APSLAKE+OPRC+OPSLAKE, data=water)
summary(watermod2)
## 
## Call:
## lm(formula = BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12722.2  -5121.2   -248.7   3903.5  16807.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -227559.31  185510.06  -1.227  0.22749    
## Year            123.66      94.39   1.310  0.19804    
## APSLAKE        1597.29     503.72   3.171  0.00300 ** 
## OPRC           2009.88     585.52   3.433  0.00146 ** 
## OPSLAKE        2206.53     464.59   4.749  2.9e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7218 on 38 degrees of freedom
## Multiple R-squared:  0.9276, Adjusted R-squared:   0.92 
## F-statistic: 121.8 on 4 and 38 DF,  p-value: < 2.2e-16

We will compare this using the partial f test method since it’s nested models. I use anova() command to see if we should eliminate these three predictors or not.

anova(watermod,watermod2)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## Model 2: BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     35 1970400272                           
## 2     38 1979541875 -3  -9141602 0.0541 0.9831

Since it is a high p-value, we will not reject the null that the reduced model, watermod2, is better. Let’s try getting rid of one other of the predictors, year, and see if it is better

watermod3 <- lm(BSAAM~APSLAKE+OPRC+OPSLAKE, data=water)
summary(watermod3)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16
anova(watermod2,watermod3)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1     38 1979541875                           
## 2     39 2068947585 -1 -89405710 1.7163  0.198

This says we also can’t reject the null that our model is better without the year predictor.
Another way to check is using a different method of comparison.
I’ll try looking at the AIC to compare:

AIC(watermod)
## [1] 898.5617
AIC(watermod2)
## [1] 892.7608
AIC(watermod3)
## [1] 892.6603

The AIC is not a large difference between the 3; but especially between the model 2 and 3, where we got rid of Year as a predictor. In this case, watermod 3 is the predictor you would use, because you can only justify using a more complicated model if the AIC drops more than 10. This is the same conclusion we came to in the anova test.

using the forward method

start with OPSLAKE

wmod <- lm(BSAAM ~ OPSLAKE, data=water)
summary(wmod)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE, data = water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17603.8  -5338.0    332.1   3410.6  20875.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
## OPSLAKE       3752.5      215.7  17.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8778 
## F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

I will now add another predictor, APSLAKE.

wmod2 <- lm(BSAAM~ APSLAKE + OPSLAKE, data=water)
summary(wmod2)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE, data = water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13335.8  -5893.2   -171.8   4219.5  19500.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19144.9     3812.0   5.022  1.1e-05 ***
## APSLAKE       1768.8      553.7   3.194  0.00273 ** 
## OPSLAKE       3689.5      196.0  18.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared:  0.9049, Adjusted R-squared:  0.9002 
## F-statistic: 190.3 on 2 and 40 DF,  p-value: < 2.2e-16

Comparing these two models using anova() yields:

anova(wmod,wmod2)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPSLAKE
##   Res.Df        RSS Df Sum of Sq      F   Pr(>F)   
## 1     41 3264010454                                
## 2     40 2600641788  1 663368666 10.203 0.002734 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null that the model with one predictor is better, so we should use APSLAKE as well as OPSLAKE.
We will continue the pattern by adding another predictor, OPRC:

wmod3 <- lm(BSAAM~ APSLAKE + OPRC+ OPSLAKE, data=water)
summary(wmod3)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16

Now we must check if it’s a better model with the third predictor using anova() again.

anova(wmod3,wmod2)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APSLAKE + OPRC + OPSLAKE
## Model 2: BSAAM ~ APSLAKE + OPSLAKE
##   Res.Df        RSS Df  Sum of Sq      F   Pr(>F)   
## 1     39 2068947585                                 
## 2     40 2600641788 -1 -531694203 10.023 0.002998 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We will use the model with 3 predictors since our pvalue is so small, not the model with only 2, because we can reject the null that the beta for OPRC is = 0.

AIC(wmod)
## [1] 908.2647
AIC(wmod2)
## [1] 900.4951
AIC(wmod3)
## [1] 892.6603

The smallest AIC is third model, but the difference between each of the models is not >10 , and we were instructed to only involve a more complicated model if AIC(wmod2)-AID(wmod3) is >10. The conclusion we will draw, then, is that the best model is wmod, with only 1 predictor. This is a different conclusion than what we drew for the anova() test, and this can happen.

Step AIC

After all that work, we were shown a much easier way to compare AIC for this model. We can use stepAIC() function to get rid of each unhelpful predictor.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
stepAIC(watermod,direction="backward")
## Start:  AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1571591 1971971864 772.57
## - APMAM    1   2262194 1972662466 772.58
## - APSAB    1   7311109 1977711381 772.69
## - Year     1  85430461 2055830733 774.36
## <none>                 1970400272 774.53
## - APSLAKE  1 106880993 2077281265 774.80
## - OPSLAKE  1 413707192 2384107464 780.73
## - OPRC     1 576007855 2546408128 783.56
## 
## Step:  AIC=772.57
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1   2626064 1974597927 770.62
## - APSAB    1   6887325 1978859189 770.72
## - Year     1  85160498 2057132362 772.39
## <none>                 1971971864 772.57
## - APSLAKE  1 105315871 2077287734 772.80
## - OPRC     1 574517654 2546489517 781.56
## - OPSLAKE  1 964675516 2936647380 787.69
## 
## Step:  AIC=770.62
## BSAAM ~ Year + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APSAB    1   4943947 1979541875 768.73
## - Year     1  82535451 2057133378 770.39
## <none>                 1974597927 770.62
## - APSLAKE  1 127432687 2102030614 771.31
## - OPRC     1 575963916 2550561844 779.63
## - OPSLAKE  1 968394770 2942992697 785.78
## 
## Step:  AIC=768.73
## BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - Year     1   89405710 2068947585 768.63
## <none>                  1979541875 768.73
## - APSLAKE  1  523812582 2503354457 776.83
## - OPRC     1  613807319 2593349193 778.35
## - OPSLAKE  1 1175063776 3154605651 786.77
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

according to this backward method, the same conclusion we drew earlier is the result here: use OPRC, APSLAKE, and OPSLAKE as predictors.

stepAIC(watermod,direction="forward")
## Start:  AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
## Call:
## lm(formula = BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + 
##     OPRC + OPSLAKE, data = water)
## 
## Coefficients:
## (Intercept)         Year        APMAM        APSAB      APSLAKE  
##   -227814.8        123.9        143.4       -546.0       1885.0  
##       OPBPC         OPRC      OPSLAKE  
##        76.6       2081.5       2055.0

This was the code Knudson typed for the forward stepAIC, but I cannot seem to make it run. I’m not sure what’s wrong with my code but in class it worked for her.

#simple<- lm(BSAAM~APSLAKE,data=water)
#stepAIC(simple, direction="forward", scope= list(upper = BSAAM~ OPSLAKE + OPRC, lower = BASAAM~1)