Today we finished talking about multicollinearity and then talked mostly about model building.
library(alr3)
## Loading required package: car
data(water)
attach(water)
We just breifly finished this topic on the reasons why multicollinearity can be a problem. First it inflates the predictors standard error. Next it makes our point estimates unstable and highly variable. Also it shrinks our test stats and pushes us to reject our null hypothesis. Finally, our confidence intervals become very wide. In the end multicollinearity is just something we want to stay away from.
In order to get the best model we need to know the best steps to take to do this. When adding predictors we want to stay away from comparing R^2 terms. Some helpful things to look at instead are t-tests, f-tests, adjusted R^2, and AIC.
Now there are 3 ways to make our models: forward selection, backward elimination, and all subsets regression.
Forward selection is the process of picking a predictor and adding one at a time if they seem to work. One thing I compared when adding variables was AIC. If it decreased I added it and if it increased I knew that I didn’t want to add that variable.
mod<- lm(BSAAM~APSLAKE)
summary(mod)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46438 -16907 -5661 19028 69464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63864 9249 6.905 2.25e-08 ***
## APSLAKE 2818 1709 1.649 0.107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared: 0.06217, Adjusted R-squared: 0.0393
## F-statistic: 2.718 on 1 and 41 DF, p-value: 0.1069
AIC(mod)
## [1] 996.9139
Here we started with APSLAKE and looked at our original model’s AIC.
mod2<- lm(BSAAM~APSLAKE+APSAB)
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + APSAB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47127 -15536 -6104 18465 67360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65536 9701 6.756 4.11e-08 ***
## APSLAKE 5029 3956 1.271 0.211
## APSAB -2703 4354 -0.621 0.538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25200 on 40 degrees of freedom
## Multiple R-squared: 0.07113, Adjusted R-squared: 0.02468
## F-statistic: 1.531 on 2 and 40 DF, p-value: 0.2286
AIC(mod2)
## [1] 998.5013
Then we tried to add APSAB but our AIC got worse so we decided not to.
mod3<- lm(BSAAM~APSLAKE+OPSLAKE)
summary(mod3)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13335.8 -5893.2 -171.8 4219.5 19500.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19144.9 3812.0 5.022 1.1e-05 ***
## APSLAKE 1768.8 553.7 3.194 0.00273 **
## OPSLAKE 3689.5 196.0 18.829 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared: 0.9049, Adjusted R-squared: 0.9002
## F-statistic: 190.3 on 2 and 40 DF, p-value: < 2.2e-16
AIC(mod3)
## [1] 900.4951
Then we added OPSLAKE instead and our AIC went down like we wanted it to so we will keep this in the model.
mod4<- lm(BSAAM~APSLAKE+OPSLAKE+OPBPC)
summary(mod4)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPBPC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13321.6 -5876.3 -166.9 4198.4 19487.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19198.01 4054.50 4.735 2.88e-05 ***
## APSLAKE 1768.90 560.79 3.154 0.00309 **
## OPSLAKE 3665.47 595.45 6.156 3.15e-07 ***
## OPBPC 21.17 493.94 0.043 0.96604
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8166 on 39 degrees of freedom
## Multiple R-squared: 0.9049, Adjusted R-squared: 0.8976
## F-statistic: 123.7 on 3 and 39 DF, p-value: < 2.2e-16
AIC(mod4)
## [1] 902.4931
Now we tried adding OPBPC and our AIC got worse again so we choose not to include this either.
mod5<- lm(BSAAM~APSLAKE+OPSLAKE+OPRC)
summary(mod5)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## OPRC 1797.5 567.8 3.166 0.002998 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
AIC(mod5)
## [1] 892.6603
Again we tried another one and the AIC improved so we will keep OPRC.
mod6<- lm(BSAAM~APSLAKE+OPSLAKE+OPRC+APMAM)
summary(mod6)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPRC + APMAM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12898 -5137 -1519 4319 18579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15559.2 3778.4 4.118 0.000199 ***
## APSLAKE 1828.5 880.8 2.076 0.044726 *
## OPSLAKE 2377.9 458.8 5.182 7.5e-06 ***
## OPRC 1815.8 586.2 3.098 0.003659 **
## APMAM -104.5 648.8 -0.161 0.872947
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7376 on 38 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9165
## F-statistic: 116.2 on 4 and 38 DF, p-value: < 2.2e-16
AIC(mod6)
## [1] 894.6309
And finally our AIC wont get any better so our best model was mod5. This is how you add variables in forward selection.
The next process was Backward elimination. This is easier in my opinion because you can add all the variables right away and see which one is the worst and get rid of them 1 by 1 until all our variables are significant.
model<- lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
summary(model)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
## OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## APMAM -12.77 708.89 -0.018 0.985725
## APSAB -664.41 1522.89 -0.436 0.665237
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## OPBPC 69.70 461.69 0.151 0.880839
## OPRC 1916.45 641.36 2.988 0.005031 **
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
AIC(model)
## [1] 898.3868
This is our model with everything included right away. The easiest way is to look for the highest p-value that is above our .05 level of significance. This is the best one to drop first. I also ran the AIC so we can look at that as well as our AIC should improve each time.
model2<- lm(BSAAM~APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
summary(model2)
##
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12696 -4933 -1396 4187 18550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15930.84 3972.50 4.010 0.000283 ***
## APSAB -673.42 1418.96 -0.475 0.637873
## APSLAKE 2263.86 1269.35 1.783 0.082714 .
## OPBPC 68.94 453.50 0.152 0.879996
## OPRC 1915.75 631.46 3.034 0.004399 **
## OPSLAKE 2212.62 740.28 2.989 0.004952 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9147
## F-statistic: 91.05 on 5 and 37 DF, p-value: < 2.2e-16
AIC(model2)
## [1] 896.3872
This is the model after we dropped the APMAM variable because that was the highest p-value for the t-test. The AIC improved.
model3<- lm(BSAAM~APSLAKE+OPBPC+OPRC+OPSLAKE)
summary(model3)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12932 -5010 -1181 4412 18621
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15538.35 3845.65 4.041 0.00025 ***
## APSLAKE 1712.68 506.99 3.378 0.00170 **
## OPBPC 46.22 446.34 0.104 0.91806
## OPRC 1798.53 575.20 3.127 0.00338 **
## OPSLAKE 2336.53 685.62 3.408 0.00156 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7378 on 38 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9164
## F-statistic: 116.1 on 4 and 38 DF, p-value: < 2.2e-16
AIC(model3)
## [1] 894.6481
Next we dropped APSAB variable and our AIC improved again.
model4<- lm(BSAAM~APSLAKE+OPRC+OPSLAKE)
summary(model4)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
AIC(model4)
## [1] 892.6603
Again we dropped another one, OPBPC, and our AIC improved. We would choose to stop here because all our variables have significant p-values below .05 for the t-test so our model is good. Let’s just drop one more to see if our AIC would be better.
model5<- lm(BSAAM~OPRC+OPSLAKE)
summary(model5)
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15991.2 -6484.6 -498.3 4700.1 19945.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22891.2 3277.8 6.984 1.98e-08 ***
## OPRC 1866.5 638.8 2.922 0.0057 **
## OPSLAKE 2400.8 503.3 4.770 2.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared: 0.9017, Adjusted R-squared: 0.8967
## F-statistic: 183.4 on 2 and 40 DF, p-value: < 2.2e-16
AIC(model5)
## [1] 901.9472
This definitely made our AIC worse so we want to go back to the last model we made.
Now there is an easier way to do this in R. We can use the stepAIC commmand and it finds the best model by comparing AIC.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
stepAIC(model)
## Start: AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APMAM 1 18537 2055849271 772.36
## - OPBPC 1 1301629 2057132362 772.39
## - APSAB 1 10869771 2066700504 772.58
## <none> 2055830733 774.36
## - APSLAKE 1 163662571 2219493304 775.65
## - OPSLAKE 1 493012936 2548843669 781.60
## - OPRC 1 509894399 2565725132 781.89
##
## Step: AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1284108 2057133378 770.39
## - APSAB 1 12514566 2068363837 770.62
## <none> 2055849271 772.36
## - APSLAKE 1 176735690 2232584961 773.90
## - OPSLAKE 1 496370866 2552220136 779.66
## - OPRC 1 511413723 2567262994 779.91
##
## Step: AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APSAB 1 11814207 2068947585 768.63
## <none> 2057133378 770.39
## - APSLAKE 1 175480984 2232614362 771.91
## - OPRC 1 510159318 2567292697 777.91
## - OPSLAKE 1 1165227857 3222361235 787.68
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
And as you can see it was left with the same model as we were because it found the lowest AIC possible.