Today we finished talking about multicollinearity and then talked mostly about model building.

library(alr3)
## Loading required package: car
data(water)
attach(water)

Multicollinearity

We just breifly finished this topic on the reasons why multicollinearity can be a problem. First it inflates the predictors standard error. Next it makes our point estimates unstable and highly variable. Also it shrinks our test stats and pushes us to reject our null hypothesis. Finally, our confidence intervals become very wide. In the end multicollinearity is just something we want to stay away from.

Model Building

In order to get the best model we need to know the best steps to take to do this. When adding predictors we want to stay away from comparing R^2 terms. Some helpful things to look at instead are t-tests, f-tests, adjusted R^2, and AIC.

Now there are 3 ways to make our models: forward selection, backward elimination, and all subsets regression.

Forward selection is the process of picking a predictor and adding one at a time if they seem to work. One thing I compared when adding variables was AIC. If it decreased I added it and if it increased I knew that I didn’t want to add that variable.

mod<- lm(BSAAM~APSLAKE)
summary(mod)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46438 -16907  -5661  19028  69464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63864       9249   6.905 2.25e-08 ***
## APSLAKE         2818       1709   1.649    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared:  0.06217,    Adjusted R-squared:  0.0393 
## F-statistic: 2.718 on 1 and 41 DF,  p-value: 0.1069
AIC(mod)
## [1] 996.9139

Here we started with APSLAKE and looked at our original model’s AIC.

mod2<- lm(BSAAM~APSLAKE+APSAB)
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + APSAB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -47127 -15536  -6104  18465  67360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    65536       9701   6.756 4.11e-08 ***
## APSLAKE         5029       3956   1.271    0.211    
## APSAB          -2703       4354  -0.621    0.538    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25200 on 40 degrees of freedom
## Multiple R-squared:  0.07113,    Adjusted R-squared:  0.02468 
## F-statistic: 1.531 on 2 and 40 DF,  p-value: 0.2286
AIC(mod2)
## [1] 998.5013

Then we tried to add APSAB but our AIC got worse so we decided not to.

mod3<- lm(BSAAM~APSLAKE+OPSLAKE)
summary(mod3)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13335.8  -5893.2   -171.8   4219.5  19500.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19144.9     3812.0   5.022  1.1e-05 ***
## APSLAKE       1768.8      553.7   3.194  0.00273 ** 
## OPSLAKE       3689.5      196.0  18.829  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8063 on 40 degrees of freedom
## Multiple R-squared:  0.9049, Adjusted R-squared:  0.9002 
## F-statistic: 190.3 on 2 and 40 DF,  p-value: < 2.2e-16
AIC(mod3)
## [1] 900.4951

Then we added OPSLAKE instead and our AIC went down like we wanted it to so we will keep this in the model.

mod4<- lm(BSAAM~APSLAKE+OPSLAKE+OPBPC)
summary(mod4)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPBPC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13321.6  -5876.3   -166.9   4198.4  19487.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19198.01    4054.50   4.735 2.88e-05 ***
## APSLAKE      1768.90     560.79   3.154  0.00309 ** 
## OPSLAKE      3665.47     595.45   6.156 3.15e-07 ***
## OPBPC          21.17     493.94   0.043  0.96604    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8166 on 39 degrees of freedom
## Multiple R-squared:  0.9049, Adjusted R-squared:  0.8976 
## F-statistic: 123.7 on 3 and 39 DF,  p-value: < 2.2e-16
AIC(mod4)
## [1] 902.4931

Now we tried adding OPBPC and our AIC got worse again so we choose not to include this either.

mod5<- lm(BSAAM~APSLAKE+OPSLAKE+OPRC)
summary(mod5)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16
AIC(mod5)
## [1] 892.6603

Again we tried another one and the AIC improved so we will keep OPRC.

mod6<- lm(BSAAM~APSLAKE+OPSLAKE+OPRC+APMAM)
summary(mod6)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPSLAKE + OPRC + APMAM)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12898  -5137  -1519   4319  18579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15559.2     3778.4   4.118 0.000199 ***
## APSLAKE       1828.5      880.8   2.076 0.044726 *  
## OPSLAKE       2377.9      458.8   5.182  7.5e-06 ***
## OPRC          1815.8      586.2   3.098 0.003659 ** 
## APMAM         -104.5      648.8  -0.161 0.872947    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7376 on 38 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9165 
## F-statistic: 116.2 on 4 and 38 DF,  p-value: < 2.2e-16
AIC(mod6)
## [1] 894.6309

And finally our AIC wont get any better so our best model was mod5. This is how you add variables in forward selection.

The next process was Backward elimination. This is easier in my opinion because you can add all the variables right away and see which one is the worst and get rid of them 1 by 1 until all our variables are significant.

model<- lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
summary(model)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
##     OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## OPBPC          69.70     461.69   0.151 0.880839    
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16
AIC(model)
## [1] 898.3868

This is our model with everything included right away. The easiest way is to look for the highest p-value that is above our .05 level of significance. This is the best one to drop first. I also ran the AIC so we can look at that as well as our AIC should improve each time.

model2<- lm(BSAAM~APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
summary(model2)
## 
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12696  -4933  -1396   4187  18550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15930.84    3972.50   4.010 0.000283 ***
## APSAB        -673.42    1418.96  -0.475 0.637873    
## APSLAKE      2263.86    1269.35   1.783 0.082714 .  
## OPBPC          68.94     453.50   0.152 0.879996    
## OPRC         1915.75     631.46   3.034 0.004399 ** 
## OPSLAKE      2212.62     740.28   2.989 0.004952 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9147 
## F-statistic: 91.05 on 5 and 37 DF,  p-value: < 2.2e-16
AIC(model2)
## [1] 896.3872

This is the model after we dropped the APMAM variable because that was the highest p-value for the t-test. The AIC improved.

model3<- lm(BSAAM~APSLAKE+OPBPC+OPRC+OPSLAKE)
summary(model3)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPBPC + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12932  -5010  -1181   4412  18621 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15538.35    3845.65   4.041  0.00025 ***
## APSLAKE      1712.68     506.99   3.378  0.00170 ** 
## OPBPC          46.22     446.34   0.104  0.91806    
## OPRC         1798.53     575.20   3.127  0.00338 ** 
## OPSLAKE      2336.53     685.62   3.408  0.00156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7378 on 38 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9164 
## F-statistic: 116.1 on 4 and 38 DF,  p-value: < 2.2e-16
AIC(model3)
## [1] 894.6481

Next we dropped APSAB variable and our AIC improved again.

model4<- lm(BSAAM~APSLAKE+OPRC+OPSLAKE)
summary(model4)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16
AIC(model4)
## [1] 892.6603

Again we dropped another one, OPBPC, and our AIC improved. We would choose to stop here because all our variables have significant p-values below .05 for the t-test so our model is good. Let’s just drop one more to see if our AIC would be better.

model5<- lm(BSAAM~OPRC+OPSLAKE)
summary(model5)
## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15991.2  -6484.6   -498.3   4700.1  19945.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22891.2     3277.8   6.984 1.98e-08 ***
## OPRC          1866.5      638.8   2.922   0.0057 ** 
## OPSLAKE       2400.8      503.3   4.770 2.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8967 
## F-statistic: 183.4 on 2 and 40 DF,  p-value: < 2.2e-16
AIC(model5)
## [1] 901.9472

This definitely made our AIC worse so we want to go back to the last model we made.

Now there is an easier way to do this in R. We can use the stepAIC commmand and it finds the best model by comparing AIC.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
stepAIC(model)
## Start:  AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1     18537 2055849271 772.36
## - OPBPC    1   1301629 2057132362 772.39
## - APSAB    1  10869771 2066700504 772.58
## <none>                 2055830733 774.36
## - APSLAKE  1 163662571 2219493304 775.65
## - OPSLAKE  1 493012936 2548843669 781.60
## - OPRC     1 509894399 2565725132 781.89
## 
## Step:  AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1284108 2057133378 770.39
## - APSAB    1  12514566 2068363837 770.62
## <none>                 2055849271 772.36
## - APSLAKE  1 176735690 2232584961 773.90
## - OPSLAKE  1 496370866 2552220136 779.66
## - OPRC     1 511413723 2567262994 779.91
## 
## Step:  AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - APSAB    1   11814207 2068947585 768.63
## <none>                  2057133378 770.39
## - APSLAKE  1  175480984 2232614362 771.91
## - OPRC     1  510159318 2567292697 777.91
## - OPSLAKE  1 1165227857 3222361235 787.68
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

And as you can see it was left with the same model as we were because it found the lowest AIC possible.