Today during class we finalized our discussion on multicollinearity and started talking about model building. Some ways the know if predictors should be added or not is to run t-test, F-test, AIC of Mallow’s \({Cp}\). The one used most often in practice is the AIC or the Akaike Information Criteria. Better models will have a lower AIC score and you would need a difference of 10 or more to justify a more complex model. It is important ot mension that the numbers mean nothing on their own and must be used as a comparison. In this example I used both ANOVA and AIC to justify my model building.
There are three different types of model building:
Forward Selection: starting simple, adding more predictors using test/predictors
Backward Elimination: starting with all the predictors and drop one at a time
All Subsets: Looking at every possible iteration of the model and comparing.
I did a long hand approach to backwards elimination for my example.
library(alr3)
## Loading required package: car
data(water)
summary(water)
## Year APMAM APSAB APSLAKE
## Min. :1948 Min. : 2.700 Min. : 1.450 Min. : 1.77
## 1st Qu.:1958 1st Qu.: 4.975 1st Qu.: 3.390 1st Qu.: 3.36
## Median :1969 Median : 7.080 Median : 4.460 Median : 4.62
## Mean :1969 Mean : 7.323 Mean : 4.652 Mean : 4.93
## 3rd Qu.:1980 3rd Qu.: 9.115 3rd Qu.: 5.685 3rd Qu.: 5.83
## Max. :1990 Max. :18.080 Max. :11.960 Max. :13.02
## OPBPC OPRC OPSLAKE BSAAM
## Min. : 4.050 Min. : 4.350 Min. : 4.600 Min. : 41785
## 1st Qu.: 7.975 1st Qu.: 7.875 1st Qu.: 8.705 1st Qu.: 59857
## Median : 9.550 Median :11.110 Median :12.140 Median : 69177
## Mean :12.836 Mean :12.002 Mean :13.522 Mean : 77756
## 3rd Qu.:16.545 3rd Qu.:14.975 3rd Qu.:16.920 3rd Qu.: 92206
## Max. :43.370 Max. :24.850 Max. :33.070 Max. :146345
mod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year, data = water)
summary(mod)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC +
## OPSLAKE + Year, data = water)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12772.4 -5164.9 -360.6 4379.1 16807.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -227814.8 197920.2 -1.151 0.25752
## APMAM 143.4 715.2 0.200 0.84228
## APSAB -546.0 1515.1 -0.360 0.72074
## APSLAKE 1885.0 1368.1 1.378 0.17699
## OPBPC 76.6 458.4 0.167 0.86827
## OPRC 2081.5 650.7 3.199 0.00293 **
## OPSLAKE 2055.0 758.1 2.711 0.01033 *
## Year 123.9 100.6 1.232 0.22621
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7503 on 35 degrees of freedom
## Multiple R-squared: 0.928, Adjusted R-squared: 0.9136
## F-statistic: 64.4 on 7 and 35 DF, p-value: < 2.2e-16
mod2 <- lm(BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year, data = water)
anova(mod, mod2)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 1970400272
## 2 36 1972662466 -1 -2262194 0.0402 0.8423
AIC(mod, mod2)
## df AIC
## mod 9 898.5617
## mod2 8 896.6111
Removing the least significant variable first, leaves us with a p value in our ANOVA of 0.9857. This means there is no statistical difference within the two models.
AIC for mod: 898.3868
AIC for mod2: 896.3872
Since these are not more than 10 apart from each other, we can again say that there is no significat difference, but it is lower which is moving in the rigth direction and move on to removing another predictor.
mod3 <- lm(BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE + Year, data = water)
anova(mod, mod3)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE + Year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 1970400272
## 2 37 1974597927 -2 -4197655 0.0373 0.9634
AIC(mod, mod3)
## df AIC
## mod 9 898.5617
## mod3 7 894.6532
Again the high p value and the lower AIC means that this predictor should be removed.
mod4 <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE + Year, data = water)
anova(mod, mod4)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE + Year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 1970400272
## 2 38 1979541875 -3 -9141602 0.0541 0.9831
AIC(mod, mod4)
## df AIC
## mod 9 898.5617
## mod4 6 892.7608
Again the high p value and the lower AIC means that this predictor should be removed.
mod5 <- lm(BSAAM ~ OPRC + OPSLAKE + Year, data = water)
anova(mod, mod5)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ OPRC + OPSLAKE + Year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 1970400272
## 2 39 2503354457 -4 -532954185 2.3667 0.07162 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(mod, mod5)
## df AIC
## mod 9 898.5617
## mod5 5 900.8557
As you can see, the AIC increased and the p value fell below .05 which indicates that this predictor needs to stay in the model.
mod6 <- lm(BSAAM ~ APSLAKE + OPSLAKE + Year, data = water)
anova(mod, mod6)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSLAKE + OPSLAKE + Year
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 1970400272
## 2 39 2593349193 -4 -622948921 2.7663 0.04251 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(mod, mod6)
## df AIC
## mod 9 898.5617
## mod6 5 902.3744
As you can see, the AIC increased and the p value fell below .05 which indicates that this predictor needs to stay in the model.
mod7 <- lm(BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
anova(mod, mod7)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSLAKE + OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 1970400272
## 2 39 2068947585 -4 -98547313 0.4376 0.7805
AIC(mod, mod7)
## df AIC
## mod 9 898.5617
## mod7 5 892.6603
Again the high p value and the lower AIC means that this predictor should be removed.
mod8 <- lm(BSAAM ~ APSLAKE + OPRC, data = water)
anova(mod, mod8)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
## Model 2: BSAAM ~ APSLAKE + OPRC
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 1970400272
## 2 40 3584866125 -5 -1614465853 5.7355 0.000575 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(mod, mod8)
## df AIC
## mod 9 898.5617
## mod8 4 914.2965
As you can see, the AIC increased and the p value fell below .05 which indicates that this predictor needs to stay in the model.
This means that our final and best model is mod7 using the predictors, APSLAKE, OPRC and OPSLAKE.
R does not make the user always to this though (Thank goodness!) There is a step command that makes this far faster.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
mod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year, data = water)
stepAIC(mod, direction = "backward")
## Start: AIC=774.53
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1571591 1971971864 772.57
## - APMAM 1 2262194 1972662466 772.58
## - APSAB 1 7311109 1977711381 772.69
## - Year 1 85430461 2055830733 774.36
## <none> 1970400272 774.53
## - APSLAKE 1 106880993 2077281265 774.80
## - OPSLAKE 1 413707192 2384107464 780.73
## - OPRC 1 576007855 2546408128 783.56
##
## Step: AIC=772.57
## BSAAM ~ APMAM + APSAB + APSLAKE + OPRC + OPSLAKE + Year
##
## Df Sum of Sq RSS AIC
## - APMAM 1 2626064 1974597927 770.62
## - APSAB 1 6887325 1978859189 770.72
## - Year 1 85160498 2057132362 772.39
## <none> 1971971864 772.57
## - APSLAKE 1 105315871 2077287734 772.80
## - OPRC 1 574517654 2546489517 781.56
## - OPSLAKE 1 964675516 2936647380 787.69
##
## Step: AIC=770.62
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE + Year
##
## Df Sum of Sq RSS AIC
## - APSAB 1 4943947 1979541875 768.73
## - Year 1 82535451 2057133378 770.39
## <none> 1974597927 770.62
## - APSLAKE 1 127432687 2102030614 771.31
## - OPRC 1 575963916 2550561844 779.63
## - OPSLAKE 1 968394770 2942992697 785.78
##
## Step: AIC=768.73
## BSAAM ~ APSLAKE + OPRC + OPSLAKE + Year
##
## Df Sum of Sq RSS AIC
## - Year 1 89405710 2068947585 768.63
## <none> 1979541875 768.73
## - APSLAKE 1 523812582 2503354457 776.83
## - OPRC 1 613807319 2593349193 778.35
## - OPSLAKE 1 1175063776 3154605651 786.77
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
This leads us to the exact same model as was shown above!
You can also use this model to go forwards.
mod9 <- lm(BSAAM ~ OPRC, data = water)
stepAIC(mod9, direction = "forward", scope = list(upper = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE + Year, lower = BSAAM~1))
## Start: AIC=795.28
## BSAAM ~ OPRC
##
## Df Sum of Sq RSS AIC
## + OPSLAKE 1 1529924515 2689959758 777.92
## + OPBPC 1 888721040 3331163233 787.11
## + APSLAKE 1 635018148 3584866125 790.27
## + Year 1 609982013 3609902260 790.57
## + APMAM 1 261271943 3958612330 794.53
## + APSAB 1 205266122 4014618151 795.14
## <none> 4219884273 795.28
##
## Step: AIC=777.92
## BSAAM ~ OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## + APSLAKE 1 621012173 2068947585 768.63
## + APSAB 1 457345396 2232614362 771.91
## + APMAM 1 387970039 2301989719 773.22
## + Year 1 186605301 2503354457 776.83
## <none> 2689959758 777.92
## + OPBPC 1 450573 2689509185 779.91
##
## Step: AIC=768.63
## BSAAM ~ OPRC + OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## + Year 1 89405710 1979541875 768.73
## + APSAB 1 11814207 2057133378 770.39
## + APMAM 1 1410311 2067537274 770.60
## + OPBPC 1 583748 2068363837 770.62
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE, data = water)
##
## Coefficients:
## (Intercept) OPRC OPSLAKE APSLAKE
## 15425 1797 2390 1712
Again this leads to the exact same model!
This is important because it allows us to be most efficient in our models and only chose predictors that add helpful information.