Our main focus in class today was on model building and what the best way is to build an optimal model for our data set. What we learned is that there are two main approaches to model building. Both of them work equally well, though it is possible that they could result in different final models. The two approaches are:
Approach 1 (Forward Selection): This is when we start with a simple model and then add more predictors using any tests/critera to determine if new predictors are a good fit. So we will start with what is essentially a simple linear regression model, and then go through step by step and add variables. As we add each variable we have to retest the model to see if the new addition enhances it. I will talk about the tests we use to do this further down.
Approach 2 (Backward Elimination): This is when we start with complicated model with many predictors and then we will cut it doww until we have a satisfactory model. We try to remove all of the unnecessary predictors until we only have significant ones left. Similar to forward selection, we have a variaty of tests we can use to test the significance of our predictors.
One thing to consider when building our model is if our model will be nested. That is, is our model exactly the same as our previous version just with one less variable. If it is not nested we have to use the AIC, but if not we can use any of the tests (but not Adjusted R2. ever.). The tests we can use to test our model are:
T-test for adding a predictor which we have done previously
F-test (partial) which we have done previously
Adjusted R2 which is like R2 but penalizes for number of predictors (not recommended)
AIC (Akaike Information critera) = 2(K+1)-2*log(likelihood); We want as small of value as possible for AIC because it means our 2log(likelihood) is large. This is useful because we don’t need to have nested models, so we can compare y=x1+x2 versus y=x1+x3. (Nested model means one models predictors are a subset of other models predictors) We need to see an AIC drop >= 10 in order to justify using a more complicated model. The actual value is not useful at all, it is only useful to use it to compare with other models.
Mallow’s Cp: We want it to be small and ideally we want it to be approximately equal to (K+1) where K is the number of betas. Not something that is super common to use in practice.
Here we will look at an example using the water data from the alr3 package. Our goal will be to build a model using the backwards elimination method. One nice thing about R is that we are able to automate the model building process through the use of the stepAIC function. This automizes the process and allows us to build our model either forwards or backwards. All we have to do is give it the initial model and specify which direction we would like our model to be built. In this first command we will build our complex model, and then using the stepAIC command and a backwards direction, it will narrow down our model until we reach our final predictors.
data(water, package = "alr3")
head(water)
## Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
## 1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
## 2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
## 3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
## 4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
## 5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
## 6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594
library(MASS)
water.mod1 <- lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE, data = water)
stepAIC(water.mod1, direction = "backward")
## Start: AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APMAM 1 18537 2055849271 772.36
## - OPBPC 1 1301629 2057132362 772.39
## - APSAB 1 10869771 2066700504 772.58
## <none> 2055830733 774.36
## - APSLAKE 1 163662571 2219493304 775.65
## - OPSLAKE 1 493012936 2548843669 781.60
## - OPRC 1 509894399 2565725132 781.89
##
## Step: AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1284108 2057133378 770.39
## - APSAB 1 12514566 2068363837 770.62
## <none> 2055849271 772.36
## - APSLAKE 1 176735690 2232584961 773.90
## - OPSLAKE 1 496370866 2552220136 779.66
## - OPRC 1 511413723 2567262994 779.91
##
## Step: AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APSAB 1 11814207 2068947585 768.63
## <none> 2057133378 770.39
## - APSLAKE 1 175480984 2232614362 771.91
## - OPRC 1 510159318 2567292697 777.91
## - OPSLAKE 1 1165227857 3222361235 787.68
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
Looking at the output above, it appears that our final model will include APSLAKE, OPRC, and OPSLAKE as our predictors of stream runoff. It is easy to follow along above, as it starts with our complex model at the top, and then as we step down each time it retests all of our predictors and determines if there is a predictor that can be eliminated. We chose to use AIC for this model but we could ahve used any of the tests discussed above to create our model as well.
start.mod <- lm(BSAAM~1, data = water)
complete.mod <- lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE, data = water)
simple.mod <- lm(BSAAM~1, data = water)
stepAIC(start.mod, scope = list(upper = complete.mod, lower = simple.mod), direction = "forward")
## Start: AIC=873.65
## BSAAM ~ 1
##
## Df Sum of Sq RSS AIC
## + OPSLAKE 1 2.4087e+10 3.2640e+09 784.24
## + OPRC 1 2.3131e+10 4.2199e+09 795.28
## + OPBPC 1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE 1 1.7004e+09 2.5651e+10 872.89
## + APMAM 1 1.5567e+09 2.5794e+10 873.13
## <none> 2.7351e+10 873.65
## + APSAB 1 9.1891e+08 2.6432e+10 874.18
##
## Step: AIC=784.24
## BSAAM ~ OPSLAKE
##
## Df Sum of Sq RSS AIC
## + APSLAKE 1 663368666 2600641788 776.47
## + APSAB 1 661988129 2602022326 776.49
## + OPRC 1 574050696 2689959758 777.92
## + APMAM 1 524283532 2739726922 778.71
## <none> 3264010454 784.24
## + OPBPC 1 56424 3263954031 786.24
##
## Step: AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## + OPRC 1 531694203 2068947585 768.63
## <none> 2600641788 776.47
## + APSAB 1 33349091 2567292697 777.91
## + APMAM 1 11041158 2589600630 778.28
## + OPBPC 1 122447 2600519341 778.46
##
## Step: AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## + APSAB 1 11814207 2057133378 770.39
## + APMAM 1 1410311 2067537274 770.60
## + OPBPC 1 583748 2068363837 770.62
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC, data = water)
##
## Coefficients:
## (Intercept) OPSLAKE APSLAKE OPRC
## 15425 2390 1712 1797
This is just another example with the exact same data set, except we are building this model using forward selection rather than backawards elimination. In the end, both approaches ended up yielding the same model, but the doesn’t necessarily have to be the case. These are both completely viable options for building models and will be used as we move forward in class.