Methods of Model Building: Section 5.1

Today in class we covered the rest of 5.1. We started by looking at the criteria for adding variables, then moved on to the actual model building.

Criteria for Adding Variables

A bad criterion to decide whether or not we should add/drop a predictor into our model is R-squared. R-squared is the proportion of variables in the response that’s explained by its linear relationship with the predictors. It doesn’t tell us much, because throwing in a ton of variables makes our R-squared go up, or at least stay the same.

A little bit better, but still not great, criterion is the adjusted R-squared. This is like R-squared, but it penalizes for the number of predictions in the model. There are better tools out there that we can use.

Better Criteria

There are four other tools we discussed in class that we can use to decide whether or not to keep a predictor.

t-test

The first is the t-test for adding/dropping \(x_j\). This is what we have been using the past few weeks. We calculate the test stat and the p-value for \(x_j\) and add the variable if the p-value is low enough. However, this test has limitations, as we can only use it with quantitative or categorical variables with only two categories.

Partial F-test

The second is the partial F-test. Again, we have been using this over the past few weeks. This works a little bit better than the t-test because we can use it for any type of variables!

AIC

The third is AIC, or the Akaike Info Criteria. This balances the model fit with the number of predictors. To calculate AIC, we take 2(k + 1) - 2 * log likelihood of model. We want this to be as small as possible, so that the number of parameters is minimized while the likelihood is maximized. We need to be able to compare two models’ AICs or else the numbers are arbitrary. We use the model with the lower AIC if the difference is significant enough.

Mallow’s Cp

The fourth is Mallow’s \(C_p\). The smaller this number is, the better, and it’s ideal if it’s close to k + 1. AIC is used more frequently than Mallow’s \(C_p\) so I won’t get into the details of it here.

Methods of Model Building

There are several ways we can go about building an actual model. There are two different stepwise methods as well as all subsets regression.

All Subsets Regression

With all subsets regression, we fit all models possible with the variables we want to look at, and get the AIC for each one. This can be a bit time consuming, and a bit much for the computer to handle once more and more variables are added, since so many models have to be created.

Stepwise Regression

A better way to go about model building is stepwise regression. With stepwise, we are either adding or taking away a variable at a time to make sure they are significant. There are two types of stepwise regression: forward selection and backward elimination.

Forward Selection

In forward selection, we start with a simple model and gradually add predictors. We add the most significant predictor first (with the lowest pval from SLR), and continue this process until we can’t add any more significant variables. We can do this by hand, but R can also do this for us pretty simply.

library(MASS)
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
## 
## Attaching package: 'alr3'
## The following object is masked from 'package:MASS':
## 
##     forbes
data(water)
attach(water)

mod6 <- lm(BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB + APMAM)
simplemod <- lm(BSAAM~1)
stepAIC(simplemod,direction="forward", scope = list(upper = mod6))
## Start:  AIC=873.65
## BSAAM ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 2.4087e+10 3.2640e+09 784.24
## + OPRC     1 2.3131e+10 4.2199e+09 795.28
## + OPBPC    1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE  1 1.7004e+09 2.5651e+10 872.89
## + APMAM    1 1.5567e+09 2.5794e+10 873.13
## <none>                  2.7351e+10 873.65
## + APSAB    1 9.1891e+08 2.6432e+10 874.18
## 
## Step:  AIC=784.24
## BSAAM ~ OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 663368666 2600641788 776.47
## + APSAB    1 661988129 2602022326 776.49
## + OPRC     1 574050696 2689959758 777.92
## + APMAM    1 524283532 2739726922 778.71
## <none>                 3264010454 784.24
## + OPBPC    1     56424 3263954031 786.24
## 
## Step:  AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## + OPRC   1 531694203 2068947585 768.63
## <none>               2600641788 776.47
## + APSAB  1  33349091 2567292697 777.91
## + APMAM  1  11041158 2589600630 778.28
## + OPBPC  1    122447 2600519341 778.46
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
## 
## Coefficients:
## (Intercept)      OPSLAKE      APSLAKE         OPRC  
##       15425         2390         1712         1797

We can see above that it tests adding each variable, picks the one with the lowest AIC, and cements it in until the model with the best AIC doesn’t need any more variables added. In our example above, BSAAM is best predicted with OPSLAKE, APSLAKE, and OPRC.

Backward Elimination

In backward elimination, we start with the more complicated model, and gradually pare down, eliminating the least important predictors. We stop when none of the remaining predictors should be cut. R presents this to us is a similar fashion.

mod6 <- lm(BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB + APMAM) #create a mod w all preds
stepAIC(mod6)
## Start:  AIC=774.36
## BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB + APMAM
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1     18537 2055849271 772.36
## - OPBPC    1   1301629 2057132362 772.39
## - APSAB    1  10869771 2066700504 772.58
## <none>                 2055830733 774.36
## - APSLAKE  1 163662571 2219493304 775.65
## - OPSLAKE  1 493012936 2548843669 781.60
## - OPRC     1 509894399 2565725132 781.89
## 
## Step:  AIC=772.36
## BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1284108 2057133378 770.39
## - APSAB    1  12514566 2068363837 770.62
## <none>                 2055849271 772.36
## - APSLAKE  1 176735690 2232584961 773.90
## - OPSLAKE  1 496370866 2552220136 779.66
## - OPRC     1 511413723 2567262994 779.91
## 
## Step:  AIC=770.39
## BSAAM ~ OPSLAKE + OPRC + APSLAKE + APSAB
## 
##           Df  Sum of Sq        RSS    AIC
## - APSAB    1   11814207 2068947585 768.63
## <none>                  2057133378 770.39
## - APSLAKE  1  175480984 2232614362 771.91
## - OPRC     1  510159318 2567292697 777.91
## - OPSLAKE  1 1165227857 3222361235 787.68
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + OPRC + APSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC + APSLAKE)
## 
## Coefficients:
## (Intercept)      OPSLAKE         OPRC      APSLAKE  
##       15425         2390         1797         1712

We come to the same conclusion.

Comparing Two Models

As seen above, when R builds models, it uses AIC to compare the models against each other. If the models we are comparing are nested, we could also use the t-test, the partial F-test, and Mallow’s \(C_p\) to compare. If the models aren’t nested, we could either use AIC or Mallow’s \(C_p\).