Model Building (Section 5.1)

In class on Tuesday we finished the rest of section 5.1, which was on model building. We discussed criteria for adding variables and learned how to use R to build models.

Determining Usefulness of Predictors

We talked about multiple methods/tests to determine if predictors are useful to our regression model.

R-Squared and Adjusted R-Squared

R-squared is not very helpful in helping us determine whether to add or a drop a predictor from our model. This is because R-squared is the variables in the response that are explained by the linear relationship the variables have with the predictors. The more predictors we have, the higher (although R-squared could actually remain the same) R-squared becomes. Adjusted R-squared is slightly better because it penalizes the number of predictors in the model.

T-Test

Running a T-test helps us determine whether to add/drop \(x_j\). Just as we have been doing for- simple linear regression, if the p-value is less than our significance level alpha then we accept that the predictor is useful in our model. However, we can only use a T-test for quanitative variables or categorical variables with two categories.

Partial F-Test

The F-test lets us test if at least one of our predictors provides useful information. The partial F-test allows us to tailor our predictors in order to determine which variable to add/drop. This test also differs from the T-test in that we can use it to test any type of variables.

AIC

AIC, or Akaike Information Criteria, lets us determine which variables to add/drop by using model fit and number of predictors. \(AIC=2(k+1)-2*log likelihood\). A lower AIC value means that the number of parameters is lower while the likelihood is larger. We compare two models and can justify removal of a predictor if the AIC drops by more than 10.

Mallow’s Cp

Mallow’s \(C_p\) is another way of determining whether to add/drop predictors. We want this number to be smaller and close to k+1. For this class, we will focus on AIC.

Methods of Model Building

The model building methods covered in class are stepwise methods and all subsets regression

Forward Selection

In this method, we start with a simple model and add the most meaningful predictors first. We determine the most meaninful predictors by the ones with the lowest p-value from SLR. We can have do the work for us and refer to AIC values instead:

library(MASS)
library(alr3)
data(water)
attach(water)
watermod<-lm(BSAAM ~ OPSLAKE + OPRC+ OPBPC + APSLAKE + APSAB + APMAM)
simplemod<-lm(BSAAM ~ 1)
stepAIC(simplemod,direction="forward", scope = list(upper = watermod))
## Start:  AIC=873.65
## BSAAM ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 2.4087e+10 3.2640e+09 784.24
## + OPRC     1 2.3131e+10 4.2199e+09 795.28
## + OPBPC    1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE  1 1.7004e+09 2.5651e+10 872.89
## + APMAM    1 1.5567e+09 2.5794e+10 873.13
## <none>                  2.7351e+10 873.65
## + APSAB    1 9.1891e+08 2.6432e+10 874.18
## 
## Step:  AIC=784.24
## BSAAM ~ OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 663368666 2600641788 776.47
## + APSAB    1 661988129 2602022326 776.49
## + OPRC     1 574050696 2689959758 777.92
## + APMAM    1 524283532 2739726922 778.71
## <none>                 3264010454 784.24
## + OPBPC    1     56424 3263954031 786.24
## 
## Step:  AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## + OPRC   1 531694203 2068947585 768.63
## <none>               2600641788 776.47
## + APSAB  1  33349091 2567292697 777.91
## + APMAM  1  11041158 2589600630 778.28
## + OPBPC  1    122447 2600519341 778.46
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
## 
## Coefficients:
## (Intercept)      OPSLAKE      APSLAKE         OPRC  
##       15425         2390         1712         1797

We can see that the first model looks at adding the value of adding each predictor to the model. As each step is performed, the predictor that gives the model the lowest AIC is added to the model. Then R moves to the next model with the previously added predictor and again lists the model AIC, each variable’s impact on AIC, and adds the predictor with the lowest model AIC. Only one predictor is added per step to avoid accidentally adding a predictor that really isn’t useful. Between the two final steps, we see that the AIC of adding another variable is actually higher, so we stop. Thus, BSAAM is predicted by OPSLAKE, APSLAKE, and OPRC.

Backward Elimination

Just as it sounds, in this method we start with the most complicated model and elimate predictors to determine the most important predictors.

stepAIC(watermod)
## Start:  AIC=774.36
## BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB + APMAM
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1     18537 2055849271 772.36
## - OPBPC    1   1301629 2057132362 772.39
## - APSAB    1  10869771 2066700504 772.58
## <none>                 2055830733 774.36
## - APSLAKE  1 163662571 2219493304 775.65
## - OPSLAKE  1 493012936 2548843669 781.60
## - OPRC     1 509894399 2565725132 781.89
## 
## Step:  AIC=772.36
## BSAAM ~ OPSLAKE + OPRC + OPBPC + APSLAKE + APSAB
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1284108 2057133378 770.39
## - APSAB    1  12514566 2068363837 770.62
## <none>                 2055849271 772.36
## - APSLAKE  1 176735690 2232584961 773.90
## - OPSLAKE  1 496370866 2552220136 779.66
## - OPRC     1 511413723 2567262994 779.91
## 
## Step:  AIC=770.39
## BSAAM ~ OPSLAKE + OPRC + APSLAKE + APSAB
## 
##           Df  Sum of Sq        RSS    AIC
## - APSAB    1   11814207 2068947585 768.63
## <none>                  2057133378 770.39
## - APSLAKE  1  175480984 2232614362 771.91
## - OPRC     1  510159318 2567292697 777.91
## - OPSLAKE  1 1165227857 3222361235 787.68
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + OPRC + APSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC + APSLAKE)
## 
## Coefficients:
## (Intercept)      OPSLAKE         OPRC      APSLAKE  
##       15425         2390         1797         1712

In the first step, we see the model that has APMAM being eliminated has the lowest AIC, so R drops this predictors. We continue this process one predictor per step, and we come to the same predictors that provide the most information as the forward method does.

All Subsets Regression

All subsets regression requires trying all possible combinations of variables and comparing the AIC of each model.

Comparing Models

It is important to use the correct comparisons between models. If we are comparing nested models, we can use T-test, partial F-test, or Mallow’s Cp. If the models aren’t nested AIC or Mallow’s cp can be used.