Model Building

This chapter we went over the different tools that we can use for Model Comparison. These are R^2, which we learned in chapter 3, AIC, Stepwise, t-tests, F-tests, and Mallows’s CP.

When using R^2, if we add predictors it can only stay the same or increase no matter how significant the variables are. This makes this model comparison tool not very uses full. Adjusted R^2 adds a penalty for datasets with a large number of variables. It is still not the most accurate but can be useful if you want something done fast. So the other comparison tools are better.

AIC

This balances the number of predictors with how accurate the model is. It also gives a penalty for using more predictors but since it focuses on the accurate of the model it is a better comparison than Adjusted R^2. The value given by the resulting AIC has no meaning. The values are only there to compare two or more models in the same dataset to see which is better. The smaller the AIC is the better. If an AIC is within 2 of another, then both of the models are equally accurate and we will pick the one with the smaller amount of variables.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
modwater1<-lm(BSAAM~APMAM+APSAB)
modwater2<-lm(BSAAM~APMAM+OPBPC)
AIC(modwater1,modwater2)
##           df      AIC
## modwater1  4 999.1251
## modwater2  4 932.0902

As you can see from the results of the model comparison above modwater2 is much more accurate than modwater1. This is because modwater2 had an AIC of 932 and modwater1 had an AIC of 999.

Stepwise Selection

This a tool that for forward selection takes a simple model and keeps adding variables until the AIC is at its lowest. For backward selection it takes all the variables in the data and removes the variables until the AIC is at its lowest. For both it starts out with all the variables and removes and adds variables until it has the lowest AIC.

library(alr3)
data(water)
attach(water)
## The following objects are masked from water (pos = 3):
## 
##     APMAM, APSAB, APSLAKE, BSAAM, OPBPC, OPRC, OPSLAKE, Year
lower.scope <- lm(BSAAM ~ 1, data = water)
upper.scope <- lm(BSAAM ~ ., data = water)
stepFwd <- step(lm(BSAAM ~ 1, data = water), scope = list(lower = lower.scope, upper = upper.scope), direction = "forward")
## Start:  AIC=873.65
## BSAAM ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 2.4087e+10 3.2640e+09 784.24
## + OPRC     1 2.3131e+10 4.2199e+09 795.28
## + OPBPC    1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE  1 1.7004e+09 2.5651e+10 872.89
## + APMAM    1 1.5567e+09 2.5794e+10 873.13
## <none>                  2.7351e+10 873.65
## + APSAB    1 9.1891e+08 2.6432e+10 874.18
## + Year     1 7.9010e+08 2.6561e+10 874.38
## 
## Step:  AIC=784.24
## BSAAM ~ OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 663368666 2600641788 776.47
## + APSAB    1 661988129 2602022326 776.49
## + OPRC     1 574050696 2689959758 777.92
## + APMAM    1 524283532 2739726922 778.71
## <none>                 3264010454 784.24
## + Year     1  45570705 3218439749 785.63
## + OPBPC    1     56424 3263954031 786.24
## 
## Step:  AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## + OPRC   1 531694203 2068947585 768.63
## <none>               2600641788 776.47
## + APSAB  1  33349091 2567292697 777.91
## + APMAM  1  11041158 2589600630 778.28
## + Year   1   7292595 2593349193 778.35
## + OPBPC  1    122447 2600519341 778.46
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + Year   1  89405710 1979541875 768.73
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62
stepBwd<-step(lm(BSAAM ~ ., data = water), scope = list(lower = lower.scope, upper = upper.scope), direction = "backward")
## Start:  AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1571591 1971971864 772.57
## - APMAM    1   2262194 1972662466 772.58
## - APSAB    1   7311109 1977711381 772.69
## - Year     1  85430461 2055830733 774.36
## <none>                 1970400272 774.53
## - APSLAKE  1 106880993 2077281265 774.80
## - OPSLAKE  1 413707192 2384107464 780.73
## - OPRC     1 576007855 2546408128 783.56
## 
## Step:  AIC=772.57
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1   2626064 1974597927 770.62
## - APSAB    1   6887325 1978859189 770.72
## - Year     1  85160498 2057132362 772.39
## <none>                 1971971864 772.57
## - APSLAKE  1 105315871 2077287734 772.80
## - OPRC     1 574517654 2546489517 781.56
## - OPSLAKE  1 964675516 2936647380 787.69
## 
## Step:  AIC=770.62
## BSAAM ~ Year + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APSAB    1   4943947 1979541875 768.73
## - Year     1  82535451 2057133378 770.39
## <none>                 1974597927 770.62
## - APSLAKE  1 127432687 2102030614 771.31
## - OPRC     1 575963916 2550561844 779.63
## - OPSLAKE  1 968394770 2942992697 785.78
## 
## Step:  AIC=768.73
## BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - Year     1   89405710 2068947585 768.63
## <none>                  1979541875 768.73
## - APSLAKE  1  523812582 2503354457 776.83
## - OPRC     1  613807319 2593349193 778.35
## - OPSLAKE  1 1175063776 3154605651 786.77
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
stepBoth<- step(lm(BSAAM ~ ., data = water), scope = list(lower = lower.scope, upper = upper.scope), direction = "both")
## Start:  AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1571591 1971971864 772.57
## - APMAM    1   2262194 1972662466 772.58
## - APSAB    1   7311109 1977711381 772.69
## - Year     1  85430461 2055830733 774.36
## <none>                 1970400272 774.53
## - APSLAKE  1 106880993 2077281265 774.80
## - OPSLAKE  1 413707192 2384107464 780.73
## - OPRC     1 576007855 2546408128 783.56
## 
## Step:  AIC=772.57
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1   2626064 1974597927 770.62
## - APSAB    1   6887325 1978859189 770.72
## - Year     1  85160498 2057132362 772.39
## <none>                 1971971864 772.57
## - APSLAKE  1 105315871 2077287734 772.80
## + OPBPC    1   1571591 1970400272 774.53
## - OPRC     1 574517654 2546489517 781.56
## - OPSLAKE  1 964675516 2936647380 787.69
## 
## Step:  AIC=770.62
## BSAAM ~ Year + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APSAB    1   4943947 1979541875 768.73
## - Year     1  82535451 2057133378 770.39
## <none>                 1974597927 770.62
## - APSLAKE  1 127432687 2102030614 771.31
## + APMAM    1   2626064 1971971864 772.57
## + OPBPC    1   1935461 1972662466 772.58
## - OPRC     1 575963916 2550561844 779.63
## - OPSLAKE  1 968394770 2942992697 785.78
## 
## Step:  AIC=768.73
## BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - Year     1   89405710 2068947585 768.63
## <none>                  1979541875 768.73
## + APSAB    1    4943947 1974597927 770.62
## + OPBPC    1    1344102 1978197773 770.70
## + APMAM    1     682686 1978859189 770.72
## - APSLAKE  1  523812582 2503354457 776.83
## - OPRC     1  613807319 2593349193 778.35
## - OPSLAKE  1 1175063776 3154605651 786.77
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## + Year     1   89405710 1979541875 768.73
## + APSAB    1   11814207 2057133378 770.39
## + APMAM    1    1410311 2067537274 770.60
## + OPBPC    1     583748 2068363837 770.62
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
stepFwd
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC, data = water)
## 
## Coefficients:
## (Intercept)      OPSLAKE      APSLAKE         OPRC  
##       15425         2390         1712         1797
stepBwd
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390
stepBoth
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

As you can see from the results forward selection, backward selection, and both selection all end out with the same result: BSAAM= 15425+ 1712APSLAKE+ 1797OPRC+ 2390*OPSLAKE. This result might have changed if we added in a lot more variables because then there are more paths for each stepwise to take.

water is not the only dataset that we can use this with. We could also try out the dataset OKCupid.

love<-read.csv("http://www.cknudson.com/data/OKCupid.csv")
attach(love)
lower.scope2 <- lm(IdealMateHeight ~ 1,data=love)
upper.scope2 <- lm(IdealMateHeight ~ .,data=love)
stepFwd2 <- step(lm(IdealMateHeight ~ 1,data=love), scope = list(lower = lower.scope2, upper = upper.scope2), direction = "forward")
## Start:  AIC=356.86
## IdealMateHeight ~ 1
## 
##          Df Sum of Sq     RSS    AIC
## + Sex     1    994.84  898.47 258.98
## + Height  1    109.33 1783.99 350.89
## <none>                1893.31 356.86
## + Age     1      0.13 1893.18 358.86
## 
## Step:  AIC=258.98
## IdealMateHeight ~ Sex
## 
##          Df Sum of Sq    RSS    AIC
## + Height  1    302.88 595.59 205.89
## <none>                898.47 258.98
## + Age     1      0.00 898.47 260.98
## 
## Step:  AIC=205.89
## IdealMateHeight ~ Sex + Height
## 
##        Df Sum of Sq    RSS    AIC
## <none>              595.59 205.89
## + Age   1   0.43821 595.15 207.79
stepBwd2<-step(lm(IdealMateHeight ~ .,data=love), scope = list(lower = lower.scope2, upper = upper.scope2), direction = "backward")
## Start:  AIC=207.79
## IdealMateHeight ~ Sex + Height + Age
## 
##          Df Sum of Sq     RSS    AIC
## - Age     1      0.44  595.59 205.89
## <none>                 595.15 207.79
## - Height  1    303.32  898.47 260.98
## - Sex     1   1188.55 1783.70 352.87
## 
## Step:  AIC=205.89
## IdealMateHeight ~ Sex + Height
## 
##          Df Sum of Sq     RSS    AIC
## <none>                 595.59 205.89
## - Height  1    302.88  898.47 258.98
## - Sex     1   1188.40 1783.99 350.89
stepBoth2<- step(lm(IdealMateHeight ~ .,data=love), scope = list(lower = lower.scope2, upper = upper.scope2), direction = "both")
## Start:  AIC=207.79
## IdealMateHeight ~ Sex + Height + Age
## 
##          Df Sum of Sq     RSS    AIC
## - Age     1      0.44  595.59 205.89
## <none>                 595.15 207.79
## - Height  1    303.32  898.47 260.98
## - Sex     1   1188.55 1783.70 352.87
## 
## Step:  AIC=205.89
## IdealMateHeight ~ Sex + Height
## 
##          Df Sum of Sq     RSS    AIC
## <none>                 595.59 205.89
## + Age     1      0.44  595.15 207.79
## - Height  1    302.88  898.47 258.98
## - Sex     1   1188.40 1783.99 350.89
stepFwd2
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height, data = love)
## 
## Coefficients:
## (Intercept)         SexM       Height  
##      39.246       -8.538        0.493
stepBwd2
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height, data = love)
## 
## Coefficients:
## (Intercept)         SexM       Height  
##      39.246       -8.538        0.493
stepBoth2
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height, data = love)
## 
## Coefficients:
## (Intercept)         SexM       Height  
##      39.246       -8.538        0.493

As we can see from above the results are all the same again. The IdealMateHeight=39.246-8.538SexM+.493Height.

Now there are more than 1 code that can do stepwise selections. Below are some of them using the dataset water and the library(MASS).

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
stepAIC(upper.scope)
## Start:  AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1571591 1971971864 772.57
## - APMAM    1   2262194 1972662466 772.58
## - APSAB    1   7311109 1977711381 772.69
## - Year     1  85430461 2055830733 774.36
## <none>                 1970400272 774.53
## - APSLAKE  1 106880993 2077281265 774.80
## - OPSLAKE  1 413707192 2384107464 780.73
## - OPRC     1 576007855 2546408128 783.56
## 
## Step:  AIC=772.57
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1   2626064 1974597927 770.62
## - APSAB    1   6887325 1978859189 770.72
## - Year     1  85160498 2057132362 772.39
## <none>                 1971971864 772.57
## - APSLAKE  1 105315871 2077287734 772.80
## - OPRC     1 574517654 2546489517 781.56
## - OPSLAKE  1 964675516 2936647380 787.69
## 
## Step:  AIC=770.62
## BSAAM ~ Year + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APSAB    1   4943947 1979541875 768.73
## - Year     1  82535451 2057133378 770.39
## <none>                 1974597927 770.62
## - APSLAKE  1 127432687 2102030614 771.31
## - OPRC     1 575963916 2550561844 779.63
## - OPSLAKE  1 968394770 2942992697 785.78
## 
## Step:  AIC=768.73
## BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - Year     1   89405710 2068947585 768.63
## <none>                  1979541875 768.73
## - APSLAKE  1  523812582 2503354457 776.83
## - OPRC     1  613807319 2593349193 778.35
## - OPSLAKE  1 1175063776 3154605651 786.77
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

This is another way to use stepwise backward selection.

stepAIC(lower.scope, direction="forward", scope=list(upper=upper.scope))
## Start:  AIC=873.65
## BSAAM ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 2.4087e+10 3.2640e+09 784.24
## + OPRC     1 2.3131e+10 4.2199e+09 795.28
## + OPBPC    1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE  1 1.7004e+09 2.5651e+10 872.89
## + APMAM    1 1.5567e+09 2.5794e+10 873.13
## <none>                  2.7351e+10 873.65
## + APSAB    1 9.1891e+08 2.6432e+10 874.18
## + Year     1 7.9010e+08 2.6561e+10 874.38
## 
## Step:  AIC=784.24
## BSAAM ~ OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 663368666 2600641788 776.47
## + APSAB    1 661988129 2602022326 776.49
## + OPRC     1 574050696 2689959758 777.92
## + APMAM    1 524283532 2739726922 778.71
## <none>                 3264010454 784.24
## + Year     1  45570705 3218439749 785.63
## + OPBPC    1     56424 3263954031 786.24
## 
## Step:  AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## + OPRC   1 531694203 2068947585 768.63
## <none>               2600641788 776.47
## + APSAB  1  33349091 2567292697 777.91
## + APMAM  1  11041158 2589600630 778.28
## + Year   1   7292595 2593349193 778.35
## + OPBPC  1    122447 2600519341 778.46
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + Year   1  89405710 1979541875 768.73
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC, data = water)
## 
## Coefficients:
## (Intercept)      OPSLAKE      APSLAKE         OPRC  
##       15425         2390         1712         1797

This is another way to use stepwise forward selection.

F-test and t-tests

F-tests and t-tests are very simple and have been covered in many of the learning logs so instead of showing the code to create them I will just show you the 3 steps it takes using the water dataset.

Step 1: make 6 single linear regressions Find the one with the smallest p value. Step 2 make a couple of multiple linear regression models and pick the one with the smallest p value. Step 3 Find variables that finds the best p value. You will find sometimes that there are no better model. They are equally as good.

Mallows’s CP

This is very similar to AIC in that smaller variables are better but their accuracy also matters. To do this we take a model made of a collection of some variables from a dataset and find its residual sum of squares. Then we divide this by the model with the complete collection of variables in the same dataset’s residual sum of squares. If the variables in the first model are good the value of Mallows’s CP will be close to 1.

modelwater<- lm(BSAAM~OPSLAKE+APSLAKE+OPRC)
modelwater2<-lm(BSAAM~.,data=water)
anova(modelwater)
## Analysis of Variance Table
## 
## Response: BSAAM
##           Df     Sum Sq    Mean Sq F value    Pr(>F)    
## OPSLAKE    1 2.4087e+10 2.4087e+10 454.044 < 2.2e-16 ***
## APSLAKE    1 6.6337e+08 6.6337e+08  12.505  0.001065 ** 
## OPRC       1 5.3169e+08 5.3169e+08  10.023  0.002998 ** 
## Residuals 39 2.0689e+09 5.3050e+07                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(modelwater2)
## Analysis of Variance Table
## 
## Response: BSAAM
##           Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Year       1 7.9010e+08 7.9010e+08  14.0345  0.000646 ***
## APMAM      1 1.5584e+09 1.5584e+09  27.6813 7.274e-06 ***
## APSAB      1 4.6712e+07 4.6712e+07   0.8297  0.368580    
## APSLAKE    1 3.0374e+08 3.0374e+08   5.3954  0.026128 *  
## OPBPC      1 1.9567e+10 1.9567e+10 347.5608 < 2.2e-16 ***
## OPRC       1 2.7013e+09 2.7013e+09  47.9827 4.748e-08 ***
## OPSLAKE    1 4.1371e+08 4.1371e+08   7.3486  0.010328 *  
## Residuals 35 1.9704e+09 5.6297e+07                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see from the anova that the smaller model has a sum of squared residuals equal to 2.0689x10^9 and the complete model has one equal to 1.9704x10^9.

So Mallows’s CP is:

(2.0689*10^9)/(1.9704*10^9)
## [1] 1.04999

As we can see these variables are good.