Model Building Basics

Criteria for Adding Variables

There are many potential criteria for adding variables. We disucssed a few in class.

BAD CRITERIA:

R^2. This is bad beccause it will always increase when more variables are added. This makes it difficult to compare models with different numbers of variables. If you use R^2 you could end up overfitting your data.

DECENTISH CRITERIA:

Adjusted R^2. This is like R^2 but it penalizes for the number of predictors in the model. This might be an okay route to take if you are explaining things to an audience without much stats knowledge.

BETTER TOOLS:

  1. T-test
  2. F-test
  3. AIC : balances model fit with # of predictors. -small AIC is better -If model with lower AIC is more complicated, then it needs to have AIC of at least 10 less than the simpler model in order to be chosen over the simpler model.
  4. Mallow’s Cp: -smaller is better -ideal if close to k+1 -imagine all predictors in the model, then take a subset and make a model -this method isn’t really used much

Methods of Model Building

Stepwise (forward selection)

-start with simple model -gradually add predictors -add most significant first -you can pick which criteria you want to use to determine significance, but remember to make your choice clear.

Stepwise (backward elimination)

-sart with many/all predictors -gradually eliminate predictors -eliminate least significant first -stop when all remaining predictors are significant -once again, specify which method you are using to determine significance

All Subset Regression

-look at ALL possible models -this can get out of control quite quickly

Nested Models

Model1 is said to be nested in model2 if all of the predictor variable in model1 are also contained in model2.

For nested models, we can use all of the possible criteria we talked about. It is better to use a formal test (f-test,t-test) if possible.

For models that are not nested, we cannot use t-test or f-test to determine significance.

Examples

Let’s look at some examples using the “water” data.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
data("water")
attach(water)

Forward Selection (long way)

First, we used forward selection. I used the t-test to tell which predictors were most significant.

#make all models
modone<-lm(BSAAM~APSLAKE)
modtwo<-lm(BSAAM~OPSLAKE)
modthree<-lm(BSAAM~APMAM)
modfour<-lm(BSAAM~APSAB)
modfive<-lm(BSAAM~OPBPC)
modsix<-lm(BSAAM~OPRC)
summary(modone) #no
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46438 -16907  -5661  19028  69464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63864       9249   6.905 2.25e-08 ***
## APSLAKE         2818       1709   1.649    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared:  0.06217,    Adjusted R-squared:  0.0393 
## F-statistic: 2.718 on 1 and 41 DF,  p-value: 0.1069
summary(modtwo) #yes
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17603.8  -5338.0    332.1   3410.6  20875.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
## OPSLAKE       3752.5      215.7  17.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8778 
## F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16
summary(modthree) #no
## 
## Call:
## lm(formula = BSAAM ~ APMAM)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37043 -16339  -5457  17158  72467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63364       9917   6.389 1.21e-07 ***
## APMAM           1965       1249   1.573    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25080 on 41 degrees of freedom
## Multiple R-squared:  0.05692,    Adjusted R-squared:  0.03391 
## F-statistic: 2.474 on 1 and 41 DF,  p-value: 0.1234
summary(modfour) #no
## 
## Call:
## lm(formula = BSAAM ~ APSAB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41314 -16784  -5101  16492  70942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    67152       9689   6.931 2.06e-08 ***
## APSAB           2279       1909   1.194    0.239    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25390 on 41 degrees of freedom
## Multiple R-squared:  0.0336, Adjusted R-squared:  0.01003 
## F-statistic: 1.425 on 1 and 41 DF,  p-value: 0.2394
summary(modfive) #yes
## 
## Call:
## lm(formula = BSAAM ~ OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21183  -7298   -819   4731  38430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40017.4     3589.1   11.15 5.47e-14 ***
## OPBPC         2940.1      240.6   12.22 3.00e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11990 on 41 degrees of freedom
## Multiple R-squared:  0.7845, Adjusted R-squared:  0.7793 
## F-statistic: 149.3 on 1 and 41 DF,  p-value: 2.996e-15
summary(modsix) #yes
## 
## Call:
## lm(formula = BSAAM ~ OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24356  -5514   -522   7448  24854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21741.4     4044.1   5.376 3.32e-06 ***
## OPRC          4667.3      311.3  14.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared:  0.8457, Adjusted R-squared:  0.842 
## F-statistic: 224.7 on 1 and 41 DF,  p-value: < 2.2e-16
#if no SIMPLE linear relationship then eliminate from the running.

mod1<-lm(BSAAM ~ OPSLAKE,water)

#Considering these two
mod2<-lm(BSAAM~ OPRC+OPSLAKE, water)
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE, data = water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15991.2  -6484.6   -498.3   4700.1  19945.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22891.2     3277.8   6.984 1.98e-08 ***
## OPRC          1866.5      638.8   2.922   0.0057 ** 
## OPSLAKE       2400.8      503.3   4.770 2.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8967 
## F-statistic: 183.4 on 2 and 40 DF,  p-value: < 2.2e-16
mod3<-lm(BSAAM~OPBPC +OPSLAKE)
summary(mod3)
## 
## Call:
## lm(formula = BSAAM ~ OPBPC + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17591.0  -5276.6    275.6   3380.7  20867.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27050.95    3540.07   7.641 2.44e-09 ***
## OPBPC          14.37     546.41   0.026    0.979    
## OPSLAKE      3736.16     658.24   5.676 1.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9033 on 40 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8747 
## F-statistic: 147.6 on 2 and 40 DF,  p-value: < 2.2e-16
#mod 3 has worse p-value
#mod 2 is BEST

Using Forward Selection, I ended up with the model: mod2<-lm(BSAAM~ OPRC+OPSLAKE, water) We were considering adding OPRC and OPBPC. OPRC was more significant so we added that one.

Backward Elimination(long way)

Next, we used Backward Elimination. I used the t-test to tell which predictors were least significant.

summary(mod1)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE, data = water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17603.8  -5338.0    332.1   3410.6  20875.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
## OPSLAKE       3752.5      215.7  17.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8778 
## F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16
#take out worst
mod4<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPRC+OPSLAKE)
summary(mod4)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12748  -5096  -1501   4242  18593 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15752.558   3845.515   4.096 0.000219 ***
## APMAM          -2.978    696.532  -0.004 0.996611    
## APSAB        -648.489   1499.041  -0.433 0.667815    
## APSLAKE      2246.477   1313.976   1.710 0.095701 .  
## OPRC         1910.356    631.576   3.025 0.004506 ** 
## OPSLAKE      2295.408    501.450   4.578 5.16e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7456 on 37 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9146 
## F-statistic: 90.99 on 5 and 37 DF,  p-value: < 2.2e-16
#take out worst
mod5<-lm(BSAAM~APSAB+APSLAKE+OPRC+OPSLAKE)
summary(mod5)
## 
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12750  -5095  -1494   4245  18594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15749.8     3740.8   4.210 0.000151 ***
## APSAB         -650.6     1392.8  -0.467 0.643055    
## APSLAKE       2244.9     1246.9   1.800 0.079735 .  
## OPRC          1910.2      622.3   3.070 0.003942 ** 
## OPSLAKE       2295.4      494.8   4.639 4.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9169 
## F-statistic: 116.8 on 4 and 38 DF,  p-value: < 2.2e-16
#take out worst
mod6<-lm(BSAAM~APSLAKE+OPRC+OPSLAKE)
summary(mod6)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16
#all are significant so mod6 is final

Using Backward Elimination, I got the model: mod6<-lm(BSAAM~APSLAKE+OPRC+OPSLAKE). At this point all the predictors are significant so we will stop removing any.

AIC (automated)

It is much easier to get R to do this for us! The first chunk is a way I learned in my other class and the second is the command we learned in 333. Both produce the same resluts. They use AIC to determine which predictors to add/subtract.

To do this for Backward elimntation just change type from “forward” to “backward”.

modnone<-lm(BSAAM~1)
lower.scope <- lm(BSAAM ~ 1, data = water)
upper.scope <- lm(BSAAM ~ ., data = water)
stepfwd<- step(modnone,scope = list(lower=lower.scope, upper= upper.scope),direction = c("forward"))
## Start:  AIC=873.65
## BSAAM ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 2.4087e+10 3.2640e+09 784.24
## + OPRC     1 2.3131e+10 4.2199e+09 795.28
## + OPBPC    1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE  1 1.7004e+09 2.5651e+10 872.89
## + APMAM    1 1.5567e+09 2.5794e+10 873.13
## <none>                  2.7351e+10 873.65
## + APSAB    1 9.1891e+08 2.6432e+10 874.18
## + Year     1 7.9010e+08 2.6561e+10 874.38
## 
## Step:  AIC=784.24
## BSAAM ~ OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 663368666 2600641788 776.47
## + APSAB    1 661988129 2602022326 776.49
## + OPRC     1 574050696 2689959758 777.92
## + APMAM    1 524283532 2739726922 778.71
## <none>                 3264010454 784.24
## + Year     1  45570705 3218439749 785.63
## + OPBPC    1     56424 3263954031 786.24
## 
## Step:  AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## + OPRC   1 531694203 2068947585 768.63
## <none>               2600641788 776.47
## + APSAB  1  33349091 2567292697 777.91
## + APMAM  1  11041158 2589600630 778.28
## + Year   1   7292595 2593349193 778.35
## + OPBPC  1    122447 2600519341 778.46
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + Year   1  89405710 1979541875 768.73
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62
modall<-lm(BSAAM~.,water)
stepAIC(modall)
## Start:  AIC=774.53
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1571591 1971971864 772.57
## - APMAM    1   2262194 1972662466 772.58
## - APSAB    1   7311109 1977711381 772.69
## - Year     1  85430461 2055830733 774.36
## <none>                 1970400272 774.53
## - APSLAKE  1 106880993 2077281265 774.80
## - OPSLAKE  1 413707192 2384107464 780.73
## - OPRC     1 576007855 2546408128 783.56
## 
## Step:  AIC=772.57
## BSAAM ~ Year + APMAM + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1   2626064 1974597927 770.62
## - APSAB    1   6887325 1978859189 770.72
## - Year     1  85160498 2057132362 772.39
## <none>                 1971971864 772.57
## - APSLAKE  1 105315871 2077287734 772.80
## - OPRC     1 574517654 2546489517 781.56
## - OPSLAKE  1 964675516 2936647380 787.69
## 
## Step:  AIC=770.62
## BSAAM ~ Year + APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APSAB    1   4943947 1979541875 768.73
## - Year     1  82535451 2057133378 770.39
## <none>                 1974597927 770.62
## - APSLAKE  1 127432687 2102030614 771.31
## - OPRC     1 575963916 2550561844 779.63
## - OPSLAKE  1 968394770 2942992697 785.78
## 
## Step:  AIC=768.73
## BSAAM ~ Year + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - Year     1   89405710 2068947585 768.63
## <none>                  1979541875 768.73
## - APSLAKE  1  523812582 2503354457 776.83
## - OPRC     1  613807319 2593349193 778.35
## - OPSLAKE  1 1175063776 3154605651 786.77
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390
stepAIC(modnone,direction = "forward", scope = list(lower=lower.scope,upper=upper.scope))
## Start:  AIC=873.65
## BSAAM ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 2.4087e+10 3.2640e+09 784.24
## + OPRC     1 2.3131e+10 4.2199e+09 795.28
## + OPBPC    1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE  1 1.7004e+09 2.5651e+10 872.89
## + APMAM    1 1.5567e+09 2.5794e+10 873.13
## <none>                  2.7351e+10 873.65
## + APSAB    1 9.1891e+08 2.6432e+10 874.18
## + Year     1 7.9010e+08 2.6561e+10 874.38
## 
## Step:  AIC=784.24
## BSAAM ~ OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 663368666 2600641788 776.47
## + APSAB    1 661988129 2602022326 776.49
## + OPRC     1 574050696 2689959758 777.92
## + APMAM    1 524283532 2739726922 778.71
## <none>                 3264010454 784.24
## + Year     1  45570705 3218439749 785.63
## + OPBPC    1     56424 3263954031 786.24
## 
## Step:  AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## + OPRC   1 531694203 2068947585 768.63
## <none>               2600641788 776.47
## + APSAB  1  33349091 2567292697 777.91
## + APMAM  1  11041158 2589600630 778.28
## + Year   1   7292595 2593349193 778.35
## + OPBPC  1    122447 2600519341 778.46
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + Year   1  89405710 1979541875 768.73
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
## 
## Coefficients:
## (Intercept)      OPSLAKE      APSLAKE         OPRC  
##       15425         2390         1712         1797
##look if would improve by 10.

Using AIC produced a little different result than using the t-tests. If we wanted we could could check the AICS the test produced because if the AIC is not at least 10 less we should just use the simpler model. So the model we would actually pick would be: BSAAM ~ OPSLAKE.