Concepts

In class we disscused the rest of multicollinearity. Our main focus in class was Model building. In model building we are trying to find the best model to use to predict our response variable given our predictors that we have. Models can be either nested, like y~x1+x2 and y~x1+x2+x3, or not nested, y~x1+x2 and y~x1+x3. To test if a predictor should be used in the model use T-test or F-test if the model is nested, or use AIC(Akaike Information Criteria) or Mallow’s Cp if the model is not nested. To build a model we can use forward and backward selection as well as all subsets. For backwards and forwards selection we will use the water data in the alr3 library.

Backward Selection

In backward selection you start with a complicated model with all your predictars and pare it down. Testing each time to see if the predictor should be left in the model. For backwards selection I used the step by step approch of deleting the predictor with the highest pvalue.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
names(water)
## [1] "Year"    "APMAM"   "APSAB"   "APSLAKE" "OPBPC"   "OPRC"    "OPSLAKE"
## [8] "BSAAM"
mod1=lm(BSAAM~OPBPC+OPRC+OPSLAKE+APSLAKE+APSAB+APMAM)
summary(mod1)
## 
## Call:
## lm(formula = BSAAM ~ OPBPC + OPRC + OPSLAKE + APSLAKE + APSAB + 
##     APMAM)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## OPBPC          69.70     461.69   0.151 0.880839    
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APMAM         -12.77     708.89  -0.018 0.985725    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16
mod2=lm(BSAAM~OPRC+OPSLAKE+APSLAKE+APSAB+APMAM)
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE + APSAB + APMAM)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12748  -5096  -1501   4242  18593 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15752.558   3845.515   4.096 0.000219 ***
## OPRC         1910.356    631.576   3.025 0.004506 ** 
## OPSLAKE      2295.408    501.450   4.578 5.16e-05 ***
## APSLAKE      2246.477   1313.976   1.710 0.095701 .  
## APSAB        -648.489   1499.041  -0.433 0.667815    
## APMAM          -2.978    696.532  -0.004 0.996611    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7456 on 37 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9146 
## F-statistic: 90.99 on 5 and 37 DF,  p-value: < 2.2e-16
mod3=lm(BSAAM~OPRC+OPSLAKE+APSLAKE+APSAB)
summary(mod3)
## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE + APSAB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12750  -5095  -1494   4245  18594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15749.8     3740.8   4.210 0.000151 ***
## OPRC          1910.2      622.3   3.070 0.003942 ** 
## OPSLAKE       2295.4      494.8   4.639 4.07e-05 ***
## APSLAKE       2244.9     1246.9   1.800 0.079735 .  
## APSAB         -650.6     1392.8  -0.467 0.643055    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9169 
## F-statistic: 116.8 on 4 and 38 DF,  p-value: < 2.2e-16
mod4=lm(BSAAM~OPRC+OPSLAKE+APSLAKE)
summary(mod4)
## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16

Our best model is BSAAM~OPRC+OPSLAKE+APSLAKE.

Forward Selection

In forward selection you start simple and add more predictors and then test them to see if they should have been added. The stepAIC function, with direction=“foward”, gives us a model that adds a predictor with the next lowest AIC until we reach the lowest AIC possible with our model.

mod5=lm(BSAAM~OPRC, data=water)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
stepAIC(mod5, direction="forward", scope=list(upper=BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE, lower=BSAAM~1))
## Start:  AIC=795.28
## BSAAM ~ OPRC
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 1529924515 2689959758 777.92
## + OPBPC    1  888721040 3331163233 787.11
## + APSLAKE  1  635018148 3584866125 790.27
## + APMAM    1  261271943 3958612330 794.53
## + APSAB    1  205266122 4014618151 795.14
## <none>                  4219884273 795.28
## 
## Step:  AIC=777.92
## BSAAM ~ OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 621012173 2068947585 768.63
## + APSAB    1 457345396 2232614362 771.91
## + APMAM    1 387970039 2301989719 773.22
## <none>                 2689959758 777.92
## + OPBPC    1    450573 2689509185 779.91
## 
## Step:  AIC=768.63
## BSAAM ~ OPRC + OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62
## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE, data = water)
## 
## Coefficients:
## (Intercept)         OPRC      OPSLAKE      APSLAKE  
##       15425         1797         2390         1712

Our best model is again BSAAM~OPRC+OPSLAKE+APSLAKE.

All subsets

With all subsets you look at every possible model with your predictors and test using AIC and keep the lowest AIC. For this example we will use the Cupid data. Below I made a subset of every possible predictor combination for this model and I used AIC to test it.

data=read.csv("http://cknudson.com/data/OKCupid.csv")
attach(data)
names(data)
## [1] "Sex"             "Height"          "IdealMateHeight" "Age"
m1=lm(IdealMateHeight~Sex)
AIC(m1)
## [1] 641.2577
m2=lm(IdealMateHeight~Height)
AIC(m2)
## [1] 733.1703
m3=lm(IdealMateHeight~Age)
AIC(m3)
## [1] 741.1309
m4=lm(IdealMateHeight~Sex+Height)
AIC(m4)
## [1] 588.1644
m5=lm(IdealMateHeight~Sex+Age)
AIC(m5)
## [1] 643.2577
m6=lm(IdealMateHeight~Height+Age)
AIC(m6)
## [1] 735.1483
m7=lm(IdealMateHeight~Sex+Height+Age)
AIC(m7)
## [1] 590.0658

As you can see from the AIC our smallest AIC is with model 4 using IdealMaleHeight~Sex+Height.