Today in class we went over model building. There are a few different methods of model building: forward selection, backward elimination, and all subsets regression.

There are a few different criteria that can be used in model building. 1. t-test for adding/dropping \(x_j\) 2. F-test(partial) for any predictor 3. AIC(Akaike Info Criteria), it balances fir with number of predictors. In model building, you want the AIC to be as small as possible. 4. Mallow’s CP, like AIC, the smaller the better.

Let’s try to build some models!

We can use the water data to see the differences in methods. Let’s load the water data and check them out.

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
summary(water)
##       Year          APMAM            APSAB           APSLAKE     
##  Min.   :1948   Min.   : 2.700   Min.   : 1.450   Min.   : 1.77  
##  1st Qu.:1958   1st Qu.: 4.975   1st Qu.: 3.390   1st Qu.: 3.36  
##  Median :1969   Median : 7.080   Median : 4.460   Median : 4.62  
##  Mean   :1969   Mean   : 7.323   Mean   : 4.652   Mean   : 4.93  
##  3rd Qu.:1980   3rd Qu.: 9.115   3rd Qu.: 5.685   3rd Qu.: 5.83  
##  Max.   :1990   Max.   :18.080   Max.   :11.960   Max.   :13.02  
##      OPBPC             OPRC           OPSLAKE           BSAAM       
##  Min.   : 4.050   Min.   : 4.350   Min.   : 4.600   Min.   : 41785  
##  1st Qu.: 7.975   1st Qu.: 7.875   1st Qu.: 8.705   1st Qu.: 59857  
##  Median : 9.550   Median :11.110   Median :12.140   Median : 69177  
##  Mean   :12.836   Mean   :12.002   Mean   :13.522   Mean   : 77756  
##  3rd Qu.:16.545   3rd Qu.:14.975   3rd Qu.:16.920   3rd Qu.: 92206  
##  Max.   :43.370   Max.   :24.850   Max.   :33.070   Max.   :146345

Forward Selection

Forward selection is a model building technique in which you start with the most simple model and add predictors until we no longer benefit from the added predictors. Generally speaking, we keep adding predicotrs until our AIC is less than 10 or our pvalue become large.

Using the water data as an example, we can building a model using forward selection.

The first step is to see which predictor has the strongest relationship with our response variable. We can do this by creating a simple linear regression model for each predictor and then look at the pvalue for those predictors.

SLR1<-lm(BSAAM~APMAM)
SLR2<-lm(BSAAM~APSAB)
SLR3<-lm(BSAAM~APSLAKE)
SLR4<-lm(BSAAM~OPBPC)
SLR5<-lm(BSAAM~OPRC)
SLR6<-lm(BSAAM~OPSLAKE)
summary(SLR1)
## 
## Call:
## lm(formula = BSAAM ~ APMAM)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37043 -16339  -5457  17158  72467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63364       9917   6.389 1.21e-07 ***
## APMAM           1965       1249   1.573    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25080 on 41 degrees of freedom
## Multiple R-squared:  0.05692,    Adjusted R-squared:  0.03391 
## F-statistic: 2.474 on 1 and 41 DF,  p-value: 0.1234
summary(SLR2)
## 
## Call:
## lm(formula = BSAAM ~ APSAB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41314 -16784  -5101  16492  70942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    67152       9689   6.931 2.06e-08 ***
## APSAB           2279       1909   1.194    0.239    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25390 on 41 degrees of freedom
## Multiple R-squared:  0.0336, Adjusted R-squared:  0.01003 
## F-statistic: 1.425 on 1 and 41 DF,  p-value: 0.2394
summary(SLR3)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46438 -16907  -5661  19028  69464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63864       9249   6.905 2.25e-08 ***
## APSLAKE         2818       1709   1.649    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared:  0.06217,    Adjusted R-squared:  0.0393 
## F-statistic: 2.718 on 1 and 41 DF,  p-value: 0.1069
summary(SLR4)
## 
## Call:
## lm(formula = BSAAM ~ OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21183  -7298   -819   4731  38430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40017.4     3589.1   11.15 5.47e-14 ***
## OPBPC         2940.1      240.6   12.22 3.00e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11990 on 41 degrees of freedom
## Multiple R-squared:  0.7845, Adjusted R-squared:  0.7793 
## F-statistic: 149.3 on 1 and 41 DF,  p-value: 2.996e-15
summary(SLR5)
## 
## Call:
## lm(formula = BSAAM ~ OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24356  -5514   -522   7448  24854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21741.4     4044.1   5.376 3.32e-06 ***
## OPRC          4667.3      311.3  14.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared:  0.8457, Adjusted R-squared:  0.842 
## F-statistic: 224.7 on 1 and 41 DF,  p-value: < 2.2e-16
summary(SLR6)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17603.8  -5338.0    332.1   3410.6  20875.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
## OPSLAKE       3752.5      215.7  17.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8778 
## F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

I will list all of the pvalues here to make less work scrolling.
APMAM: .123
APSAB: .239
APSLAKE: .107
OPBPCC: \(3*10^{-15}\)
OPRC: \(2*10^{-16}\)
OPSLAKE: \(2*10^{-16}\)

Our second step is to choose our predictor with the strongest relationship to the response. We can immediately rule out all of the AP predictors for the rest of this model building. If a predictor isn’t related to the response in the simple linear model, it will certainly not be related in the multiple linear regression.

The variable with the strongest relationship is the OPRC and OPSLAKE. Since they are tied, I chose OPRC because it comes first alphabetically.

Our third step is to create a model with OPRC.

mod1<-lm(BSAAM~OPSLAKE)
summary(mod1)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17603.8  -5338.0    332.1   3410.6  20875.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
## OPSLAKE       3752.5      215.7  17.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8778 
## F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

We can see our pvalue is \(2*10^{-16}\). This value is small enough where we can continue and try to add another predictor. Let’s add OPSLAKE because it tied for relation in the simple linear regression setting.

mod2<-lm(BSAAM~OPSLAKE+OPRC)
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15991.2  -6484.6   -498.3   4700.1  19945.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22891.2     3277.8   6.984 1.98e-08 ***
## OPSLAKE       2400.8      503.3   4.770 2.46e-05 ***
## OPRC          1866.5      638.8   2.922   0.0057 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8967 
## F-statistic: 183.4 on 2 and 40 DF,  p-value: < 2.2e-16

The pvalue of OPRC and OPSLAKE are both extremely small, so we can safely say our model has at least these two predictors. Let’s try to add another. The only other predictor we would add would be OPBPC because it was the only other predictor significant enough at the simple regression level.

mod3<-lm(BSAAM~OPSLAKE+OPBPC)
summary(mod3)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPBPC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17591.0  -5276.6    275.6   3380.7  20867.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27050.95    3540.07   7.641 2.44e-09 ***
## OPSLAKE      3736.16     658.24   5.676 1.35e-06 ***
## OPBPC          14.37     546.41   0.026    0.979    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9033 on 40 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8747 
## F-statistic: 147.6 on 2 and 40 DF,  p-value: < 2.2e-16

We can see that the pvalue for OPBPC is .979. that is not significant enough to keep. So, we can throw that predictor out and say that our mod2 is our linear model for this data using the forward selection model: \[\hat{BSAAM}=\beta_0+\beta_1*OPSLAKE+\beta_2*OPRC\]

If one decides to use backward elimination you might get a different answer. I will build a backward elimination model from the same data to demonstrate.

Backward elimination is when you start with the most complicated model, and gradually eliminate the least important predictors. You stop eliminating predictors when the AIC drop is less than 10 or the pvalue is small enough.

mod4<-lm(BSAAM~APMAM+APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
summary(mod4)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
##     OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## OPBPC          69.70     461.69   0.151 0.880839    
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

Looking at all of the pvalues, we want to eliminate the predictor with the highest pvalue, which is APMAM. We will now build another model, this time without APMAM.

mod5<-lm(BSAAM~APSAB+APSLAKE+OPBPC+OPRC+OPSLAKE)
summary(mod5)
## 
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12696  -4933  -1396   4187  18550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15930.84    3972.50   4.010 0.000283 ***
## APSAB        -673.42    1418.96  -0.475 0.637873    
## APSLAKE      2263.86    1269.35   1.783 0.082714 .  
## OPBPC          68.94     453.50   0.152 0.879996    
## OPRC         1915.75     631.46   3.034 0.004399 ** 
## OPSLAKE      2212.62     740.28   2.989 0.004952 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9147 
## F-statistic: 91.05 on 5 and 37 DF,  p-value: < 2.2e-16

We can see that now OPBPC has the largest pvalue, so we will eliminate that one next. Then, build another model.

mod6<-lm(BSAAM~APSAB+APSLAKE+OPRC+OPSLAKE)
summary(mod6)
## 
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12750  -5095  -1494   4245  18594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15749.8     3740.8   4.210 0.000151 ***
## APSAB         -650.6     1392.8  -0.467 0.643055    
## APSLAKE       2244.9     1246.9   1.800 0.079735 .  
## OPRC          1910.2      622.3   3.070 0.003942 ** 
## OPSLAKE       2295.4      494.8   4.639 4.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9169 
## F-statistic: 116.8 on 4 and 38 DF,  p-value: < 2.2e-16

Now, looking at this model, we see that APSAB has the highest pvalue, so we will eliminate that predictor and build another model.

mod7<-lm(BSAAM~APSLAKE+OPRC+OPSLAKE)
summary(mod7)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16

We see here that all of the predictors have significant pvalues, which means that this is our final model: \[\hat{BSAAM}=\beta_0+\beta_1*OPSLAKE+\beta_2*OPRC+\beta_3*APSLAKE\]

This is clearly different that our forward selection model. Forward selection models tend to have less predictors while backward elimination models tend to have more.

The final method of model building is All subsets regression. This model is created by creating all possible models using all possible combination of predictors. The AIC’s are then compared and the model is chosen from there. I will not demonstrate this model.