General Summary of Topics

Model Building

Today we learned how to build models using the forward selection and backward elimination methods. From my understanding, in forward selection start out with a simple linear regression model using the predictor with the most significant relationship to the response and then gradually add predictors one by one in order of most significant to least significant until the addition of a new predictor no longer influences the model’s predictions. In backward elimination, we start with a model that includes all of the predictors and gradually eliminate the least significant ones until you have only the necessary predictors.

Criteria for Comparison

We can use 5 criteria to perform the process of model building as described above. Partial F-tests, t-tests, R^2, the AIC, and Mallow’s Cp are the useful tools that we compare models. The partial F-test will tell us which predictors are most significant and therefore which ones we should use in our final model. The t-test is a similar process. The R^2 is not a very helpful statistic so we typically save that as a last resort. AIC balances the number of predictors with the log likelihood. To calculate the AIC we use 2(k+1)-2(log likelihood of the model) where we have k predictors. Since we want to be concise and use the fewest predictors, we want to use the model that has the smallest or most negative AIC. The same pattern goes for Mallow’s Cp.

Forward Selection using Water Data and the t-test

library(alr3)
## Warning: package 'alr3' was built under R version 3.3.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.3.3
data(water)
attach(water)

mod1<-lm(BSAAM ~ APMAM)
mod2<-lm(BSAAM ~ APSAB)
mod3<-lm(BSAAM ~ APSLAKE)
mod4<-lm(BSAAM ~ OPBPC)
mod5<-lm(BSAAM ~ OPRC)
mod6<-lm(BSAAM ~ OPSLAKE)

summary(mod1)
## 
## Call:
## lm(formula = BSAAM ~ APMAM)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37043 -16339  -5457  17158  72467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63364       9917   6.389 1.21e-07 ***
## APMAM           1965       1249   1.573    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25080 on 41 degrees of freedom
## Multiple R-squared:  0.05692,    Adjusted R-squared:  0.03391 
## F-statistic: 2.474 on 1 and 41 DF,  p-value: 0.1234
summary(mod2)
## 
## Call:
## lm(formula = BSAAM ~ APSAB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41314 -16784  -5101  16492  70942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    67152       9689   6.931 2.06e-08 ***
## APSAB           2279       1909   1.194    0.239    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25390 on 41 degrees of freedom
## Multiple R-squared:  0.0336, Adjusted R-squared:  0.01003 
## F-statistic: 1.425 on 1 and 41 DF,  p-value: 0.2394
summary(mod3)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46438 -16907  -5661  19028  69464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63864       9249   6.905 2.25e-08 ***
## APSLAKE         2818       1709   1.649    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared:  0.06217,    Adjusted R-squared:  0.0393 
## F-statistic: 2.718 on 1 and 41 DF,  p-value: 0.1069
summary(mod4)
## 
## Call:
## lm(formula = BSAAM ~ OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21183  -7298   -819   4731  38430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40017.4     3589.1   11.15 5.47e-14 ***
## OPBPC         2940.1      240.6   12.22 3.00e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11990 on 41 degrees of freedom
## Multiple R-squared:  0.7845, Adjusted R-squared:  0.7793 
## F-statistic: 149.3 on 1 and 41 DF,  p-value: 2.996e-15
summary(mod5)
## 
## Call:
## lm(formula = BSAAM ~ OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24356  -5514   -522   7448  24854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21741.4     4044.1   5.376 3.32e-06 ***
## OPRC          4667.3      311.3  14.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared:  0.8457, Adjusted R-squared:  0.842 
## F-statistic: 224.7 on 1 and 41 DF,  p-value: < 2.2e-16
summary(mod6)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17603.8  -5338.0    332.1   3410.6  20875.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
## OPSLAKE       3752.5      215.7  17.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8778 
## F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

So the OPSLAKE, OPRC, OPBPC ones are most significant. The first three predictors do not have linear relationships with the response. Let’s test them by adding one predictor at a time.

mod7<-lm(BSAAM ~ OPSLAKE + OPRC, data=water)
summary(mod7)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC, data = water)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15991.2  -6484.6   -498.3   4700.1  19945.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22891.2     3277.8   6.984 1.98e-08 ***
## OPSLAKE       2400.8      503.3   4.770 2.46e-05 ***
## OPRC          1866.5      638.8   2.922   0.0057 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8967 
## F-statistic: 183.4 on 2 and 40 DF,  p-value: < 2.2e-16

So we should remove OPBPC because it does not have a significant linear relationship with the response when OPSLAKE is included. Thus we cement OPRC into the model with OPSLAKE. Do we need all three, though?

mod13<-lm(BSAAM ~ OPSLAKE + OPRC + OPBPC)
summary(mod13)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC + OPBPC)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15964.1  -6491.8   -404.4   4741.9  19921.2 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 22991.85    3545.32   6.485  1.1e-07 ***
## OPSLAKE      2353.96     771.71   3.050  0.00410 ** 
## OPRC         1867.46     647.04   2.886  0.00633 ** 
## OPBPC          40.61     502.40   0.081  0.93599    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8304 on 39 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8941 
## F-statistic: 119.2 on 3 and 39 DF,  p-value: < 2.2e-16

This 3-predictor model is NOT better because adding OPBPC doe snot make the model better. Thus we use the model with OPSLAKE and OPRC only. This is obvious since it was already insignificant when it was combined with OPSLAKE alone.

Backward Elimination using the Water Data and AIC

Now we’ll us backward elimination by starting with all of our predictors in one model and dropping them one by one based on their significance.

mod9<-lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE)
summary(mod9)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
##     OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## OPBPC          69.70     461.69   0.151 0.880839    
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16
library(MASS)
## Warning: package 'MASS' was built under R version 3.3.3
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
stepAIC(mod9)
## Start:  AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1     18537 2055849271 772.36
## - OPBPC    1   1301629 2057132362 772.39
## - APSAB    1  10869771 2066700504 772.58
## <none>                 2055830733 774.36
## - APSLAKE  1 163662571 2219493304 775.65
## - OPSLAKE  1 493012936 2548843669 781.60
## - OPRC     1 509894399 2565725132 781.89
## 
## Step:  AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1284108 2057133378 770.39
## - APSAB    1  12514566 2068363837 770.62
## <none>                 2055849271 772.36
## - APSLAKE  1 176735690 2232584961 773.90
## - OPSLAKE  1 496370866 2552220136 779.66
## - OPRC     1 511413723 2567262994 779.91
## 
## Step:  AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - APSAB    1   11814207 2068947585 768.63
## <none>                  2057133378 770.39
## - APSLAKE  1  175480984 2232614362 771.91
## - OPRC     1  510159318 2567292697 777.91
## - OPSLAKE  1 1165227857 3222361235 787.68
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

So these three are what we should include in our model based on the AIC test using the backwards method. Note that these are the same results that we got before using the backwards method and p-values, but that will not always be true! For example, if we do it in the forward method we get different results.