Building Models

Adding Predictors

When model building and adding predictors to your model, its important to keep in mind what ways you can test if adding predictors is good or not. A bad way is by looking at \(R^2\). It is bad, because \(R^2\) will always increase as more predictors are being added. Therefore, its impossible to know if you are adding unneeded variables.

Better ways to test the added predictors are:

  1. Adjusted \(R^2\) - It penalizes for the number of predictors
  2. T-Test
  3. Partial F-test
  4. AIC
  5. Mallow’s cp - smaller the better => ideally = k+1 (number of betas) => doesn’t get used much

Practice

It’s important to know when building models, you can use forward selection or backward elimination. Forward selection means you start with a reduced model and then add more predictors using tests mentioned above. Backward elimination means you start with a complete model and eliminate any predictors that are unnecessary.

I will use the data set water to put these tests in practice.

library(alr3)
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)

I will now create a model for runoff using some of the snowfall measurements and perform multiple tests that use backward elimination.

T-test

First, create a model with all variables and run summary(model) to determine what variables are needed. Look at steps below to see what the steps are needed in determining model.

snowmod <- lm(BSAAM~ APMAM +APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data=water)
summary(snowmod)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + 
##     OPSLAKE, data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## OPBPC          69.70     461.69   0.151 0.880839    
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16
snowmod2 <- lm(BSAAM~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, data=water)
summary (snowmod2)
## 
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE, 
##     data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12696  -4933  -1396   4187  18550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15930.84    3972.50   4.010 0.000283 ***
## APSAB        -673.42    1418.96  -0.475 0.637873    
## APSLAKE      2263.86    1269.35   1.783 0.082714 .  
## OPBPC          68.94     453.50   0.152 0.879996    
## OPRC         1915.75     631.46   3.034 0.004399 ** 
## OPSLAKE      2212.62     740.28   2.989 0.004952 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9147 
## F-statistic: 91.05 on 5 and 37 DF,  p-value: < 2.2e-16
snowmod3 <- lm(BSAAM~ APSAB + APSLAKE  + OPRC + OPSLAKE, data=water)
summary (snowmod3)
## 
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12750  -5095  -1494   4245  18594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15749.8     3740.8   4.210 0.000151 ***
## APSAB         -650.6     1392.8  -0.467 0.643055    
## APSLAKE       2244.9     1246.9   1.800 0.079735 .  
## OPRC          1910.2      622.3   3.070 0.003942 ** 
## OPSLAKE       2295.4      494.8   4.639 4.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9169 
## F-statistic: 116.8 on 4 and 38 DF,  p-value: < 2.2e-16
snowmod4 <- lm(BSAAM~  APSLAKE  + OPRC + OPSLAKE, data=water)
summary (snowmod4)
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16

Steps:

  1. drop APMAM because it has highest p value, then create new model without APMAM

  2. drop OPBPC because it has highest p value, then create new model w/out APMAM, OPBPC

  3. drop APSAB because it has highest p value, then create new model w/out APMAM,OPBPC,APSAB

The three site predictors left with low p-values mean that those predictors the most useful and important in our model and the the three predictors I dropped did not have a linear relationship between those predictors and runoff response variable.

AIC

When determining whether to use AIC or T-test, you need to look if your two models are nested with each other. They are nested if ones model’s predictors are a subset of the other model’s predictors. If they are, use T-test or F-test.

If the two models are not nested, you should use AIC.

I will use the data set water and use AIC with backward elimination. The stepAIC function takes the complete model with all predictors (snowmod) and then automates the process of backward elimination to find the best model with least amount of predictors.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
## 
##     forbes
stepAIC(snowmod, direction="backward")
## Start:  AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1     18537 2055849271 772.36
## - OPBPC    1   1301629 2057132362 772.39
## - APSAB    1  10869771 2066700504 772.58
## <none>                 2055830733 774.36
## - APSLAKE  1 163662571 2219493304 775.65
## - OPSLAKE  1 493012936 2548843669 781.60
## - OPRC     1 509894399 2565725132 781.89
## 
## Step:  AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1284108 2057133378 770.39
## - APSAB    1  12514566 2068363837 770.62
## <none>                 2055849271 772.36
## - APSLAKE  1 176735690 2232584961 773.90
## - OPSLAKE  1 496370866 2552220136 779.66
## - OPRC     1 511413723 2567262994 779.91
## 
## Step:  AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - APSAB    1   11814207 2068947585 768.63
## <none>                  2057133378 770.39
## - APSLAKE  1  175480984 2232614362 771.91
## - OPRC     1  510159318 2567292697 777.91
## - OPSLAKE  1 1165227857 3222361235 787.68
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE, data = water)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

We want the smallest AIC as possible.

To begin, we need to look at the first step in the output with all predictors listed. Whatever predictor we choose to eliminate, the AIC of the model will become whatever the AIC value is in that row. Therefore, we will eliminate the predictor with the lowest AIC value because that’s what the model’s AIC will become.

Steps

1) drop APMAM

2) drop OPBPC

3) drop APSAB

4) finally, drop none because <none> has the lowest AIC

THEREFORE,

The model BSAAM ~ APSLAKE + OPRC + OPSLAKE has the lowest AIC of 768.63. These are the 3 sites that are most useful and good enough predictors for understanding the runoff.