Create a Model for runoff using some of the snowfall measurements

library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
head(water)
##   Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
## 1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
## 2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
## 3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
## 4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
## 5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
## 6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

Forward Selection

Using forward selection, we will start with a model with fewer predictors and then add more to it. For our simpler model, let’s use the first two snowfall measurements as predictors, and the stream runoff as the response. We will then test it compared to using the same first two predictors, but then adding the third snowfall measurement as another predictor, then we will compare the two models using the t-test

mysimpmod <- lm(BSAAM ~ APMAM + APSAB)
summary(mysimpmod)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -35987 -15899  -4993  16981  72494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63719.6    10267.8   6.206 2.43e-07 ***
## APMAM         2272.1     2253.3   1.008    0.319    
## APSAB         -559.3     3401.1  -0.164    0.870    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25390 on 40 degrees of freedom
## Multiple R-squared:  0.05755,    Adjusted R-squared:  0.01043 
## F-statistic: 1.221 on 2 and 40 DF,  p-value: 0.3056
myaddmod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE)
summary(myaddmod)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -42240 -16042  -4762  18625  69017 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63410      10269   6.175 2.96e-07 ***
## APMAM           1580       2354   0.671    0.506    
## APSAB          -3874       4718  -0.821    0.417    
## APSLAKE         4218       4163   1.013    0.317    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25380 on 39 degrees of freedom
## Multiple R-squared:  0.08173,    Adjusted R-squared:  0.01109 
## F-statistic: 1.157 on 3 and 39 DF,  p-value: 0.3384

Looking at the p-values of the F-test, we are not satisfied with either value produced, so we will add further predictors to decrease that value. let’s add the next one on the list.

my3rdmod <- lm(BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC)
summary(my3rdmod)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24999  -7388   -654   5089  37150 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31556.94    5378.60   5.867 8.66e-07 ***
## APMAM        -111.20    1086.14  -0.102    0.919    
## APSAB         -73.65    2181.42  -0.034    0.973    
## APSLAKE      2083.17    1913.14   1.089    0.283    
## OPBPC        2889.13     237.31  12.175 1.10e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11610 on 38 degrees of freedom
## Multiple R-squared:  0.8126, Adjusted R-squared:  0.7929 
## F-statistic:  41.2 on 4 and 38 DF,  p-value: 2.498e-13

Here we see that the F-test gives our third model a p-value of 2.498 x 10^(-13), we would want to move forward with this model compared to the two models with fewer predictors.

Backward Elimination

Now we will work backwards from our third model and see if we can still produce an accurate model with fewer predictors. First, we will call up our third model again.

summary(my3rdmod)
## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24999  -7388   -654   5089  37150 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 31556.94    5378.60   5.867 8.66e-07 ***
## APMAM        -111.20    1086.14  -0.102    0.919    
## APSAB         -73.65    2181.42  -0.034    0.973    
## APSLAKE      2083.17    1913.14   1.089    0.283    
## OPBPC        2889.13     237.31  12.175 1.10e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11610 on 38 degrees of freedom
## Multiple R-squared:  0.8126, Adjusted R-squared:  0.7929 
## F-statistic:  41.2 on 4 and 38 DF,  p-value: 2.498e-13

Next, let’s look at a model with one less predictor and compare the F-test p-values of each model, and see whether the reduced model produces a lower p-value. Let’s try eliminating the first predictor.

mynewmod <- lm(BSAAM ~ APSAB + APSLAKE + OPBPC)
summary(mynewmod)
## 
## Call:
## lm(formula = BSAAM ~ APSAB + APSLAKE + OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24680  -7524   -546   5175  37145 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31444.0     5197.0   6.050 4.41e-07 ***
## APSAB         -158.8     1991.1  -0.080    0.937    
## APSLAKE       2029.3     1815.9   1.118    0.271    
## OPBPC         2886.0      232.4  12.421 3.94e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11470 on 39 degrees of freedom
## Multiple R-squared:  0.8126, Adjusted R-squared:  0.7982 
## F-statistic: 56.36 on 3 and 39 DF,  p-value: 3.044e-14

Now we see that our reduced model produced the smaller p-value when comparing using the F-test. We actually gained form eliminating that one predictor, possibly it was a poor one. If we had to choose between these two models, the reduced model would be favored.

Testing Using AIC

Now that we have done both forward selection and backward elimination, let’s try testing two non-nested models using AIC. We can use our reduced model from the previous example as our first model.

AIC(mynewmod)
## [1] 931.677

And let’s compare that to a different model with a few changed predictors.

adiffmod <- lm (BSAAM ~ APSLAKE + OPBPC + OPSLAKE)
AIC(adiffmod)
## [1] 902.4931

As we look at both AIC values, first it is noticed that they are both very high, but our second model is about thirty less than the original. With a difference of more than ten we would switch from the original model to the second due to its lower AIC value.

myfullmod <- lm (BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE )
step(myfullmod)
## Start:  AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1     18537 2055849271 772.36
## - OPBPC    1   1301629 2057132362 772.39
## - APSAB    1  10869771 2066700504 772.58
## <none>                 2055830733 774.36
## - APSLAKE  1 163662571 2219493304 775.65
## - OPSLAKE  1 493012936 2548843669 781.60
## - OPRC     1 509894399 2565725132 781.89
## 
## Step:  AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1284108 2057133378 770.39
## - APSAB    1  12514566 2068363837 770.62
## <none>                 2055849271 772.36
## - APSLAKE  1 176735690 2232584961 773.90
## - OPSLAKE  1 496370866 2552220136 779.66
## - OPRC     1 511413723 2567262994 779.91
## 
## Step:  AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - APSAB    1   11814207 2068947585 768.63
## <none>                  2057133378 770.39
## - APSLAKE  1  175480984 2232614362 771.91
## - OPRC     1  510159318 2567292697 777.91
## - OPSLAKE  1 1165227857 3222361235 787.68
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

Through the steps function in R, it compares the AIC’s of all the variables if they were dropped. Here we see that when doing this test that APSLAKE, OPRC, and OPSLAKE are the best three predictors for BSAAM which is the measurement of runoff.