In our most recent class, we discussed model building, which is essentially adding significant predictors to a model. We mainly focused on three ways of doing this: t-test, F-test, and AIC. T-tests and F-tests require nested models, which means that one model’s predictors are a subset of the other model’s. AIC does not have this requirement, and a smaller AIC value is better when deciding if a predictor should be added to a model.
We will use the water data set to look at examples:
library(alr3)
## Warning: package 'alr3' was built under R version 3.4.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
head(water)
## Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
## 1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
## 2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
## 3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
## 4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
## 5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
## 6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594
When we build models, we either want to perform “forward selection” or “backward elimination.” Forward selection means that you would start with a simple model and add more predictors using tests, while backward elimination means that you start complicated and reduce the number of predictors in the model. We can automate this process for an AIC test with our water data using the stepAIC function. In this example, we will create a complicated model with all the predictors of runoff included and tell the function that we want to perform backward elimination:
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
## Start: AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APMAM 1 18537 2055849271 772.36
## - OPBPC 1 1301629 2057132362 772.39
## - APSAB 1 10869771 2066700504 772.58
## <none> 2055830733 774.36
## - APSLAKE 1 163662571 2219493304 775.65
## - OPSLAKE 1 493012936 2548843669 781.60
## - OPRC 1 509894399 2565725132 781.89
##
## Step: AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1284108 2057133378 770.39
## - APSAB 1 12514566 2068363837 770.62
## <none> 2055849271 772.36
## - APSLAKE 1 176735690 2232584961 773.90
## - OPSLAKE 1 496370866 2552220136 779.66
## - OPRC 1 511413723 2567262994 779.91
##
## Step: AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## - APSAB 1 11814207 2068947585 768.63
## <none> 2057133378 770.39
## - APSLAKE 1 175480984 2232614362 771.91
## - OPRC 1 510159318 2567292697 777.91
## - OPSLAKE 1 1165227857 3222361235 787.68
##
## Step: AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
##
## Coefficients:
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
What this test does is starts with a model with all the predictors and then eliminates the one whose removal results in the lowest AIC one by one. So, this test began by eliminating APMAM, which made our AIC go from 774.36 to 772.36. It then eliminated OPBPC, which made our AIC go from 772.36 to 770.39, and so on until the predictors that are left give us the lowest possible AIC. In this example, we are left with the snowfall at sites OPRC, APSLAKE, and OPSLAKE being the best predictors of runoff.
We can also perform this test manually by running a t-test. We will use the OKCupid data for this example.
cupid <- read.csv("http://cknudson.com/data/OKCupid.csv")
attach(cupid)
head(cupid)
## Sex Height IdealMateHeight Age
## 1 F 59 66 28
## 2 F 60 70 36
## 3 F 60 70 39
## 4 F 60 72 26
## 5 F 60 72 30
## 6 F 61 65 26
We will perform forward selection this time, so we start with a simple model.
cmod1 <- lm(IdealMateHeight ~ Sex)
summary(cmod1)
##
## Call:
## lm(formula = IdealMateHeight ~ Sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.147 -1.697 0.303 1.303 6.303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.1471 0.3164 224.88 <2e-16 ***
## SexM -5.4501 0.4508 -12.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.609 on 132 degrees of freedom
## Multiple R-squared: 0.5255, Adjusted R-squared: 0.5219
## F-statistic: 146.2 on 1 and 132 DF, p-value: < 2.2e-16
With this simple model, we have a very small p-value for our t-test (2.2*10-16), so sex is a good indicator of ideal mate height. We will now add height to our model as another predictor.
cmod2 <- lm(IdealMateHeight ~ Sex + Height)
summary(cmod2)
##
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1770 -1.2642 0.2219 1.2602 5.3020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.2456 3.9171 10.019 < 2e-16 ***
## SexM -8.5383 0.5281 -16.168 < 2e-16 ***
## Height 0.4930 0.0604 8.162 2.38e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared: 0.6854, Adjusted R-squared: 0.6806
## F-statistic: 142.7 on 2 and 131 DF, p-value: < 2.2e-16
Because we still have a small p-value (2.38*10-13) for our t-test, we can see that height is also a good predictor of ideal mate height. We will now add our last predictor, age.
cmod3 <- lm(IdealMateHeight ~ Sex + Height + Age)
summary(cmod3)
##
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1639 -1.2728 0.2494 1.3221 5.2108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.65668 4.14920 9.558 < 2e-16 ***
## SexM -8.54402 0.53027 -16.113 < 2e-16 ***
## Height 0.49372 0.06066 8.140 2.8e-13 ***
## Age -0.01520 0.04913 -0.309 0.758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.14 on 130 degrees of freedom
## Multiple R-squared: 0.6857, Adjusted R-squared: 0.6784
## F-statistic: 94.52 on 3 and 130 DF, p-value: < 2.2e-16
Our p-value for age is large, so it is not a good predictor of ideal mate height.
We could have also performed AIC tests manually for all of these three models:
AIC(cmod1)
## [1] 641.2577
AIC(cmod2)
## [1] 588.1644
AIC(cmod3)
## [1] 590.0658
As we can see, adding height made our AIC go down, which is a good thing, but adding age to the model caused it to increase, which we do not want.
In general, we would rather have a simple model, so our rule of thumb is that we need an AIC drop of at least 10 to justify using a more complicated model.