Learning Log Day 13

Section 5.1 - Model Building

In our most recent class, we discussed model building, which is essentially adding significant predictors to a model. We mainly focused on three ways of doing this: t-test, F-test, and AIC. T-tests and F-tests require nested models, which means that one model’s predictors are a subset of the other model’s. AIC does not have this requirement, and a smaller AIC value is better when deciding if a predictor should be added to a model.

We will use the water data set to look at examples:

library(alr3)

## Warning: package 'alr3' was built under R version 3.4.3

## Loading required package: car

## Warning: package 'car' was built under R version 3.4.3

data(water)
attach(water)
head(water)

##   Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
## 1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
## 2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
## 3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
## 4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
## 5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
## 6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594

When we build models, we either want to perform “forward selection” or “backward elimination.” Forward selection means that you would start with a simple model and add more predictors using tests, while backward elimination means that you start complicated and reduce the number of predictors in the model. We can automate this process for an AIC test with our water data using the stepAIC function. In this example, we will create a complicated model with all the predictors of runoff included and tell the function that we want to perform backward elimination:

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:alr3':
## 
##     forbes

## Start:  AIC=774.36
## BSAAM ~ APMAM + APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1     18537 2055849271 772.36
## - OPBPC    1   1301629 2057132362 772.39
## - APSAB    1  10869771 2066700504 772.58
## <none>                 2055830733 774.36
## - APSLAKE  1 163662571 2219493304 775.65
## - OPSLAKE  1 493012936 2548843669 781.60
## - OPRC     1 509894399 2565725132 781.89
## 
## Step:  AIC=772.36
## BSAAM ~ APSAB + APSLAKE + OPBPC + OPRC + OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1284108 2057133378 770.39
## - APSAB    1  12514566 2068363837 770.62
## <none>                 2055849271 772.36
## - APSLAKE  1 176735690 2232584961 773.90
## - OPSLAKE  1 496370866 2552220136 779.66
## - OPRC     1 511413723 2567262994 779.91
## 
## Step:  AIC=770.39
## BSAAM ~ APSAB + APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## - APSAB    1   11814207 2068947585 768.63
## <none>                  2057133378 770.39
## - APSLAKE  1  175480984 2232614362 771.91
## - OPRC     1  510159318 2567292697 777.91
## - OPSLAKE  1 1165227857 3222361235 787.68
## 
## Step:  AIC=768.63
## BSAAM ~ APSLAKE + OPRC + OPSLAKE
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27

## 
## Call:
## lm(formula = BSAAM ~ APSLAKE + OPRC + OPSLAKE)
## 
## Coefficients:
## (Intercept)      APSLAKE         OPRC      OPSLAKE  
##       15425         1712         1797         2390

What this test does is starts with a model with all the predictors and then eliminates the one whose removal results in the lowest AIC one by one. So, this test began by eliminating APMAM, which made our AIC go from 774.36 to 772.36. It then eliminated OPBPC, which made our AIC go from 772.36 to 770.39, and so on until the predictors that are left give us the lowest possible AIC. In this example, we are left with the snowfall at sites OPRC, APSLAKE, and OPSLAKE being the best predictors of runoff.

We can also perform this test manually by running a t-test. We will use the OKCupid data for this example.

cupid <- read.csv("http://cknudson.com/data/OKCupid.csv")
attach(cupid)
head(cupid)

##   Sex Height IdealMateHeight Age
## 1   F     59              66  28
## 2   F     60              70  36
## 3   F     60              70  39
## 4   F     60              72  26
## 5   F     60              72  30
## 6   F     61              65  26

We will perform forward selection this time, so we start with a simple model.

cmod1 <- lm(IdealMateHeight ~ Sex)
summary(cmod1)

## 
## Call:
## lm(formula = IdealMateHeight ~ Sex)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.147 -1.697  0.303  1.303  6.303 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.1471     0.3164  224.88   <2e-16 ***
## SexM         -5.4501     0.4508  -12.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.609 on 132 degrees of freedom
## Multiple R-squared:  0.5255, Adjusted R-squared:  0.5219 
## F-statistic: 146.2 on 1 and 132 DF,  p-value: < 2.2e-16

With this simple model, we have a very small p-value for our t-test (2.2*10^-16), so sex is a good indicator of ideal mate height. We will now add height to our model as another predictor.

cmod2 <- lm(IdealMateHeight ~ Sex + Height)
summary(cmod2)

## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1770 -1.2642  0.2219  1.2602  5.3020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2456     3.9171  10.019  < 2e-16 ***
## SexM         -8.5383     0.5281 -16.168  < 2e-16 ***
## Height        0.4930     0.0604   8.162 2.38e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared:  0.6854, Adjusted R-squared:  0.6806 
## F-statistic: 142.7 on 2 and 131 DF,  p-value: < 2.2e-16

Because we still have a small p-value (2.38*10^-13) for our t-test, we can see that height is also a good predictor of ideal mate height. We will now add our last predictor, age.

cmod3 <- lm(IdealMateHeight ~ Sex + Height + Age)
summary(cmod3)

## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height + Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1639 -1.2728  0.2494  1.3221  5.2108 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.65668    4.14920   9.558  < 2e-16 ***
## SexM        -8.54402    0.53027 -16.113  < 2e-16 ***
## Height       0.49372    0.06066   8.140  2.8e-13 ***
## Age         -0.01520    0.04913  -0.309    0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.14 on 130 degrees of freedom
## Multiple R-squared:  0.6857, Adjusted R-squared:  0.6784 
## F-statistic: 94.52 on 3 and 130 DF,  p-value: < 2.2e-16

Our p-value for age is large, so it is not a good predictor of ideal mate height.

We could have also performed AIC tests manually for all of these three models:

AIC(cmod1)

## [1] 641.2577

AIC(cmod2)

## [1] 588.1644

AIC(cmod3)

## [1] 590.0658

As we can see, adding height made our AIC go down, which is a good thing, but adding age to the model caused it to increase, which we do not want.

In general, we would rather have a simple model, so our rule of thumb is that we need an AIC drop of at least 10 to justify using a more complicated model.

Learning Log Day 13

Cara Christianson

March 14, 2018

Section 5.1 - Model Building