Today we discussed how to compare models and how to build models using a stepwise process.

Recall that \(r^2\) tells us how much of the response is attributed to its linear relationship with the predictor(s). However, this is a poor way to compare models. If we add more predictors, \(r^2\) can only increase or, worst case scenario, stay the same. If we only rely on this method to compare models, we will end up with one that is bloated and overly fitted for the observed data, and not useful for predictions for the larger population.

Adjusted \(r^2\) adds a penalty for the number of predictors, and while this is better, it’s still not ideal for comparing models. Adjusted \(r^2\) might be useful in comparing models for an audience lacking a background in statistics.

Instead we should rely on t-tests, F-tests, AIC, or Mallows’ CP to compare models. We have already covered t-tests and F-tests; we use these to see if the additional predictor or predictors are statistically significant.

AIC balances the number of predictors and overall model fit. We want to maximize fit while minimizing predictors, so we are looking for small values of AIC. If we are comparing two models, we will usually choose the model with a lower AIC. The exception is when the lower AIC model is more complicated; we need an AIC drop of at least 10 to choose the more complex model. AIC is used for comparing two or more models, so the AIC of a single model has little meaning.

Mallow’s CP is another tool that can be used to compare models. As with AIC, smaller values of Mallow’s are better. First we take a subset of predictors from a model, and calculate the subset model’s residual sum of squares. We divide this value by the residual sum of squares for the complete model to get the value of Mallow’s CP.

If the subset of predictors is good, then the RSS for that subset will be close to the RSS of the entire model, giving a smaller value.

It is not required to use all four means to compare models. Instead, whichever method used should be clearly stated. These methods are useful if we already have several models and wish to compare them. They can also be used when we build models.

Building Models: Stepwise

The stepwise approach to building models can be done by either forward selection or backwards elimination.

Forward Selection

In forward selection, we start with a simple model, then gradually add predictors (starting with the most significant ones first). After adding each predictor, we test to see if it’s significant. We stop adding predictors when the remaining ones are no longer significant.

If we do this manually, we first fit simple linear regression models for each predictor and take note of their respective p-values. Any predictors that have a non-significant p-value can be immediately eliminated; we won’t even consider adding them to the model. Then, we find the predictor with the lowest p-value and add that to our model.

Doing this process with the OK Cupid dataset was helpful to understand how it works.

datedata <- read.csv("http://cknudson.com/data/OKCupid.csv")
attach(datedata)
summary(datedata)
##  Sex        Height      IdealMateHeight      Age       
##  F:68   Min.   :59.00   Min.   :60.00   Min.   :20.00  
##  M:66   1st Qu.:64.25   1st Qu.:66.00   1st Qu.:28.00  
##         Median :68.00   Median :68.00   Median :30.00  
##         Mean   :67.79   Mean   :68.46   Mean   :29.96  
##         3rd Qu.:71.00   3rd Qu.:72.00   3rd Qu.:32.00  
##         Max.   :76.00   Max.   :77.00   Max.   :41.00

Ideal mate height is the response, which leaves us with height, sex and age as predictors. Next we fit simple linear regression models with IdealMateHeight as the response for each of the three predictors, and look at the p-values.

pred01 <- lm(IdealMateHeight ~ Age)
summary(pred01)
## 
## Call:
## lm(formula = IdealMateHeight ~ Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4880 -2.4797 -0.4506  3.5203  8.5286 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 68.213927   2.623250  26.004   <2e-16 ***
## Age          0.008304   0.086889   0.096    0.924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.787 on 132 degrees of freedom
## Multiple R-squared:  6.92e-05,   Adjusted R-squared:  -0.007506 
## F-statistic: 0.009135 on 1 and 132 DF,  p-value: 0.924

Looks like age has a very high p-value of 0.924, so we won’t even consider adding it to our model.

pred02 <- lm(IdealMateHeight ~ Height)
summary(pred02)
## 
## Call:
## lm(formula = IdealMateHeight ~ Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6262 -2.6511 -0.1796  2.7022  8.9938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 82.47144    4.93571  16.709  < 2e-16 ***
## Height      -0.20665    0.07266  -2.844  0.00516 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.676 on 132 degrees of freedom
## Multiple R-squared:  0.05774,    Adjusted R-squared:  0.0506 
## F-statistic: 8.089 on 1 and 132 DF,  p-value: 0.005163

Height has a p-value of 0.005163, so it’s definitely significant. We will potentially add this predictor.

pred03 <- lm(IdealMateHeight ~ Sex)
summary(pred03)
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.147 -1.697  0.303  1.303  6.303 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.1471     0.3164  224.88   <2e-16 ***
## SexM         -5.4501     0.4508  -12.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.609 on 132 degrees of freedom
## Multiple R-squared:  0.5255, Adjusted R-squared:  0.5219 
## F-statistic: 146.2 on 1 and 132 DF,  p-value: < 2.2e-16

The p-value for Sex is very small, less than \(2.2*10^{\neg16}\). This p-value is also smaller than the p-value for height, so we will add Sex as the first predictor. A model with only sex as the predictor will be significant, so we can add height.

datemodel01 <- lm(IdealMateHeight ~ Sex + Height)
summary(datemodel01)
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1770 -1.2642  0.2219  1.2602  5.3020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2456     3.9171  10.019  < 2e-16 ***
## SexM         -8.5383     0.5281 -16.168  < 2e-16 ***
## Height        0.4930     0.0604   8.162 2.38e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared:  0.6854, Adjusted R-squared:  0.6806 
## F-statistic: 142.7 on 2 and 131 DF,  p-value: < 2.2e-16

If we look at the t-test p-value for Height in this case, we can see it’s very small. Height is definitely a significant predictor and we should keep it. The only other predictor in this dataset is age, which we found to be insignificant in the simple linear regression model, so it definitely won’t be significant in a multiple linear regression model. Thus, our two predictors for ideal mate height for the respondents in this dataset are the respondents’ own height and their sex.

Backwards Elimination

In backwards elimination, we start with a model that has all of the predictors, and then we eliminate ones that are not significant. We stop eliminating predictors when the remaining ones should not be dropped.

Doing this with the OK Cupid data would first require a model with all three predictors.

alldate <- lm(IdealMateHeight ~ Sex + Height + Age)
summary(alldate)
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height + Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1639 -1.2728  0.2494  1.3221  5.2108 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.65668    4.14920   9.558  < 2e-16 ***
## SexM        -8.54402    0.53027 -16.113  < 2e-16 ***
## Height       0.49372    0.06066   8.140  2.8e-13 ***
## Age         -0.01520    0.04913  -0.309    0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.14 on 130 degrees of freedom
## Multiple R-squared:  0.6857, Adjusted R-squared:  0.6784 
## F-statistic: 94.52 on 3 and 130 DF,  p-value: < 2.2e-16

Using the t-test, we can see that Age is not significant, so it will be dropped. The resulting model:

droppedone <- lm(IdealMateHeight ~ Sex + Height)
summary(droppedone)
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1770 -1.2642  0.2219  1.2602  5.3020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2456     3.9171  10.019  < 2e-16 ***
## SexM         -8.5383     0.5281 -16.168  < 2e-16 ***
## Height        0.4930     0.0604   8.162 2.38e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared:  0.6854, Adjusted R-squared:  0.6806 
## F-statistic: 142.7 on 2 and 131 DF,  p-value: < 2.2e-16

If we look at t-test p-values for each predictor in this model, we can see they are all significant. Therefore, we do not drop any more. Our model for ideal mate height respondents in the OK Cupid survey is determined by the respondents’ own height and their sex.

Additional Notes

In this case, both forward selection and backwards elimination resulted in the same model. However, this is not always the case. In the wblake example we did in class, forward selection resulted in a model with two predictors, whereas performing backwards elimination resulted in a model with three predictors.

Both models in that example are fine. Backwards elimination has a tendency to produce a more complicated model than forward selection.

Using R for Stepwise Methods

We can also use R to handle the task of building models. This is done using the stepAIC command in the MASS package. The arguments can be kind of confusing at first.

First, we fit the response variable without any predictors. This model is saved as “simplemod” below. Then we give stepAIC the necessary arguments. The first argument is that model without predictors that we just created. This is used as a starting point.

Then we specify direction. Direction can be “forward” for forward selection, “backward” for backwards elimination, or “both” for a combination of the two. The code below will use forward selection.

Finally, we need to define the scope. Scope determines the range of models, and upper is used to denote the most complicated model possible with our dataset. In this case, it’s the multiple linear regression model with all of our predictors.

library(MASS)
simplemod <- lm(IdealMateHeight ~ 1)
stepAIC (simplemod, direction = "forward", scope = list(upper = IdealMateHeight ~ Age + Sex + Height))
## Start:  AIC=356.86
## IdealMateHeight ~ 1
## 
##          Df Sum of Sq     RSS    AIC
## + Sex     1    994.84  898.47 258.98
## + Height  1    109.33 1783.99 350.89
## <none>                1893.31 356.86
## + Age     1      0.13 1893.18 358.86
## 
## Step:  AIC=258.98
## IdealMateHeight ~ Sex
## 
##          Df Sum of Sq    RSS    AIC
## + Height  1    302.88 595.59 205.89
## <none>                898.47 258.98
## + Age     1      0.00 898.47 260.98
## 
## Step:  AIC=205.89
## IdealMateHeight ~ Sex + Height
## 
##        Df Sum of Sq    RSS    AIC
## <none>              595.59 205.89
## + Age   1   0.43821 595.15 207.79
## 
## Call:
## lm(formula = IdealMateHeight ~ Sex + Height)
## 
## Coefficients:
## (Intercept)         SexM       Height  
##      39.246       -8.538        0.493

If we run this code, R will go through and begin to compare different possibilities using AIC. It starts with that intial model, with zero predictors, then considers models with only one of the three predictors. R calculates the AIC of each possibility, and compares them. The different possibilities are listed with the lowest AIC first.

We can see in the first step the model without any predictors has an AIC of about 357, with the lowest AIC being the model with only sex added. Sex is added into the model, and R proceeds to look at the remaining predictors.

The AIC of the model with only sex as the predictor is about 259, but we can see that adding height results in an AIC drop of more than 50. Height is then added, and R considers the last predictor, age. Adding age will increase the AIC, so R recognizes it’s not useful, and gives the coefficients and predictors in the last chunk of output.

Manually doing the stepwise process of building models is really helpful in understanding how the code works and interpreting the output.

Building Models: All Subsets Regression

Another way we could build models is to use a process called all subsets regression. This involves fitting a regression model for every possible combination of predictors and choosing the best one by comparing the AICs. In the OK Cupid dataset, our combinations of predictors are:

  1. Height

  2. Sex

  3. Age

  4. Height and Sex

  5. Height and Age

  6. Age and Sex

  7. Age, Height and Sex

A model is said to be nested if all of its predictors are also in another model. We can use a t-test, F-test, AIC, or Mallows’ CP in the case of nested models. In this example, combinations 1 through 6 are all nested in combination 7. Generally we will use a t-test or F-test whenever possible.

If a model is not nested, then we are restricted to AIC or Mallows’ CP. This makes sense because when we did the t-tests and partial F-tests, we were trying to see if we could drop, instead of swap, one or more existing predictors. In this example, if we were comparing combination 4 (Height and Sex) to combination 5 (Height and Age), we can only use AIC or Mallows’ CP. Only height is shared as a predictor, so neither model is nested in this case.

Let’s fit all these combinations and find the AICs. I will use the extractAIC function to calculate the AICs. This function outputs two values. The AIC is the second one. The first value is equivalent degrees of freedom for the fitted model.

mod0 <- lm(IdealMateHeight ~ Height)
extractAIC(mod0)
## [1]   2.0000 350.8948

AIC for height only = 350.8948

mod1 <- lm(IdealMateHeight ~ Sex)
extractAIC(mod1)
## [1]   2.0000 258.9822

AIC for Sex only = 258.9822

mod2 <- lm(IdealMateHeight ~ Age)
extractAIC(mod2)
## [1]   2.0000 358.8554

AIC for age only = 358.8554

mod3 <- lm(IdealMateHeight ~ Height + Sex)
extractAIC(mod3)
## [1]   3.0000 205.8889

AIC for height and sex = 205.8889

mod4 <- lm(IdealMateHeight ~ Height + Age)
extractAIC(mod4)
## [1]   3.0000 352.8728

AIC for height and age = 352.8728

mod5 <- lm(IdealMateHeight ~ Sex + Age)
extractAIC(mod5)
## [1]   3.0000 260.9821

AIC for sex and age = 260.9821

mod6 <- lm(IdealMateHeight ~ Sex + Age + Height)
extractAIC(mod6)
## [1]   4.0000 207.7903

AIC for all three predictors = 207.7903

Our lowest AIC is for the model with height and sex as the predictors. This is the same model we found using the stepwise methods.

Even with three predictors, all subsets regression requires a lot of work. All subsets regression is not very practical for datasets with many predictors.

Conclusion

We already covered how to do t-tests and F-tests in cases where we had a model and were considering dropping one or more predictors. This was essentially comparing the complete model to another model with one or more predictors excluded and seeing which one is better. AIC and Mallows’ CP are two new ways to compare models.

Forward selection and backwards elimination are both fairly intuitive, changing the model one predictor at a time and looking at the p-values. It’s also possible to use AIC or Mallows’ CP when doing the stepwise methods. Using different methods or means of comparison can sometimes result in a slightly different model, so it’s very important to clearly state the methodology used.