We covered the rest of section 5.1 today (model building). Describe the different methods of model building and the different criteria (F-test, AIC, etc). Then build models with the water and/or OKCupid data.
In Tuesday’s class, we introduced and discussed the process of building regression models and the different ways we can compare models to compare which model can predict our response variable with greater accuracy. In this learning log, we will explore, forward stepwise selection, backward elimination, and all subsets as model building techniques. From there, we will discuss how to can compare models using t-tests, partial f-tests, AIC, and Mallow Cp to assess which model is the most accurate. After applying these methods to the wblake dataset, we will conclude with some final thoughts regarding model building and selection.
The following model building methods are relatively straightforward and are usable for quantitative and categorical variables, as well as interaction terms.
The forward stepwise regression method is a process of starting with a small set of predictor variables and then methodically adding and potentially subtracting additional predictors until the optimal model is created. Let’s dive in.
Probably the easiest way to begin forward stepwise selection is by starting with a simple model of an intercept and the strongest (and maybe second strongest) predictors. When referring to the “strongest” predictors, I simply mean the most significant predicors, or those with the smallest p-values if you were to create a SLR model for each single predictor.
Now that you’ve got a baseline model, we can move forward with adding more strong predictors. With only 1 predictor at a time, add a predictor to your baseline model and check to see if your model has improved. Checking the improvement of your model can be done many ways (t-test, F-test, AIC) which we will explore in depth in the next section. Generally, you will need to establish an entry threshold: value that is significant to the model that new predictors will need to hurdle to be added to the model. It will be most efficient to add the next strongest predictor as you test new additions to the model as to avoid testing unnecessary predictors before significant ones. If a new predictor is added the model, the current predictors should also be re-tested for significance, and thus an exit threshold value should be established as well. If a current predictor exceeds the exit threshold, meaning it is no longer significant to the model, it should be removed.
This process will be completed after no new predictors to be added are significant enough, and all of the predictors in the model are just significant enough–with respect to the enter and exit thresholds.
The backward elimination method is just what you expect: the opposite of the forward stepwise selection. This method has you start with a large group of predictors and then eliminate nonessential, insignificant predictors one at a time.
Start with a large group of predictors in your model. How large? Depending on number of available predictors and various constants, you should start with all predictors that seem semireasonable in the model. Like with the forward stepwise selection, we should organize the predictors in the model, except this time from less to greatest significance.
Starting with the least significant predictor, you will again need to establish a exit threshold of significance, and compare this first predictor. If the predictor is less significant than the threshold, it should be removed from the model. If a predictor is removed, the current model predictors should be reorder by least to greatest significance.
This process continues until no predictors in the model are less significant than the exit threshold.
Our third and final method is all subsets, which is a very appropriate title. For this method, a different model will be built for each and every possible combination of the listed predictors. So if we have x1, x2, and x3 as our predictors, the following models will be built under the all subsets method: x1 x2 x3 x1 + x2 x1 + x3 x2 + x3 x1 + x2 + x3
Each of these models will then be compared against each other, and the model with the greatest accuracy in predicting the response will be selected. As you can probably guess, this technique can be very taxing as the number of predictors increase.
Now that we are familiar with a few ways that we can develop an optimal model, we will discuss the methods for comparing these models to end up with the strongest model. We will be mentioning two additional terms in the following section: nested and non-nested models. Nested models refer to a model that contains all of the predictors of another model plus additional models. Said another way, the smaller model could be composed of part of the larger model. Intuitively, non-nested models are models in which the less complicated model can’t be composed of part of the more complicated model–there are predictors in the less complicated model that are not in the more complicated model. Understanding the distinction between nested and non-nested models will be important as we move forward.
Our good friend the t-test can help us build models, just like it can pretty much help us with anything related to statistics. When we are looking to remove or add predictors to our model, we can test the significance of that predictor with a t-test and a given alpha level. This significance level will serve as our significance threshold for entry or exit to/from the model in the techniques that we mentioned earlier. We can only use the t-test for new predictors to be added to the model or current predictors to be removed from the model–thus implying that we can only use t-tests for nested models.
In most MLR cases, when we have t-tests, we can also use partial F-tests. Similar to t-tests, we can use the partial f-test to see if the predictors that we wish to add or subtract from the model are significant enough to enter or exit the model at a given significance level. By comparing a specific model against the same model with one or more predictors added or removed, the partial F-test can tell us if these moving predictors exceed the entry significance hurdle or slide under the exit significance threshold. Again, the models being compared in the partial f-test must be nested models, just like for the t-test.
In the east corner, we have a newer measure entering the MLR ring, AIC. AIC stands for Akaike Information Criteria and can be calculated with the following formula:
\[ AIC = 2(k+1) - 2(log~likelihood) \].
On the right of the equal sign, the first term is a count of the number of parameters which is subtracted from the fit of the model. Thus, a smaller AIC value corresponds to a better fitted model, and we want this.
Another nifty characteristic of AIC is that we are not limited to nested models to compare. Since AIC is simply a measure of the specific model’s fit, we can directly compare an two models for the data.
The next newcomer to the statistical significance comparison scene is Mallow’s Cp or the Cp statistic. Mallow’s Cp can be calculated with the following equation:
\[ C_p = \dfrac {SSE_k}{s^2_p} - [n-2(k+1)] \].
The k notation in the numerator refers to the sum of squared residuals for a nested model of k predictors while the denominator is the standard error for the full model with p predictor (which accounts for the p in Mallow’s Cp). In both cases, we want Cp to be relatively small–roughly k+1 is ideal.
We have previously used methods like adjusted R-squared and the standard error when comparing models, however, we will try to stay away from these measures since these measures can be influenced by multicollinearity. The other mentioned measures above are more fun anyways.
Now that the conceptual back drop has been laid for model building, let’s run through an example using the water data.
library(alr3)
## Warning: package 'alr3' was built under R version 3.3.3
## Loading required package: car
## Warning: package 'car' was built under R version 3.3.3
data(water)
attach(water)
head(water)
## Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
## 1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
## 2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
## 3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
## 4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
## 5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
## 6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594
Using the water data, we’ll run through a couple of model building scenarios starting with forward stepwise regression using a t-tests. We will also note at the outset that our significance threshold for entry and exit will be alpha equals 0.05.
Without showing you all of the code, I’ll listed the following predictors in order of increasing p-value in their own SLR models with BSAAM as the response:
OPRC: 0.0 OPSLAKE: 0.0 OPBPC: 0.0 APSLAKE: 0.107 APMAM: 0.123 APSAB: 0.239
Given the following results, we’ll start our first simple model with predictor OPRC and the y-intercept.
Mod1 <- lm(BSAAM ~ OPRC )
summary(Mod1)
##
## Call:
## lm(formula = BSAAM ~ OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24356 -5514 -522 7448 24854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21741.4 4044.1 5.376 3.32e-06 ***
## OPRC 4667.3 311.3 14.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared: 0.8457, Adjusted R-squared: 0.842
## F-statistic: 224.7 on 1 and 41 DF, p-value: < 2.2e-16
Since all of our predictors do not exceed the significance exit threshold, we’ll add the next predictor in order, OPSLAKE.
Mod2 <- lm(BSAAM ~ OPSLAKE + OPRC)
summary(Mod2)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15991.2 -6484.6 -498.3 4700.1 19945.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22891.2 3277.8 6.984 1.98e-08 ***
## OPSLAKE 2400.8 503.3 4.770 2.46e-05 ***
## OPRC 1866.5 638.8 2.922 0.0057 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared: 0.9017, Adjusted R-squared: 0.8967
## F-statistic: 183.4 on 2 and 40 DF, p-value: < 2.2e-16
Via t-test, we see that OPSLAKE has a very small p-value of 0.0057, which is below abour entry hurdle, so we’ll add OPSLAKE to our model. We also check to see if OPRC has exceed the exit significance threshold to see if we should toss it from the model. It hasn’t, so we can continue building our model by adding the next predictor, OPBPC.
For this next method, we will show using a partial F-test, since we still are comparing nested models.
Modc <- lm(BSAAM ~ OPRC + OPSLAKE + OPBPC)
Modr <- lm(BSAAM ~ OPRC + OPSLAKE)
anova(Modc,Modr)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ OPRC + OPSLAKE + OPBPC
## Model 2: BSAAM ~ OPRC + OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 39 2689509185
## 2 40 2689959758 -1 -450573 0.0065 0.936
We see that the p-value from our partial f-test is very high at 0.936, which is above our entry threshold. We should conclude that the OPBPC predictor will not be significant in our model.
From here, we could continue to check predictors, but I already did and know that this is as good as it gets. We’ll now show an example of backwards elimination using AIC to see if we conclude with a similar model.
For backwards elimination, we will start with a full model of all of our predictors (yes, even the really junky ones). We’ll show our last comparison metric AIC using the stepAIC command. This command essentially goes through the model building process for us using AIC as the metric.
As you see below, we start with the full, and eliminate predictors one at a time if by doing so the new model would have a lower AIC. The stepAIC command runs 4 times until it finds the optimal model with OPRC, OPSLAKE, and APSLAKE as the best predictors and the lowest AIC. I love it when R does all of the work for you.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
Mod.Full <- lm(BSAAM ~ OPRC + OPSLAKE + OPBPC + APSLAKE + APSAB + APMAM)
stepAIC(Mod.Full, direction="backward")
## Start: AIC=774.36
## BSAAM ~ OPRC + OPSLAKE + OPBPC + APSLAKE + APSAB + APMAM
##
## Df Sum of Sq RSS AIC
## - APMAM 1 18537 2055849271 772.36
## - OPBPC 1 1301629 2057132362 772.39
## - APSAB 1 10869771 2066700504 772.58
## <none> 2055830733 774.36
## - APSLAKE 1 163662571 2219493304 775.65
## - OPSLAKE 1 493012936 2548843669 781.60
## - OPRC 1 509894399 2565725132 781.89
##
## Step: AIC=772.36
## BSAAM ~ OPRC + OPSLAKE + OPBPC + APSLAKE + APSAB
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1284108 2057133378 770.39
## - APSAB 1 12514566 2068363837 770.62
## <none> 2055849271 772.36
## - APSLAKE 1 176735690 2232584961 773.90
## - OPSLAKE 1 496370866 2552220136 779.66
## - OPRC 1 511413723 2567262994 779.91
##
## Step: AIC=770.39
## BSAAM ~ OPRC + OPSLAKE + APSLAKE + APSAB
##
## Df Sum of Sq RSS AIC
## - APSAB 1 11814207 2068947585 768.63
## <none> 2057133378 770.39
## - APSLAKE 1 175480984 2232614362 771.91
## - OPRC 1 510159318 2567292697 777.91
## - OPSLAKE 1 1165227857 3222361235 787.68
##
## Step: AIC=768.63
## BSAAM ~ OPRC + OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE + APSLAKE)
##
## Coefficients:
## (Intercept) OPRC OPSLAKE APSLAKE
## 15425 1797 2390 1712
Now that we’re gone through the model building process, we will conclude with some final thoughts and insights for you to consider.
Carefully choose your entry and exit significant thresholds. For t-tests and partial F-tests, this should be a p-value and for AIC and Mallow’s Cp this should be a value for the difference between the models. Usually the entry and exit values are equal, by hypothetially, they do not need to be. Also, the common practice for AIC is a difference of 10.
By wary that different methods can produce different optimal model results. This is not uncommon, but raises the fact that it could be good to check multiple methods to see how results vary.