STA 363/663 Lab 5 : Best Subset Selection

Goal

In statistical learning, we are often in a scenario in which we know the response variable, and we know the possible features, but we don't know which features we should use to train a model for the response variable. What combination gives us the highest predictive accuracy? A good fit? Today, we are going to explore one technique that can be used to help us in these scenarios, which is called best subset selection.

The Data

We will be using the same data as Lab 4. Recall that this is a random sample of 20,000 people from the BRFSS survey. We begin by loading the training data set of 20,000 observations with the following command. NOTE: This data set takes a second to load; be patient.

source("http://www.openintro.org/stat/data/cdc.R")

We have 9 variables in our training data.

genhlth: respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor.

exerany: indicates whether the respondent exercised in the past month (1) or did not (0).

hlthplan: indicates whether the respondent had some form of health coverage (1) or did not (0)

smoke100: indicates whether the respondent had smoked at least 100 cigarettes in their lifetime

height: in inches

weight: in pounds

wtdesire: desired weight in pounds

age: in years

gender: biological sex, limited to male/female.

We are going to make one quick change.

cdc$wtdesire<- cdc$weight -cdc$wtdesire

This means that wtdesire is now represents how many pounds in weight a person would like to change. A positive number reflects a number of pounds a person would like to lose, and a negative number reflects a number of pounds a person would like to gain.

Our response variable is weight.

In the last lab, we built an LSLR model using Y = weight and X = height. However, there are other possible features in this data set. Can we improve our model fit by using more of them?

We could do this by adding in variables one at a time, or in groups, and comparing fit, but this will take a long time. There is another technique we can use to help us choose a model when we have a set of features, but don't know which features we want to use. This technique is called Best Subset Selection, or BSS.

BSS: The Process

Basically, BSS is a computer algorithm that runs in two stages. In Stage 1, we work to choose the "best" (with respect to some metric) model with one feature, the "best" with two, the "best" with three, etc. In Stage 2, we compare all of these "best" models to try and find the one that optimizes some metric like the AIC, BIC, etc.

Let's start with Stage 1, which has a few steps. In Stage 1 (Step 1), BSS will train every model of the form $ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$, using each of the possible features in the data as the X. Let's see if we can figure out just how many models that might be!

We have 9 variables in our training data, but one of them is the response variable (weight).

Based on this, how many models do you think we fit in Stage 1 (Step 1) of BSS? Hint: Think carefully here. Some of the variables are categorical with more than two levels. Each level counts as a possible X.

At this point, we have multiple models, all of the form $ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$. To conclude Stage 1 (Step 1), we want to select exactly one of these models. To do this, we are going to need to compute a metric that allows us to compare the models and choose one.

The $R^2$ is most commonly used in LSLR for deciding among our models. You can use other metrics, but the code we are going to use today uses $R^2$ by default.

To conclude Stage 1 (Step 1), we compute the $R^2$ on all of the LSLR models we have trained so far. We then determine which has the highest $R^2$. This is the model we will select out of all the models we have trained so far.

To conclude Stage 1 (Step 1), the computer will store this LSLR model with the highest $R^2$. We will use this model as part of Stage 2 of BSS. We have now concluded Stage 1 (Step 1)!

Now we are ready for Stage 1 (Step 2). In this step, BSS will train every model with of the form $ Y_i = \beta_0 + \beta_1 X_i + \beta_1 Z_i + \epsilon_i$, using each of the possible combinations of X and Z using our data.

Once we have fit all these models in Stage 1 (Step 2) of BSS, what do you think we do? Hint: Same as Stage 1 (Step 1).

We continue this process with Stage 1 (Step 3), Stage 1 (Step 4), etc., until we train the LSLR model using all possible features, which is called the full model.

How many $\beta$ terms are in this full model? In other words, we end with Stage 1 ( Step what) ? Hint: If you are stuck, just fit the model in R and look. You can do this by putting in all the predictors manually, or using ModelFull <- lm( weight ~ . , data = cdc).

When we conclude Stage 1 of BSS, we have (1) the model with one slope and the highest $R^2$, (2) the model with two slopes and the highest $R^2$, and so on, all the way to the full model.

Let's see the results of this process in R.

BSS in R

The first step in running BSS in R is to load the library we need.

library(leaps)

This library contains all the code we will need to run BSS. Once we have it loaded, the command we need to run the Stage 1 of BSS is regsubsets. For example, suppose I wanted to run BSS on three variables X1, X2, and X3. The code is then:

BSSout <- regsubsets( Y ~ X1 + X2 + X3, data = cdc, nvmax = 3)

What does the nvmax do? It tells R the maximum number of beta terms we want to allow in our model, not including the intercept. So, right now we consider models of 1, 2, or 3 predictors.

Let's try running stage 1 with just the categorical predictors. Why? Because we are learning right now, and it helps to have a smaller subset to start with. You don't have to do this first in practice.

Look at only the categorical features in the data. Using these categorical features only, run the first stage of BSS and call the output BSScat. Remember to change the nvmax part of the code!!! You will notice that nothing seems to happen, as the output has been stored. Let's look at the $R_{adj}^2$ value of each of these models by using the code summary(BSScat)$adjr2. What is the $R_{adj}^2$ of the model fit with one X? (This is the first value). With two Xs? (This is the second value).

BSS Stage 2

Okay, so now we have run Stage 1, and we have a whole lot of models. Our task now is to try to decide among these models. To do this, we need to choose a metric that will help us compare the models.

If our goal is prediction, something like the AIC is a good choice. For interpretation, the BIC or the $R_{adj}^2$ are good choices. There are others, but these are the ones that are built in to the code we are using today.

For the moment, let's pick $R_{adj}^2$, as we are interested in association right now.

Create a plot to help us see the values of the $R_{adj}^2$ for all our models from Stage 1. To do this, you can use code plot(BSScat, scale = "adjr2"). Note: You can also use plot(BSScat, scale = "Cp") (which does AIC) or plot(BSScat, scale = "bic"), if you are wanting to use other metrics in the future.

This plot we have created is unusual, as we read it in rows, not columns. For instance, look at the lowest row in the table. You will see the value of $R_{adj}^2$ = .22. You will also see that in that row there are two colored boxes: Intercept, and genderf. This means that the model

has an $R_{adj}^2$ of .22.

What is the $R_{adj}^2$ of the model with health Good, health Fair, gender female and the intercept?

Now, you will notice that some $R_{adj}^2$ values seem to be the same. They are likely not exactly the same when you look at more than two decimal places. To see this, run summary(BSScat)$adjr2. These are the exact values, rather than the approximations shown in the visual. Even tiny changes in $R_{adj}^2$ will be shown in the ordering in the table, though.

Now that we have explored this a little, let's actually run BSS for all of our features.

Use all the possible feature variables (categorical and numeric) and run the first stage of BSS and call the output BSSall. Then, use the code plot(BSSall, scale = "adjr2") to plot the results.

Which features are used in the model with the lowest value of $R_{adj}^2$, and what is the value of $R_{adj}^2$ for that model?

Which features are used in the model with a $R_{adj}^2$ of .73?

Now the question is...which model do we choose?

This is where we have to make some decisions. Do we just use the model with the highest $R_{adj}^2$ ? Usually not. Some of the models have $R_{adj}^2$ values that are very, very close to that. This may mean that we may feel that a tiny gain in $R_{adj}^2$ is worth adding in a new predictor. Or, we may be predicting and even little changes in the metric may represent an improvement. That is up to us to decide.

Based on the results, which features would you choose to use? Explain. There is more than one correct answer here, so make sure you justify your reasoning.

Conclusion

In this lab, we have explored Best Subset Selection, a process we can use to help us determine which features we might want to include in a regression model. Next time, we are going to try three more techniques that let us do this.

Last updated 2021 March 3.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.

The data set set used in this lab is the cdc data set, provided by OpenIntro.