We have two main goals for today. The first is to practice working with LSLR (least squares linear regression) in matrix format. This is going to be important for the next few techniques we will be learning!
Our second goal is to think about feature selection. In statistical learning, we are often in a scenario in which we know the response variable, and we know the possible features, but we don't know which features we should use to train a model to predict the response variable. What combination gives us the highest predictive accuracy? Today, we are going to explore one technique that can be used to help us in these scenarios, which is called best subset selection. We will learn alternatives to this as we move through our next few classes.
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.
We will focus on a random sample of 20,000 people from the BRFSS survey. While there are over 200 variables in this data set, we will work with a small subset.
We begin by loading the training data set of 20,000 observations with the following command. NOTE: This data set takes a second to load; be patient.
source("http://www.openintro.org/stat/data/cdc.R")
We have 9 variables in our training data.
genhlth
: respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor.exerany
: indicates whether the respondent exercised in the past month (1) or did not (0).hlthplan
: indicates whether the respondent had some form of health coverage (1) or did not (0)smoke100
: indicates whether the respondent had smoked at least 100 cigarettes in their lifetime height
: in inches weight
: in pounds wtdesire
: desired weight in pounds age
: in years gender
: biological sex, limited to male/female.We have a client who wants to build a model to predict Y = how much weight a person wants to lose/gain. This is important to study as it can reflect current societal health trends. Perceptions of a "target" weight can also lead to mental health struggles, life style choices, changes in exercise or other life habits, and so on.
Okay, so now we know the goal. Can we get started? Well, not quite. If you look at our data set, you will notice that we don't have a variable in the data set that measures desired weight change. In other words, the response variable of interest is not in the data set.
This is actually fairly common in statistical learning - we don't have a particular variable, but we do have the things we need to build it. To create the variable we need, we use the following:
cdc$wtchange<- cdc$weight -cdc$wtdesire
What does this code do? Well, it creates a new variable called wtchange
in our data set. This represents how many pounds in weight a person would like to change. A positive number reflects a number of pounds a person would like to lose, and a negative number reflects a number of pounds a person would like to gain. This will be our response variable Y for our analysis today.
The last thing we need to do before starting is to make sure our feature classes are correct. In other words, does R think our numeric features are numbers, and our categorical features are categorical? Check, and change the classes if needed! For example,
cdc$smoke100<- as.factor(cdc$smoke100)
We have a client who first wants to explore the relationship between X = actual weight (in pounds) and Y = desired weight change (in pounds). They have asked us to consider LSLR regression as an approach for modeling these data.
Now that we have checked to see whether LSLR seems reasonable, let's build the model! We are going to build the model in matrix form today.
Now that we have the form of the model, it is time to train the model. This means we need to estimate our intercept and our slope. The first step is to create the design matrix.
XD
. Show the first few rows of your matrix (head(XD)
).XD
.Once we have the design matrix, we are able to solve for the estimates \( \hat{\beta}_0 ~ and ~ \hat{\beta}_1\) that minimize the residual sum of squares for the LSLR model. We have learned that we can actually do this ourselves using matrix algebra!
We will be getting more and more mathematical notation like this in class, so it is helpful to practice typing it u[!
lm
for this question.In the last section, we built an LSLR model using Y = weight change and X = current weight. However, there are other possible features in this data set. Can we improve our ability to predict weight change by using more of them? Let's find out!
There are several possible features in the data. We could inspect them all by hand, but it would take time, and if we had too many more features, it would become prohibitive. It would be even more duanting to try to attempt to explore different model combinations! Fortunately, there are automatic variable selection techniques which try to choose the "best" model from the available variables. We are going to explone one, best subset selection (BSS), today.today
Variable selection is a good choice when:
(1)You have a lot of variables, and you are interested in which are the "most important/useful" for explaining the response.
(2) You have a lot of variables, and you want to quickly build a model that does a good job predicting the response.
BSS is what we call an exhaustive search appraoch. This means that the method will literally consider every possible model that can be fit with the features you provide. The "you provide" part is important; BSS will not, for instance, consider any interactions or transformations that you have not specifically asked it to.
Variable selection consists of two parts: (Stage 1) a way of searching through different models and (Stage 2) a way of measuring model performance.
Stage 1 is the exhaustive search part of BSS. In Stage 1, we fit every possible LSLR model made with combinations of the given features. We then work to choose the "best" (with respect to some metric) model with one feature, the "best" with two, the "best" with three, etc. This means if we have 10 features, we end Stage 1 with 10 LSLR models.
In Stage 2, we compare all of these "best" models to try and find the one that optimizes some metric like the AIC, BIC, etc.
Let's try this in R.
Let's start with Stage 1, which has a few steps. In Stage 1 (Step 1), BSS will train every model of the form \( Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\), using each of the possible features in the data as the \( X_i \). In other words, we will have:
\( (1) DesiredWeight_i = \beta_0 + \beta_1 Age_i + \epsilon_i\)
\( (2) DesiredWeight_i = \beta_0 + \beta_1 Height_i + \epsilon_i\)
\( (2) DesiredWeight_i = \beta_0 + \beta_1 (Smoke100 = Yes)_i + \epsilon_i\)
And so on!!
Let's see if we can figure out just how many models we would need in total to complete Stage 1 Step 1. In other words, how many models with exactly one feature can we build?
We have 10 variables in our training data, but one of them is the response variable (wtchange
) and one of them is desired weight (wtdesire
) which we need to remove from the data before running BSS.
wtdesire
from the data before running BSS? Make sure you remove that column from the data before proceeding.At this point, we have multiple models, all of the form \( Y_i = \beta_0 + \beta_1 X_i + \epsilon_i\) but with a different feature. To conclude Stage 1 (Step 1), we want to select exactly one of these models. To do this, we are going to need to compute a metric that allows us to compare the models and choose one.
The \(R^2\) is most commonly used in LSLR for deciding among our models. You can use other metrics, but the code we are going to use today uses \(R^2\) by default.
To conclude Stage 1 (Step 1), we compute the \(R^2\) on all of the LSLR models we have trained so far. All of these models have exactly one feature, so we don't need a metric that penalizes for different numbers of predictors. We then determine which model has the highest \(R^2\). This is the model we will select out of all the models we have trained so far.
To conclude Stage 1 (Step 1), the computer will store this LSLR model with the highest \(R^2\). We will use this model as part of Stage 2 of BSS. We have now concluded Stage 1 (Step 1)!
Now we are ready for Stage 1 (Step 2). In this step, BSS will train every model with of the form \( Y_i = \beta_0 + \beta_1 X_i + \beta_1 Z_i + \epsilon_i\), using each of the possible combinations of X and Z using our data.In other words, we will have:
\( (1) DesiredWeight_i = \beta_0 + \beta_1 Age_i + \beta_2 Weight_i+ \epsilon_i\)
\( (2) DesiredWeight_i = \beta_0 + \beta_1 Age_i + \beta_2 Height_i + \epsilon_i\)
\( (3) DesiredWeight_i = \beta_0 + \beta_1 Age + \beta_2 (Smoke100 = Yes)_i + \epsilon_i\)
And so on!!
We continue this process with Stage 1 (Step 3), Stage 1 (Step 4), etc., until we train the LSLR model using all possible features, which is called the full model.
ModelFull <- lm( wtchange ~ . , data = cdc)
.When we conclude Stage 1 of BSS, we have (1) the model with one coefficient (and the intercept) and the highest \(R^2\), (2) the model with two coefficients (and the intercept) and the highest \(R^2\), and so on, all the way to the full model.
Let's see the results of this process in R.
The first step in running BSS in R is to load the library we need.
library(leaps)
This library contains all the code we will need to run BSS. Once we have it loaded, the command we need to run the Stage 1 of BSS is regsubsets
. For example, suppose I wanted to run BSS on three variables X1, X2, and X3. The code is then:
BSSout <- regsubsets( Y ~ X1 + X2 + X3, data = cdc, nvmax = 3, method="exhaustive")
What does the nvmax
do? It tells R the maximum number of beta terms we want to allow in our model, not including the intercept. So, right now we consider models of 1, 2, or 3 predictors.
Let's try running stage 1 with just the categorical predictors. Why? Because we are learning right now, and it helps to have a smaller subset to start with. You don't have to do this first in practice.
Look at only the categorical features in the data. Using these categorical features only, run the first stage of BSS and call the output BSScat
. Remember to change the nvmax part of the code!!! You will notice that nothing seems to happen, as the output has been stored. Let's look at the \(R_{adj}^2\) value of each of these models by using the code summary(BSScat)$adjr2
. What is the \(R_{adj}^2\) of the model fit with one feature? (This is the first value). With two features? (This is the second value).
Okay, so now we have run Stage 1, and we have a whole lot of models. Our task now is to try to decide among these models. To do this, we need to choose a penalized metric that will help us compare the models. For now, let's use \(R_{adj}^2\).
Create a plot to help us see the values of \(R_{adj}^2\) for all our models from Stage 1. To do this, you can use code plot(BSScat, scale = "adjr")
. Note: You can also use plot(BSScat, scale = "Cp")
(which uses Mallows' Cp) or plot(BSScat, scale = "bic")
, if you are wanting to use other metrics in the future.
This plot we have created is unusual, as we read it in rows, not columns. For instance, look at the lowest row in the table. You will see the value of \(R_{adj}^2\) = .024. You will also see that in that row there are two colored boxes: Intercept, and genderf. This means that the model
$$WeightChange_i = \beta_0 + \beta_1 GenderF_i + \epsilon_i$$has an \(R_{adj}^2\) of .024.
What is the \(R_{adj}^2\) of the model with health Good, health Fair, gender female and the intercept?
Now, you will notice that some \(R_{adj}^2\) values seem to be the same. They are likely not exactly the same when you look at more than two decimal places. To see this, run summary(BSScat)$adjr2
. These are the exact values, rather than the approximations shown in the visual. Even tiny changes in \(R_{adj}^2\) will be shown in the ordering in the table, though.
Now that we have explored this a little, let's actually run BSS for all of our features.
Use all the possible feature variables (categorical and numeric) and run the first stage of BSS and call the output BSSall
. Then, use the code plot(BSSall, scale = "adjr2")
to plot the results.
Which features are used in the model with the lowest value of \(R_{adj}^2\), and what is the value of \(R_{adj}^2\) for that model?
Which features are used in the model with a \(R_{adj}^2\) of 0.6955846? To figure this out, look at the values for each model. Figure out which of the models (1,2,3, etc) has the desired value. Then, run summary(BSSout)$which[HERE,]
, where the HERE is replaced with the number of the model you want.
Now the question is...which model do we choose?
This is where we have to make some decisions. Do we just use the model with the highest \(R_{adj}^2\) ? Usually not. Some of the models have \(R_{adj}^2\) values that are very, very close to that. This may mean that we may feel that a tiny gain in \(R_{adj}^2\) is worth adding in a new predictor. Or, we may be predicting and even little changes in the metric may represent an improvement. That is up to us to decide.
Based on the results, which features would you choose to use? Explain. There is more than one correct answer here, so make sure you justify your reasoning.
In this lab, we have explored Best Subset Selection, a process we can use to help us determine which features we might want to include in a regression model. This method is convienient, however it does have its downfalls. BSS does not consider possible correlations among predictors, nor can it consider the need for possible transformations. Next time, we are going to try three more techniques for variable selection that will take some of these into account.