STA 363/663 Lab 4 : LSLR in Matrix form and BSS

Goal

We have two main goals for today. The first is to practice working with LSLR (least squares linear regression) in matrix format. This is going to be important for the next few techniques we will be learning!

Our second goal is to think about feature selection. In statistical learning, we are often in a scenario in which we know the response variable, and we know the possible features, but we don't know which features we should use to train a model to predict the response variable. What combination gives us the highest predictive accuracy? Today, we are going to explore one technique that can be used to help us in these scenarios, which is called best subset selection. We will learn alternatives to this as we move through our next few classes.

The Data

The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 people in the United States. As its name implies, the BRFSS is designed to identify risk factors in the adult population and report emerging health trends. For example, respondents are asked about their diet and weekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcare coverage. The BRFSS Web site contains a complete description of the survey, including the research questions that motivate the study and many interesting results derived from the data.

We will focus on a random sample of 20,000 people from the BRFSS survey. While there are over 200 variables in this data set, we will work with a small subset.

We begin by loading the training data set of 20,000 observations with the following command. NOTE: This data set takes a second to load; be patient.

source("http://www.openintro.org/stat/data/cdc.R")

We have 9 variables in our training data.

genhlth: respondents were asked to evaluate their general health, responding either excellent, very good, good, fair or poor.

exerany: indicates whether the respondent exercised in the past month (1) or did not (0).

hlthplan: indicates whether the respondent had some form of health coverage (1) or did not (0)

smoke100: indicates whether the respondent had smoked at least 100 cigarettes in their lifetime

height: in inches

weight: in pounds

wtdesire: desired weight in pounds

age: in years

gender: biological sex, limited to male/female.

The Client

We have a client who wants to build a model to predict Y = how much weight a person wants to lose/gain. This is important to study as it can reflect current societal health trends. Perceptions of a "target" weight can also lead to mental health struggles, life style choices, changes in exercise or other life habits, and so on.

Is this a regression or classification task? Based on this, what metric will we likely use to assess predictive accuracy?

Okay, so now we know the goal. Can we get started? Well, not quite. If you look at our data set, you will notice that we don't have a variable in the data set that measures desired weight change. In other words, the response variable of interest is not in the data set.

This is actually fairly common in statistical learning - we don't have a particular variable, but we do have the things we need to build it. To create the variable we need, we use the following:

cdc$wtchange<- cdc$weight -cdc$wtdesire

What does this code do? Well, it creates a new variable called wtchange in our data set. This represents how many pounds in weight a person would like to change. A positive number reflects a number of pounds a person would like to lose, and a negative number reflects a number of pounds a person would like to gain. This will be our response variable Y for our analysis today.

Create a box plot of the response variable. Does it seem like anything is unusual? If so, explain what and how you will choose to handle it before analysis.

The last thing we need to do before starting is to make sure our feature classes are correct. In other words, does R think our numeric features are numbers, and our categorical features are categorical? Check, and change the classes if needed! For example,

cdc$smoke100<- as.factor(cdc$smoke100)

LSLR Regression with one predictor: Matrix Form

We have a client who first wants to explore the relationship between X = actual weight (in pounds) and Y = desired weight change (in pounds). They have asked us to consider LSLR regression as an approach for modeling these data.

Create an appropriate plot to visualize the relationship between weight and desired weight change. Label your plot Figure 2. Based on shape alone, does LSLR seem like a reasonable choice?

Now that we have checked to see whether LSLR seems reasonable, let's build the model! We are going to build the model in matrix form today.

Write down the population form of the LSLR regression model (so include the error term, no hats) in the white space in your Markdown file. This is to help you practice writing matrix notation. Hint: To make a symbol bold, you use \boldsymbol{}. To make text bold, you use \mathbf{}.

Now that we have the form of the model, it is time to train the model. This means we need to estimate our intercept and our slope. The first step is to create the design matrix.

Create the needed design matrix for an LSLR model for Y= desired weight change and X = weight and store the result as XD. Show the first few rows of your matrix (head(XD)).

State the dimensions of Y and the dimensions of XD.

Once we have the design matrix, we are able to solve for the estimates $ \hat{\beta}_0 ~ and ~ \hat{\beta}_1$ that minimize the residual sum of squares for the LSLR model. We have learned that we can actually do this ourselves using matrix algebra!

Write down $ \boldsymbol{\hat{\beta}} = (\mathbf{X_D^{T}X_D})^{-1} \mathbf{X_D^{T}Y}$ in the white space in your Markdown file as the answer to this question on. Note: Write literally exactly that. This is to help you practice writing matrix notation. Hint: To make a symbol bold, you use \boldsymbol{}. To make text bold, you use \mathbf{}.

We will be getting more and more mathematical notation like this in class, so it is helpful to practice typing it u[!

From the matrix representation from the previous question, use R to the estimated slope and intercept for the LSLR regression line. Show the code you use to do this code and show your results. Note: You may not use lm for this question.

Based on your LSLR line, find the estimated value weight for the 5th individual in the data.

What is the value of the RMSE for this regression line?

Motivating Variable Selection

In the last section, we built an LSLR model using Y = weight change and X = current weight. However, there are other possible features in this data set. Can we improve our ability to predict weight change by using more of them? Let's find out!

There are several possible features in the data. We could inspect them all by hand, but it would take time, and if we had too many more features, it would become prohibitive. It would be even more duanting to try to attempt to explore different model combinations! Fortunately, there are automatic variable selection techniques which try to choose the "best" model from the available variables. We are going to explone one, best subset selection (BSS), today.today

Variable selection is a good choice when:

(1)You have a lot of variables, and you are interested in which are the "most important/useful" for explaining the response.

(2) You have a lot of variables, and you want to quickly build a model that does a good job predicting the response.

BSS is what we call an exhaustive search appraoch. This means that the method will literally consider every possible model that can be fit with the features you provide. The "you provide" part is important; BSS will not, for instance, consider any interactions or transformations that you have not specifically asked it to.

Variable selection consists of two parts: (Stage 1) a way of searching through different models and (Stage 2) a way of measuring model performance.

Stage 1 is the exhaustive search part of BSS. In Stage 1, we fit every possible LSLR model made with combinations of the given features. We then work to choose the "best" (with respect to some metric) model with one feature, the "best" with two, the "best" with three, etc. This means if we have 10 features, we end Stage 1 with 10 LSLR models.

In Stage 2, we compare all of these "best" models to try and find the one that optimizes some metric like the AIC, BIC, etc.

Let's try this in R.

Stage 1 of BSS

Let's start with Stage 1, which has a few steps. In Stage 1 (Step 1), BSS will train every model of the form $ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$, using each of the possible features in the data as the $ X_i $. In other words, we will have:

$ (1) DesiredWeight_i = \beta_0 + \beta_1 Age_i + \epsilon_i$

$ (2) DesiredWeight_i = \beta_0 + \beta_1 Height_i + \epsilon_i$

$ (2) DesiredWeight_i = \beta_0 + \beta_1 (Smoke100 = Yes)_i + \epsilon_i$

And so on!!

Let's see if we can figure out just how many models we would need in total to complete Stage 1 Step 1. In other words, how many models with exactly one feature can we build?

We have 10 variables in our training data, but one of them is the response variable (wtchange) and one of them is desired weight (wtdesire) which we need to remove from the data before running BSS.

Why do we need to remove wtdesire from the data before running BSS? Make sure you remove that column from the data before proceeding.

Based on this, how many models do you think we fit in Stage 1 (Step 1) of BSS? Hint: Think carefully here. Some of the variables are categorical with more than two levels. Each level counts as a possible feature.

At this point, we have multiple models, all of the form $ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$ but with a different feature. To conclude Stage 1 (Step 1), we want to select exactly one of these models. To do this, we are going to need to compute a metric that allows us to compare the models and choose one.

The $R^2$ is most commonly used in LSLR for deciding among our models. You can use other metrics, but the code we are going to use today uses $R^2$ by default.

To conclude Stage 1 (Step 1), we compute the $R^2$ on all of the LSLR models we have trained so far. All of these models have exactly one feature, so we don't need a metric that penalizes for different numbers of predictors. We then determine which model has the highest $R^2$. This is the model we will select out of all the models we have trained so far.

To conclude Stage 1 (Step 1), the computer will store this LSLR model with the highest $R^2$. We will use this model as part of Stage 2 of BSS. We have now concluded Stage 1 (Step 1)!

Stage 1 (Step 2) of BSS

Now we are ready for Stage 1 (Step 2). In this step, BSS will train every model with of the form $ Y_i = \beta_0 + \beta_1 X_i + \beta_1 Z_i + \epsilon_i$, using each of the possible combinations of X and Z using our data.In other words, we will have:

$ (1) DesiredWeight_i = \beta_0 + \beta_1 Age_i + \beta_2 Weight_i+ \epsilon_i$

$ (2) DesiredWeight_i = \beta_0 + \beta_1 Age_i + \beta_2 Height_i + \epsilon_i$

$ (3) DesiredWeight_i = \beta_0 + \beta_1 Age + \beta_2 (Smoke100 = Yes)_i + \epsilon_i$

And so on!!

Once we have fit all the models in with two features, how do you think we choose one? Hint: Same as Stage 1 (Step 1).

We continue this process with Stage 1 (Step 3), Stage 1 (Step 4), etc., until we train the LSLR model using all possible features, which is called the full model.

How many $\beta$ terms are in this full model? In other words, we end with Stage 1 ( Step what) ? Hint: If you are stuck, just fit the model in R and look. You can do this by putting in all the predictors manually, or using ModelFull <- lm( wtchange ~ . , data = cdc).

When we conclude Stage 1 of BSS, we have (1) the model with one coefficient (and the intercept) and the highest $R^2$, (2) the model with two coefficients (and the intercept) and the highest $R^2$, and so on, all the way to the full model.

Let's see the results of this process in R.

BSS in R

The first step in running BSS in R is to load the library we need.

library(leaps)

This library contains all the code we will need to run BSS. Once we have it loaded, the command we need to run the Stage 1 of BSS is regsubsets. For example, suppose I wanted to run BSS on three variables X1, X2, and X3. The code is then:

BSSout <- regsubsets( Y ~ X1 + X2 + X3, data = cdc, nvmax = 3, method="exhaustive")

What does the nvmax do? It tells R the maximum number of beta terms we want to allow in our model, not including the intercept. So, right now we consider models of 1, 2, or 3 predictors.

Let's try running stage 1 with just the categorical predictors. Why? Because we are learning right now, and it helps to have a smaller subset to start with. You don't have to do this first in practice.

Look at only the categorical features in the data. Using these categorical features only, run the first stage of BSS and call the output BSScat. Remember to change the nvmax part of the code!!! You will notice that nothing seems to happen, as the output has been stored. Let's look at the $R_{adj}^2$ value of each of these models by using the code summary(BSScat)$adjr2. What is the $R_{adj}^2$ of the model fit with one feature? (This is the first value). With two features? (This is the second value).

BSS Stage 2

Okay, so now we have run Stage 1, and we have a whole lot of models. Our task now is to try to decide among these models. To do this, we need to choose a penalized metric that will help us compare the models. For now, let's use $R_{adj}^2$.

Create a plot to help us see the values of $R_{adj}^2$ for all our models from Stage 1. To do this, you can use code plot(BSScat, scale = "adjr"). Note: You can also use plot(BSScat, scale = "Cp") (which uses Mallows' Cp) or plot(BSScat, scale = "bic"), if you are wanting to use other metrics in the future.

This plot we have created is unusual, as we read it in rows, not columns. For instance, look at the lowest row in the table. You will see the value of $R_{adj}^2$ = .024. You will also see that in that row there are two colored boxes: Intercept, and genderf. This means that the model

has an $R_{adj}^2$ of .024.

What is the $R_{adj}^2$ of the model with health Good, health Fair, gender female and the intercept?

Now, you will notice that some $R_{adj}^2$ values seem to be the same. They are likely not exactly the same when you look at more than two decimal places. To see this, run summary(BSScat)$adjr2. These are the exact values, rather than the approximations shown in the visual. Even tiny changes in $R_{adj}^2$ will be shown in the ordering in the table, though.

Now that we have explored this a little, let's actually run BSS for all of our features.

Use all the possible feature variables (categorical and numeric) and run the first stage of BSS and call the output BSSall. Then, use the code plot(BSSall, scale = "adjr2") to plot the results.

Which features are used in the model with the lowest value of $R_{adj}^2$, and what is the value of $R_{adj}^2$ for that model?

Which features are used in the model with a $R_{adj}^2$ of 0.6955846? To figure this out, look at the values for each model. Figure out which of the models (1,2,3, etc) has the desired value. Then, run summary(BSSout)$which[HERE,], where the HERE is replaced with the number of the model you want.

Now the question is...which model do we choose?

This is where we have to make some decisions. Do we just use the model with the highest $R_{adj}^2$ ? Usually not. Some of the models have $R_{adj}^2$ values that are very, very close to that. This may mean that we may feel that a tiny gain in $R_{adj}^2$ is worth adding in a new predictor. Or, we may be predicting and even little changes in the metric may represent an improvement. That is up to us to decide.

Based on the results, which features would you choose to use? Explain. There is more than one correct answer here, so make sure you justify your reasoning.

Conclusion

In this lab, we have explored Best Subset Selection, a process we can use to help us determine which features we might want to include in a regression model. This method is convienient, however it does have its downfalls. BSS does not consider possible correlations among predictors, nor can it consider the need for possible transformations. Next time, we are going to try three more techniques for variable selection that will take some of these into account.

Last updated 2022 February 23.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.

The data set set used in this lab is the cdc data set, provided by OpenIntro.

STA 363/663 Lab 4 : LSLR with Matrices and Subset Selection