STA 363/663 Lab 5 : Shrinkage and Selection

Goal

In our last lab, we practice BSS, which is a selection technique. This means that by using BSS, we could identify the set of features that optimized some comparison metric, like the adjusted R-squared or the AIC. Today, we are going to focus on techniques that involve shrinkage. These techniques focus on optimizing a metric while still keeping the estimates of our coefficients and their standard errors in a reasonable range.

The Data

We have a client who is interested in predicting graduation rates at different universities and colleges in the United States. Graduation rate refers to the percent of the students who graduate within 4 years. The client provides us with a random sample of 776 colleges and universities in the United States. These data are linked on Canvas. The data contain the response variable graduation rate as well as 18 possible features.

X = The name of the university.

Private = Yes if the school is a private school, 1 if public

Apps = the number of apps received on average per academic year

Accept - the number of applications accepted on average per academic year

Enroll = the average number of student enrolled each year out of the accepted students

Top10Perc = how many of the accepted students were in the 10 % of their high school.

Top25Perc = how many of the accepted students were in the 25% of their high school.

F.Undergrad = the number of full time undergraduate students enrolled at the university

P.undergrad = the number of part time undergraduate students enrolled at the university

Out.State = the average out of state tuition

Room.Board = the average yearly cost of room and board

Books = the average cost of books for class per year

Personal = the average student personal expenses per year

PhD = the percent of the faculty who hold a PhD

Terminal = the percent of faculty who hold a terminal degree in their field

S.F.Ratio = the student faculty ratio

perc.alumni = the percent of alumni who are active in the alumni association

Expend = the average amount of money a university spends per student per year

Is this a regression or classification task? Based on this, what metric will we likely use to assess predictive accuracy?

Which column in the data set cannot be used as a feature in the model (aside from the response variable)? Explain why this column cannot be used.

Create a correlation matrix for these data. Based on what you see, is a shrinkage technique what you would recommend for these data? Why or why not?

Geometry Pause!

To respond to our client's request, we are going to try using ridge regression, lasso, and elastic net. These three techniques can be explored using the geometric shapes involved. Let's see how.

Suppose we have only one feature in our data. We can then write the RSS as:

\( RSS = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2 = \sum_{i=1}^n Y_i^2 - \hat{\beta}_0 \sum_{i=1}^n Y_i - \hat{\beta}_1 \sum_{i=1}^n Y_i X_i + 2 \hat{\beta}_0 \hat{\beta}_1 \sum_{i=1}^n X_i + n \hat{\beta}_0^2 + \hat{\beta}_1^2 \sum_{i=1}^n X_i^2 \)

This looks long, but it is expressed in the form:

\( A\hat{\beta}_0^2+B\hat{\beta}_0\hat{\beta}_1+C\hat{\beta}_1^2+D\hat{\beta}_0+E\hat{\beta}_1+F\) = K

This is the equation for an ellipse!! This means that for any given value K of the RSS, the values of \( (\hat{\beta}_0, \hat{\beta}_1)\) that solve the equation above form an ellipse.

For LSLR, we are interested in finding the values of \( (\hat{\beta}_0, \hat{\beta}_1)\) that minimize the RSS. It turns out the solution to this is the unique \( (\hat{\beta}_0, \hat{\beta}_1)\) that lies at the very center of the ellipse. For any other value of the RSS, there are many combinations of \( (\hat{\beta}_0, \hat{\beta}_1)\) that can give us the same value of the RSS. These combinations, when plotted, form an ellipse.

Okay, great. So the solution to LSLR is a single point in the middle of an ellipse. Why is this important? Well, when we change from using LSLR to using ridge or lasso or elastic net, we are no longer constraining ourselves to the solution \( (\hat{\beta}_0, \hat{\beta}_1)\) that minimizes the RSS. In ridge, for instance, we find the solution that minimizes:

\( RSS + \lambda \sum \hat{\beta}_j^2.\)

Well, what does this solution like? Is it an ellipse too? Not exactly. It turns out that the solution to ridge regression ends up being where the ellipse of the RSS and a circle defined by \( \sum \hat{\beta}_j^2\) collide.

In ridge, we use \( \lambda \) to essentially declare how large our sum of \( \sum \hat{\beta}_j^2\) is allowed to be. This means we define a space of solutions \( (\hat{\beta}_0, \hat{\beta}_1)\) we decide are acceptable (meaning they are not too large). This space forms a circle. This means we want to get the RSS as small as it can be when using \( (\hat{\beta}_0, \hat{\beta}_1)\) values inside that circle. What is the solution to the problem? It is the point where the ellipse of the RSS and that circle collide. In other words, we keep the RSS as small as we can while only allowing solutions \( (\hat{\beta}_0, \hat{\beta}_1)\) that are within the circle we have chosen. (And how do we decide on how big the circle gets to be? We choose \( \lambda\) using 10-fold cross validation!)

To see this set up plotted, take a look at your textbook (linked here) on page 244 (type 252 into the box to get there). The ridge solutions are on the right of this image, and the lasso solutions are on the left. For both of these images, the solution to \( (\hat{\beta}_0, \hat{\beta}_1)\) for LSLR is plotted as a single point in the middle of the ellipse. The red contours show different ellipses defined by \( (\hat{\beta}_0, \hat{\beta}_1)\) solutions that we get depending on how large we let the RSS be.

Do you think that larger values of \( \lambda\) in ridge regression make our circle of possible solutions to \( (\hat{\beta}_0, \hat{\beta}_1)\) a bigger circle or a smaller circle? Hint: Remember that when \( \lambda=0\) we allow all possible coefficients.

As you will see in the image, the difference between ridge and lasso is that in ridge, we use a circle to define our space of possible \( (\hat{\beta}_0, \hat{\beta}_1)\) values and in lasso we use a diamond. Do you see that in the ridge image, the collision of our circle and the ellipse does not allow either coefficient to be 0, but in lasso it does? This is why ridge gives us only shrink our estimates, but lasso can allow selection as well.

So in ridge, we define a circle of possible solutions and in lasso we define a diamond. What shape does lasso make? Take a look at this image. The red curve in the middle is the regions of possible coefficients defined by elastic net. It's not a circle, but not a diamond either. The shape becomes closer to a circle or closer to a diamond depending on the value of \( \alpha \) we choose in the elastic net penalty.

Suppose I choose \( (\alpha =.001)\). Does that define a space that is closer to a circle or closer to a diamond?

Suppose I choose \( (\alpha =.99)\). Does that define a space that is closer to a circle or closer to a diamond?

Ridge Regression

Now that we have explored the concept, let's try the coding. We are going to start with ridge regression. All three of our techniques today require the glmnet package, so loading that is the first step.

library(glmnet)

The next thing we need is the design matrix. We have a lot of features in this data set, and some are categorical. This means building the design matrix by hand can be laborious. Luckily, there is an easy code we can use to create the matrix we need.

XD <-model.matrix(Grad.Rate ~ ., data = )

There is one piece of information missing from the code above, and this is the data= part of the code. Here, you would put collegedata, which is the name of our data set. However, we have discussed that one column should not be used as a feature. This means that we need to exclude that column from the design matrix. For example, if we exclude the 4th column, we would use data = collegedata[,-4].

Using the code above, create the design matrix. State the dimensions of this matrix.

Now that we have the design matrix, we only need one more thing to find the coefficients for ridge regression. This missing piece is the tuning parameter \( \lambda\). As we have discussed, the choice of tuning parameter defines our range of possible values of the coefficients we are going to allow.

Briefly explain the process of choosing the tuning parameter in ridge regression.

To run this process in R, we need the code cv.glmnet.

set.seed(100)
ridge.mod <-cv.glmnet(XD[,-1], RESPONSE , alpha =  , lambda = )

Here, we have a few missing pieces.

RESPONSE = here you need to fill in the response variable you are interested in.

alpha = : Here, you are going to put either a 0 or a 1. A 0 indicates that we want to use ridge regression, and a 1 indicates that we want to use lasso.

lambda = : Here, you are going to put the sequence of \( \lambda\) values you want to consider. Make sure that you include 0 in this list!

Why do we want to make sure to include \( \lambda = 0\) in our list of possible tuning parameters?

Once we have run 10-fold CV, we want to look at our results. We can do that using the code plot(ridge.mod).

Plot your results from 10-fold CV, and show your result. Approximately what choice of \( \lambda\) seems to give the best value of our test metric?

Pictures are great, but it can be difficult from the image to tell exactly which value of \( \lambda \) we need to use. Luckily, we can pull that value directly.

ridge.mod$lambda.min

What choice of \( \lambda\) gives the best value of our test metric?

Using this choice of tuning parameter, we can then see what value of test MSE we would obtain using ridge regression.

min(ridge.mod$cvm)

Using your choice of \( \lambda\), what is the estimated test RMSE?

How many features are in your chosen ridge regression model?

At this point, there is only one other thing we might want to do with ridge regression, and that is look at your coefficients. What values do we obtain with our choice of tuning parameter?

To figure this out, we actually need to train out ridge regression model with our choice of tuning parameter. To do this, fill in the code below.

# Train Ridge
ridge.final <- glmnet(XD[,-1], collegedata$Grad.Rate, alpha = , lambda = )

Using the code above as a template, train your ridge regression model. Then, train the LSLR model using the same code template and call the result lslr.final. Once we have trained the model, we can make a data frame holding the coefficients for both LSLR and ridge using the following code. Show the resultant data frame as the answer to this question. Make sure you have formatted it using knitr::kable().
```
ridge.betas <- as.numeric(coefficients(ridge.final))
lslr.betas <- as.numeric(coefficients(lslr.final))
BetasFinal <- data.frame("LSLR" = lslr.betas, "Ridge" = ridge.betas)
rownames(BetasFinal) <- colnames(XD)
```

State the estimated test RMSE for both of the two trained models (LSLR or ridge regression). Which has the best predictive accuracy?

Lasso

Now that we have used all the features in LSLR and Ridge Regression, we are going to move onto a shrinkage and selection technique called lasso. The code for lasso is very similar to the code we used for ridge. The only difference is we need to use alpha = 1.

Plot your results from 10-fold CV, and show your result.

What choice of \( \lambda_{Lasso}\) gives the best value of our test metric?

Using your choice of \( \lambda_{Lasso}\), what is the estimated test RMSE?

How many features are in your chosen lasso regression model? Which features are included in your model?

Which model (LSLR, ridge, or lasso) has the best predictive accuracy?

Elastic Net

The last technique we want to explore is elastic net. Elastic net is designed to balance between ridge and lasso. The code used to run elastic net is a little different than what we use for ridge and lasso and it requires the caret package.

library(caret)

Elastic net requires not one but 2 tuning parameters \( \lambda ~ and ~ \alpha\). One thing we have not seen yet is how to define the range of possible values for the tuning parameters. Let's do that now!

tuningGrid <- data.frame("alpha" = seq(from = 0, to = 1, by = .05), "lambda"= seq(from =0, to = 5, by =.25))

Take a look at the tuningGrid data frame you have just created. Notice that the first column defines possible values for \( \alpha\), so these values must be between 0 and 1. The second column defines possible values for \( \lambda\), and these values can be anything greater than or equal to 0, so you will notice more range here. How did I choose the upper bound for \( \lambda\)? Well, I already knew what range of values I had gotten for ridge and lasso, so that gave me an idea of how big I might need to go.

Now, let's train elastic net! You will notice one line of code that is different from the slides, and that is the tuneGrid component. This allows us to specify the range of tuning parameters that we want to try.

set.seed(100)
college_elnet = train(
  Grad.Rate ~ ., data = collegedata,
  method = "glmnet",
  tuneGrid = tuningGrid,
  trControl = trainControl(method = "cv", number = 10)
)

What values of the tuning parameters would you recommend using?

Which method (LSLR, ridge, lasso, or elastic net) has the best predictive accuracy?

Last updated 2022 March 16.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.