In our last lab, we practice BSS, which is a selection technique. This means that by using BSS, we could identify the set of features that optimized some comparison metric, like the adjusted R-squared or the AIC. Today, we are going to focus on techniques that involve shrinkage. These techniques focus on optimizing a metric while still keeping the estimates of our coefficients and their standard errors in a reasonable range.
We have a client who is interested in predicting graduation rates at different universities and colleges in the United States. Graduation rate refers to the percent of the students who graduate within 4 years. The client provides us with a random sample of 776 colleges and universities in the United States. These data are linked on Canvas. The data contain the response variable graduation rate as well as 18 possible features.
To respond to our client's request, we are going to try using ridge regression, lasso, and elastic net. These three techniques can be explored using the geometric shapes involved. Let's see how.
Suppose we have only one feature in our data. We can then write the RSS as:
\( RSS = \sum_{i=1}^{n} (Y_i - \hat{\beta}_0 - \hat{\beta}_1 X_i)^2 = \sum_{i=1}^n Y_i^2 - \hat{\beta}_0 \sum_{i=1}^n Y_i - \hat{\beta}_1 \sum_{i=1}^n Y_i X_i + 2 \hat{\beta}_0 \hat{\beta}_1 \sum_{i=1}^n X_i + n \hat{\beta}_0^2 + \hat{\beta}_1^2 \sum_{i=1}^n X_i^2 \)
This looks long, but it is expressed in the form:
\( A\hat{\beta}_0^2+B\hat{\beta}_0\hat{\beta}_1+C\hat{\beta}_1^2+D\hat{\beta}_0+E\hat{\beta}_1+F\) = K
This is the equation for an ellipse!! This means that for any given value K of the RSS, the values of \( (\hat{\beta}_0, \hat{\beta}_1)\) that solve the equation above form an ellipse.
For LSLR, we are interested in finding the values of \( (\hat{\beta}_0, \hat{\beta}_1)\) that minimize the RSS. It turns out the solution to this is the unique \( (\hat{\beta}_0, \hat{\beta}_1)\) that lies at the very center of the ellipse. For any other value of the RSS, there are many combinations of \( (\hat{\beta}_0, \hat{\beta}_1)\) that can give us the same value of the RSS. These combinations, when plotted, form an ellipse.
Okay, great. So the solution to LSLR is a single point in the middle of an ellipse. Why is this important? Well, when we change from using LSLR to using ridge or lasso or elastic net, we are no longer constraining ourselves to the solution \( (\hat{\beta}_0, \hat{\beta}_1)\) that minimizes the RSS. In ridge, for instance, we find the solution that minimizes:
\( RSS + \lambda \sum \hat{\beta}_j^2.\)
Well, what does this solution like? Is it an ellipse too? Not exactly. It turns out that the solution to ridge regression ends up being where the ellipse of the RSS and a circle defined by \( \sum \hat{\beta}_j^2\) collide.
In ridge, we use \( \lambda \) to essentially declare how large our sum of \( \sum \hat{\beta}_j^2\) is allowed to be. This means we define a space of solutions \( (\hat{\beta}_0, \hat{\beta}_1)\) we decide are acceptable (meaning they are not too large). This space forms a circle. This means we want to get the RSS as small as it can be when using \( (\hat{\beta}_0, \hat{\beta}_1)\) values inside that circle. What is the solution to the problem? It is the point where the ellipse of the RSS and that circle collide. In other words, we keep the RSS as small as we can while only allowing solutions \( (\hat{\beta}_0, \hat{\beta}_1)\) that are within the circle we have chosen. (And how do we decide on how big the circle gets to be? We choose \( \lambda\) using 10-fold cross validation!)
To see this set up plotted, take a look at your textbook (linked here) on page 244 (type 252 into the box to get there). The ridge solutions are on the right of this image, and the lasso solutions are on the left. For both of these images, the solution to \( (\hat{\beta}_0, \hat{\beta}_1)\) for LSLR is plotted as a single point in the middle of the ellipse. The red contours show different ellipses defined by \( (\hat{\beta}_0, \hat{\beta}_1)\) solutions that we get depending on how large we let the RSS be.
As you will see in the image, the difference between ridge and lasso is that in ridge, we use a circle to define our space of possible \( (\hat{\beta}_0, \hat{\beta}_1)\) values and in lasso we use a diamond. Do you see that in the ridge image, the collision of our circle and the ellipse does not allow either coefficient to be 0, but in lasso it does? This is why ridge gives us only shrink our estimates, but lasso can allow selection as well.
So in ridge, we define a circle of possible solutions and in lasso we define a diamond. What shape does lasso make? Take a look at this image. The red curve in the middle is the regions of possible coefficients defined by elastic net. It's not a circle, but not a diamond either. The shape becomes closer to a circle or closer to a diamond depending on the value of \( \alpha \) we choose in the elastic net penalty.
Now that we have explored the concept, let's try the coding. We are going to start with ridge regression. All three of our techniques today require the glmnet
package, so loading that is the first step.
library(glmnet)
The next thing we need is the design matrix. We have a lot of features in this data set, and some are categorical. This means building the design matrix by hand can be laborious. Luckily, there is an easy code we can use to create the matrix we need.
XD <-model.matrix(Grad.Rate ~ ., data = )
There is one piece of information missing from the code above, and this is the data=
part of the code. Here, you would put collegedata
, which is the name of our data set. However, we have discussed that one column should not be used as a feature. This means that we need to exclude that column from the design matrix. For example, if we exclude the 4th column, we would use data = collegedata[,-4]
.
Now that we have the design matrix, we only need one more thing to find the coefficients for ridge regression. This missing piece is the tuning parameter \( \lambda\). As we have discussed, the choice of tuning parameter defines our range of possible values of the coefficients we are going to allow.
To run this process in R, we need the code cv.glmnet
.
set.seed(100)
ridge.mod <-cv.glmnet(XD[,-1], RESPONSE , alpha = , lambda = )
Here, we have a few missing pieces.
RESPONSE = here you need to fill in the response variable you are interested in.
alpha =
: Here, you are going to put either a 0 or a 1. A 0 indicates that we want to use ridge regression, and a 1 indicates that we want to use lasso.
lambda =
: Here, you are going to put the sequence of \( \lambda\) values you want to consider. Make sure that you include 0 in this list!
Once we have run 10-fold CV, we want to look at our results. We can do that using the code plot(ridge.mod)
.
Pictures are great, but it can be difficult from the image to tell exactly which value of \( \lambda \) we need to use. Luckily, we can pull that value directly.
ridge.mod$lambda.min
Using this choice of tuning parameter, we can then see what value of test MSE we would obtain using ridge regression.
min(ridge.mod$cvm)
At this point, there is only one other thing we might want to do with ridge regression, and that is look at your coefficients. What values do we obtain with our choice of tuning parameter?
To figure this out, we actually need to train out ridge regression model with our choice of tuning parameter. To do this, fill in the code below.
# Train Ridge
ridge.final <- glmnet(XD[,-1], collegedata$Grad.Rate, alpha = , lambda = )
lslr.final
. Once we have trained the model, we can make a data frame holding the coefficients for both LSLR and ridge using the following code. Show the resultant data frame as the answer to this question. Make sure you have formatted it using knitr::kable()
.
ridge.betas <- as.numeric(coefficients(ridge.final))
lslr.betas <- as.numeric(coefficients(lslr.final))
BetasFinal <- data.frame("LSLR" = lslr.betas, "Ridge" = ridge.betas)
rownames(BetasFinal) <- colnames(XD)
Now that we have used all the features in LSLR and Ridge Regression, we are going to move onto a shrinkage and selection technique called lasso. The code for lasso is very similar to the code we used for ridge. The only difference is we need to use alpha = 1
.
The last technique we want to explore is elastic net. Elastic net is designed to balance between ridge and lasso. The code used to run elastic net is a little different than what we use for ridge and lasso and it requires the caret package.
library(caret)
Elastic net requires not one but 2 tuning parameters \( \lambda ~ and ~ \alpha\). One thing we have not seen yet is how to define the range of possible values for the tuning parameters. Let's do that now!
tuningGrid <- data.frame("alpha" = seq(from = 0, to = 1, by = .05), "lambda"= seq(from =0, to = 5, by =.25))
Take a look at the tuningGrid
data frame you have just created. Notice that the first column defines possible values for \( \alpha\), so these values must be between 0 and 1. The second column defines possible values for \( \lambda\), and these values can be anything greater than or equal to 0, so you will notice more range here. How did I choose the upper bound for \( \lambda\)? Well, I already knew what range of values I had gotten for ridge and lasso, so that gave me an idea of how big I might need to go.
Now, let's train elastic net! You will notice one line of code that is different from the slides, and that is the tuneGrid
component. This allows us to specify the range of tuning parameters that we want to try.
set.seed(100)
college_elnet = train(
Grad.Rate ~ ., data = collegedata,
method = "glmnet",
tuneGrid = tuningGrid,
trControl = trainControl(method = "cv", number = 10)
)