STA 363 Lab 6
Goal
In our last lab, we practiced best subset selection (BSS), which is a selection technique. This means that by using BSS, we can identify a set of features that optimizes some comparison metric, like the \(R^2_{adj}\) or the AIC. In other words, we use selection techniques to help us choose a subset of features from our original list of possible features.
Today, we are going to focus on techniques that involve shrinkage. These techniques focus on optimizing a metric while still keeping the estimates of our coefficients and their standard errors in a reasonable range.
The Data
We have a client who is interested in predicting graduation rates at different universities and colleges in the United States. Graduation rate refers to the percent of the students who start at a college/university who graduate within 4 years. The client provides us with a random sample of 776 colleges and universities in the United States. These data are linked on Canvas. The data set contain the response variable graduation rate as well as 18 possible features.
Question 1
Is this a regression or classification task? Based on this, what metric will we likely use to assess predictive accuracy?
Now that we know the response variable (\(Y= Grad.Rate\)) and have identified our goal, it helps to briefly explore the response variable.
Question 2
Make a visualization to explore the distribution of graduation rate. What is the smallest value of graduation rate? The largest?
Exploring \(Y\) is important because it gives us some idea of the scale of the variable we are working with. This will be helpful later when we are looking at our predictive metrics and are attempting to determine if our estimation error is relatively large or relatively small relative to the scale of \(Y\).
Now, suppose we want to predict \(Y\), but we do not yet use any feature information. In this case, our best guess at \(Y\) is \(\bar{Y}\), the mean of the response variable.
Question 3
Compute the MSE and RMSE for these data using \(\hat{Y}_i = \bar{Y}\) for all rows \(i\).
The metrics that we have created above are generally what we use to compare other metrics to later; this is about as poor as our predictions can get, so when we build other models later, we can determine how much our predictive accuracy has improved.
Question 4
Train an LSLR (OLS) regression model using the whole data set and the single feature X = the student faculty ratio. Find the MSE and RMSE. How much have these metrics improved from the values you got in Question 3?
As we can see, our metrics improve when we use this feature (case student faculty ratio) to predict \(Y\). This is the goal! By building a model and incorporating information from the features, we hope to end up with more accurate predictions of \(Y\). Now, let’s try adding more features.
Building a model with more features
Before we add features into the model, we do need to think about the features that are available to us.
Question 5
Which column in the data set cannot be used as a feature (aside from the response variable)? Explain why this column cannot be used.
Once we have decided on the possible set of features, it is time to start thinking about what model might be appropriate. Our response variable is numeric, so our first thought is likely LSLR (OLS) regression. However, as we have discussed in class, before we choose that model it is beneficial to look for multicollinearity in our features.
Question 6
Build a correlation plot (you can choose the styling you like best!) to explore the correlations in the features in this data set.
Question 7
Based on the plot in Question 6, why would you suggest we start with Ridge Regression instead of LSLR?
Ridge Regression
Let’s try building a ridge regression model in R. Ridge regression
requires the glmnet
package, so loading that is the first
step.
library(glmnet)
The next thing we need is the design matrix. We have a lot of features in this data set, and some are categorical. This means building the design matrix by hand can be laborious. Luckily, there is an easy code we can use to create the matrix we need.
<- model.matrix(Grad.Rate ~ ., data = ) XD
There is one piece of information missing from the code above, and
this is the data=
part of the code. Here, you would put
collegedata
, which is the name of our data set. However, we
have discussed that one column should not be used as a feature. This
means that we need to exclude that column from the design matrix. For
example, if we exclude the 4th column, we would use
data = collegedata[,-4]
.
Question 8
Using the code above, create the design matrix. State the dimensions of this matrix.
Now that we have the design matrix, we only need one more thing to find the coefficients for ridge regression. This missing piece is the tuning parameter \(\lambda\). As we have discussed, the choice of tuning parameter defines our range of possible values of the coefficients we are going to allow.
Exploring the tuning parameter
We have already discussed a principled way to choose \(\lambda\), but for the moment, let’s play with a few choices and see what happens. To train a ridge regression model in R, we use
# Train Ridge
<- glmnet(XD[,-1], collegedata$Grad.Rate, alpha = 0,
ridge.model2lambda = , standardize = TRUE)
You will notice a few things in this code. First of all, we use the
design matrix XD[,-1]
. Why the [,-1]
? Well,
otherwise we end up with two intercepts (this is just a code thing). You
will also notice that right now the lambda =
part of the
code is missing. We will fix that in a second.
Once we have run ridge regression, we would like to get some idea of the values of \(\boldsymbol{\hat{\beta}_{Ridge}}\). To get that, we use
# Print the coefficients
<- data.frame("Ridge" = as.numeric(coefficients(ridge.model2)))
Betas rownames(Betas) <- colnames(XD)
::kable(Betas,
knitrcaption = "Coefficients with Tuning Parameter = ")
Question 9
Suppose we let \(\lambda =2\). Adapting the code above, what is \(\boldsymbol{\hat{\beta}_{Ridge}}\)?
Question 10
What is the sum of the coefficients (excluding the intercept) when
\(\lambda = 2\)? Hint: To find this,
use sum(ridge.model2$beta)
.
Why do we exclude the intercept? Well, the intercept is not actually penalized in ridge. We typically do not interpret the intercept when we have so many coefficients, as the intercept typically represents an impossible combination of features. This means that the intercept actually often gets larger in ridge regression, and that’s okay!
Question 11
Suppose we let \(\lambda =200\). Adapting the code above, what is \(\boldsymbol{\hat{\beta}_{Ridge}}\)?
Question 12
What is the sum of the coefficients (excluding the intercept) when \(\lambda = 200\)?
Question 13
What do you notice happens to the sum of the coefficients as \(\lambda\) increases?
This is the whole idea of shrinkage! We are adding a penalty term in that essentially constrains (i.e., bounds) the size of the coefficients. In doing this, we also constrain the standard error of these coefficients.
Choosing The Tuning Parameter using Cross Validation
Okay, now we have explored what happens when we change \(\lambda\). How do we actually decide which value of \(\lambda\) we want to use for a given data set?
Question 14
Briefly explain the process of choosing the tuning parameter in ridge regression. Use words, not code!
Now that we have explained the process, we are ready for code. To run
this process in R, we need the code cv.glmnet
.
set.seed(100)
<- cv.glmnet(XD[,-1], RESPONSE , alpha = , lambda = , standardize = TRUE) ridge.mod
Question 15
Why do we need to set a random seed here?
Here, we have a few missing pieces.
- RESPONSE = here you need to replace the word RESPONSE with the response variable you are interested in.
alpha =
: Here, you are going to put either a 0 (alpha = 0
) or a 1 (alpha = 1
). A 0 indicates that we want to use ridge regression, and a 1 indicates that we want to use lasso. Hint: we want ridge.lambda =
: Here, you are going to put the sequence of \(\lambda\) values you want to consider. Make sure that you include 0 in this list! For today, we will use a sequence of values from 0 to 50 in intervals of .05.
Question 16
Why do we want to make sure to include \(\lambda = 0\) in our list of possible tuning parameters?
Once we have run 10-fold CV, we want to look at our results. To do this, copy and paste the following into a chunk and press play:
<- function(ridge.mod, metric, title){
ridgePlot library(ggplot2)
<- ridge.mod$lambda[which.min(ridge.mod$cvm)]
smallestLambda
if(metric == "MSE"){
<- ggplot( data.frame(ridge.mod$lambda), aes( x = ridge.mod$lambda, y = (ridge.mod$cvm))) + geom_point() + geom_vline( xintercept = smallestLambda, col = "blue" , lty = 2 ) + labs(caption = paste("Test MSE values for Different Tuning Parameters. Smallest MSE at lambda = ", smallestLambda), title = title, y = "Test MSE", x = "Tuning Parameter")
g1
}
if(metric == "RMSE"){
<- ggplot( data.frame(ridge.mod$lambda), aes( x = ridge.mod$lambda, y = sqrt(ridge.mod$cvm))) + geom_point() + geom_vline( xintercept = smallestLambda, col = "blue" , lty = 2 ) + labs(caption = paste("Test RMSE values for Different Tuning Parameters. Smallest RMSE at lambda = ", smallestLambda), title = title, y = "Test RMSE", x = "Tuning Parameter")
g1
}
g1 }
This teaches R a function called ridgePlot
that I have
written for you. To use the code, you need
ridgePlot(ridge.mod, metric = , title = )
where metric = "RMSE"
will put the RMSE on the y axis
and metric = "MSE"
will put the MSE on the y-axis. You can
also add a title.
Question 17
Plot your results using the code above, and show your result. Approximately what choice of \(\lambda\) gives the best value of our test metric?
Pictures are great, but it can be difficult from the image to tell exactly which value of \(\lambda\) we need to use. Luckily, we can pull that value directly if we don’t want to look at the plot.
$lambda.min ridge.mod
Question 18
Why might it be helpful to look at the plot instead of just using
ridge.mod$lambda.min
all the time?
Using this choice of tuning parameter, we can then see what value of test MSE we would obtain using ridge regression.
min(ridge.mod$cvm)
Question 19
Using your choice of \(\lambda\), what is the estimated test RMSE?
Question 20
How much has our RMSE improved from the values you got in Question 3? This allows us to see how much better our model does than using no feature information.
At this point, there is only one other thing we might want to do with ridge regression, and that is look at your coefficients. What values do we obtain with our choice of tuning parameter?
To figure this out, we actually need to train out ridge regression model with our choice of tuning parameter. To do this, fill in the code below.
# Train Ridge
<- glmnet(XD[,-1], collegedata$Grad.Rate, alpha = , lambda = , standardize = TRUE) ridge.final
Question 21
Using the code above as a template, train your ridge regression
model. Then, train the LSLR model using the same code template and call
the result lslr.final
. Once we have trained the model, we
can make a data frame holding the coefficients for both LSLR and ridge
using the following code. Show the resultant data frame as the answer to
this question. Make sure you have formatted it using
knitr::kable()
.
<- as.numeric(coefficients(ridge.final))
ridge.betas <- as.numeric(coefficients(lslr.final))
lslr.betas <- data.frame("LSLR" = lslr.betas, "Ridge" = ridge.betas)
BetasFinal rownames(BetasFinal) <- colnames(XD)
Question 22
State the estimated test RMSE that you got from 10-fold CV for both of the two trained models (LSLR or ridge regression). Which has the best predictive accuracy? By how much (in percent difference)?
NOTE: How do we do this in code???
# Run 10-fold CV for only the two choices of lambda
set.seed(100)
<- cv.glmnet(XD[,-1], collegedata$Grad.Rate , alpha = 0 , lambda = c(LAMBDA_LSLR, LAMBDA_RIDGE), standardize = TRUE)
ridge.final.out # Output the Lambdas
$lambda
ridge.final.out# Output the test MSE
$cvm
ridge.final.out# Output the test RMSE
sqrt(ridge.final.out$cvm)
All you need to do is replace LAMBDA_LSLR
with the
numeric value of \(\lambda\) that
associates with LSLR and LAmbda_RIDGE
with the numeric
value of \(\lambda\) you have chosen
for ridge.
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2022 October 10.