STA 363 Lab 6

Goal

In our last lab, we practiced best subset selection (BSS), which is a selection technique. This means that by using BSS, we can identify a set of features that optimizes some comparison metric, like the $R^2_{adj}$ or the AIC. In other words, we use selection techniques to help us choose a subset of features from our original list of possible features.

Today, we are going to focus on techniques that involve shrinkage. These techniques focus on optimizing a metric while still keeping the estimates of our coefficients and their standard errors in a reasonable range.

The Data

We have a client who is interested in predicting graduation rates at different universities and colleges in the United States. Graduation rate refers to the percent of the students who start at a college/university who graduate within 4 years. The client provides us with a random sample of 776 colleges and universities in the United States. These data are linked on Canvas. The data set contain the response variable graduation rate as well as 18 possible features.

X = The name of the university.

Private = Yes if the school is a private school, 1 if public

Apps = the number of apps received on average per academic year

Accept - the number of applications accepted on average per academic year

Enroll = the average number of student enrolled each year out of the accepted students

Top10Perc = how many of the accepted students were in the 10 % of their high school.

Top25Perc = how many of the accepted students were in the 25% of their high school.

F.Undergrad = the number of full time undergraduate students enrolled at the university

P.undergrad = the number of part time undergraduate students enrolled at the university

Out.State = the average out of state tuition

Room.Board = the average yearly cost of room and board

Books = the average cost of books for class per year

Personal = the average student personal expenses per year

PhD = the percent of the faculty who hold a PhD

Terminal = the percent of faculty who hold a terminal degree in their field

S.F.Ratio = the student faculty ratio

perc.alumni = the percent of alumni who are active in the alumni association

Expend = the average amount of money a university spends per student per year

Question 1

Is this a regression or classification task? Based on this, what metric will we likely use to assess predictive accuracy?

Now that we know the response variable ($Y= Grad.Rate$) and have identified our goal, it helps to briefly explore the response variable.

Question 2

Make a visualization to explore the distribution of graduation rate. What is the smallest value of graduation rate? The largest?

Exploring $Y$ is important because it gives us some idea of the scale of the variable we are working with. This will be helpful later when we are looking at our predictive metrics and are attempting to determine if our estimation error is relatively large or relatively small relative to the scale of $Y$.

Now, suppose we want to predict $Y$, but we do not yet use any feature information. In this case, our best guess at $Y$ is $\bar{Y}$, the mean of the response variable.

Question 3

Compute the MSE and RMSE for these data using $\hat{Y}_i = \bar{Y}$ for all rows $i$.

The metrics that we have created above are generally what we use to compare other metrics to later; this is about as poor as our predictions can get, so when we build other models later, we can determine how much our predictive accuracy has improved.

Question 4

Train an LSLR (OLS) regression model using the whole data set and the single feature X = the student faculty ratio. Find the MSE and RMSE. How much have these metrics improved from the values you got in Question 3?

As we can see, our metrics improve when we use this feature (case student faculty ratio) to predict $Y$. This is the goal! By building a model and incorporating information from the features, we hope to end up with more accurate predictions of $Y$. Now, let’s try adding more features.

Building a model with more features

Before we add features into the model, we do need to think about the features that are available to us.

Question 5

Which column in the data set cannot be used as a feature (aside from the response variable)? Explain why this column cannot be used.

Once we have decided on the possible set of features, it is time to start thinking about what model might be appropriate. Our response variable is numeric, so our first thought is likely LSLR (OLS) regression. However, as we have discussed in class, before we choose that model it is beneficial to look for multicollinearity in our features.

Question 6

Build a correlation plot (you can choose the styling you like best!) to explore the correlations in the features in this data set.

Question 7

Based on the plot in Question 6, why would you suggest we start with Ridge Regression instead of LSLR?

Ridge Regression

Let’s try building a ridge regression model in R. Ridge regression requires the glmnet package, so loading that is the first step.

library(glmnet)

The next thing we need is the design matrix. We have a lot of features in this data set, and some are categorical. This means building the design matrix by hand can be laborious. Luckily, there is an easy code we can use to create the matrix we need.

XD <- model.matrix(Grad.Rate ~ ., data = )

There is one piece of information missing from the code above, and this is the data= part of the code. Here, you would put collegedata, which is the name of our data set. However, we have discussed that one column should not be used as a feature. This means that we need to exclude that column from the design matrix. For example, if we exclude the 4th column, we would use data = collegedata[,-4].

Question 8

Using the code above, create the design matrix. State the dimensions of this matrix.

Now that we have the design matrix, we only need one more thing to find the coefficients for ridge regression. This missing piece is the tuning parameter $\lambda$. As we have discussed, the choice of tuning parameter defines our range of possible values of the coefficients we are going to allow.

Exploring the tuning parameter

We have already discussed a principled way to choose $\lambda$, but for the moment, let’s play with a few choices and see what happens. To train a ridge regression model in R, we use

# Train Ridge
ridge.model2<- glmnet(XD[,-1], collegedata$Grad.Rate, alpha = 0, 
               lambda = , standardize = TRUE)

You will notice a few things in this code. First of all, we use the design matrix XD[,-1]. Why the [,-1]? Well, otherwise we end up with two intercepts (this is just a code thing). You will also notice that right now the lambda = part of the code is missing. We will fix that in a second.

Once we have run ridge regression, we would like to get some idea of the values of $\boldsymbol{\hat{\beta}_{Ridge}}$. To get that, we use

# Print the coefficients
Betas <- data.frame("Ridge" = as.numeric(coefficients(ridge.model2)))
rownames(Betas) <- colnames(XD)
knitr::kable(Betas,
            caption = "Coefficients with Tuning Parameter = ")

Question 9

Suppose we let $\lambda =2$. Adapting the code above, what is $\boldsymbol{\hat{\beta}_{Ridge}}$?

Question 10

What is the sum of the coefficients (excluding the intercept) when $\lambda = 2$? Hint: To find this, use sum(ridge.model2$beta).

Why do we exclude the intercept? Well, the intercept is not actually penalized in ridge. We typically do not interpret the intercept when we have so many coefficients, as the intercept typically represents an impossible combination of features. This means that the intercept actually often gets larger in ridge regression, and that’s okay!

Question 11

Suppose we let $\lambda =200$. Adapting the code above, what is $\boldsymbol{\hat{\beta}_{Ridge}}$?

Question 12

What is the sum of the coefficients (excluding the intercept) when $\lambda = 200$?

Question 13

What do you notice happens to the sum of the coefficients as $\lambda$ increases?

This is the whole idea of shrinkage! We are adding a penalty term in that essentially constrains (i.e., bounds) the size of the coefficients. In doing this, we also constrain the standard error of these coefficients.

Choosing The Tuning Parameter using Cross Validation

Okay, now we have explored what happens when we change $\lambda$. How do we actually decide which value of $\lambda$ we want to use for a given data set?

Question 14

Briefly explain the process of choosing the tuning parameter in ridge regression. Use words, not code!

Now that we have explained the process, we are ready for code. To run this process in R, we need the code cv.glmnet.

set.seed(100)
ridge.mod <- cv.glmnet(XD[,-1], RESPONSE , alpha =  , lambda = , standardize = TRUE)

Question 15

Why do we need to set a random seed here?

Here, we have a few missing pieces.

RESPONSE = here you need to replace the word RESPONSE with the response variable you are interested in.
alpha =: Here, you are going to put either a 0 (alpha = 0) or a 1 (alpha = 1). A 0 indicates that we want to use ridge regression, and a 1 indicates that we want to use lasso. Hint: we want ridge.
lambda =: Here, you are going to put the sequence of $\lambda$ values you want to consider. Make sure that you include 0 in this list! For today, we will use a sequence of values from 0 to 50 in intervals of .05.

Question 16

Why do we want to make sure to include $\lambda = 0$ in our list of possible tuning parameters?

Once we have run 10-fold CV, we want to look at our results. To do this, copy and paste the following into a chunk and press play:

ridgePlot <- function(ridge.mod, metric, title){
  library(ggplot2)
  
  smallestLambda <- ridge.mod$lambda[which.min(ridge.mod$cvm)] 
  
  if(metric == "MSE"){
  g1 <- ggplot( data.frame(ridge.mod$lambda), aes( x = ridge.mod$lambda, y = (ridge.mod$cvm))) + geom_point() + geom_vline( xintercept = smallestLambda, col = "blue" , lty = 2 ) +  labs(caption = paste("Test MSE values for Different Tuning Parameters. Smallest MSE at lambda = ", smallestLambda), title = title, y = "Test MSE", x = "Tuning Parameter")
  
  }
  
  if(metric == "RMSE"){
  g1 <- ggplot( data.frame(ridge.mod$lambda), aes( x = ridge.mod$lambda, y = sqrt(ridge.mod$cvm))) + geom_point() + geom_vline( xintercept = smallestLambda, col = "blue" , lty = 2 ) +  labs(caption = paste("Test RMSE values for Different Tuning Parameters. Smallest RMSE at lambda = ", smallestLambda), title = title, y = "Test RMSE", x = "Tuning Parameter")

  }
  
  g1
}

This teaches R a function called ridgePlot that I have written for you. To use the code, you need

ridgePlot(ridge.mod, metric = , title = )

where metric = "RMSE" will put the RMSE on the y axis and metric = "MSE" will put the MSE on the y-axis. You can also add a title.

Question 17

Plot your results using the code above, and show your result. Approximately what choice of $\lambda$ gives the best value of our test metric?

Pictures are great, but it can be difficult from the image to tell exactly which value of $\lambda$ we need to use. Luckily, we can pull that value directly if we don’t want to look at the plot.

ridge.mod$lambda.min

Question 18

Why might it be helpful to look at the plot instead of just using ridge.mod$lambda.min all the time?

Using this choice of tuning parameter, we can then see what value of test MSE we would obtain using ridge regression.

min(ridge.mod$cvm)

Question 19

Using your choice of $\lambda$, what is the estimated test RMSE?

Question 20

How much has our RMSE improved from the values you got in Question 3? This allows us to see how much better our model does than using no feature information.

At this point, there is only one other thing we might want to do with ridge regression, and that is look at your coefficients. What values do we obtain with our choice of tuning parameter?

To figure this out, we actually need to train out ridge regression model with our choice of tuning parameter. To do this, fill in the code below.

# Train Ridge
ridge.final <- glmnet(XD[,-1], collegedata$Grad.Rate, alpha = , lambda = , standardize = TRUE)

Question 21

Using the code above as a template, train your ridge regression model. Then, train the LSLR model using the same code template and call the result lslr.final. Once we have trained the model, we can make a data frame holding the coefficients for both LSLR and ridge using the following code. Show the resultant data frame as the answer to this question. Make sure you have formatted it using knitr::kable().

ridge.betas <- as.numeric(coefficients(ridge.final))
lslr.betas <- as.numeric(coefficients(lslr.final))
BetasFinal <- data.frame("LSLR" = lslr.betas, "Ridge" = ridge.betas)
rownames(BetasFinal) <- colnames(XD)

Question 22

State the estimated test RMSE that you got from 10-fold CV for both of the two trained models (LSLR or ridge regression). Which has the best predictive accuracy? By how much (in percent difference)?

NOTE: How do we do this in code???

# Run 10-fold CV for only the two choices of lambda
set.seed(100)
ridge.final.out <- cv.glmnet(XD[,-1], collegedata$Grad.Rate , alpha = 0 , lambda = c(LAMBDA_LSLR, LAMBDA_RIDGE), standardize = TRUE)
# Output the Lambdas
ridge.final.out$lambda
# Output the test MSE
ridge.final.out$cvm
# Output the test RMSE
sqrt(ridge.final.out$cvm)

All you need to do is replace LAMBDA_LSLR with the numeric value of $\lambda$ that associates with LSLR and LAmbda_RIDGE with the numeric value of $\lambda$ you have chosen for ridge.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 October 10.