STA 363 Lab 5

Goal

Before the break, we started talking about ridge regression. Today, we are going to motivate when and why we might want to consider using this technique. We will also see how to use R to train these models.

The Data

We are going to work with a data set from your textbook called the Hitters data set. This data set provides information about major league baseball players with the goal of predicting $Y$ = the salary of the players in thousands of U.S. dollars.

To load the data, run the following code:

# Load the library 
library(ISLR)

# Load the data 
data(Hitters)

# Remove columns we don't need for today
Hitters <-Hitters[,-c(14,15,19,20)]

# Scale the matrix (we'll talk later in the lab about why!)
Hitters <- as.data.frame(scale(Hitters))

Our data set has $n = 322$ rows with 16 possible features. All of these features are numeric. Note: Categorical features are allowed in Ridge Regression! We just aren’t using them for our example today.

If you open the data and look at the features, you will notice that they all seem to contain relatively small numbers. This is because each feature has been scaled to convert it into a z-score. This means a value of 1 for Hits means that for this player, their number of total hits this season is 1 standard deviation above the average number of hits across all players. A value of -2 means that for this player, their number of total hits this season is 2 standard deviations below the average number of hits across all players.

Question 1

We know things about z-scores from introductory statistics. The distribution of z-scores has a mean of what?

Hint: If you are stuck, take the mean of the first column in the data set. What is this value approximately equal to?

Question 2

The distribution of z-scores has a standard deviation of what?

Hint: If you are stuck, take the sd of the first column in the data set. What is this value approximately equal to?

This may seem like an odd scale to use for features. However, for shrinkage techniques like ridge regression, this is very useful. Our technique will involve using the numeric value of the coefficients as a penalty, so it is important that all our features are on the same scale.

Creating Y and the coefficients

If you look at the data set, you will notice that $Y$ = salary is not in the data set. That is because for today, we want to explore the properties of ridge regression vs. LSLR. To do this, we are actually going to set $\boldsymbol{{\beta}}$ ourselves, and we will use that to create $Y$.

Do we do this with real data? No. However, we do this all the time when we conduct simulation studies. These are experiments we run using the computer in which we know what the results should be and we check to see how well different methods are able to uncover that truth.

Let’s start with creating our coefficients, meaning we need to build the vector $\boldsymbol{\beta}$. Excluding the intercept, which we set at 150, we will sample the coefficients from a normal distribution with a mean of 5 and a standard deviation of 4.

# Set a random seed 
set.seed(100)

# Create the betas 
betas <- rnorm(ncol(Hitters), mean =5, sd = 4)

# Add on an intercept
betas <-c(150, betas)

These values are chosen randomly. However, once they are chosen , we will use them to create $Y$. This means that $\boldsymbol{\beta}$ captures the relationship between $Y$ and the features. We will use the linear regression equation to create $Y$.

\[\mathbf{Y}= \mathbf{X_{D}} \boldsymbol{\beta} + \boldsymbol{\epsilon},\] where $\epsilon_i \stackrel{iid}{\sim} N(0, 25)$.

# Create Y
Y <- as.matrix(cbind(1, Hitters))%*%(as.matrix(betas)) +
      rnorm(nrow(Hitters),0,25)

Question 3

Create a summary of $Y$. What is the average salary? What is the maximum salary? Remember, $Y$ is on the scale of thousands of U.S. dollars.

At this point we have a $Y$, we have our features, and we have our vector of parameters $\boldsymbol{\beta}$.

We are now going to see if using linear regression can give us estimates $\boldsymbol{\hat{\beta}}$ that are close to the true parameter values $\boldsymbol{\beta}$.

LSLR: Checking for Independence

Models like LSLR assume that we do not have strong relationships among the features that we will use to predict $Y$. The model treats each feature as though it were independent. This means that before we use LSLR, it is important to check to see if is reasonable to assume that the features are independent.

One way to check this is to create a correlogram, which is basically a plot checking the correlation between each feature and every other feature. This isn’t a perfect tool, as it only checks for linear relationships among numeric features, but is a useful one.

To create a correlogram in R, we use the following.

# Load the library 
library(corrplot)

# Create the plot 
corrplot(cor(Hitters), method = "circle",type ="upper")

The plot you will see is an upper triangular matrix. I specified that in the code using type = "upper". If you prefer a lower triangular matrix, you can use type = "lower".

Each box in the matrix represents the correlation between two features. For instance, the 2nd box on the 1st row represents the correlation between AtBat and Hits. The correlation is represented with a dark blue circle to indicate a strong, positive correlation. The larger the circle and the darker the color, the higher the correlation between two features is. This means that the correlation between AtBat and Hits is close to 1.

You will notice a line of solid blue circles along the diagonal of the plot, as every feature is perfectly correlated with itself.

Question 4

Based on the plot, what is another pair of features that are highly correlated (other than AtBat and Hits)? There are many, just pick one example.

Question 5

Based on the plot, what is a pair of features that is NOT highly correlated? There are many, just pick one example.

This plot has a lot of different options if you don’t like the circles. See what happens if you change method to method = "number", method = "ellipse", or method = "pie". This website gives a nice overview of how much you can personalize this plot: http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram

Question 6

Which version of the plot do you find easiest to read: numbers, circles, ellipses, or the pie chart? Show the plot you prefer and explain in 1 sentence why you prefer it.

Note: There is no correct answer here! It just helps me for future classes to know what you prefer.

Whichever version of the plot you prefer, you should notice a complicated structure of relationships among the features. A lot of features are correlated with other features, which means that we are violating an assumption of linear regression. What does this mean for our ability to estimate the relationship between the features and a response? Let’s explore.

LSLR: Training the Model

Question 7

Build an LSLR model using all the rows in Hitters to predict $Y$. Show a professionally formatted table of the coefficients you get.

Hint: We can professionally format output using knitr::kable( model$coefficients, col.names = "coefficients")

Because we created the response variable ourselves, we actually know the true value of $\boldsymbol{\beta}$. This means we can compare our estimated coefficients from LSLR to the true values of the parameters.

Question 8

Create a data frame called Estimates with a column called LSLR and a column called Truth. Fill in the first column with the estimated parameter values from your LSLR model. Fill in the 2nd column with the true values, betas.

Professionally output your result using knitr::kable.

Question 9

Add a column to your Estimates data frame called Difference. You can do this by using the code:

Estimates$Difference <- ( Estimates$LSLR - Estimates$Truth)

How far off is our estimate for the coefficient of cumulative at bats (CAtBat) from the truth?

Question 10

What do you notice about the estimated values of the coefficients versus the true value of the parameters for most of the cumulative features (meaning the ones that have C at the start of their name)? Based on the correlogram we created earlier, why might this be happening?

Right now, our LSLR model is not able to estimate some of the coefficients well. We were expecting that because some of our conditions were violated, but now we actually see the problem in action.

The trick now is to figure out how we might estimate the coefficients differently in order to improve our results.

Ridge Regression

In LSLR, we choose the value of $\boldsymbol{\hat{\beta}}$ that gives us the smallest value possible for the RSS.

\[RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2\]

However, when our features are correlated, we see that choosing the value of $\boldsymbol{\hat{\beta}}$ that makes the RSS its absolute smallest yields estimates that do not capture the true value of $\boldsymbol{\beta}$ very well.

There are many, many different possible values of $\boldsymbol{\hat{\beta}}$. Each one yields a potentially different value of the RSS if we choose it. The value of $\boldsymbol{\hat{\beta}}$ that we got from LSLR yields the smallest possible value of the RSS, but the estimates were not reflective of the true parameters.

Specifically, what we noticed is that many of our coefficient estimates from LSLR are too large in absolute value. For example, the coefficient for CHits (the cumulative number of hits over the season) is supposed to be 1.70, but with LSLR we obtain an estimate of -46.68, a value over 20 times larger in absolute value than the truth!

Because we are noticing that the coefficients are too large, we are going to change how we estimate $\boldsymbol{\hat{\beta}}$. Instead of choosing the value of $\boldsymbol{\hat{\beta}}$ that minimizes the RSS, we will choose the value that minimizes:

\[RSS + \lambda \sum_{j} \hat{\beta}_j^2,\]

This means we are now choosing estimates of $\boldsymbol{\hat{\beta}}$ based on two components:

\[\underbrace{RSS}_{\text{Data Component}} + \underbrace{\lambda \sum_{j} \hat{\beta}_j^2}_{\text{Penalty Component}}\] The penalty component is called a shrinkage penalty. This is because the penalty term gets larger as the square of the coefficients gets larger.

Because we want to minimize this entire term (the data component + the penalty component), we can encourage the choice of coefficients that have smaller values than we get without the penalty component. Since the problem we have currently with LSLR is that our coefficients are too big, this is a reasonable choice! Estimating the coefficients in this way is called using ridge regression.

Think about it like a budget. Our goal is to make the RSS small. In LSLR, our budget is unlimited - we can choose any coefficients we want to make the RSS as small as possible. However, in Ridge, we have a budget. We still want to make the RSS small, but we are restricted in how large we can let our coefficients be in order to achieve that goal.

The key to ridge regression is choosing the tuning parameter $\lambda$, which is a scalar where $\lambda \geq 0$. This tuning parameter allows us to control how much shrinkage we want on the coefficient estimates. In other words, it allows us to control the balance between the data component and the penalty component; this controls the budget. A larger $\lambda$ means a smaller budget.

Question 11

What values of $\lambda$ result in smaller coefficient estimates: large choices of $\lambda$ or smaller choices of $\lambda$?

Ridge in R

Now that we have the idea in place, let’s try it in R. Ridge regression requires the glmnet package, so loading that is the first step.

library(glmnet)

To build a ridge regression model in R with a specific value of $\lambda$, we can use the following:

# Train Ridge
ridge.model <- glmnet(Hitters, Y, alpha = 0, lambda = )

The inputs of this function are:

1. The design matrix (R automatically adds a column of 1s)
1. The response variable
1. A flag that tells R to use ridge regression. Remember that I told you we had two more techniques to learn? We will change this to get our code from ridge to the next technique.
1. The tuning parameter value

Once we have run ridge regression, we would like to get some idea of the values of $\boldsymbol{\hat{\beta}_{Ridge}}$. To get that, we use

# Store the coefficients 
coefficients(ridge.model)

The claim right now is that when we use ridge regression, we can use the tuning parameter $\lambda$ to encourage smaller coefficients. Let’s test that.

Question 12

Ridge regression and LSLR are equivalent when $\lambda$ is equal to what?

Note: If you try to use glmnet to fit LSLR, you need to add thresh = 1e-14 to the end of your glmnet code in order to get the same results as lm. This is because glmnet standardizes Y in its calculations, and then returns it to the original scale afterwards. This requires rounding, and setting thresh= makes sure lm and glmnet round the same way. You ONLY have to do this when you are trying to make glmnet fit LSLR, not when you are doing ridge in general.

Question 13

Choose any value of $\lambda$ that is bigger than the value you specified in Question 12. Train a ridge regression model with that choice of $\lambda$. State the value of $\lambda$ that you choose and state the sum of the squared coefficients (excluding the intercept) from your trained model.

Hint: You can use this code to find this, with model substituted for the name of your trained ridge regression model.

sum(coefficients( model )[-1]^2)

Question 14

What is the sum of squared coefficients (minus the intercept) for LSLR? Is this larger or smaller than what you got in Question 13?

Why do we exclude the intercept? With all of our coefficients scaled to have a mean of 0 and a standard deviation of 1, the intercept represents the salary we predict for a baseball player with a value of 0 (meaning equal to the mean) for all features. The predicted salary for an average player is not something we can or want to shrink; the average is the average. This means the penalization in ridge is focused on the features, not the intercept.

We can include the intercept in our calculations, but it is typically easier to see the change in our coefficient sizes if we don’t.

Question 15

State the estimate of $\hat{\beta}_0$ for LSLR and for ridge. What is the sum of squared coefficients (INCLUDING the intercept) for LSLR and for your ridge model? Which is larger?

This example has verified that by using ridge regression we can get shrinkage of our coefficients. However, we do not want to shrink our coefficients so much that we choose values of the coefficients which do not reflect the relationship between the features and $Y$. Capturing that relationship so we can predict salary well is the whole point. Since the shrinkage is controlled by $\lambda$, this means that we need to be very careful in how we choose $\lambda$.

Ridge: Choosing The Tuning Parameter

We want to choose a value of $\lambda$ that shrinks the coefficients, but we do not want to shrink them so much that the relationship between the features and $Y$ is no longer represented by those coefficients. To help us choose an appropriate value of $\lambda$, we are going to return to a concept that is very familiar to us now - cross validation.

The process begins as we did when choosing the tuning parameter $K$ in KNN - we will try a whole bunch of different values of the tuning parameter $\lambda$. For each value of $\lambda$, we use cross validation (specifically 10-fold CV today) to estimate the test RMSE we would get with that choice of $\lambda$.

If we choose a value of $\lambda$ that starts to break the relationship between the features and the response, the test RSME will increase. This means the test RMSE is a good tool to us to help us select $\lambda$. We choose the value of $\lambda$ that gives us the smallest validation RMSE.

In summary, we start with a list of possible choices for $\lambda$. For each $\lambda$ on the list, we

1. Solve for the ridge estimator $\boldsymbol{\hat{\beta}}_{Ridge}$.
1. Use 10-fold CV to estimate the test RMSE with that choice of $\boldsymbol{\hat{\beta}}_{Ridge}$.

We then ultimately select the value of $\lambda$ for which the validation RMSE is the smallest.

The Code

Now that we have explained the process, we are ready for code. Our minds are probably thinking that a for loop would be a good approach, and you would be right!! However, there is actually a built in function in R that does this process for us, so we don’t have to write a for loop today (mini celebration!!!!).

To run the CV process in R, we need the code cv.glmnet.

# Set a random seed 
set.seed(100)

# Run 10 fold-CV for a variety of choices of lambda
cv.ridge.mod <- cv.glmnet(as.matrix(Hitters), Y , alpha = 0 , lambda = )

Question 16

Why do we need to set a random seed here?

The only missing piece in the code above is the list of $\lambda$ values we will consider. For today, we will use a sequence of values from 0 to 50 in intervals of .05. To create such a sequence in R, we use the code seq.

seq( from = , to = , by = )

There are three inputs to this function.

1. from =: Where do we start the sequence? In other words, what is the smallest value of $\lambda$ we want to consider?
1. to =: Where do we want to end the sequence? In other words, what is the largest value of $\lambda$ we want to consider?
1. by =: In what intervals do we want the sequence to progress? If we want 1, 2, 3, we use by = 1. If we want 1, 1.5, 2, 2.5, we use by = .5.

Question 17

Finish the code above Question 16 to run 10-fold CV. Show your code as the answer to this question; there won’t be any other output.

Visualizing the CV results

Once we have run 10-fold CV, we want to look at our results. To do this, copy and paste the following into a chunk and press play:

ridgePlot <- function(ridge.mod, metric, title){
  suppressMessages(library(ggplot2))
  
  smallestLambda <- ridge.mod$lambda[which.min(ridge.mod$cvm)] 
  
  if(metric == "MSE"){
  g1 <- ggplot( data.frame(ridge.mod$lambda), aes( x = ridge.mod$lambda, y = (ridge.mod$cvm))) + geom_point() + geom_vline( xintercept = smallestLambda, col = "blue" , lty = 2 ) +  labs(caption = paste("Validation MSE values for Different Tuning Parameters. Smallest MSE at lambda = ", smallestLambda), title = title, y = "Validation MSE", x = "Tuning Parameter")
  
  }
  
  if(metric == "RMSE"){
  g1 <- ggplot( data.frame(ridge.mod$lambda), aes( x = ridge.mod$lambda, y = sqrt(ridge.mod$cvm))) + geom_point() + geom_vline( xintercept = smallestLambda, col = "blue" , lty = 2 ) +  labs(caption = paste("Validation RMSE values for Different Tuning Parameters. Smallest RMSE at lambda = ", smallestLambda), title = title, y = "Validation RMSE", x = "Tuning Parameter")

  }
  
  g1
}

This teaches R a function called ridgePlot that I have written for you. To use the code, you need

ridgePlot(cv.ridge.mod, metric = , title = )

where metric = "RMSE" will put the validation RMSE on the y axis and metric = "MSE" will put the validation MSE on the y-axis. You can (and should!) also add a title.

Question 18

Plot your results using the code above, and show your result. What choice of $\lambda$ gives the best value of the validation RMSE?

We can also print out the choice of $\lambda$ gives the best value of the validation RMSE.

cv.ridge.mod$lambda.min

Question 19

Why might it be helpful to look at the plot in addition to using cv.ridge.mod$lambda.min before choosing $\lambda$? In other words, if we can print out the value, what information could we get from the plot that might be helpful?

Using our choice of tuning parameter, we can then see what value of validation RMSE we would obtain using ridge regression.

min(sqrt(cv.ridge.mod$cvm))

CVM stands for cross validation metric, which in this case is the validation MSE. We square root it to get the validation RMSE.

Question 20

Using your choice of $\lambda$, what is the estimated test RMSE?

Question 21

We can use 10-fold CV to find the validation RMSE for LSLR using:

sqrt(cv.ridge.mod$cvm[which(cv.ridge.mod$lambda==0)])

Is this value larger or smaller than what we got using ridge?

What we notice is that by changing how we estimate the coefficients, we haven’t chosen coefficients that harm our ability to predict salary!! The LSLR estimator has some beautiful mathematical properties when the conditions are all satisfied, but it is not the only approach we can use.

Training the Final Ridge Regression Model

Now that we have chosen $\lambda$, we can train ridge regression model with that choice of tuning parameter. To do this, fill in the code below.

# Train Ridge
ridge.final <- glmnet(Hitters, Y, alpha = 0, lambda = )

Question 22

Using the code above as a template, train your ridge regression model.

Remember the data frame called Estimates we created way in the beginning? Add your ridge estimates to this data frame.

Estimates$Ridge <- matrix(coefficients(ridge.final))[,1]

Show the resultant data frame as the answer to this question. Make sure you have formatted it using knitr::kable().

Now, the order in the table right now is a little odd. We probably want the truth first, then the LSLR estimates, and then the ridge estimates. We also might want a column showing the difference between the truth and ridge, as well as between truth and LSLR.

We’ve done a lot of work today, so I’ll do this part for you:

Estimates <- data.frame("Truth" = betas)
Estimates$LSLR <- m1$coefficients
Estimates$Ridge <- matrix(coefficients(ridge.final))[,1]
Estimates$LSLRvsTruth <- Estimates$LSLR - Estimates$Truth
Estimates$RidgevsTruth <- Estimates$Ridge - Estimates$Truth
Estimates$CloserToTruth <- ifelse( abs(Estimates$LSLRvsTruth) < abs(Estimates$RidgevsTruth), "LSLR", "Ridge")

Question 23

How many of the coefficients are estimated more accurately using ridge rather than LSLR?

Turning in your assignment

When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2024 October 12.