Introduction

The purpose of this document is to demonstrate the three different types of regularized regression: Elastic Net, Lasso and Ridge Regression. While the standard R glmnet function offers some tools for visualizing the results of these techniques, the coefplot package contains several useful tools for visualization that are a bit easier to interpret.

Baseball Data

In order to illustrate regularized regression and the coefplot package, we need some data on which to perform variable selection. The data for this exercise will be major league baseball data from the 1986 and 1987. We will have a number of predicting variables, for example Hits, Runs, RBI, Home Runs, and several others. From this data we want to build a model that can help us predict (or justify) a players salary based on their performance.

(Reference: James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York)

rm(list=ls())
set.seed(100)
library(ISLR)
names(Hitters)
 [1] "AtBat"     "Hits"      "HmRun"     "Runs"      "RBI"       "Walks"     "Years"     "CAtBat"    "CHits"     "CHmRun"    "CRuns"    
[12] "CRBI"      "CWalks"    "League"    "Division"  "PutOuts"   "Assists"   "Errors"    "Salary"    "NewLeague"
Hitters<-na.omit(Hitters)

We will want to do some comparison of models, so we will split the data into testing and training sets.

## 75% of the sample size
smp_size <- floor(0.75 * nrow(Hitters))

train_ind <- sample(seq_len(nrow(Hitters)), size = smp_size)

train <- Hitters[train_ind, ]
test <- Hitters[-train_ind, ]

Basic Model (No Variable Selection)

Now that we have some data, let’s reveiw what we want to do with regularized variable selection. In a multiple regression problem, we have multiple predictors and a target variable. In R, it is relatively easy to preform multiple regression on this data with the lm function. The result is a set of coefficients that we apply to each predictor. Also, we are able to use residual analysis to determine the statistical significance of each predictor in the overall model.

mdl1 <- lm(Salary~., data=train)
summary(mdl1)

Call:
lm(formula = Salary ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-705.79 -150.74  -24.62  125.17 1063.50 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  33.57152   93.09123   0.361 0.718806    
AtBat        -2.41967    0.65567  -3.690 0.000298 ***
Hits         11.08951    2.60137   4.263 3.28e-05 ***
HmRun        -3.55591    6.29651  -0.565 0.572963    
Runs         -4.52955    2.96590  -1.527 0.128493    
RBI          -1.52156    2.62867  -0.579 0.563438    
Walks         7.05505    1.83780   3.839 0.000172 ***
Years        -3.40822   12.19201  -0.280 0.780153    
CAtBat        0.05533    0.13555   0.408 0.683618    
CHits        -0.80779    0.71644  -1.128 0.261056    
CHmRun        0.98164    1.73619   0.565 0.572517    
CRuns         2.05346    0.80316   2.557 0.011405 *  
CRBI          0.64249    0.72080   0.891 0.373949    
CWalks       -1.11918    0.33262  -3.365 0.000939 ***
LeagueN     116.04168   83.58204   1.388 0.166773    
DivisionW   -69.47276   41.41353  -1.678 0.095202 .  
PutOuts       0.32933    0.07670   4.294 2.89e-05 ***
Assists       0.10522    0.21140   0.498 0.619270    
Errors        1.46384    4.13709   0.354 0.723886    
NewLeagueN  -89.55925   83.87487  -1.068 0.287077    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 272 on 177 degrees of freedom
Multiple R-squared:  0.6702,    Adjusted R-squared:  0.6348 
F-statistic: 18.93 on 19 and 177 DF,  p-value: < 2.2e-16
mdl1.pred <- predict(mdl1, new_data=test)
lm_mse <- mean((mdl1.pred - test$Salary)^2)
longer object length is not a multiple of shorter object length

In the above output, we can see the various coefficients, and their statistical significance.

Bias-Variance Tradeoff

More predictor variables in a model will typically result in lower bias, but higher variability of model predictions. A common way to think about this is in terms of the graphic below. A model that has lots of predictors would likely results in the upper right hand corner. A model with few predictors results in the lower left. When performing predictions, it is often preferable to have less variance even if there is some bias.

Bias and Variance

Bias and Variance

The baseball salary model that we created above has several predictors. It’s mean squared error (MSE) is 3.356621710^{5}. Next, we will look at creating models using regularization variable selection techniques. We will then compare our regularized models with the full model above.

Mathematical Formulation of Regularization Regression Techniques

Regularized regression techniques include Lasso, Ridge Regression, and Elastic Net. Elasticnet is a generalization of the other two. Mathematically, these regression techniques all look like this:

\[ min \sum_{i=1}^n(y_i-(\beta_0+\beta_1x_{1,i}+\beta_2x_{2,i}+...+{\beta_nx_{p,i}}))^2 \] Subject to:

\[ \lambda(\alpha\sum_{i=1}^p|\beta_i| + (1-\alpha)\sum_{i=1}^p\beta_i^2)\leq\tau \]

We want a value of \(\lambda\) that minimizes the first equation, subject to the second constraint.

In effect, what we’re doing here is limiting either the sum of the absolute value of the coefficients, or on the sum of the coefficients squared. In the above, if we set \(\alpha=0\), the absolute value term goes away and we are left with Ridge Regression. Similarly, if we set \(\alpha=1\), then the sum of squared coefficients term goes away and we are left with Lasso. If we choose alpha somewhere in the middle, we have Elastic Net. One of the more subtle points of the above is that Ridge regression actually does not end up eliminating variables. Lasso, on the other hand, will reduce some of the coefficients to 0. This is the result of the \(L_1\) and \(L_2\) metrics used in Lasso and Ridge respectively.

Determining \(\alpha\) and \(\lambda\)

So, how do we determine the \(\alpha\) and \(\lambda\) values used in Elastic Net?

Essentially, \(\alpha\) is like a switch that can turn Elastic Net into Ridge or Lasso, or a combination. We can simply choose \(\alpha\). Once we have decided on an approach, we are left with finding an optmial value for \(\lambda\).

Picking \(\lambda\) can be done using a cross validation approach. In this approach, we divide up the training data for our model into a number of folds. The following diagram shows how we use one of the folds against the remaining data in order to evaluate the performance of the model. We can continue to do this with a range of \(\lambda\) values to determine the best value of \(\lambda\) based on some accuracy metric. (Typically the residual squared sum or RSS.) Note: As we perform cross validation, we will be doing so within the training set. We keep the test data separate until we wish to compare all our models.

10 Fold Cross Validation

10 Fold Cross Validation

Ridge Regression in R

The following code performs a ridge regression on the Hitters data training set.

library(glmnet) 
library(coefplot)
set.seed(100)
std <- TRUE  # Required for Lasso
Y <- train$Salary
X <- model.matrix(Salary~.,train)[,-1]
# 10-fold CV to find the optimal lambda 
ridge.cv=cv.glmnet(X, Y, alpha=0, type="deviance", family="gaussian", standardize=std, nfolds=10)
## Fit lasso model with 100 values for lambda
ridge_mdl = glmnet(X, Y, alpha=0, nlambda=100, standardize=std, family="gaussian")
## Extract coefficients at optimal lambda
coef(ridge_mdl,s=ridge.cv$lambda.min)
20 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept) -31.031994595
AtBat        -0.503926301
Hits          3.150077786
HmRun        -7.412961727
Runs          0.719601377
RBI           0.505273525
Walks         3.076459773
Years        -4.587219638
CAtBat        0.003743458
CHits         0.096606473
CHmRun        1.567366492
CRuns         0.271540296
CRBI          0.268394560
CWalks       -0.291011895
LeagueN      87.133238411
DivisionW   -81.542304442
PutOuts       0.309545878
Assists       0.058823277
Errors       -0.317446290
NewLeagueN  -73.521157177
plot(ridge_mdl, xvar="lambda", lwd=2, label=T)
abline(v=log(ridge.cv$lambda.min), col="blue", lty=2)

The resulting value of lambda for this variable selection method is 29.9798498

To get a sense of how the code settled on this value of lambda, we can take a look at the following plots. The first shows how the coefficients change as the log of lambda change. The second shows a plot of the coefficients sorted by magnitude at the optimal lambda value.

coefpath(ridge_mdl)

coefplot(ridge_mdl, lambda=ridge.cv$lambda.min, sort="magnitude")

Lasso Regression in R

We can repeat the same steps below for Lasso regression. This time we set the value of \(\alpha\) to 1 to force the model to use Lasso.

# 10-fold CV to find the optimal lambda 
lasso.cv=cv.glmnet(X,Y,alpha=1, type="deviance", family="gaussian", standardize=std, nfolds=10)
## Fit lasso model with 100 values for lambda
lasso_mdl = glmnet(X,Y,alpha=1,family="gaussian", standardize=std, nlambda=100)
## Extract coefficients at optimal lambda
coef(lasso_mdl,s=lasso.cv$lambda.min)
20 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept) -19.58957875
AtBat        -1.44032127
Hits          6.50025004
HmRun        -7.88008124
Runs          .         
RBI           .         
Walks         4.17171810
Years        -0.06029097
CAtBat        .         
CHits         .         
CHmRun        2.30642850
CRuns         0.67656267
CRBI          0.05039562
CWalks       -0.54228215
LeagueN      76.47735074
DivisionW   -69.56215881
PutOuts       0.32083857
Assists       0.08607745
Errors        0.13554310
NewLeagueN  -52.41700858
plot(lasso_mdl, xvar="lambda", lwd=2, label=T)
abline(v=log(lasso.cv$lambda.min), col="blue", lty=2)

The vertical line in the above plot indicates the optimal value for lambda.

The following two plots are similar to the ones we viewed above. This shows a slightly nicer display of the coefficients “path”, and the sorted magnitude of the coefficients at the optimal lambda.

coefpath(lasso_mdl)

coefplot(lasso_mdl, lambda=lasso.cv$lambda.min, sort="magnitude")

Elastic Net Regression in R

Finally, we will perform Elastic Net regression on the data. We choose \(\alpha=0.5\) in this case.

# 10-fold CV to find the optimal lambda 
enet.cv=cv.glmnet(X,Y,alpha=0.5, type="deviance", family="gaussian", standardize=std, nfolds=10)
## Fit lasso model with 100 values for lambda
enet_mdl = glmnet(X,Y,alpha=0.5,standardize=std,nlambda=100)
## Extract coefficients at optimal lambda
coef(enet_mdl,s=enet.cv$lambda.min)
20 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept) -15.37942454
AtBat        -1.31472788
Hits          6.06170510
HmRun        -7.85800229
Runs          .         
RBI           .         
Walks         4.16726809
Years        -1.65056351
CAtBat        .         
CHits         .         
CHmRun        2.10626204
CRuns         0.63563136
CRBI          0.15183129
CWalks       -0.53395778
LeagueN      85.84990831
DivisionW   -72.62437940
PutOuts       0.32095315
Assists       0.08592847
Errors        0.01992083
NewLeagueN  -64.54422786
plot(enet_mdl, xvar="lambda", lwd=2, label=T)
abline(v=log(enet.cv$lambda.min), col="blue", lty=2)

As before, the vertial dashed line indicates the optimal value for \(\lambda\).

The following to graphs show a slightly nicer display of the coefficient paths, as well as the coefficients, sorted by magnitude, at the optimal value for lambda.

coefplot(enet_mdl, lambda=enet.cv$lambda.min, sort="magnitude")

coefpath(enet_mdl)

Comparing the Resulting Final Models

The following table compares the results of the above three models:

tmatrix <- model.matrix(Salary~.,test)[,-1]
ridge.pred = predict(ridge_mdl, s=ridge.cv$lambda.min, newx=tmatrix)
ridge_mse = mean((ridge.pred-test$Salary)^2)

lasso.pred = predict(lasso_mdl, s=lasso.cv$lambda.min, newx=tmatrix)
lasso_mse = mean((lasso.pred-test$Salary)^2)

enet.pred = predict(enet_mdl, s=enet.cv$lambda.min, newx=tmatrix)
enet_mse = mean((enet.pred-test$Salary)^2)



barplot( c(ridge_mse, lasso_mse, enet_mse, lm_mse), main="Mean Squared Error", names.arg=c('Ridge','Lasso','Elastic Net','Full'))

Method \(\lambda\) MSE
Ridge 29.9798498 1.908433510^{5}
Elastic Net 4.7517036 1.944481210^{5}
Lasso 3.1407363 1.950363810^{5}
Full Model na 3.356621710^{5}

Conclusion

Regularized variable selection techniques provide a way to perform bias-variance tradeoffs for multiple regression problems. The methods of Ridge Regression, Lasso and Elastic Net are forms of regularized variable selection, and are mathematically interrelated. The coefplot library contains several useful tools for visualizing regularized regression techniques.

Thanks for reading this post.

---
title: "Regularized Model Selection"
output: html_notebook
author:  "Miles Porter"
date:  March 28, 2020
---

# Introduction

The purpose of this document is to demonstrate the three different types of regularized regression: Elastic Net, Lasso and Ridge Regression.  While the standard R glmnet function offers some tools for visualizing the results of these techniques, the coefplot package contains several useful tools for visualization that are a bit easier to interpret.

# Baseball Data

In order to illustrate regularized regression and the coefplot package, we need some data on which to perform variable selection.  The data for this exercise will be major league baseball data from the 1986 and 1987.  We will have a number of predicting variables, for example Hits, Runs, RBI, Home Runs, and several others.  From this data we want to build a model that can help us predict (or justify) a players salary based on their performance.

(Reference:  James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York)

```{r}
rm(list=ls())
set.seed(100)
library(ISLR)
names(Hitters)
Hitters<-na.omit(Hitters)
```

We will want to do some comparison of models, so we will split the data into testing and training sets.

```{r}
## 75% of the sample size
smp_size <- floor(0.75 * nrow(Hitters))

train_ind <- sample(seq_len(nrow(Hitters)), size = smp_size)

train <- Hitters[train_ind, ]
test <- Hitters[-train_ind, ]

```

## Basic Model (No Variable Selection)

Now that we have some data, let's reveiw what we want to do with regularized variable selection.  In a multiple regression problem, we have multiple predictors and a target variable.  In R, it is relatively easy to preform multiple regression on this data with the lm function.  The result is a set of coefficients that we apply to each predictor.  Also, we are able to use residual analysis to determine the statistical significance of each predictor in the overall model.

```{r}
mdl1 <- lm(Salary~., data=train)
summary(mdl1)
mdl1.pred <- predict(mdl1, new_data=test)
lm_mse <- mean((mdl1.pred - test$Salary)^2)
```

In the above output, we can see the various coefficients, and their statistical significance. 

## Bias-Variance Tradeoff

More predictor variables in a model will typically result in lower bias, but higher variability of model predictions.  A common way to think about this is in terms of the graphic below.  A model that has lots of predictors would likely results in the upper right hand corner.  A model with few predictors results in the lower left.  When performing predictions, it is often preferable to have less variance even if there is some bias.

![Bias and Variance](bias-and-variance.jpg)

The baseball salary model that we created above has several predictors.  It's mean squared error (MSE) is `r lm_mse`.  Next, we will look at creating models using regularization variable selection techniques.  We will then compare our regularized models with the full model above.

# Mathematical Formulation of Regularization Regression Techniques

Regularized regression techniques include Lasso, Ridge Regression, and Elastic Net.  Elasticnet is a generalization of the other two.  Mathematically, these regression techniques all look like this:

$$ min \sum_{i=1}^n(y_i-(\beta_0+\beta_1x_{1,i}+\beta_2x_{2,i}+...+{\beta_nx_{p,i}}))^2 $$
Subject to:

$$ \lambda(\alpha\sum_{i=1}^p|\beta_i| + (1-\alpha)\sum_{i=1}^p\beta_i^2)\leq\tau $$

We want a value of $\lambda$ that minimizes the first equation, subject to the second constraint.  

In effect, what we're doing here is limiting either the sum of the absolute value of the coefficients, or on the sum of the coefficients squared.  In the above, if we set $\alpha=0$, the absolute value term goes away and we are left with Ridge Regression.  Similarly, if we set $\alpha=1$, then the sum of squared coefficients term goes away and we are left with Lasso.  If we choose alpha somewhere in the middle, we have Elastic Net.  One of the more subtle points of the above is that Ridge regression actually does not end up eliminating variables.  Lasso, on the other hand, will reduce some of the coefficients to 0.  This is the result of the $L_1$ and $L_2$ metrics used in Lasso and Ridge respectively.

# Determining $\alpha$ and $\lambda$

So, how do we determine the $\alpha$ and $\lambda$ values used in Elastic Net?

Essentially, $\alpha$ is like a switch that can turn Elastic Net into Ridge or Lasso, or a combination.  We can simply choose $\alpha$.  Once we have decided on an approach, we are left with finding an optmial value for $\lambda$.

Picking $\lambda$ can be done using a cross validation approach.  In this approach, we divide up the training data for our model into a number of folds.  The following diagram shows how we use one of the folds against the remaining data in order to evaluate the performance of the model.  We can continue to do this with a range of $\lambda$ values to determine the best value of $\lambda$ based on some accuracy metric.  (Typically the residual squared sum or RSS.)  Note:  As we perform cross validation, we will be doing so within the training set.  We keep the test data separate until we wish to compare all our models.

![10 Fold Cross Validation](cv.png)

## Ridge Regression in R

The following code performs a ridge regression on the Hitters data training set.

```{r}
library(glmnet) 
library(coefplot)
set.seed(100)
std <- TRUE  # Required for Lasso
Y <- train$Salary
X <- model.matrix(Salary~.,train)[,-1]
# 10-fold CV to find the optimal lambda 
ridge.cv=cv.glmnet(X, Y, alpha=0, type="deviance", family="gaussian", standardize=std, nfolds=10)
## Fit lasso model with 100 values for lambda
ridge_mdl = glmnet(X, Y, alpha=0, nlambda=100, standardize=std, family="gaussian")
## Extract coefficients at optimal lambda
coef(ridge_mdl,s=ridge.cv$lambda.min)
plot(ridge_mdl, xvar="lambda", lwd=2, label=T)
abline(v=log(ridge.cv$lambda.min), col="blue", lty=2)
```

The resulting value of lambda for this variable selection method is `r ridge.cv$lambda.min`

To get a sense of how the code settled on this value of lambda, we can take a look at the following plots.  The first 
shows how the coefficients change as the log of lambda change.  The second shows a plot of the coefficients sorted by magnitude at the optimal lambda value.

```{r}
coefpath(ridge_mdl)
coefplot(ridge_mdl, lambda=ridge.cv$lambda.min, sort="magnitude")
```

## Lasso Regression in R

We can repeat the same steps below for Lasso regression.  This time we set the value of $\alpha$ to 1 to force the model to use Lasso.  

```{r}
# 10-fold CV to find the optimal lambda 
lasso.cv=cv.glmnet(X,Y,alpha=1, type="deviance", family="gaussian", standardize=std, nfolds=10)
## Fit lasso model with 100 values for lambda
lasso_mdl = glmnet(X,Y,alpha=1,family="gaussian", standardize=std, nlambda=100)
## Extract coefficients at optimal lambda
coef(lasso_mdl,s=lasso.cv$lambda.min)
plot(lasso_mdl, xvar="lambda", lwd=2, label=T)
abline(v=log(lasso.cv$lambda.min), col="blue", lty=2)
```

The vertical line in the above plot indicates the optimal value for lambda.

The following two plots are similar to the ones we viewed above.  This shows a slightly nicer display of the coefficients "path", and the sorted magnitude of the coefficients at the optimal lambda. 

```{r}
coefpath(lasso_mdl)
coefplot(lasso_mdl, lambda=lasso.cv$lambda.min, sort="magnitude")
```

## Elastic Net Regression in R

Finally, we will perform Elastic Net regression on the data.  We choose $\alpha=0.5$ in this case.

```{r}
# 10-fold CV to find the optimal lambda 
enet.cv=cv.glmnet(X,Y,alpha=0.5, type="deviance", family="gaussian", standardize=std, nfolds=10)
## Fit lasso model with 100 values for lambda
enet_mdl = glmnet(X,Y,alpha=0.5,standardize=std,nlambda=100)
## Extract coefficients at optimal lambda
coef(enet_mdl,s=enet.cv$lambda.min)
plot(enet_mdl, xvar="lambda", lwd=2, label=T)
abline(v=log(enet.cv$lambda.min), col="blue", lty=2)
```

As before, the vertial dashed line indicates the optimal value for $\lambda$.

The following to graphs show a slightly nicer display of the coefficient paths, as well as the coefficients, sorted by magnitude, at the optimal value for lambda.

```{r}
coefplot(enet_mdl, lambda=enet.cv$lambda.min, sort="magnitude")
coefpath(enet_mdl)
```

## Comparing the Resulting Final Models

The following table compares the results of the above three models:

```{r}
tmatrix <- model.matrix(Salary~.,test)[,-1]
ridge.pred = predict(ridge_mdl, s=ridge.cv$lambda.min, newx=tmatrix)
ridge_mse = mean((ridge.pred-test$Salary)^2)

lasso.pred = predict(lasso_mdl, s=lasso.cv$lambda.min, newx=tmatrix)
lasso_mse = mean((lasso.pred-test$Salary)^2)

enet.pred = predict(enet_mdl, s=enet.cv$lambda.min, newx=tmatrix)
enet_mse = mean((enet.pred-test$Salary)^2)



barplot( c(ridge_mse, lasso_mse, enet_mse, lm_mse), main="Mean Squared Error", names.arg=c('Ridge','Lasso','Elastic Net','Full'))
```

| Method | $\lambda$  | MSE |
|--------|-----------|------------|
| Ridge  | `r ridge.cv$lambda.min` | `r ridge_mse` |
| Elastic Net | `r enet.cv$lambda.min` | `r enet_mse` 
| Lasso | `r lasso.cv$lambda.min` | `r lasso_mse` |
| Full Model | na | `r lm_mse` |


# Conclusion

Regularized variable selection techniques provide a way to perform bias-variance tradeoffs for multiple regression problems.  The methods of Ridge Regression, Lasso and Elastic Net are forms of regularized variable selection, and are mathematically interrelated.  The coefplot library contains several useful tools for visualizing regularized regression techniques.

Thanks for reading this post.

