An Overview of Regularized Regression

September 19, 2014

\(\DeclareMathOperator*{\argmin}{arg\,min}\)

Statistical Learning Group (SLG)

First of three statistical learning topics

Today: Big picture talk
- Level of technicality will vary
Sep. 26: Joshua Day - Comparison of Regularization Penalties
Oct. 3: ?? - ??

Please provide feedback

How can the SLG be improved?
How can I be a more effective presenter?

Outline

Review & Notation
Motivation
Regularized Regression
Implementation

Review & Notation

Supervised Learning

Recall from Justin's talk:

Both inputs (features or predictors) & an output (response) are observed
Want to build a learner (model) to understand the relationship between the two
- Often interested in prediction

Notation

\({y}_i\): output or response variable
- \(i = 1,...,n\)
\(x_{ij}\): input, predictor, or feature variable
- \(j = 1,...,p\)

Matrix notation

\(\boldsymbol{y} = \begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix} \quad\) and \(\quad \boldsymbol{X} = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1p} \\ x_{21} & x_{22} & \cdots & x_{2p} \\ \vdots & \vdots & \cdots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{np} \end{pmatrix}\)

Predictor Standardization

For regularized regression, predictors should be standardized before estimation

Done using the classic "z-score" formula
Puts each predictor on the same scale
- Mean of 0 and variance of 1
- Importance will be more clear later
Use \(\hat{\beta_0} = \bar{y}\) and estimate other coefficients
- Can transform back to original scale after estimation

Estimation Goals

Given \(\boldsymbol{X}\), find a function, \(f(\boldsymbol{X})\), to model or predict \(Y\)

Loss function, \(L(y, f(\boldsymbol{X}))\)
- Determines adequacy of fit
- Squared error loss is most common

\[L(y, f(\boldsymbol{X})) = (y - f(\boldsymbol{X}))^2 \]

Choose \(f\) by minimizing the expected loss \[ E[(Y - f(\boldsymbol{X}))^2] \]
- \(\Rightarrow f = E[Y | \boldsymbol{X} = \boldsymbol{x}]\)

Linear Regression

Assume \(E[Y | \boldsymbol{X} = \boldsymbol{x}]\) is a linear function, \(\displaystyle \sum_{j=1}^{p} x_{ij} \beta_j = \boldsymbol{x}_i'\boldsymbol{\beta}\)

\(Y_i = \beta_0 + x_{i1}\beta_1 + x_{i2}\beta_2 + \dots + x_{ip}\beta_p + \epsilon\)
Estimate \(\boldsymbol{\beta}\) using ordinary least squares (OLS) \[\hat{\boldsymbol{\beta}} = \argmin_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \sum_{j=1}^{p} x_{ij} \beta_j)^2 \\ = \argmin_{\boldsymbol{\beta}} \| \boldsymbol{y} - \boldsymbol{X}\boldsymbol{\beta} \|^2_2\] \[\Rightarrow \hat{\boldsymbol{\beta}}^{OLS} = (\boldsymbol{X}'\boldsymbol{X})^{-1} \boldsymbol{X}'\boldsymbol{y} \Rightarrow \hat{f}(\boldsymbol{x}_i) = \boldsymbol{x}_i'\hat{\boldsymbol{\beta}}\]

Motivation

Why is regularization needed?

Linear Regression Shortcomings

Despite its wide use and elegant theory, linear regession has its shortcomings

Prediction accuracy
- Often can be improved upon
Model interpretability
- Linear model does not automatically do variable selection

Assessing Prediction Accuracy

Given a new input, \(\boldsymbol{x}_0\), how do we assess our prediction \(\hat{f}(\boldsymbol{x}_0)\)?

Expected Prediction Error (EPE)
- \[ \begin{aligned} EPE(\boldsymbol{x}_0) &= E[(Y_0 - \hat{f}(\boldsymbol{x}_0))^2] \\ &= \text{Var}(\epsilon) + \text{Var}(\hat{f}(\boldsymbol{x}_0)) + \text{Bias}(\hat{f}(\boldsymbol{x}_0))^2 \\ &= \text{Var}(\epsilon) + MSE(\hat{f}(\boldsymbol{x}_0)) \end{aligned} \]
- \(\text{Var}(\epsilon)\): irreducible error variance
- \(\text{Var}(\hat{f}(\boldsymbol{x}_0))\): sample-to-sample variability of \(\hat{f}(\boldsymbol{x}_0)\)
- \(\text{Bias}(\hat{f}(\boldsymbol{x}_0))\): average difference of \(\hat{f}(\boldsymbol{x}_0)\) & \(f(\boldsymbol{x}_0)\)

Estimating Prediction Error

Common approach to estmating prediction error

Randomly split data into "training" and "test" sets
- Test set has \(m\) observations
Calculate \(\hat{f}\) using training data
Estimate prediction error using test set MSE \[ \widehat{MSE}(\hat{f}) = \frac{1}{m} \sum_{i=1}^{m} (y_i - \hat{f}(x_i))^2 \]

Why?

Want our model/predictions to perform well with new/future data

Bias-Variance Tradeoff

Source: Hastie, Tibshirani, & Friedman (2009)

Improving Prediction Accuracy

If \(f(x) \approx \text{linear}\), \(\hat{f}\) will have low bias but possibly high variance

Correlated predictors
\(p \approx n\) or high-dimensional setting where \(p > n\)

Want to minimize \[MSE(\hat{f}(x)) = \text{Var}(\hat{f}(x)) + \text{Bias}(\hat{f}(x))^2\]

Sacrifice bias to reduce variance
- Hopefully leads to decrease in \(MSE\)
Regularization allows us to tune this trade-off

Variable Selection

Often interested in using only a subset of the predictors

Want to identify the most relevant predictors
- Usually the big picture is of interest
Helps avoid an overly complex model that is difficult to interpret

Linear regression does not directly determine which predictors to use

How do we choose which predictors to use?

Theory or expertise
Ad hoc trial and error using p-values

Automatic Subset Selection Methods

Best subset
Forward or backward stepwise
Forward stagewise
- Pick the best subset using test set prediction \(MSE\), \(C_p\), AIC, or BIC

Still not ideal

Discrete process: predictor is either included or excluded
- Unstable & highly variable
Regularization is more continuous, & less variable as a result

Regularized Regression

What is regularization?

Regularization Framework

Same setup as before

Given \(\boldsymbol{X}\), find a function, \(f(\boldsymbol{X})\), to model or predict \(y\)
Assume linear model and use squared error loss

Add second term to minimization

\[\hat{\boldsymbol{\beta}}(\lambda) = \argmin_{\beta} \left\{\sum_{i=1}^n (y_i - \sum_{j=1}^{p} x_{ij} \beta_j)^2 + \lambda J(\boldsymbol{\beta})\right\}\]

\(\lambda \ge 0\) is a regularization (tuning or penalty) parameter
\(J(\boldsymbol{\beta})\) is a user-defined penalty function
- Typically, the intercept is not penalized

Role of the Penalty Term

Consider \(\displaystyle J(\boldsymbol{\beta}) = \sum_{j=1}^p \beta_j^2 = \| \boldsymbol{\beta} \|^2_2 \hspace{0.2cm}\) (Ridge Regression)

Equivalent formulations \[\hat{\boldsymbol{\beta}}(\lambda)^{RR} = \argmin_{\boldsymbol{\beta}} \left\{\sum_{i=1}^n (y_i - \sum_{j=1}^{p} x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p \beta_j^2 \right\}\]

\[\hat{\boldsymbol{\beta}}(t)^{RR} = \argmin_{\boldsymbol{\beta}} \sum_{i=1}^n (y_i - \sum_{j=1}^{p} x_{ij} \beta_j)^2 \\ \text{subject to} \sum_{j=1}^p \beta_j^2 \le t\]

Role of the Regularization Parameter

\(\lambda\) directly controls the bias-variance trade-off:

\(\lambda = 0\) corresponds to OLS
\(\lambda \rightarrow \infty\) puts more weight on the penalty function and results in more shrinkage of the coefficients
- Introduce bias at the sake of reducing variance

Each \(\lambda\) results in a different solution \(\hat{\boldsymbol{\beta}}(\lambda)\)

Choosing \(\lambda\) correctly is crucial (discussed later)

The Lasso

Tibshirani (1996): Uses \(\displaystyle J(\boldsymbol{\beta}) = \sum_{j=1}^p |\beta_j| = \| \boldsymbol{\beta} \|_1\)

\[\hat{\boldsymbol{\beta}}(\lambda)^{L} = \argmin_{\beta} \left\{\sum_{i=1}^n (y_i - \sum_{j=1}^{p} x_{ij} \beta_j)^2 + \lambda \sum_{j=1}^p |\beta_j| \right\}\]

The subtle change in the penalty makes a big difference

Not only shrinks coefficients towards zero, but sets some of them to be exactly zero
- Thus performs continuous variable selection
- Hence the name, Least Absolute Shrinkage and Selection Operator (LASSO)

Geometric Interpretation

Source: Hastie, Tibshirani, & Friedman (2009)

General Regularization Framework

\[\min_{f \in \mathcal{H}} \sum_{i=1}^n \left\{L(y_i, f(x_i)) + \lambda J(f)\right\} \]

\(\mathcal{H}\) is a space of possible functions

Very general and flexible framework

\(L\): squared error, absolute error, zero-one, negative log-likelihood (GLM), hinge loss (support vector machines), …
\(J\): ridge regression, lasso, adaptive lasso, group lasso, fused lasso, thesholded lasso, generalized lasso, constrained lasso, elastic-net, dantzig selector, SCAD, smoothing splines, …
- Allows us to incorporate prior knowledge (sparsity, structure, etc.)

Implementation

How is regularization actually used?

Example: NCAA Dataset

From Mangold, Bean, & Adams (2003) via Dr. Dennis Boos
Contains information on 94 major NCAA division I universities
- \(y\): average 6-year graduation rate for 1996-1998
- \(bbindex\): author-created basketball index
- \(attend\): average basketball attendance
- 17 other predictors suggested by the literature
  - Acceptance rate, student-to-faculty ratio, etc.

Goal: Use OLS, ridge regression, and the lasso to find the best predictive model

Implementation: glmnet package in R

Code available on SLG Presentations

Computational Considerations

Regularized regression estimates are a function of \(\lambda\)

Fortunately efficient algorithms exist

Solution path algorithms
- LARS Algorithm for the Lasso (Efron et al., 2004)
- Piecewise linearity (Rosset & Zhu, 2007)
- Generaic path algorithm (Zhou & Wu, 2013)
Others
- Pathwise coordinate descent (Friedman et al., 2007)
- Alternating Direction Method of Multipliers (ADMM) (Boyd et al. 2011)

Lasso Solution Path

plot of chunk lassopath

Ridge Regression Solution Path

plot of chunk ridgepath

Choice of Regularization Parameter

Efficiently obtaining the entire solution path is nice, but we still have to choose \(\lambda\)

Critical since \(\lambda\) constrols the bias-variance tradeoff

Traditional model selection methods

\(C_p\), AIC, BIC, adjusted \(R^2\)

Cross validation is a popular alternative

Choice based on predictive performance
Makes fewer model assumptions
More widely applicable

Cross Validation Motivation

Ideally:

Separate validation set for choosing \(\lambda\) for a given method
- Reusing training set encourages overfitting
- Using test set to pick \(\lambda\) underestimates the true error
Often we do not have enough data for a seperate validation set
- Cross validation remedies this problem

\(K\)-Fold Cross Validation

Randomly split training data into \(K\) parts ("folds")
Fit model using data in \(K-1\) folds for multiple \(\lambda\text{s}\)
Calculate prediction MSE on remaining fold
Repeat process for each fold and average the MSEs

Common choices of \(K\): 5, 10, and \(n\) (leave-one-out CV)
One standard error rule
- Choose \(\lambda\) corresponding to smallest model with MSE within one standard error of the minimum MSE

Lasso 10-Fold Cross Validation

plot of chunk lassocv

Final Models

plot of chunk regplot

Model Performance

Test Set MSE
OLS	Lasso	Ridge
77.84	62.27	47.25

Summary

Traditional linear model is useful but has its shortcomings

Prediction accuracy
Model interpretability

Regularization adds a penalty term to the estimation

Able to exploit the bias-variance trade-off
Certain penalties allow for continuous variable selection
Very flexible, able to incorporate prior knowledge

References

Boyd, S., Parikh, N., Chu, E., Peleato, B., & Eckstein, J. (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1), 1-122.

Hastie, T., Friedman, J., & Tibshirani, R. (2009). The Elements of Statistical Learning. New York: Springer.

Efron, B., Hastie, T., Johnstone, I., & Tibshirani, R. (2004). Least angle regression. The Annals of Statistics, 32(2), 407-499.

Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. The Annals of Applied Statistics, 1(2), 302-332.

References

Mangold, W. D., Bean, L., & Adams, D. (2003). The impact of intercollegiate athletics on graduation rates among major NCAA division I universities: Implications for college persistence theory and practice. The Journal of Higher Education, 74(5), 540-562.

Rosset, S., & Zhu, J. (2007). Piecewise linear regularized solution paths. The Annals of Statistics, 35(3), 1012-1030.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58(1), 267-288.

Zhou, H., & Wu, Y. (2013) A generic path algorithm for regularized statistical estimation. Journal of American Statistical Association, 109(506): 686-699.