Linear model selection and regularization

Statistical Learning
James Scott (UT-Austin)

Reference: Introduction to Statistical Learning Chapter 6

Outline

Goals of model selection and regularization
Model selection by stepwise selection
Ridge regression
The lasso
Dimension reduction

Note: although we talk about regression here, everything applies to logistic regression as well (and hence classification).

Model selection and regularization

In this set of notes, we're still talking about the good 'ol linear model: \[ E(y \mid x) = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} \]

But we're focused on improving the linear model by using some other model-fitting strategy beyond ordinary least squares (OLS).

Why move beyond OLS?
1. Improved prediction accuracy—sometimes radically improved.
2. More interpretable models.

Reason 1: improved prediction accuracy

A linear model is generally thought of as a “high bias, low variance” kind of estimator, at least compared with alternatives we've seen (and others we have yet to see).
But if the number of observations \( n \) is not much larger than the number of features \( p \), even OLS can have high estimation variance.
The result: overfitting and poor prediction accuracy.
One solution: shrinking or regularizing the coefficient estimates so that they don't just chase noise in the training data.
Extreme case: if \( p>n \), there isn't even a unique OLS solution! No option but to do something else.

Reason 2: model interpretability

A major source of variance in OLS estimation is the inclusion of unnecessary variables in the regression equation. (Here “unnecessary” means “\( \beta_j = 0 \)”.)
This situation is especially pronounced in situations of sparsity: where you have lots of candidate features for predicting \( y \), only a few of them are actually useful, but you don't know which ones.
By removing these variables (i.e. setting their coefficients to zero), we get a simpler model that is easier to interpret and communicate.
In this chapter, we'll learn some ways for doing this automatically.

Model selection

The seemingly obvious approach is exhaustive enumeration:

fit all possible models under to a training set
measure the generalization error of each one on a testing set.

But this can be too exhausting. What are the limiting performance factors here?

it might take a long time to fit each model (when?)
if so, it will take even longer to repeatedly re-fit each model to multiple training sets
and there might be way too many models to check them all.

Iterative model selection

Thus a more practical approach to model-building is iterative:

Start with a working model.
Consider a limited set of “small” changes to the working model.
Measure the performance gains/losses resulting from each small change.
Choose the “best” small change to the working model. This becomes the new working model.
Repeat until you can't make any more changes that make the model better.

Iterative model selection

OK, so:

Where do I start?
What is a “small change” to a model?
What is a “limited set” of such changes?
How do we measure performance gains/losses?

Forward selection

Suppose we have \( p \) candidate variables and interactions (called the “scope”). Start with a working model having no variables in it (the null model).

Consider all possible one-variable additions to the working model, and measure how much each addition improves the model's performance. (Note: we'll define “improvement” here soon!)
Choose the single addition that improves the model the most. This becomes the new working model. (If no addition yields an improvement, stop the algorithm.)
Repeat steps 1 and 2.

Backward selection

Suppose we have \( p \) candidate variables and interactions. Start with a working model having all these variables in it (the full model).

Consider all possible one-variable deletions to the working model, and measure how much each deletion improves the model's performance.
Choose the single deletion that improves the model the most. This becomes the new working model. (If no deletion yields an improvement, stop the algorithm.)
Repeat steps 1 and 2.

Stepwise selection (forward/backward)

Suppose we have \( p \) candidate variables and interactions (the scope). Start with any working model containing some subset of these variables. Ideally this should be a reasonable guess at a good model.

Consider all possible one-variable additions or deletions to the working model, and measure how much each addition or deletion improves the model's performance.
Choose the single addition or deletion that improves the model the most. This becomes the new working model. (If no deletion yields an improvement, stop the algorithm.)
Repeat steps 1 and 2.

Measuring model performance

Ideally, we'd measure model performance using out-of-sample MSE, calculated by cross-validation.
But this adds another computationally expensive step to the algorithm.
Thus stepwise selection usually involves some approximation.
Thus most statistical software will measure performance not by actually calculating \( {\mathrm{MSE}}_\mathrm{out} \) on some test data, but rather using one of several possible heuristic approximations for \( {\mathrm{MSE}}_\mathrm{out} \).

AIC

The most common one is called AIC (“Akaike information criterion”): \[ {\mathrm{MSE}}_\mathrm{AIC} = {\mathrm{MSE}}_\mathrm{in} \left(1 + \frac{p}{n} \right)\, , \] where \[ {\mathrm{MSE}}_\mathrm{in} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-p} \] is the usual unbiased estimated of the residual variance \( \sigma^2 \) in a linear model.

Thus the AIC estimate is an inflated version of the in-sample MSE. It is not a true out-of-sample estimate.

AIC: a technical note

The general definition of AIC for a model with \( p \) coefficients is \[ \mbox{AIC} = \mbox{Deviance} + 2\cdot p \] where you'll recall that the deviance is \( -2 \) times the log likelihood of the model.

This is an estimate of the out-of-sample deviance of a model. Note that we inflate the in-sample (fitted) deviance by an additive factor of \( 2 \cdot P \).

In the special case of a linear model fit my OLS, this general definition reduces to the version of \( \mbox{MSE}_{\mbox{AIC}} \) just given.

AIC: some notes

The inflation factor of \( (1 + p/n) \) is always larger than 1.
But the more parameters p you have relative to data points n, the larger the inflation factor gets.
The AIC estimate of MSE is just an approximation to a true out-of-sample estimate. It still relies upon some pretty specific mathematical assumptions (linear model true, error Gaussian) that can easily be wrong in practice.

AIC: some notes

Don't view AIC + stepwise selection algorithm as having God-like powers for discerning the single best model
Don't treat it as an excuse to be careless. Instead proceed cautiously. Always verify that the stepwise-selected model makes sense and doesn’t violate any crucial assumptions.
It’s also a good idea to perform a quick train/test split of your data and compute \( MSE_{out} \) for your final model.
This ensures you’re actually improving the generalization error versus your baseline model.

Problems with subset/stepwise selection

step() is very slow.
This is a generic problem with stepwise selection: each candidate model along the path must be refit from scratch.
A related subtle (but massively important) issue is stability. MLEs can have high sampling variability where \( p \approx n \): they change a lot from one dataset to another. So which MLE model is “best”“ changes a lot.

Regularization

The key to modern statistical learning is regularization: departing from optimality to stabilize a system.

This is very common in engineering:

Would you drive on an “optimal” bridge (absolute minimum concrete required to support a load)?
Would you fly on an “optimal” airplane? (Ask Boeing!)

Regularization: some intuition

Recall that deviance (\( = -2 \cdot LHD \)) is the cost of mis-fitting the data:

It penalizes discrepancy between model predictions \( \hat{y} \) and actual data \( y \).
This cost is explicit in maximum likelihood estimation.

But nonzero choices of \( \beta_j \) also have costs:

\( \beta_j = 0 \) is a “safe” (zero-variance) choice.
Any other nonzero choice of \( \beta_j \) incurs a cost: the need to use data (a finite resource) to estimate it.
This cost is hidden in maximum likelihood estimation.
Solution: make the cost explicit!!

Regularization

The “optimal” fit \( \hat{\beta}_{\mathrm{MLE}} \) maximizes the likelihood, or equivalently minimizes the deviance, as a function of \( \beta \):
\[ \underset{\beta \in \mathbb{R}^p}{\mathrm{minimize}} \quad \frac{1}{n} \mathrm{dev}(\beta) \]

The regularized fit minimizes the deviance plus a “penalty” on the complexity of the estimate:
\[ \underset{\beta \in \mathbb{R}^p}{\mathrm{minimize}} \quad \frac{1}{n}\mathrm{dev}(\beta) + \lambda \cdot \mathrm{pen}(\beta) \] Here \( \lambda \) is the penalty weight, while “pen” is some cost function that penalizes departures of the fitted \( \beta \) from \( 0 \).

Regularization: two common penalties

Ridge regression: \[ \underset{\beta \in \mathbb{R}^p}{\mathrm{minimize}} \quad \frac{1}{n} \mathrm{dev}(\beta) + \lambda \cdot \sum_{j=1}^p \beta_j^2 \]

Lasso regression: \[ \underset{\beta \in \mathbb{R}^p}{\mathrm{minimize}} \quad \frac{1}{n} \mathrm{dev}(\beta) + \lambda \cdot \sum_{j=1}^p |\beta_j| \]

The advantage of the lasso approach is that it gives sparse solutions: many of the \( \hat{\beta_j} \)'s are identically zero. This yields automatic variable selection.

Regularization: path estimation

The lasso fits \( \hat{\beta} \) to minimize \( \mathrm{dev}(\beta) + \lambda \cdot \sum_{j=1}^p |\beta_j| \). We'll do this for a sequence of penalties \( \lambda_1 > \lambda_2 > \cdots > \lambda_T \).

We can then apply standard model-selection tools (e.g. cross validation, AIC) to pick the best \( \lambda \).

Path estimation:

Start with big \( \lambda_1 \): so big that \( \hat{\beta} = 0 \).
For \( t = 2, \ldots, T \), update \( \hat{\beta} \) to be optimal under \( \lambda_t < \lambda_{t-1} \).

Regularization: path estimation

Some key facts:

The estimate \( \hat{\beta} \) changes continuously along this path of \( \lambda \)'s.
It's fast! Each update is computationally very easy.
It's stable: the optimal \( \lambda \) may change a bit from sample to sample, but these changes don't affect the overall fit that much.

This is like a better version of traditional stepwise selection.

Regularization: software

There are many, many packages for fitting lasso regressions in R.

glmnet is most common, and gamlr is my favorite. These two are very similar, and they share syntax.
Side note: big difference is what they do beyond a simple lasso:. glmnet does an “elastic net” penalty, while gamlr does a “gamma lasso” penalty. A little beyond the scope of the course, but cool stuff.
Since we stick mostly to lasso, they’re nearly equivalent for us. gamlr just makes it easier to apply some model selection rules.

Both use the Matrix library representation for sparse matrices.

Sparse matrices

Often, your feature matrix will be very sparse (i.e, mostly zeros).

This is especially true when you have lots of factors (categorical variables) as features.
It is then efficient to ignore zero elements in actually storing \( X \).

A simple triplet matrix is a very common storage format:

It only stores the nonzero elements
The row ‘i’, column ‘j’, and entry value ‘v’.

Sparse matrices

For example:

\[ X = \left[ \begin{array}{r r r} 0 & 0 & 3.1 \\ 1.7 & 0 & 0 \\ 0 & 0 & 0 \end{array} \right] \]

would be stored as

\( i = 1, 2 \)
\( j = 1, 3 \)
\( v = 3.1, 1.7 \)

Sparse matrices

For gamlr/glmnet you need to build your own design matrix i.e., what the \( y ∼ x1 + x2 \) formula does for you inside glm.

We've seen how to do this with model.matrix for dense matrices. The sparse version sparse.model.matrix works the same way.

We'll walk through the example in semiconductor.R…

Dimensionality reduction

Basic idea:

Summarize the information in the \( p \) original features into a smaller set of \( k \) summary features (\( k < p \)).
Then try to predict \( y \) using the derived features.
Summaries are often linear combinations of the original variables.

This is useful when:

there are still too many variables (the haystack is too big).
and/or the features are highly correlated and it doesn't make sense to just select a handful of them.

Dimensionality reduction: example

This poses a pretty substantial estimation problem:

only 12 months per year, and perhaps 40 years of data.
Yet thousands of other stocks in the market.

Could do variable selection from among these thousands of variables, but remember: each extra bit of hay we throw onto the haystack makes it harder to find the needles!

Dimensionality reduction: example

Suppose we want to build a model for how the returns on Apple's stock depend on all other stocks in the market. So let

\( y_t \) = return on Apple stock in month \( t \).
\( x_{jt} \) = return on other stock \( j \) in month \( t \).

A linear model would say \[ E(y_t) = \alpha + \sum_{j=1}^p \beta_j x_{jt} + e_t \]

Dimensionality reduction: example

So we try dimensionality reduction:
\[ E(y_t) = \alpha + \beta m_t + e_t \] where

\( y_t \) is the return on a stock of interest (e.g. Apple) in period \( t \)
\( m_t \) is the “market return”, i.e. the weighted-average return on all other stocks:
\[ m_t = \sum_{j=1}^p w_j x_{jt} \, . \] (The weight \( w_j \) is proportional to the size of the company.)

Dimensionality reduction: example

In finance this is called the “single index” model, a variant of the “capital asset pricing model”.

See, e.g., https://finance.yahoo.com/quote/AAPL

Look for their estimate of \( \beta \)!

Dimensionality reduction

The single-index model filter \( p \) variables into 1 variable, via a single linear combination: \[ m = \sum_{j=1}^p w_j x_{j} \, . \]

In the case of a single-index model for stocks, it is natural to take a market-weighted linear combination of all stocks. The weights come from economics, not statistics.

But what if we don't have a natural or “obvious” set of weights?

Dimensionality reduction: PCA

A very popular way to proceed here is Principal Components Analysis

PCA is a dimensionality reduction technique that tries to represent \( p \) variables with a \( k < p \) new variables.
Like in the single-index model, these “new” variables are linear combinations of the original variables.
The hope is that a small number of them are able to effectively represent what is going on in the original data.
We'll study PCA in detail later. For now, we'll use it as a black box “machine” for generating summary features.

PCA: some intuition

In PCA, we extract \( K \) summary variables from each feature vector \( x_i \). Each summary \( k=1, \ldots, K \) is of the form: \[ s_{ik} = \sum_{j=1}^p v_{jk} x_{ij} \] where \( x_{ij} \) is original feature \( j \), and \( s_{ik} \) is summary feature \( k \), for observation \( i \).

The coefficients \( v_{jk} \) are the “loadings” or “weights” on the original variables. The goal of PCA is to choose an “optimal” set of weights.

PCA: some intuition

plot of chunk unnamed-chunk-1

The two variables x1 and x2 are very correlated. PC1 tells you almost everything that is going on in this dataset: \[ s_{1} = 0.781 x_1 - 0.625 x_2 \]
PCA will look for linear combinations of the original variables that account for most of their variability.

PCA: a simple example

Data from 1970s Europe on consumption of 7 types of foods: red meat, white meat, eggs, milk, fish, cereals, starch, nuts, vegetables.

plot of chunk unnamed-chunk-2

Can you interpret the summaries?

PCA: quick summary

PCA is a great way to “collapse” many features into few.
The choice of K (how many summaries) can be evaluated via the out-of-sample fit.
The units of each summary variable are not interpretable in an absolute sense—only relative to each other.
It's always a good idea to center the data before running PCA.
Much more on PCA later!

Let's see an example in gasoline.R.