Statistical Learning
James Scott (UT-Austin)
Reference: Introduction to Statistical Learning Chapter 6
Note: although we talk about regression here, everything applies to logistic regression as well (and hence classification).
In this set of notes, we're still talking about the good 'ol linear model: \[ E(y \mid x) = \beta_0 + \sum_{j=1}^p \beta_j x_{ij} \]
But we're focused on improving the linear model by using some other model-fitting strategy beyond ordinary least squares (OLS).
Why move beyond OLS?
1. Improved prediction accuracy—sometimes radically improved.
2. More interpretable models.
The seemingly obvious approach is exhaustive enumeration:
But this can be too exhausting. What are the limiting performance factors here?
Thus a more practical approach to model-building is iterative:
OK, so:
Suppose we have \( p \) candidate variables and interactions (called the “scope”). Start with a working model having no variables in it (the null model).
Suppose we have \( p \) candidate variables and interactions. Start with a working model having all these variables in it (the full model).
Suppose we have \( p \) candidate variables and interactions (the scope). Start with any working model containing some subset of these variables. Ideally this should be a reasonable guess at a good model.
The most common one is called AIC (“Akaike information criterion”): \[ {\mathrm{MSE}}_\mathrm{AIC} = {\mathrm{MSE}}_\mathrm{in} \left(1 + \frac{p}{n} \right)\, , \] where \[ {\mathrm{MSE}}_\mathrm{in} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n-p} \] is the usual unbiased estimated of the residual variance \( \sigma^2 \) in a linear model.
Thus the AIC estimate is an inflated version of the in-sample MSE. It is not a true out-of-sample estimate.
The general definition of AIC for a model with \( p \) coefficients is \[ \mbox{AIC} = \mbox{Deviance} + 2\cdot p \] where you'll recall that the deviance is \( -2 \) times the log likelihood of the model.
This is an estimate of the out-of-sample deviance of a model. Note that we inflate the in-sample (fitted) deviance by an additive factor of \( 2 \cdot P \).
In the special case of a linear model fit my OLS, this general definition reduces to the version of \( \mbox{MSE}_{\mbox{AIC}} \) just given.
But the more parameters p you have relative to data points n, the larger the inflation factor gets.
The AIC estimate of MSE is just an approximation to a true out-of-sample estimate. It still relies upon some pretty specific mathematical assumptions (linear model true, error Gaussian) that can easily be wrong in practice.
Don't view AIC + stepwise selection algorithm as having God-like powers for discerning the single best model
Don't treat it as an excuse to be careless. Instead proceed cautiously. Always verify that the stepwise-selected model makes sense and doesn’t violate any crucial assumptions.
It’s also a good idea to perform a quick train/test split of your data and compute \( MSE_{out} \) for your final model.
This ensures you’re actually improving the generalization error versus your baseline model.
step() is very slow.
This is a generic problem with stepwise selection: each candidate model along the path must be refit from scratch.
A related subtle (but massively important) issue is stability. MLEs can have high sampling variability where \( p \approx n \): they change a lot from one dataset to another. So which MLE model is “best”“ changes a lot.
The key to modern statistical learning is regularization: departing from optimality to stabilize a system.
This is very common in engineering:
Recall that deviance (\( = -2 \cdot LHD \)) is the cost of mis-fitting the data:
But nonzero choices of \( \beta_j \) also have costs:
The “optimal” fit \( \hat{\beta}_{\mathrm{MLE}} \) maximizes the likelihood, or equivalently minimizes the deviance, as a function of \( \beta \):
\[
\underset{\beta \in \mathbb{R}^p}{\mathrm{minimize}}
\quad \frac{1}{n} \mathrm{dev}(\beta)
\]
The regularized fit minimizes the deviance plus a “penalty” on the complexity of the estimate:
\[
\underset{\beta \in \mathbb{R}^p}{\mathrm{minimize}}
\quad \frac{1}{n}\mathrm{dev}(\beta) + \lambda \cdot \mathrm{pen}(\beta)
\]
Here \( \lambda \) is the penalty weight, while “pen” is some cost function that penalizes departures of the fitted \( \beta \) from \( 0 \).
Ridge regression: \[ \underset{\beta \in \mathbb{R}^p}{\mathrm{minimize}} \quad \frac{1}{n} \mathrm{dev}(\beta) + \lambda \cdot \sum_{j=1}^p \beta_j^2 \]
Lasso regression: \[ \underset{\beta \in \mathbb{R}^p}{\mathrm{minimize}} \quad \frac{1}{n} \mathrm{dev}(\beta) + \lambda \cdot \sum_{j=1}^p |\beta_j| \]
The advantage of the lasso approach is that it gives sparse solutions: many of the \( \hat{\beta_j} \)'s are identically zero. This yields automatic variable selection.
The lasso fits \( \hat{\beta} \) to minimize \( \mathrm{dev}(\beta) + \lambda \cdot \sum_{j=1}^p |\beta_j| \). We'll do this for a sequence of penalties \( \lambda_1 > \lambda_2 > \cdots > \lambda_T \).
We can then apply standard model-selection tools (e.g. cross validation, AIC) to pick the best \( \lambda \).
Path estimation:
Some key facts:
This is like a better version of traditional stepwise selection.
There are many, many packages for fitting lasso regressions in R.
glmnet is most common, and gamlr is my favorite. These two are very similar, and they share syntax.gamlr does a “gamma lasso” penalty. A little beyond the scope of the course, but cool stuff.gamlr just makes it easier to apply some model selection rules.Both use the Matrix library representation for sparse matrices.
Often, your feature matrix will be very sparse (i.e, mostly zeros).
A simple triplet matrix is a very common storage format:
For example:
\[ X = \left[ \begin{array}{r r r} 0 & 0 & 3.1 \\ 1.7 & 0 & 0 \\ 0 & 0 & 0 \end{array} \right] \]
would be stored as
For gamlr/glmnet you need to build your own design matrix i.e., what the \( y ∼ x1 + x2 \) formula does for you inside glm.
We've seen how to do this with model.matrix for dense matrices. The sparse version sparse.model.matrix works the same way.
We'll walk through the example in semiconductor.R…
Basic idea:
This is useful when:
This poses a pretty substantial estimation problem:
Could do variable selection from among these thousands of variables, but remember: each extra bit of hay we throw onto the haystack makes it harder to find the needles!
Suppose we want to build a model for how the returns on Apple's stock depend on all other stocks in the market. So let
A linear model would say \[ E(y_t) = \alpha + \sum_{j=1}^p \beta_j x_{jt} + e_t \]
So we try dimensionality reduction:
\[
E(y_t) = \alpha + \beta m_t + e_t
\]
where
In finance this is called the “single index” model, a variant of the “capital asset pricing model”.
See, e.g., https://finance.yahoo.com/quote/AAPL
Look for their estimate of \( \beta \)!
The single-index model filter \( p \) variables into 1 variable, via a single linear combination: \[ m = \sum_{j=1}^p w_j x_{j} \, . \]
In the case of a single-index model for stocks, it is natural to take a market-weighted linear combination of all stocks. The weights come from economics, not statistics.
But what if we don't have a natural or “obvious” set of weights?
A very popular way to proceed here is Principal Components Analysis
In PCA, we extract \( K \) summary variables from each feature vector \( x_i \). Each summary \( k=1, \ldots, K \) is of the form: \[ s_{ik} = \sum_{j=1}^p v_{jk} x_{ij} \] where \( x_{ij} \) is original feature \( j \), and \( s_{ik} \) is summary feature \( k \), for observation \( i \).
The coefficients \( v_{jk} \) are the “loadings” or “weights” on the original variables. The goal of PCA is to choose an “optimal” set of weights.
Data from 1970s Europe on consumption of 7 types of foods: red meat, white meat, eggs, milk, fish, cereals, starch, nuts, vegetables.
Can you interpret the summaries?
PCA is a great way to “collapse” many features into few.
The choice of K (how many summaries) can be evaluated via the out-of-sample fit.
The units of each summary variable are not interpretable in an absolute sense—only relative to each other.
It's always a good idea to center the data before running PCA.
Much more on PCA later!
Let's see an example in gasoline.R.