Statistical Learning
James Scott (UT-Austin)
Reference: Introduction to Statistical Learning, Chapters 1-2
The goal is to predict a target variable (\( y \)) with feature variables (\( x \)).
In data mining/ML/AI, this is called “supervised learning.”
A useful way to frame this problem is to think that \( y \) and \( x \) are related like this:
\[ y_i = f(x_i) + e_i \]
where:
In the first half of this course, our main purpose is to learn \( f(x) \) from the observed data.
ERCOT operates the electricity grid for 75% of Texas.
The 8 ERCOT regions are shown at right.
We'll focus on a basic prediction task:
ggplot(data = loadhou) +
geom_point(mapping = aes(x = KHOU, y = COAST), color='darkgrey') +
ylim(7000, 20000)
\[ f(x) = \beta_0 + \beta_1 x \]
\[ f(x) = \beta_0 + \beta_1 x + \beta_2 x^2 \]
\[ f(x) = ??? \]
We can't write down an equation for this \( f(x) \). But we can define it by its behavior!
Our training data consists of pairs \[ D_{\mbox{tr}} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)\} \]
We then use some statistical method to estimate \( f(x) \). Here “statistical” just means “we apply some recipe to the data.”
There are two general families of strategy.
Parametric:
\( f(x) = \beta_0 + \beta_1 x \)
Nonparametric (k-nearest neighbors):
\( f(x) \) = average \( y \) value of the 50 points closest to \( x \)
Choose a functional form of the model, e.g. \[ f(x) = \beta_0 + \beta_1 x \]
Choose a loss function that measures the difference between the model predictions \( f(x) \) and the actual outcomes \( y \). E.g. least squares: \[ \begin{aligned} L(f) &= \sum_{i=1}^N (y_i - f(x_i))^2 \\ &= \sum_{i=1}^N (y_i - \beta_0 - \beta_1 x_i)^2 \end{aligned} \]
Find the parameters that minimize the loss function.
Suppose we have our training data in the form of \( (x_i, y_i) \) pairs.
Now we want to predict \( y \) at some new point \( x^{\star} \).
There are no parameters (i.e. the \( \beta \)'s in a linear model) to estimate. Rather, the estimate for \( f(x) \) is defined by a particular algorithm applied to the data set. (Note for the mathematically rigorous: \( f(x) \) is defined point-wise, that is, by applying the same recipe at any point \( x \).)
OK, that seems sensible, but why average the nearest \( K=50 \) neighbors? Why not \( K=2 \), or \( K=200 \)?
And if we're free to pick any value of \( K \) we like, how should we choose?
As we have seen in the examples above, there are lots of options in estimating \( f(x) \).
Some methods are very flexible, and some are not… why might we ever choose a less flexible model?
Simple, more restrictive methods are usually easier to interpret
More importantly, it is often the case that simple models make more accurate predictions than very complex ones.
Let's turn to a deceptively subtle question: how accurate is each of these models?
A standard measure of accuracy is the root mean-squared error, or RMSE:
\[
\mathrm{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - f(x_i))^2 }
\]
This measures, on average, how large are the “mistakes” (errors) made by the model on the training data. (OLS minimizes this quantity.)
\[ RMSE = 1807 \]
\[ RMSE = 1001 \]
\[ \mathrm{RMSE} = 669 \]
Make good predictions about the past isn't very impressive.
Our very complex (K=2) model earned a low RMSE by simply memorizing the random pattern of noise in the training data.
It's like getting a perfect score on the GRE when someone tells you what the questions are ahead of time: it doesn't predict anything about how well you'll do in the future.
Key idea: what really matters is our prediction accuracy out-of-sample!!!
Suppose we have \( M \) additional observations \( (x_i^{\star}, y_i^{\star}) \) for \( i = 1, \ldots, M \), which we did not use to fit the model. We'll call this the “testing” data, to distinguish it from our original (“training”) data.
What really matters is the model's out-of-sample root mean-squared error: \[ \mathrm{RMSE}_{\mathrm{out}} = \sqrt{ \frac{1}{M} \sum_{i=1}^M (y_i^{\star} - f(x_i^{\star}))^2 } \]
We don't have any “future data” to use to test our model. We just have our \( N \) original data points. So we have to fake it by splitting our data set \( D \) into two subsets:
We use \( D_{in} \) only to fit the models, and \( D_{out} \) only to compare the out-of-sample predictive performance of the models.
\[ RMSE_{out} = 1805 \]
\[ RMSE_{out} = 953 \]
Not too simple, not too complex… This plot illustrates the bias-variance trade-off, one of the key ideas of this course.
\[ \mathrm{RMSE}_{out} = 893 \]
loudhou data set and starter R script. Get a feel for how the code behaves:+ ylim(7000, 20000) + xlim(0,36) or whatever limits seem appropriate.High K = high bias, low variance:
Low K = low bias, high variance:
Let's take a deeper look at prediction error.
- Let \( \{(x_1, y_1), \ldots, (x_n, y_n)\} \) be your training data.
- Suppose that \( y = f(x) + e \), where \( E(e) = 0 \) and \( \mbox{var}(e) = \sigma^2 \).
- Let \( \hat{f}(x) \) be the function estimate arising from your training data.
- Let \( x^{\star} \) be some future \( x \) point, and let \( y^{\star} \) be the corresponding outcome.
- \( x^{\star} \) is fixed a priori, but the training data and the future outcome \( y^{\star} \) are random.
Define the expected squared prediction error at \( x^{\star} \) as:
\[
MSE^{\star} = E\left\{ \left( y^{\star} - \hat{f} (x^{\star}) \right)^2 \right\}
\]
For any random variable A, \( E(A^2) = \mbox{var}(A) + E(A)^2 \). So:
\[
\begin{aligned}
MSE^{\star} &= E\left\{ \left( y^{\star} - \hat{f} (x^{\star}) \right)^2 \right\} \\
&= \mbox{var} \left\{ y^{\star} - \hat{f} (x^{\star}) \right\} + \left( E \left\{ y^{\star} - \hat{f} (x^{\star}) \right\} \right)^2 \\
&= \mbox{var} \left\{ f(x^\star) + e^\star - \hat{f} (x^{\star}) \right\} + \left( E \left\{ f(x^\star) + e^\star - \hat{f} (x^{\star}) \right\} \right)^2 \\
&= \sigma^2 + \mbox{var} \left\{ \hat{f} (x^{\star}) \right\} + \left( E \left\{ f(x^\star) - \hat{f} (x^{\star}) \right\} \right)^2 \\
&= \sigma^2 + \mbox{(Estimation variance)} + \mbox{(Squared estimation bias)}
\end{aligned}
\]
Classification is the term used when we want to predict a target variable \( y \) that is categorical (win/lose, sick/healthy, pay/default, good/bad…). For example, the plot below shows the longest run length of capital letters for a sample of spam and non-spam e-mails:
In this context, our approach is similar, but different in some specific ways. If \( y \) is binary (0/1), then our assumed model is:
\[
P(y = 1 \mid x) = f(x)
\]
In general, if \( y \in \{1, \ldots, K\} \), then we have a multi-class classification problem, and our goal is to estimate \[ P(y = k \mid x) = f_k(x) \] for each category \( k \). Note there is an implicit constraint: \[ \sum_{k=1}^K f_k(x) = 1 \]
For example, in the spam classification problem, here is a linear estimate for \( f(x) \). It behaves somewhat reasonably, but also has some obviously undesirable features.
Here's a very different fit:
In binary classification, a very natural prediction rule is to threshold the probabilities:
\[
\hat{y} = \left\{
\begin{array}{ll}
1 & \mbox{if } f(x) \geq t \\
0 & \mbox{if } f(x) < t
\end{array}
\right.
\]
A natural choice is \( t=0.5 \), although this might not always be appropriate.
In multi-class problems, the extension of this idea is to predict using the “highest probability” class:
\[
\hat{y} = \arg_k \max f_k(x) \, .
\]
All the discussion about complexity vs. predictive accuracy (bias-variance trade-off) applies to this setting… we'll explore this very soon!
What differs is our measure of accuracy: we are no longer dealing with a numerical outcome variable.
One common approach is to measure the accuracy of a model's predictions using the raw error rate on the training sample: \[ ER = \frac{1}{n} \sum_{i=1}^n I(y_i \neq \hat{y}_i) \] Here \( I(\cdot) \) is the indicator function, taking the value 1 if its argument is true, and 0 otherwise.
Just like with numerical prediction, we can define an out-of-sample version of the error rate: \[ ER_{out} = \frac{1}{m} \sum_{i=1}^m I(y_i^{\star} \neq \hat{y}_i^{\star}) \] where \( (x_1^\star, y_1^\star), \ldots, (x_m^\star, y_m^\star) \) is a collection of \( m \) “future” (out-of-sample) data points that weren't used to train the model.
As before, the practical way to estimate this quantity is to use a train/test split of the original data set.