Statistical Learning
James Scott (UT-Austin)
Reference: Introduction to Statistical Learning Chapter 3
Linear regression is the most widely used tool in the world for fitting an model of the form \( y = f(x) + e \).
A linear model is parametric model; we can can write down \( f(x) \) in the form of an equation: \[ \begin{aligned} y & = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + e \\ & = x \cdot \beta + e \end{aligned} \] where \( x = (1, x_1, x_2, \dots, x_p) \) and \( \beta = (\beta_0, \beta_1, \beta_2, \ldots, \beta_p) \). A notational convenience: the intercept gets absorbed into the vector of predictors by including a leading term of 1.
They often work pretty well for prediction.
They're a simple foundation for more sophisticated techniques.
They're easy to estimate, even for extremely large data sets.
They have lower estimation variance than most nonlinear/nonparametric alternatives.
They're easier to interpret than nonparametric models.
Techniques for feature selection (choosing which \( x_j \)'s matter) are very well developed for linear models.
Linear models are pretty much always wrong, sometimes subtly and sometimes spectacularly. If the truth is nonlinear, then the linear model will provide a biased estimate.
Linear models depend on which transformation of the data you use (e.g. \( x \) versus \( \log(x) \)).
Linear models don't estimate interactions among the predictors unless you explicitly build them in.
Three brands of OJ: Tropicana, Minute Maid, Dominicks
83 Chicago-area stores
Demographic info for each store
Price, sales (log units moved), and whether advertised (feat)
data in oj.csv, code in oj.R.
Each brand occupies a well-defined price range.
Sales decrease with price (duh).
Log-linear models: thinking about scale in linear models.
Interpreting regression models when some of the predictors are categorical (factor effects, model matrices).
Interactions.
When fitting a linear model (this goes up, that goes down), think about the scale on which you expect to find linearity.
A very common scale is to model the mean for \( \log(y) \) rather than \( y \). Why? It allows us to model multiplicative rather than additive change.
\[ log(y) = \log(\beta_0) + \beta_1 x \iff y = \beta_0 e^{\beta_1 x} \]
Key guideline: whenever \( y \) changes on a percentage scale, use \( \log(y) \) as a response. E.g.
volatility, failures, weather events… lots of things that are non-negative are expressed most naturally on a log scale.
A very common scale is to model the mean for \( \log(y) \) rather than \( y \). Why? It allows us to model multiplicative rather than additive change.
\[ log(y) = \log(\beta_0) + \beta_1 x \iff y = \beta_0 e^{\beta_1 x} \]
A simple “elasticity model” for orange juice sales \( y \) might be: \[ \log(y) = \gamma \log(\mathrm{price}) + x \cdot \beta \]
The rough interpretation of a log-log regression like this: for every 1% increase in price, we can expect a \( \gamma \% \) change in sales.
Let's try this in R, using brand as a feature (\( x \)):
reg = lm(logmove ~ log(price) + brand, data=oj)
coef(reg) ## look at coefficients
(Intercept) log(price) brandminute.maid brandtropicana
10.8288216 -3.1386914 0.8701747 1.5299428
What happened to branddominicks?
Our regression formulas look like \( \beta_0 + \beta_1 x_1 + \ldots \) But brand is not a number, so you can’t do \( \beta \cdot \mathrm{brand} \).
The first step of lm is to create a numeric “model matrix” from the input variables:
Input:
| Brand |
|---|
| dominicks |
| minute maid |
| tropicana |
Variable expressed as a factor.
Output:
| Intercept | brand=minute maid | brand = tropicana |
|---|---|---|
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
Variable coded as a set of numeric “dummy variables” that we can multiply against \( \beta \) coefficients. This is done using model.matrix.
Our OJ model used model.matrix to build a 4 column matrix:
> x <- model.matrix( ∼ log(price) + brand, data=oj)
> x[1,]
Intercept log(price) branBDinute.maid brandtropicana
1.00000 1.353255 0.000000 1.000000
Each factor’s reference level is absorbed by the intercept. Coefficients are “change relative to reference level” (dominicks here).
To check the reference level of your factors, use
levels(oj$brand)
[1] "dominicks" "minute.maid" "tropicana"
The first level is reference.
To change the reference level, use relevel:
oj$brand2 = relevel(oj$brand, 'minute.maid')
Now if you re-run the regression, you'll see a different baseline category. But crucially, the price coefficient doesn't change:
reg2 = lm(logmove ~ log(price) + brand2, data=oj)
coef(reg) # old model
(Intercept) log(price) brandminute.maid brandtropicana
10.8288216 -3.1386914 0.8701747 1.5299428
coef(reg2) # new model
(Intercept) log(price) brand2dominicks brand2tropicana
11.6989962 -3.1386914 -0.8701747 0.6597681
An interaction is when one feature changes how another feature acts on y.
In regression, an interaction is expressed as the product of two features: \[ E(y \mid x) = f(x) = \beta_0 + \beta_1 x_2 + \beta_2 x_2 + \beta_{12} x_1 x_2 + \cdots \] so that the effect on \( y \) of a one-unit increase in \( x_1 \) is \( \beta_1 + \beta_{12} x_2 \). (It depends on \( x_2 \)!)
Interactions play a massive role in statistical learning. They are important for making good predictions, and they are often central to social science and business questions:
This model statement says: elasticity (the log(price) coefficient) is different from one brand to the next:
reg3 = lm(logmove ~ log(price)*brand, data=oj)
coef(reg3)
(Intercept) log(price)
10.95468173 -3.37752963
brandminute.maid brandtropicana
0.88825363 0.96238960
log(price):brandminute.maid log(price):brandtropicana
0.05679476 0.66576088
This output is telling us that the elasticities are:
\( \quad \) dominicks: -3.4 \( \quad \) minute maid: -3.3 \( \quad \) tropicana: -2.7
Where do these numbers come from? Do they make sense?
A key question: what changes when we feature a brand? Here, this means in-store display promo or flier ad (feat in oj.csv):
feat on log sales volume…feat on price elasticity…feat on elasticity.Let's see the R code in oj.R for runs of all three models.
Goal: connect the regression formula and output to a specific equation, i.e. \( f(x) = E(y \mid x) \) for each brand individually.
| Dominicks | Minute Maid | Tropicana | |
|---|---|---|---|
| Not featured | -2.8 | -2.0 | -2.0 |
| Featured | -3.2 | -3.6 | -3.5 |
Findings:
Why does marketing increase price sensitivity? And how does this influence pricing/marketing strategy?
Before including feat, Minute Maid behaved like Dominicks.
reg_interact = lm(logmove ~ log(price)*brand, data=oj)
coef(reg_interact)
(Intercept) log(price)
10.95468173 -3.37752963
brandminute.maid brandtropicana
0.88825363 0.96238960
log(price):brandminute.maid log(price):brandtropicana
0.05679476 0.66576088
(Compare the elasticities!)
With feat, Minute Maid looks more like Tropicana. Why?
reg_ads3 <- lm(logmove ~ log(price)*brand*feat, data=oj)
coef(reg_ads3)
(Intercept) log(price)
10.40657579 -2.77415436
brandminute.maid brandtropicana
0.04720317 0.70794089
feat log(price):brandminute.maid
1.09440665 0.78293210
log(price):brandtropicana log(price):feat
0.73579299 -0.47055331
brandminute.maid:feat brandtropicana:feat
1.17294361 0.78525237
log(price):brandminute.maid:feat log(price):brandtropicana:feat
-1.10922376 -0.98614093
(Again, compare the elasticities!)
Minute Maid was more heavily promoted!
brand
feat dominicks minute.maid tropicana
0 0.743 0.711 0.834
1 0.257 0.289 0.166
Because Minute Maid was more heavily promoted, AND promotions have a negative effect on elasticity, we were confounding the two effects on price when we estimated an elasticity common to all brands.
Including feat helped deconfound the estimate!
| Intercept | brand=minute maid | brand = tropicana |
|---|---|---|
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
An interaction is when one variable changes the effect of another variable on y. For example:
In linear models, we express interactions as products of variables: \[ f(x) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_{1} x_2 + \cdots \] so that the effect on \( y \) of a one-unit increase in \( x_1 \) is \( \beta_1 + \beta_{12} x_2 \).
Just like with K-nearest neighbors in one \( x \) variable, it's important to consider out of sample fit for a linear model.
Especially in a model with lots of variables, the in-sample fit and out-of-sample fit can be very different.
Let's see these ideas in practice, by comparing three predictive models for house prices in Saratoga, New York.
We'll consider three possible models for price constructed from these 11 predictors.
A starter script is in saratoga_lm.R. Goals for today:
# Split into training and testing sets
n = nrow(SaratogaHouses)
n_train = round(0.8*n) # round to nearest integer
n_test = n - n_train
train_cases = sample.int(n, n_train, replace=FALSE)
test_cases = setdiff(1:n, train_cases)
saratoga_train = SaratogaHouses[train_cases,]
saratoga_test = SaratogaHouses[test_cases,]
# Fit to the training data
lm1 = lm(price ~ lotSize + bedrooms + bathrooms, data=saratoga_train)
lm2 = lm(price ~ . - sewer - waterfront - landValue - newConstruction, data=saratoga_train)
lm3 = lm(price ~ (. - sewer - waterfront - landValue - newConstruction)^2, data=saratoga_train)
# Predictions out of sample
yhat_test1 = predict(lm1, saratoga_test)
yhat_test2 = predict(lm2, saratoga_test)
yhat_test3 = predict(lm3, saratoga_test)
rmse = function(y, yhat) {
sqrt( mean( (y - yhat)^2 ) )
}
# Root mean-squared prediction error
rmse(saratoga_test$price, yhat_test1)
[1] 73773.61
rmse(saratoga_test$price, yhat_test2)
[1] 62221.42
rmse(saratoga_test$price, yhat_test3)
[1] 65707.78
If this assumed parametric form is far from the truth, then a linear model can predict poorly.
A nonparametric model like KNN is different:
Which approach leads to a more favorable spot along the bias-variance tradeoff?
Let's fit a KNN model to the Saratoga house-price data using all the same variables we used in our “medium” model. Recall the basic recipe:
\[ \hat{f}(x_0) = \frac{1}{K} \sum_{i \in \mathcal{N_0}} y_i \]
(The summation over \( i \in \mathcal{N}_0 \) means “sum over all data points \( i \) that fall in the neighborhood.”)
There's a caveat here: how do we measure which \( K \) points are the closest to \( x_0 \)? To see why this is trickier than it sounds at first, consider three houses:
Which house (B or C) should be “closer to” A?
Most people would that A and B are the most similar: they both have 4 BR/2 BA, and their size is similar. House C is has only 2BR/1.5 BA is is enormous for a house with that number of bedrooms. Yet if we naively try to calculate pairwise Euclidean distances, we find the following:
\[ \begin{aligned} d(A, B) &= \sqrt{(4 - 4)^2 + (2-2)^2 + (2200 - 2300)^2} \\ & = 100 \\ d(A, C) &= \sqrt{(4 - 2)^2 + (2-1.5)^2 + (2200 - 2125)^2} \\ &\approx 75.03 \end{aligned} \]
So house C ends up as a neighbor of A, not B!
What happened here is that the sqft variable completely overwhelmed the BR and BA variables in the distance calculations.
The root of the problem: square footage is measured on an entirely different scale than bedrooms or bathrooms. Treating them as if they're on the same scale leads to counter-intuitive distance calculations.
The simple way around this: weighting. That is, we treat some variables as more important than others in the distance calculation.
Ordinary Euclidean distance: \[ d(x_1, x_2) = \sqrt{ \sum_{j=1}^p \left\{ x_{1,j} - x_{2,j} \right\}^2 } \]
Weighted Euclidean distance:
\[ d_w(x_1, x_2) = \sqrt{ \sum_{j=1}^p \left\{ w_j \cdot ( x_{1,j} - x_{2,j}) \right\}^2 } \]
This depends upon the choice of weights \( w_1, \ldots, w_p \) for each feature variable.
We can always choose the scales/weights to reflect our substantive knowledge of the problem. For example:
This might lead us to scales like the following:
\[ d_w(x, x') = \sqrt{ (x_{1} - x_1')^2 + \left\{200(x_{2} - x_2')\right\}^2 + \left\{ 50(x_{3} - x_3')\right\}^2 } \] where \( x_{1} \) is square footage, \( x_2 \) is bedrooms, and \( x_3 \) is bathrooms. Thus a difference of 1 bedroom counts 200 times as much as a difference of 1 square foot.
But choosing weights by hand can be a real pain.
A “default” choice of weights that requires no manual specification is to weight by the inverse standard deviation of each feature variable:
\[ d_w(x_1, x_2) = \sqrt{ \sum_{j=1}^p \left( \frac{x_{1,j} - x_{2,j}}{s_j}) \right)^2 } \] where \( s_j \) is the sample standard deviation of feature \( j \) across all cases in the training set.
This is equivalent to calculating “ordinary” distances using a rescaled feature matrix, where we center and scale each feature variable to have mean 0 and standard deviation 1: \[ \tilde{X}_{ij} = \frac{(x_{ij} - \mu_j)}{s_j} \] where \( (\mu_j, s_j) \) are the sample mean and standard deviation of feature variable \( j \).
Then we can run ordinary (unweighted) KNN using Euclidean distances based on \( \tilde{X} \).
# construct the training and test-set feature matrices
# note the "-1": this says "don't add a column of ones for the intercept"
Xtrain = model.matrix(~ . - (price + sewer + waterfront + landValue + newConstruction) - 1, data=saratoga_train)
Xtest = model.matrix(~ . - (price + sewer + waterfront + landValue + newConstruction) - 1, data=saratoga_test)
# training and testing set responses
ytrain = saratoga_train$price
ytest = saratoga_test$price
# now rescale:
scale_train = apply(Xtrain, 2, sd) # calculate std dev for each column
Xtilde_train = scale(Xtrain, scale = scale_train)
Xtilde_test = scale(Xtest, scale = scale_train) # use the training set scales!
head(Xtrain, 2)
lotSize age livingArea pctCollege bedrooms fireplaces bathrooms rooms
1246 0.69 3 3208 62 4 1 2.5 11
1513 0.30 32 1778 64 3 1 1.5 6
heatinghot air heatinghot water/steam heatingelectric fuelelectric
1246 1 0 0 0
1513 1 0 0 0
fueloil centralAirNo
1246 0 0
1513 0 1
head(Xtilde_train, 2) %>% round(3)
lotSize age livingArea pctCollege bedrooms fireplaces bathrooms
1246 0.294 -0.863 2.316 0.618 1.023 0.708 0.909
1513 -0.290 0.134 0.037 0.811 -0.186 0.708 -0.598
rooms heatinghot air heatinghot water/steam heatingelectric
1246 1.697 0.727 -0.467 -0.447
1513 -0.457 0.727 -0.467 -0.447
fuelelectric fueloil centralAirNo
1246 -0.459 -0.379 -1.305
1513 -0.459 -0.379 0.766
library(FNN)
K = 10
# fit the model
knn_model = knn.reg(Xtilde_train, Xtilde_test, ytrain, k=K)
# calculate test-set performance
rmse(ytest, knn_model$pred)
[1] 63658.92
rmse(ytest, yhat_test2) # from the linear model with the same features
[1] 62221.42
Let's try many values of \( K \):
library(foreach)
k_grid = exp(seq(log(2), log(300), length=100)) %>% round %>% unique
rmse_grid = foreach(K = k_grid, .combine='c') %do% {
knn_model = knn.reg(Xtilde_train, Xtilde_test, ytrain, k=K)
rmse(ytest, knn_model$pred)
}
plot(k_grid, rmse_grid, log='x')
abline(h=rmse(ytest, yhat_test2)) # linear model benchmark