Regularized Regression

Basic Idea

  1. Fit a regression model
  2. Penalize (or shrink) large coefficients

Pros:

  • Can help with the bias/variance tradeoff
  • Can help with model selection

Cons:

  • May be computionally demanding on large data sets
  • Does not perform as well as random forests or boosting

A motivating example

Suppose a regression model with two covariates \(X_1\) and \(X_2\) is considered:

\[ Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \epsilon \] where \(X_1\) and \(X_2\) are nearly perfectly correlated (co-linear). You can approximate this model by: \[ Y = \beta_0 + \left( \beta_1 + \beta_2 \right) X_1 + \epsilon \] The result is:

  • You will get a good estimate of \(Y\).
  • The estimate of \(Y\) will be biased.
  • We may reduce variance in the estimate.
library(ElemStatLearn)
data(prostate)
str(prostate)
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

Model selection approach: split samples

*No method better when data/computation time permits it

  • Approach

    1. Divide data into training/test/validation.
    2. Treat validation as test data, train all competing models on the train data nad pick the best one on validation.
    3. To appropriately asses performance on new data apply to test set.
    4. You may re-split and reperfom steps 1-3.
  • Two common problems

    • Limited data
    • Computational complexity

Decomposing expected prediction error

Assume the test set formed by the observations \((x_i, y_i)\), \(i = 1, \ldots, n\) and assume the linear model

\[ y_i = f(x_i) + \epsilon_i, \quad i = 1, 2, \ldots, n, \] with the errors:

  • \(\epsilon_i\) i.i.d,
  • centered, i.e. \(\mathbb{E} \left[ \epsilon_1 \right] = 0\),
  • with variance \(\mathbb{V} \left[ \epsilon_1 \right] = \sigma^2\).

For a (previously unseen) test observation, not used to train the statistical learning method, \((x_0, y_0)\) and an estimation of \(f\), denoted \(\widehat{f}\) and corresponding to the test set, the expected test MSE (expected prediction error in the MSE sense) is

\[ \mathbb{E} \left[ \left( y_0 - \widehat{f} (x_0) \right)^2 \right] , \]
where the expected value notation \(\mathbb{E}\) is used accounting for the randomness of the function estimation \(\widehat{f}\).

The expected test MSE can be decomposed in:

  • the irreducible error: \(\sigma^2\),

  • the bias: \(\mathbf{Bias} \left( \widehat{f}(x_0) \right) = f(x_0) - \mathbb{E} \left[ \widehat{f} (x_0) \right]\), corresponding to the error induced by the approximation of the real \(f\) by the estimate \(\widehat{f}\),

  • the variance: \(\mathbb{V} \left[ \widehat{f} (x_0) \right]\) - corresponding to the amount by which \(\widehat{f}\) would change when estimated using a different training set, different training sets resulting in different \(\widehat{f}\). Ideally, the estimate for \(f\) should not vary too much between training sets.

The expected test MSE writes

\[ \mathbb{E} \left[ \left( y_0 - \widehat{f} (x_0) \right)^2 \right] = \mathbb{E} \left[ y_0^2 \right] - 2 \mathbb{E} \left[ y_0 \widehat{f} (x_0) \right] + \mathbb{E} \left[ \widehat{f}^2 (x_0) \right] \]

Using the assumption of the underlying linear model, the first term of the expected test MSE writes

\[ \begin{align} \mathbb{E} \left[ y_0^2 \right] = \mathbb{E} \left[ \left( f(x_0) + \epsilon_0 \right)^2 \right] = \mathbb{E} \left[ f^2(x_0) \right] + 2 \mathbb{E} \left[\epsilon_0 f(x_0) \right] + \mathbb{E} \left[ \epsilon_0^2 \right] = f^2(x_0) + \sigma^2 , \end{align} \] using the fact that \(\epsilon_0\) and \(\widehat{f}(x_0)\) are independent ( hence \(\mathbb{E} \left[\epsilon_0 \widehat{f}(x_0) \right] = \mathbb{E} \left[\epsilon_0 \right] \mathbb{E} \left[ \widehat{f}(x_0) \right]\)) and \(\mathbb{E} \left[\epsilon_0 \right] = 0\) by assumption.

The second term of the expected test MSE writes

\[ \mathbb{E} \left[ y_0 \widehat{f} (x_0) \right] = \mathbb{E} \left[ \left( f(x_0) + \epsilon_0 \right) \widehat{f} (x_0) \right] = \mathbb{E} \left[ f(x_0) \widehat{f}(x_0) \right] + \mathbb{E} \left[ \epsilon_0 \widehat{f}(x_0) \right] = f(x_0) \mathbb{E} \left[ \widehat{f} (x_0) \right]. \] Lastly, the third term of the expected test MSE writes

\[ \mathbb{E} \left[ \widehat{f}^2 (x_0)\right] = \mathbb{V} \left[ \widehat{f} (x_0) \right] + \mathbb{E}^2 \left[ \widehat{f} (x_0) \right] \]

Plugging back, expected test MSE writes

\[ \begin{align} \mathbb{E} \left[ \left( y_0 - \widehat{f} (x_0) \right)^2 \right] & = \mathbb{E} \left[ y_0^2 \right] - 2 \mathbb{E} \left[ y_0 \widehat{f} (x_0) \right] + \mathbb{E} \left[ \widehat{f}^2 (x_0) \right] \\ & = f^2(x_0) + \sigma^2 - 2 f(x_0) \mathbb{E} \left[ \widehat{f} (x_0) \right] + \mathbb{V} \left[ \widehat{f} (x_0) \right] + \mathbb{E}^2 \left[ \widehat{f} (x_0) \right] \\ & = \sigma^2 + \left(f(x_0) - \mathbb{E} \left[ \widehat{f} (x_0) \right] \right)^2 + \mathbb{V} \left[ \widehat{f} (x_0) \right] \\ & = \sigma^2 + \mathbf{Bias}^2 \left( \widehat{f}(x_0) \right) + \mathbb{V} \left[ \widehat{f} (x_0) \right] \\ & = \text{irreducible error} + \text{squared bias} + \text{variance} \end{align} \] The overall expected test MSE (overall expected prediction error in the MSE sense) can be computed by averaging the expected test MSE at one test point over all possible values of \(x_0\) in the test set.

Another issue for high-dimensional data

library(ElemStatLearn)
data(prostate)
str(prostate)
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
smallset <- prostate[1:5,]
lm(lpsa ~ ., data = smallset)
## 
## Call:
## lm(formula = lpsa ~ ., data = smallset)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph          svi  
##     9.60615      0.13901     -0.79142      0.09516           NA           NA  
##         lcp      gleason        pgg45    trainTRUE  
##          NA     -2.08710           NA           NA

For the case where the number of predictors is greater than the number of data points in the set, some of the coeficients will be NA since in this case the design matrix can’t be inverted.

Hard thresholding

  • Model: \(y_i = f(x_i) + \epsilon_i\)

  • Assume a linear model \(\widehat{f} = \beta^T x\)

  • Constrain only \(\lambda\) coefficients to be nonzero.

  • Selection problem is after chosing \(\lambda\). (can be computationally demanding)

Regularization for regression

If the \(\beta_i\)s are unconstrained:

  • they can explode
  • susceptible to high variance

To control variance, we might regularize/shrink the coeficients, using a penalized form of the sum of squares.

\[ \sum_{j = 1}^{n} \left( y_j - \sum_{j = 1}^{m} \beta_{i}x_{ij} \right)^2 + P(\lambda; \beta) \] Things that are commonly looked for:

  • penalty reduces complexity
  • penalty reduces variance
  • penalty respects structure of the problem

Ridge Regression (Tikhonov regularization)

The ridge regression solves

\[ \|\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 + \lambda \| \boldsymbol{\beta} \|_2^2 = \sum_{j = 1}^{n} \left( y_j - \sum_{j = 1}^{m} \beta_{i}x_{ij} \right)^2 + \lambda \sum_{j=1}^{m} \beta_j^2 \] The penalty term is the \(\ell_2\) norm.

It can be shown that this estimator is the solution to the least squares problem subject to the constraint \(\| \boldsymbol{\beta} \|^2 = c\):

\[ \min \|\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 \quad \text{subject to} \quad \| \boldsymbol{\beta} \|_2^2 = c. \] Inclusion of \(\lambda\) may also make the problem non-singlular even if \(\boldsymbol{X}^T \boldsymbol{X}\) is non invertible.

Tuning the \(\lambda\) parameter

  • \(\lambda\) controls the size of the coefficents.
  • \(\lambda\) controls the amound of regularization.
  • as \(\lambda \to 0\) we obtain the least square solution.
  • as \(\lambda \to \inf\) we obtain \(\beta_{\lambda = \inf}^{ridge} = 0\) (all coeficients go to zero)
  • setting the \(\lambda\) parameter can be done with cross validation or other techniques

LASSO (Least Absolute Shrinkage and Selection Operator)

The objective of LASSO is to solve

\[ \min_{\boldsymbol{\beta}} \|\boldsymbol{y} - \boldsymbol{X} \boldsymbol{\beta} \|_2^2 \quad \text{subject to} \quad \sum_{i = 1}^{p} | \beta_i | \leq s, \]

with the corresponding Lagrangian

\[ \sum_{i=1}^{N} \left( y_i - \beta_0 + \sum_{j=1}^{p} x_{ij} \beta_j \right)^2 + \lambda \sum_{j=1}^{p} | \beta_j |, \] with an analytical solution for orthonormal design matrices \(\boldsymbol{X}\).

Combining Predictors (Ensemble methods)

Key Ideas

  • Combining classifiers by averaging/voting.
  • Combining classifiers improves accuracy.
  • Combining classifiers reduces interpretability .
  • Boosting, bagging & random forests are variants on this theme.

Basic intuition - majority vote

For a number \(n\) of independent classifiers, \(c_i\), \(i = 1, \ldots, n\) with an accuracy

\[ \mathbb{P}(c_i = 1) = p, \quad i = 1, \ldots, n \],

the majority vote accuracy is

\[ \mathbb{P} \left( \boldsymbol{c}, \| \boldsymbol{c} \| \geq \lceil n/2 \rceil \right) = 1 - \mathbb{P} \left( \boldsymbol{c}, \| \boldsymbol{c} \| \leq \lfloor n/2 \rfloor \right) = \sum_{k = 0}^{\lfloor n/2 \rfloor} {n \choose k} p^{n-k} (1-p)^{k} = 1 - \text{pbinom}(\lfloor n/2 \rfloor, n, p) \] For 5 completely independent classifiers (\(n = 5\)) with the accuracy \(70\)% (p = 0.7), the majority vote accuracy is \(83.7\)%: \[ \mathbb{P} \left( \boldsymbol{c}, \| \boldsymbol{c} \| \geq 2.5 \right) = 1 - \text{pbinom}(\text{floor}(n/2), n, p) = \text{pbinom}(\text{floor}(n/2), n, p, \text{FALSE}) = 0.83692 , \]

p = 0.7
n = 5
pbinom(floor(n/2), n, p, FALSE)
## [1] 0.83692

For 101 completely independent classifiers, (\(n = 101\)) with the same accuracy \(70\)% (p = 0.7), the majority vote accuracy is \(99.9\)%:

p = 0.7
n = 101
pbinom(floor(n/2), n, p, FALSE)
## [1] 0.9999871

Approaches for combining classifers

  1. Bagging, boosting, random forests
    • Usually, combine similar classifiers
  2. Combining different classifiers
    • Model stacking
    • Model ensembling

Example with Wage data

library(ISLR)
data(Wage)
library(ggplot2)
library(caret)

# Remove the logwage 
Wage <- subset(Wage, select = -c(logwage))

# Create a building data sset and validation set
inBuild <- createDataPartition(y = Wage$wage, p = 0.7, list = FALSE)

# Create the validation set
validation <- Wage[-inBuild, ]

build <- Wage[inBuild, ]
inTrain <- createDataPartition(y = build$wage, p = 0.7, list = FALSE)

# Create the testing set
test <- build[-inTrain, ]
# Create the training set
train <- build[inTrain, ]

# The validation, testing and training sets and their dimensions
head(validation); dim(validation)
##        year age           maritl     race       education             region
## 86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
## 377954 2008  30 1. Never Married 3. Asian 3. Some College 2. Middle Atlantic
## 305706 2007  34       2. Married 1. White      2. HS Grad 2. Middle Atlantic
## 153561 2003  39       2. Married 1. White 4. College Grad 2. Middle Atlantic
## 449654 2009  54       2. Married 1. White      2. HS Grad 2. Middle Atlantic
##              jobclass         health health_ins      wage
## 86582  2. Information 2. >=Very Good      2. No  70.47602
## 376662 2. Information 2. >=Very Good     1. Yes 127.11574
## 377954 2. Information      1. <=Good     1. Yes 111.72085
## 305706  1. Industrial 2. >=Very Good      2. No  81.28325
## 153561  1. Industrial 2. >=Very Good     1. Yes 134.70538
## 449654 2. Information 2. >=Very Good     1. Yes 134.70538
## [1] 898  10
head(train); dim(train)
##        year age           maritl     race       education             region
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
## 161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 81404  2004  52       2. Married 1. White      2. HS Grad 2. Middle Atlantic
## 160191 2003  37 1. Never Married 3. Asian 4. College Grad 2. Middle Atlantic
## 301585 2007  56       2. Married 1. White 4. College Grad 2. Middle Atlantic
##              jobclass         health health_ins      wage
## 231655  1. Industrial      1. <=Good      2. No  75.04315
## 161300  1. Industrial      1. <=Good     1. Yes 130.98218
## 155159 2. Information 2. >=Very Good     1. Yes 154.68529
## 81404  2. Information 2. >=Very Good     1. Yes 128.68049
## 160191  1. Industrial 2. >=Very Good      2. No  82.67964
## 301585  1. Industrial      1. <=Good     1. Yes 129.15669
## [1] 1474   10
head(test); dim(test)
##        year age           maritl     race       education             region
## 11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
## 450601 2009  44       2. Married 4. Other 3. Some College 2. Middle Atlantic
## 228963 2006  41 1. Never Married 2. Black 3. Some College 2. Middle Atlantic
## 302778 2007  45      4. Divorced 1. White 3. Some College 2. Middle Atlantic
## 8690   2005  35 1. Never Married 1. White      2. HS Grad 2. Middle Atlantic
## 447660 2009  51       2. Married 1. White 3. Some College 2. Middle Atlantic
##              jobclass         health health_ins      wage
## 11443  2. Information      1. <=Good     1. Yes  75.04315
## 450601  1. Industrial 2. >=Very Good     1. Yes 169.52854
## 228963 2. Information 2. >=Very Good     1. Yes 118.88436
## 302778 2. Information      1. <=Good     1. Yes 117.14682
## 8690   2. Information 2. >=Very Good     1. Yes  89.49248
## 447660  1. Industrial 2. >=Very Good     1. Yes  90.48191
## [1] 628  10

Build two different models

# GLM model
fitGLM <- train(wage ~., method = 'glm', data = train)
predGLM <- predict(fitGLM, test)
errGLM <- predGLM - test$wage
#sqrt(sum(errGLM^2))
#length(predGLM)
qplot(test$wage, predGLM)

plot(errGLM, type = "l", pch = 17, col = 'blue', xlab = 'testing', ylab = 'error')

# RF model
fitRF <- train(wage ~., method = 'rf', data = train,
                trControl = trainControl(method = 'cv'), number = 3)
predRF <- predict(fitRF, test)
errRF <- predRF - test$wage
#sqrt(sum(errRF^2))
#length(predRF)
qplot(test$wage, predRF)

plot(errRF, type = "l", pch = 17, col = 'cyan', xlab = 'testing', ylab = 'error')

qplot(predGLM, predRF, color = wage, data = test)

Build a model combining the two predictors

# Create the data frame of the two predictions and the wages (from the testing set)
predDF <- data.frame(predGLM, predRF, wage = test$wage)
head(predDF); dim(predDF)
##          predGLM    predRF      wage
## 11443   92.02262  98.01184  75.04315
## 450601 107.58982 134.11919 169.52854
## 228963  97.76183  96.47581 118.88436
## 302778 100.82265 106.52115 117.14682
## 8690    94.03700  96.49075  89.49248
## 447660 124.79846 130.23127  90.48191
## [1] 628   3
# Combined model (trained using the testing set)
fitCOMB <- train(wage ~., method = 'gam', data = predDF)
predCOMB <- predict(fitCOMB, predDF)
errCOMB <- predCOMB - test$wage
#sqrt(sum(errCOMB^2))
#length(predCOMB)
qplot(test$wage, predCOMB)

plot(errCOMB, type = "l", pch = 17, col = 'pink', xlab = 'testing', ylab = 'error')

# Plot comparing the errors corresponding to the three models (GLM,RF,COMB)
plot(errGLM, type = 'l', pch = 17, col = 'blue', xlab = "testing", ylab = "error")
lines(errRF, type = 'l', pch = 17, col = 'cyan')
lines(errCOMB, type = 'l', pch = 17, col = 'pink')
legend('topright', 
       legend=c('err1', 'err2', 'err3'),
       col=c('blue', 'cyan', 'pink'), 
       lty = 1, cex=0.8)

sqrt(sum(errGLM^2))
## [1] 923.8519
sqrt(sum(errRF^2))
## [1] 952.6409
sqrt(sum(errCOMB^2))
## [1] 901.1065

Predict on the validation set

# Prediction over the validation set using the GLM (trained over training)
predGLMVal <- predict(fitGLM, validation)
errGLMVal <- predGLMVal - validation$wage
#sqrt(sum(errGLMVal^2))
#length(predGLMVal)
qplot(validation$wage, predGLMVal)

plot(errGLMVal, type = "l", pch = 17, col = 'blue', xlab = 'validation', ylab = 'error')

# Prediction over the validation set using the RF (trained over training)
predRFVal <- predict(fitRF, validation)
errRFVal <- predRFVal - validation$wage
#sqrt(sum(errRFVal^2))
#length(predRFVal)
qplot(validation$wage, predRFVal)

plot(errRFVal, type = "l", pch = 17, col = 'cyan', xlab = 'validation', ylab = 'error')

# Create the data frame of the two predictions corresponding to the two models.
predDFVal <- data.frame(predGLM = predGLMVal, 
                        predRF = predRFVal)
head(predDFVal, 10); dim(predDFVal)
##          predGLM    predRF
## 86582   95.17083  81.60250
## 376662 143.20212 156.71460
## 377954  91.59363 105.01108
## 305706  92.32045  96.49526
## 153561 130.33402 141.23856
## 449654 121.14431 106.38376
## 230312 141.88892 133.49794
## 229379 104.82888 100.16875
## 86064   84.80882  95.76331
## 7690   123.51298 115.36876
## [1] 898   2
# Prediction over the validation set using the RF (training over training)
predCOMBVal <- predict(fitCOMB, predDFVal)
errCOMBVal <- predCOMBVal - validation$wage
#sqrt(sum(errCOMBVal^2))
#length(predCOMBVal)
qplot(validation$wage, predCOMBVal)

plot(errCOMBVal, type = "l", pch = 17, col = 'pink', xlab = 'validation', ylab = 'error')

plot(errGLMVal, type = 'l', pch = 17, col = 'blue', xlab = "validation", ylab = "error")
lines(errRFVal, type = 'l', pch = 17, col = 'cyan')
lines(errCOMBVal, type = 'l', pch = 17, col = 'pink')
legend('topright',
       legend=c('err1', 'err2', 'err3'),
       col=c('blue', 'cyan', 'pink'),
       lty = 1, cex=0.8)

sqrt(sum(errGLMVal^2))
## [1] 1023.509
sqrt(sum(errRFVal^2))
## [1] 1058.78
sqrt(sum(errCOMBVal^2))
## [1] 1039.941

Notes

  • Even simple blending can be useful
  • Typical model for binary/multiclass data
    • Build an odd number of models
    • Predict with each model
    • Predict the class by majority
  • This can get dramatically more complicated
    • Simple bleding in caret: caretEnsemble

Forecasting

Typically applied to time series data, which is introducing some very specific types of dependent structure and some additional challenges which should be taken into account when performing prediction.

What is different?

  • Data are dependent over time.

  • Specific pattern types:

    • trends - long term increase or decrease.
    • seasonal patterns - patterns related to time of week, month, year.
    • cycles - patterns that riseand fall periodically.
  • Subsampling into training/test sets is more complicated.

  • Similar issues arise in spatial data.

    • Dependency between nearby observations.
    • Location specific effects.
  • Typically, the goal is to predict one or more observations into the futures.

  • All standard predictions can be used.

Thing that should be considered * the “spurious correlations”, i.e. sometimes random variables can often be corelated but not causally related (coincidence or the presence of other unseen factors). Those are also common in geographuc analysis. * extrapolation

Unsupervised Prediction

Iris example ignoring species lables

data("iris")
library(ggplot2)
inTrain <- createDataPartition(y = iris$Species, p = 0.7, list = FALSE)
train <- iris[inTrain,]
test <- iris[-inTrain,]

dim(train)
## [1] 105   5
dim(test)
## [1] 45  5
kMeans1 <- kmeans(subset(train, select = -c(Species)), centers = 3)
train$clusters <- as.factor(kMeans1$cluster)
qplot(Petal.Width, Petal.Length, colour = clusters, data = train)

Compare to real labels

table(kMeans1$cluster, train$clusters)
##    
##      1  2  3
##   1 35  0  0
##   2  0 49  0
##   3  0  0 21

Build predictor

fitRPART <- train(clusters ~., data = subset(train, select = -c(Species)), method = 'rpart')
table(predict(fitRPART, train), train$Species)
##    
##     setosa versicolor virginica
##   1     35          0         0
##   2      0         35        14
##   3      0          0        21

Apply on test

predRPART <- predict(fitRPART, test)
table(predRPART, test$Species)
##          
## predRPART setosa versicolor virginica
##         1     15          0         0
##         2      0         15         6
##         3      0          0         9