Boosting and Greedy Algorithm (III):

Josh Day
March 24, 2014

Sections 12.6 and 12.7

Introduction

  • Review L2 Boosting
  • Nonparametric curve estimation
    • L2 Boosting using kernel estimators
  • L2 Boosting for high-dimensional linear models
    • Connection with LASSO
  • Forward Selection and Orthogonal Matching Pursuit

L2 Boosting

Review - What is L2 Boosting?

  • Functional gradient descent using the squared error loss.
  • This amounts to repeated fitting of residuals. [1]

L2 Boosting Example

library(MASS)          # Data (Boston)
library(rpart)         # Trees
library(ipred)         # Bagging
library(randomForest)  # Random Forest
library(gbm)           # L2 Boosting

n <- dim(Boston)[1]
train <- sample(1:n, .66*n)
Boston.train <- Boston[train,]
Boston.validate <- Boston[-train,]

L2 Boosting Example

  • MSE on validation set:
    • Rpart: 17.3746
    • Bagging: 12.4235
    • Random Forest: 10.8361
    • L2 Boosting: 10.6906

L2 Boosting Example

plot of chunk unnamed-chunk-3

Nonparametric Curve Estimation

Nonparametric Curve Estimation

  • Setup
    • Estimating regression function \( E(Y \;|\; X=x) \) for univariate \( X \) and continuous \( Y \)

\[ Y_i = f^0(X_i)+\epsilon_i, \;i=1,...,n \] \[ \epsilon_i \sim_{iid} [0, \;\sigma^2] \]

  • \( f^0(\cdot) \) is real-valued, typically nonlinear function

Nonparametric Curve Estimation

  • Suppose linear base procedure with hat matrix \( H \)
    • Examples include nonparametric kernel smoothers or smoothing spline
  • Let \( B_m \) be the hat matrix at iteration m (\( B_m Y = \hat{Y}_m \))
  • For simplicity, \( \hat{f}^{[0]}\equiv0, \; \nu=1 \)

\[ \begin{aligned} B_m &= B_{m-1}+H(I-B_{m-1}) \\ &= I-(I-H)^m \end{aligned} \]

Nonparametric Curve Estimation

  • Insights:
    • If \( ||I-H||<1 \) for a suitable norm, \( B_m \) converges to \( I \)
    • \( \hat{Y} \) converges to \( Y \)
    • Must stop early with iterations to avoid overfitting

Nonparametric Curve Estimation

[1]

Nonparametric Curve Estimation

[1]

L2 Boosting Using Kernel Estimators

Nadaraya-Watson kernel estimator: [1]

where:

  • \( h>0 \) is bandwidth
  • \( K(\cdot) \) is a kernel in the form of a probability density symmetric around 0
  • \( K_h(x)=\frac{1}{h}K(\frac{x}{h}) \)

L2 Boosting Using Kernel Estimators

  • Using the Nadaraya-Watson kernel estimator under the following conditions:
    • \( \hat{f}^{[0]}\equiv0, \; \nu=1 \)
    • Fixed design points \( X_i=\frac{i}{n} \)
  • L2 Boosting with m=2 iterations amounts to a kernel estimator with a higher order kernel
    • Intuitive explanation for how boosting can improve on base procedure.

L2 Boosting For High-Dimensional Linear Models

L2 Boosting For High-Dimensional Linear Models

\[ Y_i = \sum_{j=1}^p \beta_j X_i^{(j)}+\epsilon_i, \; i=1,...,n \]

where

  • \( \epsilon_i \sim^{iid} [0, \sigma^2] \)
  • \( \epsilon_i,\; i=1,...,n \) independent of all \( X_i \)
  • Estimation can be done with the component-wise linear least squares base procedure

L2 Boosting For High-Dimensional Linear Models

  • Component-wise linear least squares base procedure:
    • fits in every iteration the best predictor variable reducing the residual sum of squares most
  • This method has some basic properties which are shared by the Lasso
    • when stopping early (needed to avoid over-fitting), the L2 Boosting method (often) does variable selection, and the coefficient estimates \( \beta^m \) are (typically) shrunken versions of a least squares estimate \( \beta_{OLS} \).

Connection to LASSO

  • Efron et al. (2004) consider Forward Stagewise Linear Regression (FSLR) Boosting
    • As step sizes tend to 0, the solution is equivalent to LASSO when varying the regularization parameter \( \lambda \). [2]

Connection to LASSO

[2]

Connection to LASSO

[1]

Forward Selection and Orthogonal Matching Pursuit

Forward Selection and Orthogonal Matching Pursuit

Consider the general linear model

[1]

with link function \( g(\cdot) \) and empirical risk function:

[1]

Forward Selection and Orthogonal Matching Pursuit

In every iteration, we have previous set of variables \( S^{[m-1]}\subseteq\{1,...,p\} \)

[1] [1] [1] [1]

Forward Selection

[1] [1] [1]

Orthogonal Matching Pursuit

[1] [1]

Forward Selection and Orthogonal Matching Pursuit

  • OMP is “Boosting version” of Forward Selection
    • Both methods fit \( p-|S^{[m-1]}| \) models in each iteration
      • Forward Selection refits full models
      • OMP fits univariate predictors to residuals
  • OMP is much more efficient in computation
  • OMP has properties comparable to L1 penalty methods

References

  1. Bühlmann, Peter, and Geer. Statistics for High-Dimensional Data. Heidelberg New York: Springer, 2011.
  2. Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. New York: Springer, 2009.