Boosting and Greedy Algorithm (III):

Introduction

Review L2 Boosting
Nonparametric curve estimation
- L2 Boosting using kernel estimators
L2 Boosting for high-dimensional linear models
- Connection with LASSO
Forward Selection and Orthogonal Matching Pursuit

L2 Boosting

Review - What is L2 Boosting?

Functional gradient descent using the squared error loss.
This amounts to repeated fitting of residuals. [1]

L2 Boosting Example

library(MASS)          # Data (Boston)
library(rpart)         # Trees
library(ipred)         # Bagging
library(randomForest)  # Random Forest
library(gbm)           # L2 Boosting

n <- dim(Boston)[1]
train <- sample(1:n, .66*n)
Boston.train <- Boston[train,]
Boston.validate <- Boston[-train,]

L2 Boosting Example

MSE on validation set:
- Rpart: 17.3746
- Bagging: 12.4235
- Random Forest: 10.8361
- L2 Boosting: 10.6906

L2 Boosting Example

plot of chunk unnamed-chunk-3

Nonparametric Curve Estimation

Setup
- Estimating regression function \( E(Y \;|\; X=x) \) for univariate \( X \) and continuous \( Y \)

\[ Y_i = f^0(X_i)+\epsilon_i, \;i=1,...,n \] \[ \epsilon_i \sim_{iid} [0, \;\sigma^2] \]

\( f^0(\cdot) \) is real-valued, typically nonlinear function

Nonparametric Curve Estimation

Suppose linear base procedure with hat matrix \( H \)
- Examples include nonparametric kernel smoothers or smoothing spline
Let \( B_m \) be the hat matrix at iteration m (\( B_m Y = \hat{Y}_m \))
For simplicity, \( \hat{f}^{[0]}\equiv0, \; \nu=1 \)

\[ \begin{aligned} B_m &= B_{m-1}+H(I-B_{m-1}) \\ &= I-(I-H)^m \end{aligned} \]

Nonparametric Curve Estimation

Insights:
- If \( ||I-H||<1 \) for a suitable norm, \( B_m \) converges to \( I \)
- \( \hat{Y} \) converges to \( Y \)
- Must stop early with iterations to avoid overfitting

Nonparametric Curve Estimation

[1]

Nonparametric Curve Estimation

[1]

L2 Boosting Using Kernel Estimators

Nadaraya-Watson kernel estimator: [1]

where:

\( h>0 \) is bandwidth
\( K(\cdot) \) is a kernel in the form of a probability density symmetric around 0
\( K_h(x)=\frac{1}{h}K(\frac{x}{h}) \)

L2 Boosting Using Kernel Estimators

Using the Nadaraya-Watson kernel estimator under the following conditions:
- \( \hat{f}^{[0]}\equiv0, \; \nu=1 \)
- Fixed design points \( X_i=\frac{i}{n} \)
L2 Boosting with m=2 iterations amounts to a kernel estimator with a higher order kernel
- Intuitive explanation for how boosting can improve on base procedure.

L2 Boosting For High-Dimensional Linear Models

\[ Y_i = \sum_{j=1}^p \beta_j X_i^{(j)}+\epsilon_i, \; i=1,...,n \]

where

\( \epsilon_i \sim^{iid} [0, \sigma^2] \)
\( \epsilon_i,\; i=1,...,n \) independent of all \( X_i \)
Estimation can be done with the component-wise linear least squares base procedure

L2 Boosting For High-Dimensional Linear Models

Component-wise linear least squares base procedure:
- fits in every iteration the best predictor variable reducing the residual sum of squares most
This method has some basic properties which are shared by the Lasso
- when stopping early (needed to avoid over-fitting), the L2 Boosting method (often) does variable selection, and the coefficient estimates \( \beta^m \) are (typically) shrunken versions of a least squares estimate \( \beta_{OLS} \).

Connection to LASSO

Efron et al. (2004) consider Forward Stagewise Linear Regression (FSLR) Boosting
- As step sizes tend to 0, the solution is equivalent to LASSO when varying the regularization parameter \( \lambda \). [2]

Connection to LASSO

[2]

Connection to LASSO

[1]

Forward Selection and Orthogonal Matching Pursuit

Consider the general linear model

[1]

with link function \( g(\cdot) \) and empirical risk function:

[1]

Forward Selection and Orthogonal Matching Pursuit

In every iteration, we have previous set of variables \( S^{[m-1]}\subseteq\{1,...,p\} \)

[1] [1] [1] [1]

Forward Selection

[1] [1] [1]

Orthogonal Matching Pursuit

[1] [1]

Forward Selection and Orthogonal Matching Pursuit

OMP is “Boosting version” of Forward Selection
- Both methods fit \( p-|S^{[m-1]}| \) models in each iteration
  - Forward Selection refits full models
  - OMP fits univariate predictors to residuals
OMP is much more efficient in computation
OMP has properties comparable to L1 penalty methods

References

Bühlmann, Peter, and Geer. Statistics for High-Dimensional Data. Heidelberg New York: Springer, 2011.
Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. New York: Springer, 2009.