ID5059 Lecture 12 - more boosting (real boost)

C. Donovan
04 April 2018

Administrivia

Group 2 project, hopefully you've seen it, started it (or talked about starting, or thought about talking about starting, or thinking “project 2?”)
Marking for project 1 - I'll aim for mid-next week
A little reshuffle of content given industrial action.
- We'll get onto neural nets asap (3-4 lectures)
- Reading for random forests and extreme gradient boosting
- Brief clustering (1 instead of 2)
- Some tidying up: ROC, gain, etc
- SVMs in remaining

If it's not in the lecture or lab, it's not in the exam

Big picture

Until now, we've been looking at finding just one (sort of best) model to describe our data
In many situations, we get better predictions by combining models
Broadly these are referred to as ensembles
Here we look at boosting
In short - fit lots of models changing the observation weights to favour hard-to-predict observations

combining models be good!

Ought-to-knows

How boosting works - produce a schematic or pseudo-code etc.
Details of two variants (Adaboost and RealBoost)
What is meant by: a weak-learner, a stump, an Out-Of-Bag (OOB) estimate
Rules of thumb: how many iterations? what complexity of base learner should be used?
Where boosting should work, where it probably won't
General pros/cons of the method

High-level view

In principle another simple process

Step 1: Fit a classification model (classifier) and find cases that are misclassified
Step 2: Re-weight misclassified cases, increasing their importance
Step 3: Refit a classification model
Step 4: Determine new weights and iterate
Step 5: Combine all these models

We'll looked at adaboost, we now look at realBoost

Real AdaBoost - one-page version

The problem is similar to before:

Two class problem \( y=(-1,1) \)
seek (weak) classifier models/functions which provide probability estimates \( p(\mathbf{x})=P(y=1|\mathbf{x})\in [0,1] \) and
use a set of weights \( w_i (i=1,..,n) \).

Real AdaBoost - 1 page hard to digest

Initialise the algorithm with \( w_i=1/n \quad(i=1,..,n) \) and repeat the following steps for \( m=1,..,M \):

Classifier model produces class probabilities \( \mathbf{p}_m(\mathbf{X}) \) on the training data
Define \( f_m(\mathbf{X})=\frac{1}{2}\log\left(\frac{\mathbf{p}_m(\mathbf{X})}{(1-\mathbf{p}_m(\mathbf{X}))}\right)\in \mathcal{R} \)
Update the weights \( w_{(m+1,i)}=w_{m,i}e^{-yf_m(\mathbf{x}_i)}, \quad(i=1,..,n) \)
Normalise weights \( \sum_{i=1}^n w_{m+1,i}=1 \)

Iterate over \( m \)

Class prediction for new data \( \mathbf{x}^* \) is sign\( (\sum_{m=1}^Mf_m(\mathbf{x^*})) \)

High-level view

In principle another simple process

Step 1: Fit a classification model (classifier) and find cases that are misclassified
Step 2: Re-weight misclassified cases, increasing their importance
Step 3: Refit a classification model
Step 4: Determine new weights and iterate
Step 5: Combine all these models

Detailed view step 1 - fitting

We have a covariate set \( \mathbf{X} \) (\( n\times p \)) and our target/response \( \mathbf{y} \) (\( n\times 1 \)), which can be considered as just a binary vector recoded to \( -1,1 \) start.
Set initial weights for each of the \( n \) observations to be \( \frac{1}{n} \) (note these sum to one). Start iterations at \( m=1 \).
Fit classifier and predict probabilities for \( i=1,...,n \) i.e. \( \mathbf{p}_m(\mathbf{X}) \) to get this:

\[ f_m(\mathbf{X})=\frac{1}{2}\log\left(\frac{\mathbf{p}_m(\mathbf{X})}{(1-\mathbf{p}_m(\mathbf{X}))}\right) \]

At it's heart, it is \( p/(1-p) \), which are odds

Detailed view step 1 - odds?

At it's heart, it is \( p/(1-p) \), which are odds

\[ f_m(\mathbf{X})=\frac{1}{2}\log\left(\frac{\mathbf{p}_m(\mathbf{X})}{(1-\mathbf{p}_m(\mathbf{X}))}\right) \]

this thing gets big if our model is confident \( y=1 \),
this thing gets “small” (big negative values) if our model is confident \( y=-1 \),
this thing tends to 0 if our model can't pick

Detailed view step 1 - odds?

converting our predicted probabilities to log-odds - similar to logistic regression/GLM with logit link an binomial errors

ultimately predict class \( y=1 \) if \( \sum_m \log\left(\frac{p_m}{(1-p_m)}\right)>0 \) else predict class \( y=-1 \).
if \( p(\mathbf{x})=(1-p(\mathbf{x})), f(\mathbf{x})=0 \)
if \( p(\mathbf{x})>(1-p(\mathbf{x})) \) then \( \log\left(\frac{p_m}{(1-p_m)}\right)>0 \) i.e. \( y=1 \) is most likely.
if \( p(\mathbf{x})<(1-p(\mathbf{x})) \) then \( \log\left(\frac{p_m}{(1-p_m)}\right)<0 \) i.e. \( y=-1 \) is most likely.

High-level view

In principle another simple process

Step 1: Fit a classification model (classifier) and find cases that are misclassified
Step 2: Re-weight misclassified cases, increasing their importance
Step 3: Refit a classification model
Step 4: Determine new weights and iterate
Step 5: Combine all these models

Detailed view step 2: reweighting

Reweight for \( m+1 \) using:

\[ w_{(m+1,i)}=w_{m,i}e^{-yf_m(\mathbf{x}_i)}, \quad(i=1,..,n) \]

WTF?

Detailed view step 2: WTF explained

Consider the weight updating function \( we^{-yf(\mathbf{x})} \):

\( e^{-yf(\mathbf{x})}>1 \) if \( y=-1 \) and \( f(\mathbf{x})>0 \) i.e. implies false positive.
\( e^{-yf(\mathbf{x})}>1 \) if \( y=1 \) and \( f(\mathbf{x})<0 \) i.e. implies false negative.
\( e^{-yf(\mathbf{x})}<1 \) if \( y=-1 \) and \( f(\mathbf{x})<0 \) i.e. implies correct negative.
\( e^{-yf(\mathbf{x})}<1 \) if \( y=1 \) and \( f(\mathbf{x})>0 \) i.e. implies correct positive.

So… weights are increased for observations that the model is misclassifying - just as in the ordinary Adaboost.

High-level view

In principle another simple process

Step 1: Fit a classification model (classifier) and find cases that are misclassified
Step 2: Re-weight misclassified cases, increasing their importance
Step 3: Refit a classification model
Step 4: Determine new weights and iterate
Step 5: Combine all these models

Detailed view steps 3-4: reweighting, iterating

For the incorrectly classified observations we have mutliplied by >1
For the correctly classified observations we have mutliplied by <1
The magnitude of the multiplier is related to how wrong/right we were
Confidently right or wrong gets big weight changes up or down

Then, normalise the weights i.e. make \( \sum_{i=1}^nw_{m,i}=1 \)

Iterate over \( m \) - refitting and reweighting
Terminate by looking at generalisation error

High-level view

In principle another simple process

Step 1: Fit a classification model (classifier) and find cases that are misclassified
Step 2: Re-weight misclassified cases, increasing their importance
Step 3: Refit a classification model
Step 4: Determine new weights and iterate
Step 5: Combine all these models

Detailed view step 5: combining

Class prediction for new data \( \mathbf{x}^* \) is sign\( (\sum_{m=1}^Mf_m(\mathbf{x^*})) \) so \( \hat{y}_i \) is either \( -1,1 \).

if the combined models tend to offer \( f \)<0, our we predict -1
if the combined models tend to offer \( f \)>0, our we predict 1

Weights? How do we include these?

If the classifier naturally includes weighted observations, no problem (let's look at such a thing … R:GLMs)
Alternatives?
- Use the weights as a probability measure for a resampling process (e.g. bootstrapping)
- Missclassified observations are therefore progressively over-sampled

Pros and cons

(as usual) Loss of interpretability
More computer intensive (fitting 200-400 times)
Potentially large gains in predictive power
In most cases a boosted classifier will perform as well or better than the type of classifier it is based on
Can be extended to multiple classes (turn into a set of larger binary problems), but this greatly increases the computational effort

What classifier to use?

While theoretically a wide range of classifiers might be used, tree models are most common
Intended(/works well) for use with weak learners - but what is weak?
Original specification built on stumps
More complex (yet usually still weak) base learners have been found to give better performance in some cases.
Two, four or eight terminal nodes are most common - stumps being most common of all.

Practicalities

Rules of thumb

Difficult to overfit, 200-400 iterations often sufficient.
Stumps work pretty well. However some cases benefit from more complex classifiers, but <8 leaves are recommended. Can try several and compare performance. If you a priori believe complex interactions exist, several terminal nodes as base.
Which algorithm? Probably dictated by software - personally favour realBoost or gentleBoost given statistical basis.

Review

In a nutshell:

The predictive aggregate of lots of unstable models is better than the, almost random, best one you get from one fit to all the data.

Short exercise

Everybody on the Left

I want build some software to predict someone's age of death

Everybody on the Right

I want build some software to decide whether I should give someone a loan

Shorrt exercise

What data do I need?
How do I get it?
How do I treat it?
What is the model going to look like?
How do I know it's any good?