ID5059 Lecture 10 - bootstrapping and bagging

C. Donovan
02 March 2018

Administrivia

Industrial action is continuing - dates are on UCU.org.uk

Big picture

Until now, we've been looking at finding just one (sort of best) model to describe our data
In many situations, we get better predictions by combining models
Broadly these are referred to as ensembles
Here we look at a simple early version often applied to trees: bagging (from bootstrap aggregation)
In short - make lots of trees and combine them (Random forests are a variant we'll also look at)

combining models be good!

Big picture

This makes some sense:

When we did model selection with a greedy search, we didn't expect the optimal model
When we do model selection a different sample might provide a different finishing point
Relatedly, if we randomly perturb our data and model select, we often get different answers

Our best model wasn't all that special, so why settle for one?

Big picture

This idea also aligns with evidence/ideas elsewhere:

The wisdom of crowds (in some respects)
Model averaging as found in statistics (average predictions from models weighted by their AIC)
Kaggle competitions are often won by some top competitors joining forces and averaging their models in some way

Road map

Sooo. today we'll:

look at bootstrapping (bagging: bootstrap aggregation you'll recall)
see how unreliable a single tree is
show how to stick lots together to get a better tool

Resampling

We have seen and used resampling
- cross validation & \( k \)-fold CV
We've also briefly discussed OOB validation
- Create \( k \) datasets of size \( N \) by sampling from \( \mathbf{y} \) and \( \mathbf{X} \) with replacement
- Get \( k \) test errors using the non-chosen rows of \( \mathbf{y} \) and \( \mathbf{X} \) as “unseen” data

Nonparametric bootstrapping

Concept is simple

Resample with replacement many times to get lots of different datasets of size \( n \)
Do something with each dataset - the collective results say something about sensitivity to random perturbations (e.g. precision in estimates given sampling)

Sensitivity of results to perturbation

Let's look at trees we've fitted already. How stable were they?

I'm neither surprised nor filled with confidence.

Inferential motivation

Bootstrapping was originally developed to help us with our logical statistical problem. Different samples will give different estimates! Statistics is all about quantifying uncertainty (and hence assessing confidence in “answers”)

refer ppt slides

Traditional approach

Assume/posit/show that the data come from some distribution
Assume/posit/show that the data has been randomly sampled
Derive a sampling distribution for the statistic(s) under consideration
Base classical analysis on this distribution
Run statistical tests, get \( p \)-values, etc.

Problems with the traditional approach

If the assumptions about the population are wrong, then the corresponding sampling distribution of the statistic may be seriously inaccurate (we'll ignore asymptotic bits).
The approach requires sufficient mathematical prowess to derive the sampling distribution of the statistic of interest. In some cases, such a derivation may be prohibitively difficult.

New approach

Estimate the statistics empirically
No assumptions about population distributions
No explicit derivation of sampling distributions
Treat our \( \mathbf{y} \) and \( \mathbf{X} \) as an estimate of the population
Each row of our \( \mathbf{y} \) and \( \mathbf{X} \) will be chosen with probability \( 1/n \)
Claim: this mimics the original sampling of \( \mathbf{y} \) and \( \mathbf{X} \) from some population

bootstraps

CI for linear regression: "trad" and bootstrap

Summary

Bootstrapping can be computationally expensive
The results are extremely useful
Good CI for any or all the statistics associated with a dataset or a model
Care is needed when choosing a type of CI e.g. dependent data such as repeated measures on subjects.
Large area - refer to Efron for seminal work.

Bagging

What you need to know

The general process behind bagging of a predictive model. You should be able to describe the process algorithmically.
How the predictions are produced for both quantitative and categorical response type models.
The benefits and short-comings of bagging. In particular where you'd expect a predictive process to benefit from bagging.

History

Proposed by Breiman (1996) - one of the creaters of of CART
Basically boostrapping a la statistics, but we're interested in predictions as opposed to inference

Overview

A high-level description of the method is simple:

create a number of bootstrapped datasets,
fit the predictive model to each,
combine the predictions of the multiple models

Note we have two classes of predictive model in mind: a quantitative response (i.e. regression type problems) or a categorical response (classification problems).

More specifically

Following the original notation of Brieman 1996:

Take \( \mathcal{L} \) as some learning/training set of data consisting of \( y_i,\mathbf{x}_i, (i=1,..,n) \)
By standard sampling with replacement produce \( M \) variants of these from our original data: \( \mathcal{L}_m \) (\( m=1,2,..,M \))
We can produce predictors for each of the \( \mathcal{L}_m \) giving \( \hat{f}_m(\mathbf{X}_{\mathcal{L}_m}) \).
Combine' this multitude of models.

What?: bootstrap your data \( M \) times, fit to each to give \( M \) models. Want a prediction? You now get \( M \).

Combining bootstrapped predictions

Quantitative response: take the average of the \( \hat{y}_m \).
For a given input vector \( \mathbf{x} \), pass this down through all \( M \) models giving \( M \) predictions. A simple mean gives the bagged prediction.
Categorical response (i.e. a classification problem): seek a vote' across the collective classifiers.
For a given input vector \( \mathbf{x} \), pass this down each of our \( M \) classifiers to give predictions of class \( \hat{j} \) in each case. The vote' is the class with the greatest frequency of predictions.

Note - the type of model is not specified, it can be anything.

Performance

The resulting aggregate predictions will have lower variability than the individual predictions that might be generated from a single classifier/model. - Simulations and analyses of real data and found that the improvement in performance was marked.
Across a large number of datasets reductions in misclassification rates of 6%-77%, but typically in the range of 20%-40%.

(of course, they only published good results yeah?)

It's really simple

The process is very simple to apply, even without dedicated software
Requires only:
- a front-end that will perform the bootstrap sampling of the data,
- a predictive modelling tool,
- and a back-end that will gather the results together.

How many? You can use a validation dataset to check the generalisation error has plateaued for your \( M \)

It's really simple

Let's do one…

Pros and cons

There is no particular gain for models that do not have some inherent instability (evidenced under perturbation)
For example, application to a \( k \)-nearest neighbor classifier may show no improvement, whereas classification tree might show great gains.
Something like ordinary regression, where an automated model selection process proves quite variable under perturbation, would simiarly benefit.

Pros and cons

Low gains will be expected if the base model is giving good performance already - e.g. is close to the misclassification performance boundary.
There is a loss of interpretability. We have predictions from multiple models, so the exact nature of the model and the contribution of the variables is complex.

Review

In a nutshell:

We build lots of datasets by (non-parametric) boostrapping our sample i.e. sampling with replacement, all with \( n \) observations.
We fit models to each e.g. classification or regression trees
We make predictions by passing new data through all the models
We aggregate the predictions by either averaging when numeric or use a voting system (aka plurality) most popular wins

The aggregate of lots of unstable models is better than the, almost random, best one you get from one fit to all the data.