ID5059 Lecture 10 - bootstrapping and bagging

C. Donovan
02 March 2018

Administrivia

  • Industrial action is continuing - dates are on UCU.org.uk

Big picture

  • Until now, we've been looking at finding just one (sort of best) model to describe our data
  • In many situations, we get better predictions by combining models
  • Broadly these are referred to as ensembles
  • Here we look at a simple early version often applied to trees: bagging (from bootstrap aggregation)
  • In short - make lots of trees and combine them (Random forests are a variant we'll also look at)

combining models be good!

Big picture

This makes some sense:

  • When we did model selection with a greedy search, we didn't expect the optimal model
  • When we do model selection a different sample might provide a different finishing point
  • Relatedly, if we randomly perturb our data and model select, we often get different answers

Our best model wasn't all that special, so why settle for one?

Big picture

This idea also aligns with evidence/ideas elsewhere:

  • The wisdom of crowds (in some respects)
  • Model averaging as found in statistics (average predictions from models weighted by their AIC)
  • Kaggle competitions are often won by some top competitors joining forces and averaging their models in some way

Road map

Sooo. today we'll:

  • look at bootstrapping (bagging: bootstrap aggregation you'll recall)
  • see how unreliable a single tree is
  • show how to stick lots together to get a better tool

Resampling

  • We have seen and used resampling

    • cross validation & \( k \)-fold CV
  • We've also briefly discussed OOB validation

    • Create \( k \) datasets of size \( N \) by sampling from \( \mathbf{y} \) and \( \mathbf{X} \) with replacement
    • Get \( k \) test errors using the non-chosen rows of \( \mathbf{y} \) and \( \mathbf{X} \) as “unseen” data

Nonparametric bootstrapping

Concept is simple

  • Resample with replacement many times to get lots of different datasets of size \( n \)
  • Do something with each dataset - the collective results say something about sensitivity to random perturbations (e.g. precision in estimates given sampling)

Sensitivity of results to perturbation

Let's look at trees we've fitted already. How stable were they?

I'm neither surprised nor filled with confidence.

Inferential motivation

Bootstrapping was originally developed to help us with our logical statistical problem. Different samples will give different estimates! Statistics is all about quantifying uncertainty (and hence assessing confidence in “answers”)

  • refer ppt slides

Traditional approach

  • Assume/posit/show that the data come from some distribution
  • Assume/posit/show that the data has been randomly sampled
  • Derive a sampling distribution for the statistic(s) under consideration
  • Base classical analysis on this distribution
  • Run statistical tests, get \( p \)-values, etc.

Problems with the traditional approach

  • If the assumptions about the population are wrong, then the corresponding sampling distribution of the statistic may be seriously inaccurate (we'll ignore asymptotic bits).
  • The approach requires sufficient mathematical prowess to derive the sampling distribution of the statistic of interest. In some cases, such a derivation may be prohibitively difficult.

New approach

  • Estimate the statistics empirically
  • No assumptions about population distributions
  • No explicit derivation of sampling distributions
  • Treat our \( \mathbf{y} \) and \( \mathbf{X} \) as an estimate of the population
  • Each row of our \( \mathbf{y} \) and \( \mathbf{X} \) will be chosen with probability \( 1/n \)

  • Claim: this mimics the original sampling of \( \mathbf{y} \) and \( \mathbf{X} \) from some population

bootstraps

CI for linear regression: "trad" and bootstrap

Summary

  • Bootstrapping can be computationally expensive
  • The results are extremely useful
  • Good CI for any or all the statistics associated with a dataset or a model
  • Care is needed when choosing a type of CI e.g. dependent data such as repeated measures on subjects.
  • Large area - refer to Efron for seminal work.

Bagging

What you need to know

  • The general process behind bagging of a predictive model. You should be able to describe the process algorithmically.
  • How the predictions are produced for both quantitative and categorical response type models.
  • The benefits and short-comings of bagging. In particular where you'd expect a predictive process to benefit from bagging.

History

  • Proposed by Breiman (1996) - one of the creaters of of CART
  • Basically boostrapping a la statistics, but we're interested in predictions as opposed to inference

Overview

A high-level description of the method is simple:

  • create a number of bootstrapped datasets,
  • fit the predictive model to each,
  • combine the predictions of the multiple models

Note we have two classes of predictive model in mind: a quantitative response (i.e. regression type problems) or a categorical response (classification problems).

More specifically

Following the original notation of Brieman 1996:

  • Take \( \mathcal{L} \) as some learning/training set of data consisting of \( y_i,\mathbf{x}_i, (i=1,..,n) \)
  • By standard sampling with replacement produce \( M \) variants of these from our original data: \( \mathcal{L}_m \) (\( m=1,2,..,M \))
  • We can produce predictors for each of the \( \mathcal{L}_m \) giving \( \hat{f}_m(\mathbf{X}_{\mathcal{L}_m}) \).
  • Combine' this multitude of models.

What?: bootstrap your data \( M \) times, fit to each to give \( M \) models. Want a prediction? You now get \( M \).

Combining bootstrapped predictions

  • Quantitative response: take the average of the \( \hat{y}_m \).
  • For a given input vector \( \mathbf{x} \), pass this down through all \( M \) models giving \( M \) predictions. A simple mean gives the bagged prediction.
  • Categorical response (i.e. a classification problem): seek a vote' across the collective classifiers.
  • For a given input vector \( \mathbf{x} \), pass this down each of our \( M \) classifiers to give predictions of class \( \hat{j} \) in each case. The vote' is the class with the greatest frequency of predictions.

Note - the type of model is not specified, it can be anything.

Performance

  • The resulting aggregate predictions will have lower variability than the individual predictions that might be generated from a single classifier/model. - Simulations and analyses of real data and found that the improvement in performance was marked.
  • Across a large number of datasets reductions in misclassification rates of 6%-77%, but typically in the range of 20%-40%.

(of course, they only published good results yeah?)

It's really simple

  • The process is very simple to apply, even without dedicated software
  • Requires only:

    • a front-end that will perform the bootstrap sampling of the data,
    • a predictive modelling tool,
    • and a back-end that will gather the results together.

How many? You can use a validation dataset to check the generalisation error has plateaued for your \( M \)

It's really simple

Let's do one…

Pros and cons

  • There is no particular gain for models that do not have some inherent instability (evidenced under perturbation)
  • For example, application to a \( k \)-nearest neighbor classifier may show no improvement, whereas classification tree might show great gains.
  • Something like ordinary regression, where an automated model selection process proves quite variable under perturbation, would simiarly benefit.

Pros and cons

  • Low gains will be expected if the base model is giving good performance already - e.g. is close to the misclassification performance boundary.
  • There is a loss of interpretability. We have predictions from multiple models, so the exact nature of the model and the contribution of the variables is complex.

Review

In a nutshell:

  • We build lots of datasets by (non-parametric) boostrapping our sample i.e. sampling with replacement, all with \( n \) observations.
  • We fit models to each e.g. classification or regression trees
  • We make predictions by passing new data through all the models
  • We aggregate the predictions by either averaging when numeric or use a voting system (aka plurality) most popular wins

The aggregate of lots of unstable models is better than the, almost random, best one you get from one fit to all the data.