ID5059 Lecture 07 - Tree methods - Introduction

C. Donovan
14 Feb 2018

Administrivia

Additional lab session - 5-6pm on Thursdays
Industrial action starts next week. So currently affects Friday lecture next week.
Happy valentine's day! (where you get unsoliticted mail and gifts from someone who apparently is watching you when you don't know it - creepy?)

Selection & Validation Summary

Want to choose the best model amongst a set of candidate models

Best will usually be via generalisation error (various measures) because these are predictive models
The models might be:
- very different things (Neural Net vs SVM vs Logistic reg etc)
- Different sets of inputs/covariates (they're in or out)
- Different complexity within a model type (discrete e.g. polynomial order, or cont. smoothing parameter)

We know how to fit some of these models and choose between them

Selection & Validation Summary

Commonly encounter:

Common scenario

Big picture

We want more model types!

We want things that scale in complexity, are efficient, have little constraint on inputs or response types and that work.
We look first at a simple method that has some of these properties (except often doesn't work well). It is a basis for better things.

Your Lab 01

You might recall this romantic moment

Flying

slightly interrupted by this

Now, sinking

but allowing this for your predilection:

Wee tree

Tree methods

Need-to-knows

How recursive binary partitioning of \( \mathcal{R}_p \) works.
How to sketch a partitioning of \( \mathcal{R}_2 \) on the basis of a series of simple binary splits.
How to go from a series of binary splitting rules to a tree representation and vice versa.

A classification problem....

A bit of drawing required

Historical perspective

1960 Automatic Interaction Detection (AID) related to the clustering literature.
THAID, CHAID
ID3, C4.5, C5.0
CART 1984 Breiman et al.

Recursive partitioning on \(\mathcal{R}\)

Take an \( n\times p \) matrix \( \mathbf{X} \), define a \( p \) dimensional space \( \mathcal{R}_p \). We wish to apply a simple rule recursively:

Select a variable \( x_i \) and split on the basis of a single value \( x_i=a \). We now have two spaces: \( x_i\le a \) and \( x_i>a \).
Select one of the current _sub-spaces}, select a variable \( x_j \), and split this sub-space on the basis of a single value \( x_j=b \).
Repeatedly select sub-spaces and split in two.
Note that this process can extend beyond just the two dimensions represented by \( x_1 \) and \( x_2 \). If this were 3-dimensions (i.e. include an \( x_3 \)) then the partitions would be cubes. Beyond this the partitions are conceptually hyper-cubes.

An arbitrary 2-D space

Space splitting

A single split

Space splitting

Split of a subspace

Space splitting

Further splitting of a subspace

Space splitting

Further splitting

Space splitting

Potential 3-D surface

Binary partitioning process as a tree

An example tree diagram for a contrived partioning

Tree representation

The splitting points are called nodes - these have a binary splitting rule associated with them
The two new spaces created by the split are represented by lines leaving the nodes, these are referred to as the branches.
A tree with one split is a stump.
The nodes at the bottom of the diagram are referred to as the terminal nodes and collectively represent all the final partitions/subspaces of the data.
You can drop' a vector \( \mathbf{x} \) down the tree to determine which subspace this coordinate falls into.

Tree construction

We can model the response as a constant for each region (or. equivalently, leaf)
If we are minimising sums of squares, the optimal constant for a region/leaf is the average of the observed outputs for all inputs associated with the region/leaf
Computing the optimal binary partition for given inputs and output is computationally intractable, in general
A greedy algorithm is used that finds an optimal variable and split point given an initial choice (or guess), then continues for sub-regions
This is quick to compute (sums of averages) but errors at the root lead to errors at the leaves

How big should the tree be?

Tradeoff between bias and variance
Small tree – high bias, low variance
Not big enough to capture the correct model structure
Large tree – low bias, high variance
Overfitting – in the extreme case each input is in exactly one region
Optimal size should be adaptively chosen from the data

We could stop splitting based on a threshold for decreases in sum of squares, but this might rule out a useful split further down the tree.

How big should the tree be?

We could stop splitting based on a threshold for decreases in sum of squares, meaning that we never build a tree that is too big
But this might rule out a useful split further down the tree
Instead we construct a tree that is probably too large, and prune it by cost-complexity calculations – next lecture

Tree methods

Need-to-knows

How recursive binary partitioning of \( \mathcal{R}_p \) works.
How to sketch a partitioning of \( \mathcal{R}_2 \) on the basis of a series of simple binary splits.
How to go from a series of binary splitting rules to a tree representation and vice versa.

An example

leoIsDead <- read.csv("train.csv", header=TRUE)

##### Classification tree
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(rpart)

An example

## women and children first!
simpleTree <- rpart(survived ~ sex + age, method="class", data=leoIsDead)

## plot the tree
plot(simpleTree, uniform=TRUE,
     main="Women and Children First", margin = 0.1)
text(simpleTree, use.n=TRUE, all=TRUE)

plot of chunk unnamed-chunk-3

An example

## the rpart plot is terrible, so get something better
fancyRpartPlot(simpleTree)

plot of chunk unnamed-chunk-5

An example

## learn a tree for a larger subset of covariates
biggerTree <- rpart(survived ~ pclass + sex + age + sibsp + parch + fare, method="class", data=leoIsDead)

## view details
summary(biggerTree)

An example

Call:
rpart(formula = survived ~ pclass + sex + age + sibsp + parch + 
    fare, data = leoIsDead, method = "class")
  n= 891 

          CP nsplit rel error    xerror       xstd
1 0.44444444      0 1.0000000 1.0000000 0.04244576
2 0.03070175      1 0.5555556 0.5555556 0.03574957
3 0.02339181      3 0.4941520 0.5000000 0.03437157
4 0.02046784      4 0.4707602 0.5029240 0.03444798
5 0.01267057      5 0.4502924 0.5087719 0.03459945
6 0.01000000      8 0.4122807 0.5000000 0.03437157

Variable importance
   sex   fare pclass  sibsp  parch    age 
    48     19     14      7      7      6 

Node number 1: 891 observations,    complexity param=0.4444444
  predicted class=0  expected loss=0.3838384  P(node) =1
    class counts:   549   342
   probabilities: 0.616 0.384 
  left son=2 (577 obs) right son=3 (314 obs)
  Primary splits:
      sex    splits as  RL,           improve=124.426300, (0 missing)
      pclass < 2.5      to the right, improve= 43.781830, (0 missing)
      fare   < 10.48125 to the left,  improve= 37.941940, (0 missing)
      parch  < 0.5      to the left,  improve=  9.157774, (0 missing)
      age    < 6.5      to the right, improve=  8.814172, (177 missing)
  Surrogate splits:
      fare  < 77.6229  to the left,  agree=0.679, adj=0.089, (0 split)
      parch < 0.5      to the left,  agree=0.678, adj=0.086, (0 split)

Node number 2: 577 observations,    complexity param=0.02339181
  predicted class=0  expected loss=0.1889081  P(node) =0.647587
    class counts:   468   109
   probabilities: 0.811 0.189 
  left son=4 (553 obs) right son=5 (24 obs)
  Primary splits:
      age    < 6.5      to the right, improve=10.788930, (124 missing)
      fare   < 26.26875 to the left,  improve=10.216720, (0 missing)
      pclass < 1.5      to the right, improve=10.019140, (0 missing)
      parch  < 0.5      to the left,  improve= 3.350327, (0 missing)
      sibsp  < 0.5      to the left,  improve= 1.501502, (0 missing)

Node number 3: 314 observations,    complexity param=0.03070175
  predicted class=1  expected loss=0.2579618  P(node) =0.352413
    class counts:    81   233
   probabilities: 0.258 0.742 
  left son=6 (144 obs) right son=7 (170 obs)
  Primary splits:
      pclass < 2.5      to the right, improve=31.163130, (0 missing)
      fare   < 48.2     to the left,  improve=10.114210, (0 missing)
      sibsp  < 2.5      to the right, improve= 9.372551, (0 missing)
      parch  < 3.5      to the right, improve= 5.140857, (0 missing)
      age    < 12       to the left,  improve= 1.891684, (53 missing)
  Surrogate splits:
      fare  < 25.69795 to the left,  agree=0.799, adj=0.563, (0 split)
      sibsp < 1.5      to the right, agree=0.592, adj=0.111, (0 split)
      parch < 1.5      to the right, agree=0.567, adj=0.056, (0 split)
      age   < 18.5     to the left,  agree=0.564, adj=0.049, (0 split)

Node number 4: 553 observations
  predicted class=0  expected loss=0.1681736  P(node) =0.620651
    class counts:   460    93
   probabilities: 0.832 0.168 

Node number 5: 24 observations,    complexity param=0.02046784
  predicted class=1  expected loss=0.3333333  P(node) =0.02693603
    class counts:     8    16
   probabilities: 0.333 0.667 
  left son=10 (9 obs) right son=11 (15 obs)
  Primary splits:
      sibsp  < 2.5      to the right, improve=8.8888890, (0 missing)
      pclass < 2.5      to the right, improve=3.8095240, (0 missing)
      fare   < 20.825   to the right, improve=2.6666670, (0 missing)
      age    < 1.5      to the right, improve=0.6095238, (0 missing)
  Surrogate splits:
      pclass < 2.5      to the right, agree=0.792, adj=0.444, (0 split)
      fare   < 26.95    to the right, agree=0.750, adj=0.333, (0 split)

Node number 6: 144 observations,    complexity param=0.03070175
  predicted class=0  expected loss=0.5  P(node) =0.1616162
    class counts:    72    72
   probabilities: 0.500 0.500 
  left son=12 (27 obs) right son=13 (117 obs)
  Primary splits:
      fare  < 23.35    to the right, improve=10.051280, (0 missing)
      sibsp < 2.5      to the right, improve= 4.571429, (0 missing)
      age   < 38.5     to the right, improve= 3.875163, (42 missing)
      parch < 1.5      to the right, improve= 3.773262, (0 missing)
  Surrogate splits:
      sibsp < 2.5      to the right, agree=0.882, adj=0.37, (0 split)
      parch < 1.5      to the right, agree=0.882, adj=0.37, (0 split)

Node number 7: 170 observations
  predicted class=1  expected loss=0.05294118  P(node) =0.1907969
    class counts:     9   161
   probabilities: 0.053 0.947 

Node number 10: 9 observations
  predicted class=0  expected loss=0.1111111  P(node) =0.01010101
    class counts:     8     1
   probabilities: 0.889 0.111 

Node number 11: 15 observations
  predicted class=1  expected loss=0  P(node) =0.01683502
    class counts:     0    15
   probabilities: 0.000 1.000 

Node number 12: 27 observations
  predicted class=0  expected loss=0.1111111  P(node) =0.03030303
    class counts:    24     3
   probabilities: 0.889 0.111 

Node number 13: 117 observations,    complexity param=0.01267057
  predicted class=1  expected loss=0.4102564  P(node) =0.1313131
    class counts:    48    69
   probabilities: 0.410 0.590 
  left son=26 (93 obs) right son=27 (24 obs)
  Primary splits:
      age   < 16.5     to the right, improve=2.4685870, (34 missing)
      fare  < 7.8875   to the right, improve=2.0325270, (0 missing)
      sibsp < 0.5      to the right, improve=0.3076923, (0 missing)
      parch < 1.5      to the left,  improve=0.1582418, (0 missing)
  Surrogate splits:
      sibsp < 1.5      to the left,  agree=0.747, adj=0.087, (34 split)
      fare  < 20.8     to the left,  agree=0.747, adj=0.087, (0 split)

Node number 26: 93 observations,    complexity param=0.01267057
  predicted class=1  expected loss=0.4516129  P(node) =0.1043771
    class counts:    42    51
   probabilities: 0.452 0.548 
  left son=52 (56 obs) right son=53 (37 obs)
  Primary splits:
      fare  < 7.8875   to the right, improve=2.92648500, (0 missing)
      age   < 36.5     to the right, improve=1.66181500, (33 missing)
      sibsp < 0.5      to the right, improve=0.34061040, (0 missing)
      parch < 1.5      to the left,  improve=0.05969685, (0 missing)
  Surrogate splits:
      sibsp < 0.5      to the right, agree=0.667, adj=0.162, (0 split)

Node number 27: 24 observations
  predicted class=1  expected loss=0.25  P(node) =0.02693603
    class counts:     6    18
   probabilities: 0.250 0.750 

Node number 52: 56 observations,    complexity param=0.01267057
  predicted class=0  expected loss=0.4464286  P(node) =0.06285073
    class counts:    31    25
   probabilities: 0.554 0.446 
  left son=104 (33 obs) right son=105 (23 obs)
  Primary splits:
      fare  < 14.8729  to the left,  improve=3.30439500, (0 missing)
      age   < 23.5     to the left,  improve=1.31484800, (13 missing)
      parch < 1.5      to the left,  improve=1.04027400, (0 missing)
      sibsp < 0.5      to the left,  improve=0.02216117, (0 missing)
  Surrogate splits:
      sibsp < 0.5      to the left,  agree=0.696, adj=0.261, (0 split)
      parch < 1.5      to the left,  agree=0.679, adj=0.217, (0 split)

Node number 53: 37 observations
  predicted class=1  expected loss=0.2972973  P(node) =0.04152637
    class counts:    11    26
   probabilities: 0.297 0.703 

Node number 104: 33 observations
  predicted class=0  expected loss=0.3030303  P(node) =0.03703704
    class counts:    23    10
   probabilities: 0.697 0.303 

Node number 105: 23 observations
  predicted class=1  expected loss=0.3478261  P(node) =0.02581369
    class counts:     8    15
   probabilities: 0.348 0.652

An example

## plot the tree
fancyRpartPlot(biggerTree)

plot of chunk unnamed-chunk-8

An example

Whoops - too male, too old, too poor

Regression trees

Consider our general regression problem (note can be classification): \[ \mathbf{y}=f(\mathbf{X}) + \mathbf{e} \] and the usual approximation model (linear in its parameters): \[ \mathbf{y}=\mathbf{X}\boldsymbol{\beta} + \mathbf{e} \]

Standard' interactions of form \( \beta_p(X_1X_2) \)
These are simple in form and quite hard to interpret succinctly
What is probably the simplest interaction form to interpret?
Recursive binary splitting rules for the covariate space