ID5059 Lecture 07 - Tree methods - Introduction

C. Donovan
14 Feb 2018

Administrivia

  • Additional lab session - 5-6pm on Thursdays
  • Industrial action starts next week. So currently affects Friday lecture next week.
  • Happy valentine's day! (where you get unsoliticted mail and gifts from someone who apparently is watching you when you don't know it - creepy?)

Selection & Validation Summary

Want to choose the best model amongst a set of candidate models

  • Best will usually be via generalisation error (various measures) because these are predictive models
  • The models might be:
    • very different things (Neural Net vs SVM vs Logistic reg etc)
    • Different sets of inputs/covariates (they're in or out)
    • Different complexity within a model type (discrete e.g. polynomial order, or cont. smoothing parameter)

We know how to fit some of these models and choose between them

Selection & Validation Summary

Commonly encounter:

Common scenario

Big picture

We want more model types!

  • We want things that scale in complexity, are efficient, have little constraint on inputs or response types and that work.
  • We look first at a simple method that has some of these properties (except often doesn't work well). It is a basis for better things.

Your Lab 01

  • You might recall this romantic moment

Flying

  • slightly interrupted by this

Now, sinking

  • but allowing this for your predilection:

Wee tree

Tree methods

Need-to-knows

  • How recursive binary partitioning of \( \mathcal{R}_p \) works.
  • How to sketch a partitioning of \( \mathcal{R}_2 \) on the basis of a series of simple binary splits.
  • How to go from a series of binary splitting rules to a tree representation and vice versa.

A classification problem....

A bit of drawing required

Historical perspective

  • 1960 Automatic Interaction Detection (AID) related to the clustering literature.
  • THAID, CHAID
  • ID3, C4.5, C5.0
  • CART 1984 Breiman et al.

Recursive partitioning on \(\mathcal{R}\)

Take an \( n\times p \) matrix \( \mathbf{X} \), define a \( p \) dimensional space \( \mathcal{R}_p \). We wish to apply a simple rule recursively:

  • Select a variable \( x_i \) and split on the basis of a single value \( x_i=a \). We now have two spaces: \( x_i\le a \) and \( x_i>a \).
  • Select one of the current _sub-spaces}, select a variable \( x_j \), and split this sub-space on the basis of a single value \( x_j=b \).
  • Repeatedly select sub-spaces and split in two.
  • Note that this process can extend beyond just the two dimensions represented by \( x_1 \) and \( x_2 \). If this were 3-dimensions (i.e. include an \( x_3 \)) then the partitions would be cubes. Beyond this the partitions are conceptually hyper-cubes.

An arbitrary 2-D space

An arbitrary 2-D space

Space splitting

A single split

Space splitting

Split of a subspace

Space splitting

Further splitting of a subspace

Space splitting

Further splitting

Space splitting

Potential 3-D surface

Binary partitioning process as a tree

An example tree diagram for a contrived partioning

Tree representation

  • The splitting points are called nodes - these have a binary splitting rule associated with them
  • The two new spaces created by the split are represented by lines leaving the nodes, these are referred to as the branches.
  • A tree with one split is a stump.
  • The nodes at the bottom of the diagram are referred to as the terminal nodes and collectively represent all the final partitions/subspaces of the data.
  • You can drop' a vector \( \mathbf{x} \) down the tree to determine which subspace this coordinate falls into.

Tree construction

  • We can model the response as a constant for each region (or. equivalently, leaf)
  • If we are minimising sums of squares, the optimal constant for a region/leaf is the average of the observed outputs for all inputs associated with the region/leaf
  • Computing the optimal binary partition for given inputs and output is computationally intractable, in general
  • A greedy algorithm is used that finds an optimal variable and split point given an initial choice (or guess), then continues for sub-regions
  • This is quick to compute (sums of averages) but errors at the root lead to errors at the leaves

How big should the tree be?

  • Tradeoff between bias and variance
  • Small tree – high bias, low variance
  • Not big enough to capture the correct model structure
  • Large tree – low bias, high variance
  • Overfitting – in the extreme case each input is in exactly one region
  • Optimal size should be adaptively chosen from the data

We could stop splitting based on a threshold for decreases in sum of squares, but this might rule out a useful split further down the tree.

How big should the tree be?

  • We could stop splitting based on a threshold for decreases in sum of squares, meaning that we never build a tree that is too big
  • But this might rule out a useful split further down the tree
  • Instead we construct a tree that is probably too large, and prune it by cost-complexity calculations – next lecture

Tree methods

Need-to-knows

  • How recursive binary partitioning of \( \mathcal{R}_p \) works.
  • How to sketch a partitioning of \( \mathcal{R}_2 \) on the basis of a series of simple binary splits.
  • How to go from a series of binary splitting rules to a tree representation and vice versa.

An example

leoIsDead <- read.csv("train.csv", header=TRUE)

##### Classification tree
library(rattle)
library(rpart.plot)
library(RColorBrewer)
library(rpart)

An example

## women and children first!
simpleTree <- rpart(survived ~ sex + age, method="class", data=leoIsDead)

## plot the tree
plot(simpleTree, uniform=TRUE,
     main="Women and Children First", margin = 0.1)
text(simpleTree, use.n=TRUE, all=TRUE)

plot of chunk unnamed-chunk-3

An example

## the rpart plot is terrible, so get something better
fancyRpartPlot(simpleTree)

plot of chunk unnamed-chunk-5

An example

## learn a tree for a larger subset of covariates
biggerTree <- rpart(survived ~ pclass + sex + age + sibsp + parch + fare, method="class", data=leoIsDead)

## view details
summary(biggerTree)

An example

Call:
rpart(formula = survived ~ pclass + sex + age + sibsp + parch + 
    fare, data = leoIsDead, method = "class")
  n= 891 

          CP nsplit rel error    xerror       xstd
1 0.44444444      0 1.0000000 1.0000000 0.04244576
2 0.03070175      1 0.5555556 0.5555556 0.03574957
3 0.02339181      3 0.4941520 0.5000000 0.03437157
4 0.02046784      4 0.4707602 0.5029240 0.03444798
5 0.01267057      5 0.4502924 0.5087719 0.03459945
6 0.01000000      8 0.4122807 0.5000000 0.03437157

Variable importance
   sex   fare pclass  sibsp  parch    age 
    48     19     14      7      7      6 

Node number 1: 891 observations,    complexity param=0.4444444
  predicted class=0  expected loss=0.3838384  P(node) =1
    class counts:   549   342
   probabilities: 0.616 0.384 
  left son=2 (577 obs) right son=3 (314 obs)
  Primary splits:
      sex    splits as  RL,           improve=124.426300, (0 missing)
      pclass < 2.5      to the right, improve= 43.781830, (0 missing)
      fare   < 10.48125 to the left,  improve= 37.941940, (0 missing)
      parch  < 0.5      to the left,  improve=  9.157774, (0 missing)
      age    < 6.5      to the right, improve=  8.814172, (177 missing)
  Surrogate splits:
      fare  < 77.6229  to the left,  agree=0.679, adj=0.089, (0 split)
      parch < 0.5      to the left,  agree=0.678, adj=0.086, (0 split)

Node number 2: 577 observations,    complexity param=0.02339181
  predicted class=0  expected loss=0.1889081  P(node) =0.647587
    class counts:   468   109
   probabilities: 0.811 0.189 
  left son=4 (553 obs) right son=5 (24 obs)
  Primary splits:
      age    < 6.5      to the right, improve=10.788930, (124 missing)
      fare   < 26.26875 to the left,  improve=10.216720, (0 missing)
      pclass < 1.5      to the right, improve=10.019140, (0 missing)
      parch  < 0.5      to the left,  improve= 3.350327, (0 missing)
      sibsp  < 0.5      to the left,  improve= 1.501502, (0 missing)

Node number 3: 314 observations,    complexity param=0.03070175
  predicted class=1  expected loss=0.2579618  P(node) =0.352413
    class counts:    81   233
   probabilities: 0.258 0.742 
  left son=6 (144 obs) right son=7 (170 obs)
  Primary splits:
      pclass < 2.5      to the right, improve=31.163130, (0 missing)
      fare   < 48.2     to the left,  improve=10.114210, (0 missing)
      sibsp  < 2.5      to the right, improve= 9.372551, (0 missing)
      parch  < 3.5      to the right, improve= 5.140857, (0 missing)
      age    < 12       to the left,  improve= 1.891684, (53 missing)
  Surrogate splits:
      fare  < 25.69795 to the left,  agree=0.799, adj=0.563, (0 split)
      sibsp < 1.5      to the right, agree=0.592, adj=0.111, (0 split)
      parch < 1.5      to the right, agree=0.567, adj=0.056, (0 split)
      age   < 18.5     to the left,  agree=0.564, adj=0.049, (0 split)

Node number 4: 553 observations
  predicted class=0  expected loss=0.1681736  P(node) =0.620651
    class counts:   460    93
   probabilities: 0.832 0.168 

Node number 5: 24 observations,    complexity param=0.02046784
  predicted class=1  expected loss=0.3333333  P(node) =0.02693603
    class counts:     8    16
   probabilities: 0.333 0.667 
  left son=10 (9 obs) right son=11 (15 obs)
  Primary splits:
      sibsp  < 2.5      to the right, improve=8.8888890, (0 missing)
      pclass < 2.5      to the right, improve=3.8095240, (0 missing)
      fare   < 20.825   to the right, improve=2.6666670, (0 missing)
      age    < 1.5      to the right, improve=0.6095238, (0 missing)
  Surrogate splits:
      pclass < 2.5      to the right, agree=0.792, adj=0.444, (0 split)
      fare   < 26.95    to the right, agree=0.750, adj=0.333, (0 split)

Node number 6: 144 observations,    complexity param=0.03070175
  predicted class=0  expected loss=0.5  P(node) =0.1616162
    class counts:    72    72
   probabilities: 0.500 0.500 
  left son=12 (27 obs) right son=13 (117 obs)
  Primary splits:
      fare  < 23.35    to the right, improve=10.051280, (0 missing)
      sibsp < 2.5      to the right, improve= 4.571429, (0 missing)
      age   < 38.5     to the right, improve= 3.875163, (42 missing)
      parch < 1.5      to the right, improve= 3.773262, (0 missing)
  Surrogate splits:
      sibsp < 2.5      to the right, agree=0.882, adj=0.37, (0 split)
      parch < 1.5      to the right, agree=0.882, adj=0.37, (0 split)

Node number 7: 170 observations
  predicted class=1  expected loss=0.05294118  P(node) =0.1907969
    class counts:     9   161
   probabilities: 0.053 0.947 

Node number 10: 9 observations
  predicted class=0  expected loss=0.1111111  P(node) =0.01010101
    class counts:     8     1
   probabilities: 0.889 0.111 

Node number 11: 15 observations
  predicted class=1  expected loss=0  P(node) =0.01683502
    class counts:     0    15
   probabilities: 0.000 1.000 

Node number 12: 27 observations
  predicted class=0  expected loss=0.1111111  P(node) =0.03030303
    class counts:    24     3
   probabilities: 0.889 0.111 

Node number 13: 117 observations,    complexity param=0.01267057
  predicted class=1  expected loss=0.4102564  P(node) =0.1313131
    class counts:    48    69
   probabilities: 0.410 0.590 
  left son=26 (93 obs) right son=27 (24 obs)
  Primary splits:
      age   < 16.5     to the right, improve=2.4685870, (34 missing)
      fare  < 7.8875   to the right, improve=2.0325270, (0 missing)
      sibsp < 0.5      to the right, improve=0.3076923, (0 missing)
      parch < 1.5      to the left,  improve=0.1582418, (0 missing)
  Surrogate splits:
      sibsp < 1.5      to the left,  agree=0.747, adj=0.087, (34 split)
      fare  < 20.8     to the left,  agree=0.747, adj=0.087, (0 split)

Node number 26: 93 observations,    complexity param=0.01267057
  predicted class=1  expected loss=0.4516129  P(node) =0.1043771
    class counts:    42    51
   probabilities: 0.452 0.548 
  left son=52 (56 obs) right son=53 (37 obs)
  Primary splits:
      fare  < 7.8875   to the right, improve=2.92648500, (0 missing)
      age   < 36.5     to the right, improve=1.66181500, (33 missing)
      sibsp < 0.5      to the right, improve=0.34061040, (0 missing)
      parch < 1.5      to the left,  improve=0.05969685, (0 missing)
  Surrogate splits:
      sibsp < 0.5      to the right, agree=0.667, adj=0.162, (0 split)

Node number 27: 24 observations
  predicted class=1  expected loss=0.25  P(node) =0.02693603
    class counts:     6    18
   probabilities: 0.250 0.750 

Node number 52: 56 observations,    complexity param=0.01267057
  predicted class=0  expected loss=0.4464286  P(node) =0.06285073
    class counts:    31    25
   probabilities: 0.554 0.446 
  left son=104 (33 obs) right son=105 (23 obs)
  Primary splits:
      fare  < 14.8729  to the left,  improve=3.30439500, (0 missing)
      age   < 23.5     to the left,  improve=1.31484800, (13 missing)
      parch < 1.5      to the left,  improve=1.04027400, (0 missing)
      sibsp < 0.5      to the left,  improve=0.02216117, (0 missing)
  Surrogate splits:
      sibsp < 0.5      to the left,  agree=0.696, adj=0.261, (0 split)
      parch < 1.5      to the left,  agree=0.679, adj=0.217, (0 split)

Node number 53: 37 observations
  predicted class=1  expected loss=0.2972973  P(node) =0.04152637
    class counts:    11    26
   probabilities: 0.297 0.703 

Node number 104: 33 observations
  predicted class=0  expected loss=0.3030303  P(node) =0.03703704
    class counts:    23    10
   probabilities: 0.697 0.303 

Node number 105: 23 observations
  predicted class=1  expected loss=0.3478261  P(node) =0.02581369
    class counts:     8    15
   probabilities: 0.348 0.652 

An example

## plot the tree
fancyRpartPlot(biggerTree)

plot of chunk unnamed-chunk-8

An example

Whoops - too male, too old, too poor

Regression trees

Consider our general regression problem (note can be classification): \[ \mathbf{y}=f(\mathbf{X}) + \mathbf{e} \] and the usual approximation model (linear in its parameters): \[ \mathbf{y}=\mathbf{X}\boldsymbol{\beta} + \mathbf{e} \]

  • Standard' interactions of form \( \beta_p(X_1X_2) \)
  • These are simple in form and quite hard to interpret succinctly
  • What is probably the simplest interaction form to interpret?
  • Recursive binary splitting rules for the covariate space