ID5059 Lecture 09 - Regression & Decision Trees

C. Donovan
21 Feb 2018

Administrivia

No lecture Friday. Complain to the VP
(pssst: suggest reading section 8 of James et al)

Big picture

We're adding new ways to build \( f \) (as per my favourite eqn \( \mathbf{y} = f(\mathbf{X}) \))
Trees are really simple
- They're good to tell stories with (interpretable) but not great predictive performance
- We can however use many of them combined for good prediction (ensembles, later)

Today

How we search for the best splits?
- Splitting on numeric versus categorical covariates/features
How/why we stop splitting?
How do we get the right-sized tree?
Missing values and surrogate splits

Refer to: James et al Sections 8.1.1 & 8.1.2

First up - nature of the response

In short:

If your response is a class, it's a classificaion tree
- Split based on purity (measured using Gini or Entropy)
- Stop
- Prune: aim for something like low misclassification (generalisation) error
If your response is numeric, it's a regression tree
- Split based on something using \( y-\hat{y} \)
- Stop
- Prune: aim for low generalisation error based on \( y-\hat{y} \) e.g. Mean Squared Error

simples!

Searching for splits

Splitting criteria is either purity (e.g. Gini or Entropy) or based on \( y-\hat{y} \)
It is a brute force approach - look “everywhere”

Searching for splits

(from Arthur Charpentier – freakonometrics.hypotheses.org)

Searching for splits

(from Arthur Charpentier – freakonometrics.hypotheses.org)

Searching for splits

(from Arthur Charpentier – freakonometrics.hypotheses.org)

Recap: splitting objective

The thing we're maximising or minimising is a measure of homogeneity
- For classes, make nodes pure
- For numeric response, make nodes contain similar values (reduce the variance or \( y-\hat{y} \) or…)
Most pure is when impurity is zero

Recap: Our two measures of (im)purity \(\phi\)

Taking the proportions in each class (\( p_j \)) in the node.

The entropy of a set of output classes: \[ \phi(p_1,\ldots,p_n) = -\sum p_i\log_2 p_i \]

The Gini Index

\[ \phi(p_1,\ldots,p_n) = 1 - \Sigma p_j^2 \]

Relatively big is bad - the node doesn't have a clear majority. Smaller means some classes dominate.

Categorical covariates

This has all looked at numeric covariates - which is intuitive
For class/categorical covariates, we need to consider all possible groupings - this can be very expensive
- For this reason (also for many other models) we don't like categorcial variables with lots of levels
- May not be able to use blindly - may condense classes/extract relvant bits (e.g. truncate post-codes) or drop

Stopping your tree

This isn't fancy - there are a bunch of rules underlying your software:

Maximum number of splits/levels e.g. no more than 32
Mininum number of data in a node e.g. >5
Others? Depends on the software

We're usually just allowing for it to be complex, if this is required.

Stopping your tree

# Fit and explore tree ----------------------------------------------------

  leoIsDead <- read.csv("data/train.csv", header=TRUE)

  wacfTrain <- rpart(survived ~ sex + age, method="class", data=leoIsDead, control = rpart.control(cp=0))

  wacfTrain$control

$minsplit
[1] 20

$minbucket
[1] 7

$cp
[1] 0

$maxcompete
[1] 4

$maxsurrogate
[1] 5

$usesurrogate
[1] 2

$surrogatestyle
[1] 0

$maxdepth
[1] 30

$xval
[1] 10

Stopping your tree

# Fit and explore tree ----------------------------------------------------

  wacfTrain <- rpart(survived ~ sex + age, method="class", data=leoIsDead, control = rpart.control(cp=0))

  wacfTrain

n= 891 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

  1) root 891 342 0 (0.61616162 0.38383838)  
    2) sex=male 577 109 0 (0.81109185 0.18890815)  
      4) age>=6.5 553  93 0 (0.83182640 0.16817360)  
        8) age< 24.75 137  17 0 (0.87591241 0.12408759) *
        9) age>=24.75 416  76 0 (0.81730769 0.18269231)  
         18) age>=27.5 373  63 0 (0.83109920 0.16890080)  
           36) age>=53 37   4 0 (0.89189189 0.10810811) *
           37) age< 53 336  59 0 (0.82440476 0.17559524)  
             74) age< 47.5 312  51 0 (0.83653846 0.16346154) *
             75) age>=47.5 24   8 0 (0.66666667 0.33333333)  
              150) age>=49.5 15   3 0 (0.80000000 0.20000000) *
              151) age< 49.5 9   4 1 (0.44444444 0.55555556) *
         19) age< 27.5 43  13 0 (0.69767442 0.30232558) *
      5) age< 6.5 24   8 1 (0.33333333 0.66666667) *
    3) sex=female 314  81 1 (0.25796178 0.74203822)  
      6) age< 12 32  13 1 (0.40625000 0.59375000)  
       12) age>=7.5 8   1 0 (0.87500000 0.12500000) *
       13) age< 7.5 24   6 1 (0.25000000 0.75000000) *
      7) age>=12 282  68 1 (0.24113475 0.75886525)  
       14) age< 48.5 258  66 1 (0.25581395 0.74418605)  
         28) age>=42.5 17   7 1 (0.41176471 0.58823529) *
         29) age< 42.5 241  59 1 (0.24481328 0.75518672)  
           58) age< 32.25 190  52 1 (0.27368421 0.72631579)  
            116) age>=24.5 52  16 1 (0.30769231 0.69230769) *
            117) age< 24.5 138  36 1 (0.26086957 0.73913043)  
              234) age< 21.5 105  31 1 (0.29523810 0.70476190)  
                468) age>=19.5 9   4 0 (0.55555556 0.44444444) *
                469) age< 19.5 96  26 1 (0.27083333 0.72916667) *
              235) age>=21.5 33   5 1 (0.15151515 0.84848485) *
           59) age>=32.25 51   7 1 (0.13725490 0.86274510) *
       15) age>=48.5 24   2 1 (0.08333333 0.91666667) *

Stopping your tree

# Fit and explore tree ----------------------------------------------------

  fancyRpartPlot(wacfTrain)

plot of chunk unnamed-chunk-4

The right-sized tree

Too complex is, well, too complex and bad for prediction.

We want the tree with the best generalisation error (obvs). But

What set of trees are we considering?
What error measure do we use?

High level view

It's just this problem again:

In our case, complexity is tree size (number of terminal nodes), with error TBD

Mid level view

We grow our trees in a greedy fashion, we prune in reverse, collapsing splits together
We have a sequence of trees from most complex, which prune down to the stump, losing a node each time
Generalisation error will be from validation or cross-validation
Generalisation error may be expressed as:
- misclassification rate for a class problem (there are others)
- Mean Squared Error or similar for a numeric response (i.e. where the \( y-\hat{y} \) makes sense)

Mid level view

Recall Misclassification rate is the proportion of observations off the lead diagonal in a confusion matrix
It is the proportion of observations we mis-classify, small is good
To get a generalisation version of this, make sure we apply the model to validation or cross-validation data to get the confusion matrix

More detail than you want

The cost-complexity measure is given by

\[ R(T)_\alpha=R(T) +\alpha |T| \]

where \( R(T) \) is the resubstitution error rate for tree \( T \), \( |T| \) is the size/complexity of the tree (the number of terminal nodes).

we minimise this!
Note again we have fidelity to the data balanced by the complexity of the model (like AIC, adj \( R^2 \)…)

More detail than you want

The cost-complexity measure is given by

\[ R(T)_\alpha=R(T) +\alpha |T| \]

we can see \( \alpha \) balances the emphasis: if \( \alpha \)=0, no constraint is imposed on complexity.
If \( \alpha \) were very large then the complexity would dominate our minimisation and the simplest model would be favoured - no splits at all.

Hiding under the hood, your software is probably doing cost-complexity pruning: we find an \( \alpha \) to give good generalisation error.

More detail than you want

  plotcp(wacfTrain)

plot of chunk unnamed-chunk-5

  prunedTree <- prune(wacfTrain, cp=0.0023)

More detail than you want

  fancyRpartPlot(prunedTree)

plot of chunk unnamed-chunk-6

Surrogate splits, missing values

Trees are touted as having good treatment of missing values

Surrogate splits, missing values

Missing values are a pain. If I have/want \( y = f(\mathbf{X}) \):

I usually can't use a row of data missing some \( x \) to fit my model
If I have \( f(\mathbf{X}) \), I usually can't make predictions without all the \( x \)

Decision trees may store Surrogate splits when building

High level view - surrogates

If I choose particular split as best, is there another one on another variable that splits the data almost the same?
- If so, store this
- (draws on board)
- My software may store several of decreasing similarity for each selected split
- (It's not so costly, I did a massive search anyway)

Mid level view - surrogates

Advantage?

If I seek to predict \( y \), but are missing some \( x \) values, I might use a surrogate to by-pass a node I don't have a value for
e.g. Split A uses \( X_1 \), but split B on \( X_2 \) has a similar effect.
I have a missing value for \( x_1 \), but I move down the tree using the surrogate split on \( x_2 \). continue.

Cautionary tales

Be aware/things not covered

Imbalances of classes make things hard (if \( y \) is mainly one class, misclassification is naturally low). Not just for trees - we'll return to this
Relatedly: costs of errors need not be treated the same