ID5059 Lecture 09 - Regression & Decision Trees

C. Donovan
21 Feb 2018

Administrivia

  • No lecture Friday. Complain to the VP
  • (pssst: suggest reading section 8 of James et al)

Big picture

  • We're adding new ways to build \( f \) (as per my favourite eqn \( \mathbf{y} = f(\mathbf{X}) \))
  • Trees are really simple
    • They're good to tell stories with (interpretable) but not great predictive performance
    • We can however use many of them combined for good prediction (ensembles, later)

Today

  • How we search for the best splits?
    • Splitting on numeric versus categorical covariates/features
  • How/why we stop splitting?
  • How do we get the right-sized tree?
  • Missing values and surrogate splits

Refer to: James et al Sections 8.1.1 & 8.1.2

First up - nature of the response

In short:

  • If your response is a class, it's a classificaion tree

    • Split based on purity (measured using Gini or Entropy)
    • Stop
    • Prune: aim for something like low misclassification (generalisation) error
  • If your response is numeric, it's a regression tree

    • Split based on something using \( y-\hat{y} \)
    • Stop
    • Prune: aim for low generalisation error based on \( y-\hat{y} \) e.g. Mean Squared Error

simples!

Searching for splits

  • Splitting criteria is either purity (e.g. Gini or Entropy) or based on \( y-\hat{y} \)
  • It is a brute force approach - look “everywhere”

Searching for splits

(from Arthur Charpentier – freakonometrics.hypotheses.org)

Searching for splits

(from Arthur Charpentier – freakonometrics.hypotheses.org)

Searching for splits

(from Arthur Charpentier – freakonometrics.hypotheses.org)

Recap: splitting objective

  • The thing we're maximising or minimising is a measure of homogeneity
    • For classes, make nodes pure
    • For numeric response, make nodes contain similar values (reduce the variance or \( y-\hat{y} \) or…)
  • Most pure is when impurity is zero

Recap: Our two measures of (im)purity \(\phi\)

Taking the proportions in each class (\( p_j \)) in the node.

The entropy of a set of output classes: \[ \phi(p_1,\ldots,p_n) = -\sum p_i\log_2 p_i \]

The Gini Index

\[ \phi(p_1,\ldots,p_n) = 1 - \Sigma p_j^2 \]

Relatively big is bad - the node doesn't have a clear majority. Smaller means some classes dominate.

Categorical covariates

  • This has all looked at numeric covariates - which is intuitive
  • For class/categorical covariates, we need to consider all possible groupings - this can be very expensive
    • For this reason (also for many other models) we don't like categorcial variables with lots of levels
    • May not be able to use blindly - may condense classes/extract relvant bits (e.g. truncate post-codes) or drop

Stopping your tree

This isn't fancy - there are a bunch of rules underlying your software:

  • Maximum number of splits/levels e.g. no more than 32
  • Mininum number of data in a node e.g. >5
  • Others? Depends on the software

We're usually just allowing for it to be complex, if this is required.

Stopping your tree

# Fit and explore tree ----------------------------------------------------

  leoIsDead <- read.csv("data/train.csv", header=TRUE)

  wacfTrain <- rpart(survived ~ sex + age, method="class", data=leoIsDead, control = rpart.control(cp=0))

  wacfTrain$control
$minsplit
[1] 20

$minbucket
[1] 7

$cp
[1] 0

$maxcompete
[1] 4

$maxsurrogate
[1] 5

$usesurrogate
[1] 2

$surrogatestyle
[1] 0

$maxdepth
[1] 30

$xval
[1] 10

Stopping your tree

# Fit and explore tree ----------------------------------------------------

  wacfTrain <- rpart(survived ~ sex + age, method="class", data=leoIsDead, control = rpart.control(cp=0))

  wacfTrain
n= 891 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

  1) root 891 342 0 (0.61616162 0.38383838)  
    2) sex=male 577 109 0 (0.81109185 0.18890815)  
      4) age>=6.5 553  93 0 (0.83182640 0.16817360)  
        8) age< 24.75 137  17 0 (0.87591241 0.12408759) *
        9) age>=24.75 416  76 0 (0.81730769 0.18269231)  
         18) age>=27.5 373  63 0 (0.83109920 0.16890080)  
           36) age>=53 37   4 0 (0.89189189 0.10810811) *
           37) age< 53 336  59 0 (0.82440476 0.17559524)  
             74) age< 47.5 312  51 0 (0.83653846 0.16346154) *
             75) age>=47.5 24   8 0 (0.66666667 0.33333333)  
              150) age>=49.5 15   3 0 (0.80000000 0.20000000) *
              151) age< 49.5 9   4 1 (0.44444444 0.55555556) *
         19) age< 27.5 43  13 0 (0.69767442 0.30232558) *
      5) age< 6.5 24   8 1 (0.33333333 0.66666667) *
    3) sex=female 314  81 1 (0.25796178 0.74203822)  
      6) age< 12 32  13 1 (0.40625000 0.59375000)  
       12) age>=7.5 8   1 0 (0.87500000 0.12500000) *
       13) age< 7.5 24   6 1 (0.25000000 0.75000000) *
      7) age>=12 282  68 1 (0.24113475 0.75886525)  
       14) age< 48.5 258  66 1 (0.25581395 0.74418605)  
         28) age>=42.5 17   7 1 (0.41176471 0.58823529) *
         29) age< 42.5 241  59 1 (0.24481328 0.75518672)  
           58) age< 32.25 190  52 1 (0.27368421 0.72631579)  
            116) age>=24.5 52  16 1 (0.30769231 0.69230769) *
            117) age< 24.5 138  36 1 (0.26086957 0.73913043)  
              234) age< 21.5 105  31 1 (0.29523810 0.70476190)  
                468) age>=19.5 9   4 0 (0.55555556 0.44444444) *
                469) age< 19.5 96  26 1 (0.27083333 0.72916667) *
              235) age>=21.5 33   5 1 (0.15151515 0.84848485) *
           59) age>=32.25 51   7 1 (0.13725490 0.86274510) *
       15) age>=48.5 24   2 1 (0.08333333 0.91666667) *

Stopping your tree

# Fit and explore tree ----------------------------------------------------

  fancyRpartPlot(wacfTrain)

plot of chunk unnamed-chunk-4

The right-sized tree

Too complex is, well, too complex and bad for prediction.

We want the tree with the best generalisation error (obvs). But

  • What set of trees are we considering?
  • What error measure do we use?

High level view

It's just this problem again:

In our case, complexity is tree size (number of terminal nodes), with error TBD

Mid level view

  • We grow our trees in a greedy fashion, we prune in reverse, collapsing splits together
  • We have a sequence of trees from most complex, which prune down to the stump, losing a node each time
  • Generalisation error will be from validation or cross-validation
  • Generalisation error may be expressed as:
    • misclassification rate for a class problem (there are others)
    • Mean Squared Error or similar for a numeric response (i.e. where the \( y-\hat{y} \) makes sense)

Mid level view

  • Recall Misclassification rate is the proportion of observations off the lead diagonal in a confusion matrix
  • It is the proportion of observations we mis-classify, small is good
  • To get a generalisation version of this, make sure we apply the model to validation or cross-validation data to get the confusion matrix

More detail than you want

The cost-complexity measure is given by

\[ R(T)_\alpha=R(T) +\alpha |T| \]

where \( R(T) \) is the resubstitution error rate for tree \( T \), \( |T| \) is the size/complexity of the tree (the number of terminal nodes).

  • we minimise this!
  • Note again we have fidelity to the data balanced by the complexity of the model (like AIC, adj \( R^2 \)…)

More detail than you want

The cost-complexity measure is given by

\[ R(T)_\alpha=R(T) +\alpha |T| \]

  • we can see \( \alpha \) balances the emphasis: if \( \alpha \)=0, no constraint is imposed on complexity.
  • If \( \alpha \) were very large then the complexity would dominate our minimisation and the simplest model would be favoured - no splits at all.

Hiding under the hood, your software is probably doing cost-complexity pruning: we find an \( \alpha \) to give good generalisation error.

More detail than you want

  plotcp(wacfTrain)

plot of chunk unnamed-chunk-5

  prunedTree <- prune(wacfTrain, cp=0.0023)

More detail than you want

  fancyRpartPlot(prunedTree)

plot of chunk unnamed-chunk-6

Surrogate splits, missing values

Trees are touted as having good treatment of missing values

Surrogate splits, missing values

Missing values are a pain. If I have/want \( y = f(\mathbf{X}) \):

  • I usually can't use a row of data missing some \( x \) to fit my model
  • If I have \( f(\mathbf{X}) \), I usually can't make predictions without all the \( x \)

Decision trees may store Surrogate splits when building

High level view - surrogates

  • If I choose particular split as best, is there another one on another variable that splits the data almost the same?
    • If so, store this
    • (draws on board)
    • My software may store several of decreasing similarity for each selected split
    • (It's not so costly, I did a massive search anyway)

Mid level view - surrogates

Advantage?

  • If I seek to predict \( y \), but are missing some \( x \) values, I might use a surrogate to by-pass a node I don't have a value for
  • e.g. Split A uses \( X_1 \), but split B on \( X_2 \) has a similar effect.
  • I have a missing value for \( x_1 \), but I move down the tree using the surrogate split on \( x_2 \). continue.

Cautionary tales

Be aware/things not covered

  • Imbalances of classes make things hard (if \( y \) is mainly one class, misclassification is naturally low). Not just for trees - we'll return to this
  • Relatedly: costs of errors need not be treated the same