ID5059 Lecture 08 - Regression & Decision Trees

C. Donovan
16 Feb 2018

Administrivia

  • Lab: today in main comp sci lab, 1-3pm

    • Lab tasks are on moodle
    • Tutor and I will be there - project questions are also fair game
  • Discussion forum on Moodle - post questions

Big picture

  • We're adding new ways to build \( f \) (as per my favourite eqn \( \mathbf{y} = f(\mathbf{X}) \))
  • Trees are really simple
    • They're good to tell stories with (interpretable) but not great predictive performance
    • We can however use many of them combined for good prediction (ensembles, later)

Need-to-knows

Mainly for today:

  • Tree interpretation again for fun
  • How to decide/measure if a candidate split is better than another (for both class or numeric response)
    • So, what node purity is
  • How we predict a value for a node

Remember your tree-fu, Padawan

[satisfy the condition, go left]

Remember your tree-fu, Padawan

Note

  • majority prediction at nodes
  • satisfy the numeric condition, go left
  • can estimate probabilities too for each node
  • we can calculate confusion matrices (and related measures) at any node and overall

Some preliminary jargon

  • Training, Validation & Test data
  • Resubstitution error (i.e. training error) the error rate of a tree on the cases from which it was constructed
  • Generalisation error the error rate of a tree on unseen cases
  • Purity how mixed-up the training data is at a node

Tree Generation

All tree methods follow this basic scheme:

  • We have a set of mixed-up (response) data
  • so immediately we need some measure of how mixed-up the data is
  • Split the data (based on covariates) into two subsets
  • the data in each ought to be less mixed-up overall
  • each split forms the branches of a new tree
  • Recurse by repeating for each subtree/branch (til stop)

Tree Generation

The methods differ based on some choices:

  1. How “mixed-up” is measured e.g.

    • Variance of the response if \( y \) is numeric
    • node purity if we're looking a class \( y \)
  2. How we decide when to stop splitting

    • the heuristics for this are common to all methods
    • often as simple as “stop when 4 or fewer items are in subset”
  3. How we get the right tree size (we prune to get good generalisation error)

  4. How we predict \( y \)

First up - nature of the response

In short:

  • If your response is a class, it's a classificaion tree
  • If your response is numeric, it's a regression tree

simples!

Regression Trees

Classical statistical approach:

  • Mixed-up is measured by standard deviation (or any measure of variability of \( y \))
  • Predict using the average of the \( y \) that fall into each split (i.e. predict with the mean)

Decision/Classification Trees

Information Theory approach:

  • Mixed-up is measured by amount of (im)purity
  • Predict by reporting the majority class (or a variant based on class proportions)

Purity?

Purity can be measured many ways: Entropy, Gini index & two-ing.

  • A group/subspace/node/whatever is pure if it contains only one class
  • (In regression tree terminology, the SD of the outputs is zero)
  • In classification tree terminology, the impurity is zero
  • We split and resplit in order to increase node purity
  • Complete tree purity is analogous to overfitting

Types of correct/incorrect classification

When considering a two-class problem, the range of prediction outcomes is small:

  • False positive or false negative
  • Correct negative or correct positive

which give rise to our familiar confusion matrix.

For a multi-class problem this range of outcomes is considerably larger:

  • Incorrectly predicting class as \( j \) when correct class is \( i \) (\( i,j \in 1,..,J \)).
  • Correctly predicting class as \( j \) (\( j \in 1,..,J \)).

so we have \( J \) types of correct classification, and \( J^2-J \) types of incorrect classification.

What do/don't want in a node?

Drawing required…

Purity of nodes - quantifying

  • intuitively that we want to optimise for some measure of the purity of nodes
  • Consider a \( J \)-level categorical response variable. A node gives a vector of proportions \( \mathbf{p}=[p_1, ..., p_J] \) for the levels of our response.
  • \( \sum_{j=1}^J p_j=1 \) so the vector \( \mathbf{p} \) is a probability distribution of the response classes within the node.

Purity of nodes - quantifying

  • We can list some desirable/necessary properties of an impurity measure, which will be a function of the these proportions \( \phi(\mathbf{p}) \).
  • \( \phi(\mathbf{p}) \) will be a maximum when \( \mathbf{p}=[\frac{1}{J},...,\frac{1}{J}] \). This is our definition of the least pure node, i.e there is an equal mixture of all classes.
  • \( \phi(\mathbf{p}) \) will be a minimum when \( p_j=1 \) (and therefore \( p_k=0 \quad\forall\quad k\ne j \)). This is our definition of our most pure node, only one class exists.

Purity of nodes - quantifying

  • Our measure of the impurity of a node \( t \) will be given by \( i(t)=\phi((p_1|t),...,(p_J|t)) \)
  • A measure of the decrease of impurity resulting from splitting node \( t \) into a left node \( t_L \) and a right node \( t_R \) will be given by \( \Delta i(t)=i(t)-p_L i(t_L)-p_R i(t_R) \), where \( p_L \) and \( p_R \) are the proportion of points in \( t \) that go to the left and right respectively.

Entropy measure of Purity

We work in base 2, taking the bit as our unit.

We can now precisely define the entropy of a set of output classes as: \[ \phi(p_1,\ldots,p_n) = \sum p_i \log_2 {1\over p_i} = -\sum p_i\log_2 p_i \]

You'll see it is large when impure (largest when all \( p \) are equal), vanishingly small when pure. For a two class problem it is between 0 and 1.

Example

Class Bus Car Train
Probability 0.4 0.3 0.3

\[ \begin{eqnarray*}\phi(0.4,0.3,0.3) &=& -0.4 \log_2 0.4 - 0.3 \log_2 0.3 - 0.3 \log_2 0.3\\ &\approx& 1.571 \end{eqnarray*} \]

We say our output class data has entropy about 1.57 bits per class.

Example

Class Bus Car Train
Probability 0 1 0

\[ \begin{eqnarray*}\phi(0,1,0) &=& - 1 \log_2 1 \\ &=& 0 \end{eqnarray*} \]

We say our output class data has zero entropy, meaning zero randomness

Example

Class Bus Car Train
Probability 1/3 1/3 1/3

\[ \begin{eqnarray*} \phi(\frac{1}{3},\frac{1}{3},\frac{1}{3}) &=& -\frac{1}{3} \log_2 \frac{1}{3} - \frac{1}{3} \log_2 \frac{1}{3} - \frac{1}{3} \log_2 \frac{1}{3} \\ &\approx& -\frac{-1.584963}{3} - \frac{-1.584963}{3} - \frac{-1.584963}{3} \\ &\approx& 0.528321 + 0.528321 + 0.528321 \\ &\approx& 1.584963 \end{eqnarray*} \]

We say our output class data has maximum entropy for a class of this size, meaning the most randomness

Gini Index

One minus the sum of the squared output probabilities:

\[ 1 - \Sigma p_j^2 \]

In our example, \( 1 - (0.4^2 + 0.3^2 + 0.3^2) = 0.660 \).

Minimum Gini index is zero

Maximum Gini index is \( 1 - n(\frac{1}{n})^2 = 1 - \frac{1}{n} \), two thirds in our example

Numeric Value Discretisation

For each numeric attribute:

  • Sort the data instances on that attribute
  • Look for potential knots

    • points in the sorted list above which the class labels change
  • Apply a measure on each of the candidate knots, and choose the one with the maximum value

    • weight the measures by size of subset
    • measure is SD, entropy, Gini index, ….
  • Repeat recursively in both subsets until either

    • the subset is pure
    • some stopping criterion is reached

Numeric Value Discretisation

  • The process just described is a supervised method for collecting numeric values into discrete sets
  • Once complete, we treat as categorical

    • \( 0\ldots14.6 \rightarrow \) Bus, \( 14.7\ldots19.3 \rightarrow \) Car, etc.
  • There is often no need to be so sophisticated

  • We can use unsupervised discretisation methods

    • sort values into centiles, fixed width bins, above & below average, “good” above 25\ and the rest “bad”, etc.
  • Unsupervised means “ignore the class variable”

Trees Recap

Regression trees

  • Select splits – reduce group heterogeneity via Standard Deviation
  • Determine when to stop splitting
  • Select the “right-sized” tree

Classification trees

  • Specify the criteria for predictive accuracy
  • Select splits – reduce group heterogeneity via information
  • Determine when to stop splitting
  • Select the “right-sized” tree

Next Lecture

  • So far we've talked about building trees
  • But we haven't built any yet
  • We fix this next, looking at worked exmples of regression and classification trees

    • using all three measures
    • involving categorical and numeric input values
  • After that, we consider validation – pruning the trees