ID5059 Lecture 08 - Regression & Decision Trees

C. Donovan
16 Feb 2018

Administrivia

Lab: today in main comp sci lab, 1-3pm
- Lab tasks are on moodle
- Tutor and I will be there - project questions are also fair game
Discussion forum on Moodle - post questions

Big picture

We're adding new ways to build \( f \) (as per my favourite eqn \( \mathbf{y} = f(\mathbf{X}) \))
Trees are really simple
- They're good to tell stories with (interpretable) but not great predictive performance
- We can however use many of them combined for good prediction (ensembles, later)

Need-to-knows

Mainly for today:

Tree interpretation again for fun
How to decide/measure if a candidate split is better than another (for both class or numeric response)
- So, what node purity is
How we predict a value for a node

Remember your tree-fu, Padawan

[satisfy the condition, go left]

Remember your tree-fu, Padawan

Note

majority prediction at nodes
satisfy the numeric condition, go left
can estimate probabilities too for each node
we can calculate confusion matrices (and related measures) at any node and overall

Some preliminary jargon

Training, Validation & Test data
Resubstitution error (i.e. training error) the error rate of a tree on the cases from which it was constructed
Generalisation error the error rate of a tree on unseen cases
Purity how mixed-up the training data is at a node

Tree Generation

All tree methods follow this basic scheme:

We have a set of mixed-up (response) data
so immediately we need some measure of how mixed-up the data is
Split the data (based on covariates) into two subsets
the data in each ought to be less mixed-up overall
each split forms the branches of a new tree
Recurse by repeating for each subtree/branch (til stop)

Tree Generation

The methods differ based on some choices:

How “mixed-up” is measured e.g.
- Variance of the response if \( y \) is numeric
- node purity if we're looking a class \( y \)
How we decide when to stop splitting
- the heuristics for this are common to all methods
- often as simple as “stop when 4 or fewer items are in subset”
How we get the right tree size (we prune to get good generalisation error)
How we predict \( y \)

First up - nature of the response

In short:

If your response is a class, it's a classificaion tree
If your response is numeric, it's a regression tree

simples!

Regression Trees

Classical statistical approach:

Mixed-up is measured by standard deviation (or any measure of variability of \( y \))
Predict using the average of the \( y \) that fall into each split (i.e. predict with the mean)

Decision/Classification Trees

Information Theory approach:

Mixed-up is measured by amount of (im)purity
Predict by reporting the majority class (or a variant based on class proportions)

Purity?

Purity can be measured many ways: Entropy, Gini index & two-ing.

A group/subspace/node/whatever is pure if it contains only one class
(In regression tree terminology, the SD of the outputs is zero)
In classification tree terminology, the impurity is zero
We split and resplit in order to increase node purity
Complete tree purity is analogous to overfitting

Types of correct/incorrect classification

When considering a two-class problem, the range of prediction outcomes is small:

False positive or false negative
Correct negative or correct positive

which give rise to our familiar confusion matrix.

For a multi-class problem this range of outcomes is considerably larger:

Incorrectly predicting class as \( j \) when correct class is \( i \) (\( i,j \in 1,..,J \)).
Correctly predicting class as \( j \) (\( j \in 1,..,J \)).

so we have \( J \) types of correct classification, and \( J^2-J \) types of incorrect classification.

What do/don't want in a node?

Drawing required…

Purity of nodes - quantifying

intuitively that we want to optimise for some measure of the purity of nodes
Consider a \( J \)-level categorical response variable. A node gives a vector of proportions \( \mathbf{p}=[p_1, ..., p_J] \) for the levels of our response.
\( \sum_{j=1}^J p_j=1 \) so the vector \( \mathbf{p} \) is a probability distribution of the response classes within the node.

Purity of nodes - quantifying

We can list some desirable/necessary properties of an impurity measure, which will be a function of the these proportions \( \phi(\mathbf{p}) \).
\( \phi(\mathbf{p}) \) will be a maximum when \( \mathbf{p}=[\frac{1}{J},...,\frac{1}{J}] \). This is our definition of the least pure node, i.e there is an equal mixture of all classes.
\( \phi(\mathbf{p}) \) will be a minimum when \( p_j=1 \) (and therefore \( p_k=0 \quad\forall\quad k\ne j \)). This is our definition of our most pure node, only one class exists.

Purity of nodes - quantifying

Our measure of the impurity of a node \( t \) will be given by \( i(t)=\phi((p_1|t),...,(p_J|t)) \)
A measure of the decrease of impurity resulting from splitting node \( t \) into a left node \( t_L \) and a right node \( t_R \) will be given by \( \Delta i(t)=i(t)-p_L i(t_L)-p_R i(t_R) \), where \( p_L \) and \( p_R \) are the proportion of points in \( t \) that go to the left and right respectively.

Entropy measure of Purity

We work in base 2, taking the bit as our unit.

We can now precisely define the entropy of a set of output classes as: \[ \phi(p_1,\ldots,p_n) = \sum p_i \log_2 {1\over p_i} = -\sum p_i\log_2 p_i \]

You'll see it is large when impure (largest when all \( p \) are equal), vanishingly small when pure. For a two class problem it is between 0 and 1.

Example

Class	Bus	Car	Train
Probability	0.4	0.3	0.3

\[ \begin{eqnarray*}\phi(0.4,0.3,0.3) &=& -0.4 \log_2 0.4 - 0.3 \log_2 0.3 - 0.3 \log_2 0.3\\ &\approx& 1.571 \end{eqnarray*} \]

We say our output class data has entropy about 1.57 bits per class.

Example

Class	Bus	Car	Train
Probability	0	1	0

\[ \begin{eqnarray*}\phi(0,1,0) &=& - 1 \log_2 1 \\ &=& 0 \end{eqnarray*} \]

We say our output class data has zero entropy, meaning zero randomness

Example

Class	Bus	Car	Train
Probability	1/3	1/3	1/3

\[ \begin{eqnarray*} \phi(\frac{1}{3},\frac{1}{3},\frac{1}{3}) &=& -\frac{1}{3} \log_2 \frac{1}{3} - \frac{1}{3} \log_2 \frac{1}{3} - \frac{1}{3} \log_2 \frac{1}{3} \\ &\approx& -\frac{-1.584963}{3} - \frac{-1.584963}{3} - \frac{-1.584963}{3} \\ &\approx& 0.528321 + 0.528321 + 0.528321 \\ &\approx& 1.584963 \end{eqnarray*} \]

We say our output class data has maximum entropy for a class of this size, meaning the most randomness

Gini Index

One minus the sum of the squared output probabilities:

\[ 1 - \Sigma p_j^2 \]

In our example, \( 1 - (0.4^2 + 0.3^2 + 0.3^2) = 0.660 \).

Minimum Gini index is zero

Maximum Gini index is \( 1 - n(\frac{1}{n})^2 = 1 - \frac{1}{n} \), two thirds in our example

Numeric Value Discretisation

For each numeric attribute:

Sort the data instances on that attribute
Look for potential knots
- points in the sorted list above which the class labels change
Apply a measure on each of the candidate knots, and choose the one with the maximum value
- weight the measures by size of subset
- measure is SD, entropy, Gini index, ….
Repeat recursively in both subsets until either
- the subset is pure
- some stopping criterion is reached

Numeric Value Discretisation

The process just described is a supervised method for collecting numeric values into discrete sets
Once complete, we treat as categorical
- \( 0\ldots14.6 \rightarrow \) Bus, \( 14.7\ldots19.3 \rightarrow \) Car, etc.
There is often no need to be so sophisticated
We can use unsupervised discretisation methods
- sort values into centiles, fixed width bins, above & below average, “good” above 25\ and the rest “bad”, etc.
Unsupervised means “ignore the class variable”

Trees Recap

Regression trees

Select splits – reduce group heterogeneity via Standard Deviation
Determine when to stop splitting
Select the “right-sized” tree

Classification trees

Specify the criteria for predictive accuracy
Select splits – reduce group heterogeneity via information
Determine when to stop splitting
Select the “right-sized” tree

Next Lecture

So far we've talked about building trees
But we haven't built any yet
We fix this next, looking at worked exmples of regression and classification trees
- using all three measures
- involving categorical and numeric input values
After that, we consider validation – pruning the trees