C. Donovan
16 Feb 2018
Lab: today in main comp sci lab, 1-3pm
Discussion forum on Moodle - post questions
Mainly for today:
[satisfy the condition, go left]
Note
All tree methods follow this basic scheme:
The methods differ based on some choices:
How “mixed-up” is measured e.g.
How we decide when to stop splitting
How we get the right tree size (we prune to get good generalisation error)
How we predict \( y \)
In short:
simples!
Classical statistical approach:
Information Theory approach:
Purity can be measured many ways: Entropy, Gini index & two-ing.
When considering a two-class problem, the range of prediction outcomes is small:
which give rise to our familiar confusion matrix.
For a multi-class problem this range of outcomes is considerably larger:
so we have \( J \) types of correct classification, and \( J^2-J \) types of incorrect classification.
Drawing required…
We work in base 2, taking the bit as our unit.
We can now precisely define the entropy of a set of output classes as: \[ \phi(p_1,\ldots,p_n) = \sum p_i \log_2 {1\over p_i} = -\sum p_i\log_2 p_i \]
You'll see it is large when impure (largest when all \( p \) are equal), vanishingly small when pure. For a two class problem it is between 0 and 1.
Class | Bus | Car | Train |
---|---|---|---|
Probability | 0.4 | 0.3 | 0.3 |
\[ \begin{eqnarray*}\phi(0.4,0.3,0.3) &=& -0.4 \log_2 0.4 - 0.3 \log_2 0.3 - 0.3 \log_2 0.3\\ &\approx& 1.571 \end{eqnarray*} \]
We say our output class data has entropy about 1.57 bits per class.
Class | Bus | Car | Train |
---|---|---|---|
Probability | 0 | 1 | 0 |
\[ \begin{eqnarray*}\phi(0,1,0) &=& - 1 \log_2 1 \\ &=& 0 \end{eqnarray*} \]
We say our output class data has zero entropy, meaning zero randomness
Class | Bus | Car | Train |
---|---|---|---|
Probability | 1/3 | 1/3 | 1/3 |
\[ \begin{eqnarray*} \phi(\frac{1}{3},\frac{1}{3},\frac{1}{3}) &=& -\frac{1}{3} \log_2 \frac{1}{3} - \frac{1}{3} \log_2 \frac{1}{3} - \frac{1}{3} \log_2 \frac{1}{3} \\ &\approx& -\frac{-1.584963}{3} - \frac{-1.584963}{3} - \frac{-1.584963}{3} \\ &\approx& 0.528321 + 0.528321 + 0.528321 \\ &\approx& 1.584963 \end{eqnarray*} \]
We say our output class data has maximum entropy for a class of this size, meaning the most randomness
One minus the sum of the squared output probabilities:
\[ 1 - \Sigma p_j^2 \]
In our example, \( 1 - (0.4^2 + 0.3^2 + 0.3^2) = 0.660 \).
Minimum Gini index is zero
Maximum Gini index is \( 1 - n(\frac{1}{n})^2 = 1 - \frac{1}{n} \), two thirds in our example
For each numeric attribute:
Look for potential knots
Apply a measure on each of the candidate knots, and choose the one with the maximum value
Repeat recursively in both subsets until either
Once complete, we treat as categorical
There is often no need to be so sophisticated
We can use unsupervised discretisation methods
Unsupervised means “ignore the class variable”
Regression trees
Classification trees
We fix this next, looking at worked exmples of regression and classification trees
After that, we consider validation – pruning the trees