C. Donovan
21 Feb 2018
Refer to: James et al Sections 8.1.1 & 8.1.2
In short:
If your response is a class, it's a classificaion tree
If your response is numeric, it's a regression tree
simples!
(from Arthur Charpentier – freakonometrics.hypotheses.org)
(from Arthur Charpentier – freakonometrics.hypotheses.org)
(from Arthur Charpentier – freakonometrics.hypotheses.org)
Taking the proportions in each class (\( p_j \)) in the node.
The entropy of a set of output classes: \[ \phi(p_1,\ldots,p_n) = -\sum p_i\log_2 p_i \]
The Gini Index
\[ \phi(p_1,\ldots,p_n) = 1 - \Sigma p_j^2 \]
Relatively big is bad - the node doesn't have a clear majority. Smaller means some classes dominate.
This isn't fancy - there are a bunch of rules underlying your software:
We're usually just allowing for it to be complex, if this is required.
# Fit and explore tree ----------------------------------------------------
leoIsDead <- read.csv("data/train.csv", header=TRUE)
wacfTrain <- rpart(survived ~ sex + age, method="class", data=leoIsDead, control = rpart.control(cp=0))
wacfTrain$control
$minsplit
[1] 20
$minbucket
[1] 7
$cp
[1] 0
$maxcompete
[1] 4
$maxsurrogate
[1] 5
$usesurrogate
[1] 2
$surrogatestyle
[1] 0
$maxdepth
[1] 30
$xval
[1] 10
# Fit and explore tree ----------------------------------------------------
wacfTrain <- rpart(survived ~ sex + age, method="class", data=leoIsDead, control = rpart.control(cp=0))
wacfTrain
n= 891
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 891 342 0 (0.61616162 0.38383838)
2) sex=male 577 109 0 (0.81109185 0.18890815)
4) age>=6.5 553 93 0 (0.83182640 0.16817360)
8) age< 24.75 137 17 0 (0.87591241 0.12408759) *
9) age>=24.75 416 76 0 (0.81730769 0.18269231)
18) age>=27.5 373 63 0 (0.83109920 0.16890080)
36) age>=53 37 4 0 (0.89189189 0.10810811) *
37) age< 53 336 59 0 (0.82440476 0.17559524)
74) age< 47.5 312 51 0 (0.83653846 0.16346154) *
75) age>=47.5 24 8 0 (0.66666667 0.33333333)
150) age>=49.5 15 3 0 (0.80000000 0.20000000) *
151) age< 49.5 9 4 1 (0.44444444 0.55555556) *
19) age< 27.5 43 13 0 (0.69767442 0.30232558) *
5) age< 6.5 24 8 1 (0.33333333 0.66666667) *
3) sex=female 314 81 1 (0.25796178 0.74203822)
6) age< 12 32 13 1 (0.40625000 0.59375000)
12) age>=7.5 8 1 0 (0.87500000 0.12500000) *
13) age< 7.5 24 6 1 (0.25000000 0.75000000) *
7) age>=12 282 68 1 (0.24113475 0.75886525)
14) age< 48.5 258 66 1 (0.25581395 0.74418605)
28) age>=42.5 17 7 1 (0.41176471 0.58823529) *
29) age< 42.5 241 59 1 (0.24481328 0.75518672)
58) age< 32.25 190 52 1 (0.27368421 0.72631579)
116) age>=24.5 52 16 1 (0.30769231 0.69230769) *
117) age< 24.5 138 36 1 (0.26086957 0.73913043)
234) age< 21.5 105 31 1 (0.29523810 0.70476190)
468) age>=19.5 9 4 0 (0.55555556 0.44444444) *
469) age< 19.5 96 26 1 (0.27083333 0.72916667) *
235) age>=21.5 33 5 1 (0.15151515 0.84848485) *
59) age>=32.25 51 7 1 (0.13725490 0.86274510) *
15) age>=48.5 24 2 1 (0.08333333 0.91666667) *
# Fit and explore tree ----------------------------------------------------
fancyRpartPlot(wacfTrain)
Too complex is, well, too complex and bad for prediction.
We want the tree with the best generalisation error (obvs). But
It's just this problem again:
In our case, complexity is tree size (number of terminal nodes), with error TBD
The cost-complexity measure is given by
\[ R(T)_\alpha=R(T) +\alpha |T| \]
where \( R(T) \) is the resubstitution error rate for tree \( T \), \( |T| \) is the size/complexity of the tree (the number of terminal nodes).
The cost-complexity measure is given by
\[ R(T)_\alpha=R(T) +\alpha |T| \]
Hiding under the hood, your software is probably doing cost-complexity pruning: we find an \( \alpha \) to give good generalisation error.
plotcp(wacfTrain)
prunedTree <- prune(wacfTrain, cp=0.0023)
fancyRpartPlot(prunedTree)
Trees are touted as having good treatment of missing values
Missing values are a pain. If I have/want \( y = f(\mathbf{X}) \):
Decision trees may store Surrogate splits when building
Advantage?
Be aware/things not covered