Our primary focus in this Chapter is on the Model phase.
Our primary focus in this Chapter is on the Model phase.
library(tigerstats) library(tree) library(tigerTree)
Trees are a convenient way to:
Two types of trees:
We'll work with classification trees, first.
For the m111survey data, let's make a tree model to predict sex from:
trSex <- tree(sex ~ fastest + height, data = m111survey) plot(trSex); text(trSex)
A tree has nodes:
The majority sex in each terminal node is listed under the node.
trSex
## node), split, n, deviance, yval, (yprob) ## * denotes terminal node ## ## 1) root 71 97.280 female ( 0.56338 0.43662 ) ## 2) height < 69.5 43 34.750 female ( 0.86047 0.13953 ) ## 4) height < 67.375 29 8.700 female ( 0.96552 0.03448 ) ## 8) fastest < 122.5 24 0.000 female ( 1.00000 0.00000 ) * ## 9) fastest > 122.5 5 5.004 female ( 0.80000 0.20000 ) * ## 5) height > 67.375 14 18.250 female ( 0.64286 0.35714 ) ## 10) fastest < 97.5 6 5.407 female ( 0.83333 0.16667 ) * ## 11) fastest > 97.5 8 11.090 male ( 0.50000 0.50000 ) * ## 3) height > 69.5 28 19.070 male ( 0.10714 0.89286 ) ## 6) fastest < 106.5 13 14.050 male ( 0.23077 0.76923 ) * ## 7) fastest > 106.5 15 0.000 male ( 0.00000 1.00000 ) *
How does the tree decide to make its splits?
How does it decide when to stop splitting?
Under the hood, tree uses the control argument and the tree.control function:
trSex <- tree(sex ~ fastest + height, data = m111survey,
control = tree.control(
nobs = nrow(m111survey),
mincut = 5,
minsize = 10,
mindev = 0.01
))
This is the number of observations you are using to build your tree.
We are using all the rows in m111survey.
So minobs = nrow(m111survey)
This is the smallest-sized node that you will consider splitting up into two child nodes.
By default, minsize = 10, but we could change that.
If you plan to split a node, each of the two child nodes must be at least mincut in size.
By default, mincut = 5, but you can change this.
Note: mincut must not be more than half of minsize.
mindev pays attention to the deviance.
But what is deviance?
Each node has a deviance. When you are predicting sex, the deviance formula is:
\[-2 \left(n_{\text{female}} \ln(p_{\text{female}}) + n_{\text{male}} \ln(p_{\text{male}})\right),\]
where:
\(\ln\) is the natural logarithm function.
For the root node of the tree the deviance is:
\[ -2\left(40\ln(40/71) + 31\ln(31/71)\right) \approx 97.280.\]
Deviance is a measure of how "pure" a node is. The closer the node is to being all male or all female, the closer its deviance is to 0.
Consider the following function:
\[f(p) = -2\left(p\ln(p) + (1-p)\ln(1-p)\right),\] where \(0 < p < 1\).
Let \(n\) be the number of items at the node, and consider:
\[-2 \left(\frac{n_{\text{female}}}{n} \ln(p_{\text{female}}) + \frac{n_{\text{male}}}{n} \ln(p_{\text{male}})\right) = -2 \left(p_{\text{female}} \ln(p_{\text{female}}) + p_{\text{male}} \ln(p_{\text{male}})\right).\] The LHS is the deviance divided ny \(n\). The RHS is \(f(p_{\text{female}})\).
The idea in splitting a node is to choose a split so that:
the sum of the deviance of the two child nodes is as small as possible.
mindev sets limit on this. To split a node into two child nodes, the deviance of that node must be at least:
\[\text{mindev} \times \text{root deviance}.\]
By default, mindev = 0.01, but you can change this.
Let's examine the construction of our trSex model, step-by-step.
1) root 71 97.280 female ( 0.56338 0.43662 )
1) root 71 97.280 female ( 0.56338 0.43662 ) 2) height < 69.5 43 34.750 female ( 0.86047 0.13953 ) 3) height > 69.5 28 19.070 male ( 0.10714 0.89286 )
3) height > 69.5 28 19.070 male ( 0.10714 0.89286 ) 6) fastest < 106.5 13 14.050 male ( 0.23077 0.76923 ) * 7) fastest > 106.5 15 0.000 male ( 0.00000 1.00000 ) *
2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )
4) height < 67.375 29 8.700 female ( 0.96552 0.03448 )
5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )
5) height > 67.375 14 18.250 female ( 0.64286 0.35714 ) 10) fastest < 97.5 6 5.407 female ( 0.83333 0.16667 ) * 11) fastest > 97.5 8 11.090 male ( 0.50000 0.50000 ) *
4) height < 67.375 29 8.700 female ( 0.96552 0.03448 )
8) fastest < 122.5 24 0.000 female ( 1.00000 0.00000 ) *
9) fastest > 122.5 5 5.004 female ( 0.80000 0.20000 ) *
Trees can work with any number of explanatory (predictor) variables. Each predictor must be a factor or a numerical variable.
Could we predict sex better if we used more predictor variables?
Let's use all the other variables in m111survey!
trSex2 <- tree(sex ~ ., data = m111survey) plot(trSex2); text(trSex2)
Note that we are sticking with the default values of control.
summary(trSex2)
## ## Classification tree: ## tree(formula = sex ~ ., data = m111survey) ## Variables actually used in tree construction: ## [1] "ideal_ht" "GPA" "fastest" ## Number of terminal nodes: 4 ## Residual mean deviance: 0.1975 = 12.64 / 64 ## Misclassification error rate: 0.04412 = 3 / 68
Lead-up info:
Classification tree: tree(formula = sex ~ ., data = m111survey) Variables actually used in tree construction: [1] "ideal_ht" "GPA" "fastest" Number of terminal nodes: 4
Residual mean deviance: 0.1975 = 12.64 / 64
R added the deviance at all four terminal nodes, then divided by:
\[\text{observations} - \text{terminal nodes} = 68 - 4 = 64.\]
The smaller the residual mean, deviance, the more "pure" the terminal nodes are, on average.
At each terminal node, you predict sex based on which sex is the majority at the node.
The minority observations are "mis-classed."
You can find the three mis-classed observations below:
trSex2
## node), split, n, deviance, yval, (yprob) ## * denotes terminal node ## ## 1) root 68 93.320 female ( 0.55882 0.44118 ) ## 2) ideal_ht < 71 39 15.780 female ( 0.94872 0.05128 ) ## 4) GPA < 2.785 6 7.638 female ( 0.66667 0.33333 ) * ## 5) GPA > 2.785 33 0.000 female ( 1.00000 0.00000 ) * ## 3) ideal_ht > 71 29 8.700 male ( 0.03448 0.96552 ) ## 6) fastest < 92.5 5 5.004 male ( 0.20000 0.80000 ) * ## 7) fastest > 92.5 24 0.000 male ( 0.00000 1.00000 ) *
The 3/68 sounds great, but this is how the tree did on the same data that was used to build it.
If you used the tree to predict new observations of a similar sort, it probably would not do as well.
Wait a minute! The error rate was "3 out of 68". But m111survey has 71 observations:
nrow(m111survey)
## [1] 71
It turns out that three observations were missing values for predictor variables, so R could not use them to check the tree. This will happen frequently.
Let's allow the tree to grow "big" (i.e., have as many terminal nodes as possible).
trSex3 <- tree(sex ~ ., data = m111survey,
control = tree.control(
nobs = nrow(m111survey),
mincut = 1,
minsize = 2,
mindev = 0
))
plot(trSex3); text(trSex3)
summary(trSex3)
## ## Classification tree: ## tree(formula = sex ~ ., data = m111survey, control = tree.control(nobs = nrow(m111survey), ## mincut = 1, minsize = 2, mindev = 0)) ## Variables actually used in tree construction: ## [1] "ideal_ht" "GPA" "height" "sleep" "fastest" ## Number of terminal nodes: 6 ## Residual mean deviance: 0 = 0 / 62 ## Misclassification error rate: 0 = 0 / 68
All terminal nodes are pure!
seat didn't make it into the "big" tree model. Is it relevant to sex at all?
trSex4 <- tree(sex ~ seat, data = m111survey) plot(trSex4); text(trSex4)
In the plot, levels of a factor are coded by letters: a,b,c, …
trSex4
## node), split, n, deviance, yval, (yprob) ## * denotes terminal node ## ## 1) root 71 97.28 female ( 0.5634 0.4366 ) ## 2) seat: 1_front 27 32.82 female ( 0.7037 0.2963 ) * ## 3) seat: 2_middle,3_back 44 60.91 male ( 0.4773 0.5227 ) *
seat is not much related to sex. Look at the high mis-classification rate:
summary(trSex4)
Classification tree: tree(formula = sex ~ seat, data = m111survey) Number of terminal nodes: 2 Residual mean deviance: 1.358 = 93.72 / 69 Misclassification error rate: 0.4085 = 29 / 71
data(iris) View(iris) help(iris)
Use the iris data to make a tree model that predicts Species from Sepal.Length, Sepal.Width, Petal.Length and Petal.Width.
How many terminal nodes does it have?
What is the mis-classification rate?
We said that the mis-classification rate underestimates the error a tree model will have when applied to new observations.
We also have options about "how large" to grow our tree.
data("verlander", package = "tigerstats")
View(verlander)
?tigerstats::verlander
The Pitch-fx machine classifies pitches by type.
Research Question:
Can we predict how the machine will classify a pitch? What variables are important in determining pitch-type?
verlander %>% ggplot(aes(x = speed)) + geom_density()
verlander %>% ggplot(aes(x = pitch_type, y = speed)) + geom_boxplot()
He had one very slow pitch (probably an intentional ball) Let's get rid of it:
ver2 <- verlander %>% filter(speed > 60)
We'll work with the ver2 data frame from now on.
Try various plots (see Addins for cloud plots), etc.
gamedate probably doesn't matter. Anyway, it's neither a factor nor numerical, so the trees won't work with it. Let's remove it:
ver2 <- ver2 %>% select(-gamedate)
trMod <- tree(pitch_type ~ ., data = ver2) plot(trMod); text(trMod)
trMod
## node), split, n, deviance, yval, (yprob) ## * denotes terminal node ## ## 1) root 15306 44070.0 FF ( 0.1666013 0.1773814 0.4413955 0.1320397 0.0825820 ) ## 2) speed < 91.75 6565 14350.0 CU ( 0.3859863 0.4135567 0.0004570 0.0085301 0.1914699 ) ## 4) pfx_x < -2.025 2649 1152.0 CH ( 0.9554549 0.0000000 0.0011325 0.0211401 0.0222726 ) * ## 5) pfx_x > -2.025 3916 4870.0 CU ( 0.0007661 0.6933095 0.0000000 0.0000000 0.3059244 ) ## 10) pfx_z < -2.365 2755 866.1 CU ( 0.0000000 0.9633394 0.0000000 0.0000000 0.0366606 ) * ## 11) pfx_z > -2.365 1161 519.6 SL ( 0.0025840 0.0525409 0.0000000 0.0000000 0.9448751 ) * ## 3) speed > 91.75 8741 9652.0 FF ( 0.0018305 0.0000000 0.7725661 0.2248027 0.0008008 ) ## 6) pfx_x < -8.305 2063 1075.0 FT ( 0.0000000 0.0000000 0.0727096 0.9272904 0.0000000 ) ## 12) pfx_x < -8.525 1726 112.6 FT ( 0.0000000 0.0000000 0.0052144 0.9947856 0.0000000 ) * ## 13) pfx_x > -8.525 337 458.2 FT ( 0.0000000 0.0000000 0.4183976 0.5816024 0.0000000 ) * ## 7) pfx_x > -8.305 6678 943.2 FF ( 0.0023959 0.0000000 0.9887691 0.0077868 0.0010482 ) *
summary(trMod)
## ## Classification tree: ## tree(formula = pitch_type ~ ., data = ver2) ## Variables actually used in tree construction: ## [1] "speed" "pfx_x" "pfx_z" ## Number of terminal nodes: 6 ## Residual mean deviance: 0.2648 = 4052 / 15300 ## Misclassification error rate: 0.03319 = 508 / 15306
Using the Cloud Addin, we can get something like this:
cloud(speed ~ pfx_x * pfx_z,
data = ver2,
screen = list(x = -90,
y = 0,
z = 0),
groups = pitch_type,
auto.key = list(
space = "top",
title = "Type of Pitch",
cex.title = 1,
columns = 3),
pch = 20,
zoom = 0.7,
main = "Verlander Pitches")
Our tree is kinda small. Let's make another tree with lots of terminal nodes:
tr.Mod2 <- tree(pitch_type ~ ., data = ver2,
control = tree.control(nobs = nrow(ver2),
mincut = 1, minsize = 2, mindev = 0))
plot(trMod2); text(trMod2)
summary(tr.Mod2)
## ## Classification tree: ## tree(formula = pitch_type ~ ., data = ver2, control = tree.control(nobs = nrow(ver2), ## mincut = 1, minsize = 2, mindev = 0)) ## Number of terminal nodes: 261 ## Residual mean deviance: 0 = 0 / 15040 ## Misclassification error rate: 0 = 0 / 15306
The tree kept growing, making more and more splits until no further helpful splits could be made.
… a tree with lots of terminal nodes, or fewer terminal nodes?
And can we estimate how well our tree will do on new data?
We'll subdivide our data into three groups:
Let's say we'll assign:
This assignment should be done randomly!
The package tigerTree contains a function that will divide a data frame randomly into the desired groups.
library(tigerTree)
dfs <- divideTrainTest(seed = 3030,
prop.train = 0.6, prop.quiz = 0.2,
data = ver2)
verTrain <- dfs$train
verQuiz <- dfs$quiz
verTest <- dfs$test
First, a "conservative" tree:
tr.small <- tree(pitch_type ~ ., data = verTrain) summary(tr.small)
## ## Classification tree: ## tree(formula = pitch_type ~ ., data = verTrain) ## Variables actually used in tree construction: ## [1] "speed" "pfx_x" "pfx_z" ## Number of terminal nodes: 7 ## Residual mean deviance: 0.2153 = 1975 / 9176 ## Misclassification error rate: 0.02995 = 275 / 9183
Next, an "overgrown" tree:
tr.big <- tree(pitch_type ~ ., data = verTrain,
control = tree.control(nobs = nrow(verTrain),
mincut = 1, minsize = 2, mindev = 0))
summary(tr.big)
## ## Classification tree: ## tree(formula = pitch_type ~ ., data = verTrain, control = tree.control(nobs = nrow(verTrain), ## mincut = 1, minsize = 2, mindev = 0)) ## Number of terminal nodes: 152 ## Residual mean deviance: 0 = 0 / 9031 ## Misclassification error rate: 0 = 0 / 9183
Wow, no errors!
This is done with the tryTree function from the tigerTree package:
tryTree(mod = tr.small,
testSet = verQuiz,
truth = verQuiz$pitch_type)
## Residual mean deviance: 0.2665 = 813.7 / 3054 ## Misclassification error rate: 0.03659 = 112 / 3061 ## Confusion matrix: ## truth ## prediction CH CU FF FT SL ## CH 481 0 2 10 13 ## CU 0 510 0 0 32 ## FF 2 0 1367 13 4 ## FT 0 0 31 406 0 ## SL 0 5 0 0 185
Error rate bigger than on the training set!
Again use tryTree but with the tr.big model:
tryTree(mod = tr.big,
testSet = verQuiz,
truth = verQuiz$pitch_type)
## Residual mean deviance: 0.1228 = 363.8 / 2963 ## Misclassification error rate: 0.02973 = 91 / 3061 ## Confusion matrix: ## truth ## prediction CH CU FF FT SL ## CH 474 0 4 2 9 ## CU 0 507 0 0 13 ## FF 1 0 1379 26 3 ## FT 1 0 17 401 0 ## SL 7 8 0 0 209
Error rate also bigger than on the training set!
There are very scientific ways to do this, but we'll just work by hand. Tune the tree by hand with this local Shiny app:
tuneTree(pitch_type ~ ., data = verTrain,
testSet = verQuiz,
truth = verQuiz$pitch_type)
I finally decided to go with:
tr.chosen <- tree(pitch_type ~ ., data = verTrain,
control = tree.control(
nobs = nrow(verTrain),
mincut = 7,
minsize = 14,
mindev = 0.0003))
Variables actually used in tree construction: [1] "speed" "pfx_x" "pz" "season" "pfx_z" "px" "pitches" Number of terminal nodes: 41 Residual mean deviance: 0.07468 = 682.7 / 9142 Misclassification error rate: 0.01568 = 144 / 9183
… my choice gave:
tryTree(tr.chosen, testSet = verQuiz, truth = verQuiz$pitch_type)
## Residual mean deviance: 0.1517 = 458.2 / 3020 ## Misclassification error rate: 0.02777 = 85 / 3061 ## Confusion matrix: ## truth ## prediction CH CU FF FT SL ## CH 481 0 3 1 10 ## CU 0 506 0 0 13 ## FF 1 0 1373 20 3 ## FT 1 0 24 408 0 ## SL 0 9 0 0 208
tryTree(tr.chosen, testSet = verTest, truth = verTest$pitch_type)
## Residual mean deviance: 0.1274 = 385 / 3021 ## Misclassification error rate: 0.02253 = 69 / 3062 ## Confusion matrix: ## truth ## prediction CH CU FF FT SL ## CH 498 0 3 0 10 ## CU 0 544 0 0 11 ## FF 2 0 1329 20 2 ## FT 1 0 14 363 0 ## SL 4 2 0 0 259
Our tree does pretty well at imitating Pitch-fx, but it's not exactly how Pitch-fx makes its classifications:
verlander