Classification Trees

The Big Picture

Our primary focus in this Chapter is on the Model phase.

Packages We Need

library(tigerstats)
library(tree)
library(tigerTree)

Introduction to Classification Trees

Tree Models

Trees are a convenient way to:

describe how a set of explanatory variables are related to a response variable
predict values of the response variable for a new observation

Two types of trees:

classification trees (response is a factor)
regression trees (response is numerical)

We'll work with classification trees, first.

A Classification Tree

For the m111survey data, let's make a tree model to predict sex from:

height
fastest

trSex <- tree(sex ~ fastest + height, data = m111survey)
plot(trSex); text(trSex)

The Tree Model (graph)

Nodes

A tree has nodes:

the root node is at the top
each node can be split into two child nodes, based on the value of a variable
nodes that don't get split are called terminal nodes

The majority sex in each terminal node is listed under the node.

The Tree Model (more detail)

trSex

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 71 97.280 female ( 0.56338 0.43662 )  
##    2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )  
##      4) height < 67.375 29  8.700 female ( 0.96552 0.03448 )  
##        8) fastest < 122.5 24  0.000 female ( 1.00000 0.00000 ) *
##        9) fastest > 122.5 5  5.004 female ( 0.80000 0.20000 ) *
##      5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )  
##       10) fastest < 97.5 6  5.407 female ( 0.83333 0.16667 ) *
##       11) fastest > 97.5 8 11.090 male ( 0.50000 0.50000 ) *
##    3) height > 69.5 28 19.070 male ( 0.10714 0.89286 )  
##      6) fastest < 106.5 13 14.050 male ( 0.23077 0.76923 ) *
##      7) fastest > 106.5 15  0.000 male ( 0.00000 1.00000 ) *

How Does This Work?

How does the tree decide to make its splits?

How does it decide when to stop splitting?

By Tree-Control!

Under the hood, tree uses the control argument and the tree.control function:

trSex <- tree(sex ~ fastest + height, data = m111survey,
              control = tree.control(
                nobs = nrow(m111survey),
                mincut = 5,
                minsize = 10,
                mindev = 0.01
              ))

nobs

This is the number of observations you are using to build your tree.

We are using all the rows in m111survey.

So minobs = nrow(m111survey)

minsize

This is the smallest-sized node that you will consider splitting up into two child nodes.

By default, minsize = 10, but we could change that.

mincut

If you plan to split a node, each of the two child nodes must be at least mincut in size.

By default, mincut = 5, but you can change this.

Note: mincut must not be more than half of minsize.

mindev

mindev pays attention to the deviance.

But what is deviance?

Deviance

Each node has a deviance. When you are predicting sex, the deviance formula is:

\[-2 \left(n_{\text{female}} \ln(p_{\text{female}}) + n_{\text{male}} \ln(p_{\text{male}})\right),\]

where:

\(n_{\text{female}} =\) number of females in the node;
\(p_{\text{female}} =\) proportion of females in the node;
\(n_{\text{male}} =\) number of males in the node;
\(p_{\text{male}} =\) proportion of males in the node;

\(\ln\) is the natural logarithm function.

Deviance

For the root node of the tree the deviance is:

\[ -2\left(40\ln(40/71) + 31\ln(31/71)\right) \approx 97.280.\]

Deviance is a measure of how "pure" a node is. The closer the node is to being all male or all female, the closer its deviance is to 0.

Note on Deviance

Consider the following function:

\[f(p) = -2\left(p\ln(p) + (1-p)\ln(1-p)\right),\] where \(0 < p < 1\).

Graph of f

Notice:

\(f\) was smallest for \(p\) near 0 and near 1
\(f\) was biggest at \(p = 1/2\)

Let \(n\) be the number of items at the node, and consider:

\[-2 \left(\frac{n_{\text{female}}}{n} \ln(p_{\text{female}}) + \frac{n_{\text{male}}}{n} \ln(p_{\text{male}})\right) = -2 \left(p_{\text{female}} \ln(p_{\text{female}}) + p_{\text{male}} \ln(p_{\text{male}})\right).\] The LHS is the deviance divided ny \(n\). The RHS is \(f(p_{\text{female}})\).

Accordingly …

Deviance is biggest when the node is "most impure" (\(p_{\text{female}}\) and \(p_{\text{male}}\) both near \(1/2\).)
Deviance is smallest when the node is maximally pure (\(p_{\text{female}} = 0\) and \(p_{\text{male}} = 1\), or vice versa).

Deviance in Splitting

The idea in splitting a node is to choose a split so that:

the sum of the deviance of the two child nodes is as small as possible.

mindev in Splitting

mindev sets limit on this. To split a node into two child nodes, the deviance of that node must be at least:

\[\text{mindev} \times \text{root deviance}.\]

By default, mindev = 0.01, but you can change this.

A Walk-Through

Let's examine the construction of our trSex model, step-by-step.

Beginning (no splits yet)

1) root 71 97.280 female ( 0.56338 0.43662 )

1) root 71 97.280 female ( 0.56338 0.43662 )  
  2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )
  3) height > 69.5 28 19.070 male ( 0.10714 0.89286 )

3) height > 69.5 28 19.070 male ( 0.10714 0.89286 )  
  6) fastest < 106.5 13 14.050 male ( 0.23077 0.76923 ) *
  7) fastest > 106.5 15  0.000 male ( 0.00000 1.00000 ) *

2) height < 69.5 43 34.750 female ( 0.86047 0.13953 )  
     4) height < 67.375 29  8.700 female ( 0.96552 0.03448 )  
     5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )

5) height > 67.375 14 18.250 female ( 0.64286 0.35714 )  
  10) fastest < 97.5 6  5.407 female ( 0.83333 0.16667 ) *
  11) fastest > 97.5 8 11.090 male ( 0.50000 0.50000 ) *

4) height < 67.375 29  8.700 female ( 0.96552 0.03448 )  
    8) fastest < 122.5 24  0.000 female ( 1.00000 0.00000 ) *
    9) fastest > 122.5 5  5.004 female ( 0.80000 0.20000 ) *

More Predictor Variables

Trees can work with any number of explanatory (predictor) variables. Each predictor must be a factor or a numerical variable.

Could we predict sex better if we used more predictor variables?

Let's use all the other variables in m111survey!

Use All Other Variables

trSex2 <- tree(sex ~ ., data = m111survey)
plot(trSex2); text(trSex2)

Note that we are sticking with the default values of control.

The Result

A Useful Summary

summary(trSex2)

## 
## Classification tree:
## tree(formula = sex ~ ., data = m111survey)
## Variables actually used in tree construction:
## [1] "ideal_ht" "GPA"      "fastest" 
## Number of terminal nodes:  4 
## Residual mean deviance:  0.1975 = 12.64 / 64 
## Misclassification error rate: 0.04412 = 3 / 68

The Output

Lead-up info:

Classification tree:
tree(formula = sex ~ ., data = m111survey)
Variables actually used in tree construction:
[1] "ideal_ht" "GPA"      "fastest" 
Number of terminal nodes:  4

Output: Deviance

Residual mean deviance:  0.1975 = 12.64 / 64

R added the deviance at all four terminal nodes, then divided by:

\[\text{observations} - \text{terminal nodes} = 68 - 4 = 64.\]

The smaller the residual mean, deviance, the more "pure" the terminal nodes are, on average.

Output: Error Rate

At each terminal node, you predict sex based on which sex is the majority at the node.

The minority observations are "mis-classed."

Mis-Classed Observations

You can find the three mis-classed observations below:

trSex2

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 68 93.320 female ( 0.55882 0.44118 )  
##   2) ideal_ht < 71 39 15.780 female ( 0.94872 0.05128 )  
##     4) GPA < 2.785 6  7.638 female ( 0.66667 0.33333 ) *
##     5) GPA > 2.785 33  0.000 female ( 1.00000 0.00000 ) *
##   3) ideal_ht > 71 29  8.700 male ( 0.03448 0.96552 )  
##     6) fastest < 92.5 5  5.004 male ( 0.20000 0.80000 ) *
##     7) fastest > 92.5 24  0.000 male ( 0.00000 1.00000 ) *

Caution About the Error Rate

The 3/68 sounds great, but this is how the tree did on the same data that was used to build it.

If you used the tree to predict new observations of a similar sort, it probably would not do as well.

Missing Observations

Wait a minute! The error rate was "3 out of 68". But m111survey has 71 observations:

nrow(m111survey)

## [1] 71

It turns out that three observations were missing values for predictor variables, so R could not use them to check the tree. This will happen frequently.

Control Your Tree

Let's allow the tree to grow "big" (i.e., have as many terminal nodes as possible).

trSex3 <- tree(sex ~ ., data = m111survey,
               control = tree.control(
                 nobs = nrow(m111survey),
                 mincut = 1,
                 minsize = 2,
                 mindev = 0
               ))
plot(trSex3); text(trSex3)

The Model

The Summary

summary(trSex3)

## 
## Classification tree:
## tree(formula = sex ~ ., data = m111survey, control = tree.control(nobs = nrow(m111survey), 
##     mincut = 1, minsize = 2, mindev = 0))
## Variables actually used in tree construction:
## [1] "ideal_ht" "GPA"      "height"   "sleep"    "fastest" 
## Number of terminal nodes:  6 
## Residual mean deviance:  0 = 0 / 62 
## Misclassification error rate: 0 = 0 / 68

All terminal nodes are pure!

Dealing with Factor Predcitors

seat didn't make it into the "big" tree model. Is it relevant to sex at all?

trSex4 <- tree(sex ~ seat, data = m111survey)
plot(trSex4); text(trSex4)

Factors in a Tree Plot:

In the plot, levels of a factor are coded by letters: a,b,c, …

trSex4

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 71 97.28 female ( 0.5634 0.4366 )  
##   2) seat: 1_front 27 32.82 female ( 0.7037 0.2963 ) *
##   3) seat: 2_middle,3_back 44 60.91 male ( 0.4773 0.5227 ) *

seat and sex

seat is not much related to sex. Look at the high mis-classification rate:

summary(trSex4)

Classification tree:
tree(formula = sex ~ seat, data = m111survey)
Number of terminal nodes:  2 
Residual mean deviance:  1.358 = 93.72 / 69 
Misclassification error rate: 0.4085 = 29 / 71

Practice

data(iris)
View(iris)
help(iris)

Use the iris data to make a tree model that predicts Species from Sepal.Length, Sepal.Width, Petal.Length and Petal.Width.

How many terminal nodes does it have?

What is the mis-classification rate?

Prediction with Classification Trees

Thinking about Error Rates and Trees Sizes

We said that the mis-classification rate underestimates the error a tree model will have when applied to new observations.

We also have options about "how large" to grow our tree.

What's the "best" tree size?
How can we properly estimate the error rate for new observations?

Justin Verlander

data("verlander", package = "tigerstats")
View(verlander)
?tigerstats::verlander

The Pitch-fx machine classifies pitches by type.

Research Question:

Can we predict how the machine will classify a pitch? What variables are important in determining pitch-type?

First, Look Around a Bit

verlander %>%
  ggplot(aes(x = speed)) +
  geom_density()

Look with Box Plots

verlander %>%
  ggplot(aes(x = pitch_type, y = speed)) +
  geom_boxplot()

Remove an Unusual Observation

He had one very slow pitch (probably an intentional ball) Let's get rid of it:

ver2 <- verlander %>% filter(speed > 60)

We'll work with the ver2 data frame from now on.

More Looking

Try various plots (see Addins for cloud plots), etc.

Unimportant Variables

gamedate probably doesn't matter. Anyway, it's neither a factor nor numerical, so the trees won't work with it. Let's remove it:

ver2 <-
  ver2 %>% 
  select(-gamedate)

A Classification Tree

trMod <- tree(pitch_type ~ ., data = ver2)
plot(trMod); text(trMod)

A Closer Look

trMod

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 15306 44070.0 FF ( 0.1666013 0.1773814 0.4413955 0.1320397 0.0825820 )  
##    2) speed < 91.75 6565 14350.0 CU ( 0.3859863 0.4135567 0.0004570 0.0085301 0.1914699 )  
##      4) pfx_x < -2.025 2649  1152.0 CH ( 0.9554549 0.0000000 0.0011325 0.0211401 0.0222726 ) *
##      5) pfx_x > -2.025 3916  4870.0 CU ( 0.0007661 0.6933095 0.0000000 0.0000000 0.3059244 )  
##       10) pfx_z < -2.365 2755   866.1 CU ( 0.0000000 0.9633394 0.0000000 0.0000000 0.0366606 ) *
##       11) pfx_z > -2.365 1161   519.6 SL ( 0.0025840 0.0525409 0.0000000 0.0000000 0.9448751 ) *
##    3) speed > 91.75 8741  9652.0 FF ( 0.0018305 0.0000000 0.7725661 0.2248027 0.0008008 )  
##      6) pfx_x < -8.305 2063  1075.0 FT ( 0.0000000 0.0000000 0.0727096 0.9272904 0.0000000 )  
##       12) pfx_x < -8.525 1726   112.6 FT ( 0.0000000 0.0000000 0.0052144 0.9947856 0.0000000 ) *
##       13) pfx_x > -8.525 337   458.2 FT ( 0.0000000 0.0000000 0.4183976 0.5816024 0.0000000 ) *
##      7) pfx_x > -8.305 6678   943.2 FF ( 0.0023959 0.0000000 0.9887691 0.0077868 0.0010482 ) *

A Summary of the Tree

summary(trMod)

## 
## Classification tree:
## tree(formula = pitch_type ~ ., data = ver2)
## Variables actually used in tree construction:
## [1] "speed" "pfx_x" "pfx_z"
## Number of terminal nodes:  6 
## Residual mean deviance:  0.2648 = 4052 / 15300 
## Misclassification error rate: 0.03319 = 508 / 15306

Graphical Look

Using the Cloud Addin, we can get something like this:

cloud(speed ~ pfx_x * pfx_z,
    data = ver2,
    screen = list(x = -90,
            y = 0,
            z = 0),
    groups = pitch_type,
    auto.key = list(
        space = "top",
        title = "Type of Pitch",
        cex.title = 1,
        columns = 3),
    pch = 20,
    zoom = 0.7,
    main = "Verlander Pitches")

A Cloud Plot

An Overgrown Tree

Our tree is kinda small. Let's make another tree with lots of terminal nodes:

tr.Mod2 <- tree(pitch_type ~ ., data = ver2,
               control = tree.control(nobs = nrow(ver2), 
                                      mincut = 1, minsize = 2, mindev = 0))
plot(trMod2); text(trMod2)

Yuck!

What's Going On?

summary(tr.Mod2)

## 
## Classification tree:
## tree(formula = pitch_type ~ ., data = ver2, control = tree.control(nobs = nrow(ver2), 
##     mincut = 1, minsize = 2, mindev = 0))
## Number of terminal nodes:  261 
## Residual mean deviance:  0 = 0 / 15040 
## Misclassification error rate: 0 = 0 / 15306

The tree kept growing, making more and more splits until no further helpful splits could be made.

Which is Better …

… a tree with lots of terminal nodes, or fewer terminal nodes?

And can we estimate how well our tree will do on new data?

Training, Quiz, Test

We'll subdivide our data into three groups:

training set
quiz set
test set

Then …

We'll build as many trees as we like, using only the training set.
We'll try each tree model on the quiz set, and note the error rate.
We'll pick the model with the lowest error rate (or one that we like for other good reasons).
We will try that model, and ONLY that model, on the test set.
The error rate on the test set is our best estimate of the error rate for new data.

Subdivision

Let's say we'll assign:

60% of the data to the training set;
20% to the quiz set;
20% to the test set.

This assignment should be done randomly!

Subdivision

The package tigerTree contains a function that will divide a data frame randomly into the desired groups.

library(tigerTree)
dfs <- divideTrainTest(seed = 3030, 
                       prop.train = 0.6, prop.quiz = 0.2,
                       data = ver2)
verTrain <- dfs$train
verQuiz <- dfs$quiz
verTest <- dfs$test

Build Trees with Training Set:

First, a "conservative" tree:

tr.small <- tree(pitch_type ~ ., data = verTrain)
summary(tr.small)

## 
## Classification tree:
## tree(formula = pitch_type ~ ., data = verTrain)
## Variables actually used in tree construction:
## [1] "speed" "pfx_x" "pfx_z"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.2153 = 1975 / 9176 
## Misclassification error rate: 0.02995 = 275 / 9183

Build Trees with Training Set

Next, an "overgrown" tree:

tr.big <- tree(pitch_type ~ ., data = verTrain,
               control = tree.control(nobs = nrow(verTrain), 
                                      mincut = 1, minsize = 2, mindev = 0))
summary(tr.big)

## 
## Classification tree:
## tree(formula = pitch_type ~ ., data = verTrain, control = tree.control(nobs = nrow(verTrain), 
##     mincut = 1, minsize = 2, mindev = 0))
## Number of terminal nodes:  152 
## Residual mean deviance:  0 = 0 / 9031 
## Misclassification error rate: 0 = 0 / 9183

Wow, no errors!

Try Small Tree on the Quiz Set:

This is done with the tryTree function from the tigerTree package:

tryTree(mod = tr.small, 
        testSet = verQuiz, 
        truth = verQuiz$pitch_type)

## Residual mean deviance:  0.2665 = 813.7 / 3054
## Misclassification error rate:  0.03659 = 112 / 3061
## Confusion matrix:
##           truth
## prediction   CH   CU   FF   FT   SL
##         CH  481    0    2   10   13
##         CU    0  510    0    0   32
##         FF    2    0 1367   13    4
##         FT    0    0   31  406    0
##         SL    0    5    0    0  185

Error rate bigger than on the training set!

Try Big Tree on the Test Set

Again use tryTree but with the tr.big model:

tryTree(mod = tr.big, 
        testSet = verQuiz, 
        truth = verQuiz$pitch_type)

## Residual mean deviance:  0.1228 = 363.8 / 2963
## Misclassification error rate:  0.02973 = 91 / 3061
## Confusion matrix:
##           truth
## prediction   CH   CU   FF   FT   SL
##         CH  474    0    4    2    9
##         CU    0  507    0    0   13
##         FF    1    0 1379   26    3
##         FT    1    0   17  401    0
##         SL    7    8    0    0  209

Error rate also bigger than on the training set!

What's the best Tree Size?

There are very scientific ways to do this, but we'll just work by hand. Tune the tree by hand with this local Shiny app:

tuneTree(pitch_type ~ ., data = verTrain,
         testSet = verQuiz,
         truth = verQuiz$pitch_type)

My Choice

I finally decided to go with:

tr.chosen <- tree(pitch_type ~ ., data = verTrain,
               control = tree.control(
                 nobs = nrow(verTrain), 
                        mincut = 7,
                        minsize = 14,
                        mindev = 0.0003))

Variables actually used in tree construction:
[1] "speed"   "pfx_x"   "pz"      "season"  "pfx_z"   "px"      "pitches"
Number of terminal nodes:  41 
Residual mean deviance:  0.07468 = 682.7 / 9142 
Misclassification error rate: 0.01568 = 144 / 9183

On the Quiz Set …

… my choice gave:

tryTree(tr.chosen, testSet = verQuiz, truth = verQuiz$pitch_type)

## Residual mean deviance:  0.1517 = 458.2 / 3020
## Misclassification error rate:  0.02777 = 85 / 3061
## Confusion matrix:
##           truth
## prediction   CH   CU   FF   FT   SL
##         CH  481    0    3    1   10
##         CU    0  506    0    0   13
##         FF    1    0 1373   20    3
##         FT    1    0   24  408    0
##         SL    0    9    0    0  208

On the Test Set, I Get …

tryTree(tr.chosen, testSet = verTest, truth = verTest$pitch_type)

## Residual mean deviance:  0.1274 = 385 / 3021
## Misclassification error rate:  0.02253 = 69 / 3062
## Confusion matrix:
##           truth
## prediction   CH   CU   FF   FT   SL
##         CH  498    0    3    0   10
##         CU    0  544    0    0   11
##         FF    2    0 1329   20    2
##         FT    1    0   14  363    0
##         SL    4    2    0    0  259

Note

Our tree does pretty well at imitating Pitch-fx, but it's not exactly how Pitch-fx makes its classifications:

Pitch-fx may use other predictors besides the one in verlander
It probably uses more sophisticated prediction methods