Intro to Machine Learning for Classification: Trees

Trees

If you’ve ever dealt with a crying baby, you know all about trees. Imagine that there’s a crying baby. How do you calm her? You go through some quick mental calculations to try to predict the cause of the crying. Let’s make some pseudocode:

If time since last meal >= 3h, predict “hunger”.
If time since last meal < 3h:
- If grimacing:
  - If bad smell, predict “poopy diaper”
  - If no bad smell, predict “needs a burp”
- If not grimacing:
  - If last diaper change >=3h, predict “wet diaper”
  - If last diaper change < 3h:
    - If recent loud noise, predict “fear”
    - If no recent loud noise, let the other parent figure it out!

Here’s a diagram called a flowchart that captures the same decision tree:

You get the idea. Tree diagrams don’t construct a mathematical model based in linear, algebraic relationships, the way regression does. Instead, trees (also called CART algorithms, for Classification And Regression Trees) work to come up with a decision tree that helps categorize things. Let’s do a quick example.

There are lots of places to get data – from your own research, or online. For this example, I’m using some data from http://archive.ics.uci.edu/ml/datasets/seismic-bumps. It’s intended to help predict when dangerous seismic events might happen, in order to keep miners safe.

Get Data

First I’ll get the data. It’s in arff format, so I’ll use the “foreign” package to get it.

library(foreign)
seismic_data<-read.arff("http://archive.ics.uci.edu/ml/machine-learning-databases/00266/seismic-bumps.arff")

Next, let me take a look at the attribute information, in order to understand what the result or outcome is, and what my predictors are. The list below can come through the original arff file, or the website mentioned above.

Attribute information:

seismic: result of shift seismic hazard assessment in the mine working obtained by the seismic method (a - lack of hazard, b - low hazard, c - high hazard, d - danger state);
seismoacoustic: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method;
shift: information about type of a shift (W - coal-getting, N -preparation shift);
genergy: seismic energy recorded within previous shift by the most active geophone (GMax) out of geophones monitoring the longwall;
gpuls: a number of pulses recorded within previous shift by GMax;
gdenergy: a deviation of energy recorded within previous shift by GMax from average energy recorded during eight previous shifts;
gdpuls: a deviation of a number of pulses recorded within previous shift by GMax from average numberof pulses recorded during eight previous shifts;
ghazard: result of shift seismic hazard assessment in the mine working obtained by the seismoacoustic method based on registration coming form GMax only;
nbumps: the number of seismic bumps recorded within previous shift;
nbumps2: the number of seismic bumps (in energy range [10²,10³)) registered within previous shift;
nbumps3: the number of seismic bumps (in energy range [10³,10⁴)) registered within previous shift;
nbumps4: the number of seismic bumps (in energy range [10⁴,10⁵)) registered within previous shift;
nbumps5: the number of seismic bumps (in energy range [10⁵,10⁶)) registered within the last shift;
nbumps6: the number of seismic bumps (in energy range [10⁶,10⁷)) registered within previous shift;
nbumps7: the number of seismic bumps (in energy range [10⁷,10⁸)) registered within previous shift;
nbumps89: the number of seismic bumps (in energy range [10⁸,10¹⁰)) registered within previous shift;
energy: total energy of seismic bumps registered within previous shift;
maxenergy: the maximum energy of the seismic bumps registered within previous shift;
class: the decision attribute - ‘1’ means that high energy seismic bump occurred in the next shift (‘hazardous state’), ‘0’ means that no high energy seismic bumps occurred in the next shift (‘non-hazardous state’).

We want to look at attributes 1-18, to see if we can come up with a rule set (a “model”) that will allow us to predict whether attribute 19 turns out to be hazardous or non-hazardous.

Data Segmentation

Let’s first split up our data. We want to “train” (or create the model) on some data, and test it on a different set of data.

First off, let’s load some libraries we’ll want.

library(lattice)
library(ggplot2)
library(caret)

Then, let’s create a data partition. We have over 2500 discrete data points, so we can split them up into training and testing. We’ll do 60%, 40% breakup:

# We set a seed for the sake of reproducibility
set.seed(42)

# First, we'll pick off the training data: 
inTrain<-createDataPartition(y=seismic_data$class, p = 0.60, list=FALSE)
training<-seismic_data[inTrain,]

# Then what's left is testing data.
testing<-seismic_data[-inTrain,]

Our plan is to first train a tree on our training data and then measure its accuracy on the same data used to generate the model (thus, the risk of overfitting, or making a model too contingent on this data’s random noise). We will do this a few times using different tree generation algorithms. Once we get a good enough model, we’ll apply it to the testing data just once (this is super important) and see what kind of error we get, and if we have a model that’s going to help miners stay safe!

rpart Algorithm

First, we’ll start with the easiest possible model: predicting class from every other variable, using all the default settings and an rpart (Recursive PARTitioning) algorithm. We’ll use the caret package, because it makes it so easy to try different tree algorithms (see http://topepo.github.io/caret/Tree_Based_Model.html). Additionally, caret includes repeated applications of a model and some tuning within its default settings. Best of all, caret takes existing R machine learning packages and standardizes a single approach to using them, so you only have to learn a single syntax. It’s handy!

To see how good our model is, we’ll use confusionMatrix. Confusion matrix takes reality, compares it to your model’s prediction, and sees how correct it is. We tell the confusion matrix which category counts as “positive” for things like positive predictive value, specificity, etc. In our case, positive equals 1, or a hazardous condition.

set.seed(42)
tree_model_1<-train(class ~ ., data=training, method="rpart")
confusionMatrix(predict(tree_model_1, training), reference=training$class, positive="1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1449  102
##          1    0    0
##                                           
##                Accuracy : 0.9342          
##                  95% CI : (0.9207, 0.9461)
##     No Information Rate : 0.9342          
##     P-Value [Acc > NIR] : 0.5263          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.00000         
##             Specificity : 1.00000         
##          Pos Pred Value :     NaN         
##          Neg Pred Value : 0.93424         
##              Prevalence : 0.06576         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.00000         
##       Balanced Accuracy : 0.50000         
##                                           
##        'Positive' Class : 1               
##

93% accuracy? That’s awesome! BUT…looking more closely…we are lacking sensitivity and positive predictive value, and the confusion matrix points out that the model just predicts “non-hazardous” for each case. Something’s wonky here. We want to get higher than a “oh, it’s probably not hazardous” prediction that a human could do, ruling everything non-hazardous. Let’s try some other tree models here.

rpart2 Algorithm

rpart 2 is similar to rpart and based on the same recursive partitioning, but uses different methods for optimization.

set.seed(42)
tree_model_2<-train(class ~ ., data=training, method="rpart2")

## note: only 2 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 2 .

confusionMatrix(predict(tree_model_2, training), reference=training$class,  positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1433   65
##          1   16   37
##                                           
##                Accuracy : 0.9478          
##                  95% CI : (0.9355, 0.9583)
##     No Information Rate : 0.9342          
##     P-Value [Acc > NIR] : 0.01553         
##                                           
##                   Kappa : 0.4528          
##  Mcnemar's Test P-Value : 9.643e-08       
##                                           
##             Sensitivity : 0.36275         
##             Specificity : 0.98896         
##          Pos Pred Value : 0.69811         
##          Neg Pred Value : 0.95661         
##              Prevalence : 0.06576         
##          Detection Rate : 0.02386         
##    Detection Prevalence : 0.03417         
##       Balanced Accuracy : 0.67585         
##                                           
##        'Positive' Class : 1               
##

OK, this looks better, because we have more values filled in… But what do they all mean? Which is the most important?

We could favor sensitivity – catching every hazardous event, even if that means we tag some non-hazardous situations as hazardous. But what if people get so used to false alarms that they disregard alerts that a situation is hazardous?

Or we could favor, say, positive predictive value – the probability we got it right when we said a situation was dangerous. Or negative predictive value – the probability we got it right when we said things were safe.

Or just overall accuracy – the overall percentage of correct predictions.

Ideally, we’d want perfect scores on everything, but in the real world, this is impossible. We have to measure model “success” by one (or more) of these measures.

So much depends on what our priorities are. Let’s say we want high sensitivity and high positive predictive value. We could do some sophisticated stuff to our train() command to make the model selection more tuned to this metric, but in order to keep things simple, let’s just manually keep an eye out for the models that do the best in sensitivity!

So far, we have 0 sensitivity in our first tree model, but 36% sensitivity in our tree model 2 with an awesome 70% PPV. Can we top that? Let’s do an advanced tree algorithm that does things like boost certain variables that have more influence or importance:

bstTree Algorithm

This is a “boosted” tree, where some features are deemed more important than others.

set.seed(42)
tree_model_3<-train(class ~ ., data=training, method="bstTree")
confusionMatrix(predict(tree_model_3, training), reference = training$class, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1449  102
##          1    0    0
##                                           
##                Accuracy : 0.9342          
##                  95% CI : (0.9207, 0.9461)
##     No Information Rate : 0.9342          
##     P-Value [Acc > NIR] : 0.5263          
##                                           
##                   Kappa : 0               
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.00000         
##             Specificity : 1.00000         
##          Pos Pred Value :     NaN         
##          Neg Pred Value : 0.93424         
##              Prevalence : 0.06576         
##          Detection Rate : 0.00000         
##    Detection Prevalence : 0.00000         
##       Balanced Accuracy : 0.50000         
##                                           
##        'Positive' Class : 1               
##

Oh, dear, we’re back again with predicting everything is non-hazardous. Normally, here, I’d wonder if my features (variables) made sense to use in a tree model after all and if I could do a bit of feature engineering. But that’s not what we’re doing here – we don’t want perfectly groomed machine learning, we want naive, quick-start machine learning! If you do want to stop here and learn about features, you can do so in the Feature Selection tutorial.

ctree Algorithm

The conditional inference tree makes splits not based just on any increase in information / reduction in data mixing (also known as entropy), but based on whether or not such an increase is statistically significant.

set.seed(42)
tree_model_4<-train(class ~ ., data=training, method="ctree")
confusionMatrix(predict(tree_model_4, training), reference = training$class, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 1443   93
##          1    6    9
##                                           
##                Accuracy : 0.9362          
##                  95% CI : (0.9228, 0.9478)
##     No Information Rate : 0.9342          
##     P-Value [Acc > NIR] : 0.4043          
##                                           
##                   Kappa : 0.1393          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.088235        
##             Specificity : 0.995859        
##          Pos Pred Value : 0.600000        
##          Neg Pred Value : 0.939453        
##              Prevalence : 0.065764        
##          Detection Rate : 0.005803        
##    Detection Prevalence : 0.009671        
##       Balanced Accuracy : 0.542047        
##                                           
##        'Positive' Class : 1               
##

This one has lower sensitivity and PPV than model 2. Let’s just call model 2 the winner, and take a closer look at it. The nice thing about it is that it’s an rpart type of tree algorithm, so it’s quite simple and can be shown visually in a tree diagram (called a dendrogram). Not all tree models have this visual simplicity!

plot(tree_model_2$finalModel, uniform=TRUE, margin=0.1, compress=FALSE)
text(tree_model_2$finalModel, use.n = TRUE, all = TRUE, cex=0.7)

In every branching, “true” is to the left, and “false” to the right. So if we look at the first branch, we see nbums<1.5. This is the question that will give rise to the two branches below. But first, we see another line of text, “0” (aka non-hazardous) and some numbers below that: 1449/102. Keeping in mind that the first or leftmost value or branch in a dendrogram is “true” or “yes”, we can interpret the first level as “We begin with 1449 cases who have 0 (non-hazardous) = true and 102 cases where 0 status is false. We will split next on whether the nbumps is less than 1.5.” Subsequent branches go the same way. At the end of each branch, there are “0” and “1” bins that each give their success / error numbers. For example, the leftmost “leaf” is a “0” or non-hazardous bin, and it contains 1150 non-hazardous and 26 hazardous events.

Sometimes just plotting causes a pretty squished dendrogram. Here’s a nicer way to plot requires the rpart.plot package.

library(rpart.plot)
rpart.plot(tree_model_2$finalModel, type=4, extra=1)

A dendrogram makes it relatively easy (at least once you have the data dictionary and know what each variable measures) to describe the decision tree well. Note that in the dendrogram above, leaves are different colors based on whether they represent bins of hazardous or non-hazardous situations.

You can see how trees make rough categories that might not always work that precisely. For some data, tree algorithms might not be right at all. For others, you might benefit from combining lots of trees, each of which looks at only part of the data and a section of the variables. The combination of partial trees altogether forms a random forest, which we’ll discuss in another document!

Our final step, once we’ve come up with a “winner”, is to apply it to our testing data. Our model won’t be as accurate on the testing data as it is on the training data. Let’s think about why: any model is going to do really well on the data it trained on, since it’s tuned on both the signal (the predictive relationship we’re trying to identify, which will be the same in training and testing) and the noise (different, subject to randomness, will differ between training and testing) of the training data. The more our model is tuned to random noise, the worse the decline in performance between training and testing data.

Let’s see how much sensitivity and positive predictive value we lose, when we apply our model to the testing data.

confusionMatrix(predict(tree_model_2, testing), reference=testing$class, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 942  62
##          1  23   6
##                                           
##                Accuracy : 0.9177          
##                  95% CI : (0.8993, 0.9337)
##     No Information Rate : 0.9342          
##     P-Value [Acc > NIR] : 0.9836          
##                                           
##                   Kappa : 0.0878          
##  Mcnemar's Test P-Value : 3.761e-05       
##                                           
##             Sensitivity : 0.088235        
##             Specificity : 0.976166        
##          Pos Pred Value : 0.206897        
##          Neg Pred Value : 0.938247        
##              Prevalence : 0.065828        
##          Detection Rate : 0.005808        
##    Detection Prevalence : 0.028074        
##       Balanced Accuracy : 0.532201        
##                                           
##        'Positive' Class : 1               
##

Wow, we really decreased a lot in both sensitivity and positive predictive value! The 21% PPV means that out of 100 alarms that something’s hazardous, only 21 will be legit. The steep decrease in effectiveness of our model demonstrates that we’re “overfitting” the data. What do do about overfitting? Stay tuned for another microtutorial in machine learning!

One thing to avoid doing here is to continue tuning the tree and trying other models to see if we can get testing performance to improve. The reason we need to avoid that is that we’re then tuning to signal and noise again, and it’s essentially turning testing data into training data.

Intro to Machine Learning for Classification: Trees

Joy Payton

August 18, 2016