Classification with Decision Trees

part of the Data Mining Series by Karen Mazidi

See more at my RPubs site.

A classification tree is a model that predicts the class label of data items. The tree is built by repeatedly dividing the data into groups based on attribute values. The attribute on which to divide is selected by information gain, a statistical technique for determining which attribute split will most cleanly divide the data.

This script explores the C4.5 algorithm and the newer C5.0 algorithm.

C4.5

The C4.5 algorithm, created by Ross Quinlan, implements decision trees. The algorithm starts with all instances in the same group, then repeatedly splits the data based on attributes until each item is classified. To avoid overfitting, sometimes the tree is pruned back. C4.5 attempts this automatically. C4.5 handles both continuous and discrete attributes.

load packages

First we load the RWeka and caret packages.

J48 is an open source Java implementation of the C4.5 algorith available in the Weka package.

The caret package (Classification And REgression Training) is a set of functions that streamline learning by providing functions for data splitting, feature selection, model tuning, and more.

library(RWeka)
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.4

Create the model

First, use caret to create a 10-fold training set. Then train the model. We are using the well-known iris data set here that is automatically included in R.

set.seed(1958)  # set a seed to get replicable results
train <- createFolds(iris$Species, k=10)
C45Fit <- train(Species ~., method="J48", data=iris,
                tuneLength = 5,
                trControl = trainControl(
                  method="cv", indexOut=train))

Look at the model results

The results first describe the data:

Next it tells us that it did 10-fold cross validation.

Finally, we get an accuracy of 0.98, Kappa 0.97. Pretty good!

C45Fit
## C4.5-like Trees 
## 
## 150 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 135, 135, 135, 135, 135, 135, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   0.96      0.94 
## 
## Tuning parameter 'C' was held constant at a value of 0.25
## 

Look at the model

Looking at the tree we see that it first split on whether or not the petal width was 0.6 or less. Each indentation is the next split in the tree.

C45Fit$finalModel
## J48 pruned tree
## ------------------
## 
## Petal.Width <= 0.6: setosa (50.0)
## Petal.Width > 0.6
## |   Petal.Width <= 1.7
## |   |   Petal.Length <= 4.9: versicolor (48.0/1.0)
## |   |   Petal.Length > 4.9
## |   |   |   Petal.Width <= 1.5: virginica (3.0)
## |   |   |   Petal.Width > 1.5: versicolor (3.0/1.0)
## |   Petal.Width > 1.7: virginica (46.0/1.0)
## 
## Number of Leaves  :  5
## 
## Size of the tree :   9

C5.0

Quinlan made improvements to C4.5 and called it C5.0. The newer algorithm is faster, requires less memory and gets results similar to C4.5 but with smaller decision trees.

The algorithm is availabe in package C50. The printr package is a companion to knitr.

library(C50)
library(printr)

Split iris data into train and test

train.indices <- sample(1:nrow(iris), 100)
iris.train <- iris[train.indices, ]
iris.test <- iris[-train.indices, ]

Train decision tree

model <- C5.0(Species ~., data=iris.train)

Test

results <- predict(object=model, newdata=iris.test, type="class")

Look at confusion matrix

table(results, iris.test$Species)
results/ setosa versicolor virginica
setosa 19 0 0
versicolor 0 16 1
virginica 0 0 14

Plot the tree

plot(model)

plot of chunk unnamed-chunk-10

Comparing C4.5 and C5.0 results

Comparing the splits between the two algorithms we see that C4.5 made 4 splits and got 96% accuracy whereas C5.0 made only 3 splits and got 98% accuracy.

Note that the seed influenced how the data was split and therefore influenced the result. This is a small data set and therefore more influenced by the test/train split than a larger data set is likely to be.