Decision Trees for Categorical Response

Umair Durrani

August 14, 2016

Introduction

Decision Tree is a supervised learning algorithm that creates non-linear decision boundaries by considering linear rules.

Dataset

We consider a subset of 2012 National Collision Database of Canada available from Transport Canada, at the Open Data catalog website (Anon 2014). This subset contains 3 variables: CollisionType, Sex and Severity. CollisionType refers to the type of collisions (Rear End (RE), Right Angle (RA), Head-on (HO), and Right Turn (RT)). Sex is the gender (Male (M) & Female (F)) and Severity is the injury severity (1 = No injury and 2 = Injury) of the driver or other occupant of the vehicle. The response variable, therefore, is the Severity.

Problem Description

Our objective is to predict the Severity of a driver (or occupant) in a crash, given his/her Sex and CollisionType in which vehicle is involved.

Decision Tree for predicting Severity

We will first divide the data into training and test sets. The next step is to train the decision tree algorithm on the training set. Finally, the Severity of the observations in the test data will be predicted using the learnt tree and the accuracy will be determined.

Reading the data

df <- read.csv("Data/ncdb_2012_filtered.csv", header=TRUE)

Following is the data set df:

library(DT)
datatable(df, filter="top")

Train/Test Split

Before applying the algorithm, we’ll first divide the data set into test and train sets:

# Set random seed
set.seed(1)

# Shuffling the dataset
n <- nrow(df)
dfs <- df[sample(n),]

# Split the data in train and test
train_indices <- 1:round(0.7 * n)
train <- dfs[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test <- dfs[test_indices, ]

How does the Decision Tree algorithm decide which variable to split first?

The decision tree algorithm divides the data using linear rules, e.g. is the gender M or F? But there is another variable in the data, CollisionType. The algorithm uses something called Information gain to decide which variable to use first to apply linear rules.

\[ Information \ Gain = Entropy \ of \ parent - Weighted \ average \ of \ entropy \ of \ child \]
where, entropy is defined as:

\[ Entropy = \sum_{j}-p_jlog_2p_j \]

whwre, \(p_j\) is the probability of response type j. Entropy is the measure of impurity in the data set. In this example, if all entries in the Severity are 1 or 2 only, then it represents pure response, with an entropy of zero (probability is 1 and log of 1 is zero). On the other hand, if the Severity has equal amounts of 1 and 2 then the entropy would be 1.
Apart from entropy, Gini Index and classification error rate are also used to estimate information gain.
For a solved example, please refer to Kardi Teknomo’s tutorial.

Building the Decision Tree Model

For building the decision tree model, we will use the rpart library.

library(rpart)
tree <- rpart(Severity~., train, method = "class", minsplit=2, minbucket=1)

Visualizing the tree

library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.0.5 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)

fancyRpartPlot(tree)

Starting at the top level, we have two proportions. The left one is 0.27 which represents the 27% of Severity is 1 i.e. No Injury. The right one is 0.73, representing 73% of Injury because of which a 2 is written at the top. The first splitting variable is CollisionType. If it is not Rear End (RE) collision then 89% of the observations are of type Injury. So, for all other CollisionTypes the class is Injury.
When the CollisionType is RE, the Severity variable has 50% Injury and 50% No Injury. The next rule is derived from Sex variable. If the driver/occupant is Female (F) then there is no injury; if it is Male (M) then the Severity is Injury.
Now, we will predict the class labels for test data set.

Prediction

pred_test <- predict(tree, test, type = "class")

conf <- table(test$Severity, pred_test)

acc <- sum(diag(conf))/sum(conf)

The accuracy is 33.3333333% only. This indicates that these data are not sufficient or not appropriate for decision tree modeling.

References

Anon, 2014. National Collision Database. Available at: http://open.canada.ca/data/en/dataset/1eb9eba7-71d1-4b30-9fb1-30cbdab7e63a [Accessed March 8, 2016].