Introduction
Decision Tree is a supervised learning algorithm that creates non-linear decision boundaries by considering linear rules.
Dataset
We consider a subset of 2012 National Collision Database of Canada available from Transport Canada, at the Open Data catalog website (Anon 2014). This subset contains 3 variables: CollisionType
, Sex
and Severity
. CollisionType
refers to the type of collisions (Rear End (RE), Right Angle (RA), Head-on (HO), and Right Turn (RT)). Sex
is the gender (Male (M) & Female (F)) and Severity
is the injury severity (1 = No injury and 2 = Injury) of the driver or other occupant of the vehicle. The response variable, therefore, is the Severity
.
Problem Description
Our objective is to predict the Severity
of a driver (or occupant) in a crash, given his/her Sex
and CollisionType
in which vehicle is involved.
Decision Tree for predicting Severity
We will first divide the data into training and test sets. The next step is to train the decision tree algorithm on the training set. Finally, the Severity
of the observations in the test data will be predicted using the learnt tree and the accuracy will be determined.
Reading the data
df <- read.csv("Data/ncdb_2012_filtered.csv", header=TRUE)
Following is the data set df:
library(DT)
datatable(df, filter="top")
Train/Test Split
Before applying the algorithm, we’ll first divide the data set into test and train sets:
# Set random seed
set.seed(1)
# Shuffling the dataset
n <- nrow(df)
dfs <- df[sample(n),]
# Split the data in train and test
train_indices <- 1:round(0.7 * n)
train <- dfs[train_indices, ]
test_indices <- (round(0.7 * n) + 1):n
test <- dfs[test_indices, ]
How does the Decision Tree algorithm decide which variable to split first?
The decision tree algorithm divides the data using linear rules, e.g. is the gender M or F? But there is another variable in the data, CollisionType
. The algorithm uses something called Information gain
to decide which variable to use first to apply linear rules.
\[ Information \ Gain = Entropy \ of \ parent - Weighted \ average \ of \ entropy \ of \ child \]
where, entropy is defined as:
\[ Entropy = \sum_{j}-p_jlog_2p_j \]
whwre, \(p_j\) is the probability of response type j. Entropy is the measure of impurity in the data set. In this example, if all entries in the Severity
are 1 or 2 only, then it represents pure response, with an entropy of zero (probability is 1 and log of 1 is zero). On the other hand, if the Severity
has equal amounts of 1 and 2 then the entropy would be 1.
Apart from entropy, Gini Index
and classification error rate
are also used to estimate information gain.
For a solved example, please refer to Kardi Teknomo’s tutorial.
Building the Decision Tree Model
For building the decision tree model, we will use the rpart
library.
library(rpart)
tree <- rpart(Severity~., train, method = "class", minsplit=2, minbucket=1)
Visualizing the tree
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 4.0.5 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)
fancyRpartPlot(tree)
Starting at the top level, we have two proportions. The left one is 0.27 which represents the 27% of Severity
is 1 i.e. No Injury. The right one is 0.73, representing 73% of Injury because of which a 2 is written at the top. The first splitting variable is CollisionType
. If it is not Rear End (RE) collision then 89% of the observations are of type Injury. So, for all other CollisionType
s the class is Injury
.
When the CollisionType
is RE, the Severity
variable has 50% Injury
and 50% No Injury
. The next rule is derived from Sex
variable. If the driver/occupant is Female (F) then there is no injury; if it is Male (M) then the Severity
is Injury
.
Now, we will predict the class labels for test data set.
Prediction
pred_test <- predict(tree, test, type = "class")
conf <- table(test$Severity, pred_test)
acc <- sum(diag(conf))/sum(conf)
The accuracy is 33.3333333% only. This indicates that these data are not sufficient or not appropriate for decision tree modeling.
Resources
References
Anon, 2014. National Collision Database. Available at: http://open.canada.ca/data/en/dataset/1eb9eba7-71d1-4b30-9fb1-30cbdab7e63a [Accessed March 8, 2016].