Intro to Decision Trees with R Example

Robert Ness

Intro to Decision Trees
- Advantages of method
- Disadvantages
Recursive Partitioning Algorthms
- Algorithm pseudocode and objective
R Example: Titanic Data

Intro to Decision Trees

Advantages of Decision Trees

Simple to understand and interpret. White box.
Requires little data preparation. (No need for normalization or dummy vars, works with NAs)
Works with both numerical and categorical data.
Handles nonlinearity (in constrast to logistic regression)
Possible to validate a model using statistical tests. Gives you confidence it will work on new data sets.
Robust. Performs well even if you deviate from assumptions
Scales to big data

Limitations of Decision Trees

Learning globally optimal tree is NP-hard, algos rely on greedy search
Easy to overfit the tree (unconstrained, prediction accuracy is 100% on training data)
Complex “if-then” relationships between features inflate tree size. eg XOR gate, multiplexor

Recursive partioning algorithms

Recursive partioning algorithms have two basic steps

Given a subset of training data, find the best feature for predicting the labels on that subset.
Find a split on that feature that best seperates the labels, and split into two new subsets
Repeat steps one and two recursively until you meet a stopping criterion

Search problem: Given a subset of the data, the algorithm must chose an ideal next split for that subset on one the features.

Popular algorithms perform an exhaustive search over all possible splits.

CART (Breiman, Friedman, Olshen, and Stone 1984)
C4.5 (Quinlan 1993)
Reply on search criteria such as information gain
Implemented in R package 'rpart'

Default stopping criterion - each datapoint is its own subset, no more data to split.

Information gain is a criterion used for split search but leads to overfitting

Let \( p_i \) be the proportion of times the label of the ith observation in the subset appears in the subset.
Then maximize information gain \( IG = \sum_{i=1}^m p_i log_2 p_i \)
Intuition: choose a split for a given subset that minimizes entropy in the subset's distribution of labels Caveats:
Cannot distinguish between statistically significant and insignificant improvements in IG, leading to overfitting.
For categorical features, IG biased in favor of features with more levels

Conditional tree inference approach stops when splits are not statistically significant

Using the observations in the subset, apply statistical test of independence between each feature and the labels.
Example for feature \( X_j \): Null Hypothesis: (\( Y \perp X_j \))
Select split on feature with lowest p-value
Stop recursion if no features have significant p-values.
R package party does permutation tests, parametric test available as well

R Example: Titanic Data

Packages and Data processing

readr to read in the data
dplyr to process data
party and rpart for the classification tree algorithms Note dplyr imports magrittr which uses the “%>%” syntax used below. Google to learn more.

library(readr)
library(dplyr)
library(party)
library(rpart)
library(rpart.plot)
library(ROCR)
set.seed(100)

Select features that may explain survival

Each row in the data is a passenger. Columns are features:

survived: 0 if died, 1 if survived
embarked: Port of Embarkation (Cherbourg, Queenstown,Southampton)
sex: Gender
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
fare: Fare Payed

Other columns could make interesting features with some further engineering.

Before fitting the model, categorical features should be made into factors

titanic3 <- "https://goo.gl/At238b" %>%
  read_csv %>% # read in the data
  select(survived, embarked, sex, 
         sibsp, parch, fare) %>%
  mutate(embarked = factor(embarked),
         sex = factor(sex))
#load("/Users/robertness/Downloads/titanic.Rdata")

Predictors and response are organized in a data frame

Source: local data frame [5 x 6]

  survived embarked    sex sibsp parch     fare
1        1        S female     0     0 211.3375
2        1        S   male     1     2 151.5500
3        0        S female     1     2 151.5500
4        0        S   male     1     2 151.5500
5        0        S female     1     2 151.5500

View the pairwise plots (red = survived)

plot of chunk unnamed-chunk-4

Split data into training and test sets

.data <- c("training", "test") %>%
  sample(nrow(titanic3), replace = T) %>%
  split(titanic3, .)

Recursive partitioning is implemented in "rpart" package

rtree_fit <- rpart(survived ~ ., 
          .data$training) 
rpart.plot(rtree_fit)

plot of chunk unnamed-chunk-6

Conditional partitioning is implemented in the "ctree" method

tree_fit <- ctree(survived ~ ., 
                  data = .data$training)

Use plot to visualizing partitions.

plot of chunk unnamed-chunk-8

Use ROCR package to visualize ROC Curve and compare methods.

tree_roc <- tree_fit %>%
  predict(newdata = .data$test) %>%
  prediction(.data$test$survived) %>%
  performance("tpr", "fpr")

For comparison, we compare the decision tree, the conditional tree, and logistic regression

ROC of Classification Tree Performance

plot of chunk unnamed-chunk-11