Intro to Decision Trees with R Example

Robert Ness

Table of Contents

  • Intro to Decision Trees
    • Advantages of method
    • Disadvantages
  • Recursive Partitioning Algorthms
    • Algorithm pseudocode and objective
  • R Example: Titanic Data

Intro to Decision Trees

Advantages of Decision Trees

  • Simple to understand and interpret. White box.
  • Requires little data preparation. (No need for normalization or dummy vars, works with NAs)
  • Works with both numerical and categorical data.
  • Handles nonlinearity (in constrast to logistic regression)
  • Possible to validate a model using statistical tests. Gives you confidence it will work on new data sets.
  • Robust. Performs well even if you deviate from assumptions
  • Scales to big data

Limitations of Decision Trees

  • Learning globally optimal tree is NP-hard, algos rely on greedy search
  • Easy to overfit the tree (unconstrained, prediction accuracy is 100% on training data)
  • Complex “if-then” relationships between features inflate tree size. eg XOR gate, multiplexor

Recursive partioning algorithms

Recursive partioning algorithms have two basic steps

  1. Given a subset of training data, find the best feature for predicting the labels on that subset.
  2. Find a split on that feature that best seperates the labels, and split into two new subsets
  3. Repeat steps one and two recursively until you meet a stopping criterion

Search problem: Given a subset of the data, the algorithm must chose an ideal next split for that subset on one the features.

Popular algorithms perform an exhaustive search over all possible splits.

  • CART (Breiman, Friedman, Olshen, and Stone 1984)
  • C4.5 (Quinlan 1993)
  • Reply on search criteria such as information gain
  • Implemented in R package 'rpart'

Default stopping criterion - each datapoint is its own subset, no more data to split.

Information gain is a criterion used for split search but leads to overfitting

  • Let \( p_i \) be the proportion of times the label of the ith observation in the subset appears in the subset.
  • Then maximize information gain \( IG = \sum_{i=1}^m p_i log_2 p_i \)
  • Intuition: choose a split for a given subset that minimizes entropy in the subset's distribution of labels Caveats:
  • Cannot distinguish between statistically significant and insignificant improvements in IG, leading to overfitting.
  • For categorical features, IG biased in favor of features with more levels

Conditional tree inference approach stops when splits are not statistically significant

  • Using the observations in the subset, apply statistical test of independence between each feature and the labels.
  • Example for feature \( X_j \): Null Hypothesis: (\( Y \perp X_j \))
  • Select split on feature with lowest p-value
  • Stop recursion if no features have significant p-values.
  • R package party does permutation tests, parametric test available as well

R Example: Titanic Data

Packages and Data processing

  • readr to read in the data
  • dplyr to process data
  • party and rpart for the classification tree algorithms Note dplyr imports magrittr which uses the “%>%” syntax used below. Google to learn more.
library(readr)
library(dplyr)
library(party)
library(rpart)
library(rpart.plot)
library(ROCR)
set.seed(100)

Select features that may explain survival

Each row in the data is a passenger. Columns are features:

  • survived: 0 if died, 1 if survived
  • embarked: Port of Embarkation (Cherbourg, Queenstown,Southampton)
  • sex: Gender
  • sibsp: Number of Siblings/Spouses Aboard
  • parch: Number of Parents/Children Aboard
  • fare: Fare Payed

    Other columns could make interesting features with some further engineering.

Before fitting the model, categorical features should be made into factors

titanic3 <- "https://goo.gl/At238b" %>%
  read_csv %>% # read in the data
  select(survived, embarked, sex, 
         sibsp, parch, fare) %>%
  mutate(embarked = factor(embarked),
         sex = factor(sex))
#load("/Users/robertness/Downloads/titanic.Rdata")

Predictors and response are organized in a data frame

Source: local data frame [5 x 6]

  survived embarked    sex sibsp parch     fare
1        1        S female     0     0 211.3375
2        1        S   male     1     2 151.5500
3        0        S female     1     2 151.5500
4        0        S   male     1     2 151.5500
5        0        S female     1     2 151.5500

View the pairwise plots (red = survived)

plot of chunk unnamed-chunk-4

Split data into training and test sets

.data <- c("training", "test") %>%
  sample(nrow(titanic3), replace = T) %>%
  split(titanic3, .)

Recursive partitioning is implemented in "rpart" package

rtree_fit <- rpart(survived ~ ., 
          .data$training) 
rpart.plot(rtree_fit)

plot of chunk unnamed-chunk-6

Conditional partitioning is implemented in the "ctree" method

tree_fit <- ctree(survived ~ ., 
                  data = .data$training)

Use plot to visualizing partitions.

plot of chunk unnamed-chunk-8

Use ROCR package to visualize ROC Curve and compare methods.

tree_roc <- tree_fit %>%
  predict(newdata = .data$test) %>%
  prediction(.data$test$survived) %>%
  performance("tpr", "fpr")

For comparison, we compare the decision tree, the conditional tree, and logistic regression

ROC of Classification Tree Performance

plot of chunk unnamed-chunk-11