FFTrees

An R package to create, visualise, and impliment fast and frugal decision trees

Nathaniel Phillips, Economic Psychology, University of Basel
BaselR Meeting, March 2017

A growing problem at the Cook County Hospital in 1996

plot of chunk cookmap

Patient overload

plot of chunk unnamed-chunk-2

Diagnosing ER patients with chest pain.

Situation

  • 30 people a day worried about a heart attack.
  • Coronary care bed costs $2,000 a night and requires a 3 day stay.

Decision task

  • Send people with heart attacks to the coronary care bed, and healthy patients to a normal bed.

Multiple, uncertain measures

  • Blood pressure, Stethescope,
  • Questions: How long? How much? During exercise? History? Cholesterol? Drugs? etc.
  • Electrocardiogram (ECG) reading.

Intuition based on clinical expertise was not working.

Study: How much do doctors agree on diagnoses?

  • Asked doctors to estimate from 0 to 100 the probability of a heart attack of 20 separate patients.

plot of chunk unnamed-chunk-3

Conclusion: Terribly low agreement

"In each case the answers we got pretty much ranged from 0 to 100. It was extraordinary" (Brenden Reilly, Department of Medicine chairman)

Solution

  • A decision tree developed by a cardiologist named Lee Goldman.

plot of chunk unnamed-chunk-4

Why use a decision tree?

  • Speed, Easy of understanding and implementation

The Cook hospital decision tree

  • Over two years, the performance of the tree was compared to the physician's intuitive judgments.

Results

  • Tree dramatically outperformed the doctor's clinical judgments and resulted in far fewer false-positives and huge cost savings

  • To this day, the tree is still used at the hospital.

Dr. Lee Goldman

Fast and frugal decision trees (FFT)

  • A fast and frugal decision tree (FFT) is a very simple, highly restricted decision tree.

  • In an FFT, each node has exactly two branches, where at least one branch is an exit branch (Martignon et al., 2008).

  • Because of their restrictions, FFTs are even faster and require less information than non-FFT trees.

plot of chunk unnamed-chunk-6

Depression Tree

  • Jenny et al. (2013): Simple rules for detecting depression

plot of chunk unnamed-chunk-7

Bank failure

  • Neth et al. (2013): Homo heuristics in the financial world: From risk management to managing uncertainty

plot of chunk unnamed-chunk-8

Problem

There is no off-the-shelf method to construct FFTs

Task

  • Create an easy-to-use R package that constructs, visualizes, and implements FFTs called FFTrees.

plot of chunk unnamed-chunk-9

FFTrees

# Available on CRAN
install.packages("FFTrees")
devtools::github("ndphillips/FFTrees", include_vignette = TRUE)

Heart disease datatset

library(FFTrees)

head(heartdisease)
##     age sex cp trestbps chol fbs     restecg thalach exang oldpeak slope
## 257  67   0  a      106  223   0      normal     142     0     0.3    up
## 118  35   0  a      138  183   0      normal     182     0     1.4    up
## 242  41   0 aa      126  306   0      normal     163     0     0.0    up
## 12   56   0 aa      140  294   0 hypertrophy     153     0     1.3  flat
## 131  54   1 np      120  258   0 hypertrophy     147     0     0.4  flat
## 108  57   1 np      128  229   0 hypertrophy     150     0     0.4  flat
##     ca   thal diagnosis
## 257  2 normal         0
## 118  0 normal         0
## 242  0 normal         0
## 12   0 normal         0
## 131  0     rd         0
## 108  1     rd         1

Creating a Heart Disease FFT

# Step 1: Create training and test data
set.seed(100)

heartdisease <- heartdisease[sample(nrow(heartdisease)),]
heart.train <- heartdisease[1:150,]
heart.test <- heartdisease[151:303,]

# Step 2: Create heart.fft
heart.fft <- FFTrees(formula = diagnosis ~.,
                     data = heart.train,
                     data.test = heart.test)

Evaluating a decision algorithm

plot of chunk unnamed-chunk-13

FFT summary statistics

# Step 3: Summary statistics
heart.fft
## [1] "7 FFTs using up to 4 of 13 cues"
## [1] "FFT #5 uses 3 cues {thal,thalach,cp} with the following performance:"
##       train   test
## n    150.00 153.00
## pci    0.88   0.87
## mcu    1.67   1.78
## acc    0.81   0.73
## bacc   0.82   0.73
## sens   0.89   0.81
## spec   0.74   0.66

Heart disease cue accuracies

plot(heart.fft, what = "cues", main = "Heart Disease")

plot of chunk unnamed-chunk-17

Heart Disease FFT

plot(heart.fft, main = "Heart Disease", 
     decision.names = c("healthy", "sick"),
     stats = FALSE)

plot of chunk unnamed-chunk-20

Heart Disease FFT - Training

plot of chunk unnamed-chunk-23

Heart Disease FFT - Prediction

plot of chunk unnamed-chunk-25

Heart Disease FFT - Tree 3

plot of chunk unnamed-chunk-27

Heart Disease FFT - Tree 6

plot of chunk unnamed-chunk-29

How do FFTs compare to regression and CART?

  • Simplicity: How much information is used and how is it combined?
  • Accuracy: How well can the algorithm predict new data?

Heart disease: regression

  • 4 significant cues: (sex, cp, trestbps, ca)

plot of chunk unnamed-chunk-30

Heart disease: rpart

  • 8 cues (thal, cp, oldpeak, ca, age, exang, thalach, chol)

plot of chunk unnamed-chunk-31

Heart disease: FFT

  • 3 cues (thal, cp, ca)
  • However, when applied to the data, only about 1.75 cues are even used on average. As a result, > 85% of the cue information is completely ignored.

plot of chunk unnamed-chunk-32

Heart disease classification accuracy

plot of chunk unnamed-chunk-35

Heart disease classification accuracy

plot of chunk unnamed-chunk-37

Heart disease classification accuracy

plot of chunk unnamed-chunk-39

How accurate are FFTs built by FFTrees?

  • Prediction competition
    • 10 datasets taken from the UCI machine learning database
    • 50% Fitting / 50% Prediction subsample splitting, DV: balanced accuracy = (sensitivity + specificity) / 2
dataset cases cues base.rate
arrhythmia 68 280 0.29
audiology 226 70 0.10
breast 683 10 0.35
bridges 92 10 0.39
cmc 1473 10 0.35

Table: 5 of the 10 prediction datasets

Aggregate simulation prediction results

plot of chunk unnamed-chunk-40

Aggregate simulation prediction results

plot of chunk unnamed-chunk-41

Aggregate simulation prediction results

plot of chunk unnamed-chunk-42

Simulation prediction results by dataset

plot of chunk unnamed-chunk-43

Conclusions

  • FFTrees makes it easy to develop simple, effective, transparent decision trees.
  • Decision aids built with FFTrees can compete with complex, compensatory algorithms in prediction.

Next steps

  • Speed up code with c++ or Julia.
  • Include cue costs into algorithm.
    • Heart disease FFT: $75.91
    • Regression: $300
  • Quantify when and how a tree fails when it is applied to data over time.

plot of chunk unnamed-chunk-44

Questions?

plot of chunk unnamed-chunk-45

Tree construction algorithm

Step Time
1. Decision threshold for each cue
2. Select cues 2
3. Order cues 4
4. Exit branch 6

Exploring trees with FFForest()

  • The FFForest() function will create a forest of FFTs by creating trees from random subsets of the data.

  • You can then extrapolate cue importance and co-occurence from an FFForest object:

plot of chunk unnamed-chunk-46