FFTrees

An R package to create, visualise, and impliment fast and frugal decision trees

Nathaniel Phillips, Economic Psychology, University of Basel
BaselR Meeting - March 2017

A growing problem at the Cook County Hospital in 1996

plot of chunk cookmap

Patient overload

plot of chunk unnamed-chunk-2

Diagnosing ER patients with chest pain.

Situation

  • 30 people a day worried about a heart attack.
  • Coronary care bed costs $2,000 a night and requires a 3 day stay.

Decision task

  • Send people with heart attacks to the coronary care bed, and healthy patients to a normal bed.

Multiple, uncertain measures

  • Blood pressure, Stethescope,
  • Questions: How long? How much? During exercise? History? Cholesterol? Drugs? etc.
  • Electrocardiogram (ECG) reading.

Intuition based on clinical expertise was not working.

Study: How much do doctors agree on diagnoses?

  • Asked doctors to estimate from 0 to 100 the probability of a heart attack of 20 separate patients.

plot of chunk unnamed-chunk-3

Conclusion: Terribly low agreement

"In each case the answers we got pretty much ranged from 0 to 100. It was extraordinary" (Brenden Reilly, Department of Medicine chairman)

Solution

  • A decision tree developed by a cardiologist named Lee Goldman.

plot of chunk unnamed-chunk-4

Why use a decision tree?

  • Speed, Easy of understanding and implementation

The Cook hospital decision tree

  • Over two years, the performance of the tree was compared to the physician's intuitive judgments.

Results

  • Tree dramatically outperformed the doctor's clinical judgments and resulted in far fewer false-positives and huge cost savings

  • To this day, the tree is still used at the hospital.

Dr. Lee Goldman

Fast and frugal decision trees (FFT)

  • A fast and frugal decision tree (FFT) is a very simple, highly restricted decision tree.

  • In an FFT, each node has exactly two branches, where at least one branch is an exit branch (Martignon et al., 2008).

  • Because of their restrictions, FFTs are even faster and require less information than non-FFT trees.

plot of chunk unnamed-chunk-7

Depression Tree

  • Jenny et al. (2013): Simple rules for detecting depression

plot of chunk unnamed-chunk-8

Bank failure

  • Neth et al. (2013): Homo heuristics in the financial world: From risk management to managing uncertainty

plot of chunk unnamed-chunk-9

Problem

There is no off-the-shelf method to construct FFTs

Task

  • Create an easy-to-use R package that constructs, visualizes, and implements FFTs called FFTrees.

plot of chunk unnamed-chunk-10

FFTrees

# Available on CRAN
install.packages("FFTrees")
devtools::github("ndphillips/FFTrees", include_vignette = TRUE)

Heart disease datatset

library(FFTrees)

head(heartdisease)
##   age sex cp trestbps chol fbs     restecg thalach exang oldpeak slope ca
## 1  63   1 ta      145  233   1 hypertrophy     150     0     2.3  down  0
## 2  67   1  a      160  286   0 hypertrophy     108     1     1.5  flat  3
## 3  67   1  a      120  229   0 hypertrophy     129     1     2.6  flat  2
## 4  37   1 np      130  250   0      normal     187     0     3.5  down  0
## 5  41   0 aa      130  204   0 hypertrophy     172     0     1.4    up  0
## 6  56   1 aa      120  236   0      normal     178     0     0.8    up  0
##     thal diagnosis
## 1     fd         0
## 2 normal         1
## 3     rd         1
## 4 normal         0
## 5 normal         0
## 6 normal         0

Creating a Heart Disease FFT

# Step 1: Create training and test data
set.seed(100)

heartdisease <- heartdisease[sample(nrow(heartdisease)),]
heart.train <- heartdisease[1:150,]
heart.test <- heartdisease[151:303,]

# Step 2: Create heart.fft
heart.fft <- FFTrees(formula = diagnosis ~.,
                     data = heart.train,
                     data.test = heart.test)

Evaluating a decision algorithm

plot of chunk unnamed-chunk-14

FFT summary statistics

# Step 3: Summary statistics
heart.fft
## [1] "7 FFTs using up to 4 of 13 cues"
## [1] "FFT #4 uses 3 cues {thal,cp,ca} with the following performance:"
##       train   test
## n    150.00 153.00
## pci    0.88   0.88
## mcu    1.74   1.73
## acc    0.80   0.82
## bacc   0.80   0.82
## sens   0.82   0.88
## spec   0.79   0.76

Heart disease cue accuracies

plot(heart.fft, what = "cues", main = "Heart Disease")

plot of chunk unnamed-chunk-18

Heart Disease FFT

plot(heart.fft, main = "Heart Disease", 
     decision.names = c("healthy", "sick"),
     stats = FALSE)

plot of chunk unnamed-chunk-21

Heart Disease FFT - Training

plot of chunk unnamed-chunk-24

Heart Disease FFT - Prediction

plot of chunk unnamed-chunk-26

Heart Disease FFT - Tree 3

plot of chunk unnamed-chunk-28

Heart Disease FFT - Tree 6

plot of chunk unnamed-chunk-30

How do FFTs compare to regression and CART?

  • Simplicity: How much information is used and how is it combined?
  • Accuracy: How well can the algorithm predict new data?

Heart disease: regression

  • 4 significant cues: (sex, cp, trestbps, ca)

plot of chunk unnamed-chunk-31

Heart disease: rpart

  • 8 cues (thal, cp, oldpeak, ca, age, exang, thalach, chol)

plot of chunk unnamed-chunk-32

Heart disease: FFT

  • 3 cues (thal, cp, ca)
  • However, when applied to the data, only about 1.75 cues are even used on average. As a result, > 85% of the cue information is completely ignored.

plot of chunk unnamed-chunk-33

Heart disease classification accuracy

plot of chunk unnamed-chunk-36

Heart disease classification accuracy

plot of chunk unnamed-chunk-38

Heart disease classification accuracy

plot of chunk unnamed-chunk-40

How accurate are FFTs built by FFTrees?

  • Prediction competition
    • 10 datasets taken from the UCI machine learning database
    • 50% Fitting / 50% Prediction subsample splitting, DV: balanced accuracy = (sensitivity + specificity) / 2
dataset cases cues base.rate
arrhythmia 68 280 0.29
audiology 226 70 0.10
breast 683 10 0.35
bridges 92 10 0.39
cmc 1473 10 0.35

Table: 5 of the 10 prediction datasets

Aggregate simulation prediction results

plot of chunk unnamed-chunk-41

Aggregate simulation prediction results

plot of chunk unnamed-chunk-42

Aggregate simulation prediction results

plot of chunk unnamed-chunk-43

Simulation prediction results by dataset

plot of chunk unnamed-chunk-44

Conclusions

  • FFTrees makes it easy to develop simple, effective, transparent decision trees.
  • Decision aids built with FFTrees can compete with complex, compensatory algorithms in prediction.

Next steps

  • Speed up code with c++ or Julia.
  • Include cue costs into algorithm.
    • Heart disease FFT: $75.91
    • Regression: $300
  • Quantify when and how a tree fails when it is applied to data over time.

plot of chunk unnamed-chunk-45

Questions?

plot of chunk unnamed-chunk-46

Tree construction algorithm

Step Time
1. Decision threshold for each cue
2. Select cues 2
3. Order cues 4
4. Exit branch 6

Exploring trees with FFForest()

  • The FFForest() function will create a forest of FFTs by creating trees from random subsets of the data.

  • You can then extrapolate cue importance and co-occurence from an FFForest object:

plot of chunk unnamed-chunk-47