Statistical Learning Part 3

Ross Jacobucci
December 3rd, 2014

Overview

Today, we are going to cover a number of more “advanced” algorithms that are both popular and easy to implement

The talk is going to be split into two parts:

Decision Trees for Interpretable Results
Random Forests for Increased Predictive Accuracy

The Importance of Interpretation & Ease of Implementation

In my previous talk, I briefly mentioned the Netflix Prize. What I didn't mention is that Netflix actually never used the winning algorithm: https://www.techdirt.com/blog/innovation/articles/20120409/03412518422/why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml

The problem, too much effort to implement.

Are you willing to let a program run for hours, days, weeks?

Interpretation

Ask yourself, do I care most about getting the best possible prediction, or do I care about doing well, but being able to interpret the results?

This is a fundamental tradeoff when deciding which method to use.

Part 1 is for those who care more about being able to interpret the results.

Part 2 is for situations when getting the most actual prediction matters most.

Interpretation Cont'd

Decision Tree Overview

Think of it as a form of non-parametric regression

Used for both classification and regression
Can produce simpler and more interpretable solutions than linear regression
Also, in many cases, does better than linear regression

The algorithm recursively splits the dataset into partitions, or subgroups, based on the criterion variable.

Recent non-technical overview in JCCP: http://psycnet.apa.org/journals/ccp/82/5/895/

Decision Trees Example

Iris Dataset

plot of chunk unnamed-chunk-1

Build an Actual Tree -- Get the First Split

plot of chunk unnamed-chunk-2

What Does This Split Look Like?

plot of chunk unnamed-chunk-3

Let's Go Down Another Level

plot of chunk unnamed-chunk-4

How Does This Translate

plot of chunk unnamed-chunk-5

So What Does The Algorithm Actually Do?

The first 8 categories of values of Petal.Length

Petal.Length
  1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 
  1   1   2   7  13  13   7   4

The algorithm starts at 1, compares every person with a value of 1 VERSUS every person with a value GREATER than 1. Computes a measure of impurity within each subroup. Next, goes to 1.1, compares every person with 1 AND 1.1 VERSUS everyone with values GREATER than 1.1.

After this is done for Petal Length, it goes on to the next variable, exhaustively searcher the sample space.

How About For Quantitative Outcomes?

Same strategy, just uses a different cost function – most often RMSE

The relationship between covariate and outcome becomes a step function

Plot for Linear Regression

lm.out <- lm(mpg ~ wt, data=mtcars)
pred.lm <- predict(lm.out)
qplot(mtcars$wt,pred.lm,geom=c("line","point"))

plot of chunk unnamed-chunk-7

Tree Plot for Quantitative Outcome

quant.out <- rpart(mpg ~ wt, data=mtcars)
plot(quant.out);text(quant.out)

plot of chunk unnamed-chunk-8

pred.quant <- predict(quant.out)

Relationship between outcome and predictor

qplot(mtcars$wt,pred.quant,geom=c("step","point"))

plot of chunk unnamed-chunk-9

Predictive Accuracy

In many cases does slightly better than linear (logistic) regression. This comes with the increased ability to interpret the relationship between the predictors and outcome.

How about with the mtcars dataset – predict mpg Looking for lower RMSE

library(caret)

out.lm <- train(mpg ~ wt, data=mtcars,method="lm")
out.lm$results$RMSE
[1] 3.172

out.rpart <- train(mpg ~ wt, data=mtcars,method="rpart")
out.rpart$results$RMSE[2]
[1] 4.269

How does it model interactions?

Interaction – when the effect of one variable depends on the other

Example from Strobl et al., 2009:

So What Are Some Applications of Decision Trees?

Account for attrition in longitudinal studies: http://books.google.com/books?hl=en&lr=&id=OV9tAAAAQBAJ&oi=fnd&pg=PA282&dq=mcardle,+jj&ots=iTbvlgy9K3&sig=OxWX9nQjWgeN3WroeqbL9ouMgBs#v=onepage&q=mcardle%2C%20jj&f=false

To create sample weights: http://www.ncbi.nlm.nih.gov/pubmed/24385641

Decision trees for survival analysis: http://link.springer.com/article/10.1007/s11336-014-9413-1

This is among many many others in psychological research. See the JCCP overview on CART to see others.

SEM Trees

Decision Trees, except the outcome is a SEM

Within each subgroup, the -2LL is computed for the SEM model. SEM Trees looks for the maximum split that: -2LL full > -2LL group 1 + -2LL group 2

Looking for people that are maximally similar with respect to the given SEM model.

SEM Trees Example

Example taken from Brandmaier, Von Oertzen, McArdle, AND Lindenberger, 2012

Active sample, see if cognitive training has effect on cognitive aging in older sample

Use covariates to predict different trajectories in growth curve model (longitudinal SEM)

Tree Plot

Trajectories

Programming SEM Trees

semtree package in R download from: http://brandmaier.de/semtree/download/

Use of either OpenMx or lavaan package to estimate SEM

See the .Rmd file in Dropbox folder for example

Part 2

Random Forests

Random Forests Overview

Random Forests build an ensemble of trees

Uses bootstrap aggregating (bagging) to take averages across all of the trees
can build 10's, 100's, or 1000's + trees
Much harder to interpret, as there is no 1 tree

For each tree

a random bootstrap sample is taken
a random selection of predictors is chosen to build the tree
- this serves to de-correlate the trees, and decreases suppression effects

May see the term bagged trees.

This refers to using an ensemble of trees (using bagging), but mtry=p

Variable Importance

library(randomForest)
rf.out <- randomForest(Species ~ ., data=iris,importance=T)
varImpPlot(rf.out)

plot of chunk unnamed-chunk-11

Why use Random Forests?

One of the most powerful learning algorithms

most popular algorithm used for winning Kaggle competitions
- https://www.kaggle.com/wiki/RandomForests

Fewer things to tune in comparison to other advanced algorithms

one of the best out of the box models
- counter: Friedman referred to GBM's as the best out of box

In a recent JMLR article, Random Forests came out as the top algorithm across 100's of datasets.

http://jmlr.csail.mit.edu/papers/v15/delgado14a.html
- used RF implementation in randomForest package and accessed through caret

Random Forests in R

The randomForest package is probably the most well known

pretty simple to use, and can be accessed through the caret train() – method=“rf”

cforest() from the party package implements conditional inference forests

slightly different formulation, where the selection of variables is unbiased
- leads to unbiased variable importance measures
also prefers subsampling (no replacement)
- default in randomForest is bootstrap sampling, although subsampling can be used (replace=F)

Many other derivations, however, both randomForest() and cforest() seem to be the two most popular

SEM Forests

Just as Random Forests are a generalization of Decision Trees, the same idea applies to SEM Forests and SEM Trees

For SEM Forests, we care more about predictive power as opposed to interpretation

Much harder to interpret, as there is no 1 tree
Just as in Random Forests, get variable importance
Also get a proximity matrix
This clusters cases that are similar to eachother