Ross Jacobucci
December 3rd, 2014
Today, we are going to cover a number of more “advanced” algorithms that are both popular and easy to implement
The talk is going to be split into two parts:
In my previous talk, I briefly mentioned the Netflix Prize. What I didn't mention is that Netflix actually never used the winning algorithm: https://www.techdirt.com/blog/innovation/articles/20120409/03412518422/why-netflix-never-implemented-algorithm-that-won-netflix-1-million-challenge.shtml
The problem, too much effort to implement.
Are you willing to let a program run for hours, days, weeks?
Ask yourself, do I care most about getting the best possible prediction, or do I care about doing well, but being able to interpret the results?
This is a fundamental tradeoff when deciding which method to use.
Part 1 is for those who care more about being able to interpret the results.
Part 2 is for situations when getting the most actual prediction matters most.

Think of it as a form of non-parametric regression
The algorithm recursively splits the dataset into partitions, or subgroups, based on the criterion variable.
Recent non-technical overview in JCCP: http://psycnet.apa.org/journals/ccp/82/5/895/
Iris Dataset
The first 8 categories of values of Petal.Length
Petal.Length
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7
1 1 2 7 13 13 7 4
The algorithm starts at 1, compares every person with a value of 1 VERSUS every person with a value GREATER than 1. Computes a measure of impurity within each subroup. Next, goes to 1.1, compares every person with 1 AND 1.1 VERSUS everyone with values GREATER than 1.1.
After this is done for Petal Length, it goes on to the next variable, exhaustively searcher the sample space.
Same strategy, just uses a different cost function – most often RMSE
The relationship between covariate and outcome becomes a step function
lm.out <- lm(mpg ~ wt, data=mtcars)
pred.lm <- predict(lm.out)
qplot(mtcars$wt,pred.lm,geom=c("line","point"))
quant.out <- rpart(mpg ~ wt, data=mtcars)
plot(quant.out);text(quant.out)
pred.quant <- predict(quant.out)
qplot(mtcars$wt,pred.quant,geom=c("step","point"))
In many cases does slightly better than linear (logistic) regression. This comes with the increased ability to interpret the relationship between the predictors and outcome.
How about with the mtcars dataset – predict mpg Looking for lower RMSE
library(caret)
out.lm <- train(mpg ~ wt, data=mtcars,method="lm")
out.lm$results$RMSE
[1] 3.172
out.rpart <- train(mpg ~ wt, data=mtcars,method="rpart")
out.rpart$results$RMSE[2]
[1] 4.269
Interaction – when the effect of one variable depends on the other
Example from Strobl et al., 2009:

Account for attrition in longitudinal studies: http://books.google.com/books?hl=en&lr=&id=OV9tAAAAQBAJ&oi=fnd&pg=PA282&dq=mcardle,+jj&ots=iTbvlgy9K3&sig=OxWX9nQjWgeN3WroeqbL9ouMgBs#v=onepage&q=mcardle%2C%20jj&f=false
To create sample weights: http://www.ncbi.nlm.nih.gov/pubmed/24385641
Decision trees for survival analysis: http://link.springer.com/article/10.1007/s11336-014-9413-1
This is among many many others in psychological research. See the JCCP overview on CART to see others.
Decision Trees, except the outcome is a SEM
Within each subgroup, the -2LL is computed for the SEM model. SEM Trees looks for the maximum split that: -2LL full > -2LL group 1 + -2LL group 2
Looking for people that are maximally similar with respect to the given SEM model.
Example taken from Brandmaier, Von Oertzen, McArdle, AND Lindenberger, 2012
Use covariates to predict different trajectories in growth curve model (longitudinal SEM)


semtree package in R download from: http://brandmaier.de/semtree/download/
Use of either OpenMx or lavaan package to estimate SEM
See the .Rmd file in Dropbox folder for example
Random Forests
Random Forests build an ensemble of trees
For each tree
May see the term bagged trees.
library(randomForest)
rf.out <- randomForest(Species ~ ., data=iris,importance=T)
varImpPlot(rf.out)
One of the most powerful learning algorithms
Fewer things to tune in comparison to other advanced algorithms
In a recent JMLR article, Random Forests came out as the top algorithm across 100's of datasets.
The randomForest package is probably the most well known
cforest() from the party package implements conditional inference forests
Many other derivations, however, both randomForest() and cforest() seem to be the two most popular
Just as Random Forests are a generalization of Decision Trees, the same idea applies to SEM Forests and SEM Trees
For SEM Forests, we care more about predictive power as opposed to interpretation