Online location: http://rpubs.com/jcross/titanic_trees

To access R and RStudio which are installed on the Saint Ann’s server you can go to: http://rstudio.saintannsny.org:8787/ and log in with your Saint Ann’s email address.

Past labs:

http://rpubs.com/jcross/titanic_plot

http://rpubs.com/jcross/intro_to_R

http://rpubs.com/jcross/titanic_transformation

Introduction

In this lab, we’ll create decision trees in order to predict who survived the sinking of the Titanic. We’ll start by loading the data and writing the code too find a branch. Then we’ll use the rpart package to build entire trees.

.libPaths("/home/rstudioshared/shared_files/packages")
library(dplyr)
library(ggplot2)

titanic <- read.csv("/home/rstudioshared/shared_files/data/titanic_train.csv")
titanic_test <- read.csv("/home/rstudioshared/shared_files/data/titanic_test.csv")

Finding a Branch

Let’s start by splitting everyone on the Titanic into two groups based on age: those 45 and younger and those older than 45. For each group, we’ll calculate the survival rate, we’ll use this survival rate to make a “prediction” for every passenger and compute the root mean square error (RMSE) of this predictions.

RMSE <- function(x, y){sqrt(mean((x-y)^2))}

break.point <- 45
titanic <- titanic %>% mutate(age.groups = cut(Age, c(0, break.point, 90)))
age.summary <- titanic %>% group_by(age.groups) %>% 
  summarize(SurvivalRate = mean(Survived))
left_join(titanic, age.summary) %>% summarize(RMSE(Survived, SurvivalRate))

We can try splitting passengers into groups using a different age cut-off:

break.point <- 10
titanic <- titanic %>% mutate(age.groups = cut(Age, c(0, break.point, 90)))
age.summary <- titanic %>% group_by(age.groups) %>% 
  summarize(SurvivalRate = mean(Survived))
left_join(titanic, age.summary) %>% summarize(RMSE(Survived, SurvivalRate))

Better still, we can look at all possible age cut points and find the split that gives us the lowest RMSE using the code below:

ages.to.check <- seq(0.5, 89.5, 0.5)

results <- data.frame(Age=NULL, RMSE=NULL)

for(age in ages.to.check ){
    break.point <- age
titanic <- titanic %>% mutate(age.groups = cut(Age, c(0, break.point, 90)))
age.summary <- titanic %>% group_by(age.groups) %>% 
  summarize(SurvivalRate = mean(Survived))
rmse <- left_join(titanic, age.summary, by="age.groups") %>% 
  summarize(RMSE(Survived, SurvivalRate)) %>% as.numeric()

results <- rbind(results, data.frame(Age=age, RMSE=rmse))
}

results %>% ggplot(aes(Age, RMSE, label=Age))+geom_label() 

We can save a little time by letting the rpart package do this for us.

library(rpart)
rpart(Survived ~ Age, data=titanic, maxdepth=1)

Here “deviance” is the sum of the square errors (if you were to divide by the number of passengers and take the square root, you’d have RMSE). We can visualize our branch with:

library(rpart.plot)
age.split <- rpart(Survived ~ Age, data=titanic, maxdepth=1)
prp(age.split)

Recursive Partitioning

rpart stands for recursive partitioning. This means that it finds the best way to split the data into two groups.. and then it can do it again, searching each of these groups for the best split and slicing up the data again.

age.split <- rpart(Survived ~ Age, data=titanic, maxdepth=2, cp=0)
age.split
prp(age.split)

age.split <- rpart(Survived ~ Age, data=titanic, maxdepth=5, cp=0)
age.split
prp(age.split)

Better yet, rpart is not limited to looking for the best split within any one variable. We can search through several variables all at once and find the best possible way to splice the data (remember that the “best” split is the once that minimizes the RMSE or (equivalently) the deviance.

rpart(Survived ~ Pclass+Sex+Age+SibSp+Parch+Fare+Embarked, data=titanic, maxdepth=1, cp=0)
rpart(Survived ~ Pclass+Sex+Age+SibSp+Parch+Fare+Embarked, data=titanic, maxdepth=2, cp=0)

tree <- rpart(Survived ~ Pclass+Sex+Age+SibSp+Parch+Fare+Embarked, data=titanic, 
              maxdepth=2, cp=0)
prp(tree)

While, in theory, there’s no limit to the number of branches in our trees, a tree that’s too complex will “over-fit” the data. Imagine a tree where every passenger ends up on their own branch – this tree has taught us nothing about the data. We may want to limit the size of the tree but changing the “complexity parameter” rather than altering the maximum depth of the tree. Counter-intuitively, the smaller your complexity parameter, the more complex your tree will be. The complexity parameter is minimum improvement in “fit” needed for a branch to be used (we’ll talk more about this later).

tree <- rpart(Survived ~ Pclass+Sex+Age+SibSp+Parch+Fare+Embarked, data=titanic, 
              maxdepth=10, cp=0)
prp(tree)

tree <- rpart(Survived ~ Pclass+Sex+Age+SibSp+Parch+Fare+Embarked, data=titanic, cp=0.005)
prp(tree)

Actual Predictions

Finally, let’s use our trees to make predictions on the test set. After making a tree with the training set, we can simply find the appropriate branch for each passenger in the test set and return the predicted chance of survival for that branch as our prediction.

titanic_test <- read.csv("/home/rstudioshared/shared_files/data/titanic_test.csv")
predict(tree, titanic_test)
titanic_test$Survived <- predict(tree, titanic_test)

Once we have our predictions, we can submit them to Kaggle:

titanic_test %>% 
  mutate(Survived = round(Survived)) %>% 
  select(PassengerId, Survived) %>% 
  write.csv("kaggle_preds_decision_tree.csv", row.names=FALSE)

You May Want To…

?rpart
?rpart.control
tree <- rpart(Survived ~ Pclass+Sex+Age+SibSp+Parch+Embarked+Fare, data=titanic, 
              cp=0.005, method="class")
prp(tree)

tree <- rpart(Survived ~ Pclass+Sex+Age+SibSp+Parch+Embarked+Fare, data=titanic, 
              cp=0.005, method="anova")
prp(tree)