Robert Ness
Search problem: Given a subset of the data, the algorithm must chose an ideal next split for that subset on one the features.
Default stopping criterion - each datapoint is its own subset, no more data to split.
library(readr)
library(dplyr)
library(party)
library(rpart)
library(rpart.plot)
library(ROCR)
set.seed(100)
Each row in the data is a passenger. Columns are features:
fare: Fare Payed
Other columns could make interesting features with some further engineering.
titanic3 <- "https://goo.gl/At238b" %>%
read_csv %>% # read in the data
select(survived, embarked, sex,
sibsp, parch, fare) %>%
mutate(embarked = factor(embarked),
sex = factor(sex))
#load("/Users/robertness/Downloads/titanic.Rdata")
Source: local data frame [5 x 6]
survived embarked sex sibsp parch fare
1 1 S female 0 0 211.3375
2 1 S male 1 2 151.5500
3 0 S female 1 2 151.5500
4 0 S male 1 2 151.5500
5 0 S female 1 2 151.5500
.data <- c("training", "test") %>%
sample(nrow(titanic3), replace = T) %>%
split(titanic3, .)
rtree_fit <- rpart(survived ~ .,
.data$training)
rpart.plot(rtree_fit)
tree_fit <- ctree(survived ~ .,
data = .data$training)
tree_roc <- tree_fit %>%
predict(newdata = .data$test) %>%
prediction(.data$test$survived) %>%
performance("tpr", "fpr")
For comparison, we compare the decision tree, the conditional tree, and logistic regression