Learn a decision tree

As a big fan of shipwrecks, you decide to go to your local library and look up data about Titanic passengers. You find a data set of 714 passengers, and store it in the titanic data frame (Source: Kaggle). Each passenger has a set of features - Pclass, Sex and Age - and is labeled as survived (1) or perished (0) in the Survived column.

To test your classification skills, you can build a decision tree that uses a person’s age, gender, and travel class to predict whether or not they survived the Titanic. The titanic data frame has already been divided into training and test sets (named train and test).

In this exercise, you’ll need train to build a decision tree. You can use the rpart() function of the rpart package for this. Behind the scenes, it performs the steps that Vincent explained in the video: coming up with possible feature tests and building a tree with the best of these tests.

Finally, a fancy plot can help you interpret the tree. You will need the rattle, rpart.plot and RColorBrewer packages to display this.

Note: In problems that have a random aspect, the set.seed() function is used to enforce reproducibility. Don’t worry about it, just don’t remove it!

train <- read.csv("train.csv")
# Set random seed. Don't remove this line
set.seed(1)

# Load the rpart, rattle, rpart.plot and RColorBrewer package
library(rpart)
## Warning: package 'rpart' was built under R version 3.2.5
library(rattle)
## Warning: package 'rattle' was built under R version 3.2.5
## Rattle: A free graphical interface for data mining with R.
## XXXX 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.2.5
library(RColorBrewer)
## Warning: package 'RColorBrewer' was built under R version 3.2.3
# Build a tree model: tree
tree <- rpart(Survived ~., data=train, method="class")

# Draw the decision tree
fancyRpartPlot(tree)

Understanding the tree plot

In the previous exercise you made a fancy plot of the tree that you’ve learned on the training set. Have another look at the close-up of a node:

Remember how Vincent told you that a tree is learned by separating the training set step by step? In an ideal world, the separations lead to subsets that all have the same class. In reality, however, each division will contain both positive and negative training observations. In this node, 76% of the training instances are positive and 24% are negative. The majority class thus is positive, or 1, which is signaled by the number 1 on top. The 36% bit tells you which percentage of the entire training set passes through this particular node. On each tree level, these percentages thus sum up to 100%. Finally, the Pclass = 1,2 bit specifies the feature test on which this node will be separated next. If the test comes out positive, the left branch is taken; if it’s negative, the right branch is taken.

Now that you can interpret the tree, can you tell which of the following statements is correct?

  • The majority class of the root node is positive, denoting survival. 1
  • The feature test that follows when the Sex is not female, is based on a categorical variable. 2
  • The tree will predict female passengers in class 3 to not survive, although it’s close. 3
  • The leftmost leaf is very impure, as the vast majority of the training instances in this leaf are positive.

3

Classify with the decision tree

The previous learning step involved proposing different tests on which to split nodes and then to select the best tests using an appropriate splitting criterion. You were spared from all the implementation hassles that come with that: the rpart() function did all of that for you.

Now you are going to classify the instances that are in the test set. As before, the data frames titanic, train and test are available in the workspace. You’ll only want to work with the test set, though.

test <- read.csv("test.csv")
# Code from previous exercise
set.seed(1)
library(rpart)

## Use tree to predict the labels of the test set with the predict() function; store the resulting prediction in pred.

tree <- rpart(Survived ~ ., train, method = "class")

# Predict the values of the test set: pred
pred <- predict(tree, test, type="class")

## Create a confusion matrix, conf, of your predictions on the test set. The true values, test$Survived, should be on the rows.
# Construct the confusion matrix: conf
conf <- table(test$Survived, pred)

## Use the confusion matrix to print out the accuracy. This is the ratio of all correctly classified instances divided by the total number of classified instances.
# Print out the accuracy
sum(diag(conf)) / sum(conf)
## [1] 0.7990654
  • Around 80 percent of all test instances have been classified correctly.

Pruning the tree

A like-minded shipwreck fanatic is also doing some Titanic predictions. He passes you some code that builds another decision tree model. The resulting model, tree, seems to work well but it’s pretty hard to interpret. Generally speaking, the harder it is to interpret the model, the more likely you’re overfitting on the training data.

You decide to prune his tree. This implies that you’re increasing the bias, as you’re restricting the amount of detail your tree can model. Finally, you’ll plot this pruned tree to compare it to the previous one.

# Calculation of a complex tree
set.seed(1)
tree <- rpart(Survived ~ ., train, method = "class", control = rpart.control(cp=0.00001))

# Draw the complex tree
fancyRpartPlot(tree)

# Prune the tree: pruned
## Use the prune() method to shrink tree to a more compact tree, pruned. Also specify the cp argument to be 0.01. This is a complexity parameter. It basically tells the algorithm to remove node splits that do not sufficiently decrease the impurity.
pruned <- prune(tree, cp=0.01)

# Draw pruned
fancyRpartPlot(pruned)

- Another way to check if you overfit your model is by comparing the accuracy on the training set with the accuracy on the test set. You’d see that the difference between those two is smaller for the simpler tree. You can also set the cp argument while learning the tree with rpart() using rpart.control.

Splitting criterion

Do you remember the spam filters we built and tested in chapter 1 and 2? Well, it’s time to make the filter more serious! We added some relevant data for every email that will help filter the spam, such as word and character frequencies. All of these can be found in the emails dataset, which is loaded in your workspace. Also, a training and test set have already been built from it: train and test.

In this exercise, you’ll build two decision trees based on different splitting criteria. In the video you’ve learned about information gain: the higher the gain when you split, the better. However, the standard splitting criterion of rpart() is the Gini impurity.

It is up to you now to compare the information gain criterion with the Gini impurity criterion: how do the accuracy and resulting trees differ?

# Set random seed. Don't remove this line.
set.seed(1)

# Train and test tree with gini criterion (train data is not available)
##tree_g <- rpart(spam ~ ., train, method = "class")
##pred_g <- predict(tree_g, test, type = "class")
##conf_g <- table(test$spam, pred_g)
##acc_g <- sum(diag(conf_g)) / sum(conf_g)

# Change the first line of code to use information gain as splitting criterion
##tree_i <- rpart(spam ~ ., train, method = "class", parms = list(split = "information"))
##pred_i <- predict(tree_i, test, type = "class")
##conf_i <- table(test$spam, pred_i)
##acc_i <- sum(diag(conf_i)) / sum(conf_i)

# Draw a fancy plot of both tree_g and tree_i
##fancyRpartPlot(tree_g)
##fancyRpartPlot(tree_i)

# Print out acc_g and acc_i
##acc_g
##acc_i
  • [1] 0.8905797
  • [1] 0.8963768
  • Using a different splitting criterion can influence the resulting model of your learning algorithm. However, the resulting trees are quite similar. The same variables are often present in both trees and the accuracy on the test set is comparable: 89% and 90%.

Preprocess the data

Let’s return to the tragic titanic dataset. This time, you’ll classify its observations differently, with k-Nearest Neighbors (k-NN). However, there is one problem you’ll have to tackle first: scale.

As you’ve seen in the video, the scale of your input variables may have a great influence on the outcome of the k-NN algorithm. In your case, the Age is on an entirely different scale than Sex and Pclass, hence it’s best to rescale first!

For example, to normalize a vector x, you could do the following:

#(x???min(x))/(max(x)???min(x))

Head over to the instructions to normalize Age and Pclass for both the training and the test set.

train <- read.csv("train3.csv")
test <- read.csv("test3.csv")
# Store the Survived column of train and test in train_labels and test_labels
train_labels <- train$Survived
test_labels <- test$Survived

# Copy train and test to knn_train and knn_test
knn_train <- train
knn_test <- test

# Drop Survived column for knn_train and knn_test
knn_train$Survived <- NULL
knn_test$Survived <- NULL

# normalize Pclass
##Pclass is an ordinal value between 1 and 3. Have a look at the code that normalizes this variable in both the training and the test set. To define the minimum and maximum, only the training set is used; we can't use information on the test set (like the minimums or maximums) to normalize the data.
min_class <- min(knn_train$Pclass)
max_class <- max(knn_train$Pclass)
knn_train$Pclass <- (knn_train$Pclass - min_class) / (max_class - min_class)
knn_test$Pclass <- (knn_test$Pclass - min_class) / (max_class - min_class)

# normalize Age
##In a similar fashion, normalize the Age column of knn_train as well as knn_test. Again, you should only use features from the train set to decide on the normalization! Use the intermediate variables min_age and max_age.
min_age <- min(knn_train$Age)
max_age <- max(knn_train$Age)
knn_train$Age <- (knn_train$Age - min_age) / (max_age - min_age)
knn_test$Age <- (knn_test$Age - min_age) / (max_age - min_age)
head(knn_train$Age)
## [1] 0.2926614 0.1916130 0.4442339 0.4821271 0.3684476 0.3684476
head(knn_test$Age)
## [1] 0.520020210 0.002147278 0.254768220 0.374763168 0.355816597 0.418971833

The knn() function

Now that you have your preprocessed data - available in your workspace as knn_train, knn_test, train_labels and test_labels - you are ready to start with actually classifying some instances with k-Nearest Neighbors.

To do this, you can use the knn() function which is available from the class. It takes four arguments: - train: The observations in the training set, without the class labels, - test: The observations in the test, without the class labels, - cl: A factor of true class labels of the training set and - k: the number of nearest neighbors you want to consider.

Let’s try it out and see what the result what the result looks like.

# Set random seed. Don't remove this line.
set.seed(1)

# Load the class package
library(class)
## Warning: package 'class' was built under R version 3.2.5
# Make predictions using knn: pred
pred <- knn(knn_train, knn_test, train_labels, k=5)
    
# Construct the confusion matrix: conf
conf <- table(test_labels, pred)

# Print out the confusion matrix
conf
##            pred
## test_labels   0   1
##           0 113  16
##           1  26  59

K’s choice

A big issue with k-Nearest Neighbors is the choice of a suitable k. How many neighbors should you use to decide on the label of a new observation? Let’s have R answer this question for us and assess the performance of k-Nearest Neighbor classification for increasing values of k.

# Set random seed. Don't remove this line.
set.seed(1)

# Load the class package, define range and accs
library(class)
range <- 1:round(0.2 * nrow(knn_train))
accs <- rep(0, length(range))

for (k in range) 
  
  # Make predictions using knn: pred
{
pred <- knn(knn_train, knn_test, train_labels, k=k)

    
  # Construct the confusion matrix: conf
conf <- table(test_labels, pred)
    
  # Calculate the accuracy and store it in accs[k]
accs[k] <- sum(diag(conf)) / sum(conf)
}

# Plot the accuracies. Title of x-axis is "k".
plot(range, accs, xlab="k")

# Calculate the best k
which.max(accs)
## [1] 1

Creating the ROC curve (1)

In this exercise you will work with a medium sized dataset about the income of people given a set of features like education, race, sex, and so on. Each observation is labeled with 1 or 0: 1 means the observation has annual income equal or above $50,000, 0 means the observation has an annual income lower than $50,000 (Source: UCIMLR). This label information is stored in the income variable.

A tree model, tree, is learned for you on a training set, that tries to predict income based on all other variables in the dataset.

Before, you used this tree to make class predictions, by setting the type argument in predict() to “class”.

To build an ROC curve, however, you need the probabilities that the observations are positive. In this case, you’ll want to to predict the probability of each observation in the test set (already available) having an annual income equal to or above $50,000. For this, you’ll have to set the type argument of predict() to “prob”.

train4 <- read.csv("train4.csv")
test4 <- read.csv("test4.csv")
# Set random seed. Don't remove this line
set.seed(1)

# Build a tree on the training set: tree
tree <- rpart(income ~ ., train4, method = "class")

# Predict probability values using the model: all_probs
all_probs <- predict(tree, test4, type="prob")

# Print out all_probs
head(all_probs)
##       <=50K       >50K
## 1 0.6989796 0.30102041
## 2 0.9500130 0.04998705
## 3 0.9500130 0.04998705
## 4 0.9500130 0.04998705
## 5 0.2836489 0.71635112
## 6 0.6989796 0.30102041
# Select second column of all_probs: probs
probs <- all_probs[,2]
head(probs)
##          1          2          3          4          5          6 
## 0.30102041 0.04998705 0.04998705 0.04998705 0.71635112 0.30102041

Creating the ROC curve (2)

Now that you have the probabilities of every observation in the test set belonging to the positive class (annual income equal or above $50,000), you can build the ROC curve.

You’ll use the ROCR package for this. First, you have to build a prediction object with prediction(). Next, you can use performance() with the appropriate arguments to build the actual ROC data and plot it.

probs, which you had to calculate in the previous exercise, is already coded for you.

# Code of previous exercise
set.seed(1)
tree <- rpart(income ~ ., train4, method = "class")
probs <- predict(tree, test4, type = "prob")[,2]

# Load the ROCR library
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.2.5
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.2.5
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
# Make a prediction object: pred
pred <- prediction(probs, test4$income)

# Make a performance object: perf
perf <- performance(pred, "tpr" , "fpr")

# Plot this curve
plot(perf)

The area under the curve

The same package you used for constructing the ROC curve can be used to quantify the area under the curve, or AUC. The same tree model is loaded into your workspace, and the test set’s probabilities have again been calculated for you.

Again using the ROCR package, you can calculate the AUC. The use of prediction() is identical to before. However, the performance() function needs some tweaking.

# Build tree and predict probability values for the test set
set.seed(1)
tree <- rpart(income ~ ., train4, method = "class")
probs <- predict(tree, test4, type = "prob")[,2]

# Load the ROCR library
library(ROCR)

# Make a prediction object: pred
pred <- prediction(probs, test4$income)

# Make a performance object: perf
perf <- performance(pred, "auc")

# Print out the AUC
perf@y.values[[1]]
## [1] 0.8481775

Comparing the methods

In this exercise you’re going to assess two models: a decision tree model and a k-Nearest Neighbor model. You will compare the ROC curves of these models to draw your conclusions.

You finished the previous chapter by building a spam filter. This time, we have some predictions from two spam filters! These spam filters calculated the probabilities of unseen observations in the test set being spam. The real spam labels of the test set can be found in test$spam.

It is your job to use your knowledge about the ROCR package to plot two ROC curves, one for each classifier. The assigned probabilities for the observations in the test set are loaded into your workspace: probs_t for the decision tree model, probs_k for k-Nearest Neighbors.

The test set is loaded into your workspace as test. It’s a subset of the emails dataset.

test5 <- read.csv("test5.csv")
# Make the prediction objects for both models: pred_t, pred_k
##pred_t <- prediction(probs_t, test5$spam)
##pred_k <- prediction(probs_k, test5$spam)

# Make the performance objects for both models: perf_t, perf_k
##perf_t <- performance(pred_t, "tpr","fpr")
##perf_k <- performance(pred_k, "tpr", "fpr")

# Draw the ROC lines using draw_roc_lines()
##draw_roc_lines(perf_t, perf_k)