STA 363 Lab 8

Goal

In today’s lab, we are going to explore forests. In class, we did a regression forest. Today, we will focus on a classification forest. Most of the ideas are the same, but this will give us practice with both!

The Data

We are going to work with the same data set as the last lab. As a reminder, this data set contains information on n = 333 penguins. To load the data, you need to use the following three lines of code:

library(palmerpenguins)
data(penguins)
penguins <- na.omit(penguins)

We have a client who is interested in building a model for Y = the sex of the penguin.

In addition to this response variable, we have information on 7 features.

body_mass_g - the body mass of the penguin in grams.
species - the type penguin.
island - the island where the penguin lives.
bill_length_mm - the length of the penguin bill in millimeters.
bill_depth_mm - the depth of the penguin bill in millimeters.
flipper_length_mm - the flipper length of the penguin in millimeters.
year - the year the penguin was measured.

Classification Tree

Before we start building a forest, let’s build a single tree. This is generally good practice, because it will help us to see what the trees that make up our forest might look like.

Question 1

Grow a classification tree for \(Y=\) sex using all the available features. Call this tree tree1. Show your tree as your answer to this question. You are welcome to use the standard stopping rules (meaning you do not have to set your own unless you would like to).

Question 2

What is the Gini Index of the first split of this tree?

Question 3

Using your tree, what sex would you predict for the first penguin in the data set?

Question 4

We can use the code predict(tree1)to get predicted probabilities from our tree. What is the predicted probability of being male and of being female for the 3rd penguin in the data set?

Question 5

We can use the code predict(tree1, type = "class")to get predicted values of sex for each penguin. What is the predicted sex of the 3rd penguin in the data set? Note: this code chooses the class associated with the highest predicted probability.

Question 6

Create a confusion matrix for tree1. Hint: You may need to refer back to your logistic regression lab for this.

Because we are working with a categorical response variable, this means that we are going to be using classification metrics to evaluate our tree.

Question 7

What is the sensitivity of your classification tree? Let 0 = female and 1 = male.

Question 8

What is the specificity of your classification tree? Let 0 = female and 1 = male.

Question 9

What percent of penguins in the training data are incorrectly classified by your tree? In other words, what is the classification error rate (CER)?

Question 10

What percent of penguins in the training data are correctly classified by your tree? In other words, what is the accuracy?

Now that we have assessed the predictive performance of a single tree, let’s consider growing a forest.

Building a Forest: One Tree at a time

Before we use the default R codes to help us build our forest, let’s make sure we take some time to understand what is going on under the hood when that code runs.

We know that forests are built by creating bootstrap samples from the original penguins data set. So, let’s start there.

Question 11

Create a single bootstrap sample from the penguins data. Use a random seed of 363663. Once you have your sample, grow a tree on that bootstrap sample. Call this tree tree2. Show the tree as the answer to this question.

Question 12

Which rows in the penguins data set are OOB for your bootstrap sample? Would the OOB rows necessarily be the same for a different bootstrap sample?

Now we have two trees, grown on different data sets.

Question 13

Create a confusion matrix for tree2 and show the matrix as part of your answer. Is this the same as the confusion matrix we got from tree1?

Once we can grow one tree, we could grow more than one!

Question 14

Create a for loop that grows and plots a bagged classification forest with 3 trees.

In practice, it is faster to use some R packages to grow a forest. Let’s do that next.

Building a Forest: Using the Package

In practice, we typically use the randomForest package to build our forests.

Question 15

Using the package, grow a forest with \(B = 1000\) trees and call it forest1. Use the random seed 363663. When you look at the output, you will see OOB estimate of error rate. This is the OOB estimate of the CER. State this CER value as your answer to this question.

You will see in the output of the random forest that you also get a confusion matrix on the OOB observations! The rows are the true values and the columns are the predicted values.

Question 16

What is the OOB estimate of the sensitivity?

Interpreting a Forest

Okay, so we can now build a forest. What about interpretability? Our forest is composed of \(B=1000\) trees. How in the world can we decide which features are important in our 1000 tree forest?

Suppose we are attempting to express how important the feature body mass is to our tree building process. To determine this, we grow our bagged forest, using all predictors we are considering. We compute and store the OOB error rate (the test CER) of this forest. Now, to see how important body mass was to our prediction process, we want to see what happens if we assume that body mass has no relationship with sex. How do we do that? We scramble up the order of the rows in the body mass column in the data before we grow our trees!

What this does is essentially break the relationship between Y and X. If X was important in our model, our predictive metrics (test MSE/RMSE for a regression forest or test CER for a classification forest) should become worse once we do so.

To check the importance of a classification forest, use the following code:

# Load the library to make the graph
suppressMessages(library(lattice))

# Plot the importance
barchart(sort(randomForest::importance(forest1)[,3]),
xlab = "Percent Increase in OOB CER",
main = "Figure: Importance")

Question 17

Which feature is most important in the forest? How can you tell?

Question 18

Which feature is least important in the forest? How can you tell?

Question 19

By how much (by what percent) does the OOB CER (remember, this means test CER) get worse if we permute the values of species?

This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 November 16.