STA 363 Lab 8
Goal
In today’s lab, we are going to explore forests. In class, we did a regression forest. Today, we will focus on a classification forest. Most of the ideas are the same, but this will give us practice with both!
The Data
We are going to work with the same data set as the last lab. As a reminder, this data set contains information on n = 333 penguins. To load the data, you need to use the following three lines of code:
library(palmerpenguins)
data(penguins)
<- na.omit(penguins) penguins
We have a client who is interested in building a model for Y = the sex of the penguin.
In addition to this response variable, we have information on 7 features.
-
body_mass_g
- the body mass of the penguin in grams. -
species
- the type penguin. -
island
- the island where the penguin lives. -
bill_length_mm
- the length of the penguin bill in millimeters. -
bill_depth_mm
- the depth of the penguin bill in millimeters. -
flipper_length_mm
- the flipper length of the penguin in millimeters. -
year
- the year the penguin was measured.
Classification Tree
Before we start building a forest, let’s build a single tree. This is generally good practice, because it will help us to see what the trees that make up our forest might look like.
Question 1
Grow a classification tree for \(Y=\) sex using all the available features.
Call this tree tree1
. Show your tree as your answer to this
question. You are welcome to use the standard stopping rules (meaning
you do not have to set your own unless you would like to).
Question 2
What is the Gini Index of the first split of this tree?
Question 3
Using your tree, what sex would you predict for the first penguin in the data set?
Question 4
We can use the code predict(tree1)
to get predicted
probabilities from our tree. What is the predicted probability of being
male and of being female for the 3rd penguin in the data set?
Question 5
We can use the code predict(tree1, type = "class")
to get
predicted values of sex for each penguin. What is the predicted sex of
the 3rd penguin in the data set? Note: this code chooses the
class associated with the highest predicted probability.
Question 6
Create a confusion matrix for tree1
. Hint: You may
need to refer back to your logistic regression lab for this.
Because we are working with a categorical response variable, this means that we are going to be using classification metrics to evaluate our tree.
Question 7
What is the sensitivity of your classification tree? Let 0 = female and 1 = male.
Question 8
What is the specificity of your classification tree? Let 0 = female and 1 = male.
Question 9
What percent of penguins in the training data are incorrectly classified by your tree? In other words, what is the classification error rate (CER)?
Question 10
What percent of penguins in the training data are correctly classified by your tree? In other words, what is the accuracy?
Now that we have assessed the predictive performance of a single tree, let’s consider growing a forest.
Building a Forest: One Tree at a time
Before we use the default R codes to help us build our forest, let’s make sure we take some time to understand what is going on under the hood when that code runs.
We know that forests are built by creating bootstrap
samples from the original penguins
data set. So,
let’s start there.
Question 11
Create a single bootstrap sample from the penguins
data.
Use a random seed of 363663. Once you have your sample, grow a tree on
that bootstrap sample. Call this tree tree2
. Show the tree
as the answer to this question.
Question 12
Which rows in the penguins
data set are OOB for your
bootstrap sample? Would the OOB rows necessarily be the same for a
different bootstrap sample?
Now we have two trees, grown on different data sets.
Question 13
Create a confusion matrix for tree2
and show the matrix
as part of your answer. Is this the same as the confusion matrix we got
from tree1
?
Once we can grow one tree, we could grow more than one!
Question 14
Create a for loop that grows and plots a bagged classification forest with 3 trees.
In practice, it is faster to use some R packages to grow a forest. Let’s do that next.
Building a Forest: Using the Package
In practice, we typically use the randomForest
package
to build our forests.
Question 15
Using the package, grow a forest with \(B =
1000\) trees and call it forest1
. Use the random
seed 363663. When you look at the output, you will see
OOB estimate of error rate
. This is the OOB estimate of the
CER. State this CER value as your answer to this question.
You will see in the output of the random forest that you also get a confusion matrix on the OOB observations! The rows are the true values and the columns are the predicted values.
Question 16
What is the OOB estimate of the sensitivity?
Interpreting a Forest
Okay, so we can now build a forest. What about interpretability? Our forest is composed of \(B=1000\) trees. How in the world can we decide which features are important in our 1000 tree forest?
Suppose we are attempting to express how important the feature body mass is to our tree building process. To determine this, we grow our bagged forest, using all predictors we are considering. We compute and store the OOB error rate (the test CER) of this forest. Now, to see how important body mass was to our prediction process, we want to see what happens if we assume that body mass has no relationship with sex. How do we do that? We scramble up the order of the rows in the body mass column in the data before we grow our trees!
What this does is essentially break the relationship between Y and X. If X was important in our model, our predictive metrics (test MSE/RMSE for a regression forest or test CER for a classification forest) should become worse once we do so.
To check the importance of a classification forest, use the following code:
# Load the library to make the graph
suppressMessages(library(lattice))
# Plot the importance
barchart(sort(randomForest::importance(forest1)[,3]),
xlab = "Percent Increase in OOB CER",
main = "Figure: Importance")
Question 17
Which feature is most important in the forest? How can you tell?
Question 18
Which feature is least important in the forest? How can you tell?
Question 19
By how much (by what percent) does the OOB CER (remember, this means test CER) get worse if we permute the values of species?
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2022 November 16.