STA 363 Lab 8
Goal
In today’s lab, we are going to explore forests. We are going to work with the same data set as the last lab. As a reminder, this data set contains information on n = 333 penguins. To load the data, you need to use the following three lines of code:
We have information on:
-
body_mass_g
- the body mass of the penguin in grams. -
sex
- the sex of the penguin. -
species
- the type penguin. -
island
- the island where the penguin lives. -
bill_length_mm
- the length of the penguin bill in millimeters. -
bill_depth_mm
- the depth of the penguin bill in millimeters. -
flipper_length_mm
- the flipper length of the penguin in millimeters. -
year
- the year the penguin was measured.
Regression Forest
There are two different classes of forests. We use regression forests when our response variable is numeric and categorical forests when our response variable is categorical.
We will start with a regression forest predicting \(Y\) = body mass.
Question 1
Grow a regression tree on the penguins
data. Show a plot
of your tree. Call this tree1
. Use the default stopping
rules
Before we use the default R codes to help us build our forest, let’s make sure we take some time to understand what is going on under the hood when that code runs.
Building a Forest: One Tree at a time
We know that forests are built by creating bootstrap
samples from the original penguins
data set. So,
let’s start there.
Question 2
Create a single bootstrap sample from the penguins
data.
Use a random seed of 363663. Once you have your sample, grow a tree on
that bootstrap sample. Show a plot of the tree as the answer to this
question. Call this tree2
.
Hint: The sample
command will help us do this!
Question 3
Are the tree you grew in Question 1 (tree1) and the tree you grew in Question 2 (tree2) the same?
This is how we create a forest! We create multiple bootstrap sample of the original data set. We then grow a tree on each of the bootstrap samples. Each of these trees can be different.
What is the purpose of this? We only have one sample from the population, and we use it to build one tree. However, what if we had a different sample from the population? What would our tree look like then? If we are able to look at trees from many different samples in the population, we will likely be able to predict \(Y\) better than if we just used one sample.
However…we only have one sample, but with bootstrapping we can create many samples to approximate sampling from the population.
Question 4
How many rows in the penguins
data set are OOB for your
bootstrap sample? In other words, how many rows are OOB for tree 2?
In practice, of course, we can use R packages to grow a forest. Let’s do that next.
Building a Forest: Using the Package
In class, we learned about the randomForest
package to
build forests. There are other packages, but let’s start with this one
because the syntax is nice.
# Load the library
library(randomForest)
# Set a seed
set.seed(363663)
# Grow a forest
forest1 <- randomForest(y ~ ., data = ,
ntree = ,
importance = TRUE)
Question 5
Using the package, grow a forest with \(B =
1000\) trees and call it forest1
. Use the random
seed 363663. What is the estimated test RMSE of your forest?
Interpreting a Forest
Okay, so we can now build a forest. What about interpretability? Our forest is composed of \(B=1000\) trees. How in the world can we decide how the features are related to \(Y\) and what features matter in our 1000 tree forest???
With forests, we use something called feature importance as a tool to describe how much different features are contributing to the fit of the forest.
Let’s consider the feature flipper length. Suppose flipper length is not at all important to the forest, meaning this feature were not related to body mass. Under this assumption, we could completely change the ordering of the flipper length variable in the data set, and the forest should not be impacted.
In other words, I should be able to put the flipper length of the 3rd penguin in row 5, and the flipper length of the 5th penguin in row 1, and so on, and the forest should not be impacted.
This suggests that one way to see how important a feature is to a forest is to do just that. We fit the forest with the original version of the feature, and then we fit a second forest where we have permuted (scrambled up) the ordering of the feature. We can then compare the fit of the forests to see if changing the ordering made a difference! If a feature is important in our forest, our predictive metrics (test MSE for a regression forest or test CER for a classification forest) should become worse once we permute the feature.
To permute the ordering of flipper length, we can use:
# Create a copy of the data set
penguinsPermute <- penguins
# Permute the feature
penguinsPermute$flipper_length_mm<- penguins$flipper_length_mm[sample(1:333,333)]
We then grow a tree on this new data set penguinsPermute
and compute the percent increase in the MSE. This is the
importance of the feature.
Question 6
Create a data set called penguinsPermute
where you have
permuted the value of the feature sex
. Based on comparing a
tree grown on the original data set to a tree grown on this permuted
data set, what is the training importance of
sex
?
In other words, by what percent does the training MSE increase if we permute the feature sex?
We did this using just one tree to see the idea of permuting a feature, but we don’t typically use importance with a single tree. Instead, we use it for the whole forest.
To check the importance across an entire forest, use the following code:
# Print out the importance
knitr::kable( randomForest::importance(forest1)[,1], col.names = "% Inc in MSE" )
The value that you see in the column represents the percent increase in the estimated test MSE when we permuted each feature.
Question 7
Which feature has the highest importance? The lowest?
We can also graph the importance, which is particularly helpful when you have a lot of features.
# Load the library to make the graph
suppressMessages(library(lattice))
# Plot the importance
barchart(sort(randomForest::importance(forest1)[,1]),
xlab = "Percent Increase in OOB MSE",
main = "Figure 1: Importance")
Question 8
Change the color of the bars in the graph to be a color other than cyan.
If we only wanted to show the top 3 features in terms of importance, we can change the graph to:
barchart(sort(randomForest::importance(forest1)[,1])[-c(1:4)],
xlab = "Percent Increase in OOB MSE",
main = "Figure 1: Importance")
How did I know to remove 1:4? Well, there are 7 features in total. If I only want the top 3, I need to remove the lowest 4.
Question 9
Adapt the plot in Question 8 to include only the top 5 features in terms of importance.
Partial Plots
Another option for interpreting how features in a forest are related to a response involves what are called partial plots.
These plots can allow us to see the impact of each variable on the predictions made by our forest. It can show us the shape of the relationship, as well as the direction.
For a regression forest, the partial plot shows us that if we hold all other features fixed, but we change the value of a certain feature, how does this change \(Y\)?
For a classification forest, the partial plot shows us that if we hold all other features fixed, but we change the value of a certain feature, how does this change the log odds that \(Y= 1\)?
The Code
To create a partial plot, we first must ensure our data set is a data frame.
Let’s say we want to find the partial plot for flipper length. To create this plot, we use the following code:
partialPlot( forest1, penguins,
x.var = "flipper_length_mm",
xlab = "Flipper Length (in mm)",
ylab = "Body Mass (in grams)",
main = "Partial Plot: Flipper Length")
Question 10
Describe the relationship you see between flipper length and body mass. In other words, as flipper length increases, in general what happens to body mass?
We can also create partial plots for categorical features.
partialPlot( forest1, penguins,
x.var = "species",
xlab = "Penguin Species",
ylab = "Body Mass (in grams)",
main = "Partial Plot: Species")
Question 11
Describe the relationship you see between species and body mass.
Question 12
Create a partial plot for island vs. body mass and describe the relationship.
Question 13
Create a partial plot for bill length vs. body mass and describe the relationship.
Turning in your assignment
When your Markdown document is complete, do a final knit and make sure everything compiles. You will then submit your document on Canvas. You must submit a PDF or html document to Canvas. No other formats will be accepted. Make sure you look through the final document to make sure everything has knit correctly.
References
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2025 April 4.
The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .