STA 363 Lab 6
Goal
Our goal today is to start exploring decision trees in R. There are two different kinds of decision trees: regression trees and classification trees. Regression trees are used when \(Y\) is numeric and classification trees are used when \(Y\) is categorical. We will do regression trees today and we will start classification trees next class.
The Data
We are going to work with our familiar penguins data set with information on n = 333 penguins. To load the data, you need to use the following three lines of code:
library(palmerpenguins)
data(penguins)
penguins <- na.omit(penguins)
penguins <-data.frame(penguins)
We have a client who is interested in building a model for Y = the body mass of a penguin in grams. In addition to this response variable, we have information on 7 features.
-
species
- the type penguin. -
island
- the island where the penguin lives. -
bill_length_mm
- the length of the penguin bill in millimeters. -
bill_depth_mm
- the depth of the penguin bill in millimeters. -
flipper_length_mm
- the flipper length of the penguin in millimeters. -
sex
- the biological sex of the penguin. -
year
- the year the penguin was measured.
The client is particularly interested in understanding the relationship between different values of the features and higher body mass.
Question 1
Is this an association task or a prediction task?
We are going to use a decision tree model to approach this task today.
Regression Trees
Regression trees are supervised models (like all the models we do in this course) with a response variable Y that is numeric. We will use classification trees if Y is categorical, but we won’t get to those until next week.
When we build regression trees, the first step is data cleaning, just like it is for every other model. We still need to be on the look out for outliers and features that are impossible to use, and other traits of the data that we need to be aware of. The EDA step, however, is a little less intensive with regression trees than with a linear model. Regression trees don’t have the assumption that the relationship between Y and each X be represented by a line, or a curve, etc., so we don’t need to check for shape the way we are used to for LSLR models. The tree model is not built that way.
With that in mind, let’s start!
Building a Regression Tree with a Categorical Feature
The first tree we are going to build will use only one feature: species. Why only one feature? Well, because it is much easier to explore how the tree grows if we limit ourselves to only one feature to start with. We will add more features later.
All trees begin with all of the rows of the data in one giant cluster, called the root.
Question 2
In the root, what is the predicted value \(\hat{y}_i\) for all penguins?
Question 3
What is the RSS and RMSE with all the data in the root node (i.e, using no features)?
Why do we care about this? Well, knowing the information about the root gives us an idea of the starting point of our tree. We build a tree by minimizing the RSS, so it helps to know what the RSS in the root node is so we can track the improvement as the tree grows.
Now that we have explored the root node, it is time to start growing the tree! Remember, we only have one feature: species.
Question 4
With species as a feature, what are all the possible splitting rules we could use to divide the root into two leaves? Note: Order doesn’t matter.
Now that we know all the possible splitting rules we can choose, how do we decide which splitting rule to use to grow our tree? We choose the splitting rule that gives us the smallest RSS after the split. This means that in order to decide which splitting rule from Question 4 to choose, we need to actually compute the RSS we would get if we used each splitting rule.
Question 5
Find the training RSS that we would get if we built a tree with one split using each of the splitting rules from Question 4. Show your code and the RSS for each splitting rule. Hint: Write out the steps in words before you start to code. This will help you figure out what each line of code needs to do. You do not need a for loop, though you can use one if you wish.
Question 6
Based on this, which splitting rule would you recommend we use to split the root node into two leaves? What is the percent reduction in RSS you get with this split (comparing to the root RSS)?
Building a Tree in R
Now that we have determined ourselves which splitting rule we should
use, let’s verify it using R. The package that we need to do that is
called rpart()
. To use this function, we need three
libraries.
# Grows the tree
library(rpart)
# Allows us to visualize the tree
library(rattle)
library(rpart.plot)
To build a tree with one feature (species) and only one split, we use the following code:
You will notice that the structure of this code is similar to other
regression codes. We specific the Y variable, the feature(s), and the
data. We use method = "anova"
to tell R that we are fitting
a regression tree rather than a classification tree.
The maxdepth=1
part of the code specifies that right
now, we are only allowing one split in the tree. If you don’t put
anything in the maxdepth
part of a tree code, the tree
grows until it hits the built in stopping rules. More on that later!
Once your tree is built (meaning you have run the line of code above), it helps to visualize the results. One of the biggest advantages of trees is that they are highly interpretable, meaning it is relatively direct to use trees to explain relationships in the data.
To visualize our tree, we use one of the following:
The sub=
part of the code is where you add a title to
your tree. It is much easier in a tree to add a title at the bottom
rather than at the top - there is more space.
Look at your tree and verify that your answer to Question 6 matches the tree you have drawn!
Now that we have drawn the tree, let’s use it.
Question 7
What percent of penguins are in leaf 2?
Question 8
What body mass does the tree predict for all Chinstrap penguins?
Now, with only this one binary feature, it turns out that the tree can be compared to a least-squares linear regression model:
\[BodyMass_i = \beta_0 + \beta_1 GentooPenguin_i + \epsilon_i, \epsilon_i \sim N(0,\sigma)\]
Let’s see why.
Question 9
Fit a least squares linear regression model for body mass, using
whether or not a penguin is a Gentoo penguin as a feature. Call this
model LSLR1
. Write out the fitted regression line. Hint: To
specify an indicator variable for a specific level of a categorical
variable, you can use, for instance, (species=="Adelie")
.
The ( ) are important.
Question 10
Based on the regression model, what body mass would you predict for a Chinstrap penguin? Keeping in mind that in the visualization our trees round to the nearest whole number, how do these predictions compare to those you made from the tree?
What do we notice? Well, with a single categorical feature, trees yield one predicted values for each level of the feature. LSLR models do exactly the same thing. That prediction, for both models, is the mean of the response variable for all the rows at each level of the feature. Translation: The models may look different, but they are in fact the same at this stage.
Okay, then what is the point of a tree?? When we start adding in multiple features, or when the features are numeric, we will find that trees and LSLR models are not the same.
Regression Trees: Numeric Feature
Now that we have explored using a categorical feature, let’s switch to using a numeric feature to predict \(Y\).
Before we use the built in code, let’s make sure we can verify ourselves what splitting rule we should use.
Question 11
Find the training RSS that we would get if we built a tree with one split using each of the possible splitting rules on flipper length. Show your code and the RSS for each splitting rule. Which splitting rule should we use? Hint: A for loop is required here. Write out the steps in words before you start to code. This will help you figure out what each line of code needs to do.
One we know that this is the splitting rule we need, we can verify using the built in R commands.
Question 12
Create a tree using only flipper length as a feature
(this means you should not include species in this tree). Use the
maxdepth = 1
stopping criterion to make sure that for the
moment, the tree only has one split. Call the tree tree2
,
and show a visualization of your tree as your answer to this question
(Figure 2).
Question 13
Based on your tree, what body mass would you predict for a penguin with a flipper length of 210 mm?
Now, we saw in a previous example that using a tree with a single split and a single categorical feature was the same as fitting an LSLR model with a single categorical feature. Let’s see if the same thing holds now that we have a numeric feature.
Question 14
Fit a least squares linear regression model for body mass, using
flipper length as the only feature. Call this model LSLR2
and write out the fitted regression line.
Question 15
Based on your LSLR line, what body mass would you predict for a penguin with a flipper length of 210 mm? How does this compare to what you get from a tree?
This time, you will see that our values are not identical. We are still using one split - why the change? This is because LSLR assumes the relationship between body mass (Y) and flipper length (X) is linear, so the prediction is based on the slope of a line. The tree models assumes that body mass is different depending on whether X is greater than a particular value of X. There are exactly two possible predictions at this point, dependent on flipper length. This is very different from an LSLR line, which right now can make multiple different predictions depending on the exact value of the flipper length.
So…are trees always better? Or is LSLR always better? Neither. Remember, there is almost never an “always” situation in statistical learning. As we have been doing all semester, we have to think through what our goal of modeling is, what properties our data have, what properties we want our model to have, and then compare a few plausible options!
Question 16
Find the training RMSE for your tree and for your LSLR model with flipper length as a feature. Based on training metrics, which model is a stronger fit to the sample data?
Adding More Splits
So far, we have limited ourselves to trees with one split. Now, let’s let the tree grow a bit.
Question 17
Build a tree using flipper length as a feature, but this time allow 4
splits (maxdepth=3
). Plot your tree and label your image
Figure 3.
Fitting the Tree with multiple features
Now we have explored two trees, but both included only one feature. One of the great things about trees is that they can very easily handle a lot of features in the data. Let’s now grow a tree using them all!
Question 18
Create a tree using all possible features in the data. Do not
restrict the number of splits. Call the tree tree4
, and
show a visualization of your tree as your answer.
Question 19
Which features does your tree use?
Question 20
In Tree 4, which feature was able to give us the largest reduction in training RSS in one split?
Question 21
Based on Tree 4, what is the predicted body mass for the 3rd penguin in the data set?
We seem to have specified no stopping rules…but we did. R has default stopping rules built in that are designed to protect your computer. Let’s take a look at these.
Question 22
Type the code ?rpart.control
into a chunk, and hit play,
and then put a # in front of the code. What will pop up is the R help
page. This page shows all of the stopping criteria you can choose to use
when growing a tree. It also shows (in the code at the top) the default
stopping criteria that R uses if we don’t specify our own. What is the
default number of rows that have to be in a leaf in order for it to
split?
Question 23
Create a tree using all the features, but this time add the stopping rule that the R-squared needs to increase by .1% (.001) in order to split. Call this tree 5. Show your result and discuss the changes in the tree.
Now, trees sometimes have a habit of getting too big to be clear to see in the visuals. There are some great tips at this website, starting on Page 9 that can help make our plot more readable.
Conclusion
Question 24
The client wanted to know what penguin traits were associated with higher body mass. Based on your tree, respond to the client’s question.
This
work was created by Nicole Dalzell is licensed under a
Creative
Commons Attribution-NonCommercial 4.0 International License. Last
updated 2024 October 27.
The data set used in this lab is from the palmerpenguins library in R: Horst AM, Hill AP, Gorman KB (2020). palmerpenguins: Palmer Archipelago (Antarctica) penguin data. R package version 0.1.0. https://allisonhorst.github.io/palmerpenguins/. doi: 10.5281/zenodo.3960218. .