Today we are going to explore how to fit and use regression trees in R.
We are going to work with a data set containing information on n = 333 penguins. To load the data, you need to use the following three lines of code:
library(palmerpenguins)
data(penguins)
penguins <- na.omit(penguins)
We have a client who is interested in building a model for Y = the body mass of a penguin in grams.
In addition to this response variable, we have information on 7 features.
species
- the type penguin.island
- the island where the penguin lives.bill_length_mm
- the length of the penguin bill in millimeters.bill_depth_mm
- the depth of the penguin bill in millimeters.flipper_length_mm
- the flipper length of the penguin in millimeters.sex
- the biological sex of the penguin.year
- the year the penguin was measured.The client is particularly interested in understanding the relationship between different values of the features and higher body mass. To build a model to explore this, we will use a regression tree.
Regression trees are supervised models (like all the models we do in this course) with a response variable Y that is numeric. We will use classification trees if Y is categorical, but we won't get to those until next week.
When we build Regression Trees, the first step is data cleaning, just like it is for every other model. The EDA step, however, is a little less intensive. Regression trees don't have the assumption that the relationship between Y and each X be represented by a line, or a curve, etc., so we don't need to check for shape the way we are used to for LSLR models. We also don't assume normality of the residuals. The tree model is not built that way...it is a non-parametric model, meaning it is not structured using things like a slope or intercept (parameters). This means that we do not have to make assumptions about the shape of relationships.
With that in mind, let's start! We are going to start with just one feature: species. We are going to use this to give us a way to explore how the tree grows, and then we will add more features later.
Now that we have determined ourselves which splitting rule we should use, let's verify it. We can use rpart()
as a tool to build trees in R. To use this function, we need two libraries.
libary(rpart)
libary(rattle)
Once you have the libraries, we can start building our tree. To build a tree with one feature (species) and only one split, we use the following:
tree1<- rpart(body_mass_g ~ species, data = penguins, method = "anova", maxdepth = 1)
You will notice that the structure of this code is similar to other regression codes. We specific the Y variable and the features, and the data. We use method = "anova"
to tell R that we are fitting a regression tree rather than a classification tree.
The maxdepth=1
part of the code specifies that right now, we are only allowing one split. If you don't put anything in the maxdepth part of a tree code, the tree grows until it hits the built in stopping rules. More on that later!
Once your tree is built, it helps to visualize the results. One of the biggest advantages of trees is that they are highly interpretable. To visualize our tree, we use the following:
fancyRpartPlot(tree1, sub = "Figure 1: One Split")
The sub=
part of the code is where you add a title to your tree. It is much easier in a tree to add a title at the bottom rather than at the top - there is more space.
Look at your tree and verify that your answer to Question 6 matches the tree you have drawn.
Now, with only this one binary feature, it turns out that the tree can be compared to a least-squares linear regression model: \( BodyMass_i = \beta_0 + \beta_1 GentooPenguin_i + \epsilon_i, \epsilon_i \sim N(0,\sigma)\). Let's see why.
LSLR1
. Write out the fitted regression line. Hint: To specify an indicator variable for a specific level of a categorical variable, you can use, for instance, (species=="Adelie")
. The ( ) are important.What do we notice? Well, with a single categorical feature, trees yield one predicted values for each level of the feature. LSLR models do exactly the same thing. That prediction, for both models, is the mean of the response variable for all the rows at each level of the feature. Translation: The models may look different, but they are in fact the same at this stage.
Okay, then what is the point of a tree?? When we start adding in multiple predictors, or when the predictors are numeric, we will find that trees and LSLR models are not the same.
Let's see if that is true.
maxdepth = 1
stopping criterion to make sure that for the moment, the tree only has one split. If you don't do this, the tree will keep growing, and for now, we only want one split. Call the tree tree2
, and show a visualization of your tree as your answer (Figure 2).LSLR2
and write out the fitted regression line.Now, you will see that our values are not identical. We are still using one split - why the change? This is because LSLR assumes the relationship between body mass (Y) and flipper length (X) is linear, so the prediction is based on the slope of a line. The tree models assumes that body mass is different depending on whether X is greater than a particular value of X. There are exactly two possible predictions at this point, dependent on flipper length. This is very different from an LSLR line, which right now can make multiple different predictions depending on the exact value of the flipper length.
So...are trees always better? Or is LSLR always better? Neither. Remember, there is almost never an "always" situation in statistical learning. As we have been doing all semester, we have to think through what our goal of modeling is, what properties our data have, what properties we want our model to have, and then compare a few plausible options!
So far, we have limited ourselves to trees with one split. Now, let's let the tree grow a bit.
Now we have explored two trees, but both included only one feature. One of the great things about trees is that they can very easily handle a lot of features in the data. Let's now grow a tree using them all!
tree4
, and show a visualization of your tree as your answer.We seem to have specified no stopping rules...but we did. R has default stopping rules built in that are designed to protect your computer. Let's take a look at these.
?rpart.control
into a chunk, and hit play, and then put a # in front of the code. What will pop up is the R help page. This page shows all of the stopping criteria you can choose to use when growing a tree. It also shows (in the code at the top) the default stopping criteria that R uses if we don't specify our own. What is the default number of rows that have to be in a leaf in order for it to split?Now, trees sometimes have a habit of getting too big to be clear to see in the visuals. There are some great tips at this website, starting on Page 9 that can help make our plot more readable.