Goal

Today we are going to explore how to fit and use regression trees in R.

The Data

We are going to work with a data set containing information on n = 333 penguins. To load the data, you need to use the following three lines of code:

library(palmerpenguins)
data(penguins)
penguins <- na.omit(penguins)

We have a client who is interested in building a model for Y = the body mass of a penguin in grams.

In addition to this response variable, we have information on 7 features.

  • species - the type penguin.
  • island - the island where the penguin lives.
  • bill_length_mm - the length of the penguin bill in millimeters.
  • bill_depth_mm - the depth of the penguin bill in millimeters.
  • flipper_length_mm - the flipper length of the penguin in millimeters.
  • sex - the biological sex of the penguin.
  • year - the year the penguin was measured.

The client is particularly interested in understanding the relationship between different values of the features and higher body mass. To build a model to explore this, we will use a regression tree.

  1. Is this an association task or a prediction task?
  2. Regression Trees: Categorical Feature

    Regression trees are supervised models (like all the models we do in this course) with a response variable Y that is numeric. We will use classification trees if Y is categorical, but we won't get to those until next week.

    When we build Regression Trees, the first step is data cleaning, just like it is for every other model. The EDA step, however, is a little less intensive. Regression trees don't have the assumption that the relationship between Y and each X be represented by a line, or a curve, etc., so we don't need to check for shape the way we are used to for LSLR models. We also don't assume normality of the residuals. The tree model is not built that way...it is a non-parametric model, meaning it is not structured using things like a slope or intercept (parameters). This means that we do not have to make assumptions about the shape of relationships.

    With that in mind, let's start! We are going to start with just one feature: species. We are going to use this to give us a way to explore how the tree grows, and then we will add more features later.

    Figuring out the First Split

    1. Right now, assume we have not started building the tree, so all the data is in the root node. In this root node, what is the predicted value for Y = body mass?
      1. What is the training RSS and RMSE with all the data in the root node (i.e, using no features)?
        1. With species as a feature, what are the possible splitting rules we could use to divide the root node into two leaves?
          1. Using a for loop, find the training RSS that we would get if we built a tree with one split using each of these splitting rules. Hint: Write out the steps in words. This will help you figure out what each line of code needs to do. Show your code and your results.
            1. Based on this, which splitting rule would you recommend we use to split the root node into two leaves? What is the percent reduction in training RSS you get with this split (comparing to the root node RSS)?
            2. Building a Tree in R

              Now that we have determined ourselves which splitting rule we should use, let's verify it. We can use rpart() as a tool to build trees in R. To use this function, we need two libraries.

              libary(rpart)
              libary(rattle)
              

              Once you have the libraries, we can start building our tree. To build a tree with one feature (species) and only one split, we use the following:

              tree1<- rpart(body_mass_g ~ species, data = penguins, method = "anova", maxdepth = 1)
              

              You will notice that the structure of this code is similar to other regression codes. We specific the Y variable and the features, and the data. We use method = "anova" to tell R that we are fitting a regression tree rather than a classification tree.

              The maxdepth=1 part of the code specifies that right now, we are only allowing one split. If you don't put anything in the maxdepth part of a tree code, the tree grows until it hits the built in stopping rules. More on that later!

              Once your tree is built, it helps to visualize the results. One of the biggest advantages of trees is that they are highly interpretable. To visualize our tree, we use the following:

              fancyRpartPlot(tree1, sub = "Figure 1: One Split")
              

              The sub= part of the code is where you add a title to your tree. It is much easier in a tree to add a title at the bottom rather than at the top - there is more space.

              Look at your tree and verify that your answer to Question 6 matches the tree you have drawn.

              1. What percent of penguins are in leaf 2?
                1. What body mass would you predict for a Chinstrap penguin?
                2. Now, with only this one binary feature, it turns out that the tree can be compared to a least-squares linear regression model: \( BodyMass_i = \beta_0 + \beta_1 GentooPenguin_i + \epsilon_i, \epsilon_i \sim N(0,\sigma)\). Let's see why.

                  1. Fit a least squares linear regression model for body mass, using whether or not a penguin is a Gentoo penguin as a feature. Call this model LSLR1. Write out the fitted regression line. Hint: To specify an indicator variable for a specific level of a categorical variable, you can use, for instance, (species=="Adelie"). The ( ) are important.
                    1. Based on the regression model, what body mass would you predict for a Chinstrap penguin? Keeping in mind that in the visualization our trees round to the nearest whole number, how do these predictions compare to those you made from the tree?
                    2. What do we notice? Well, with a single categorical feature, trees yield one predicted values for each level of the feature. LSLR models do exactly the same thing. That prediction, for both models, is the mean of the response variable for all the rows at each level of the feature. Translation: The models may look different, but they are in fact the same at this stage.

                      Okay, then what is the point of a tree?? When we start adding in multiple predictors, or when the predictors are numeric, we will find that trees and LSLR models are not the same.

                      Regression Trees: Numeric Feature

                      Let's see if that is true.

                      1. Create a tree using only flipper length as a feature. Use the maxdepth = 1 stopping criterion to make sure that for the moment, the tree only has one split. If you don't do this, the tree will keep growing, and for now, we only want one split. Call the tree tree2, and show a visualization of your tree as your answer (Figure 2).
                        1. Based on your tree, what body mass would you predict for a penguin with a flipper length of 210 mm?
                          1. Fit a least squares linear regression model for body mass, using flipper length as the only feature. Call this model LSLR2 and write out the fitted regression line.
                            1. Based on your LSLR line, what body mass would you predict for a penguin with a flipper length of 210 mm? How does this compare to what you get from a tree?
                            2. Now, you will see that our values are not identical. We are still using one split - why the change? This is because LSLR assumes the relationship between body mass (Y) and flipper length (X) is linear, so the prediction is based on the slope of a line. The tree models assumes that body mass is different depending on whether X is greater than a particular value of X. There are exactly two possible predictions at this point, dependent on flipper length. This is very different from an LSLR line, which right now can make multiple different predictions depending on the exact value of the flipper length.

                              So...are trees always better? Or is LSLR always better? Neither. Remember, there is almost never an "always" situation in statistical learning. As we have been doing all semester, we have to think through what our goal of modeling is, what properties our data have, what properties we want our model to have, and then compare a few plausible options!

                              1. Find the training RMSE for your tree and for your LSLR model with flipper length as a feature. Based on training metrics, which model is a stronger fit to the sample data?
                              2. Adding More Splits

                                So far, we have limited ourselves to trees with one split. Now, let's let the tree grow a bit.

                                1. Build a tree using flipper length as a feature, but this time allow 3 splits. Plot your tree and label your image Figure 3.
                                2. Fitting the Tree with multiple features

                                  Now we have explored two trees, but both included only one feature. One of the great things about trees is that they can very easily handle a lot of features in the data. Let's now grow a tree using them all!

                                  1. Create a tree using all possible features in the data. Do not restrict the number of splits. Call the tree tree4, and show a visualization of your tree as your answer.
                                    1. Which features does your tree use?
                                      1. In Tree 4, Which feature was able to give us the largest reduction in training RSS in one split?
                                        1. Based on Tree 4, what is the predicted body mass for the 3rd penguin in the data set?
                                        2. We seem to have specified no stopping rules...but we did. R has default stopping rules built in that are designed to protect your computer. Let's take a look at these.

                                          1. Type the code ?rpart.control into a chunk, and hit play, and then put a # in front of the code. What will pop up is the R help page. This page shows all of the stopping criteria you can choose to use when growing a tree. It also shows (in the code at the top) the default stopping criteria that R uses if we don't specify our own. What is the default number of rows that have to be in a leaf in order for it to split?
                                            1. Create a tree using all the features, but this time add the stopping rule that the R-squared needs to increase by .1% (.001) in order to split. Call this tree 5. Show your result and discuss the changes in the tree.
                                            2. Now, trees sometimes have a habit of getting too big to be clear to see in the visuals. There are some great tips at this website, starting on Page 9 that can help make our plot more readable.

Conclusion

  1. The client wanted to know what features of a penguin were associated with higher body mass. Based on your tree, respond to the client's question.
  2. Creative Commons License
    This work was created by Nicole Dalzell is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2022 March 30.
    The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.