STA 363/663 Lab 6 : Regression Trees

Goal

We have been working with regression trees in class. Today we are going to explore how to fit and use trees in R.

The Data

We will be using a data set on universities (the same data as from Project 2). Go ahead and load that data into your RMarkdown file.

We have 19 variables in our training data.

Private = Yes if the school is a private school, 1 if public

Apps = the number of apps received on average per academic year

Accept - the number of applications accepted on average per academic year

Enroll = the average number of student enrolled each year out of the accepted students

Top10Perc = how many of the accepted students were in the 10 % of their high school.

Top25Perc = how many of the accepted students were in the 25% of their high school.

F.Undergrad = the number of full time undergraduate students enrolled at the university

P.undergrad = the number of part time undergraduate students enrolled at the university

Out.State = the average out of state tuition

Room.Board = the average yearly cost of room and board

Books = the average cost of books for class per year

Personal = the average student personal expenses per year

PhD = the percent of he faculty who hold a PhD

Terminal = the percent of faculty who hold a terminal degree in their field

S.F.Ratio = the student faculty ratio

perc.alumni = the percent of alumni who are active in the alumni association

Expend = the average amount of money a university spends per student per year

Grad.Rate = the percent of students who graduate within 4 years.

For today, our response variable is Grad, the graduation rate of the university.

In the project, we used a variety of models to try and make predictions using this data set. Today, we are going to give trees a try.

Regression Trees: Categorical Feature

Regression trees are supervised models (like all the models we do in this course) with a response variable Y that is numeric. We will use classification trees if Y is categorical.

When we build Regression Trees, the first step is data cleaning, just like it is for every other model. The EDA step, however, is a little less intensive. Regression models don't have the assumption that the relationship between Y and each X be represented by a line, or a curve, etc., so we don't need to check for shape the way we are used to for LSLR models.

With that in mind, let's start! We are going to start with three features: student faculty ratio (S.F.Ratio), expenses per student (Expend) and whether or not the school is a private school (Private).

Let's start growing our tree.

Trees begin with a split on a single feature. Suppose we decided to consider splitting on whether or not the school is a private school. Explain in 1-2 sentences how you would use this feature to create one split, and how you would use the splitting rule to move rows into leaves.

Without using rpart to build the tree, find the training RSS we would get if we split on whether or not a school is a private school. Show your code.

Now, using the rpart code, create a tree using only Private as a feature. Call tree Tree1. Show a visualization of your tree as your answer.

Based on your tree, what percent of your training data comes from public schools?

Based on your tree, what graduation rate would you predict for a public school?

Now, with only this one binary feature, it turns out that the tree can be compared to a least-squares linear regression model: \( GradRate_i = \beta_0 + \beta_1 PrivateYes_i + \epsilon_i, \epsilon_i \sim N(0,var = \sigma^2)\). Let's see why.

Fit a least squares linear regression model for graduation rate, using whether or not a school is a private school as a feature. Call this model LSLR1. Write out the fitted regression line.

Based on the LSLR model, what graduation rate would you predict for a public school? Keeping in mind that in the visualization our trees round to the nearest whole number, how do these predictions compare to those you made from the tree?

What do we notice? Well, with a single, binary categorical feature, trees yield one predicted values for each level of the feature. LSLR models so exactly the same thing. That prediction, for both models, is the mean of the response variable for all the rows at each level of the feature. Translation: The models may look different, but they are in fact the same at this stage.

Okay, then what is the point of a tree?? When we start adding in multiple predictors, or when the predictors are numeric, we will find that trees and LSLR models are not the same.

Regression Trees: Numeric Feature

Let's see if that is true.

Create a tree using only student faculty ratio as a feature. Use the maxdepth = 1 stopping criterion to make sure that for the moment, the tree only has one split. If you don't do this, the tree will keep growing, and for now, we only want one split. Call tree Tree2, and show a visualization of your tree as your answer.

Based on your tree, what graduation rate would you predict for a school with a student faculty ratio of 10 (1 student to 10 faculty)?

Fit a least squares linear regression model for graduation rate, using student faculty ratio as the only feature. Call this model LSLR2. Write out the fitted regression line.

Based on your LSLR model, what graduation rate would you predict for a school with a student faculty ratio of 10 (1 student to 10 faculty)? How does this compare to what you get from a tree?

Now, you will see that our values are not identical. We are still using one split - why the chance? This is because LSLR assumes the relationship between graduation rate (Y) and student faculty ratio (X) is linear, so the prediction is based on the slope of a line. The tree models assumes that graduation rate is different depending on whether X is greater than a particular value of X. There are exactly two possible predictions at this point, for any student faculty ratio. This is very different from an LSLR line, which right now can make multiple different predictions depending on the student faculty ratio.

So...are trees always better? Or is LSLR always better. Neither. Remember, there is almost never an "always" situation in statistical learning. Trees are not always better at predicting than LSLR, and LSLR is not always better at predicting than a tree. As we have been doing all semester, we have to think through what our goal of modeling is, what properties our data have, what properties we want our model to have, and then compare a few plausible options!

Find the test MSE for your tree and for your LSLR model with student faculty ratio as a feature. Based on test metrics, which model would you choose and why?

Fitting the Tree with three features

Now we have explored two trees, but neither included all three features we wanted to explore. Recall that we started with three features: student faculty ratio (S.F>Ratio), expenses per student (Expend and whether or not the school is a private school (Private). Let's now grow a tree using them all!

Create a tree using student faculty ratio, whether or the not the school is a private school, and expenses on each student as features. Call the tree Tree3, and show a visualization of your tree as your answer.

We seem to have specified no stopping rules...but we did. R has default stopping rules built in that are designed to protect your computer. Let's take a look at these.

Type the code ?rpart.control into a chunk, and hit play, and then put a # in front of the code. What will pop up is the R help page. This page shows all of the stopping criteria you can choose to use when growing a tree. It also shows (in the code at the top) the default stopping criteria that R uses if we don't specify our own. What is the default number of rows that have to be in a leaf in order for it to split?

Okay, so now that we understand the stopping rules, let's try our tree.

Which feature was able to give us the largest reduction in training RSS in one split?

Based on our tree, what is the predicted graduation of a public school that spends about 12,000 US dollars on each student in terms of school expenses, with a student faculty ratio of 20 (meaning 1:20)?

Fitting the Tree with all features

Now let's really see the power of a tree and try using all the features (except university name!!).

Create a tree using all of the features (except university name). Call the tree TreeAll, and show a visualization of your tree as your answer.

Prune your tree. Call the final tree TreeFinal, and show a visualization of your tree as your answer.

With your pruned tree, how many leaves did you remove from the original tree? Hint: It is okay in practice if the answer is 0, it just means that the stopping rules already gave us a tree that predicted (relatively!) well.

What is the predicted test RMSE of your final pruned tree?

Now, trees sometimes have a habit of getting too big to be clear to see in the visuals. There are some great tips at this website, starting on Page 9 that can help make our plot more readable.

Conclusion

In this lab, we have explored regression trees. Our next stop is classification trees!

Last updated 2021 April 5th.

The css file used to format this lab was retrieved from the GitHub of Mine Çetinkaya-Rundel, version 2016 Jan 13.