We have been working with regression trees in class. Today we are going to explore how to fit and use trees in R.
We will be using a data set on universities (the same data as from Project 2). Go ahead and load that data into your RMarkdown file.
We have 19 variables in our training data.
Private
= Yes if the school is a private school, 1 if publicApps
= the number of apps received on average per academic yearAccept
- the number of applications accepted on average per academic yearEnroll
= the average number of student enrolled each year out of the accepted studentsTop10Perc
= how many of the accepted students were in the 10 % of their high school.Top25Perc
= how many of the accepted students were in the 25% of their high school.F.Undergrad
= the number of full time undergraduate students enrolled at the universityP.undergrad
= the number of part time undergraduate students enrolled at the universityOut.State
= the average out of state tuition Room.Board
= the average yearly cost of room and boardBooks
= the average cost of books for class per yearPersonal
= the average student personal expenses per yearPhD
= the percent of he faculty who hold a PhDTerminal
= the percent of faculty who hold a terminal degree in their fieldS.F.Ratio
= the student faculty ratio
perc.alumni
= the percent of alumni who are active in the alumni associationExpend
= the average amount of money a university spends per student per yearGrad.Rate
= the percent of students who graduate within 4 years.For today, our response variable is Grad
, the graduation rate of the university.
In the project, we used a variety of models to try and make predictions using this data set. Today, we are going to give trees a try.
Regression trees are supervised models (like all the models we do in this course) with a response variable Y that is numeric. We will use classification trees if Y is categorical.
When we build Regression Trees, the first step is data cleaning, just like it is for every other model. The EDA step, however, is a little less intensive. Regression models don't have the assumption that the relationship between Y and each X be represented by a line, or a curve, etc., so we don't need to check for shape the way we are used to for LSLR models.
With that in mind, let's start! We are going to start with three features: student faculty ratio (S.F.Ratio
), expenses per student (Expend
) and whether or not the school is a private school (Private
).
Let's start growing our tree.
rpart
to build the tree, find the training RSS we would get if we split on whether or not a school is a private school. Show your code.rpart
code, create a tree using only Private
as a feature. Call tree Tree1
. Show a visualization of your tree as your answer.Now, with only this one binary feature, it turns out that the tree can be compared to a least-squares linear regression model: \( GradRate_i = \beta_0 + \beta_1 PrivateYes_i + \epsilon_i, \epsilon_i \sim N(0,var = \sigma^2)\). Let's see why.
LSLR1
. Write out the fitted regression line.What do we notice? Well, with a single, binary categorical feature, trees yield one predicted values for each level of the feature. LSLR models so exactly the same thing. That prediction, for both models, is the mean of the response variable for all the rows at each level of the feature. Translation: The models may look different, but they are in fact the same at this stage.
Okay, then what is the point of a tree?? When we start adding in multiple predictors, or when the predictors are numeric, we will find that trees and LSLR models are not the same.
Let's see if that is true.
maxdepth = 1
stopping criterion to make sure that for the moment, the tree only has one split. If you don't do this, the tree will keep growing, and for now, we only want one split. Call tree Tree2
, and show a visualization of your tree as your answer.LSLR2
. Write out the fitted regression line.Now, you will see that our values are not identical. We are still using one split - why the chance? This is because LSLR assumes the relationship between graduation rate (Y) and student faculty ratio (X) is linear, so the prediction is based on the slope of a line. The tree models assumes that graduation rate is different depending on whether X is greater than a particular value of X. There are exactly two possible predictions at this point, for any student faculty ratio. This is very different from an LSLR line, which right now can make multiple different predictions depending on the student faculty ratio.
So...are trees always better? Or is LSLR always better. Neither. Remember, there is almost never an "always" situation in statistical learning. Trees are not always better at predicting than LSLR, and LSLR is not always better at predicting than a tree. As we have been doing all semester, we have to think through what our goal of modeling is, what properties our data have, what properties we want our model to have, and then compare a few plausible options!
Now we have explored two trees, but neither included all three features we wanted to explore. Recall that we started with three features: student faculty ratio (S.F>Ratio
), expenses per student (Expend
and whether or not the school is a private school (Private
). Let's now grow a tree using them all!
Tree3
, and show a visualization of your tree as your answer.We seem to have specified no stopping rules...but we did. R has default stopping rules built in that are designed to protect your computer. Let's take a look at these.
?rpart.control
into a chunk, and hit play, and then put a # in front of the code. What will pop up is the R help page. This page shows all of the stopping criteria you can choose to use when growing a tree. It also shows (in the code at the top) the default stopping criteria that R uses if we don't specify our own. What is the default number of rows that have to be in a leaf in order for it to split?Okay, so now that we understand the stopping rules, let's try our tree.
Now let's really see the power of a tree and try using all the features (except university name!!).
TreeAll
, and show a visualization of your tree as your answer.TreeFinal
, and show a visualization of your tree as your answer.Now, trees sometimes have a habit of getting too big to be clear to see in the visuals. There are some great tips at this website, starting on Page 9 that can help make our plot more readable.
In this lab, we have explored regression trees. Our next stop is classification trees!