Visit my website for more like this!

Heavily borrowed from:

Textbook: Introduction to statistical learning

Textbook: Elements of statistical learning

UCLA Example link

`require(knitr)`

`## Loading required package: knitr`

This notebook is about *tree-based* methods for regression and classification. They involve *stratifying* or *segmenting* the predictor space into a number of more simple regions. To make a prediction for any given observation, we typically use the mean of the mode of the training observations in these regions to which is belongs. These splitting rules used to segment the predictor space can be summarized in a tree, hence they are usually called *decision tree* methods.

Tree methods are simple and useful for interpretation, however, they typically are not competitive with the best supervised learning methods in terms of prediction accuracy. Hence, in subsequent notebooks we also introduce bagging, random forests, and boosting. Each of these examples involve producing multiples trees, which are then combined to yield a single consensus prediction. We see that combining a large number of trees can result in dramatic improvements in prediction accuracy at the expensive of a loss in interpretation.

Decision trees can be applied to both regression and classification problems. We will first consider regression.

We begin with a simple example:

We use the `Hitters`

data from the `ISLR`

library to predict a baseball playerâ€™s `Salary`

based on the number of `Years`

he has played in the major league, and the number of `Hits`

he made in the previous year.

The result would be a series of splitting rules. The first split would segment the data into `Years < 4.5`

on the left branch, and the remainder to the right. The predicted salary for these players is given by the mean response value from the players in either branch. Players with `Years >= 4.5`

are assigned to the right branch, and then further subdivided by `Hits`

. Players with `Years >= 4.5`

and `Hits < 118`

fall into the third region, and players with `Years >= 4.5`

and `Hits >= 118`

fall in the fourth region, each with their own predicted probabilities. The end points of the trees are called nodes, or leaves. We might interpret such a response as `Years`

are the most important factor in determining `Salary`

, and players with less experience have lower salaries. If the player is less experienced, the number of `Hits`

last year plays a roll in his `Salary`

. If we code this model, we see that the relationship ends up being slightly more complicated.

```
library(tree)
library(ISLR)
attach(Hitters)
# Remove NA data
Hitters<- na.omit(Hitters)
# log transform Salary to make it a bit more normally distributed
hist(Hitters$Salary)
```

```
Hitters$Salary <- log(Hitters$Salary)
hist(Hitters$Salary)
```

```
tree.fit <- tree(Salary~Hits+Years, data=Hitters)
summary(tree.fit)
```

```
##
## Regression tree:
## tree(formula = Salary ~ Hits + Years, data = Hitters)
## Number of terminal nodes: 8
## Residual mean deviance: 0.271 = 69.1 / 255
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.2400 -0.2980 -0.0365 0.0000 0.3230 2.1500
```

`plot(tree.fit)`

Now we discuss prediction via stratification of feature space, to build a regression tree. In general, there are two steps.

Find the variable / split that best separates the response variable, which yields the lowest RSS.

Divide the data into two leaves on the first identified node.

Within each leaf, find the best variable/split that separates the outcomes.

Continue until the groups are too small or sufficiently â€˜pureâ€™.

The goal being to find the number of regions the minimize RSS. However, it computationally unfeasible to consider every possible partition into *J* regions. For this reason we take a *top-down*, *greedy* approach. It is top-down because we start at a point where all the observation belongs to a single region. It is greedy because at each step of the tree-building process, the best split is chosen at that particular step, rather than looking ahead to see a split that will lead to a better tree in some future step.

Once all the regions have been created, we predict the response for a given test observation using the mean of the training observations in each region.

While the model above can produce good prediction on training data, basic tree methods are likely to over fit the data, leading to poor test performance. This is because the resulting trees tend to be too complex. A smaller tree with fewer splits often leads to lower variance, easier interpretation and lower test errors, at the cost of a little bias. One possible way to achieve this is to build a tree only so long as the decrease in RSS due to each split exceeds some (high) threshold. While this will certainly reduce tree size, it is too short sighted. This is because a seemingly worthless split early on in a tree can be followed by a very good split later.

Therefore, a better strategy is to grow a large tree, then *prune* it back to obtain a better sub tree. Intuitively, our goal is to select a sub tree that leads to the lowest test error rate. To do this, we would normally use cross validation. However it is too cumbersome since there is an extremely large number of possible sub trees.

*Cost complexity pruning* - also know as weakest link pruning gives us a way to remedy this problem. Rather than considering every possible sub tree, we consider a sequence of trees indexed by a non negative tuning parameter `alpha`

.

__Revised steps to building a regression tree_

Use recursive binary splitting to grow a large tree based on training data, stopping only when each terminal node has fewer than some minimum number of observations.

Apply cost complexity pruning to the large tree in order to obtain a sequence of best sub trees as a function of

`alpha`

.Use

*k*-fold cross validation to choose`alpha`

.Return the sub tree from step 2 that corresponds to the chosen value of

`alpha`

.

`library(caret)`

```
## Loading required package: lattice
## Loading required package: ggplot2
```

```
split <- createDataPartition(y=Hitters$Salary, p=0.5, list=FALSE)
train <- Hitters[split,]
test <- Hitters[-split,]
#Create tree model
trees <- tree(Salary~., train)
plot(trees)
text(trees, pretty=0)
```

```
#Cross validate to see whether pruning the tree will improve performance
cv.trees <- cv.tree(trees)
plot(cv.trees)
```

It seems like the 7th sized trees result in the lowest deviance. We can then prune the tree. However, this doesnâ€™t really prune the model, therefore we can select a smaller size where the improvement in deviance plateaus. This would be around the 4rd split.

```
prune.trees <- prune.tree(trees, best=4)
plot(prune.trees)
text(prune.trees, pretty=0)
```

Use the pruned tree to make predictions on the test set.

```
yhat <- predict(prune.trees, test)
plot(yhat, test$Salary)
abline(0,1)
```

`mean((yhat - test$Salary)^2)`

`## [1] 0.3531`

Classification trees are very similar to regression trees, except that it is used to predict a qualitative response rather than a quantitative one. For a regression tree, the predicted response for an observation is given by the mean response of the training observations in that branch. In contract, for classification trees, we predict that each observation belongs to the most *commonly occuring* class of training observation in the region in belongs. When interpreting the results of a classification tree, we are often interested in not only the predictions for each node, but also the *class proportions* in the region.

To grow a classification tree, we use the same recursive binary splitting, but now RSS cannot be used as a splitting criterion. The alternative is to use the *classification error rate*. While it is intuitive, it turns out that this method is not sensitive enough for tree-growing.

In practise two other methods are preferable, though they are quite similar numerically:

__Gini index_ is a measure of the total variance across *K* classes.

__Cross-entropy_ will take on a value near zero if the proportion of training observations in the given category are all near zero or one.

These two methods are preferred when pruning the tree, but the regular classification error rate is preferable if the prediction accuracy of the final pruned model is the goal.

To demonstrate this we will use the `Heart`

dataset. These data contain a binary outcome variable `AHD`

for 303 patients who presented with chest pain. The outcomes are coded as `Yes`

or `No`

for presence of heart disease.

```
Heart <-read.csv('http://www-bcf.usc.edu/~gareth/ISL/Heart.csv')
kable(head(Heart))
```

X | Age | Sex | ChestPain | RestBP | Chol | Fbs | RestECG | MaxHR | ExAng | Oldpeak | Slope | Ca | Thal | AHD |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 63 | 1 | typical | 145 | 233 | 1 | 2 | 150 | 0 | 2.3 | 3 | 0 | fixed | No |

2 | 67 | 1 | asymptomatic | 160 | 286 | 0 | 2 | 108 | 1 | 1.5 | 2 | 3 | normal | Yes |

3 | 67 | 1 | asymptomatic | 120 | 229 | 0 | 2 | 129 | 1 | 2.6 | 2 | 2 | reversable | Yes |

4 | 37 | 1 | nonanginal | 130 | 250 | 0 | 0 | 187 | 0 | 3.5 | 3 | 0 | normal | No |

5 | 41 | 0 | nontypical | 130 | 204 | 0 | 2 | 172 | 0 | 1.4 | 1 | 0 | normal | No |

6 | 56 | 1 | nontypical | 120 | 236 | 0 | 0 | 178 | 0 | 0.8 | 1 | 0 | normal | No |

`dim(Heart)`

[1] 303 15

```
split <- createDataPartition(y=Heart$AHD, p = 0.5, list=FALSE)
train <- Heart[split,]
test <- Heart[-split,]
trees <- tree(AHD ~., train)
plot(trees)
```