Decision Trees

We have described two types of machine learning algorithms. Linear approaches (linear regression, generalized linear models (GLM), discriminant analysis) and a smoothing approach (k-nearest neighbors (in notes)). The linear approaches were limiting in that the partition of the prediction space had to be linear (in the case of QDA, quadratic). A limitation of the smoothing approach is that with a large number of predictors, we run into the problem of the curse of dimensionality.

The Curse of Dimensionality

A useful way of understand the curse of dimensionality is by considering how large we have to make a neighborhood/window to include a given percentage of the data. For example, suppose we have one continuous predictor with equally spaced points in the [0,1] interval and we want to create windows that include 1/10-th of the data. Then it’s easy to see that our windows have to be of size 0.1:

rafalib::mypar()
x <- seq(0, 1, len=100)
y <- rep(1, 100)
plot(x, y, xlab = "", ylab = "", cex = 0.25, yaxt = "n", xaxt = "n", type = "n")
lines(x[c(15,35)], y[c(15,35)], col = "blue", lwd = 3)
points(x,y, cex = 0.25)
points(x[25],y[25], col = "blue", cex = 0.5, pch = 4)
text(x[c(15,35)], y[c(15,35)], c("[","]"))

Now, for two predictors, if we decide to keep the neighborhood just as small, 10% for each dimension, we include only 1 point:

or, if we want to include 10% of the data we need to increase the window size to \(\sqrt{10}\):

To include 10% of the data in a case with \(p\) features we need an interval for each predictor that covers \(0.10^{1/p}\) of the total. This proportion gets close to 1 (including all the data and no longer smoothing) quickly:

Here we look at a set of elegant and versatile methods that adapt to higher dimensions and also allow these regions to take more complex shapes, but still produce models that are interpretable. These are very popular, well-known and studied methods. We will concentrate on Regression and Decision Trees and their extension to Random Forests.

Regression Trees

Consider the olives dataset below. We show two measured predictors, linoleic (percent linoleic acid of sample) and eicosenoic (percent eicosenoic acid of sample). Suppose we wanted to predict the olive’s region using these two predictors.

olives <- read.csv("https://raw.githubusercontent.com/datasciencelabs/data/master/olive.csv", as.is = TRUE) %>% tbl_df
names(olives)[1] <- "province"
region_names <- c("Southern Italy", "Sardinia", "Northern Italy")
olives <- olives %>% mutate(Region = factor(region_names[Region]))

p <- olives %>% ggplot(aes(eicosenoic, linoleic, fill = Region)) +
     geom_point(pch = 21)
p

Note that we can describe a classification algorithm that would work pretty much perfectly:

p <- p + geom_vline(xintercept = 6.5) + 
         geom_segment(x = -2, y = 1053.5, xend = 6.5, yend = 1053.5)
p

The prediction algorithm inferred from the figure above is what we call a decision tree. If eicosnoic is larger than 6.5, predict Southern Italy. If not, then if linoleic is larger than 1,054, predict Sardinia and Northern Italy otherwise. We can draw this decision tree like this:

## Warning: package 'rpart' was built under R version 4.0.5

Decision trees like this are often used in practice. For example, to decide if a person is at risk of having a heart attack, doctors use the following:

The general idea of the methods we are describing is to define an algorithm that uses data to create these tress. Regression and decision trees operate by predicting an outcome variable \(Y\) by partitioning feature (predictor) space.

Regression Trees

Let’s start with the case of a continuous outcome. The general idea here is to build a decision tree and at the end of each node we will have a different prediction \(\hat{Y}\) for the outcome \(Y\).

The regression tree model then:

  1. Partitions space into \(J\) non-overlapping regions, \(R_1, R_2, \ldots, R_J\).
  2. For every observation that falls within region \(R_j\), predict the response as the mean of responses for training observations in \(R_j\).

The important observation is that Regression Trees create partitions recursively.

For example, consider finding a good predictor \(j\) to partition space along its axis. A recursive algorithm would look like this:

Find predictor \(j\) and value \(s\) that minimize RSS:

\[ \sum_{i:\, x_i \in R_1(j,s))} (y_i - \hat{y}_{R_1})^2 + \sum_{i:\, x_i \in R_2(j,s))} (y_i - \hat{y}_{R_2})^2 \]

Where \(R_1\) and \(R_2\) are regions resulting from splitting observations on predictor \(j\) and value \(s\):

\[ R_1(j,s) = \{X|X_j < s\} \text{ and } R_2(j,s) \{X|X_j \geq s\} \]

This is then applied recursively to regions \(R_1\) and \(R_2\). Within each region a prediction is made using \(\hat{y}_{R_j}\) which is the mean of the response \(Y\) of observations in \(R_j\).

Let’s take a look at what this algorithm does on the Boston Housing data set. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston, MA in the 1970s. It was obtained from the StatLib archive. It contains information including medv (median value of owner-occupied homes in $1000’s), lstat (% of individuals with lower socioeconomic status), rm (average number of rooms per dwelling), and dis (weighted distances to five Boston employment centres), among others with a total of 14 variables.

The dataset is small in size with only 506 cases, but we’ll use it for educational purposes.

library(tree)
library(MASS)
set.seed(1)
# Randomly sample half of the data for training
train = sample(1:nrow(Boston), nrow(Boston)/2) 
# Fit a regression tree using all of the available predictors
fit = tree(medv ~ ., Boston, subset = train)  
# Print a summary of the tree
summary(fit)
## 
## Regression tree:
## tree(formula = medv ~ ., data = Boston, subset = train)
## Variables actually used in tree construction:
## [1] "rm"    "lstat" "crim"  "age"  
## Number of terminal nodes:  7 
## Residual mean deviance:  10.38 = 2555 / 246 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -10.1800  -1.7770  -0.1775   0.0000   1.9230  16.5800
# Use tree for prediction
preds <- predict(fit, newdata = Boston[-train,])
test = Boston[-train, "medv"]

Note that the output indicates only 3 of the predictors were used to construct the tree. Let’s plot it and take a look:

plot(fit, type = "uniform")
text(fit, cex = 1)

The tree suggests that lower values of lstat correspond to more expensive houses and predicts a median house price of $46,380 for larger homes in the suburbs in which residents have high socioeconomic status (rm >= 7.437 and lstat < 9.715).

The idea behind the regression tree is that outcome \(Y\) is estimated (or predicted) to be it’s mean within each of the data partitions. Think of it as the conditional mean of \(Y\) where conditioning is given by this region partitioning.

We can also use the predictions made to calculate MSE (mean square error) to compare models.

plot(preds, test)
abline(0,1)

mean((preds-test)^2)
## [1] 35.28688

The test set MSE associated with this tree is 25.05. The square root of the MSE is 5.005, indicating this model leads to test predictions that are within approximately $5,005 of the true median home value for the suburb.

Specifics of the regression tree algorithm

The recursive partitioning algorithm described above leads to a set of natural questions:

  1. When do we stop partitioning?

We stop when adding a partition does not reduce MSE, or, when a partition has too few training observations. Even then, trees built with this stopping criterion tend to overfit to the training data. To avoid this, a post-processing step called pruning is used to make the tree smaller.

  1. Why would a smaller tree tend to generalize better?

The cv.tree function is used to determine a reasonable tree depth for the given dataset. For this dataset it seems that a depth of 8 works well since it reaches the minimum error or “deviance” with that number:

cv_boston = cv.tree(fit)
plot(cv_boston$size, cv_boston$dev, type = 'b')

However, if we decide to prune a tree we can do so using the prune.tree() function:

prune_boston = prune.tree(fit, best = 5)
plot(prune_boston, type = "uniform")
text(prune_boston)

Let’s calculate the test set MSE for the pruned tree:

preds_prune <- predict(prune_boston, newdata = Boston[-train,])
test = Boston[-train, "medv"]
mean((preds_prune-test)^2)
## [1] 35.90102

The MSE of the pruned tree is greater than the original, so we should not prune this decision tree.

Classification (Decision) Trees

Classification, or decision trees, are used in classification problems where the outcome is categorical. The same partitioning principle is used, but now each region predicts the majority class for training observations within that region. The recursive partitioning algorithm we saw previously requires a score function to choose predictors (and values) to partition with. In classification we could use a naive approach of looking for partitions that minimize training error. However, better performing approaches use more sophisticated metrics. Here are two of the most popular (denoted for leaf \(m\)):

  • Gini Index: \(\sum_{k=1}^K \hat{p}_{mk}(1-\hat{p}_{mk})\), or

  • Entropy: \(-\sum_{k=1}^K \hat{p}_{mk}\log(\hat{p}_{mk})\)

where \(\hat{p}_{mk}\) is the proportion of training observations in partition \(m\) labeled as class \(k\). Both of these seek to partition observations into subsets that have the same labels.

Let us look at how a classification tree performs on the digits example we examined before:

We can see the prediction here:

f_hat_cart <- predict(fit, newdata = true_f)[,2]

p <-true_f %>% mutate(f = f_hat_cart) %>%
               ggplot(aes(X_1, X_2, fill = f)) +
               scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) + 
               geom_raster() +  
               stat_contour(aes(x = X_1, y = X_2, z = f), 
               data = mutate(true_f, f = f_hat_cart),
               breaks = c(0.5), color = "black", lwd = 1.5) +
               guides(fill = FALSE)

library(gridExtra)
grid.arrange(true_f_plot, p, nrow=1)

We can again prune the tree if we wish, but in this case the pruned tree does not differ much from the original.

pruned_fit <- prune.tree(fit)
plot(pruned_fit)

Here is what a pruned tree looks like:

pruned_fit  <- prune.tree(fit, k = 160)
f_hat_cart2 <- predict(pruned_fit, newdata = true_f)[,2]
p <-true_f %>% mutate(f = f_hat_cart2) %>%
               ggplot(aes(X_1, X_2, fill = f)) +
               scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) + 
               geom_raster() +  
               stat_contour(aes(x = X_1, y = X_2, z = f), 
               data = mutate(true_f, f = f_hat_cart2),
               breaks = c(0.5),color = "black", lwd = 1.5)
p

Classification trees have certain advantages that make them very useful. They are highly interpretable, even more so than linear models, are easy to visualize (if small enough), and they (maybe) model human decision processes and don’t require that dummy predictors for categorical variables are used.

On the other hand, the greedy approach via recursive partitioning is a bit harder to train than linear regression. It may not always be the best performing method since it is not very flexible and is highly unstable to changes in training data. Below we will learn about the bootstrap to help with this.

Bootstrap

Suppose the income distribution of your population is as follows:

hist(log10(income))

The population median is

m <- median(income)
m
## [1] 45317.72

Suppose we don’t have access to the entire population but want to estimate the median \(m\). We take a sample of 250 and estimate the population median \(m\) with the sample median \(M\):

set.seed(1)
N <- 250
X <- sample(income, N)
M <- median(X)
M
## [1] 41626.3

Can we construct a confidence interval? What is the distribution of \(M\)?

From a Monte Carlo simulation we see that the distribution of \(M\) is approximately normal with the following expected value and standard error:

B <- 10^5
Ms <- replicate(B, {
  X <- sample(income, N)
  M <- median(X)
})
par(mfrow=c(1,2))
hist(Ms)
qqnorm(Ms)
qqline(Ms)

mean(Ms)
## [1] 45471.42
sd(Ms)
## [1] 3650.328

The problem here is that, as we have described before, in practice we do not have access to the distribution. In the past we have used the central limit theorem. But the CLT we studied applies to averages and here we are interested in the median.

The Bootstrap permits us to approximate a Monte Carlo simulation without access to the entire distribution. The general idea is relatively simple. We act as if the sample is the distribution and sample (with replacement) datasets of the same size. Then we compute the summary statistic, in this case the median, on this bootstrap sample.

There is theory telling us that the distribution of the statistics obtained with bootstrap samples approximate the distribution of our actual statistic. This is how we construct bootstrap samples and an approximate distribution:

B <- 10^5
M_stars <- replicate(B, {
  X_star <- sample(X, N, replace = TRUE)
  M_star <- median(X_star)
})

Now we can check how close it is to the actual distribution

qqplot(Ms, M_stars)
abline(0,1)  

We see it is not perfect but it provides a decent approximation:

quantile(Ms, c(0.05, 0.95))
##       5%      95% 
## 39727.90 51722.74
quantile(M_stars, c(0.05, 0.95))
##       5%      95% 
## 38041.02 46207.35

This is much better than what we get if we mindlessly use the CLT:

median(X) + 1.96 * sd(X)/sqrt(N) * c(-1,1)
## [1] 29392.54 53860.06

If we know the distribution is normal, we can use the bootstrap to estimate the mean:

mean(Ms) + 1.96*sd(Ms)*c(-1,1)
## [1] 38316.78 52626.07
mean(M_stars) + 1.96*sd(M_stars)*c(-1,1)
## [1] 37239.74 46112.89

Random Forests

Random Forests are a very popular approach that address the shortcomings of decision trees via re-sampling of the training data. Their goal is to improve prediction performance and reduce instability by averaging multiple decision trees (a forest constructed with randomness). It has two features that help accomplish this.

The first trick is Bagging (bootstrap aggregation) General scheme:

  1. Build many decision trees \(T_1, T_2, \ldots, T_B\) from training set
  2. Given a new observation, let each \(T_j\) predict \(\hat{y}_j\)
  3. For regression: predict average \(\frac{1}{B} \sum_{j=1}^B \hat{y}_j\), for classification: predict with majority vote (most frequent class)

But how do we get many decision trees from a single training set?

For this we use the bootstrap. To create \(T_j, \, j=1,\ldots,B\) from a training set of size \(N\):

  1. Create a bootstrap training set by sampling \(N\) observations from training set with replacement
  2. Build a decision tree from the bootstrap training set

Let’s look at this using the Boston housing dataset. We fit a Random Forest by using the randomForest() function. Here, mtry = 13 indicates all 13 predictors should be considered for each split of the tree - in other words, bagging should be done.

library(randomForest)
library(MASS)
fit_bag <- randomForest(medv ~ ., data = Boston, mtry = 13)
fit_bag
## 
## Call:
##  randomForest(formula = medv ~ ., data = Boston, mtry = 13) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##           Mean of squared residuals: 10.54275
##                     % Var explained: 87.51

How well does it perform on the test set?

preds_bag = predict(fit_bag, newdata = Boston[-train,])
plot(preds_bag, test)
abline(0,1)

mean((preds_bag - test)^2)
## [1] 2.304846

Much, much better than the one decision tree we used above.

We can grow a random forest in the same way, except now we use a smaller value for mtry. The default for regression trees is \(p/3\) (\(p\) is the number of predictors) and the default for classification trees is \(\sqrt p\). Let’s try mtry = 6.

fit_rf <- randomForest(medv ~ ., data = Boston, mtry = 6, importance = TRUE)
preds_rf = predict(fit_rf, newdata = Boston[-train,])
mean((preds_rf - test)^2)
## [1] 2.265554

Looks like the MSE for this random forest is worse with fewer predictors and the one using all of the predictors performs the best.

The second Random Forests feature is to use a random selection of features to split when deciding partitions. Specifically, when building each tree \(T_j\), at each recursive partition only consider a randomly selected subset of predictors to check for the best split. This reduces correlation between trees in a forest, improving prediction accuracy.

Here is the random forest fit for the digits data:

detach("package:MASS", unload=TRUE)
library(randomForest)
fit <- randomForest(label ~ X_1 + X_2, data = digits_train)
f_hat_rf <- predict(fit, newdata = true_f, type = "prob")[,2]

p <-true_f %>% mutate(f = f_hat_rf) %>%
               ggplot(aes(X_1, X_2, fill = f)) +
               scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) + 
               geom_raster() + 
               stat_contour(aes(x = X_1, y = X_2, z = f), 
               data = mutate(true_f, f = f_hat_rf),
               breaks = c(0.5), color = "black", lwd = 1.5) +
               guides(fill = FALSE)

library(gridExtra)
grid.arrange(true_f_plot, p, nrow = 1)

We can control the “smoothness” of the random forest estimate in several ways. One way is to limit the size of each node. We can require the number of points per node to be larger:

fit <- randomForest(as.factor(label) ~ X_1  +X_2,
                    nodesize = 250,
                    data = digits_train)
f_hat_rf <- predict(fit, newdata = true_f, type = "prob")[,2]

p <-true_f %>% mutate(f = f_hat_rf) %>%
               ggplot(aes(X_1, X_2, fill = f))  +
               scale_fill_gradientn(colors = c("#F8766D","white","#00BFC4")) + 
               geom_raster() +  
               stat_contour(aes(x = X_1, y = X_2, z = f), 
               data = mutate(true_f, f = f_hat_rf),
               breaks = c(0.5), color = "black", lwd = 1.5) +
               guides(fill = FALSE)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
p

We can compare the results:

library(caret)
get_accuracy <- function(fit){
  pred <- predict(fit, newdata = digits_test, type = "class")
  confusionMatrix(table(pred = pred, true = digits_test$label))$overall[1]
}
fit <- tree(label ~ X_1 + X_2, data = digits_train)
get_accuracy(fit)
## Accuracy 
## 0.808069
fit <- randomForest(label ~ X_1 + X_2, data = digits_train)
get_accuracy(fit)
##  Accuracy 
## 0.8059701
fit <- randomForest(label ~ X_1 + X_2,
                    nodesize = 250,
                    data = digits_train)
get_accuracy(fit)
##  Accuracy 
## 0.8206623

A disadvantage of random forests is that we lose interpretability. However, we can use the fact that a bootstrap sample was used to construct trees to measure variable importance from the random forest.

Let’s see this using all the digits data:

url <- "https://raw.githubusercontent.com/datasciencelabs/data/master/hand-written-digits-train.csv"
digits <- read_csv(url)

digits <- mutate(digits, label = as.factor(label))

inTrain   <- createDataPartition(y = digits$label, p = 0.9)
train_set <- slice(digits, inTrain$Resample1)
test_set  <- slice(digits, -inTrain$Resample1)

fit <- randomForest(label~., ntree = 100, data = train_set)

How well does it do?

pred <- predict(fit, newdata = test_set, type = "class")
confusionMatrix(table(pred = pred, true = test_set$label))
## Confusion Matrix and Statistics
## 
##     true
## pred   0   1   2   3   4   5   6   7   8   9
##    0 406   0   2   1   0   1   2   0   1   4
##    1   0 461   0   0   2   1   1   1   2   0
##    2   1   2 406   5   0   1   0   2   1   0
##    3   0   2   1 408   0   9   0   0   4   6
##    4   0   0   3   0 391   1   0   5   3   2
##    5   0   0   0   6   0 356   4   0   3   2
##    6   0   1   2   1   1   1 404   0   1   0
##    7   0   1   0   2   0   0   0 425   0   6
##    8   6   1   1   7   2   7   2   1 387   4
##    9   0   0   2   5  11   2   0   6   4 394
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9623          
##                  95% CI : (0.9561, 0.9679)
##     No Information Rate : 0.1115          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9581          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.98305   0.9850  0.97362  0.93793  0.96069  0.93931
## Specificity           0.99709   0.9981  0.99682  0.99415  0.99631  0.99607
## Pos Pred Value        0.97362   0.9850  0.97129  0.94884  0.96543  0.95957
## Neg Pred Value        0.99815   0.9981  0.99709  0.99283  0.99578  0.99399
## Prevalence            0.09843   0.1115  0.09938  0.10367  0.09700  0.09032
## Detection Rate        0.09676   0.1099  0.09676  0.09724  0.09318  0.08484
## Detection Prevalence  0.09938   0.1115  0.09962  0.10248  0.09652  0.08842
## Balanced Accuracy     0.99007   0.9916  0.98522  0.96604  0.97850  0.96769
##                      Class: 6 Class: 7 Class: 8 Class: 9
## Sensitivity           0.97821   0.9659  0.95320  0.94258
## Specificity           0.99815   0.9976  0.99182  0.99206
## Pos Pred Value        0.98297   0.9793  0.92584  0.92925
## Neg Pred Value        0.99762   0.9960  0.99497  0.99364
## Prevalence            0.09843   0.1049  0.09676  0.09962
## Detection Rate        0.09628   0.1013  0.09223  0.09390
## Detection Prevalence  0.09795   0.1034  0.09962  0.10105
## Balanced Accuracy     0.98818   0.9818  0.97251  0.96732

Here is a table of variable importance for the random forest we just constructed.

library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
variable_importance <- importance(fit) 
tmp <- data_frame(feature = rownames(variable_importance),
                  Gini = variable_importance[,1]) %>%
                  arrange(desc(Gini))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
kable(tmp[1:10,])
feature Gini
pixel350 353.6952
pixel433 339.3635
pixel378 316.8655
pixel461 280.3615
pixel487 247.0005
pixel377 244.1287
pixel405 237.6620
pixel381 231.1008
pixel489 228.5381
pixel290 226.4892

We can see where the “important” features are:

expand.grid(Row = 1:28, Column = 1:28) %>%
            mutate(value = variable_importance[,1]) %>%
            ggplot(aes(Row, Column, fill = value)) +
            geom_raster() +
            scale_y_reverse() 

And a barplot of the same data showing only the most predictive features:

tmp %>% filter(Gini > 200) %>%
        ggplot(aes(x=reorder(feature, Gini), y=Gini)) +
        geom_bar(stat='identity') +
        coord_flip() + xlab("Feature") +
        theme(axis.text=element_text(size=8))

Tree-based methods summary

Tree-based methods are very interpretable prediction models for which some inferential tasks are possible (e.g., variable importance in random forests), but are much more limited than the linear models we saw previously. These methods are very commonly used across many application domains and Random Forests often perform at state-of-the-art for many tasks.