Tree-Based Methods in R

Decision Trees

Decision Trees can be applied to both regression and classification problems. The tree-based methods involve stratifying or segmenting the predictor space into number of simple regions. In order to make prediction for a given observation, we typically use the mean or the mode of the response value for the training observations in the region to which it belongs.

We first discuss Decision trees in the regression and classification problems followed by Bagging and Random Forest in the next section.

Regression Trees

## Regression Trees Using Advertising Data ##

library(ISLR2)
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
Advertising <- read.csv(".\\Advertising.csv")

# Divide Data to Train and Test Set
set.seed(27)
train.index <- sample(c(1:200), 150, replace=FALSE)
train <- Advertising[train.index,]
test <- Advertising[-train.index,]

# Build the tree
library(tree)
tree.ads <- tree(Sales ~ TV + Radio + Newspaper, data=train)
summary(tree.ads)

## 
## Regression tree:
## tree(formula = Sales ~ TV + Radio + Newspaper, data = train)
## Variables actually used in tree construction:
## [1] "TV"    "Radio"
## Number of terminal nodes:  9 
## Residual mean deviance:  1.684 = 237.4 / 141 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.1910 -0.8625  0.0858  0.0000  0.7661  3.5330

plot(tree.ads)
text(tree.ads, pretty=0)

# Size of Tree and Prediction Performance
cv.tree.ads <- cv.tree(tree.ads)
plot(cv.tree.ads$size, cv.tree.ads$dev, type="b")

# Test Set Prediction
tree.ads.pred <- predict(tree.ads, newdata=test)
plot(tree.ads.pred, test$Sales)        #  Scatter Plot of Predicted vs Observed
abline(0,1)                            #  Regression Line

mean((tree.ads.pred - test$Sales)^2)   #  MSE

## [1] 2.906198

Classification Trees

## Classification Trees Using Iris Data ##

# Iris Data
iris <- datasets::iris
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# Divide Data into Train and Test Sets
library(caret)
set.seed(123)
train.index <- createDataPartition(iris$Species, list=FALSE, p=0.7)
train <- iris[train.index,]
test <- iris[-train.index,]

# Build the Classification Tree Using Train Data
tree.iris <- tree(Species ~ . , train)
summary(tree.iris)

## 
## Classification tree:
## tree(formula = Species ~ ., data = train)
## Variables actually used in tree construction:
## [1] "Petal.Length" "Petal.Width" 
## Number of terminal nodes:  5 
## Residual mean deviance:  0.1173 = 11.73 / 100 
## Misclassification error rate: 0.02857 = 3 / 105

plot(tree.iris)
text(tree.iris, pretty=0)

# Predict the Species in Test Data
tree.iris.pred <- predict(tree.iris, test, type="class")

# Construct Confusion Matrix
table(tree.iris.pred, test$Species)

##               
## tree.iris.pred setosa versicolor virginica
##     setosa         15          0         0
##     versicolor      0         14         2
##     virginica       0          1        13

Advantages

Decision trees for regression and classification have a number of advantages over more classical approaches

Trees are very easy to explain. Even easier to explain than linear regression.
It is believed that decision trees more closely mirror human decision-making than the regression and classification approaches.
Trees can be displayed graphically (esp if they are small).
Trees can handle qualitative predictors without the need to create dummy variables.

Disadvantages

Trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches.
Trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree.

The last disadvantage can be improved by incorporating methods like bagging, random forest, and boosting that are discussed in the next section.

Bagging

An ensemble method is an approach that combines simple models (called weak learners) in order to obtain a single potentially powerful model.

Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method - particularly decision trees. The idea is to take many training sets through bootstrap resampling and build the model B times. Then for regression problems, the prediction will be based on the average over all the B trees constructed. For classification problems, the mode over all the B trees will be used as the predicted class. Note that the number of bootstrap resamples should not be too large to avoid overfitting with the training data set.

The bagging procedure for regression and classification is shown in R below.

Bagging: Regression

## Bagging Using Advertising Data (m=p)
library(randomForest)

# Divide Data to Train and Test Set
set.seed(27)
train.index <- sample(c(1:200), 150, replace=FALSE)
train <- Advertising[train.index,]
test <- Advertising[-train.index,]

# Train the Model (B = 500)
set.seed(23)
bag.ads <- randomForest(Sales ~ TV + Radio + Newspaper, 
                        mtry=3, data=train, importance=TRUE)
bag.ads

## 
## Call:
##  randomForest(formula = Sales ~ TV + Radio + Newspaper, data = train,      mtry = 3, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.6962456
##                     % Var explained: 97.47

# Test the Model
bag.ads.pred <- predict(bag.ads, newdata=test)
plot(bag.ads.pred, test$Sales)
abline(0,1)

mean((bag.ads.pred - test$Sales)^2)

## [1] 0.532457

# Variable Importance
importance(bag.ads)

##               %IncMSE IncNodePurity
## TV        99.47349400    2746.19196
## Radio     92.73291947    1266.24661
## Newspaper -0.05267697      25.06207

Bagging: Classification

## Bagging Using Iris Data ##
library(randomForest)

# Divide Data into Train and Test Sets
library(caret)
set.seed(123)
train.index <- createDataPartition(iris$Species, list=FALSE, p=0.7)
train <- iris[train.index,]
test <- iris[-train.index,]

# Train the model
set.seed(21)
bag.iris <- randomForest(Species ~ . , data=train, 
                         mtry=4, importance=TRUE) # mtry is the number of predictors used in the  bagging
bag.iris

## 
## Call:
##  randomForest(formula = Species ~ ., data = train, mtry = 4, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 4.76%
## Confusion matrix:
##            setosa versicolor virginica class.error
## setosa         35          0         0  0.00000000
## versicolor      0         33         2  0.05714286
## virginica       0          3        32  0.08571429

# Test the model
bag.iris.pred <- predict(bag.iris, test)
table(bag.iris.pred, test$Species)

##              
## bag.iris.pred setosa versicolor virginica
##    setosa         15          0         0
##    versicolor      0         14         2
##    virginica       0          1        13

# Variable Importance
importance(bag.iris) # The higher the better

##                 setosa versicolor virginica MeanDecreaseAccuracy
## Sepal.Length  1.001002   5.715499  1.162282             5.224539
## Sepal.Width   0.000000   2.032614  2.777429             3.412962
## Petal.Length 21.271979  30.320031 28.593691            32.827405
## Petal.Width  25.347255  35.362470 36.661378            38.108545
##              MeanDecreaseGini
## Sepal.Length        0.6746152
## Sepal.Width         0.7066065
## Petal.Length       29.3222733
## Petal.Width        38.6080478

Random Forest

Random forests provide an improvement over bagged trees by way of randomly selecting a sample of the m predictors from a full set of p predictors for every bootstrapped training samples. By default, most software will use the square root of p as the value of m. The rationale behind this approach is that the bootstrapped trees are less correlated from each other, and that one very strong predictor does not always influence the prediction of the bootstrapped tree.

Random Forest: Regression

## Random Forest (m < p): Sales Regression Problem ## 

# Divide Data to Train and Test Set
set.seed(27)
train.index <- sample(c(1:200), 150, replace=FALSE)
train <- Advertising[train.index,]
test <- Advertising[-train.index,]

# Train the model
set.seed(24)
rf.ads <- randomForest(Sales ~ TV + Radio + Newspaper, data=train,
                       mtry = 2, importance=TRUE)    # (m = 2) < (3 = p)
rf.ads

## 
## Call:
##  randomForest(formula = Sales ~ TV + Radio + Newspaper, data = train,      mtry = 2, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.8505621
##                     % Var explained: 96.9

# Test the model
rf.ads.pred <- predict(rf.ads, newdata=test)
plot(rf.ads.pred, test$Sales)
abline(0,1)

mean((rf.ads.pred - test$Sales)^2)

## [1] 0.6254764

Random Forest: Classification

## Random Forest (m < p): Iris Classification Problem ##

# Divide Data into Train and Test Sets
set.seed(123)
train.index <- createDataPartition(iris$Species, list=FALSE, p=0.7)
train <- iris[train.index,]
test <- iris[-train.index,]

# Train the model
set.seed(24)
rf.iris <- randomForest(Species ~ . , data=train, 
                        mtry=2, importance=TRUE) # mtry used is sqrt of 4

# Test the model
rf.iris.pred <- predict(rf.iris, test)
table(rf.iris.pred, test$Species)

##             
## rf.iris.pred setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         2
##   virginica       0          1        13

# Variable Importance
importance(rf.iris)

##                 setosa versicolor virginica MeanDecreaseAccuracy
## Sepal.Length  7.009222   8.402265  7.739983            10.914456
## Sepal.Width   4.394139   1.608043  4.537595             5.462419
## Petal.Length 22.205962  29.103983 27.855758            33.110557
## Petal.Width  21.538905  26.952956 31.627524            31.306264
##              MeanDecreaseGini
## Sepal.Length         7.469165
## Sepal.Width          1.550380
## Petal.Length        30.039106
## Petal.Width         30.223216