Decision Trees

Decision Trees can be applied to both regression and classification problems. The tree-based methods involve stratifying or segmenting the predictor space into number of simple regions. In order to make prediction for a given observation, we typically use the mean or the mode of the response value for the training observations in the region to which it belongs.

We first discuss Decision trees in the regression and classification problems followed by Bagging and Random Forest in the next section.

Regression Trees

## Regression Trees Using Advertising Data ##
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
Advertising <- read.csv(".\\Advertising.csv")


# Divide Data to Train and Test Set
set.seed(27)  # You may change this seed number to change the random selection
train.index <- sample(c(1:200), 160, replace=FALSE)
train <- Advertising[train.index,]
test <- Advertising[-train.index,]


# Build the regression tree using "tree" package
# Run install.packages("tree") first if not yet installed
library(tree)
fit.tree <- tree(Sales ~ TV + Radio + Newspaper, data=train)
summary(fit.tree)

## 
## Regression tree:
## tree(formula = Sales ~ TV + Radio + Newspaper, data = train)
## Variables actually used in tree construction:
## [1] "TV"    "Radio"
## Number of terminal nodes:  8 
## Residual mean deviance:  2.056 = 312.5 / 152 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -5.0350 -0.9386  0.1393  0.0000  0.8645  3.5330

plot(fit.tree)
text(fit.tree)

# Size of Tree and Prediction Performance
cv.fit.tree <- cv.tree(fit.tree)
plot(cv.fit.tree$size, cv.fit.tree$dev, type="b")

# Test Set Prediction
fit.tree.pred <- predict(fit.tree, newdata=test)
plot(fit.tree.pred, test$Sales)        #  Scatter Plot of Predicted vs Observed
abline(0,1)                            #  Regression Line

mean((fit.tree.pred - test$Sales)^2)   #  MSE

## [1] 3.085704

Classification Trees

# Load Default data in the ISLR2 package
library(ISLR2)
data("Default")
head(Default)

##   default student   balance    income
## 1      No      No  729.5265 44361.625
## 2      No     Yes  817.1804 12106.135
## 3      No      No 1073.5492 31767.139
## 4      No      No  529.2506 35704.494
## 5      No      No  785.6559 38463.496
## 6      No     Yes  919.5885  7491.559

# Validation Set Approach
# We divide the data into train and test sets with proportioanl allocation to outcome variable
library(caret)
set.seed(160)
train.index <- createDataPartition(Default$default, list=FALSE, p=0.7)  # 70% for training
train <- Default[train.index,]
test <- Default[-train.index,]

# Build Classification tree using the training data
library(tree)
fit.tree <- tree(default ~ ., data=train)
summary(fit.tree)

## 
## Classification tree:
## tree(formula = default ~ ., data = train)
## Variables actually used in tree construction:
## [1] "balance"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.1657 = 1159 / 6996 
## Misclassification error rate: 0.02828 = 198 / 7001

plot(fit.tree)
text(fit.tree)

# Predict Default status in test data
predicted <- predict(fit.tree, test, type="class")
actual <- test$default

# Construct Confusion Matrix
table(predicted, actual)

##          actual
## predicted   No  Yes
##       No  2888   60
##       Yes   12   39

## Classification Trees Using Iris Data ##

# Load Iris data from the datasets package. 
# Unlike the Default data, the outcome variable here has 3 categories.
iris <- datasets::iris
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

# Divide Data into Train and Test Sets
library(caret)
set.seed(123)
train.index <- createDataPartition(iris$Species, list=FALSE, p=0.7)
train <- iris[train.index,]
test <- iris[-train.index,]

# Build the Classification Tree Using Train Data
fit.tree <- tree(Species ~ . , train)
summary(fit.tree)

## 
## Classification tree:
## tree(formula = Species ~ ., data = train)
## Variables actually used in tree construction:
## [1] "Petal.Length" "Petal.Width" 
## Number of terminal nodes:  5 
## Residual mean deviance:  0.1173 = 11.73 / 100 
## Misclassification error rate: 0.02857 = 3 / 105

plot(fit.tree)
text(fit.tree)

# Predict the Species in Test Data
predicted <- predict(fit.tree, test, type="class")
actual <- test$Species

# Construct Confusion Matrix
table(predicted, actual)

##             actual
## predicted    setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         2
##   virginica       0          1        13

Advantages

Decision trees for regression and classification have a number of advantages over more classical approaches

Trees are very easy to explain. Even easier to explain than linear regression.
It is believed that decision trees more closely mirror human decision-making than the regression and classification approaches.
Trees can be displayed graphically (esp if they are small).
Trees can handle qualitative predictors without the need to create dummy variables.

Disadvantages

Trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches.
Trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree.

The last disadvantage can be improved by incorporating methods like bagging, random forest, and boosting that are discussed in the next section.

Decision Trees

Introduction to Business Intelligence

2023-12-08

Regression Trees

Classification Trees

Advantages

Disadvantages