Decision Trees can be applied to both regression and classification problems. The tree-based methods involve stratifying or segmenting the predictor space into number of simple regions. In order to make prediction for a given observation, we typically use the mean or the mode of the response value for the training observations in the region to which it belongs.
We first discuss Decision trees in the regression and classification problems followed by Bagging and Random Forest in the next section.
## Regression Trees Using Advertising Data ##
setwd("C:\\Users\\Asus\\Documents\\UP Files\\UPV Subjects\\Stat 197 (Intro to BI)")
Advertising <- read.csv(".\\Advertising.csv")
# Divide Data to Train and Test Set
set.seed(27) # You may change this seed number to change the random selection
train.index <- sample(c(1:200), 160, replace=FALSE)
train <- Advertising[train.index,]
test <- Advertising[-train.index,]
# Build the regression tree using "tree" package
# Run install.packages("tree") first if not yet installed
library(tree)
fit.tree <- tree(Sales ~ TV + Radio + Newspaper, data=train)
summary(fit.tree)
##
## Regression tree:
## tree(formula = Sales ~ TV + Radio + Newspaper, data = train)
## Variables actually used in tree construction:
## [1] "TV" "Radio"
## Number of terminal nodes: 8
## Residual mean deviance: 2.056 = 312.5 / 152
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.0350 -0.9386 0.1393 0.0000 0.8645 3.5330
plot(fit.tree)
text(fit.tree)
# Size of Tree and Prediction Performance
cv.fit.tree <- cv.tree(fit.tree)
plot(cv.fit.tree$size, cv.fit.tree$dev, type="b")
# Test Set Prediction
fit.tree.pred <- predict(fit.tree, newdata=test)
plot(fit.tree.pred, test$Sales) # Scatter Plot of Predicted vs Observed
abline(0,1) # Regression Line
mean((fit.tree.pred - test$Sales)^2) # MSE
## [1] 3.085704
# Load Default data in the ISLR2 package
library(ISLR2)
data("Default")
head(Default)
## default student balance income
## 1 No No 729.5265 44361.625
## 2 No Yes 817.1804 12106.135
## 3 No No 1073.5492 31767.139
## 4 No No 529.2506 35704.494
## 5 No No 785.6559 38463.496
## 6 No Yes 919.5885 7491.559
# Validation Set Approach
# We divide the data into train and test sets with proportioanl allocation to outcome variable
library(caret)
set.seed(160)
train.index <- createDataPartition(Default$default, list=FALSE, p=0.7) # 70% for training
train <- Default[train.index,]
test <- Default[-train.index,]
# Build Classification tree using the training data
library(tree)
fit.tree <- tree(default ~ ., data=train)
summary(fit.tree)
##
## Classification tree:
## tree(formula = default ~ ., data = train)
## Variables actually used in tree construction:
## [1] "balance"
## Number of terminal nodes: 5
## Residual mean deviance: 0.1657 = 1159 / 6996
## Misclassification error rate: 0.02828 = 198 / 7001
plot(fit.tree)
text(fit.tree)
# Predict Default status in test data
predicted <- predict(fit.tree, test, type="class")
actual <- test$default
# Construct Confusion Matrix
table(predicted, actual)
## actual
## predicted No Yes
## No 2888 60
## Yes 12 39
## Classification Trees Using Iris Data ##
# Load Iris data from the datasets package.
# Unlike the Default data, the outcome variable here has 3 categories.
iris <- datasets::iris
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Divide Data into Train and Test Sets
library(caret)
set.seed(123)
train.index <- createDataPartition(iris$Species, list=FALSE, p=0.7)
train <- iris[train.index,]
test <- iris[-train.index,]
# Build the Classification Tree Using Train Data
fit.tree <- tree(Species ~ . , train)
summary(fit.tree)
##
## Classification tree:
## tree(formula = Species ~ ., data = train)
## Variables actually used in tree construction:
## [1] "Petal.Length" "Petal.Width"
## Number of terminal nodes: 5
## Residual mean deviance: 0.1173 = 11.73 / 100
## Misclassification error rate: 0.02857 = 3 / 105
plot(fit.tree)
text(fit.tree)
# Predict the Species in Test Data
predicted <- predict(fit.tree, test, type="class")
actual <- test$Species
# Construct Confusion Matrix
table(predicted, actual)
## actual
## predicted setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 2
## virginica 0 1 13
Decision trees for regression and classification have a number of advantages over more classical approaches
Trees are very easy to explain. Even easier to explain than linear regression.
It is believed that decision trees more closely mirror human decision-making than the regression and classification approaches.
Trees can be displayed graphically (esp if they are small).
Trees can handle qualitative predictors without the need to create dummy variables.
Trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches.
Trees can be very non-robust. In other words, a small change in the data can cause a large change in the final estimated tree.
The last disadvantage can be improved by incorporating methods like bagging, random forest, and boosting that are discussed in the next section.