This document explores tree-based methods for classification and regression, focusing on:
Tree-based methods are non-parametric and effective for capturing complex relationships in data.
# Install required packages if not installed
# install.packages(c("ISLR", "tree", "MASS", "randomForest", "gbm"))
# Load packages
library(ISLR) # Datasets from ISLR book
library(tree) # Decision Trees
library(MASS) # Boston dataset
library(randomForest) # Bagging & Random Forests
library(gbm) # Boosting
The Carseats
dataset contains 400 simulated
records of child car seat sales, along with predictors like
price, location, and advertising.
We classify sales as High (Yes) or Low (No) and build a
decision tree to predict this.
# Load dataset
data(Carseats)
# Convert Sales into a categorical variable
Carseats$High <- factor(ifelse(Carseats$Sales <= 8, "No", "Yes"))
# Fit a classification tree
tree.carseats <- tree(High ~ . - Sales, data = Carseats)
summary(tree.carseats)
##
## Classification tree:
## tree(formula = High ~ . - Sales, data = Carseats)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Income" "CompPrice" "Population"
## [6] "Advertising" "Age" "US"
## Number of terminal nodes: 27
## Residual mean deviance: 0.4575 = 170.7 / 373
## Misclassification error rate: 0.09 = 36 / 400
Price
, ShelveLoc
).plot(tree.carseats)
text(tree.carseats, pretty = 0)
The initial tree is quite complex, making it prone to overfitting. We will use cost-complexity pruning to simplify the tree and improve its generalisation.
cv.tree
)
prune.misclass
)
# Perform cross-validation for pruning
set.seed(42)
cv.carseats <- cv.tree(tree.carseats, FUN = prune.misclass)
# Identify the best tree size
best.size <- cv.carseats$size[which.min(cv.carseats$dev)]
# Prune the tree to the optimal size
prune.carseats <- prune.misclass(tree.carseats, best = best.size)
# Plot the pruned tree
plot(prune.carseats)
text(prune.carseats, pretty = 0)
# Split data into training (50%) and test (50%) sets
set.seed(42)
train <- sample(1:nrow(Carseats), 200)
Carseats.test <- Carseats[-train, ]
High.test <- Carseats$High[-train]
# Train model on training set
tree.carseats <- tree(High ~ . - Sales, Carseats, subset = train)
# Predict on test data
tree.pred <- predict(tree.carseats, Carseats.test, type = "class")
# Accuracy assessment
test.table <- table(tree.pred, High.test)
accuracy <- sum(diag(test.table)) / sum(test.table)
paste("Test Accuracy:", round(accuracy * 100, 2), "%")
## [1] "Test Accuracy: 77 %"
Predict median home values (medv
) based
on neighborhood attributes.
# Load dataset
data(Boston)
# Train-test split
set.seed(42)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
boston.train <- Boston[train, ]
boston.test <- Boston[-train, ]
# Fit a regression tree
tree.boston <- tree(medv ~ ., data = boston.train)
summary(tree.boston)
##
## Regression tree:
## tree(formula = medv ~ ., data = boston.train)
## Variables actually used in tree construction:
## [1] "rm" "lstat" "dis" "nox"
## Number of terminal nodes: 10
## Residual mean deviance: 16.67 = 4051 / 243
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -22.04000 -2.02900 -0.02258 0.00000 1.64000 13.97000
rm
), lower status percentage
(lstat
)).# Cross-validation to find optimal tree size
cv.boston <- cv.tree(tree.boston)
# Best tree size
best.size <- cv.boston$size[which.min(cv.boston$dev)]
# Prune tree
prune.boston <- prune.tree(tree.boston, best = best.size)
# Plot pruned tree
plot(prune.boston)
text(prune.boston, pretty = 0)
# Predict on test set
yhat <- predict(prune.boston, newdata = boston.test)
# Compute test MSE
test.mse <- mean((yhat - boston.test$medv)^2)
paste("Test MSE:", round(test.mse, 2))
## [1] "Test MSE: 21.06"
# Bagging: Use all predictors in every split
bag.boston <- randomForest(medv ~ ., data = boston.train, mtry = 13, importance = TRUE)
# Predict on test set
yhat.bag <- predict(bag.boston, newdata = boston.test)
bag.mse <- mean((yhat.bag - boston.test$medv)^2)
paste("Bagging Test MSE:", round(bag.mse, 2))
## [1] "Bagging Test MSE: 9.78"
Bagging and Random Forests improve stability and
reduce variance compared to a single tree. We fine-tune
these models by adjusting the number of predictors
(mtry
) and trees (ntree
).
# Bagging: Use all predictors in every split (mtry = total predictors)
bag.boston <- randomForest(medv ~ ., data = boston.train, mtry = 13, ntree = 500, importance = TRUE)
# Predict on test set
yhat.bag <- predict(bag.boston, newdata = boston.test)
bag.mse <- mean((yhat.bag - boston.test$medv)^2)
paste("Bagging Test MSE:", round(bag.mse, 2))
## [1] "Bagging Test MSE: 9.69"
mtry
)# Random Forest: Use an optimal number of predictors (mtry = sqrt(p))
rf.boston <- randomForest(medv ~ ., data = boston.train, mtry = 6, ntree = 500, importance = TRUE)
# Predict on test set
yhat.rf <- predict(rf.boston, newdata = boston.test)
rf.mse <- mean((yhat.rf - boston.test$medv)^2)
paste("Random Forest Test MSE:", round(rf.mse, 2))
## [1] "Random Forest Test MSE: 9.67"
# Variable importance plot
varImpPlot(rf.boston)
Bagging vs Random Forests:
mtry
), reducing correlation and
further lowering variance.Choosing the Best Model:
mtry = total predictors
=> behaves like
bagging.mtry < total predictors
=> benefits from
decorrelated trees, improving prediction accuracy.# Random Forest: Use only 6 predictors in each split
rf.boston <- randomForest(medv ~ ., data = boston.train, mtry = 6, importance = TRUE)
# Predict on test set
yhat.rf <- predict(rf.boston, newdata = boston.test)
rf.mse <- mean((yhat.rf - boston.test$medv)^2)
paste("Random Forest Test MSE:", round(rf.mse, 2))
## [1] "Random Forest Test MSE: 9.26"
# Variable importance
varImpPlot(rf.boston)
Boosting improves performance by sequentially correcting weak learners. We tune key parameters:
n.trees
: The number of boosting
iterations (too high can lead to overfitting).interaction.depth
: Controls complexity
(higher values allow deeper trees).shrinkage
(learning rate): A small
value (e.g., 0.01
) improves generalisation but increases
computation time.# Default boosting model
boost.boston <- gbm(medv ~ ., data = boston.train, distribution = "gaussian",
n.trees = 5000, interaction.depth = 4, shrinkage = 0.01, cv.folds = 5)
# Predict on test set
yhat.boost <- predict(boost.boston, newdata = boston.test, n.trees = 5000)
boost.mse <- mean((yhat.boost - boston.test$medv)^2)
paste("Boosting Test MSE:", round(boost.mse, 2))
## [1] "Boosting Test MSE: 8.91"
shrinkage
:
interaction.depth
:
cv.folds
):
Boosting sequentially corrects weak models, reducing test error.
# Fit a boosting model
boost.boston <- gbm(medv ~ ., data = boston.train, distribution = "gaussian",
n.trees = 5000, interaction.depth = 4)
# Predict on test set
yhat.boost <- predict(boost.boston, newdata = boston.test, n.trees = 5000)
boost.mse <- mean((yhat.boost - boston.test$medv)^2)
paste("Boosting Test MSE:", round(boost.mse, 2))
## [1] "Boosting Test MSE: 9.95"
Model | Best Use Case |
---|---|
Pruned Decision Tree | Best for interpretability, when explaining results is important. |
Bagging | Reduces variance, good for datasets with moderate complexity. |
Random Forests | Best for generalisation, handles high-dimensional data well. |
Boosting | Best for minimising error, but may overfit if not tuned properly. |
Tree-based methods are powerful tools for classification and regression, especially when using ensembles.