Introduction

This document explores tree-based methods for classification and regression, focusing on:

  • Decision Trees: Understanding the structure and pruning techniques.
  • Bagging and Random Forests: Reducing variance and improving accuracy.
  • Boosting: Improving predictions by sequentially correcting weak models.

Tree-based methods are non-parametric and effective for capturing complex relationships in data.


1. Load Required Packages

# Install required packages if not installed
# install.packages(c("ISLR", "tree", "MASS", "randomForest", "gbm"))

# Load packages
library(ISLR)          # Datasets from ISLR book
library(tree)          # Decision Trees
library(MASS)          # Boston dataset
library(randomForest)  # Bagging & Random Forests
library(gbm)           # Boosting

2. Classification Trees: Carseats Dataset

Objective

The Carseats dataset contains 400 simulated records of child car seat sales, along with predictors like price, location, and advertising.
We classify sales as High (Yes) or Low (No) and build a decision tree to predict this.

# Load dataset
data(Carseats)

# Convert Sales into a categorical variable
Carseats$High <- factor(ifelse(Carseats$Sales <= 8, "No", "Yes"))

# Fit a classification tree
tree.carseats <- tree(High ~ . - Sales, data = Carseats)
summary(tree.carseats)
## 
## Classification tree:
## tree(formula = High ~ . - Sales, data = Carseats)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Income"      "CompPrice"   "Population" 
## [6] "Advertising" "Age"         "US"         
## Number of terminal nodes:  27 
## Residual mean deviance:  0.4575 = 170.7 / 373 
## Misclassification error rate: 0.09 = 36 / 400

Interpretation

  • The tree splits based on important predictors (e.g., Price, ShelveLoc).
  • The misclassification error tells how well the tree fits training data.

Visualising the Tree

plot(tree.carseats)
text(tree.carseats, pretty = 0)

Pruning the Decision Tree using Cross-Validation

The initial tree is quite complex, making it prone to overfitting. We will use cost-complexity pruning to simplify the tree and improve its generalisation.

Steps for Pruning:

  1. Perform Cross-Validation (cv.tree)
    • This function identifies the optimal tree size by minimising classification error.
  2. Find the Best Tree Size
    • The tree size with the lowest cross-validation error is selected.
  3. Prune the Tree (prune.misclass)
    • We trim the tree down to its optimal size.

Applying Cross-Validation and Pruning

# Perform cross-validation for pruning
set.seed(42)
cv.carseats <- cv.tree(tree.carseats, FUN = prune.misclass)

# Identify the best tree size
best.size <- cv.carseats$size[which.min(cv.carseats$dev)]

# Prune the tree to the optimal size
prune.carseats <- prune.misclass(tree.carseats, best = best.size)

# Plot the pruned tree
plot(prune.carseats)
text(prune.carseats, pretty = 0)

Interpreting the Pruned Tree

  • The pruned tree retains only the most important splits, improving interpretability.
  • It avoids overfitting by reducing unnecessary complexity.
  • The test accuracy may improve as a result of better generalisation.

Model Evaluation: Train-Test Split

# Split data into training (50%) and test (50%) sets
set.seed(42)
train <- sample(1:nrow(Carseats), 200)
Carseats.test <- Carseats[-train, ]
High.test <- Carseats$High[-train]

# Train model on training set
tree.carseats <- tree(High ~ . - Sales, Carseats, subset = train)

# Predict on test data
tree.pred <- predict(tree.carseats, Carseats.test, type = "class")

# Accuracy assessment
test.table <- table(tree.pred, High.test)
accuracy <- sum(diag(test.table)) / sum(test.table)
paste("Test Accuracy:", round(accuracy * 100, 2), "%")
## [1] "Test Accuracy: 77 %"

3. Regression Trees: Boston Housing Data

Objective

Predict median home values (medv) based on neighborhood attributes.

# Load dataset
data(Boston)

# Train-test split
set.seed(42)
train <- sample(1:nrow(Boston), nrow(Boston)/2)
boston.train <- Boston[train, ]
boston.test <- Boston[-train, ]

# Fit a regression tree
tree.boston <- tree(medv ~ ., data = boston.train)
summary(tree.boston)
## 
## Regression tree:
## tree(formula = medv ~ ., data = boston.train)
## Variables actually used in tree construction:
## [1] "rm"    "lstat" "dis"   "nox"  
## Number of terminal nodes:  10 
## Residual mean deviance:  16.67 = 4051 / 243 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -22.04000  -2.02900  -0.02258   0.00000   1.64000  13.97000

Interpretation

  • The tree selects key predictors (e.g., average number of rooms (rm), lower status percentage (lstat)).
  • The deviance measures the sum of squared errors.

Pruning the Tree

# Cross-validation to find optimal tree size
cv.boston <- cv.tree(tree.boston)

# Best tree size
best.size <- cv.boston$size[which.min(cv.boston$dev)]

# Prune tree
prune.boston <- prune.tree(tree.boston, best = best.size)

# Plot pruned tree
plot(prune.boston)
text(prune.boston, pretty = 0)

Test Set Evaluation

# Predict on test set
yhat <- predict(prune.boston, newdata = boston.test)

# Compute test MSE
test.mse <- mean((yhat - boston.test$medv)^2)
paste("Test MSE:", round(test.mse, 2))
## [1] "Test MSE: 21.06"

4. Bagging and Random Forests

Why Use Bagging and Random Forests?

  • Bagging reduces overfitting by averaging multiple trees.
  • Random Forests improve bagging by selecting random subsets of predictors.
# Bagging: Use all predictors in every split
bag.boston <- randomForest(medv ~ ., data = boston.train, mtry = 13, importance = TRUE)

# Predict on test set
yhat.bag <- predict(bag.boston, newdata = boston.test)
bag.mse <- mean((yhat.bag - boston.test$medv)^2)
paste("Bagging Test MSE:", round(bag.mse, 2))
## [1] "Bagging Test MSE: 9.78"

Optimising Bagging and Random Forests

Bagging and Random Forests improve stability and reduce variance compared to a single tree. We fine-tune these models by adjusting the number of predictors (mtry) and trees (ntree).

Bagging (Bootstrap Aggregation)

# Bagging: Use all predictors in every split (mtry = total predictors)
bag.boston <- randomForest(medv ~ ., data = boston.train, mtry = 13, ntree = 500, importance = TRUE)

# Predict on test set
yhat.bag <- predict(bag.boston, newdata = boston.test)
bag.mse <- mean((yhat.bag - boston.test$medv)^2)
paste("Bagging Test MSE:", round(bag.mse, 2))
## [1] "Bagging Test MSE: 9.69"

Random Forest (Choosing Optimal mtry)

# Random Forest: Use an optimal number of predictors (mtry = sqrt(p))
rf.boston <- randomForest(medv ~ ., data = boston.train, mtry = 6, ntree = 500, importance = TRUE)

# Predict on test set
yhat.rf <- predict(rf.boston, newdata = boston.test)
rf.mse <- mean((yhat.rf - boston.test$medv)^2)
paste("Random Forest Test MSE:", round(rf.mse, 2))
## [1] "Random Forest Test MSE: 9.67"
# Variable importance plot
varImpPlot(rf.boston)

Interpreting the Results

  • Bagging vs Random Forests:

    • Bagging uses all predictors at each split, reducing bias but maintaining some correlation.
    • Random Forests introduce randomness by selecting subsets of predictors (mtry), reducing correlation and further lowering variance.
  • Choosing the Best Model:

    • If mtry = total predictors => behaves like bagging.
    • If mtry < total predictors => benefits from decorrelated trees, improving prediction accuracy.

Random Forest (Improving Bagging)

# Random Forest: Use only 6 predictors in each split
rf.boston <- randomForest(medv ~ ., data = boston.train, mtry = 6, importance = TRUE)

# Predict on test set
yhat.rf <- predict(rf.boston, newdata = boston.test)
rf.mse <- mean((yhat.rf - boston.test$medv)^2)
paste("Random Forest Test MSE:", round(rf.mse, 2))
## [1] "Random Forest Test MSE: 9.26"
# Variable importance
varImpPlot(rf.boston)


5. Boosting: Improving Performance

Optimising Boosting Parameters

Boosting improves performance by sequentially correcting weak learners. We tune key parameters:

  • n.trees: The number of boosting iterations (too high can lead to overfitting).
  • interaction.depth: Controls complexity (higher values allow deeper trees).
  • shrinkage (learning rate): A small value (e.g., 0.01) improves generalisation but increases computation time.

Fine-Tuning Boosting Parameters

# Default boosting model
boost.boston <- gbm(medv ~ ., data = boston.train, distribution = "gaussian",
                    n.trees = 5000, interaction.depth = 4, shrinkage = 0.01, cv.folds = 5)

# Predict on test set
yhat.boost <- predict(boost.boston, newdata = boston.test, n.trees = 5000)
boost.mse <- mean((yhat.boost - boston.test$medv)^2)
paste("Boosting Test MSE:", round(boost.mse, 2))
## [1] "Boosting Test MSE: 8.91"

Interpreting the Results

  • Choosing shrinkage:
    • A lower shrinkage value (e.g., 0.01) improves accuracy but requires more trees.
    • A higher shrinkage value (e.g., 0.2) converges faster but may overfit.
  • Choosing interaction.depth:
    • If too small → underfits the data.
    • If too large → overfits and increases variance.
  • Using Cross-Validation (cv.folds):
    • Helps identify the best number of trees by monitoring test error.
    • Prevents overfitting and improves real-world performance.

Why Boosting?

Boosting sequentially corrects weak models, reducing test error.

# Fit a boosting model
boost.boston <- gbm(medv ~ ., data = boston.train, distribution = "gaussian",
                    n.trees = 5000, interaction.depth = 4)

# Predict on test set
yhat.boost <- predict(boost.boston, newdata = boston.test, n.trees = 5000)
boost.mse <- mean((yhat.boost - boston.test$medv)^2)
paste("Boosting Test MSE:", round(boost.mse, 2))
## [1] "Boosting Test MSE: 9.95"

In summary

Model Best Use Case
Pruned Decision Tree Best for interpretability, when explaining results is important.
Bagging Reduces variance, good for datasets with moderate complexity.
Random Forests Best for generalisation, handles high-dimensional data well.
Boosting Best for minimising error, but may overfit if not tuned properly.

Conclusion

  • Decision Trees provide interpretability but may overfit.
  • Bagging & Random Forests reduce variance, improving predictions.
  • Boosting generally achieves the lowest test MSE.

Tree-based methods are powerful tools for classification and regression, especially when using ensembles.