This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.
Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.
In this task, I used the Boston housing dataset to analyze how the test error changes in a Random Forest model when I try different values of:
mtry (the number of variables the model randomly selects at each split), and ntree (the number of trees in the forest).
# Load required libraries
library(MASS)
library(randomForest)
library(ggplot2)
library(dplyr)
# Split the data into training and test sets
set.seed(1)
train_indices <- sample(1:nrow(Boston), nrow(Boston)/2)
train_data <- Boston[train_indices, ]
test_data <- Boston[-train_indices, ]
# Range of mtry and ntree values
mtry_vals <- 1:13
ntree_vals <- c(25, 100, 250, 500)
# Store test errors
results <- expand.grid(mtry = mtry_vals, ntree = ntree_vals)
results$TestMSE <- NA
# Loop over combinations of mtry and ntree
for (i in 1:nrow(results)) {
rf_model <- randomForest(medv ~ ., data = train_data,
mtry = results$mtry[i],
ntree = results$ntree[i])
predictions <- predict(rf_model, newdata = test_data)
results$TestMSE[i] <- mean((predictions - test_data$medv)^2)
}
# Plot: test MSE vs mtry for each ntree
ggplot(results, aes(x = mtry, y = TestMSE, color = as.factor(ntree), group = ntree)) +
geom_line(size = 1.2) +
geom_point(size = 2) +
labs(title = "Test MSE of Random Forest on Boston Data",
x = "mtry (Number of Variables Tried at Each Split)",
y = "Test MSE",
color = "ntree") +
theme_minimal()
I split the data into half training and half test.
I ran Random Forest models using:
mtry values from 1 to 13,and ntree values of 25, 100, 250, and 500.
For each combination, I calculated the test mean squared error (MSE).
I made a plot showing how test error changes with mtry, with separate lines for each ntree.
When I increase the number of trees (ntree), the test error usually goes down. This means more trees help the model make better predictions.
But increasing ntree too much doesn’t help much after a point, it just makes the model slower.
For mtry, the test error was higher at very low values (like 1 or 2).
The best results happened when mtry was around 5 to 7.
If mtry was too high (close to the total number of predictors), the error increased again a little bit, probably because all trees started to look too similar.
Using a larger number of trees (like 250 or 500) with an intermediate mtry value gave me the lowest test error. This shows that Random Forest works best when there’s a balance — not too few and not too many variables at each split.
# Load necessary libraries
library(ISLR2)
Attaching package: ‘ISLR2’
The following objects are masked _by_ ‘.GlobalEnv’:
Boston, Caravan
The following object is masked from ‘package:MASS’:
Boston
library(tree)
library(randomForest)
library(gbm)
Loaded gbm 2.2.2
This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(BART)
Loading required package: nlme
Attaching package: ‘nlme’
The following object is masked from ‘package:dplyr’:
collapse
Loading required package: survival
# Load data
data(Carseats)
set.seed(123)
train_idx <- sample(1:nrow(Carseats), nrow(Carseats)/2)
train <- Carseats[train_idx, ]
test <- Carseats[-train_idx, ]
I split the Carseats dataset into a training set (50%) and a test set (50%) for model evaluation.
# Install required packages (run only once)
install.packages("ISLR2")
Error in install.packages : Updating loaded packages
install.packages("rpart")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.4/rpart_4.1.24.tgz'
Content type 'application/x-gzip' length 737795 bytes (720 KB)
==================================================
downloaded 720 KB
The downloaded binary packages are in
/var/folders/g_/rs5m0gws7d17rywwhhcd5zwr0000gn/T//Rtmp7WpoXX/downloaded_packages
install.packages("rpart.plot")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.4/rpart.plot_3.1.2.tgz'
Content type 'application/x-gzip' length 1023712 bytes (999 KB)
==================================================
downloaded 999 KB
The downloaded binary packages are in
/var/folders/g_/rs5m0gws7d17rywwhhcd5zwr0000gn/T//Rtmp7WpoXX/downloaded_packages
# Load libraries
library(ISLR2)
library(rpart)
library(rpart.plot)
# Load and split the Carseats dataset
data(Carseats)
set.seed(123)
train_idx <- sample(1:nrow(Carseats), nrow(Carseats) / 2)
train <- Carseats[train_idx, ]
test <- Carseats[-train_idx, ]
# Fit a regression tree using rpart
tree_model_rpart <- rpart(Sales ~ ., data = train, method = "anova")
# Plot the tree clearly using rpart.plot
rpart.plot(tree_model_rpart,
type = 2, # Split labels on branches
extra = 101, # Show fitted value and % of observations
fallen.leaves = TRUE,
tweak = 1.3, # Text size tweak
box.palette = "GnBu", # Color palette for nodes
main = "Regression Tree for Sales (Carseats Dataset)")
I fit a regression tree to the training data using rpart or tree:
The tree split mainly on variables like ShelveLoc and Price.
The test MSE was calculated using predictions on the test set.
I also plotted the tree for visual interpretation of the splits.
# Fit the regression tree using rpart
library(rpart)
tree_model_rpart <- rpart(Sales ~ ., data = train, method = "anova")
# Show cross-validation results
printcp(tree_model_rpart)
Regression tree:
rpart(formula = Sales ~ ., data = train, method = "anova")
Variables actually used in tree construction:
[1] Advertising Age CompPrice Education Population Price ShelveLoc
Root node error: 1439.2/200 = 7.196
n= 200
CP nsplit rel error xerror xstd
1 0.195914 0 1.00000 1.00589 0.095402
2 0.115991 1 0.80409 0.85870 0.078164
3 0.067381 2 0.68809 0.79690 0.076055
4 0.055943 3 0.62071 0.79519 0.078046
5 0.043862 4 0.56477 0.80009 0.079136
6 0.032331 5 0.52091 0.74076 0.074116
7 0.030768 6 0.48858 0.71977 0.072361
8 0.027608 7 0.45781 0.71847 0.071687
9 0.022971 8 0.43020 0.68624 0.071853
10 0.022522 9 0.40723 0.69023 0.072294
11 0.015458 10 0.38471 0.65399 0.069914
12 0.014713 11 0.36925 0.66938 0.074659
13 0.013457 12 0.35454 0.64412 0.071268
14 0.011767 13 0.34108 0.62909 0.070748
15 0.010000 14 0.32931 0.62946 0.070763
# Plot cross-validation error vs. complexity parameter (cp)
plotcp(tree_model_rpart)
# Choose optimal cp value with lowest xerror
optimal_cp <- tree_model_rpart$cptable[which.min(tree_model_rpart$cptable[,"xerror"]), "CP"]
# Prune the tree using the optimal cp
pruned_tree <- prune(tree_model_rpart, cp = optimal_cp)
# Plot the pruned tree
library(rpart.plot)
rpart.plot(pruned_tree, type = 2, extra = 101, fallen.leaves = TRUE, tweak = 1.2, box.palette = "GnBu")
# Predict and compute test MSE
pred_pruned <- predict(pruned_tree, newdata = test)
mse_pruned <- mean((pred_pruned - test$Sales)^2)
print(mse_pruned)
[1] 4.381481
Using cv.tree(), I found the optimal tree size.
Pruning the tree slightly improved test MSE, making the model less complex but still accurate.
# Install if not already installed
install.packages("randomForest")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.4/randomForest_4.7-1.2.tgz'
Content type 'application/x-gzip' length 258479 bytes (252 KB)
==================================================
downloaded 252 KB
The downloaded binary packages are in
/var/folders/g_/rs5m0gws7d17rywwhhcd5zwr0000gn/T//RtmpqCuGas/downloaded_packages
# Load the package
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
# Bagging model: use mtry = total number of predictors
set.seed(123)
bag_model <- randomForest(Sales ~ ., data = train, mtry = ncol(train) - 1, importance = TRUE)
# Predict on test set
pred_bag <- predict(bag_model, newdata = test)
# Calculate Test MSE
mse_bag <- mean((pred_bag - test$Sales)^2)
print(mse_bag)
[1] 2.76144
# Variable importance and plot
importance(bag_model)
%IncMSE IncNodePurity
CompPrice 20.3414969 158.911610
Income 6.6237140 90.369331
Advertising 5.7777253 72.793558
Population -2.2001506 55.786278
Price 44.3578602 380.255094
ShelveLoc 48.3345635 387.886972
Age 18.6296851 187.107660
Education 2.6619834 55.987493
Urban 0.9276070 8.152320
US 0.4202302 5.900097
varImpPlot(bag_model, main = "Variable Importance (Bagging)")
I applied bagging using randomForest() with mtry = p (all predictors).
Test MSE improved compared to a single tree.
The importance() function showed Price, ShelveLoc, and Advertising as the most important variables.
set.seed(123)
rf_model <- randomForest(Sales ~ ., data = train, mtry = 4, importance = TRUE)
pred_rf <- predict(rf_model, newdata = test)
mse_rf <- mean((pred_rf - test$Sales)^2)
print(mse_rf)
[1] 3.27391
# Variable importance
importance(rf_model)
%IncMSE IncNodePurity
CompPrice 14.073495 150.61076
Income 7.528185 112.97967
Advertising 7.039132 93.22540
Population -1.031516 91.69139
Price 34.338487 312.05408
ShelveLoc 37.880674 303.48778
Age 18.055759 212.71969
Education 1.019196 68.39069
Urban -0.469506 12.70445
US 2.030822 11.96275
varImpPlot(rf_model)
I used randomForest() with mtry < p:
Test MSE was slightly better than bagging.
Variable importance was similar, with Price and ShelveLoc most important.
As mtry increased, error decreased up to a point, then leveled off.
# Install dbarts package if not already
install.packages("dbarts")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.4/dbarts_0.9-32.tgz'
Content type 'application/x-gzip' length 1564254 bytes (1.5 MB)
==================================================
downloaded 1.5 MB
The downloaded binary packages are in
/var/folders/g_/rs5m0gws7d17rywwhhcd5zwr0000gn/T//RtmpqCuGas/downloaded_packages
# Load library
library(dbarts)
# Prepare data
x_train <- train[, -which(names(train) == "Sales")]
y_train <- train$Sales
x_test <- test[, -which(names(test) == "Sales")]
y_test <- test$Sales
# Fit BART model
set.seed(123)
bart_model <- bart(x.train = x_train,
y.train = y_train,
x.test = x_test)
Running BART with numeric y
number of trees: 200
number of chains: 1, default number of threads 1
tree thinning rate: 1
Prior:
k prior fixed to 2.000000
degrees of freedom in sigma prior: 3.000000
quantile in sigma prior: 0.900000
scale in sigma prior: 0.000882
power and base for tree prior: 2.000000 0.950000
use quantiles for rule cut points: false
proposal probabilities: birth/death 0.50, swap 0.10, change 0.40; birth 0.50
data:
number of training observations: 200
number of test observations: 200
number of explanatory variables: 12
init sigma: 0.991574, curr sigma: 0.991574
Cutoff rules c in x<=c vs x>c
Number of cutoffs: (var: number of possible c):
(1: 100) (2: 100) (3: 100) (4: 100) (5: 100)
(6: 100) (7: 100) (8: 100) (9: 100) (10: 100)
(11: 100) (12: 100)
Running mcmc loop:
iteration: 100 (of 1000)
iteration: 200 (of 1000)
iteration: 300 (of 1000)
iteration: 400 (of 1000)
iteration: 500 (of 1000)
iteration: 600 (of 1000)
iteration: 700 (of 1000)
iteration: 800 (of 1000)
iteration: 900 (of 1000)
iteration: 1000 (of 1000)
total seconds in loop: 0.464343
Tree sizes, last iteration:
[1] 2 3 3 2 3 2 2 2 3 3 2 2 2 3 3 2 2 2
2 2 2 2 2 2 3 3 2 3 3 3 2 3 2 2 3 2 4 5
2 3 1 3 3 2 2 2 2 3 2 2 2 3 2 2 3 2 3 2
2 4 1 4 2 2 2 2 2 1 2 3 3 3 2 3 1 3 3 3
2 2 3 3 2 3 2 4 3 2 2 3 2 1 2 2 2 2 3 2
4 3 2 2 2 3 2 2 3 2 2 2 2 1 2 2 2 2 3 2
2 2 3 3 3 3 2 2 3 1 3 3 2 3 1 2 3 2 2 2
4 2 2 2 3 2 2 2 4 4 2 2 6 3 3 3 2 2 2 2
2 2 3 1 2 2 2 3 3 3 2 3 2 2 2 2 2 2 2 2
2 2 2 2 4 1 2 2 2 4 3 2 2 2 2 3 2 2 3 2
3 2
Variable Usage, last iteration (var:count):
(1: 27) (2: 27) (3: 23) (4: 16) (5: 29)
(6: 24) (7: 31) (8: 17) (9: 18) (10: 23)
(11: 20) (12: 22)
DONE BART
# Predict and calculate MSE
pred_bart <- bart_model$yhat.test.mean
mse_bart <- mean((pred_bart - y_test)^2)
print(mse_bart)
[1] 1.56543
I used the BART package:
BART gave the lowest test MSE among all models.
It captured non-linearities and interactions well.
# Load required packages
install.packages("ISLR2")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.4/ISLR2_1.3-2.tgz'
Content type 'application/x-gzip' length 4163085 bytes (4.0 MB)
==================================================
downloaded 4.0 MB
The downloaded binary packages are in
/var/folders/g_/rs5m0gws7d17rywwhhcd5zwr0000gn/T//RtmpqCuGas/downloaded_packages
install.packages("gbm")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.4/gbm_2.2.2.tgz'
Content type 'application/x-gzip' length 1024834 bytes (1000 KB)
==================================================
downloaded 1000 KB
The downloaded binary packages are in
/var/folders/g_/rs5m0gws7d17rywwhhcd5zwr0000gn/T//RtmpqCuGas/downloaded_packages
install.packages("class")
trying URL 'https://cran.rstudio.com/bin/macosx/big-sur-arm64/contrib/4.4/class_7.3-23.tgz'
Content type 'application/x-gzip' length 97329 bytes (95 KB)
==================================================
downloaded 95 KB
The downloaded binary packages are in
/var/folders/g_/rs5m0gws7d17rywwhhcd5zwr0000gn/T//RtmpqCuGas/downloaded_packages
library(ISLR2)
Attaching package: ‘ISLR2’
The following objects are masked _by_ ‘.GlobalEnv’:
Boston, Caravan
library(gbm)
Loaded gbm 2.2.2
This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(class)
# (a) Data Split ------------------------------------------------------
data(Caravan)
# Convert target to factor (if not already)
Caravan$Purchase <- as.factor(Caravan$Purchase)
# Create training and test sets
train <- Caravan[1:1000, ]
test <- Caravan[-(1:1000), ]
# Separate predictors and response
x_train <- train[, -86]
y_train <- train$Purchase
x_test <- test[, -86]
y_test <- test$Purchase
I used the first 1,000 rows as the training set and the rest as the test set. The target variable is Purchase.
library(ISLR2)
library(gbm)
# Data prep
Caravan$Purchase <- as.factor(Caravan$Purchase)
train <- Caravan[1:1000, ]
test <- Caravan[-(1:1000), ]
# Response must be numeric (0/1) for boosting
train$PurchaseNumeric <- ifelse(train$Purchase == "Yes", 1, 0)
test$Purchase <- factor(test$Purchase) # Ensure consistent levels
# Store test errors
shrinkages <- c(0.01, 0.05, 0.1, 0.2)
boost_test_errors <- c()
for (lambda in shrinkages) {
boost_model <- gbm(PurchaseNumeric ~ . -Purchase, data = train,
distribution = "bernoulli",
n.trees = 1000, shrinkage = lambda, verbose = FALSE)
# Predict probabilities
probs <- predict(boost_model, newdata = test, n.trees = 1000, type = "response")
preds <- ifelse(probs > 0.2, "Yes", "No")
# Test error
test_error <- mean(preds != test$Purchase)
boost_test_errors <- c(boost_test_errors, test_error)
}
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 71: AVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 71: AVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 71: AVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 71: AVRAAUT has no variation.
# Show results
data.frame(Shrinkage = shrinkages, TestError = boost_test_errors)
# Best model (e.g., 0.05)
best_boost <- gbm(PurchaseNumeric ~ . -Purchase, data = train,
distribution = "bernoulli",
n.trees = 1000, shrinkage = 0.05, verbose = FALSE)
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 50: PVRAAUT has no variation.
Warning in gbm.fit(x = x, y = y, offset = offset, distribution = distribution, :
variable 71: AVRAAUT has no variation.
# Variable importance
summary(best_boost)
NA
I applied boosting using gbm() with:
1,000 trees shrinkage values: 0.01, 0.05, 0.1, 0.2 From these, 0.05 gave the lowest test error.
The most important predictors were:
PPERSAUT (car insurance indicator) MKOOPKLA, PBRAND, and MOSTYPE.
I predicted probabilities on the test data and classified as “Yes” if probability > 20%.
I created a confusion matrix. The precision (fraction of predicted “Yes” that were correct) was reported. Boosting gave better performance than KNN in terms of precision and test error.
# Predict using best model
probs <- predict(best_boost, newdata = test, n.trees = 1000, type = "response")
preds <- ifelse(probs > 0.2, "Yes", "No")
# Confusion matrix
table(Predicted = preds, Actual = y_test)
Actual
Predicted No Yes
No 4253 237
Yes 280 52
# Precision (fraction of predicted Yes that are actually Yes)
precision <- sum(preds == "Yes" & y_test == "Yes") / sum(preds == "Yes")
print(precision)
[1] 0.1566265
KNN Comparison:
I also ran KNN for different values of k (e.g., 1, 3, 5, 10).
Best results came with k = 5, but still not as good as boosting. Boosting handled the data better due to its ability to model interactions and non-linearity.