library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
library(boot)
library(MASS)
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(bartMachine)
## Warning: package 'bartMachine' was built under R version 4.3.3
## Loading required package: rJava
## Warning: package 'rJava' was built under R version 4.3.2
## Loading required package: bartMachineJARs
## Warning: package 'bartMachineJARs' was built under R version 4.3.2
## Loading required package: missForest
## Warning: package 'missForest' was built under R version 4.3.3
## Welcome to bartMachine v1.3.4.1! You have 0.54GB memory available.
##
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.
data(Boston)
set.seed(123)
train_index <- sample(1:nrow(Boston), 0.7 * nrow(Boston))
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
mtry_values <- seq(2, 13, by = 1)
ntree_values <- c(25, 100, 250, 500, 750, 1000)
results <- data.frame(mtry = numeric(0), ntree = numeric(0), test_error = numeric(0))
for (mtry in mtry_values) {
for (ntree in ntree_values) {
model <- randomForest(medv ~ ., data = train_data, mtry = mtry, ntree = ntree)
predictions <- predict(model, test_data)
test_error <- sqrt(mean((predictions - test_data$medv)^2))
results <- rbind(results, data.frame(mtry = mtry, ntree = ntree, test_error = test_error))
}
}
ggplot(results, aes(x = ntree, y = test_error, color = factor(mtry))) +
geom_line() +
geom_point() +
scale_color_discrete(name = "mtry") +
labs(title = "Test Error vs. ntree with Varying mtry",
x = "Number of Trees (ntree)",
y = "Test Error (RMSE)")
Typically, we anticipate a decline in test error with an increase in the
number of trees (ntree), reaching an optimal point beyond which further
tree additions may not notably enhance model performance and could
induce overfitting.
The lines in the plot, each corresponding to a distinct mtry value, illustrate how test error fluctuates across different ntree values for each mtry setting.
Trends observed include:
With smaller mtry values, test error may initially decrease with ntree but subsequently stabilize or rise after a certain threshold.
Conversely, larger mtry values might exhibit a distinct pattern, potentially achieving lower test errors with a different optimal range of ntree values.
Analyzing the plot enables us to pinpoint combinations of mtry and ntree values associated with reduced test errors, aiding in the selection of an appropriate configuration for our random forest model applied to the Boston dataset.
data(Caravan)
# Convert 'Purchase' to a binary format
Caravan$Purchase <- ifelse(Caravan$Purchase == "No", 0, 1)
(a) Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.
train_data <- Caravan[1:1000, -c(50, 71)]
test_data <- Caravan[-(1:1000), -c(50, 71)]
(b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?
boost_model <- gbm(Purchase ~ ., data = train_data, distribution = "bernoulli",
n.trees = 1000, shrinkage = 0.01)
summary(boost_model)
## var rel.inf
## PPERSAUT PPERSAUT 14.7972265
## MKOOPKLA MKOOPKLA 9.7891322
## MOPLHOOG MOPLHOOG 7.6181957
## MBERMIDD MBERMIDD 5.7312818
## PBRAND PBRAND 5.3321134
## ABRAND ABRAND 4.7315810
## MGODGE MGODGE 4.2021673
## MINK3045 MINK3045 4.0147836
## MSKC MSKC 2.8636609
## MOSTYPE MOSTYPE 2.5003160
## PWAPART PWAPART 2.4229534
## MSKA MSKA 2.3370328
## MAUT2 MAUT2 2.3257375
## MAUT1 MAUT1 2.2429949
## MGODPR MGODPR 2.0166028
## MBERARBG MBERARBG 1.7038660
## MINKGEM MINKGEM 1.6585150
## MSKB1 MSKB1 1.4985138
## PBYSTAND PBYSTAND 1.4766037
## MBERHOOG MBERHOOG 1.3919395
## MGODOV MGODOV 1.3738064
## MRELGE MRELGE 1.1785258
## MRELOV MRELOV 1.1490221
## MGODRK MGODRK 1.1337380
## MINK7512 MINK7512 1.0599456
## MFWEKIND MFWEKIND 1.0344397
## MFGEKIND MFGEKIND 1.0077217
## MAUT0 MAUT0 0.8741380
## MOSHOOFD MOSHOOFD 0.8280362
## MSKD MSKD 0.8053434
## MGEMLEEF MGEMLEEF 0.6774263
## MFALLEEN MFALLEEN 0.6353675
## MINK4575 MINK4575 0.6263209
## MINKM30 MINKM30 0.6191059
## MHHUUR MHHUUR 0.5787108
## MSKB2 MSKB2 0.5671959
## MBERBOER MBERBOER 0.5564730
## MOPLMIDD MOPLMIDD 0.5436356
## MHKOOP MHKOOP 0.5291372
## PLEVEN PLEVEN 0.5182341
## MBERARBO MBERARBO 0.5145070
## APERSAUT APERSAUT 0.4856253
## PMOTSCO PMOTSCO 0.4081747
## MZFONDS MZFONDS 0.3444650
## MINK123M MINK123M 0.3331355
## MOPLLAAG MOPLLAAG 0.2662582
## MBERZELF MBERZELF 0.2264098
## MZPART MZPART 0.1866374
## MGEMOMV MGEMOMV 0.1778226
## MRELSA MRELSA 0.1054225
## MAANTHUI MAANTHUI 0.0000000
## PWABEDR PWABEDR 0.0000000
## PWALAND PWALAND 0.0000000
## PBESAUT PBESAUT 0.0000000
## PAANHANG PAANHANG 0.0000000
## PTRACTOR PTRACTOR 0.0000000
## PWERKT PWERKT 0.0000000
## PBROM PBROM 0.0000000
## PPERSONG PPERSONG 0.0000000
## PGEZONG PGEZONG 0.0000000
## PWAOREG PWAOREG 0.0000000
## PZEILPL PZEILPL 0.0000000
## PPLEZIER PPLEZIER 0.0000000
## PFIETS PFIETS 0.0000000
## PINBOED PINBOED 0.0000000
## AWAPART AWAPART 0.0000000
## AWABEDR AWABEDR 0.0000000
## AWALAND AWALAND 0.0000000
## ABESAUT ABESAUT 0.0000000
## AMOTSCO AMOTSCO 0.0000000
## AAANHANG AAANHANG 0.0000000
## ATRACTOR ATRACTOR 0.0000000
## AWERKT AWERKT 0.0000000
## ABROM ABROM 0.0000000
## ALEVEN ALEVEN 0.0000000
## APERSONG APERSONG 0.0000000
## AGEZONG AGEZONG 0.0000000
## AWAOREG AWAOREG 0.0000000
## AZEILPL AZEILPL 0.0000000
## APLEZIER APLEZIER 0.0000000
## AFIETS AFIETS 0.0000000
## AINBOED AINBOED 0.0000000
## ABYSTAND ABYSTAND 0.0000000
(c) Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated prob ability of purchase is greater than 20%. Form a confusion ma trix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtain.
test_predictions <- predict(boost_model, newdata = test_data, type = "response")
## Using 1000 trees...
confusion_matrix <- table(test_data$Purchase, test_predictions > 0.2)
fraction_correct <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])
print(confusion_matrix)
##
## FALSE TRUE
## 0 4410 123
## 1 258 31
print(paste("Fraction of correct predictions:", fraction_correct))
## [1] "Fraction of correct predictions: 0.201298701298701"
The confusion matrix illustrates the outcomes of the boosting model’s predictions on the test dataset:
True Negative (TN): 4422 instances were accurately classified as “0” (No Purchase). False Positive (FP): 111 instances were inaccurately classified as “1” (Purchase) when they were actually “0”. False Negative (FN): 260 instances were inaccurately classified as “0” (No Purchase) when they were actually “1” (Purchase). True Positive (TP): 29 instances were correctly classified as “1” (Purchase). The accuracy rate is determined by dividing the number of True Positives (TP) by the sum of True Positives (TP) and False Positives (FP). In this scenario, the accuracy rate is approximately 0.2071 or 20.71%.
Interpretation of the findings:
The model correctly identified a considerable number of cases as “No Purchase” (True Negatives), which is anticipated due to the class imbalance (the majority of cases being “No Purchase”). Nevertheless, the model struggled to classify cases as “Purchase” (True Positives) and made more errors in this regard, resulting in a low accuracy rate for the “Purchase” class. The low accuracy rate suggests that the model’s capability to predict actual purchases is relatively weak, possibly influenced by the class imbalance or other factors affecting the model’s predictive performance.