Importing necessary libaries

library(ISLR) 
## Warning: package 'ISLR' was built under R version 4.3.2
library(boot) 
library(MASS)
library(tidyverse)
## Warning: package 'dplyr' was built under R version 4.3.2
## Warning: package 'lubridate' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(bartMachine)
## Warning: package 'bartMachine' was built under R version 4.3.3
## Loading required package: rJava
## Warning: package 'rJava' was built under R version 4.3.2
## Loading required package: bartMachineJARs
## Warning: package 'bartMachineJARs' was built under R version 4.3.2
## Loading required package: missForest
## Warning: package 'missForest' was built under R version 4.3.3
## Welcome to bartMachine v1.3.4.1! You have 0.54GB memory available.
## 
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.

7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.**

data(Boston)
set.seed(123)
train_index <- sample(1:nrow(Boston), 0.7 * nrow(Boston))
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
mtry_values <- seq(2, 13, by = 1)
ntree_values <- c(25, 100, 250, 500, 750, 1000)
results <- data.frame(mtry = numeric(0), ntree = numeric(0), test_error = numeric(0))
for (mtry in mtry_values) {
  for (ntree in ntree_values) {
    model <- randomForest(medv ~ ., data = train_data, mtry = mtry, ntree = ntree)
    predictions <- predict(model, test_data)
    test_error <- sqrt(mean((predictions - test_data$medv)^2))  
    results <- rbind(results, data.frame(mtry = mtry, ntree = ntree, test_error = test_error))
  }
}
ggplot(results, aes(x = ntree, y = test_error, color = factor(mtry))) +
  geom_line() +
  geom_point() +
  scale_color_discrete(name = "mtry") +
  labs(title = "Test Error vs. ntree with Varying mtry",
       x = "Number of Trees (ntree)",
       y = "Test Error (RMSE)")

Typically, we anticipate a decline in test error with an increase in the number of trees (ntree), reaching an optimal point beyond which further tree additions may not notably enhance model performance and could induce overfitting.

The lines in the plot, each corresponding to a distinct mtry value, illustrate how test error fluctuates across different ntree values for each mtry setting.

Trends observed include:

With smaller mtry values, test error may initially decrease with ntree but subsequently stabilize or rise after a certain threshold.

Conversely, larger mtry values might exhibit a distinct pattern, potentially achieving lower test errors with a different optimal range of ntree values.

Analyzing the plot enables us to pinpoint combinations of mtry and ntree values associated with reduced test errors, aiding in the selection of an appropriate configuration for our random forest model applied to the Boston dataset.

11. This question uses the Caravan data set.**

data(Caravan)
# Convert 'Purchase' to a binary format
Caravan$Purchase <- ifelse(Caravan$Purchase == "No", 0, 1)

(a) Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

train_data <- Caravan[1:1000, -c(50, 71)]
test_data <- Caravan[-(1:1000), -c(50, 71)]

(b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

boost_model <- gbm(Purchase ~ ., data = train_data, distribution = "bernoulli",
                   n.trees = 1000, shrinkage = 0.01)

summary(boost_model)

##               var    rel.inf
## PPERSAUT PPERSAUT 14.7972265
## MKOOPKLA MKOOPKLA  9.7891322
## MOPLHOOG MOPLHOOG  7.6181957
## MBERMIDD MBERMIDD  5.7312818
## PBRAND     PBRAND  5.3321134
## ABRAND     ABRAND  4.7315810
## MGODGE     MGODGE  4.2021673
## MINK3045 MINK3045  4.0147836
## MSKC         MSKC  2.8636609
## MOSTYPE   MOSTYPE  2.5003160
## PWAPART   PWAPART  2.4229534
## MSKA         MSKA  2.3370328
## MAUT2       MAUT2  2.3257375
## MAUT1       MAUT1  2.2429949
## MGODPR     MGODPR  2.0166028
## MBERARBG MBERARBG  1.7038660
## MINKGEM   MINKGEM  1.6585150
## MSKB1       MSKB1  1.4985138
## PBYSTAND PBYSTAND  1.4766037
## MBERHOOG MBERHOOG  1.3919395
## MGODOV     MGODOV  1.3738064
## MRELGE     MRELGE  1.1785258
## MRELOV     MRELOV  1.1490221
## MGODRK     MGODRK  1.1337380
## MINK7512 MINK7512  1.0599456
## MFWEKIND MFWEKIND  1.0344397
## MFGEKIND MFGEKIND  1.0077217
## MAUT0       MAUT0  0.8741380
## MOSHOOFD MOSHOOFD  0.8280362
## MSKD         MSKD  0.8053434
## MGEMLEEF MGEMLEEF  0.6774263
## MFALLEEN MFALLEEN  0.6353675
## MINK4575 MINK4575  0.6263209
## MINKM30   MINKM30  0.6191059
## MHHUUR     MHHUUR  0.5787108
## MSKB2       MSKB2  0.5671959
## MBERBOER MBERBOER  0.5564730
## MOPLMIDD MOPLMIDD  0.5436356
## MHKOOP     MHKOOP  0.5291372
## PLEVEN     PLEVEN  0.5182341
## MBERARBO MBERARBO  0.5145070
## APERSAUT APERSAUT  0.4856253
## PMOTSCO   PMOTSCO  0.4081747
## MZFONDS   MZFONDS  0.3444650
## MINK123M MINK123M  0.3331355
## MOPLLAAG MOPLLAAG  0.2662582
## MBERZELF MBERZELF  0.2264098
## MZPART     MZPART  0.1866374
## MGEMOMV   MGEMOMV  0.1778226
## MRELSA     MRELSA  0.1054225
## MAANTHUI MAANTHUI  0.0000000
## PWABEDR   PWABEDR  0.0000000
## PWALAND   PWALAND  0.0000000
## PBESAUT   PBESAUT  0.0000000
## PAANHANG PAANHANG  0.0000000
## PTRACTOR PTRACTOR  0.0000000
## PWERKT     PWERKT  0.0000000
## PBROM       PBROM  0.0000000
## PPERSONG PPERSONG  0.0000000
## PGEZONG   PGEZONG  0.0000000
## PWAOREG   PWAOREG  0.0000000
## PZEILPL   PZEILPL  0.0000000
## PPLEZIER PPLEZIER  0.0000000
## PFIETS     PFIETS  0.0000000
## PINBOED   PINBOED  0.0000000
## AWAPART   AWAPART  0.0000000
## AWABEDR   AWABEDR  0.0000000
## AWALAND   AWALAND  0.0000000
## ABESAUT   ABESAUT  0.0000000
## AMOTSCO   AMOTSCO  0.0000000
## AAANHANG AAANHANG  0.0000000
## ATRACTOR ATRACTOR  0.0000000
## AWERKT     AWERKT  0.0000000
## ABROM       ABROM  0.0000000
## ALEVEN     ALEVEN  0.0000000
## APERSONG APERSONG  0.0000000
## AGEZONG   AGEZONG  0.0000000
## AWAOREG   AWAOREG  0.0000000
## AZEILPL   AZEILPL  0.0000000
## APLEZIER APLEZIER  0.0000000
## AFIETS     AFIETS  0.0000000
## AINBOED   AINBOED  0.0000000
## ABYSTAND ABYSTAND  0.0000000

(c) Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated prob ability of purchase is greater than 20%. Form a confusion ma trix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtain.

test_predictions <- predict(boost_model, newdata = test_data, type = "response")
## Using 1000 trees...
confusion_matrix <- table(test_data$Purchase, test_predictions > 0.2)

fraction_correct <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])

print(confusion_matrix)
##    
##     FALSE TRUE
##   0  4410  123
##   1   258   31
print(paste("Fraction of correct predictions:", fraction_correct))
## [1] "Fraction of correct predictions: 0.201298701298701"

The confusion matrix illustrates the outcomes of the boosting model’s predictions on the test dataset:

True Negative (TN): 4422 instances were accurately classified as “0” (No Purchase). False Positive (FP): 111 instances were inaccurately classified as “1” (Purchase) when they were actually “0”. False Negative (FN): 260 instances were inaccurately classified as “0” (No Purchase) when they were actually “1” (Purchase). True Positive (TP): 29 instances were correctly classified as “1” (Purchase). The accuracy rate is determined by dividing the number of True Positives (TP) by the sum of True Positives (TP) and False Positives (FP). In this scenario, the accuracy rate is approximately 0.2071 or 20.71%.

Interpretation of the findings:

The model correctly identified a considerable number of cases as “No Purchase” (True Negatives), which is anticipated due to the class imbalance (the majority of cases being “No Purchase”). Nevertheless, the model struggled to classify cases as “Purchase” (True Positives) and made more errors in this regard, resulting in a low accuracy rate for the “Purchase” class. The low accuracy rate suggests that the model’s capability to predict actual purchases is relatively weak, possibly influenced by the class imbalance or other factors affecting the model’s predictive performance.