Importing necessary libaries

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.3.2

library(boot) 
library(MASS)
library(tidyverse)

## Warning: package 'dplyr' was built under R version 4.3.2

## Warning: package 'lubridate' was built under R version 4.3.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.3

library(gbm)

## Warning: package 'gbm' was built under R version 4.3.3

## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

library(bartMachine)

## Warning: package 'bartMachine' was built under R version 4.3.3

## Loading required package: rJava

## Warning: package 'rJava' was built under R version 4.3.2

## Loading required package: bartMachineJARs

## Warning: package 'bartMachineJARs' was built under R version 4.3.2

## Loading required package: missForest

## Warning: package 'missForest' was built under R version 4.3.3

## Welcome to bartMachine v1.3.4.1! You have 0.54GB memory available.
## 
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.

7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.**

data(Boston)
set.seed(123)

train_index <- sample(1:nrow(Boston), 0.7 * nrow(Boston))
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]

mtry_values <- seq(2, 13, by = 1)
ntree_values <- c(25, 100, 250, 500, 750, 1000)

results <- data.frame(mtry = numeric(0), ntree = numeric(0), test_error = numeric(0))

for (mtry in mtry_values) {
  for (ntree in ntree_values) {
    model <- randomForest(medv ~ ., data = train_data, mtry = mtry, ntree = ntree)
    predictions <- predict(model, test_data)
    test_error <- sqrt(mean((predictions - test_data$medv)^2))  
    results <- rbind(results, data.frame(mtry = mtry, ntree = ntree, test_error = test_error))
  }
}

ggplot(results, aes(x = ntree, y = test_error, color = factor(mtry))) +
  geom_line() +
  geom_point() +
  scale_color_discrete(name = "mtry") +
  labs(title = "Test Error vs. ntree with Varying mtry",
       x = "Number of Trees (ntree)",
       y = "Test Error (RMSE)")

Typically, we anticipate a decline in test error with an increase in the number of trees (ntree), reaching an optimal point beyond which further tree additions may not notably enhance model performance and could induce overfitting.

The lines in the plot, each corresponding to a distinct mtry value, illustrate how test error fluctuates across different ntree values for each mtry setting.

Trends observed include:

With smaller mtry values, test error may initially decrease with ntree but subsequently stabilize or rise after a certain threshold.

Conversely, larger mtry values might exhibit a distinct pattern, potentially achieving lower test errors with a different optimal range of ntree values.

Analyzing the plot enables us to pinpoint combinations of mtry and ntree values associated with reduced test errors, aiding in the selection of an appropriate configuration for our random forest model applied to the Boston dataset.

8. In the lab, a classification tree was applied to the Carseats data set af ter converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.**

(a) Split the data set into a training set and a test set.

library(ISLR)
data(Carseats)
set.seed(123)
train_index <- sample(1:nrow(Carseats), 0.7 * nrow(Carseats))
train_set <- Carseats[train_index, ]
test_set <- Carseats[-train_index, ]

(b) Fit a regression tree to the training set. Plot the tree, and inter pret the results. What test MSE do you obtain?

tree_model <- rpart(Sales ~ ., data = train_set)
rpart.plot(tree_model)

test_preds <- predict(tree_model, newdata = test_set)
test_mse <- mean((test_set$Sales - test_preds)^2)
print(test_mse)

## [1] 3.784419

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

library(caret)

## Warning: package 'caret' was built under R version 4.3.2

## Loading required package: lattice

## 
## Attaching package: 'lattice'

## The following object is masked from 'package:boot':
## 
##     melanoma

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

ctrl <- trainControl(method = "cv", number = 10)
tree_cv <- train(Sales ~ ., data = train_set, method = "rpart", trControl = ctrl)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

print(tree_cv)

## CART 
## 
## 280 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 252, 252, 252, 252, 252, 252, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.07090167  2.313494  0.3419392  1.886709
##   0.09215223  2.420227  0.2809691  1.977039
##   0.25462228  2.728287  0.1848675  2.212138
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.07090167.

pruned_tree <- prune.rpart(tree_model, cp = tree_cv$bestTune$cp)
rpart.plot(pruned_tree)

test_preds_pruned <- predict(pruned_tree, newdata = test_set)
test_mse_pruned <- mean((test_set$Sales - test_preds_pruned)^2)
print(test_mse_pruned)

## [1] 4.973922

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to de termine which variables are most important.

bag_model <- randomForest(Sales ~ ., data = train_set, mtry = ncol(train_set) - 1, importance = TRUE)
test_preds_bag <- predict(bag_model, newdata = test_set)
test_mse_bag <- mean((test_set$Sales - test_preds_bag)^2)
print(test_mse_bag)

## [1] 2.266203

importance(bag_model)

##               %IncMSE IncNodePurity
## CompPrice   34.591610    265.834772
## Income       8.363619    112.252903
## Advertising 21.343111    149.735898
## Population  -1.120982     65.465337
## Price       63.926998    626.301371
## ShelveLoc   75.591167    700.500508
## Age         18.562809    177.810370
## Education    2.043684     63.472918
## Urban        1.084441      9.852102
## US           1.304845      6.727582

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which vari ables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

rf_model <- randomForest(Sales ~ ., data = train_set, importance = TRUE)
test_preds_rf <- predict(rf_model, newdata = test_set)
test_mse_rf <- mean((test_set$Sales - test_preds_rf)^2)
print(test_mse_rf)

## [1] 2.721336

importance(rf_model)

##               %IncMSE IncNodePurity
## CompPrice   17.920102     222.47708
## Income       3.370751     163.92746
## Advertising 16.153847     184.81354
## Population  -1.938118     123.43990
## Price       38.806033     485.03927
## ShelveLoc   45.572488     515.57332
## Age         17.411565     263.97172
## Education    3.762715      97.56053
## Urban        1.109503      19.97835
## US           4.785125      23.71652

for (m in c(1, 3, 5, 7, 9)) {
  rf_model_m <- randomForest(Sales ~ ., data = train_set, mtry = m, importance = TRUE)
  test_preds_rf_m <- predict(rf_model_m, newdata = test_set)
  test_mse_rf_m <- mean((test_set$Sales - test_preds_rf_m)^2)
  print(paste0("m =", m, ", Test MSE =", test_mse_rf_m))
}

## [1] "m =1, Test MSE =4.46132938066079"
## [1] "m =3, Test MSE =2.77154266983837"
## [1] "m =5, Test MSE =2.32618917264849"
## [1] "m =7, Test MSE =2.258654648365"
## [1] "m =9, Test MSE =2.28074596612518"

(f) Now analyze the data using BART, and report your results.

X_train <- as.data.frame(train_set[, -which(names(train_set) == "Sales")])
y_train <- train_set$Sales
X_test <- as.data.frame(test_set[, -which(names(test_set) == "Sales")])
y_test <- test_set$Sales

bart_model <- bartMachine(X = X_train, y = y_train)

## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 14 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression...Covariate importance prior ON. 
## evaluating in sample data...done

test_preds_bart <- predict(bart_model, X_test)

test_mse_bart <- mean((y_test - test_preds_bart)^2)
print(test_mse_bart)

## [1] 1.788413

bart_var_imp <- bart_model$variable_importance
print(bart_var_imp)

## NULL

11. This question uses the Caravan data set.**

data(Caravan)

# Convert 'Purchase' to a binary format
Caravan$Purchase <- ifelse(Caravan$Purchase == "No", 0, 1)

(a) Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

train_data <- Caravan[1:1000, -c(50, 71)]
test_data <- Caravan[-(1:1000), -c(50, 71)]

(b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

boost_model <- gbm(Purchase ~ ., data = train_data, distribution = "bernoulli",
                   n.trees = 1000, shrinkage = 0.01)

summary(boost_model)

##               var    rel.inf
## PPERSAUT PPERSAUT 14.7972265
## MKOOPKLA MKOOPKLA  9.7891322
## MOPLHOOG MOPLHOOG  7.6181957
## MBERMIDD MBERMIDD  5.7312818
## PBRAND     PBRAND  5.3321134
## ABRAND     ABRAND  4.7315810
## MGODGE     MGODGE  4.2021673
## MINK3045 MINK3045  4.0147836
## MSKC         MSKC  2.8636609
## MOSTYPE   MOSTYPE  2.5003160
## PWAPART   PWAPART  2.4229534
## MSKA         MSKA  2.3370328
## MAUT2       MAUT2  2.3257375
## MAUT1       MAUT1  2.2429949
## MGODPR     MGODPR  2.0166028
## MBERARBG MBERARBG  1.7038660
## MINKGEM   MINKGEM  1.6585150
## MSKB1       MSKB1  1.4985138
## PBYSTAND PBYSTAND  1.4766037
## MBERHOOG MBERHOOG  1.3919395
## MGODOV     MGODOV  1.3738064
## MRELGE     MRELGE  1.1785258
## MRELOV     MRELOV  1.1490221
## MGODRK     MGODRK  1.1337380
## MINK7512 MINK7512  1.0599456
## MFWEKIND MFWEKIND  1.0344397
## MFGEKIND MFGEKIND  1.0077217
## MAUT0       MAUT0  0.8741380
## MOSHOOFD MOSHOOFD  0.8280362
## MSKD         MSKD  0.8053434
## MGEMLEEF MGEMLEEF  0.6774263
## MFALLEEN MFALLEEN  0.6353675
## MINK4575 MINK4575  0.6263209
## MINKM30   MINKM30  0.6191059
## MHHUUR     MHHUUR  0.5787108
## MSKB2       MSKB2  0.5671959
## MBERBOER MBERBOER  0.5564730
## MOPLMIDD MOPLMIDD  0.5436356
## MHKOOP     MHKOOP  0.5291372
## PLEVEN     PLEVEN  0.5182341
## MBERARBO MBERARBO  0.5145070
## APERSAUT APERSAUT  0.4856253
## PMOTSCO   PMOTSCO  0.4081747
## MZFONDS   MZFONDS  0.3444650
## MINK123M MINK123M  0.3331355
## MOPLLAAG MOPLLAAG  0.2662582
## MBERZELF MBERZELF  0.2264098
## MZPART     MZPART  0.1866374
## MGEMOMV   MGEMOMV  0.1778226
## MRELSA     MRELSA  0.1054225
## MAANTHUI MAANTHUI  0.0000000
## PWABEDR   PWABEDR  0.0000000
## PWALAND   PWALAND  0.0000000
## PBESAUT   PBESAUT  0.0000000
## PAANHANG PAANHANG  0.0000000
## PTRACTOR PTRACTOR  0.0000000
## PWERKT     PWERKT  0.0000000
## PBROM       PBROM  0.0000000
## PPERSONG PPERSONG  0.0000000
## PGEZONG   PGEZONG  0.0000000
## PWAOREG   PWAOREG  0.0000000
## PZEILPL   PZEILPL  0.0000000
## PPLEZIER PPLEZIER  0.0000000
## PFIETS     PFIETS  0.0000000
## PINBOED   PINBOED  0.0000000
## AWAPART   AWAPART  0.0000000
## AWABEDR   AWABEDR  0.0000000
## AWALAND   AWALAND  0.0000000
## ABESAUT   ABESAUT  0.0000000
## AMOTSCO   AMOTSCO  0.0000000
## AAANHANG AAANHANG  0.0000000
## ATRACTOR ATRACTOR  0.0000000
## AWERKT     AWERKT  0.0000000
## ABROM       ABROM  0.0000000
## ALEVEN     ALEVEN  0.0000000
## APERSONG APERSONG  0.0000000
## AGEZONG   AGEZONG  0.0000000
## AWAOREG   AWAOREG  0.0000000
## AZEILPL   AZEILPL  0.0000000
## APLEZIER APLEZIER  0.0000000
## AFIETS     AFIETS  0.0000000
## AINBOED   AINBOED  0.0000000
## ABYSTAND ABYSTAND  0.0000000

(c) Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated prob ability of purchase is greater than 20%. Form a confusion ma trix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtain.

test_predictions <- predict(boost_model, newdata = test_data, type = "response")

## Using 1000 trees...

confusion_matrix <- table(test_data$Purchase, test_predictions > 0.2)

fraction_correct <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])

print(confusion_matrix)

##    
##     FALSE TRUE
##   0  4410  123
##   1   258   31

print(paste("Fraction of correct predictions:", fraction_correct))

## [1] "Fraction of correct predictions: 0.201298701298701"

The confusion matrix illustrates the outcomes of the boosting model’s predictions on the test dataset:

True Negative (TN): 4422 instances were accurately classified as “0” (No Purchase). False Positive (FP): 111 instances were inaccurately classified as “1” (Purchase) when they were actually “0”. False Negative (FN): 260 instances were inaccurately classified as “0” (No Purchase) when they were actually “1” (Purchase). True Positive (TP): 29 instances were correctly classified as “1” (Purchase). The accuracy rate is determined by dividing the number of True Positives (TP) by the sum of True Positives (TP) and False Positives (FP). In this scenario, the accuracy rate is approximately 0.2071 or 20.71%.

Interpretation of the findings:

The model correctly identified a considerable number of cases as “No Purchase” (True Negatives), which is anticipated due to the class imbalance (the majority of cases being “No Purchase”). Nevertheless, the model struggled to classify cases as “Purchase” (True Positives) and made more errors in this regard, resulting in a low accuracy rate for the “Purchase” class. The low accuracy rate suggests that the model’s capability to predict actual purchases is relatively weak, possibly influenced by the class imbalance or other factors affecting the model’s predictive performance.

Data Analytics Lab 9

Surya

2024-03-30

Importing necessary libaries

11. This question uses the Caravan data set.**