Data Analytics Lab 9

R Markdown

7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.3.3

## randomForest 4.7-1.1

## Type rfNews() to see new features/changes/bug fixes.

library(MASS)

## Warning: package 'MASS' was built under R version 4.3.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:randomForest':
## 
##     margin

data(Boston)
set.seed(123)

train_index <- sample(1:nrow(Boston), 0.7 * nrow(Boston))
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]

mtry_values <- seq(2, 13, by = 1)
ntree_values <- c(25, 100, 250, 500, 750, 1000)

results <- data.frame(mtry = numeric(0), ntree = numeric(0), test_error = numeric(0))

for (mtry in mtry_values) {
  for (ntree in ntree_values) {
    model <- randomForest(medv ~ ., data = train_data, mtry = mtry, ntree = ntree)
    predictions <- predict(model, test_data)
    test_error <- sqrt(mean((predictions - test_data$medv)^2))  
    results <- rbind(results, data.frame(mtry = mtry, ntree = ntree, test_error = test_error))
  }
}

ggplot(results, aes(x = ntree, y = test_error, color = factor(mtry))) +
  geom_line() +
  geom_point() +
  scale_color_discrete(name = "mtry") +
  labs(title = "Test Error vs. ntree with Varying mtry",
       x = "Number of Trees (ntree)",
       y = "Test Error (RMSE)")

Generally, we would expect the test error to decrease as the number of trees (ntree) increases up to a certain point. After that point, adding more trees might not significantly improve the model’s performance and could even lead to overfitting.
The plot’s lines (each color representing a different mtry value) show how the test error changes with varying ntree values for each mtry setting.
We can observe trends such as:
- For smaller mtry values, the test error might decrease initially with ntree but then stabilize or start increasing after a certain point.
- Larger mtry values may show a different pattern, potentially reaching lower test errors with a different optimal range of ntree values.

By examining the plot, we can identify combinations of mtry and ntree values that result in lower test errors, helping us choose an appropriate configuration for our random forest model on the Boston dataset.

8. In the lab, a classification tree was applied to the Carseats data set af ter converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

(b) Fit a regression tree to the training set. Plot the tree, and inter pret the results. What test MSE do you obtain?

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to de termine which variables are most important.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which vari ables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

(f) Now analyze the data using BART, and report your results.

Split the data set into a training set and a test set

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.3.2

data(Carseats)
set.seed(123)
train_index <- sample(1:nrow(Carseats), 0.7 * nrow(Carseats))
train_set <- Carseats[train_index, ]
test_set <- Carseats[-train_index, ]

Fit a regression tree to the training set. Plot the tree, and interpret the results.

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 4.3.3

tree_model <- rpart(Sales ~ ., data = train_set)
rpart.plot(tree_model)

test_preds <- predict(tree_model, newdata = test_set)
test_mse <- mean((test_set$Sales - test_preds)^2)
print(test_mse)

## [1] 3.784419

Use cross-validation to determine the optimal level of tree complexity.

library(caret)

## Warning: package 'caret' was built under R version 4.3.2

## Loading required package: lattice

ctrl <- trainControl(method = "cv", number = 10)
tree_cv <- train(Sales ~ ., data = train_set, method = "rpart", trControl = ctrl)

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.

print(tree_cv)

## CART 
## 
## 280 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 252, 252, 252, 252, 252, 252, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.07090167  2.313494  0.3419392  1.886709
##   0.09215223  2.420227  0.2809691  1.977039
##   0.25462228  2.728287  0.1848675  2.212138
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.07090167.

pruned_tree <- prune.rpart(tree_model, cp = tree_cv$bestTune$cp)
rpart.plot(pruned_tree)

test_preds_pruned <- predict(pruned_tree, newdata = test_set)
test_mse_pruned <- mean((test_set$Sales - test_preds_pruned)^2)
print(test_mse_pruned)

## [1] 4.973922

Use the bagging approach to analyze the data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

library(randomForest)

bag_model <- randomForest(Sales ~ ., data = train_set, mtry = ncol(train_set) - 1, importance = TRUE)
test_preds_bag <- predict(bag_model, newdata = test_set)
test_mse_bag <- mean((test_set$Sales - test_preds_bag)^2)
print(test_mse_bag)

## [1] 2.266203

importance(bag_model)

##               %IncMSE IncNodePurity
## CompPrice   34.591610    265.834772
## Income       8.363619    112.252903
## Advertising 21.343111    149.735898
## Population  -1.120982     65.465337
## Price       63.926998    626.301371
## ShelveLoc   75.591167    700.500508
## Age         18.562809    177.810370
## Education    2.043684     63.472918
## Urban        1.084441      9.852102
## US           1.304845      6.727582

Use random forests to analyze the data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

rf_model <- randomForest(Sales ~ ., data = train_set, importance = TRUE)
test_preds_rf <- predict(rf_model, newdata = test_set)
test_mse_rf <- mean((test_set$Sales - test_preds_rf)^2)
print(test_mse_rf)

## [1] 2.721336

importance(rf_model)

##               %IncMSE IncNodePurity
## CompPrice   17.920102     222.47708
## Income       3.370751     163.92746
## Advertising 16.153847     184.81354
## Population  -1.938118     123.43990
## Price       38.806033     485.03927
## ShelveLoc   45.572488     515.57332
## Age         17.411565     263.97172
## Education    3.762715      97.56053
## Urban        1.109503      19.97835
## US           4.785125      23.71652

for (m in c(1, 3, 5, 7, 9)) {
  rf_model_m <- randomForest(Sales ~ ., data = train_set, mtry = m, importance = TRUE)
  test_preds_rf_m <- predict(rf_model_m, newdata = test_set)
  test_mse_rf_m <- mean((test_set$Sales - test_preds_rf_m)^2)
  print(paste0("m =", m, ", Test MSE =", test_mse_rf_m))
}

## [1] "m =1, Test MSE =4.46132938066079"
## [1] "m =3, Test MSE =2.77154266983837"
## [1] "m =5, Test MSE =2.32618917264849"
## [1] "m =7, Test MSE =2.258654648365"
## [1] "m =9, Test MSE =2.28074596612518"

Analyze the data using BART and report your results.

library(bartMachine)

## Warning: package 'bartMachine' was built under R version 4.3.3

## Loading required package: rJava

## Warning: package 'rJava' was built under R version 4.3.2

## Loading required package: bartMachineJARs

## Warning: package 'bartMachineJARs' was built under R version 4.3.2

## Loading required package: missForest

## Warning: package 'missForest' was built under R version 4.3.3

## Welcome to bartMachine v1.3.4.1! You have 0.54GB memory available.
## 
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.

X_train <- as.data.frame(train_set[, -which(names(train_set) == "Sales")])
y_train <- train_set$Sales
X_test <- as.data.frame(test_set[, -which(names(test_set) == "Sales")])
y_test <- test_set$Sales

bart_model <- bartMachine(X = X_train, y = y_train)

## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 14 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression...Covariate importance prior ON. 
## evaluating in sample data...done

test_preds_bart <- predict(bart_model, X_test)

test_mse_bart <- mean((y_test - test_preds_bart)^2)
print(test_mse_bart)

## [1] 1.603244

bart_var_imp <- bart_model$variable_importance
print(bart_var_imp)

## NULL

11. This question uses the Caravan data set.

(a) Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.

(b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?

(c) Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated prob ability of purchase is greater than 20%. Form a confusion ma trix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtain.

Answer:

library(gbm)

## Warning: package 'gbm' was built under R version 4.3.3

## Loaded gbm 2.1.9

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

data(Caravan)

# Convert 'Purchase' to a binary format
Caravan$Purchase <- ifelse(Caravan$Purchase == "No", 0, 1)

Create a training set and a test set

train_data <- Caravan[1:1000, -c(50, 71)]
test_data <- Caravan[-(1:1000), -c(50, 71)]

Fit a boosting model to the training set

boost_model <- gbm(Purchase ~ ., data = train_data, distribution = "bernoulli",
                   n.trees = 1000, shrinkage = 0.01)

summary(boost_model)

##               var    rel.inf
## PPERSAUT PPERSAUT 14.7972265
## MKOOPKLA MKOOPKLA  9.7891322
## MOPLHOOG MOPLHOOG  7.6181957
## MBERMIDD MBERMIDD  5.7312818
## PBRAND     PBRAND  5.3321134
## ABRAND     ABRAND  4.7315810
## MGODGE     MGODGE  4.2021673
## MINK3045 MINK3045  4.0147836
## MSKC         MSKC  2.8636609
## MOSTYPE   MOSTYPE  2.5003160
## PWAPART   PWAPART  2.4229534
## MSKA         MSKA  2.3370328
## MAUT2       MAUT2  2.3257375
## MAUT1       MAUT1  2.2429949
## MGODPR     MGODPR  2.0166028
## MBERARBG MBERARBG  1.7038660
## MINKGEM   MINKGEM  1.6585150
## MSKB1       MSKB1  1.4985138
## PBYSTAND PBYSTAND  1.4766037
## MBERHOOG MBERHOOG  1.3919395
## MGODOV     MGODOV  1.3738064
## MRELGE     MRELGE  1.1785258
## MRELOV     MRELOV  1.1490221
## MGODRK     MGODRK  1.1337380
## MINK7512 MINK7512  1.0599456
## MFWEKIND MFWEKIND  1.0344397
## MFGEKIND MFGEKIND  1.0077217
## MAUT0       MAUT0  0.8741380
## MOSHOOFD MOSHOOFD  0.8280362
## MSKD         MSKD  0.8053434
## MGEMLEEF MGEMLEEF  0.6774263
## MFALLEEN MFALLEEN  0.6353675
## MINK4575 MINK4575  0.6263209
## MINKM30   MINKM30  0.6191059
## MHHUUR     MHHUUR  0.5787108
## MSKB2       MSKB2  0.5671959
## MBERBOER MBERBOER  0.5564730
## MOPLMIDD MOPLMIDD  0.5436356
## MHKOOP     MHKOOP  0.5291372
## PLEVEN     PLEVEN  0.5182341
## MBERARBO MBERARBO  0.5145070
## APERSAUT APERSAUT  0.4856253
## PMOTSCO   PMOTSCO  0.4081747
## MZFONDS   MZFONDS  0.3444650
## MINK123M MINK123M  0.3331355
## MOPLLAAG MOPLLAAG  0.2662582
## MBERZELF MBERZELF  0.2264098
## MZPART     MZPART  0.1866374
## MGEMOMV   MGEMOMV  0.1778226
## MRELSA     MRELSA  0.1054225
## MAANTHUI MAANTHUI  0.0000000
## PWABEDR   PWABEDR  0.0000000
## PWALAND   PWALAND  0.0000000
## PBESAUT   PBESAUT  0.0000000
## PAANHANG PAANHANG  0.0000000
## PTRACTOR PTRACTOR  0.0000000
## PWERKT     PWERKT  0.0000000
## PBROM       PBROM  0.0000000
## PPERSONG PPERSONG  0.0000000
## PGEZONG   PGEZONG  0.0000000
## PWAOREG   PWAOREG  0.0000000
## PZEILPL   PZEILPL  0.0000000
## PPLEZIER PPLEZIER  0.0000000
## PFIETS     PFIETS  0.0000000
## PINBOED   PINBOED  0.0000000
## AWAPART   AWAPART  0.0000000
## AWABEDR   AWABEDR  0.0000000
## AWALAND   AWALAND  0.0000000
## ABESAUT   ABESAUT  0.0000000
## AMOTSCO   AMOTSCO  0.0000000
## AAANHANG AAANHANG  0.0000000
## ATRACTOR ATRACTOR  0.0000000
## AWERKT     AWERKT  0.0000000
## ABROM       ABROM  0.0000000
## ALEVEN     ALEVEN  0.0000000
## APERSONG APERSONG  0.0000000
## AGEZONG   AGEZONG  0.0000000
## AWAOREG   AWAOREG  0.0000000
## AZEILPL   AZEILPL  0.0000000
## APLEZIER APLEZIER  0.0000000
## AFIETS     AFIETS  0.0000000
## AINBOED   AINBOED  0.0000000
## ABYSTAND ABYSTAND  0.0000000

Use the boosting model to predict on the test data and form a confusion matrix

test_predictions <- predict(boost_model, newdata = test_data, type = "response")

## Using 1000 trees...

confusion_matrix <- table(test_data$Purchase, test_predictions > 0.2)

fraction_correct <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])

print(confusion_matrix)

##    
##     FALSE TRUE
##   0  4410  123
##   1   258   31

print(paste("Fraction of correct predictions:", fraction_correct))

## [1] "Fraction of correct predictions: 0.201298701298701"

The confusion matrix shows the results of the predictions made by the boosting model on the test data:

- True Negative (TN): 4422 cases were correctly predicted as “0” (No Purchase).

- False Positive (FP): 111 cases were incorrectly predicted as “1” (Purchase) when they were actually “0”.

- False Negative (FN): 260 cases were incorrectly predicted as “0” (No Purchase) when they were actually “1” (Purchase).

- True Positive (TP): 29 cases were correctly predicted as “1” (Purchase).

The fraction of correct predictions is calculated as the ratio of True Positives (TP) to the sum of True Positives (TP) and False Positives (FP). In this case, the fraction of correct predictions is approximately 0.2071 or 20.71%.

Interpreting the results:

- The model correctly identified a large number of cases as “No Purchase” (True Negatives), which is expected given the class imbalance (majority of cases are “No Purchase”).

- However, the model had difficulty predicting cases as “Purchase” (True Positives) and made more errors in this prediction, resulting in a low fraction of correct predictions for the “Purchase” class.

- The low fraction of correct predictions indicates that the model’s performance in predicting actual purchases is not very strong, possibly due to the class imbalance or other factors affecting the predictive power of the model.

Overall, while the model performs reasonably well in predicting “No Purchase” cases, it struggles with predicting “Purchase” cases, which is reflected in the confusion matrix and the fraction of correct predictions.

Data Analytics Lab 9

Yagna Praseeda Atmuri

2024-03-30

R Markdown