7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
data(Boston)
set.seed(123)
train_index <- sample(1:nrow(Boston), 0.7 * nrow(Boston))
train_data <- Boston[train_index, ]
test_data <- Boston[-train_index, ]
mtry_values <- seq(2, 13, by = 1)
ntree_values <- c(25, 100, 250, 500, 750, 1000)
results <- data.frame(mtry = numeric(0), ntree = numeric(0), test_error = numeric(0))
for (mtry in mtry_values) {
for (ntree in ntree_values) {
model <- randomForest(medv ~ ., data = train_data, mtry = mtry, ntree = ntree)
predictions <- predict(model, test_data)
test_error <- sqrt(mean((predictions - test_data$medv)^2))
results <- rbind(results, data.frame(mtry = mtry, ntree = ntree, test_error = test_error))
}
}
ggplot(results, aes(x = ntree, y = test_error, color = factor(mtry))) +
geom_line() +
geom_point() +
scale_color_discrete(name = "mtry") +
labs(title = "Test Error vs. ntree with Varying mtry",
x = "Number of Trees (ntree)",
y = "Test Error (RMSE)")
Generally, we would expect the test error to decrease as the
number of trees (ntree) increases up to a
certain point. After that point, adding more trees might not
significantly improve the model’s performance and could even lead to
overfitting.
The plot’s lines (each color representing a different
mtry value) show how the test error
changes with varying ntree values for each
mtry setting.
We can observe trends such as:
For smaller mtry values, the test
error might decrease initially with ntree
but then stabilize or start increasing after a certain point.
Larger mtry values may show a
different pattern, potentially reaching lower test errors with a
different optimal range of ntree
values.
By examining the plot, we can identify combinations of
mtry and
ntree values that result in lower test
errors, helping us choose an appropriate configuration for our random
forest model on the Boston dataset.
8. In the lab, a classification tree was applied to the Carseats data set af ter converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
(a) Split the data set into a training set and a test set.
(b) Fit a regression tree to the training set. Plot the tree, and inter pret the results. What test MSE do you obtain?
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to de termine which variables are most important.
(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which vari ables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
(f) Now analyze the data using BART, and report your results.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
data(Carseats)
set.seed(123)
train_index <- sample(1:nrow(Carseats), 0.7 * nrow(Carseats))
train_set <- Carseats[train_index, ]
test_set <- Carseats[-train_index, ]
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.3.3
tree_model <- rpart(Sales ~ ., data = train_set)
rpart.plot(tree_model)
test_preds <- predict(tree_model, newdata = test_set)
test_mse <- mean((test_set$Sales - test_preds)^2)
print(test_mse)
## [1] 3.784419
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
ctrl <- trainControl(method = "cv", number = 10)
tree_cv <- train(Sales ~ ., data = train_set, method = "rpart", trControl = ctrl)
## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
## : There were missing values in resampled performance measures.
print(tree_cv)
## CART
##
## 280 samples
## 10 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 252, 252, 252, 252, 252, 252, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.07090167 2.313494 0.3419392 1.886709
## 0.09215223 2.420227 0.2809691 1.977039
## 0.25462228 2.728287 0.1848675 2.212138
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.07090167.
pruned_tree <- prune.rpart(tree_model, cp = tree_cv$bestTune$cp)
rpart.plot(pruned_tree)
test_preds_pruned <- predict(pruned_tree, newdata = test_set)
test_mse_pruned <- mean((test_set$Sales - test_preds_pruned)^2)
print(test_mse_pruned)
## [1] 4.973922
library(randomForest)
bag_model <- randomForest(Sales ~ ., data = train_set, mtry = ncol(train_set) - 1, importance = TRUE)
test_preds_bag <- predict(bag_model, newdata = test_set)
test_mse_bag <- mean((test_set$Sales - test_preds_bag)^2)
print(test_mse_bag)
## [1] 2.266203
importance(bag_model)
## %IncMSE IncNodePurity
## CompPrice 34.591610 265.834772
## Income 8.363619 112.252903
## Advertising 21.343111 149.735898
## Population -1.120982 65.465337
## Price 63.926998 626.301371
## ShelveLoc 75.591167 700.500508
## Age 18.562809 177.810370
## Education 2.043684 63.472918
## Urban 1.084441 9.852102
## US 1.304845 6.727582
rf_model <- randomForest(Sales ~ ., data = train_set, importance = TRUE)
test_preds_rf <- predict(rf_model, newdata = test_set)
test_mse_rf <- mean((test_set$Sales - test_preds_rf)^2)
print(test_mse_rf)
## [1] 2.721336
importance(rf_model)
## %IncMSE IncNodePurity
## CompPrice 17.920102 222.47708
## Income 3.370751 163.92746
## Advertising 16.153847 184.81354
## Population -1.938118 123.43990
## Price 38.806033 485.03927
## ShelveLoc 45.572488 515.57332
## Age 17.411565 263.97172
## Education 3.762715 97.56053
## Urban 1.109503 19.97835
## US 4.785125 23.71652
for (m in c(1, 3, 5, 7, 9)) {
rf_model_m <- randomForest(Sales ~ ., data = train_set, mtry = m, importance = TRUE)
test_preds_rf_m <- predict(rf_model_m, newdata = test_set)
test_mse_rf_m <- mean((test_set$Sales - test_preds_rf_m)^2)
print(paste0("m =", m, ", Test MSE =", test_mse_rf_m))
}
## [1] "m =1, Test MSE =4.46132938066079"
## [1] "m =3, Test MSE =2.77154266983837"
## [1] "m =5, Test MSE =2.32618917264849"
## [1] "m =7, Test MSE =2.258654648365"
## [1] "m =9, Test MSE =2.28074596612518"
library(bartMachine)
## Warning: package 'bartMachine' was built under R version 4.3.3
## Loading required package: rJava
## Warning: package 'rJava' was built under R version 4.3.2
## Loading required package: bartMachineJARs
## Warning: package 'bartMachineJARs' was built under R version 4.3.2
## Loading required package: missForest
## Warning: package 'missForest' was built under R version 4.3.3
## Welcome to bartMachine v1.3.4.1! You have 0.54GB memory available.
##
## If you run out of memory, restart R, and use e.g.
## 'options(java.parameters = "-Xmx5g")' for 5GB of RAM before you call
## 'library(bartMachine)'.
X_train <- as.data.frame(train_set[, -which(names(train_set) == "Sales")])
y_train <- train_set$Sales
X_test <- as.data.frame(test_set[, -which(names(test_set) == "Sales")])
y_test <- test_set$Sales
bart_model <- bartMachine(X = X_train, y = y_train)
## bartMachine initializing with 50 trees...
## bartMachine vars checked...
## bartMachine java init...
## bartMachine factors created...
## bartMachine before preprocess...
## bartMachine after preprocess... 14 total features...
## bartMachine sigsq estimated...
## bartMachine training data finalized...
## Now building bartMachine for regression...Covariate importance prior ON.
## evaluating in sample data...done
test_preds_bart <- predict(bart_model, X_test)
test_mse_bart <- mean((y_test - test_preds_bart)^2)
print(test_mse_bart)
## [1] 1.603244
bart_var_imp <- bart_model$variable_importance
print(bart_var_imp)
## NULL
11. This question uses the Caravan data set.
(a) Create a training set consisting of the first 1,000 observations, and a test set consisting of the remaining observations.
(b) Fit a boosting model to the training set with Purchase as the response and the other variables as predictors. Use 1,000 trees, and a shrinkage value of 0.01. Which predictors appear to be the most important?
(c) Use the boosting model to predict the response on the test data. Predict that a person will make a purchase if the estimated prob ability of purchase is greater than 20%. Form a confusion ma trix. What fraction of the people predicted to make a purchase do in fact make one? How does this compare with the results obtain.
Answer:
library(gbm)
## Warning: package 'gbm' was built under R version 4.3.3
## Loaded gbm 2.1.9
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
data(Caravan)
# Convert 'Purchase' to a binary format
Caravan$Purchase <- ifelse(Caravan$Purchase == "No", 0, 1)
train_data <- Caravan[1:1000, -c(50, 71)]
test_data <- Caravan[-(1:1000), -c(50, 71)]
boost_model <- gbm(Purchase ~ ., data = train_data, distribution = "bernoulli",
n.trees = 1000, shrinkage = 0.01)
summary(boost_model)
## var rel.inf
## PPERSAUT PPERSAUT 14.7972265
## MKOOPKLA MKOOPKLA 9.7891322
## MOPLHOOG MOPLHOOG 7.6181957
## MBERMIDD MBERMIDD 5.7312818
## PBRAND PBRAND 5.3321134
## ABRAND ABRAND 4.7315810
## MGODGE MGODGE 4.2021673
## MINK3045 MINK3045 4.0147836
## MSKC MSKC 2.8636609
## MOSTYPE MOSTYPE 2.5003160
## PWAPART PWAPART 2.4229534
## MSKA MSKA 2.3370328
## MAUT2 MAUT2 2.3257375
## MAUT1 MAUT1 2.2429949
## MGODPR MGODPR 2.0166028
## MBERARBG MBERARBG 1.7038660
## MINKGEM MINKGEM 1.6585150
## MSKB1 MSKB1 1.4985138
## PBYSTAND PBYSTAND 1.4766037
## MBERHOOG MBERHOOG 1.3919395
## MGODOV MGODOV 1.3738064
## MRELGE MRELGE 1.1785258
## MRELOV MRELOV 1.1490221
## MGODRK MGODRK 1.1337380
## MINK7512 MINK7512 1.0599456
## MFWEKIND MFWEKIND 1.0344397
## MFGEKIND MFGEKIND 1.0077217
## MAUT0 MAUT0 0.8741380
## MOSHOOFD MOSHOOFD 0.8280362
## MSKD MSKD 0.8053434
## MGEMLEEF MGEMLEEF 0.6774263
## MFALLEEN MFALLEEN 0.6353675
## MINK4575 MINK4575 0.6263209
## MINKM30 MINKM30 0.6191059
## MHHUUR MHHUUR 0.5787108
## MSKB2 MSKB2 0.5671959
## MBERBOER MBERBOER 0.5564730
## MOPLMIDD MOPLMIDD 0.5436356
## MHKOOP MHKOOP 0.5291372
## PLEVEN PLEVEN 0.5182341
## MBERARBO MBERARBO 0.5145070
## APERSAUT APERSAUT 0.4856253
## PMOTSCO PMOTSCO 0.4081747
## MZFONDS MZFONDS 0.3444650
## MINK123M MINK123M 0.3331355
## MOPLLAAG MOPLLAAG 0.2662582
## MBERZELF MBERZELF 0.2264098
## MZPART MZPART 0.1866374
## MGEMOMV MGEMOMV 0.1778226
## MRELSA MRELSA 0.1054225
## MAANTHUI MAANTHUI 0.0000000
## PWABEDR PWABEDR 0.0000000
## PWALAND PWALAND 0.0000000
## PBESAUT PBESAUT 0.0000000
## PAANHANG PAANHANG 0.0000000
## PTRACTOR PTRACTOR 0.0000000
## PWERKT PWERKT 0.0000000
## PBROM PBROM 0.0000000
## PPERSONG PPERSONG 0.0000000
## PGEZONG PGEZONG 0.0000000
## PWAOREG PWAOREG 0.0000000
## PZEILPL PZEILPL 0.0000000
## PPLEZIER PPLEZIER 0.0000000
## PFIETS PFIETS 0.0000000
## PINBOED PINBOED 0.0000000
## AWAPART AWAPART 0.0000000
## AWABEDR AWABEDR 0.0000000
## AWALAND AWALAND 0.0000000
## ABESAUT ABESAUT 0.0000000
## AMOTSCO AMOTSCO 0.0000000
## AAANHANG AAANHANG 0.0000000
## ATRACTOR ATRACTOR 0.0000000
## AWERKT AWERKT 0.0000000
## ABROM ABROM 0.0000000
## ALEVEN ALEVEN 0.0000000
## APERSONG APERSONG 0.0000000
## AGEZONG AGEZONG 0.0000000
## AWAOREG AWAOREG 0.0000000
## AZEILPL AZEILPL 0.0000000
## APLEZIER APLEZIER 0.0000000
## AFIETS AFIETS 0.0000000
## AINBOED AINBOED 0.0000000
## ABYSTAND ABYSTAND 0.0000000
test_predictions <- predict(boost_model, newdata = test_data, type = "response")
## Using 1000 trees...
confusion_matrix <- table(test_data$Purchase, test_predictions > 0.2)
fraction_correct <- confusion_matrix[2, 2] / sum(confusion_matrix[, 2])
print(confusion_matrix)
##
## FALSE TRUE
## 0 4410 123
## 1 258 31
print(paste("Fraction of correct predictions:", fraction_correct))
## [1] "Fraction of correct predictions: 0.201298701298701"
The confusion matrix shows the results of the predictions made by the boosting model on the test data:
- True Negative (TN): 4422 cases were correctly predicted as “0” (No Purchase).
- False Positive (FP): 111 cases were incorrectly predicted as “1” (Purchase) when they were actually “0”.
- False Negative (FN): 260 cases were incorrectly predicted as “0” (No Purchase) when they were actually “1” (Purchase).
- True Positive (TP): 29 cases were correctly predicted as “1” (Purchase).
The fraction of correct predictions is calculated as the ratio of True Positives (TP) to the sum of True Positives (TP) and False Positives (FP). In this case, the fraction of correct predictions is approximately 0.2071 or 20.71%.
Interpreting the results:
- The model correctly identified a large number of cases as “No Purchase” (True Negatives), which is expected given the class imbalance (majority of cases are “No Purchase”).
- However, the model had difficulty predicting cases as “Purchase” (True Positives) and made more errors in this prediction, resulting in a low fraction of correct predictions for the “Purchase” class.
- The low fraction of correct predictions indicates that the model’s performance in predicting actual purchases is not very strong, possibly due to the class imbalance or other factors affecting the predictive power of the model.
Overall, while the model performs reasonably well in predicting “No Purchase” cases, it struggles with predicting “Purchase” cases, which is reflected in the confusion matrix and the fraction of correct predictions.