The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tree)library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
library(reshape2)library(ggplot2)
Attaching package: 'ggplot2'
The following object is masked from 'package:randomForest':
margin
library(MASS)
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
The following object is masked from 'package:ISLR2':
Boston
library(BART)
Loading required package: nlme
Attaching package: 'nlme'
The following object is masked from 'package:dplyr':
collapse
Loading required package: survival
library(gbm)
Loaded gbm 2.2.2
This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(class)library(caret)
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:survival':
cluster
library(pROC)
Type 'citation("pROC")' for a citation.
Attaching package: 'pROC'
The following objects are masked from 'package:stats':
cov, smooth, var
7. In the lab, we applied random forests to the Boston data using mtry = 6 and using ntree = 25 and ntree = 500. Create a plot displaying the test error resulting from random forests on this data set for a more comprehensive range of values for mtry and ntree. You can model your plot after Figure 8.10. Describe the results obtained.
set.seed(18)train <-sample(1:nrow(Boston), nrow(Boston)/2)test <- Boston[-train, ]test_medv <- test$medv# Adjusted mtry values (p, p/2, sqrt(p), p/4)mtry_vals <-c(12, 6, 3, 2) # 12 = all predictors, sqrt(12) ≈ 3.46ntree_vals <-seq(25, 500, by =25) # Reduced range for computational efficiency# Create an empty matrix to store test MSEtest_mse <-matrix(NA, nrow =length(ntree_vals), ncol =length(mtry_vals),dimnames =list(ntree_vals, paste0("mtry_", mtry_vals)))# Loop over combinations of mtry and ntreefor (i inseq_along(mtry_vals)) {for (j inseq_along(ntree_vals)) { rf.model <-randomForest(medv ~ ., data = Boston, subset = train,mtry = mtry_vals[i], ntree = ntree_vals[j]) pred <-predict(rf.model, newdata = test) test_mse[j, i] <-mean((pred - test_medv)^2) }}# Create data frame for plottinglibrary(reshape2)library(ggplot2)df_plot <-melt(test_mse, varnames =c("ntree", "mtry"), value.name ="TestMSE")df_plot$mtry <-as.factor(gsub("mtry_", "", df_plot$mtry))df_plot$ntree <-as.numeric(as.character(df_plot$ntree))# Plot with better formattingggplot(df_plot, aes(x = ntree, y = TestMSE, color = mtry)) +geom_line(size =1) +geom_point(size =0.8, alpha =0.5) +scale_color_brewer(palette ="Set1", name ="mtry") +labs(title ="Random Forest Test MSE vs. Number of Trees",subtitle ="Boston Housing Dataset",x ="Number of Trees (ntree)",y ="Test Mean Squared Error") +theme_minimal() +theme(legend.position ="right",plot.title =element_text(face ="bold"),legend.title =element_text(face ="bold"))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Observations: 1) Based on validation set approach, as mtry increases test mse tend to get lower. 2) As ntree increases, test mse tend to decrease upto certain number of trees and tend to stablise 3) Therefore, mtry = 12 (that is all variable in Boston) has a better test mse than mtry < 12. And number of trees ntree after 1250 shows minimal gains and minimal fluctations in test mse.
8 In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
(a) Split the data set into a training set and a test set.
Interpretation of plot: - Based on whether ShelveLoc is bad/Medium or Good, we have two regions. And each region is further divided based on Price values. - And as we traverse further down to leaves, each region is further segmented based on Income, Advertising, Age and soon. - Lets look at one of the specific region, when ShelveLoc is Good & Price less than 108.5, Sales is 12.340 and if Price is greater than 108.5, depending upon Advertising its further divided into three regions - Clearly, the major importance is ShelveLoc & Price. If good product is given for lesser price, sales are very high.
Test MSE is 4.679982
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
set.seed(18)cv.carseats <-cv.tree(tree.cs)plot(cv.carseats$size, cv.carseats$dev, type ="b")
# Choose best size and prunebest_size <- cv.carseats$size[which.min(cv.carseats$dev)]pruned_tree <-prune.tree(tree.cs, best = best_size)# Plot pruned treeplot(pruned_tree)text(pruned_tree, pretty =0)
# Predict and compute test MSEpred_pruned <-predict(pruned_tree, newdata = testset)mse_pruned <-mean((pred_pruned - testset$Sales)^2)mse_pruned
[1] 4.851463
In this case, pruning has no effect or No pruning is required. As fully expanded tree has lowest deviance
Therefore, same Test MSE as above.
(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
%IncMSE IncNodePurity
CompPrice 20.6396894 126.937918
Income 7.5331432 68.054506
Advertising 20.2842349 146.401868
Population 4.2280119 72.655529
Price 56.5406491 506.875694
ShelveLoc 66.7630152 543.736624
Age 9.0077036 80.950331
Education 0.5039699 33.564259
Urban -1.6446822 5.584607
US -0.3553105 6.908309
varImpPlot(bag.carseats)
For Bagging we use randomForest with m = p.
Test MSE is 3.124937 Importance interpretation
ShelveLoc is the most important variable. Indicates that the quality of shelf location has the greatest impact on predicting sales.
Price is the second most important. As expected, price significantly affects purchasing behavior—lower prices likely lead to higher sales.
Advertising - Suggests that marketing efforts contribute notably to sales predictions.
CompPrice - indicates competitive pricing matters; customers compare store prices with competitors.
(Some of the importance can be interpreted from previous decision tree as well.But This bagging method further confirm the major role of ShelveLoc, Price & Advertising)
(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
%IncMSE IncNodePurity
CompPrice 9.550173 118.95355
Income 4.856627 99.18224
Advertising 12.961444 133.31174
Population 3.329297 110.76742
Price 47.107384 441.43832
ShelveLoc 51.782647 456.91039
Age 6.297534 110.21048
Education 2.067177 59.45038
Urban 1.291690 10.24085
US 4.486863 17.45096
varImpPlot(rf.carseats)
Interpretation: - Same observation as in case of Bagging with difference values of importance
ShelveLoc and Price are dominant predictors, aligning with business intuition: better visibility and affordability drive higher sales.
Advertising and CompPrice offer moderate predictive power, suggesting a multi-faceted influence on customer behavior.
Test MSE is 3.164371 for m = 4
set.seed(18)mtry_val <-c(3,6,8,10)test_mse_368 <-rep(0,4)for(i inseq(1,4)){ m <- mtry_val[i]rf.carseats <-randomForest(Sales ~ ., data = Carseats, subset = train, mtry = m , importance =TRUE)pred_rf <-predict(rf.carseats, newdata = testset)test_mse_368[i] <-mean((pred_rf - testset$Sales)^2)}test_mse_368
[1] 3.273626 3.101766 3.081288 3.107321
From this observation, as m = 8 has the lowest test_mse and both m = 6 and m = 10 are very close to the lowest as well. Indicates stablising as number predictors increases.
Thus we can say, higher number of predictors is better, than smaller m. And the sweet spot is just when Test MSE starts to stabilising.
(f) Now analyze the data using BART, and report your results.
Regression Tree (4.68): The simple tree model has the highest MSE, indicating high variance and low flexibility. It overfits to the training data and underperforms on the test set.
Bagging (3.12): A significant improvement over a single tree. Bagging reduces variance by aggregating many trees trained on bootstrap samples. The result is a much lower test error.
Random Forest (3.16): Comparable to bagging, with a slight increase in MSE. Random forests introduce additional randomness (through feature subsetting), which can improve generalization but sometimes reduce performance slightly if too aggressive.
BART (1.62): Best performance. BART models the function non-parametrically using a sum of trees and includes uncertainty quantification. It captures complex relationships and interactions better than other models here.
Thus: - BART clearly outperforms all other models in terms of test error, making it the most trustworthy model for this dataset.
Random Forests and Bagging are solid and interpretable alternatives. They allow for variable importance assessment and are easier to explain.
Regression trees are too simple for this task and do not generalize well.
library(class)library(caret)clean_data <-na.omit(Caravan)train_data <- clean_data[1:1000, ]test_data <- clean_data[-(1:1000), ]# Select numeric columns and remove zero-variance columnsnumeric_cols <-sapply(clean_data[,-86], is.numeric)numeric_vars <-names(clean_data)[-86][numeric_cols]zero_var_cols <-sapply(train_data[, numeric_vars], function(col) sd(col) ==0)numeric_vars <- numeric_vars[!zero_var_cols]# Standardize using training datatrain_X <-scale(train_data[, numeric_vars])test_X <-scale(test_data[, numeric_vars],center =attr(train_X, "scaled:center"),scale =attr(train_X, "scaled:scale"))train_y_knn <- train_data$Purchasetest_y_knn <- test_data$Purchaseevaluate_knn <-function(k) { pred_knn <-knn(train = train_X, test = test_X, cl = train_y_knn, k = k) cm <-table(Predicted = pred_knn, Actual = test_y_knn)if ("Yes"%in%rownames(cm)) { precision <- cm["Yes", "Yes"] /sum(cm["Yes", ]) } else { precision <-NA }cat(paste0("\n--- KNN (k = ", k, ") ---\n"))print(cm)cat(sprintf("Fraction of predicted 'Yes' that are correct: %.3f\n", precision))return(list(pred = pred_knn, cm = cm, precision = precision))}# Try multiple k valuesknn_results <-list()for (k inc(1, 3, 5, 7, 9)) { knn_results[[paste0("k", k)]] <-evaluate_knn(k)}
--- KNN (k = 1) ---
Actual
Predicted No Yes
No 4255 249
Yes 278 40
Fraction of predicted 'Yes' that are correct: 0.126
--- KNN (k = 3) ---
Actual
Predicted No Yes
No 4449 267
Yes 84 22
Fraction of predicted 'Yes' that are correct: 0.208
--- KNN (k = 5) ---
Actual
Predicted No Yes
No 4516 281
Yes 17 8
Fraction of predicted 'Yes' that are correct: 0.320
--- KNN (k = 7) ---
Actual
Predicted No Yes
No 4525 288
Yes 8 1
Fraction of predicted 'Yes' that are correct: 0.111
--- KNN (k = 9) ---
Actual
Predicted No Yes
No 4532 289
Yes 1 0
Fraction of predicted 'Yes' that are correct: 0.000
# Store best KNN resultbest_k <-5# ← placeholder; replace based on actual resultsbest_knn_pred <- knn_results[[paste0("k", best_k)]]$predtest_Y <- test_y_knn
Best Shrinkage Value: The optimal λ value is 0.01, which yielded the highest precision of approximately 20.9%. This suggests that a smaller step size in the gradient boosting process worked best for this dataset, allowing the model to make more careful, incremental improvements that capture the subtleties in the data without overfitting.
Most Important Predictors: The top predictors PPERSAUT (likely representing car ownership) and MKOOPKLA (likely representing purchasing power or socioeconomic class) are the most influential variables in determining whether someone will purchase insurance. This provides valuable business insight by highlighting customer characteristics that strongly correlate with purchase behavior.
Precision: With a precision of 20.9% when using a 0.01 shrinkage value, approximately 1 in 5 customers predicted to make a purchase actually did so. While this may seem low in absolute terms, it represents a significant improvement over random targeting, considering the low base rate of purchases in insurance data.
The confusion matrix shows that of all people predicted to make a purchase, about 20.9% actually did make one (with the best shrinkage value of 0.01). Comparing this with KNN:
KNN with k=5 achieved a higher precision of 32%, meaning 32% of those predicted to make a purchase actually did.
However, the KNN model with k=5 made far fewer positive predictions overall (only 25 test cases were predicted as purchases with 8 being correct).
The boosting model likely identified more potential customers while maintaining a reasonable precision rate.
The boosting model offers an advantage in identifying important variables (PPERSAUT and MKOOPKLA) that strongly influence purchase decisions, providing interpretability that KNN lacks.
The 20% probability threshold was an appropriate choice for this imbalanced dataset, as it allowed the model to identify more potential customers than would be possible with the default 50% threshold.
Interpretability Advantage: While KNN performed better in precision, the boosting model offers greater interpretability by identifying important variables like PPERSAUT and MKOOPKLA. This provides actionable insights that can inform broader marketing strategies beyond just prediction.
Thresholding Strategy: The 20% probability threshold for boosting predictions was effective, acknowledging the imbalanced nature of insurance purchase data. This approach appropriately balances false positives and false negatives in a domain where missed opportunities (false negatives) can be costly.
In summary, while KNN with k=5 achieved higher precision (32% vs. 20.9%), the boosting model provided a better balance between prediction volume and accuracy, making it potentially more useful in a practical marketing scenario where identifying a reasonable number of likely purchasers is important