library(ISLR)
library(tree)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(BART)
## Loading required package: nlme
## Loading required package: survival
library(caret)
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
##
## cluster
library(MASS)
(3)Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of pˆm1. The x-axis should display pˆm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.
A plot was created to show how the Gini index, classification error, and entropy vary as \(\hat{p}_{m1}\) increases from 0 to 1. All three measures reach their maximum when the class probabilities are equal (i.e., \(\hat{p}_{m1} = 0.5\)). Entropy is the most sensitive around this point, classification error is the least sensitive, and the Gini index falls in between. This visual illustrates how these impurity measures behave and why they are used in decision trees.
p <- seq(0, 1, by = 0.01)
gini <- 2 * p * (1 - p)
class_error <- 1 - pmax(p, 1 - p)
entropy <- -p * log2(p) - (1 - p) * log2(1 - p)
entropy[is.nan(entropy)] <- 0
plot(p, gini, type = "l", col = "blue", ylim = c(0, 1), ylab = "Impurity Measure", xlab = expression(hat(p)[m1]), lwd = 2)
lines(p, class_error, col = "red", lwd = 2)
lines(p, entropy, col = "darkgreen", lwd = 2)
legend("top", legend = c("Gini", "Classification Error", "Entropy"),
col = c("blue", "red", "darkgreen"), lty = 1, lwd = 2)
(8)In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
Split the data set into a training set and a test set.
The Carseats data was randomly divided into a training set (70%) and a test set (30%).
set.seed(1)
train_index <- createDataPartition(Carseats$Sales, p = 0.7, list = FALSE)
train <- Carseats[train_index, ]
test <- Carseats[-train_index, ]
Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
A regression tree was fitted to predict Sales. The tree used six variables and produced 18 terminal nodes. The test mean squared error (MSE) from this model was 4.47.
tree_carseats <- tree(Sales ~ ., data = train)
summary(tree_carseats)
##
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Income" "Advertising" "CompPrice"
## [6] "Age"
## Number of terminal nodes: 18
## Residual mean deviance: 2.22 = 583.8 / 263
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.64200 -0.96000 0.02174 0.00000 0.96000 3.40700
plot(tree_carseats)
text(tree_carseats, pretty = 0)
pred_tree <- predict(tree_carseats, newdata = test)
mse_tree <- mean((pred_tree - test$Sales)^2)
mse_tree
## [1] 4.469011
Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
Cross-validation was used to determine the best tree size. The optimal tree size was the same as the original tree, so pruning did not reduce the test MSE. The test MSE remained 4.47 after pruning.
set.seed(2)
cv_tree <- cv.tree(tree_carseats, FUN = prune.tree)
plot(cv_tree$size, cv_tree$dev, type = "b")
pruned_tree <- prune.tree(tree_carseats, best = cv_tree$size[which.min(cv_tree$dev)])
plot(pruned_tree)
text(pruned_tree, pretty = 5)
pred_pruned <- predict(pruned_tree, newdata = test)
mse_pruned <- mean((pred_pruned - test$Sales)^2)
mse_pruned
## [1] 4.469011
Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
The bagging method was applied using all predictors at each split. The test MSE dropped to 2.95, an improvement over the single tree. The most important variables were ShelveLoc, Price, and Advertising.
set.seed(1)
bag_carseats <- randomForest(Sales ~ ., data = train, mtry = ncol(train) - 1, importance = TRUE)
pred_bag <- predict(bag_carseats, newdata = test)
mse_bag <- mean((pred_bag - test$Sales)^2)
mse_bag
## [1] 2.95016
importance(bag_carseats)
## %IncMSE IncNodePurity
## CompPrice 29.3462503 184.444145
## Income 2.1950934 84.971443
## Advertising 28.4461920 226.470679
## Population -0.7596617 67.714326
## Price 68.7814941 651.996671
## ShelveLoc 71.7824230 628.164064
## Age 21.3597782 166.657371
## Education 1.4885394 55.640049
## Urban -3.4002675 8.580206
## US 5.0762780 12.380845
varImpPlot(bag_carseats)
Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
A random forest model was built with mtry = 4. The test MSE was 3.50. The most important variables remained ShelveLoc and Price. A test across different mtry values showed that performance varied slightly, with some improvement for smaller values due to increased randomness.
set.seed(1)
rf_carseats <- randomForest(Sales ~ ., data = train, mtry = 4, importance = TRUE)
pred_rf <- predict(rf_carseats, newdata = test)
mse_rf <- mean((pred_rf - test$Sales)^2)
mse_rf
## [1] 3.500971
importance(rf_carseats)
## %IncMSE IncNodePurity
## CompPrice 16.8566344 180.61779
## Income 1.6173049 119.33836
## Advertising 22.7107139 230.95192
## Population 0.1108546 107.96778
## Price 50.2654195 571.56201
## ShelveLoc 54.7490099 534.09345
## Age 16.3214308 194.92518
## Education -0.5807710 79.17980
## Urban -0.5769700 15.64903
## US 6.0425707 23.57621
varImpPlot(rf_carseats)
errors <- sapply(1:10, function(m) {
rf_tmp <- randomForest(Sales ~ ., data = train, mtry = m)
pred_tmp <- predict(rf_tmp, newdata = test)
mean((pred_tmp - test$Sales)^2)
})
plot(1:10, errors, type = "b", xlab = "mtry", ylab = "Test MSE")
Now analyze the data using BART, and report your results.
BART was used to predict Sales, and it gave the lowest test MSE of 1.73, outperforming all other models in this task.
x_train <- train[, -which(names(train) == "Sales")]
x_test <- test[, -which(names(test) == "Sales")]
y_train <- train$Sales
set.seed(1)
bart_fit <- wbart(x.train = x_train, y.train = y_train, x.test = x_test)
## *****Into main of wbart
## *****Data:
## data:n,p,np: 281, 14, 119
## y1,yn: 1.974733, 2.184733
## x1,x[n*p]: 138.000000, 1.000000
## xp1,xp[np*p]: 122.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 64 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,0.287616,3.000000,0.182475
## *****sigma: 0.967869
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 100
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
pred_bart <- colMeans(bart_fit$yhat.test)
mse_bart <- mean((pred_bart - test$Sales)^2)
mse_bart
## [1] 1.73499
data(OJ)
Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
The OJ dataset was split into 800 observations for training and 270 for testing.
set.seed(1)
train_index <- sample(1:nrow(OJ), 800)
train <- OJ[train_index, ]
test <- OJ[-train_index, ]
Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
A classification tree was built to predict Purchase. It had 9 terminal nodes and a training error rate of 15.88%. LoyalCH was the most influential variable in the tree.
tree_oj <- tree(Purchase ~ ., data = train)
summary(tree_oj)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
One terminal node showed that when LoyalCH was very low, 98.3% of the observations were classified as MM, indicating high confidence in that prediction.
tree_oj
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
## 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
## 9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
## 20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
## 21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
## 24) PctDiscMM < 0.196196 55 73.14 CH ( 0.61818 0.38182 ) *
## 25) PctDiscMM > 0.196196 17 12.32 MM ( 0.11765 0.88235 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
Create a plot of the tree, and interpret the results.
A plot of the tree showed that LoyalCH was the main splitting variable. Other variables like PriceDiff and SpecialCH appeared in deeper splits.
plot(tree_oj)
text(tree_oj, pretty = 5)
Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
Using the test set, the classification tree had a test error rate of 17.04%, based on a confusion matrix comparing predicted and actual values.
pred_test <- predict(tree_oj, test, type = "class")
confusion <- table(Predicted = pred_test, Actual = test$Purchase)
confusion
## Actual
## Predicted CH MM
## CH 160 38
## MM 8 64
test_error <- 1 - sum(diag(confusion)) / sum(confusion)
test_error
## [1] 0.1703704
Apply the cv.tree() function to the training set in order to determine the optimal tree size.
Cross-validation suggested the optimal tree size was 9, the same as the full tree, so no pruning was required.
set.seed(2)
cv_oj <- cv.tree(tree_oj, FUN = prune.misclass)
Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
A plot of tree size versus cross-validated classification error showed the lowest error at tree size 9.
plot(cv_oj$size, cv_oj$dev, type = "b",
xlab = "Tree Size", ylab = "CV Misclassification Rate")
Which tree size corresponds to the lowest cross-validated classification error rate?
The lowest cross-validation error occurred with a tree size of 9, which matched the original unpruned tree.
min_dev_index <- which.min(cv_oj$dev)
optimal_size <- cv_oj$size[min_dev_index]
optimal_size
## [1] 9
i.Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
To compare performance, the tree was manually pruned to 5 terminal nodes, even though cross-validation did not suggest pruning.
# Use optimal size from CV, or 5 as fallback
pruned_tree <- prune.misclass(tree_oj, best = 5)
plot(pruned_tree)
text(pruned_tree, pretty = 0)
Compare the training error rates between the pruned and unpruned trees. Which is higher?
The unpruned tree had a training error rate of 15.88%, while the pruned tree had a slightly higher rate of 16.25%.
train_pred_unpruned <- predict(tree_oj, train, type = "class")
train_err_unpruned <- mean(train_pred_unpruned != train$Purchase)
train_pred_pruned <- predict(pruned_tree, train, type = "class")
train_err_pruned <- mean(train_pred_pruned != train$Purchase)
train_err_unpruned
## [1] 0.15875
train_err_pruned
## [1] 0.1625
The pruned tree had a test error rate of 16.30%, which was slightly lower than the unpruned tree’s test error of 17.04%. This suggests the pruned tree generalized slightly better.
test_pred_unpruned <- predict(tree_oj, test, type = "class")
test_err_unpruned <- mean(test_pred_unpruned != test$Purchase)
test_pred_pruned <- predict(pruned_tree, test, type = "class")
test_err_pruned <- mean(test_pred_pruned != test$Purchase)
test_err_unpruned
## [1] 0.1703704
test_err_pruned
## [1] 0.162963