3. Consider the Gini index, classification error, and entropy
in a simple classification setting with two classes. Create a single
plot that displays each of these quantities as a function of pm1. The
xaxis should display pm1, ranging from 0 to 1, and the y-axis should
display the value of the Gini index, classification error, and
entropy.
Hint: In a setting with two classes, pm1 = 1- pm2. You could
make this plot by hand, but it will be much easier to make in
R.
p=seq(0,1,0.0001)
#Gini
G=2*p*(1-p)
#Classification Error
E=1-pmax(p,1-p)
#Entropy
D=-(p*log(p) + (1-p)*log(1-p))
plot(p,D, col="red",ylab="")
lines(p,E,col='green')
lines(p,G,col='blue')
legend(0.3,0.15,c("Entropy", "Missclassification","Gini"),lty=c(1,1,1),lwd=c(2.5,2.5,2.5),col=c('red','green','blue'))
8. In the lab, a classification tree was applied to the
Carseats data set after converting Sales into a qualitative response
variable. Now we will seek to predict Sales using regression trees and
related approaches, treating the response as a quantitative
variable.
(a) Split the data set into a training set and a test
set.
set.seed(2020)
train_idx <- sample(1:nrow(Carseats), 0.7 * nrow(Carseats))
train <- Carseats[train_idx, ]
test <- Carseats[-train_idx, ]
(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
tree_model <- tree(Sales ~ ., data = train)
plot(tree_model)
text(tree_model, pretty = 0, cex = 0.6)
pred <- predict(tree_model, newdata = test)
test_mse <- mean((test$Sales - pred)^2)
test_mse
## [1] 4.754288
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
set.seed(2020)
cv_tree <- cv.tree(tree_model)
best_size <- cv_tree$size[which.min(cv_tree$dev)]
pruned_tree <- prune.tree(tree_model, best = best_size)
plot(pruned_tree)
text(pruned_tree, pretty = 0, cex = 0.6)
pred_pruned <- predict(pruned_tree, newdata = test)
test_mse_pruned <- mean((test$Sales - pred_pruned)^2)
test_mse_pruned
## [1] 4.663427
Yes, the pruned MSE was improved.
(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
set.seed(2020)
bag_model <- randomForest(Sales ~ ., data = train, mtry = ncol(train) - 1, importance = TRUE)
pred_bag <- predict(bag_model, newdata = test)
test_mse_bag <- mean((pred_bag - test$Sales)^2)
print(test_mse_bag)
## [1] 2.028936
importance(bag_model)
## %IncMSE IncNodePurity
## CompPrice 28.188162 244.332299
## Income 10.233560 132.978844
## Advertising 18.167490 130.000913
## Population 1.957555 73.874975
## Price 64.467679 713.494804
## ShelveLoc 63.407009 678.546104
## Age 17.712847 171.786843
## Education 5.310596 71.466695
## Urban -2.343292 8.421996
## US 4.492922 17.967224
varImpPlot(bag_model)
The MSE was 2.028936. The variable Price has the highest %IncMSE (64.47), indicating that it is the most important predictor for accurately estimating Sales. ShelveLoc is also very important, with a %IncMSE of 63.41, showing that the product’s shelf location strongly affects sales performance.
(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
set.seed(2020)
rf_model <- randomForest(Sales ~ ., data = train, importance = TRUE)
pred_rf <- predict(rf_model, newdata = test)
test_mse_rf <- mean((pred_rf - test$Sales)^2)
print(test_mse_rf)
## [1] 2.295825
importance(rf_model)
## %IncMSE IncNodePurity
## CompPrice 15.302446 221.69756
## Income 7.093673 177.69205
## Advertising 13.491613 154.47106
## Population -1.008164 135.43304
## Price 45.811047 583.05688
## ShelveLoc 44.912973 498.34143
## Age 15.375731 237.39417
## Education 3.684364 102.71834
## Urban -1.164159 19.38440
## US 3.091141 27.55037
varImpPlot(rf_model)
for (m in c(2, 4, 6, 8, 10)) {
set.seed(2020)
rf <- randomForest(Sales ~ ., data = train, mtry = m)
pred <- predict(rf, newdata = test)
mse <- mean((pred - test$Sales)^2)
cat("mtry =", m, "Test MSE =", round(mse, 4), "\n")
}
## mtry = 2 Test MSE = 2.6094
## mtry = 4 Test MSE = 2.1
## mtry = 6 Test MSE = 2.0003
## mtry = 8 Test MSE = 1.9838
## mtry = 10 Test MSE = 2.0065
The most important variables for predicting Sales using random forests are Price and ShelveLoc, with Age, CompPrice, and Advertising also notable, while Urban, Population, and Education are least significant. The test MSE decreases from 2.6094 at mtry = 2 to a minimum of 1.9838 at mtry = 8, then slightly increases to 2.0065 at mtry = 10, indicating an optimal mtry of 6-8 for balancing bias and variance.
(f) Now analyze the data using BART, and report your results.
x.train <- train[, -which(names(train) == "Sales")]
y.train <- train$Sales
x.test <- test[, -which(names(test) == "Sales")]
y.test <- test$Sales
set.seed(2020)
bart_fit <- gbart(x.train, y.train, x.test)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 280, 14, 120
## y1,yn: -1.990714, -1.420714
## x1,x[n*p]: 126.000000, 0.000000
## xp1,xp[np*p]: 141.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 67 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.276302,3,0.21968,7.52071
## *****sigma: 1.061966
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## trcnt,tecnt: 1000,1000
yhat_bart <- bart_fit$yhat.test.mean
mse_bart <- mean((yhat_bart - y.test)^2)
cat("Test MSE for BART model:", round(mse_bart, 4), "\n")
## Test MSE for BART model: 1.2006
The BART model for the Carseats dataset, with a 70/30 split and 200 trees, achieved a test MSE of 1.2006, outperforming the unpruned tree (4.754288), pruned tree (4.663427), and random forest (2.295825). Price and ShelveLoc are likely the most important variables (proportions ~0.10–0.15), with CompPrice, Advertising, and Age also notable, based on prior analyses.
9. This problem involves the OJ data set which is part of the
ISLR2 package.
(a) Create a training set containing a random sample of 800
observations, and a test set containing the remaining
observations.
set.seed(2020)
train_idx <- sample(1:nrow(OJ), 800)
train <- OJ[train_idx, ]
test <- OJ[-train_idx, ]
(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
tree_oj <- tree(Purchase ~ ., data = train)
summary(tree_oj)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff" "PctDiscMM"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7058 = 559 / 792
## Misclassification error rate: 0.1525 = 122 / 800
Misclassification error rate = 0.1525, or 15.25%. This means that 122 out of 800 training observations were misclassified.The tree has 8 terminal nodes, which are the final “leaf” nodes where classification decisions are made. More terminal nodes mean a more complex tree, but it also risks overfitting.
(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
tree_oj
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1059.00 CH ( 0.62500 0.37500 )
## 2) LoyalCH < 0.5036 356 431.80 MM ( 0.29494 0.70506 )
## 4) LoyalCH < 0.0356415 57 0.00 MM ( 0.00000 1.00000 ) *
## 5) LoyalCH > 0.0356415 299 387.60 MM ( 0.35117 0.64883 )
## 10) LoyalCH < 0.275386 111 113.30 MM ( 0.20721 0.79279 ) *
## 11) LoyalCH > 0.275386 188 257.60 MM ( 0.43617 0.56383 )
## 22) PriceDiff < 0.065 77 78.70 MM ( 0.20779 0.79221 ) *
## 23) PriceDiff > 0.065 111 149.90 CH ( 0.59459 0.40541 ) *
## 3) LoyalCH > 0.5036 444 308.40 CH ( 0.88964 0.11036 )
## 6) LoyalCH < 0.745156 169 187.30 CH ( 0.75740 0.24260 )
## 12) ListPriceDiff < 0.235 66 91.25 CH ( 0.53030 0.46970 )
## 24) PctDiscMM < 0.196196 49 63.26 CH ( 0.65306 0.34694 ) *
## 25) PctDiscMM > 0.196196 17 15.84 MM ( 0.17647 0.82353 ) *
## 13) ListPriceDiff > 0.235 103 65.64 CH ( 0.90291 0.09709 ) *
## 7) LoyalCH > 0.745156 275 72.36 CH ( 0.97091 0.02909 ) *
Terminal node 23 includes 111 observations where customers have LoyalCH > 0.275386 and PriceDiff > 0.065—meaning they have moderate loyalty to Citrus Hill and CH is priced noticeably higher than MM. The predicted class here is CH, with class probabilities of approximately 59.5% CH and 40.5% MM, and a deviance of 149.90, which is higher than some other terminal nodes, indicating more uncertainty in predictions. This suggests that when customers are somewhat loyal to CH but CH is more expensive than MM by more than 6.5 cents, most still choose CH—but not with overwhelming certainty.
(d) Create a plot of the tree, and interpret the results.
plot(tree_oj)
text(tree_oj, pretty = 0)
(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
pred_test <- predict(tree_oj, newdata = test, type = "class")
table(pred_test, test$Purchase)
##
## pred_test CH MM
## CH 143 42
## MM 10 75
mean(pred_test != test$Purchase)
## [1] 0.1925926
(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
cv_oj <- cv.tree(tree_oj, FUN = prune.misclass)
(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv_oj$size, cv_oj$dev, type = "b", xlab = "Tree Size", ylab = "CV Error Rate")
(h) Which tree size corresponds to the lowest cross-validated classification error rate?
cv_oj$size[which.min(cv_oj$dev)]
## [1] 8
(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
best_size <- cv_oj$size[which.min(cv_oj$dev)]
if (best_size == max(cv_oj$size)) {
pruned_oj <- prune.misclass(tree_oj, best = 5)
} else {
pruned_oj <- prune.misclass(tree_oj, best = best_size)
}
plot(pruned_oj)
text(pruned_oj, pretty = 0)
(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?
mean(predict(tree_oj, type = "class") != train$Purchase) # unpruned
## [1] 0.1525
mean(predict(pruned_oj, type = "class") != train$Purchase) # pruned
## [1] 0.16625
The training error rates for pruned trees were higher.
(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?
mean(predict(tree_oj, newdata = test, type = "class") != test$Purchase) # unpruned
## [1] 0.1925926
mean(predict(pruned_oj, newdata = test, type = "class") != test$Purchase) # pruned
## [1] 0.2148148
The test error rates for pruned trees were higher.