library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: ggplot2
## Loading required package: lattice
library(rattle)
## Warning: package 'rattle' was built under R version 4.4.3
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(rpart)
library(tree)
## Warning: package 'tree' was built under R version 4.4.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.3
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(BART)
## Warning: package 'BART' was built under R version 4.4.3
## Loading required package: nlme
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
attach(Carseats)
attach(OJ)
3. Consider the Gini index, classification error, and entropy in a
simple classification setting with two classes. Create a single plot
that displays each of these quantities as a function of ˆ pm1. The
x-axis should display ˆ pm1, ranging from 0 to 1, and the y-axis should
display the value of the Gini index, classification error, and
entropy.
Hint: In a setting with two classes, ˆ pm1 =1− ˆ pm2. You could make
this plot by hand, but it will be much easier to make in R.
p <- seq(0, 1, 0.01)
gini <- 2 * p * (1 - p)
classification_error <- 1 - pmax(p, 1 - p)
entropy <- -p * log2(p) - (1 - p) * log2(1 - p)
entropy[is.na(entropy)] <- 0
plot(p, gini, type = "l", col = "blue", ylim = c(0,1),
ylab = "Value", xlab = "pm1")
lines(p, entropy, col = "red")
lines(p, classification_error, col = "green")
legend("top", legend = c("Gini", "Entropy", "Classification Error"),
col = c("blue", "red", "green"), lty = 1)

8. In the lab, a classification tree was applied to the Carseats
data set after converting Sales into a qualitative response variable.
Now we will seek to predict Sales using regression trees and related
approaches, treating the response as a quantitative variable.
(a) Split the data set into a training set and a test set.
set.seed(62)
train_index <- createDataPartition(Carseats$Sales, p = 0.7, list = FALSE)
car_train <- Carseats[train_index, ]
car_test <- Carseats[-train_index, ]
(b) Fit a regression tree to the training set. Plot the tree, and
interpret the results. What test MSE do you obtain?
car_tree <- rpart(Sales ~ ., data = car_train)
fancyRpartPlot(car_tree, sub = "", cex = 0.7)

The decision tree helps predict outcomes based on features like
"Price", "ShelveLoc", and
"Income". It starts by checking “ShelveLoc”
(Bad/Medium). If it’s “Bad” or “Medium,” it keeps splitting
based on other factors like "Price" and other
feature. The tree ends in leaf nodes, which give the final
prediction along with the number of cases in each category. Each split
helps the model make better predictions.
tree_pred <- predict(car_tree, newdata = car_test)
mse_tree <- mean((tree_pred - car_test$Sales)^2)
mse_tree
## [1] 4.741038
(c) Use cross-validation in order to determine the optimal level of
tree complexity. Does pruning the tree improve the test MSE?
printcp(car_tree)
##
## Regression tree:
## rpart(formula = Sales ~ ., data = car_train)
##
## Variables actually used in tree construction:
## [1] Advertising Age CompPrice Education Income Price
## [7] ShelveLoc
##
## Root node error: 2204.8/281 = 7.8463
##
## n= 281
##
## CP nsplit rel error xerror xstd
## 1 0.209522 0 1.00000 1.00909 0.087406
## 2 0.119805 1 0.79048 0.80463 0.065598
## 3 0.065052 2 0.67067 0.70854 0.058902
## 4 0.039570 3 0.60562 0.68788 0.054292
## 5 0.035441 4 0.56605 0.72448 0.056051
## 6 0.029500 5 0.53061 0.73349 0.053593
## 7 0.027752 6 0.50111 0.71589 0.053731
## 8 0.024447 7 0.47336 0.71297 0.053083
## 9 0.019810 8 0.44891 0.71525 0.053177
## 10 0.019158 9 0.42910 0.71966 0.053711
## 11 0.017220 10 0.40994 0.70184 0.052208
## 12 0.016757 11 0.39272 0.68745 0.052450
## 13 0.015813 12 0.37596 0.68626 0.053120
## 14 0.015727 13 0.36015 0.68323 0.052788
## 15 0.013609 14 0.34442 0.66967 0.051432
## 16 0.013460 15 0.33082 0.65294 0.048939
## 17 0.012952 16 0.31736 0.65294 0.048939
## 18 0.010835 17 0.30440 0.63746 0.048019
## 19 0.010000 18 0.29357 0.63109 0.047565
plotcp(car_tree)

best_cp <- car_tree$cptable[which.min(car_tree$cptable[, "xerror"]), "CP"]
car_prun <- prune(car_tree, cp = best_cp)
pred_prun <- predict(car_prun, car_test)
mse_prun <- mean((car_test$Sales - pred_prun)^2)
mse_prun
## [1] 4.741038
After pruning, the test MSE remained the same at
4.74. This suggests that pruning did not improve or
worsen the model’s performance on the test set.
(d) Use the bagging approach in order to analyze this data. What
test MSE do you obtain? Use the importance() function to determine which
variables are most important.
set.seed(62)
car_bag <- randomForest(Sales ~ ., data = car_train, mtry = ncol(car_train) - 1, importance = TRUE)
car_bag
##
## Call:
## randomForest(formula = Sales ~ ., data = car_train, mtry = ncol(car_train) - 1, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 2.466286
## % Var explained: 68.57
bag_pred <- predict(car_bag, newdata = car_test)
mse_bag <- mean((car_test$Sales - bag_pred)^2)
mse_bag
## [1] 2.419629
Using the bagging approach, the test MSE was 2.42,
which is a big improvement over the regression tree
(4.74). This suggests bagging significantly boosts
prediction accuracy by reducing variance.
importance(car_bag)
## %IncMSE IncNodePurity
## CompPrice 29.8580734 197.688262
## Income 12.5910626 122.873227
## Advertising 25.2242668 212.756458
## Population -0.3157721 75.005450
## Price 66.9930266 668.898161
## ShelveLoc 62.9411213 543.989191
## Age 22.7934087 232.611362
## Education 4.7508081 64.795992
## Urban -2.3355119 9.792186
## US 3.6163318 14.813023
varImpPlot(car_bag)

The most important variable is Price, followed by
ShelveLoc, Advertising,
CompPrice, and Age are moderately important.
Variables like Income, Education,
Urban, and US have low importance.
(e) Use random forests to analyze this data. What test MSE do you
obtain? Use the importance() function to determine which variables are
most important. Describe the effect of m, the number of variables
considered at each split, on the error rate obtained.
set.seed(62)
car_rf <- randomForest(Sales ~ ., data = car_train, mtry = 5, importance = TRUE)
car_rf
##
## Call:
## randomForest(formula = Sales ~ ., data = car_train, mtry = 5, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 2.604994
## % Var explained: 66.8
rf_pred <- predict(car_rf, newdata = car_test)
mse_rf <- mean((car_test$Sales - rf_pred)^2)
mse_rf
## [1] 2.551815
importance(car_rf)
## %IncMSE IncNodePurity
## CompPrice 22.0619269 184.08901
## Income 8.1554907 143.48805
## Advertising 21.7979130 231.15250
## Population -0.2317539 107.04063
## Price 57.9510177 599.13072
## ShelveLoc 52.3995633 491.56403
## Age 21.6935478 250.48481
## Education 4.1393667 79.45348
## Urban -2.6527966 12.04230
## US 5.9685629 25.13044
As mtry increased, the test error decreased, showing
that using more variables at each split improved the model’s
accuracy.
(f) Now analyze the data using BART, and report your results.
x_train <- data.matrix(car_train[, -which(names(car_train) == "Sales")])
y_train <- car_train$Sales
x_test <- data.matrix(car_test[, -which(names(car_test) == "Sales")])
y_test <- car_test$Sales
set.seed(62)
car_bart <- gbart(x.train = x_train, y.train = y_train, x.test = x_test)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 281, 10, 119
## y1,yn: 1.984591, -1.575409
## x1,x[n*p]: 138.000000, 2.000000
## xp1,xp[np*p]: 113.000000, 2.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 68 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.703589,7.51541
## *****sigma: 1.900530
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,10,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## trcnt,tecnt: 1000,1000
bart_pred <- car_bart$yhat.test.mean
mse_bart <- mean((y_test - bart_pred)^2)
mse_bart
## [1] 1.406111
Using BART, the test MSE was 1.41, which is the lowest among all
models, indicating the best prediction accuracy on the test set.
9. This problem involves the OJ data set which is part of the ISLR2
package.
(a) Create a training set containing a random sample of 800
observations, and a test set containing the remaining observations.
set.seed(62)
n <- nrow(OJ)
oj_train_index <- sample(1:n, 800)
oj_train <- OJ[oj_train_index, ]
oj_test <- OJ[-oj_train_index, ]
(b) Fit a tree to the training data, with Purchase as the response
and the other variables as predictors. Use the summary() function to
produce summary statistics about the tree, and describe the results
obtained. What is the training error rate? How many terminal nodes does
the tree have?
oj_tree <- tree(Purchase ~ ., data = oj_train)
summary(oj_tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj_train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff" "PctDiscCH"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7594 = 600.6 / 791
## Misclassification error rate: 0.1612 = 129 / 800
Training error rate is 0.1612 and the Terminal nodes
is 9
(c) Type in the name of the tree object in order to get a detailed
text output. Pick one of the terminal nodes, and interpret the
information displayed.
oj_tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1070.00 CH ( 0.61000 0.39000 )
## 2) LoyalCH < 0.48285 297 326.70 MM ( 0.23906 0.76094 )
## 4) LoyalCH < 0.0616725 66 17.92 MM ( 0.03030 0.96970 ) *
## 5) LoyalCH > 0.0616725 231 281.70 MM ( 0.29870 0.70130 )
## 10) PriceDiff < 0.31 179 195.00 MM ( 0.23464 0.76536 )
## 20) PriceDiff < -0.34 19 0.00 MM ( 0.00000 1.00000 ) *
## 21) PriceDiff > -0.34 160 184.20 MM ( 0.26250 0.73750 ) *
## 11) PriceDiff > 0.31 52 72.01 CH ( 0.51923 0.48077 ) *
## 3) LoyalCH > 0.48285 503 460.20 CH ( 0.82903 0.17097 )
## 6) LoyalCH < 0.705699 204 258.30 CH ( 0.67157 0.32843 )
## 12) ListPriceDiff < 0.235 79 106.70 MM ( 0.40506 0.59494 )
## 24) PctDiscCH < 0.052007 66 82.56 MM ( 0.31818 0.68182 ) *
## 25) PctDiscCH > 0.052007 13 11.16 CH ( 0.84615 0.15385 ) *
## 13) ListPriceDiff > 0.235 125 109.90 CH ( 0.84000 0.16000 ) *
## 7) LoyalCH > 0.705699 299 141.50 CH ( 0.93645 0.06355 )
## 14) PriceDiff < -0.39 8 10.59 MM ( 0.37500 0.62500 ) *
## 15) PriceDiff > -0.39 291 112.30 CH ( 0.95189 0.04811 ) *
In Node 15, there are 291 observations where PriceDiff >
-0.39. The predicted class is CH, with 95% of customers
choosing CH, so the model is confident in this prediction.
(d) Create a plot of the tree, and interpret the results.
plot(oj_tree)
text(oj_tree)

It look like most of the data is split on
the LoyalCH variable.
(e) Predict the response on the test data, and produce a confusion
matrix comparing the test labels to the predicted test labels. What is
the test error rate?
oj_pred <- predict(oj_tree, newdata = oj_test, type = "class")
confusionMatrix(data = oj_pred, reference = oj_test$Purchase)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 130 19
## MM 35 86
##
## Accuracy : 0.8
## 95% CI : (0.7472, 0.846)
## No Information Rate : 0.6111
## P-Value [Acc > NIR] : 2.106e-11
##
## Kappa : 0.5906
##
## Mcnemar's Test P-Value : 0.04123
##
## Sensitivity : 0.7879
## Specificity : 0.8190
## Pos Pred Value : 0.8725
## Neg Pred Value : 0.7107
## Prevalence : 0.6111
## Detection Rate : 0.4815
## Detection Prevalence : 0.5519
## Balanced Accuracy : 0.8035
##
## 'Positive' Class : CH
##
oj_error <- mean(oj_pred != oj_test$Purchase)
oj_error
## [1] 0.2
The Test result error rate is 0.2
(f) Apply the cv.tree() function to the training set in order to
determine the optimal tree size.
set.seed(62)
oj_cv <- cv.tree(oj_tree, FUN = prune.misclass)
oj_cv
## $size
## [1] 9 8 6 5 2 1
##
## $dev
## [1] 152 151 150 149 160 312
##
## $k
## [1] -Inf 0 1 2 8 155
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
(g) Produce a plot with tree size on the x-axis and
cross-validatedclassification error rate on the y-axis.
plot(oj_cv$size, oj_cv$dev, type = "b")

(h) Which tree size corresponds to the lowest cross-validated
classification error rate?
oj_cv$size[which.min(oj_cv$dev)]
## [1] 5
The optimal tree size is 5, as it gives the lowest
cross-validated classification error.
(i) Produce a pruned tree corresponding to the optimal tree size
obtained using cross-validation. If cross-validation does not lead to
selection of a pruned tree, then create a pruned tree with five terminal
nodes.
oj_pruned <- prune.misclass(oj_tree, best = 5)
plot(oj_pruned)
text(oj_pruned, pretty = 0)

(j) Compare the training error rates between the pruned and unpruned
trees. Which is higher?
pred_unpruned <- predict(oj_tree, newdata = oj_train, type = "class")
pred_pruned <- predict(oj_pruned, newdata = oj_train, type = "class")
error_unpruned <- mean(pred_unpruned != oj_train$Purchase)
error_unpruned
## [1] 0.16125
Unpruned is 0.16125
error_pruned <- mean(pred_pruned != oj_train$Purchase)
error_pruned
## [1] 0.16625
Pruned is 0.16625
The training error rate for the unpruned tree was
0.16125, while the pruned tree had a
slightly higher error rate of 0.16625.
(k) Compare the test error rates between the pruned and unpruned
trees. Which is higher?
test_unpruned <- predict(oj_tree, newdata = oj_test, type = "class")
test_pruned <- predict(oj_pruned, newdata = oj_test, type = "class")
error_unpruned2 <- mean(test_unpruned != oj_test$Purchase)
error_unpruned2
## [1] 0.2
Test error rate for unpruned is 0.2
error_pruned <- mean(test_pruned != oj_test$Purchase)
error_pruned
## [1] 0.1851852
Test error rate for unpruned is 0.1851852
The test error rate for the unpruned tree was
0.2, while the pruned tree had a
slightly lower error rate of 0.1852.