Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
p = seq(0, 1, 0.001)
gini = 2 * p * (1 - p)
class.error = 1 - pmax(p, 1 - p)
entropy = - (p * log(p) + (1 - p) * log(1 - p))
errors <- cbind(gini, class.error, entropy)
matplot(p, errors, ylab = "Error", xlab = "p", col = c("purple", "hotpink", "blue"), pch = 20)
legend('bottom', inset=.05, legend = c('Gini Index', 'Classification Error', 'Entropy'), col = c("purple", "hotpink", "blue"), pch = 20)
In the lab, a classification tree was applied to the Carseats data set
after converting Sales into a qualitative response
variable. Now we will seek to predict Sales using
regression trees and related approaches, treating the response as a
quantitative variable.
(a) Split the data set into a training set and a test set.
# Splitting into train and test using a 70/30 split
set.seed(13) ; index = sample(nrow(Carseats), 0.7*nrow(Carseats), replace = FALSE)
cs.train <- Carseats[index, ]
cs.test <- Carseats[-index, ]
…
(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
# Fitting tree
tree.cs = tree(Sales ~ ., data = cs.train)
summary(tree.cs)
##
## Regression tree:
## tree(formula = Sales ~ ., data = cs.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "CompPrice" "Income"
## [6] "Advertising"
## Number of terminal nodes: 19
## Residual mean deviance: 2.245 = 585.9 / 261
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.8430 -0.8754 -0.0050 0.0000 0.8856 4.1270
# Plotting tree
plot(tree.cs)
text(tree.cs, pretty = 0, cex = 0.6)
# Calculating MSE
preds.cs = predict(tree.cs, cs.test)
mean((cs.test$Sales - preds.cs)^2)
## [1] 5.035004
The MSE is approximately 5.04. We observe from the plot and summary output that only 6 out of the 10 predictors were used in constructing the tree, which contains 19 terminal nodes.
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
# Fitting tree using cross-validation
set.seed(13) ; cv.tree.cs = cv.tree(tree.cs)
# Plotting tree size against deviance
plot(cv.tree.cs$size, cv.tree.cs$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")
From the plot above, it appears that using a tree with 15 terminal nodes results in the lowest deviance. However, a tree size of 12 exhibits almost equivalent performance while decreasing the chances of over-fitting. For this reason, I will prune the tree using 12 terminal nodes and evaluate its performance on the test set.
# Pruning tree, predicting on the test set, and calculating MSE
pruned.tree.cs <- prune.tree(tree.cs, best = 12)
preds.pruned.cs = predict(pruned.tree.cs, cs.test)
round(mean((cs.test$Sales - preds.pruned.cs)^2), 2)
## [1] 5.01
The MSE of the pruned tree is 5.01, showing that pruning improves the error rate as expected, although not significantly.
(d) Use the bagging approach in order to analyze this
data. What test MSE do you obtain? Use the importance()
function to determine which variables are most important.
# Fitting bagged model
set.seed(13) ; bag.cs <- randomForest(Sales ~ ., data = cs.train, mtry = 10, importance = TRUE)
bag.cs
##
## Call:
## randomForest(formula = Sales ~ ., data = cs.train, mtry = 10, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 2.453833
## % Var explained: 68.9
# Assessing variable importance
importance(bag.cs)
## %IncMSE IncNodePurity
## CompPrice 30.8637031 210.24617
## Income 10.0309275 122.62188
## Advertising 18.3580695 171.38890
## Population -1.8089018 80.24687
## Price 69.4041828 648.17256
## ShelveLoc 70.9038306 651.19973
## Age 21.3879874 187.98531
## Education 3.3753411 59.31635
## Urban 0.5094791 11.22996
## US 3.6151313 8.30000
# Predicting on the test set and calculating MSE
preds.bag.cs = predict(bag.cs, cs.test)
round(mean((cs.test$Sales - preds.bag.cs)^2), 2)
## [1] 2.4
The bagged model performs remarkable better than the previous
single-tree models assessed, resulting in an MSE of 2.40. This clearly
demonstrates the advantages inherent in combining multiple trees within
the model. The output also reveals that the two most important variables
are Price and ShelveLoc.
(e) Use random forests to analyze this data. What test
MSE do you obtain? Use the importance() function to
determine which variables are most important. Describe the effect of
\(m\), the number of variables
considered at each split, on the error rate obtained.
# Fitting random forest
set.seed(13) ; rf.cs <- randomForest(Sales ~ ., data = cs.train, importance = TRUE)
rf.cs
##
## Call:
## randomForest(formula = Sales ~ ., data = cs.train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 2.873453
## % Var explained: 63.59
# Assessing variable importance
importance(rf.cs)
## %IncMSE IncNodePurity
## CompPrice 18.431048 197.16339
## Income 6.047285 166.71194
## Advertising 15.912619 205.91286
## Population -1.309743 152.27778
## Price 43.694487 495.84108
## ShelveLoc 47.746890 504.29165
## Age 16.526979 227.96415
## Education 1.029517 97.02562
## Urban -1.727364 20.61620
## US 3.504014 24.70489
# Predicting on the test set and calculating MSE
preds.rf.cs = predict(rf.cs, cs.test)
round(mean((cs.test$Sales - preds.rf.cs)^2), 2)
## [1] 2.8
The random forest results in an MSE of 2.80, slightly worse than the
bagged model. However, after experimenting with different numbers of
variables at each split, I found that 7 variables (\(mtry = 7\)) results in the lowest MSE
(3.35) for this particular case. Again, we observe that the most
important variables are Price and
ShelveLoc.
(f) Now analyze the data using BART, and report your results.
# Splitting into train and test set for y and x-variables
set.seed(13)
x = Carseats[,2:10]
y = Carseats[,"Sales"]
xtrain = x[index, ]
ytrain = y[index]
xtest = x[-index, ]
ytest = y[-index]
# Fitting BART model
bart.fit <- gbart(xtrain, ytrain, x.test = xtest)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 280, 12, 120
## y1,yn: -3.949607, 1.150393
## x1,x[n*p]: 108.000000, 1.000000
## xp1,xp[np*p]: 117.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 69 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.278247,3,0.185217,7.41961
## *****sigma: 0.975113
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,12,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## trcnt,tecnt: 1000,1000
# Predicting on the test set and computing MSE
yhat.bart <- bart.fit$yhat.test.mean
round(mean((ytest - yhat.bart)^2), 2)
## [1] 1.72
The results show that the Bayesian Additive Regression Trees (BART) model exhibits the best performance of all with its MSE of 1.72.
This problem involves the OJ data set which is
part of the ISLR2 package.
(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
# Splitting into train and test according to the instructions
set.seed(13) ; index = sample(1:nrow(OJ), 800)
oj.train = OJ[index, ]
oj.test = OJ[-index, ]
(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
# Fitting tree
tree.oj <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "WeekofPurchase" "ListPriceDiff"
## [5] "DiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7313 = 578.5 / 791
## Misclassification error rate: 0.1562 = 125 / 800
The training error rate is 15.62% and the tree contains 9 terminal nodes.
(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
# Printing text output
tree.oj
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1061.00 CH ( 0.62250 0.37750 )
## 2) LoyalCH < 0.5036 344 416.50 MM ( 0.29360 0.70640 )
## 4) LoyalCH < 0.051325 57 10.07 MM ( 0.01754 0.98246 ) *
## 5) LoyalCH > 0.051325 287 371.10 MM ( 0.34843 0.65157 )
## 10) PriceDiff < 0.065 117 110.10 MM ( 0.17949 0.82051 ) *
## 11) PriceDiff > 0.065 170 234.80 MM ( 0.46471 0.53529 )
## 22) WeekofPurchase < 249 73 89.35 MM ( 0.30137 0.69863 ) *
## 23) WeekofPurchase > 249 97 131.50 CH ( 0.58763 0.41237 )
## 46) LoyalCH < 0.430291 70 96.98 MM ( 0.48571 0.51429 ) *
## 47) LoyalCH > 0.430291 27 22.65 CH ( 0.85185 0.14815 ) *
## 3) LoyalCH > 0.5036 456 351.30 CH ( 0.87061 0.12939 )
## 6) LoyalCH < 0.764572 201 225.50 CH ( 0.75124 0.24876 )
## 12) ListPriceDiff < 0.235 78 108.10 CH ( 0.51282 0.48718 )
## 24) DiscMM < 0.15 40 47.05 CH ( 0.72500 0.27500 ) *
## 25) DiscMM > 0.15 38 45.73 MM ( 0.28947 0.71053 ) *
## 13) ListPriceDiff > 0.235 123 78.64 CH ( 0.90244 0.09756 ) *
## 7) LoyalCH > 0.764572 255 77.87 CH ( 0.96471 0.03529 ) *
Node 47 is reached following a series of decisions based on the the
variables LoyalCH, PriceDiff, and
WeekofPurchase. The terminal node is characterized by a
decision based on the value of LoyalCH, indicating the
customer’s brand loyalty for Citrus Hill (CH). If
LoyalCH is greater than 0.430291, the outcome will be
assigned to Citrus Hill with a probability of 85.19%, and to Minute Maid
(MM) with a probability of 14.81%. This decision results in a
deviance value of approximately 22.65.
(d) Create a plot of the tree, and interpret the results.
# Plotting tree
plot(tree.oj)
text(tree.oj, pretty = 0, cex = 0.6)
The plot serves as a visual representation of the text output
examined above. At the top of the tree, we observe LoyalCH
as the root node, dividing the tree into two main branches. The
customer’s loyalty to Citrus Hill appears to be a determining factor of
the prediction, seeing that 4/5 of the terminal nodes on the left branch
(LoyalCH < 0.5036) classifies Purchase as Minute Maid,
while the 3/4 of the terminal nodes on the right branch (LoyalCH >
0.5036) classifies Purchase as Citrus Hill. Other variables
contributing to the decision-making process are PriceDiff,
WeekofPurchase, ListPriceDiff, and
DiscMM.
(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
# Predicting on the test set and preoducing confusion matrix
preds.tree.oj = predict(tree.oj, newdata = oj.test, type = "class")
table(preds.tree.oj, oj.test$Purchase)
##
## preds.tree.oj CH MM
## CH 126 20
## MM 29 95
# Calculating error rate
round((20 + 19) / 270, 4)
## [1] 0.1444
It’s atypical to observe a situation where the test error rate (14.44%) is smaller than the training error rate (15.62%). However, due to random chance or variability in the training data, models sometimes perform better on the test set than on the training set.
(f) Apply the cv.tree() function to the
training set in order to determine the optimal tree
size.
# Applying cross-validation to determine optimal tree size
set.seed(13) ; cv.tree.oj <- cv.tree(tree.oj, FUN = prune.misclass)
(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
# Plotting tree size against classification error (deviance)
plot(cv.tree.oj$size, cv.tree.oj$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")
text(cv.tree.oj$size, cv.tree.oj$dev, labels = round(cv.tree.oj$dev, 2), pos = 3)
(h) Which tree size corresponds to the lowest
cross-validated classification error rate?
It is evident from the plot that 8 terminal nodes results in the best
cross-validated classification error, although just marginally better
than the error using 9 terminal nodes. Consequently, cross-validation
suggest minimal pruning, decreasing the tree from a size of 9 to 8.
(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
# Pruning tree and printing summary to obtain training error rate
pruned.tree.oj <- prune.misclass(tree.oj, best = 8)
summary(pruned.tree.oj)
##
## Classification tree:
## snip.tree(tree = tree.oj, nodes = 23L)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "WeekofPurchase" "ListPriceDiff"
## [5] "DiscMM"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7454 = 590.3 / 792
## Misclassification error rate: 0.1588 = 127 / 800
# Predicting on the test set and producing confusion matrix
preds.pruned.oj <- predict(pruned.tree.oj, oj.test, type = "class")
table(preds.pruned.oj, oj.test$Purchase)
##
## preds.pruned.oj CH MM
## CH 134 35
## MM 21 80
# Calculating test error rate
round((35 + 21) / 270, 4)
## [1] 0.2074
(j) Compare the training error rates between the pruned
and unpruned trees. Which is higher?
The pruned tree using 8 terminal nodes as determined by cross-validation
resulted in a training error of 15.88%. As expected, the training error
rate slightly increased from the one of the unpruned tree (15.62%), as
reduction in model complexity typically results in a higher training
error.
(k) Compare the test error rates between the pruned and
unpruned trees. Which is higher?
When comparing the test errors, we surprisingly find that the pruned
tree performed somewhat worse than the unpruned tree when evaluating on
the test set. As seen earlier, the unpruned tree resulted in a test
error rate of 14.44%, while the pruned tree with one less terminal nodes
yields a test error of 20.74%.