Exercise 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

p = seq(0, 1, 0.001)
gini = 2 * p * (1 - p)
class.error = 1 - pmax(p, 1 - p)
entropy = - (p * log(p) + (1 - p) * log(1 - p))

errors <- cbind(gini, class.error, entropy)

matplot(p, errors, ylab = "Error", xlab = "p", col = c("purple", "hotpink", "blue"), pch = 20)
legend('bottom', inset=.05, legend = c('Gini Index', 'Classification Error', 'Entropy'), col = c("purple", "hotpink", "blue"), pch = 20)

Exercise 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

# Splitting into train and test using a 70/30 split
set.seed(13) ; index = sample(nrow(Carseats), 0.7*nrow(Carseats), replace = FALSE)
cs.train <- Carseats[index, ]
cs.test <- Carseats[-index, ]

…

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

# Fitting tree
tree.cs = tree(Sales ~ ., data = cs.train)
summary(tree.cs)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = cs.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "CompPrice"   "Income"     
## [6] "Advertising"
## Number of terminal nodes:  19 
## Residual mean deviance:  2.245 = 585.9 / 261 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.8430 -0.8754 -0.0050  0.0000  0.8856  4.1270
# Plotting tree
plot(tree.cs)
text(tree.cs, pretty = 0, cex = 0.6)

# Calculating MSE
preds.cs = predict(tree.cs, cs.test)
mean((cs.test$Sales - preds.cs)^2)
## [1] 5.035004

The MSE is approximately 5.04. We observe from the plot and summary output that only 6 out of the 10 predictors were used in constructing the tree, which contains 19 terminal nodes.

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

# Fitting tree using cross-validation
set.seed(13) ; cv.tree.cs = cv.tree(tree.cs)

# Plotting tree size against deviance
plot(cv.tree.cs$size, cv.tree.cs$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")

From the plot above, it appears that using a tree with 15 terminal nodes results in the lowest deviance. However, a tree size of 12 exhibits almost equivalent performance while decreasing the chances of over-fitting. For this reason, I will prune the tree using 12 terminal nodes and evaluate its performance on the test set.

# Pruning tree, predicting on the test set, and calculating MSE
pruned.tree.cs <- prune.tree(tree.cs, best = 12)
preds.pruned.cs = predict(pruned.tree.cs, cs.test)
round(mean((cs.test$Sales - preds.pruned.cs)^2), 2)
## [1] 5.01

The MSE of the pruned tree is 5.01, showing that pruning improves the error rate as expected, although not significantly.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

# Fitting bagged model
set.seed(13) ; bag.cs <- randomForest(Sales ~ ., data = cs.train, mtry = 10, importance = TRUE)
bag.cs
## 
## Call:
##  randomForest(formula = Sales ~ ., data = cs.train, mtry = 10,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.453833
##                     % Var explained: 68.9
# Assessing variable importance 
importance(bag.cs)
##                %IncMSE IncNodePurity
## CompPrice   30.8637031     210.24617
## Income      10.0309275     122.62188
## Advertising 18.3580695     171.38890
## Population  -1.8089018      80.24687
## Price       69.4041828     648.17256
## ShelveLoc   70.9038306     651.19973
## Age         21.3879874     187.98531
## Education    3.3753411      59.31635
## Urban        0.5094791      11.22996
## US           3.6151313       8.30000
# Predicting on the test set and calculating MSE
preds.bag.cs = predict(bag.cs, cs.test)
round(mean((cs.test$Sales - preds.bag.cs)^2), 2)
## [1] 2.4

The bagged model performs remarkable better than the previous single-tree models assessed, resulting in an MSE of 2.40. This clearly demonstrates the advantages inherent in combining multiple trees within the model. The output also reveals that the two most important variables are Price and ShelveLoc.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of \(m\), the number of variables considered at each split, on the error rate obtained.

# Fitting random forest
set.seed(13) ; rf.cs <- randomForest(Sales ~ ., data = cs.train, importance = TRUE)
rf.cs
## 
## Call:
##  randomForest(formula = Sales ~ ., data = cs.train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 2.873453
##                     % Var explained: 63.59
# Assessing variable importance 
importance(rf.cs)
##               %IncMSE IncNodePurity
## CompPrice   18.431048     197.16339
## Income       6.047285     166.71194
## Advertising 15.912619     205.91286
## Population  -1.309743     152.27778
## Price       43.694487     495.84108
## ShelveLoc   47.746890     504.29165
## Age         16.526979     227.96415
## Education    1.029517      97.02562
## Urban       -1.727364      20.61620
## US           3.504014      24.70489
# Predicting on the test set and calculating MSE
preds.rf.cs = predict(rf.cs, cs.test)
round(mean((cs.test$Sales - preds.rf.cs)^2), 2)
## [1] 2.8

The random forest results in an MSE of 2.80, slightly worse than the bagged model. However, after experimenting with different numbers of variables at each split, I found that 7 variables (\(mtry = 7\)) results in the lowest MSE (3.35) for this particular case. Again, we observe that the most important variables are Price and ShelveLoc.

(f) Now analyze the data using BART, and report your results.

# Splitting into train and test set for y and x-variables
set.seed(13)
x = Carseats[,2:10]
y = Carseats[,"Sales"]
xtrain = x[index, ]
ytrain = y[index]
xtest = x[-index, ]
ytest = y[-index]

# Fitting BART model
bart.fit <- gbart(xtrain, ytrain, x.test = xtest)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 280, 12, 120
## y1,yn: -3.949607, 1.150393
## x1,x[n*p]: 108.000000, 1.000000
## xp1,xp[np*p]: 117.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 69 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.278247,3,0.185217,7.41961
## *****sigma: 0.975113
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,12,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## trcnt,tecnt: 1000,1000
# Predicting on the test set and computing MSE
yhat.bart <- bart.fit$yhat.test.mean
round(mean((ytest - yhat.bart)^2), 2)
## [1] 1.72

The results show that the Bayesian Additive Regression Trees (BART) model exhibits the best performance of all with its MSE of 1.72.

Exercise 9

This problem involves the OJ data set which is part of the ISLR2 package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

# Splitting into train and test according to the instructions
set.seed(13) ; index = sample(1:nrow(OJ), 800)
oj.train = OJ[index, ]
oj.test = OJ[-index, ]

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

# Fitting tree
tree.oj <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"        "PriceDiff"      "WeekofPurchase" "ListPriceDiff" 
## [5] "DiscMM"        
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7313 = 578.5 / 791 
## Misclassification error rate: 0.1562 = 125 / 800

The training error rate is 15.62% and the tree contains 9 terminal nodes.

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

# Printing text output
tree.oj
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1061.00 CH ( 0.62250 0.37750 )  
##    2) LoyalCH < 0.5036 344  416.50 MM ( 0.29360 0.70640 )  
##      4) LoyalCH < 0.051325 57   10.07 MM ( 0.01754 0.98246 ) *
##      5) LoyalCH > 0.051325 287  371.10 MM ( 0.34843 0.65157 )  
##       10) PriceDiff < 0.065 117  110.10 MM ( 0.17949 0.82051 ) *
##       11) PriceDiff > 0.065 170  234.80 MM ( 0.46471 0.53529 )  
##         22) WeekofPurchase < 249 73   89.35 MM ( 0.30137 0.69863 ) *
##         23) WeekofPurchase > 249 97  131.50 CH ( 0.58763 0.41237 )  
##           46) LoyalCH < 0.430291 70   96.98 MM ( 0.48571 0.51429 ) *
##           47) LoyalCH > 0.430291 27   22.65 CH ( 0.85185 0.14815 ) *
##    3) LoyalCH > 0.5036 456  351.30 CH ( 0.87061 0.12939 )  
##      6) LoyalCH < 0.764572 201  225.50 CH ( 0.75124 0.24876 )  
##       12) ListPriceDiff < 0.235 78  108.10 CH ( 0.51282 0.48718 )  
##         24) DiscMM < 0.15 40   47.05 CH ( 0.72500 0.27500 ) *
##         25) DiscMM > 0.15 38   45.73 MM ( 0.28947 0.71053 ) *
##       13) ListPriceDiff > 0.235 123   78.64 CH ( 0.90244 0.09756 ) *
##      7) LoyalCH > 0.764572 255   77.87 CH ( 0.96471 0.03529 ) *

Node 47 is reached following a series of decisions based on the the variables LoyalCH, PriceDiff, and WeekofPurchase. The terminal node is characterized by a decision based on the value of LoyalCH, indicating the customer’s brand loyalty for Citrus Hill (CH). If LoyalCH is greater than 0.430291, the outcome will be assigned to Citrus Hill with a probability of 85.19%, and to Minute Maid (MM) with a probability of 14.81%. This decision results in a deviance value of approximately 22.65.

(d) Create a plot of the tree, and interpret the results.

# Plotting tree
plot(tree.oj)
text(tree.oj, pretty = 0, cex = 0.6)

The plot serves as a visual representation of the text output examined above. At the top of the tree, we observe LoyalCH as the root node, dividing the tree into two main branches. The customer’s loyalty to Citrus Hill appears to be a determining factor of the prediction, seeing that 4/5 of the terminal nodes on the left branch (LoyalCH < 0.5036) classifies Purchase as Minute Maid, while the 3/4 of the terminal nodes on the right branch (LoyalCH > 0.5036) classifies Purchase as Citrus Hill. Other variables contributing to the decision-making process are PriceDiff, WeekofPurchase, ListPriceDiff, and DiscMM.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

# Predicting on the test set and preoducing confusion matrix
preds.tree.oj = predict(tree.oj, newdata = oj.test, type = "class")
table(preds.tree.oj, oj.test$Purchase)
##              
## preds.tree.oj  CH  MM
##            CH 126  20
##            MM  29  95
# Calculating error rate
round((20 + 19) / 270, 4)
## [1] 0.1444

It’s atypical to observe a situation where the test error rate (14.44%) is smaller than the training error rate (15.62%). However, due to random chance or variability in the training data, models sometimes perform better on the test set than on the training set.

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

# Applying cross-validation to determine optimal tree size
set.seed(13) ; cv.tree.oj <- cv.tree(tree.oj, FUN = prune.misclass)

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

# Plotting tree size against classification error (deviance)
plot(cv.tree.oj$size, cv.tree.oj$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")
text(cv.tree.oj$size, cv.tree.oj$dev, labels = round(cv.tree.oj$dev, 2), pos = 3)

(h) Which tree size corresponds to the lowest cross-validated classification error rate?
It is evident from the plot that 8 terminal nodes results in the best cross-validated classification error, although just marginally better than the error using 9 terminal nodes. Consequently, cross-validation suggest minimal pruning, decreasing the tree from a size of 9 to 8.

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

# Pruning tree and printing summary to obtain training error rate
pruned.tree.oj <- prune.misclass(tree.oj, best = 8)
summary(pruned.tree.oj)
## 
## Classification tree:
## snip.tree(tree = tree.oj, nodes = 23L)
## Variables actually used in tree construction:
## [1] "LoyalCH"        "PriceDiff"      "WeekofPurchase" "ListPriceDiff" 
## [5] "DiscMM"        
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7454 = 590.3 / 792 
## Misclassification error rate: 0.1588 = 127 / 800
# Predicting on the test set and producing confusion matrix
preds.pruned.oj <- predict(pruned.tree.oj, oj.test, type = "class")
table(preds.pruned.oj, oj.test$Purchase)
##                
## preds.pruned.oj  CH  MM
##              CH 134  35
##              MM  21  80
# Calculating test error rate
round((35 + 21) / 270, 4)
## [1] 0.2074

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?
The pruned tree using 8 terminal nodes as determined by cross-validation resulted in a training error of 15.88%. As expected, the training error rate slightly increased from the one of the unpruned tree (15.62%), as reduction in model complexity typically results in a higher training error.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?
When comparing the test errors, we surprisingly find that the pruned tree performed somewhat worse than the unpruned tree when evaluating on the test set. As seen earlier, the unpruned tree resulted in a test error rate of 14.44%, while the pruned tree with one less terminal nodes yields a test error of 20.74%.