ISLR2 book starting on PDF Page 372.

# Load dependencies
pacman::p_load(ISLR2, caret, tree, randomForest, BART)

Question 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The \(x\)-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the \(y\)-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, \(\hat{p}_{m1}\) = 1 − \(\hat{p}_{m2}\). You could make this plot by hand, but it will be much easier to make in R.

# Create sequence of p values
p = seq(0, 1, length.out = 200)

# Compute impurity measures
gini = 2 * p * (1 - p)
error = 1 - pmax(p, 1 - p)
entropy = -p * log2(p) - (1 - p) * log2(1 - p)
entropy[is.nan(entropy)] = 0  # Handle log(0) at endpoints

# Plot Gini
plot(p, gini, 
     type = "l", 
     col = "blue", 
     lwd = 2, 
     ylim = c(0, 1),
     xlab = expression(hat(p)[m1]),
     ylab = "Value",
     main = "Gini Index, Classification Error, and Entropy")

# Add Classification Error and Entropy
lines(p, error, col = "red", lwd = 2, lty = 2)
lines(p, entropy, col = "darkgreen", lwd = 2, lty = 3)

# Add legend
legend("top", legend = c("Gini Index", 
                         "Classification Error", 
                         "Entropy"),
       col = c("blue", "red", "darkgreen"), 
       lwd = 2, 
       lty = c(1, 2, 3), 
       cex = 0.65) # Changes the size of the legend box


Question 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

Part a)

Split the data set into a training set and a test set.

# Read in the data
d1 = Carseats
head(d1)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes
set.seed(42)

# Split data into train and test
train_index = createDataPartition(d1$Sales, p= 0.7, list=FALSE)

train = d1[train_index, ]
test = d1[-train_index, ]


Part b)

Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?


The test MSE obtained is 4.42 using ShelveLoc, Price, CompPrice, Age, Advertising, and Education as predictors.

# Fit a Regression Tree
fit_tree = tree(Sales ~ ., data= train)

summary(fit_tree)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "CompPrice"   "Age"         "Advertising"
## [6] "Education"  
## Number of terminal nodes:  17 
## Residual mean deviance:  2.46 = 649.5 / 264 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.91700 -1.00100  0.01913  0.00000  0.96830  4.18100
# Plot the Tree
plot(fit_tree)
text(fit_tree, pretty= 0)

# Calculate the MSE
preds = predict(fit_tree, test)
mean((preds - test$Sales)^2)
## [1] 4.421924


Part c)

Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?


Using cross-validation, the optimal number of terminal nodes was determined to be 9. Pruning the tree to this size improved the test MSE from 4.42 to 4.24, resulting in a reduction of 0.18.

set.seed(42)

cv_tree = cv.tree(fit_tree)
plot(cv_tree$size, cv_tree$dev, type= 'b',
     xlab = "Tree Size (Number of Terminal Nodes)",
     ylab = "Deviance",
     main = "Cross-Validation for Tree Pruning")

# Find index of minimum deviance
best_index = which.min(cv_tree$dev)

# Get corresponding tree size
cv_tree$size[best_index]
## [1] 9
# Prune to best number of nodes selected by cross-validation
prune = prune.tree(fit_tree, best = 9)
plot(prune)
text(prune, pretty = 0)

# Calculate the MSE
preds = predict(prune, test)
mean((preds - test$Sales)^2)
## [1] 4.23625


Part d)

Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.


The test MSE obtained is 2.06 with Price and ShelveLoc as the most important predictors.

set.seed(42)

# Bagging method
bag_model = randomForest(Sales ~ ., data= train, mtry = 10, importance = TRUE)

# Calculate the MSE
yhat = predict(bag_model, test)
mean((yhat - test$Sales)^2)
## [1] 2.063321
importance(bag_model)
##               %IncMSE IncNodePurity
## CompPrice   32.066701    251.211444
## Income       7.807017    117.174920
## Advertising 22.253304    162.647032
## Population  -1.420411     67.974963
## Price       66.328927    645.348379
## ShelveLoc   68.144349    770.526801
## Age         20.056238    175.868521
## Education    4.409980     53.236309
## Urban       -0.421840     11.171375
## US           2.450300      9.298957


Part e)

Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of \(m\), the number of variables considered at each split, on the error rate obtained.


The test MSE increased from 2.06 to 2.40, with Price and ShelveLoc identified as the most important predictors. As the number of variables selected at each split (\(m\)) increases, the MSE tends to decrease. However, this trend approaches the behavior of bagging when \(m\) nears the total number of predictors.

set.seed(42)

fit_rforest = randomForest(Sales ~ ., data= train, mtry = 3, importance = TRUE) # mtry set from (# of predictors)/3

# Calculate the MSE
yhat = predict(fit_rforest, test)
mean((yhat - test$Sales)^2)
## [1] 2.403375
importance(fit_rforest)
##                %IncMSE IncNodePurity
## CompPrice   18.7356408     215.35309
## Income       6.6339752     173.26772
## Advertising 13.7031915     189.72684
## Population  -1.5548213     135.25324
## Price       42.5710171     529.19502
## ShelveLoc   49.5414748     585.17801
## Age         14.6065870     238.28702
## Education    0.7162662      90.63694
## Urban       -0.4856681      22.95345
## US           4.6618975      28.30025


Part f)

Now analyze the data using BART, and report your results.


The BART model produced a test MSE of 1.47, the lowest among all models evaluated.

# Create x and y train and test
x = d1[, 2:11]
y = d1$Sales

xtrain = x[train_index, ]
ytrain = y[train_index]

xtest = x[-train_index, ]
ytest = y[-train_index]
set.seed(42)

# BART fit
fit_bart = gbart(xtrain, ytrain, x.test= xtest)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 281, 14, 119
## y1,yn: 3.685623, 2.175623
## x1,x[n*p]: 111.000000, 1.000000
## xp1,xp[np*p]: 138.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 68 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.196244,7.53438
## *****sigma: 1.003722
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## trcnt,tecnt: 1000,1000
# Calculate MSE
yhat = fit_bart$yhat.test.mean
mean((ytest - yhat)^2)
## [1] 1.472166


Question 9

This problem involves the OJ data set which is part of the ISLR2 package.

Part a)

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations

set.seed(42)

# Read in the data
d2 = OJ
head(d2)
##   Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1       CH            237       1    1.75    1.99   0.00    0.0         0
## 2       CH            239       1    1.75    1.99   0.00    0.3         0
## 3       CH            245       1    1.86    2.09   0.17    0.0         0
## 4       MM            227       1    1.69    1.69   0.00    0.0         0
## 5       CH            228       7    1.69    1.69   0.00    0.0         0
## 6       CH            230       7    1.69    1.99   0.00    0.0         0
##   SpecialMM  LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1         0 0.500000        1.99        1.75      0.24     No  0.000000
## 2         1 0.600000        1.69        1.75     -0.06     No  0.150754
## 3         0 0.680000        2.09        1.69      0.40     No  0.000000
## 4         0 0.400000        1.69        1.69      0.00     No  0.000000
## 5         0 0.956535        1.69        1.69      0.00    Yes  0.000000
## 6         1 0.965228        1.99        1.69      0.30    Yes  0.000000
##   PctDiscCH ListPriceDiff STORE
## 1  0.000000          0.24     1
## 2  0.000000          0.24     1
## 3  0.091398          0.23     1
## 4  0.000000          0.00     1
## 5  0.000000          0.00     0
## 6  0.000000          0.30     0
# Split data into train and test
train_index = sample(1:nrow(d2), 800)

train = d2[train_index, ]
test = d2[-train_index, ]


Part b)

Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?


It has 8 terminal/leaf nodes with a misclassification error rate of 16.38%. The most important predictors were LoyalCH, SalePriceMM, and PriceDiff.

# Fit a Classification Tree
fit_tree = tree(Purchase ~ ., data= train)

summary(fit_tree)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH"     "SalePriceMM" "PriceDiff"  
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7392 = 585.5 / 792 
## Misclassification error rate: 0.1638 = 131 / 800


Part c)

Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.


SalePriceMM < 2.04 128 123.50 MM ( 0.18750 0.81250 )

This terminal node includes observations where the price of Minute Maid is less than $2.04. Below is the breakdown of the other variables and values:

  • There are 128 observations in this terminal node.
  • The 123.5 is the residual deviance (a measure of impurity where lower is better).
  • MM is the predicted class
  • The class probabilities: 18.75% of customers chose Citrus Hill (CH) 81.25% chose Minute Maid (MM)
# Print the Tree model
fit_tree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1066.00 CH ( 0.61500 0.38500 )  
##    2) LoyalCH < 0.48285 285  296.00 MM ( 0.21404 0.78596 )  
##      4) LoyalCH < 0.064156 64    0.00 MM ( 0.00000 1.00000 ) *
##      5) LoyalCH > 0.064156 221  260.40 MM ( 0.27602 0.72398 )  
##       10) SalePriceMM < 2.04 128  123.50 MM ( 0.18750 0.81250 ) *
##       11) SalePriceMM > 2.04 93  125.00 MM ( 0.39785 0.60215 ) *
##    3) LoyalCH > 0.48285 515  458.10 CH ( 0.83689 0.16311 )  
##      6) LoyalCH < 0.753545 230  282.70 CH ( 0.69565 0.30435 )  
##       12) PriceDiff < 0.265 149  203.00 CH ( 0.57718 0.42282 )  
##         24) PriceDiff < -0.165 32   38.02 MM ( 0.28125 0.71875 ) *
##         25) PriceDiff > -0.165 117  150.30 CH ( 0.65812 0.34188 )  
##           50) LoyalCH < 0.703993 105  139.60 CH ( 0.61905 0.38095 ) *
##           51) LoyalCH > 0.703993 12    0.00 CH ( 1.00000 0.00000 ) *
##       13) PriceDiff > 0.265 81   47.66 CH ( 0.91358 0.08642 ) *
##      7) LoyalCH > 0.753545 285  111.70 CH ( 0.95088 0.04912 ) *


Part d)

Create a plot of the tree, and interpret the results.


The plot has 6 nodes with the 3 important variables of LoyalCH SalePriceMM PriceDiff.

# Plot the tree
plot(fit_tree)
text(fit_tree, pretty= 0)


Part e)

Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?


The test error rate is 18.89%.

# Predictions
oj_preds = predict(fit_tree, test, type= 'class')

# Confusion Matrix
confusionMatrix(oj_preds, test$Purchase, positive= 'MM')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 125  15
##         MM  36  94
##                                          
##                Accuracy : 0.8111         
##                  95% CI : (0.7592, 0.856)
##     No Information Rate : 0.5963         
##     P-Value [Acc > NIR] : 3.4e-14        
##                                          
##                   Kappa : 0.6195         
##                                          
##  Mcnemar's Test P-Value : 0.005101       
##                                          
##             Sensitivity : 0.8624         
##             Specificity : 0.7764         
##          Pos Pred Value : 0.7231         
##          Neg Pred Value : 0.8929         
##              Prevalence : 0.4037         
##          Detection Rate : 0.3481         
##    Detection Prevalence : 0.4815         
##       Balanced Accuracy : 0.8194         
##                                          
##        'Positive' Class : MM             
## 


Part f)

Apply the cv.tree() function to the training set in order to determine the optimal tree size.

set.seed(42)

# Conduct cross validation
cv_tree = cv.tree(fit_tree, FUN= prune.misclass)


Part g)

Produce a plot with tree size on the \(x\)-axis and cross-validated classification error rate on the \(y\)-axis.

plot(cv_tree$size, cv_tree$dev, type= 'b',
     xlab = "Tree Size (Number of Terminal Nodes)",
     ylab = "Cross-Validated Classification Error Rate",
     main = "CV Error vs. Tree Size")


Part h)

Which tree size corresponds to the lowest cross-validated classification error rate?


The optimal tree size is 5 denoted by the lowest cross-validated classification error rate.

# Find index of minimum deviance
best_index = which.min(cv_tree$dev)

# Get corresponding tree size
cv_tree$size[best_index]
## [1] 5


Part i)

Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

# Prune to best number of nodes selected by cross-validation
prune = prune.tree(fit_tree, best = 5)
plot(prune)
text(prune, pretty = 0)


Part j)

Compare the training error rates between the pruned and unpruned trees. Which is higher?


The training error rate increased from 16.38% to 18.12% after pruning, indicating that the pruned tree has a higher training error.

summary(prune)
## 
## Classification tree:
## snip.tree(tree = fit_tree, nodes = c(5L, 12L))
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.7833 = 622.7 / 795 
## Misclassification error rate: 0.1812 = 145 / 800


Part k)

Compare the test error rates between the pruned and unpruned trees. Which is higher?


The test error rate increased from 18.89% to 21.85% after pruning, indicating that the pruned tree has a higher test error.

# Predictions
prune_preds = predict(prune, test, type= 'class')

# Confusion Matrix
confusionMatrix(prune_preds, test$Purchase, positive= 'MM')
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 128  26
##         MM  33  83
##                                           
##                Accuracy : 0.7815          
##                  95% CI : (0.7274, 0.8293)
##     No Information Rate : 0.5963          
##     P-Value [Acc > NIR] : 8.681e-11       
##                                           
##                   Kappa : 0.5508          
##                                           
##  Mcnemar's Test P-Value : 0.4347          
##                                           
##             Sensitivity : 0.7615          
##             Specificity : 0.7950          
##          Pos Pred Value : 0.7155          
##          Neg Pred Value : 0.8312          
##              Prevalence : 0.4037          
##          Detection Rate : 0.3074          
##    Detection Prevalence : 0.4296          
##       Balanced Accuracy : 0.7782          
##                                           
##        'Positive' Class : MM              
##