ISLR2 book starting on PDF Page 372.
# Load dependencies
pacman::p_load(ISLR2, caret, tree, randomForest, BART)
Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The \(x\)-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the \(y\)-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, \(\hat{p}_{m1}\) = 1 − \(\hat{p}_{m2}\). You could make this plot by hand, but it will be much easier to make in R.
# Create sequence of p values
p = seq(0, 1, length.out = 200)
# Compute impurity measures
gini = 2 * p * (1 - p)
error = 1 - pmax(p, 1 - p)
entropy = -p * log2(p) - (1 - p) * log2(1 - p)
entropy[is.nan(entropy)] = 0 # Handle log(0) at endpoints
# Plot Gini
plot(p, gini,
type = "l",
col = "blue",
lwd = 2,
ylim = c(0, 1),
xlab = expression(hat(p)[m1]),
ylab = "Value",
main = "Gini Index, Classification Error, and Entropy")
# Add Classification Error and Entropy
lines(p, error, col = "red", lwd = 2, lty = 2)
lines(p, entropy, col = "darkgreen", lwd = 2, lty = 3)
# Add legend
legend("top", legend = c("Gini Index",
"Classification Error",
"Entropy"),
col = c("blue", "red", "darkgreen"),
lwd = 2,
lty = c(1, 2, 3),
cex = 0.65) # Changes the size of the legend box
In the lab, a classification tree was applied to the
Carseats data set after converting Sales into
a qualitative response variable. Now we will seek to predict
Sales using regression trees and related approaches,
treating the response as a quantitative variable.
Split the data set into a training set and a test set.
# Read in the data
d1 = Carseats
head(d1)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
set.seed(42)
# Split data into train and test
train_index = createDataPartition(d1$Sales, p= 0.7, list=FALSE)
train = d1[train_index, ]
test = d1[-train_index, ]
Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
The test MSE obtained is 4.42 using ShelveLoc,
Price, CompPrice, Age,
Advertising, and Education as predictors.
# Fit a Regression Tree
fit_tree = tree(Sales ~ ., data= train)
summary(fit_tree)
##
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "CompPrice" "Age" "Advertising"
## [6] "Education"
## Number of terminal nodes: 17
## Residual mean deviance: 2.46 = 649.5 / 264
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.91700 -1.00100 0.01913 0.00000 0.96830 4.18100
# Plot the Tree
plot(fit_tree)
text(fit_tree, pretty= 0)
# Calculate the MSE
preds = predict(fit_tree, test)
mean((preds - test$Sales)^2)
## [1] 4.421924
Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
Using cross-validation, the optimal number of terminal nodes was determined to be 9. Pruning the tree to this size improved the test MSE from 4.42 to 4.24, resulting in a reduction of 0.18.
set.seed(42)
cv_tree = cv.tree(fit_tree)
plot(cv_tree$size, cv_tree$dev, type= 'b',
xlab = "Tree Size (Number of Terminal Nodes)",
ylab = "Deviance",
main = "Cross-Validation for Tree Pruning")
# Find index of minimum deviance
best_index = which.min(cv_tree$dev)
# Get corresponding tree size
cv_tree$size[best_index]
## [1] 9
# Prune to best number of nodes selected by cross-validation
prune = prune.tree(fit_tree, best = 9)
plot(prune)
text(prune, pretty = 0)
# Calculate the MSE
preds = predict(prune, test)
mean((preds - test$Sales)^2)
## [1] 4.23625
Use the bagging approach in order to analyze this data. What test
MSE do you obtain? Use the importance() function to
determine which variables are most important.
The test MSE obtained is 2.06 with Price and
ShelveLoc as the most important predictors.
set.seed(42)
# Bagging method
bag_model = randomForest(Sales ~ ., data= train, mtry = 10, importance = TRUE)
# Calculate the MSE
yhat = predict(bag_model, test)
mean((yhat - test$Sales)^2)
## [1] 2.063321
importance(bag_model)
## %IncMSE IncNodePurity
## CompPrice 32.066701 251.211444
## Income 7.807017 117.174920
## Advertising 22.253304 162.647032
## Population -1.420411 67.974963
## Price 66.328927 645.348379
## ShelveLoc 68.144349 770.526801
## Age 20.056238 175.868521
## Education 4.409980 53.236309
## Urban -0.421840 11.171375
## US 2.450300 9.298957
Use random forests to analyze this data. What test MSE do you
obtain? Use the importance() function to determine which
variables are most important. Describe the effect of \(m\), the number of variables considered at
each split, on the error rate obtained.
The test MSE increased from 2.06 to 2.40, with Price and
ShelveLoc identified as the most important predictors. As
the number of variables selected at each split (\(m\)) increases, the MSE tends to decrease.
However, this trend approaches the behavior of bagging when \(m\) nears the total number of
predictors.
set.seed(42)
fit_rforest = randomForest(Sales ~ ., data= train, mtry = 3, importance = TRUE) # mtry set from (# of predictors)/3
# Calculate the MSE
yhat = predict(fit_rforest, test)
mean((yhat - test$Sales)^2)
## [1] 2.403375
importance(fit_rforest)
## %IncMSE IncNodePurity
## CompPrice 18.7356408 215.35309
## Income 6.6339752 173.26772
## Advertising 13.7031915 189.72684
## Population -1.5548213 135.25324
## Price 42.5710171 529.19502
## ShelveLoc 49.5414748 585.17801
## Age 14.6065870 238.28702
## Education 0.7162662 90.63694
## Urban -0.4856681 22.95345
## US 4.6618975 28.30025
Now analyze the data using BART, and report your results.
The BART model produced a test MSE of 1.47, the lowest among all models evaluated.
# Create x and y train and test
x = d1[, 2:11]
y = d1$Sales
xtrain = x[train_index, ]
ytrain = y[train_index]
xtest = x[-train_index, ]
ytest = y[-train_index]
set.seed(42)
# BART fit
fit_bart = gbart(xtrain, ytrain, x.test= xtest)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 281, 14, 119
## y1,yn: 3.685623, 2.175623
## x1,x[n*p]: 111.000000, 1.000000
## xp1,xp[np*p]: 138.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 68 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.196244,7.53438
## *****sigma: 1.003722
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## trcnt,tecnt: 1000,1000
# Calculate MSE
yhat = fit_bart$yhat.test.mean
mean((ytest - yhat)^2)
## [1] 1.472166
This problem involves the OJ data set which is part
of the ISLR2 package.
Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations
set.seed(42)
# Read in the data
d2 = OJ
head(d2)
## Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH
## 1 CH 237 1 1.75 1.99 0.00 0.0 0
## 2 CH 239 1 1.75 1.99 0.00 0.3 0
## 3 CH 245 1 1.86 2.09 0.17 0.0 0
## 4 MM 227 1 1.69 1.69 0.00 0.0 0
## 5 CH 228 7 1.69 1.69 0.00 0.0 0
## 6 CH 230 7 1.69 1.99 0.00 0.0 0
## SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM
## 1 0 0.500000 1.99 1.75 0.24 No 0.000000
## 2 1 0.600000 1.69 1.75 -0.06 No 0.150754
## 3 0 0.680000 2.09 1.69 0.40 No 0.000000
## 4 0 0.400000 1.69 1.69 0.00 No 0.000000
## 5 0 0.956535 1.69 1.69 0.00 Yes 0.000000
## 6 1 0.965228 1.99 1.69 0.30 Yes 0.000000
## PctDiscCH ListPriceDiff STORE
## 1 0.000000 0.24 1
## 2 0.000000 0.24 1
## 3 0.091398 0.23 1
## 4 0.000000 0.00 1
## 5 0.000000 0.00 0
## 6 0.000000 0.30 0
# Split data into train and test
train_index = sample(1:nrow(d2), 800)
train = d2[train_index, ]
test = d2[-train_index, ]
Fit a tree to the training data, with Purchase as
the response and the other variables as predictors. Use the
summary() function to produce summary statistics about the
tree, and describe the results obtained. What is the training error
rate? How many terminal nodes does the tree have?
It has 8 terminal/leaf nodes with a misclassification error rate of
16.38%. The most important predictors were LoyalCH,
SalePriceMM, and PriceDiff.
# Fit a Classification Tree
fit_tree = tree(Purchase ~ ., data= train)
summary(fit_tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "SalePriceMM" "PriceDiff"
## Number of terminal nodes: 8
## Residual mean deviance: 0.7392 = 585.5 / 792
## Misclassification error rate: 0.1638 = 131 / 800
Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
SalePriceMM < 2.04 128 123.50 MM ( 0.18750 0.81250 )
This terminal node includes observations where the price of Minute Maid is less than $2.04. Below is the breakdown of the other variables and values:
# Print the Tree model
fit_tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1066.00 CH ( 0.61500 0.38500 )
## 2) LoyalCH < 0.48285 285 296.00 MM ( 0.21404 0.78596 )
## 4) LoyalCH < 0.064156 64 0.00 MM ( 0.00000 1.00000 ) *
## 5) LoyalCH > 0.064156 221 260.40 MM ( 0.27602 0.72398 )
## 10) SalePriceMM < 2.04 128 123.50 MM ( 0.18750 0.81250 ) *
## 11) SalePriceMM > 2.04 93 125.00 MM ( 0.39785 0.60215 ) *
## 3) LoyalCH > 0.48285 515 458.10 CH ( 0.83689 0.16311 )
## 6) LoyalCH < 0.753545 230 282.70 CH ( 0.69565 0.30435 )
## 12) PriceDiff < 0.265 149 203.00 CH ( 0.57718 0.42282 )
## 24) PriceDiff < -0.165 32 38.02 MM ( 0.28125 0.71875 ) *
## 25) PriceDiff > -0.165 117 150.30 CH ( 0.65812 0.34188 )
## 50) LoyalCH < 0.703993 105 139.60 CH ( 0.61905 0.38095 ) *
## 51) LoyalCH > 0.703993 12 0.00 CH ( 1.00000 0.00000 ) *
## 13) PriceDiff > 0.265 81 47.66 CH ( 0.91358 0.08642 ) *
## 7) LoyalCH > 0.753545 285 111.70 CH ( 0.95088 0.04912 ) *
Create a plot of the tree, and interpret the results.
The plot has 6 nodes with the 3 important variables of
LoyalCH SalePriceMM
PriceDiff.
# Plot the tree
plot(fit_tree)
text(fit_tree, pretty= 0)
Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
The test error rate is 18.89%.
# Predictions
oj_preds = predict(fit_tree, test, type= 'class')
# Confusion Matrix
confusionMatrix(oj_preds, test$Purchase, positive= 'MM')
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 125 15
## MM 36 94
##
## Accuracy : 0.8111
## 95% CI : (0.7592, 0.856)
## No Information Rate : 0.5963
## P-Value [Acc > NIR] : 3.4e-14
##
## Kappa : 0.6195
##
## Mcnemar's Test P-Value : 0.005101
##
## Sensitivity : 0.8624
## Specificity : 0.7764
## Pos Pred Value : 0.7231
## Neg Pred Value : 0.8929
## Prevalence : 0.4037
## Detection Rate : 0.3481
## Detection Prevalence : 0.4815
## Balanced Accuracy : 0.8194
##
## 'Positive' Class : MM
##
Apply the cv.tree() function to the training set in
order to determine the optimal tree size.
set.seed(42)
# Conduct cross validation
cv_tree = cv.tree(fit_tree, FUN= prune.misclass)
Produce a plot with tree size on the \(x\)-axis and cross-validated classification error rate on the \(y\)-axis.
plot(cv_tree$size, cv_tree$dev, type= 'b',
xlab = "Tree Size (Number of Terminal Nodes)",
ylab = "Cross-Validated Classification Error Rate",
main = "CV Error vs. Tree Size")
Which tree size corresponds to the lowest cross-validated classification error rate?
The optimal tree size is 5 denoted by the lowest cross-validated classification error rate.
# Find index of minimum deviance
best_index = which.min(cv_tree$dev)
# Get corresponding tree size
cv_tree$size[best_index]
## [1] 5
Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
# Prune to best number of nodes selected by cross-validation
prune = prune.tree(fit_tree, best = 5)
plot(prune)
text(prune, pretty = 0)
Compare the training error rates between the pruned and unpruned trees. Which is higher?
The training error rate increased from 16.38% to 18.12% after pruning, indicating that the pruned tree has a higher training error.
summary(prune)
##
## Classification tree:
## snip.tree(tree = fit_tree, nodes = c(5L, 12L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 5
## Residual mean deviance: 0.7833 = 622.7 / 795
## Misclassification error rate: 0.1812 = 145 / 800
Compare the test error rates between the pruned and unpruned trees. Which is higher?
The test error rate increased from 18.89% to 21.85% after pruning, indicating that the pruned tree has a higher test error.
# Predictions
prune_preds = predict(prune, test, type= 'class')
# Confusion Matrix
confusionMatrix(prune_preds, test$Purchase, positive= 'MM')
## Confusion Matrix and Statistics
##
## Reference
## Prediction CH MM
## CH 128 26
## MM 33 83
##
## Accuracy : 0.7815
## 95% CI : (0.7274, 0.8293)
## No Information Rate : 0.5963
## P-Value [Acc > NIR] : 8.681e-11
##
## Kappa : 0.5508
##
## Mcnemar's Test P-Value : 0.4347
##
## Sensitivity : 0.7615
## Specificity : 0.7950
## Pos Pred Value : 0.7155
## Neg Pred Value : 0.8312
## Prevalence : 0.4037
## Detection Rate : 0.3074
## Detection Prevalence : 0.4296
## Balanced Accuracy : 0.7782
##
## 'Positive' Class : MM
##