Problem 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, ˆpm1 = 1 − ˆpm2. You could make this plot by hand, but it will be much easier to make in R

phat1 = seq(0,1,0.01)
phat2 = 1 - phat1

We have the following equation for Gini index:

\[ G = \sum_{k=1}^{K} \hat{p}_{mk} (1 - \hat{p}_{mk}) \]

We can plug in the hint to get the following:

gini = phat1 * phat2 + phat2 * phat1

We can then move on to the classification error below:

\[ E = 1 - \max_{k} (\hat{p}_{mk}). \]

Which we can translate into R:

e <- 1 - pmax(phat1, phat2)

We finally move on to entropy below:

\[ D = - \sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk} \]

Which is the following in R:

d <- -(phat1 * log(phat1) + phat2 * log(phat2))

Now we can plot everything together:

impurity_df <- data.frame(
  phat1 = phat1,
  Gini = gini,
  Classification_Error = e,
  Entropy = d
)

impurity_long <- pivot_longer(impurity_df, 
                               cols = c(Gini, Classification_Error, Entropy),
                               names_to = "Measure", 
                               values_to = "Value")

ggplot(impurity_long, aes(x = phat1, y = Value, color = Measure)) +
  geom_line(size = 1) +
  labs(
    x = expression(hat(p)[1]),
    y = "Error",
    color = "Measure"
  ) +
  theme_minimal() +
  coord_cartesian(xlim = c(0, 1), ylim = c(0, 1))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

Problem 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

  1. Split the data set into a training set and a test set.
# 80 20 train test split
set.seed(123)
train_idx <- createDataPartition(Carseats$Sales, p=0.8, list = F) 
train <- Carseats[train_idx,]
test <- Carseats[-train_idx,]
  1. Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
tree <- tree(Sales~., data=train)
plot(tree)
text(tree, pretty=0)

# summary(tree)

tree_preds <- predict(tree, newdata = test)
cat('MSE of regression trees: ', mean((test$Sales - tree_preds)^2))
## MSE of regression trees:  3.989909

We calculate a test MSE of 3.99, with the associated regression tree. It seems the most important factor is shelve location, followed by the price.

  1. Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
cv_tree <- cv.tree(tree)
plot(cv_tree$size, cv_tree$dev, type='b')

We see the best performance at a size of 20, but this may be due to overfitting. The error seems to plateau around 10, so we will prune the tree to size 10 and then compare.

prune_tree <- prune.tree(tree, best = 10)
plot(prune_tree)
text(prune_tree, pretty=0)

prune_preds <- predict(prune_tree, newdata=test)
cat('MSE of pruned regression trees: ', mean((test$Sales - prune_preds)^2))
## MSE of pruned regression trees:  4.17508

We see an improved test MSE of 4.17, meaning that the unpruned tree has a better performance than the pruned tree.

  1. Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
bag <- randomForest(Sales~., data=train, mtry=10, importance=T) # mtry 10 to indicate bagging

bag_preds <- predict(bag, newdata=test)

cat('MSE of bagging: ', mean((test$Sales - bag_preds)^2))
## MSE of bagging:  2.207355

We see a much better performance with a test MSE of 2.20 using bagging. We can examine which variables are important next:

importance(bag)
##                %IncMSE IncNodePurity
## CompPrice   32.9658615    261.321866
## Income       8.3720174    119.142486
## Advertising 21.9712398    171.089174
## Population  -0.8515952     78.469427
## Price       73.3770038    735.787501
## ShelveLoc   77.6120330    786.114217
## Age         23.7411566    245.395461
## Education    2.3243803     65.264136
## Urban        0.4501544     11.374272
## US           2.0875754      7.874132

With this we can see that CompPrice, Income, and Advertising were the most important. Each of those lead to a significant increase in node purity when used and a significant increase in MSE when they were removed.

  1. Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
rf <- randomForest(Sales~., data=train, mtry=3, importance=T) # p/3 for regression

rf_preds <- predict(rf, newdata=test)

cat('MSE of random forest: ', mean((test$Sales - rf_preds)^2))
## MSE of random forest:  2.687778

We calculate a test MSE for random forest at 2.69. We can check the importance once again:

importance(rf)
##                %IncMSE IncNodePurity
## CompPrice   13.3317563     227.72227
## Income       6.5171317     196.31469
## Advertising 13.2852877     207.80164
## Population  -2.2650990     145.37736
## Price       42.8014305     580.20990
## ShelveLoc   46.9570829     603.81542
## Age         15.2863156     282.51052
## Education    2.8357733     105.57985
## Urban        0.2410994      21.99124
## US           5.2215714      32.28122

We once again see the top 3 most important as CompPrice, Income, and Advertising, agreeing with the results we got in bagging.

  1. Now analyze the data using BART, and report your results.
xtrain <- train[,2:11]
xtest <- test[,2:11]
ytrain <- train[,1]
ytest <- test[,1]

bart <- gbart(xtrain, ytrain, x.test = xtest)
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 321, 14, 79
## y1,yn: 3.754922, -1.525078
## x1,x[n*p]: 111.000000, 1.000000
## xp1,xp[np*p]: 138.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 69 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.198224,7.46508
## *****sigma: 1.008771
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 4s
## trcnt,tecnt: 1000,1000
bart_preds <- bart$yhat.train.mean
cat('MSE of BART: ', mean((test$Sales - bart_preds)^2))
## MSE of BART:  13.31927

We see an MSE of 13.32 for BART, so far the worst performing model by far.

Problem 9

This problem involves the OJ data set which is part of the ISLR2 package.

str(OJ)
## 'data.frame':    1070 obs. of  18 variables:
##  $ Purchase      : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
##  $ WeekofPurchase: num  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : num  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : num  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : num  1 1 1 1 0 0 0 0 0 0 ...
  1. Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
library(tree)
set.seed(123)
train_idx <- sample(1:length(OJ$Purchase), size = 800) # train indices

train <- OJ[train_idx,]
test <- OJ[-train_idx,]
  1. Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
tree <- tree(Purchase ~., data=train)
summary(tree)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7625 = 603.9 / 792 
## Misclassification error rate: 0.165 = 132 / 800

The training error is misclassification error in this case, which is 16.5% for this tree. There are 8 terminal nodes, and there is a residual mean deviance of 0.7625.

  1. Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
tree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1071.00 CH ( 0.60875 0.39125 )  
##    2) LoyalCH < 0.5036 350  415.10 MM ( 0.28000 0.72000 )  
##      4) LoyalCH < 0.276142 170  131.00 MM ( 0.12941 0.87059 )  
##        8) LoyalCH < 0.0356415 56   10.03 MM ( 0.01786 0.98214 ) *
##        9) LoyalCH > 0.0356415 114  108.90 MM ( 0.18421 0.81579 ) *
##      5) LoyalCH > 0.276142 180  245.20 MM ( 0.42222 0.57778 )  
##       10) PriceDiff < 0.05 74   74.61 MM ( 0.20270 0.79730 ) *
##       11) PriceDiff > 0.05 106  144.50 CH ( 0.57547 0.42453 ) *
##    3) LoyalCH > 0.5036 450  357.10 CH ( 0.86444 0.13556 )  
##      6) PriceDiff < -0.39 27   32.82 MM ( 0.29630 0.70370 ) *
##      7) PriceDiff > -0.39 423  273.70 CH ( 0.90071 0.09929 )  
##       14) LoyalCH < 0.705326 130  135.50 CH ( 0.78462 0.21538 )  
##         28) PriceDiff < 0.145 43   58.47 CH ( 0.58140 0.41860 ) *
##         29) PriceDiff > 0.145 87   62.07 CH ( 0.88506 0.11494 ) *
##       15) LoyalCH > 0.705326 293  112.50 CH ( 0.95222 0.04778 ) *

Let us examine node 8 which is a terminal node. To get there, the first split is at LoyalCH which is below 0.504, of which leaves 350 people. The next split is with LoyalCH being lower than 0.276, which decreases to 170 people. The next split is again with LoyalCH being lower than 0.036. This categorizes the remaining 56 individuals as purchasing MM, as 98.2% of individuals in the training dataset who met both of these criteria purchased MM (meaning only about one person in this group purchased CH). This outcome makes sense: people with a very low loyalty for CH will be very unlikely to buy CH.

  1. Create a plot of the tree, and interpret the results.
plot(tree)
text(tree, pretty = 0)

In this tree diagram, branches on the left mean the condition is met while branches on the right mean the condition is not met. We can see visually the same concept from the previous question: in individuals with LoyalCH < 0.5036, LoyalCH < 0.276, and LoyalCH < 0.036, they would be classified as MM.

Seeing the entire tree shows an overall trend: if LoyalCH < 0.5 then you are unlikely to purchase CH, but if it is higher you are more likely to purchase CH. We see that the exceptions to this are involving PriceDiff; when there isn’t much of a price difference or MM is higher then they are likely to buy CH on the left hand side, while if MM is much lower the right hand side is more likely to buy MM.

Interestingly enough, we see that some splits have the same classification (such as nodes 8 and 9 as seen in the last section). This is because these splits increase the node purity, as the likelihood of being correct is much higher in node 8 than node 9.

  1. Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
tree_preds <- predict(tree, newdata = test)
tree_preds <- as.factor(ifelse(tree_preds[,1] >= 0.5, "CH","MM")) # classify as CH or MM
confusionMatrix(as.factor(test$Purchase), tree_preds)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 150  16
##         MM  34  70
##                                           
##                Accuracy : 0.8148          
##                  95% CI : (0.7633, 0.8593)
##     No Information Rate : 0.6815          
##     P-Value [Acc > NIR] : 5.99e-07        
##                                           
##                   Kappa : 0.596           
##                                           
##  Mcnemar's Test P-Value : 0.01621         
##                                           
##             Sensitivity : 0.8152          
##             Specificity : 0.8140          
##          Pos Pred Value : 0.9036          
##          Neg Pred Value : 0.6731          
##              Prevalence : 0.6815          
##          Detection Rate : 0.5556          
##    Detection Prevalence : 0.6148          
##       Balanced Accuracy : 0.8146          
##                                           
##        'Positive' Class : CH              
## 
cat('misclassification error: ', mean(tree_preds != as.factor(test$Purchase)))
## misclassification error:  0.1851852

We see misclassification error is 18.5%, slightly higher than the training error.

  1. Apply the cv.tree() function to the training set in order to determine the optimal tree size.
cv_tree <- cv.tree(tree) # uses prune.tree with cross validation
print(cv_tree)
## $size
## [1] 8 7 6 5 4 3 2 1
## 
## $dev
## [1]  690.2346  692.8691  683.0423  717.4446  717.4446  743.3604  793.1266
## [8] 1073.0916
## 
## $k
## [1]      -Inf  12.03823  14.92474  25.76707  26.02613  38.91686  50.61655
## [8] 298.68751
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
  1. Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv_tree)

  1. Which tree size corresponds to the lowest cross-validated classification error rate?

We see the lowest error at size 6 at 683, meaining that the optimal length to cut the tree is at size 6.

  1. Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
tree_pruned <- prune.tree(tree, best = 6)
plot(tree_pruned)
text(tree_pruned, pretty = 0)

  1. Compare the training error rates between the pruned and unpruned trees. Which is higher?
summary(tree)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7625 = 603.9 / 792 
## Misclassification error rate: 0.165 = 132 / 800
summary(tree_pruned)
## 
## Classification tree:
## snip.tree(tree = tree, nodes = c(4L, 14L))
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  6 
## Residual mean deviance:  0.7945 = 630.9 / 794 
## Misclassification error rate: 0.165 = 132 / 800

We see the training error is at 16.5% for both trees. This is because the two nodes that were pruned were at a split in which both options were the same classification. This means that we see no decrease in performance after pruning the tree.

  1. Compare the test error rates between the pruned and unpruned trees. Which is higher?
prune_preds <- predict(tree_pruned, newdata = test)
prune_preds <- as.factor(ifelse(prune_preds[,1] >= 0.5, "CH","MM")) # classify as CH or MM
cat('misclassification error for full tree: ', mean(tree_preds != as.factor(test$Purchase)), '\n')
## misclassification error for full tree:  0.1851852
cat('misclassification error for pruned tree: ', mean(prune_preds != as.factor(test$Purchase)))
## misclassification error for pruned tree:  0.1851852

We see the same test error for the pruning, for the same reason; the nodes pruned were only regarding improving purity rather than performance.