3. Consider the Gini index, classifcation error, and entropy in a simple

classifcation setting with two classes. Create a single plot that displays each of these quantities # as a function of pˆm1. The x-axis should display pˆm1, ranging from 0 to 1, and the y-axis should # display the value of the Gini index, classifcation error, and entropy.

Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make

this plot by hand, but it will be much easier to make in R

pm1 = seq(0, 1, 0.01)
pm2 = 1 - pm1

class_error = 1 - pmax(pm1, pm2)

gini = pm1*(1 - pm1) + pm2*(1 - pm2)

entropy = -pm1*log(pm1) - pm2*log(pm2)

df_proportion = data.frame(pm1,pm2,class_error,gini,entropy)

ggplot(data = df_proportion) +
  geom_line(aes(x = pm1, y = class_error, col = 'Classification Error')) +
  geom_line(aes(x = pm1, y = gini, col = 'Gini Index')) +
  geom_line(aes(x = pm1, y = entropy, col = 'Entropy')) +
  labs(y = 'Function Value', col = 'Function') +
  theme_minimal()

## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_line()`).

8. In the lab, a classifcation tree was applied to the Carseats data set after converting Sales into # a qualitative response variable. Now we will seek to predict Sales using regression trees and related # approaches, treating the response as a quantitative variable.

8a) Split the data set into a training set and a test set.

set.seed(1)
index = sample(1:nrow(Carseats), 0.7*nrow(Carseats))
car_train = Carseats[index,]
car_test = Carseats[-index,]

8b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test ### MSE do you obtain?

car_reg_tree = tree(Sales ~ ., car_train)
plot(car_reg_tree)
text(car_reg_tree, pretty = 0, cex = 0.6, adj = 0.5, offset = 0)

This decision tree tells us that the two biggest predictors for sales are Price and ShelveLoc because they are near the top of the tree and everything else splits off of them.

preds = predict(car_reg_tree, car_test)
mean((preds - car_test$Sales)^2)

## [1] 4.208383

We get a test MSE of 4.208.

8c) Use cross-validation in order to determine the optimal level of

tree complexity. Does pruning the tree improve the test MSE?

set.seed(1)
cv_tree = cv.tree(car_reg_tree, K= 10)
plot(cv_tree)

We will prune using 8 as the optimal tree based on the plot above.

pruned_tree = prune.tree(car_reg_tree, best = 8)
plot(pruned_tree)
text(pruned_tree)

preds = predict(pruned_tree, car_test)
mean((preds - car_test$Sales)^2)

## [1] 4.579256

We get an MSE of 4.579 which is higher than our previous MSE using all nodes. Pruning did not improve our MSE, it made it slightly higher.

8d) Use the bagging approach in order to analyze this data. What

test MSE do you obtain? Use the importance() function to determine which variables ### are most important.

set.seed(1)
bag_tree = randomForest(x = car_train[,-1], y = car_train$Sales,
                        mtry = ncol(car_train) -1, importance = TRUE)

preds = predict(bag_tree, car_test)
mean((preds - car_test$Sales)^2)

## [1] 2.573252

importance(bag_tree)

##               %IncMSE IncNodePurity
## CompPrice   35.324343     230.55353
## Income       7.923790     118.79213
## Advertising 19.736816     155.03099
## Population  -3.479681      63.11975
## Price       70.613500     673.11982
## ShelveLoc   68.147204     637.69500
## Age         20.964215     228.90345
## Education    4.705263      63.13124
## Urban       -2.098091      11.66651
## US           1.389570      11.15329

varImpPlot(bag_tree)

We get a test MSE of 2.573 which is lower than both of the previous models. We find that price and shelveloc are both still the most important predictors for the model.

8e) Use random forests to analyze this data. What test MSE do you

obtain? Use the importance() function to determine which variables are most important. Describe ### the effect of m, the number of variables considered at each split, on the error rate obtained.

test_MSE = c()
set.seed(1)
i = 1
for(Mtry in 1:10){
  car_rf = randomForest(x = car_train[ ,-1], y = car_train$Sales,
                        mtry = Mtry, importance = TRUE)
  preds = predict(car_rf, car_test)
  test_MSE[i] = mean((preds - car_test$Sales)^2)
  i= i+1
}
plot(1:10,test_MSE, type = 'b')

It looks like the lowest MSE comes from using an mtry of 9.

test_MSE[9]

## [1] 2.523349

And we get a Test MSE of 2.523, so having 9 predictors in training is slightly better than the previous bagged model with an MSE of 2.573.

car_rf = randomForest(x = car_train[ ,-1], y = car_train$Sales,
                        mtry = 9, importance = TRUE)
importance(car_rf)

##                %IncMSE IncNodePurity
## CompPrice   34.3696380     232.01325
## Income       6.5736407     118.83707
## Advertising 19.5933136     158.80343
## Population  -0.5070978      68.05089
## Price       70.0057275     685.21031
## ShelveLoc   74.3188385     628.14085
## Age         21.5947770     220.65504
## Education    2.8605098      64.72522
## Urban       -1.8345240      11.69904
## US          -0.3735566      13.57048

varImpPlot(car_rf)

ShelveLoc and Price are still our 2 best predictors for our model.

8f) Now analyze the data using BART, and report your results

library(BART)

## Warning: package 'BART' was built under R version 4.3.3

## Loading required package: nlme

## 
## Attaching package: 'nlme'

## The following object is masked from 'package:dplyr':
## 
##     collapse

## Loading required package: nnet

## Loading required package: survival

## Warning: package 'survival' was built under R version 4.3.3

car_bart = gbart(x.train = car_train[ ,-1], y.train = car_train$Sales,
                 x.test = car_test[ ,-1])

## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 280, 14, 120
## y1,yn: 2.763393, 0.113393
## x1,x[n*p]: 107.000000, 1.000000
## xp1,xp[np*p]: 113.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 70 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.276302,3,0.215848,7.59661
## *****sigma: 1.052663
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 4s
## trcnt,tecnt: 1000,1000

y_bart = car_bart$yhat.test.mean
mean((car_test$Sales - y_bart)^2)

## [1] 1.52565

BART gives us an MSE of 1.526, which is the lowest MSE of all the models that were done.

9. This problem involves the OJ data set which is part of the ISLR2 package.

9a) Create a training set containing a random sample of 800 observations,

and a test set containing the remaining observations.

OJ = ISLR2::OJ
set.seed(1)
index = sample(1:nrow(OJ), 800)
OJ_train = OJ[index,]
OJ_test = OJ[-index,]

9b) Fit a tree to the training data, with Purchase as the response

and the other variables as predictors. Use the summary() function

to produce summary statistics about the tree, and describe the

results obtained. What is the training error rate? How many

terminal nodes does the tree have?

oj_tree = tree(Purchase ~., data = OJ_train)
summary(oj_tree)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ_train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800

There is a training error rate of 15.88%, have 9 terminal nodes, and of the 17 predictors in the model only 5 were used to construct the tree. These are:

“LoyalCH” “PriceDiff” “SpecialCH” “ListPriceDiff” “PctDiscMM”

9c) Type in the name of the tree object in order to get a detailed

text output. Pick one of the terminal nodes, and interpret the

information displayed.

oj_tree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1073.00 CH ( 0.60625 0.39375 )  
##    2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  
##      4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  
##        8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *
##        9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *
##      5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  
##       10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 )  
##         20) SpecialCH < 0.5 64   51.98 MM ( 0.14062 0.85938 ) *
##         21) SpecialCH > 0.5 15   20.19 CH ( 0.60000 0.40000 ) *
##       11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *
##    3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  
##      6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  
##       12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  
##         24) PctDiscMM < 0.196196 55   73.14 CH ( 0.61818 0.38182 ) *
##         25) PctDiscMM > 0.196196 17   12.32 MM ( 0.11765 0.88235 ) *
##       13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *
##      7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *

Node 8 is a terminal node because of the : 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) 59 observations fall into this node and there is a deviance of 10.14 in the node. When the LoyalCH < 0.0356, then the model predicts that the purchase is likely to be Minute Maid with 98.3% confidence and a predicted purchase of Citrus Hill at 1.695% confidence. This suggests this is a pure node.

9d) Create a plot of the tree, and interpret the results.

plot(oj_tree)
text(oj_tree, pretty = 0, cex = 0.6)

A summary of the tree is Citrus Hill loyal customers will purchase Citrus Hill unless there

are large enough discounts/price differences for Minute Maid to push them out of their normal ##### loyal purchase and vice versa for Minute Maid enjoyers.

People who score very low in LoyalCH are predicted to purchase Minute Maid. A split occured due to Price Differences < 0.05 where people are more likely to purchase Citrus Hill. Another split occurs where SpecialCH < 0.5 where it is 50 50 if people purchase Minute Maid or Cirtus Hill. On the other side of the tree, people who have a a high LoyalCH score are predicted to purchase Citrus Hill as expected. A split occurs based on List Price Difference < 0.235 so if the difference is not too much loyal customers will stay with Citrus Hill. It splits again on Percent Discount Minute Maid where loyal customers will eventually purchase Minute Maid if the discount is large enough.

9e)Predict the response on the test data, and produce a confusion

matrix comparing the test labels to the predicted test labels.

What is the test error rate?

preds = predict(oj_tree, OJ_test, type = 'class')
table(preds, test_actual = OJ_test$Purchase)

##      test_actual
## preds  CH  MM
##    CH 160  38
##    MM   8  64

mean(preds == OJ_test$Purchase)

## [1] 0.8296296

1 - mean(preds == OJ_test$Purchase)

## [1] 0.1703704

We get an error rate of 17.03% or 82.96% accuracy. Of our 270 test observations, we misclassify 46 of our predictions with the majority of these predictions being predicted as Citrus Hill but actually being Minute Maid. This could possibly be due to an imbalance between Citrus Hill and Minute Maid in the data.

9f) Apply the cv.tree() function to the training set in order to

determine the optimal tree size.

set.seed(1)
oj_cv = cv.tree(oj_tree, K= 10, FUN = prune.misclass)
oj_cv

## $size
## [1] 9 8 7 4 2 1
## 
## $dev
## [1] 145 145 146 146 167 315
## 
## $k
## [1]       -Inf   0.000000   3.000000   4.333333  10.500000 151.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

9g) Produce a plot with tree size on the x-axis and cross-validated

classifcation error rate on the y-axis.

plot(oj_cv$size, oj_cv$dev/ nrow(OJ_train), type = 'b',
     xlab = 'Tree Size', ylab = 'Error Rate')

9h) Which tree size corresponds to the lowest cross-validated classifcation error rate?

Tree sizes 8 and 9 give us the lowest error rates of .145 but I will be choosing tree size 8 for simplicity.

9i) Produce a pruned tree corresponding to the optimal tree size

obtained using cross-validation. If cross-validation does not lead

to selection of a pruned tree, then create a pruned tree with fve

terminal nodes.

oj_pruned = prune.tree(oj_tree, best = 8)
oj_pruned

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1073.00 CH ( 0.60625 0.39375 )  
##    2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  
##      4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  
##        8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *
##        9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *
##      5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  
##       10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 ) *
##       11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *
##    3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  
##      6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  
##       12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  
##         24) PctDiscMM < 0.196196 55   73.14 CH ( 0.61818 0.38182 ) *
##         25) PctDiscMM > 0.196196 17   12.32 MM ( 0.11765 0.88235 ) *
##       13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *
##      7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *

9j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

Unpruned:

mean(predict(oj_tree, type = 'class') != OJ_train$Purchase)

## [1] 0.15875

Pruned:

mean(predict(oj_pruned, type = 'class') != OJ_train$Purchase)

## [1] 0.1625

The pruned tree has a higher training error rate of 0.1625 compared to the unpruned rate of .159. This could have happened because we have decreased the flexibility / variance of the training set by decreasing the number of splits.

9k) Compare the test error rates between the pruned and unpruned

trees. Which is higher?