Assignment 7

Conceptual

3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a sinle plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The \(x\)-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the \(y\)-axis should display the value of the Gini index, classification error, and entropy.

library(tidyverse)
library(plotly)

Create Gini index data:

gini <- as_tibble(list(p = seq(0, 1, 0.001))) %>% 
  mutate(value = p * (1 - p) * 2,
         measure = "Gini")

Create entropy data:

entr <- as_tibble(list(p = seq(0, 1, 0.001))) %>% 
  mutate(value = -(p * log(p) + (1 - p) * log(1 - p)),
         measure = "Entropy")

Create classification error data:

error <- as_tibble(list(p = seq(0, 1, 0.001))) %>% 
  mutate(value = 1 - pmax(p, 1 - p),
         measure = "Classification Error")

df <- bind_rows(gini, entr, error) %>% 
  arrange(measure, p)

plt <- ggplot(df, aes(x = p, y = value, col = measure)) +
  geom_line() +
  scale_color_manual(values = c("#377eb8","#e41a1c","#4daf4a")) +
  labs(x = "p",
       y = "Value fot Split",
       title = "Max Value for Each Criterion Occurs at p = 0.50") +
  theme_minimal()

figure <- ggplotly(plt, width = 600, height = 300)
figure

Applied

8. In the lab, a classification tree was applied to the `Carseats` data set after converting `Sales` into a qualitative response variable. Now we will seek to predict `Sales` using regression trees and related approaches, treating the response as a qunatitative variable.

a) Split the data into a training set and a test set.

library(ISLR2)

set.seed(22)

intrain <- sample(nrow(Carseats), nrow(Carseats) / 2)

train_cs <- Carseats[intrain, ]
test_cs <- Carseats[-intrain, ]

b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain.

library(tree)

tree_cs <- tree(Sales ~., data = train_cs)
summary(tree_cs)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = train_cs)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Advertising" "CompPrice"   "Education"  
## [6] "Age"        
## Number of terminal nodes:  16 
## Residual mean deviance:  2.365 = 435.1 / 184 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.0440 -1.0370  0.0130  0.0000  0.8608  4.9900

plot(tree_cs)
text(tree_cs, pretty = 0, cex = 0.55)

pred_tree <- predict(tree_cs, test_cs)
tree_mse <- mean((test_cs$Sales - pred_tree) ^ 2)
tree_mse

## [1] 4.884655

According to the summary, the regression tree split on 6 of the variables. The first split was made on ShelveLoc, followed by Price, Advertising, CompPrice, Education, and finally Age. There are 16 terminal nodes in the regression tree. The test MSE obtained is 4.884655.

c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

tree_cv <- cv.tree(tree_cs, FUN = prune.tree)

par(mfrow = c(1, 2))
plot(tree_cv$size, tree_cv$dev, type = "b")
plot(tree_cv$k, tree_cv$dev, type = "b")

tree_prune <- prune.tree(tree_cs, best = 6)

par(mfrow = c(1,1))

plot(tree_prune)
text(tree_prune, pretty = 0, cex = 0.55)

pred_prune <- predict(tree_prune, test_cs)
prune_mse <- mean((test_cs$Sales - pred_prune) ^ 2)
prune_mse

## [1] 4.911125

Pruning the tree does not appear to improve the test MSE. The test MSE increases slightly from 4.884655 to 4.911125.

d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

library(randomForest)
set.seed(22)

bag_cs <- randomForest(Sales ~., data = train_cs, mtry = 10, ntree = 500, importance = T)
pred_bag <- predict(bag_cs, test_cs)
bag_mse <- mean((test_cs$Sales - pred_bag) ^ 2)
bag_mse

## [1] 2.342287

The test MSE obtained is 2.342287, which is an improvement over both the regression and pruned trees.

bag_imp <- as_tibble(importance(bag_cs)) %>% 
  mutate(variable = row.names(importance(bag_cs)),
         per_inc_mse = `%IncMSE`,
         inc_node_purity = IncNodePurity) %>% 
  select(variable, per_inc_mse, inc_node_purity) %>% 
  arrange(desc(per_inc_mse))

bag_imp

## # A tibble: 10 × 3
##    variable    per_inc_mse inc_node_purity
##    <chr>             <dbl>           <dbl>
##  1 ShelveLoc         57.7           622.  
##  2 Price             54.1           560.  
##  3 CompPrice         25.2           164.  
##  4 Advertising       17.7           160.  
##  5 Age                8.92          106.  
##  6 Education          3.58           52.0 
##  7 Income             2.76           72.1 
##  8 US                 1.55            6.36
##  9 Population         1.32           69.5 
## 10 Urban              1.21            5.94

It appears that the most important variables are ShelvLoc, Price, and CompPrice.

e) Use random forest to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of \(m\), the number of variables considered at each split, on the error rate obtained.

set.seed(22)

rf_cs <- randomForest(Sales ~., data = train_cs, mtry = 7, ntree = 500, importance = T)
pred_rf <- predict(rf_cs, test_cs)
rf_mse <- mean((test_cs$Sales - pred_rf) ^ 2)
rf_mse

## [1] 2.336825

rf_imp <- as_tibble(importance(rf_cs)) %>% 
  mutate(variable = row.names(importance(rf_cs)),
         per_inc_mse = `%IncMSE`,
         inc_node_purity = IncNodePurity) %>% 
  select(variable, per_inc_mse, inc_node_purity) %>% 
  arrange(desc(per_inc_mse))

rf_imp

## # A tibble: 10 × 3
##    variable    per_inc_mse inc_node_purity
##    <chr>             <dbl>           <dbl>
##  1 ShelveLoc       54.1             564.  
##  2 Price           46.5             546.  
##  3 CompPrice       19.5             159.  
##  4 Advertising     18.1             160.  
##  5 Age              8.36            116.  
##  6 Income           2.99             95.5 
##  7 Education        2.81             62.4 
##  8 US               1.64             12.7 
##  9 Population       0.773            83.8 
## 10 Urban            0.0885            8.09

set.seed(22)

rf_cs2 <- randomForest(Sales ~., data = train_cs, mtry = 8, ntree = 500, importance = T)
pred_rf2 <- predict(rf_cs2, test_cs)
rf_mse2 <- mean((test_cs$Sales - pred_rf2) ^ 2)
rf_mse2

## [1] 2.368217

set.seed(22)

rf_cs3 <- randomForest(Sales ~., data = train_cs, mtry = 9, ntree = 500, importance = T)
pred_rf3 <- predict(rf_cs3, test_cs)
rf_mse3 <- mean((test_cs$Sales - pred_rf3) ^ 2)
rf_mse3

## [1] 2.337641

set.seed(22)

rf_cs4 <- randomForest(Sales ~., data = train_cs, mtry = 5, ntree = 500, importance = T)
pred_rf4 <- predict(rf_cs4, test_cs)
rf_mse4 <- mean((test_cs$Sales - pred_rf4) ^ 2)
rf_mse4

## [1] 2.39062

The test MSE obtained is 2.336825, which is a slight increase from that of the test MSE obtained from the bagging approach. Changing the value of \(m\) makes the test MSE vary between 2.33 and 2.39. The same variables appear to be the most important: ShelvLoc, Price, and CompPrice.

f) Now analyze the data using BART, and report your results.

library(BART)

x <- Carseats[, 2:11]
y <- Carseats[, 1]

xtrain <- x[intrain, ]
ytrain <- y[intrain]

xtest <- x[-intrain, ]
ytest <- y[-intrain]

bart_cs <- gbart(xtrain, ytrain, x.test = xtest)

## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 14, 200
## y1,yn: -2.981400, -2.431400
## x1,x[n*p]: 129.000000, 0.000000
## xp1,xp[np*p]: 113.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 66 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.239703,7.5114
## *****sigma: 1.109306
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 2s
## trcnt,tecnt: 1000,1000

bart_yhat <- bart_cs$yhat.test.mean
bart_mse <- mean((ytest - bart_yhat) ^2)
bart_mse

## [1] 1.116262

The test MSE obtained is 1.116262. This error rate for the Bayesian Additive Regression Trees model is lower than that of the regression and pruned trees as well as the random forest model. It appears that the BART model performs the best on the Carseats data set in terms of predicting Sales.

9. This problem involves the `OJ` data set which is part of the `ISLR2` package.

a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(23)

index <- sample(nrow(OJ), 800)
oj_train <- OJ[index, ]
oj_test <- OJ[-index, ]

b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

oj_tree <- tree(Purchase ~., data = oj_train)
summary(oj_tree)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj_train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff" "SpecialCH"
## Number of terminal nodes:  10 
## Residual mean deviance:  0.7116 = 562.2 / 790 
## Misclassification error rate: 0.145 = 116 / 800

Three variables are used in the construction of the tree: LoyalCH, PriceDiff, and SpecialCH. There are 10 terminal nodes, and the training error rate is 0.145.

c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

oj_tree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1064.00 CH ( 0.61750 0.38250 )  
##    2) LoyalCH < 0.48285 298  327.20 MM ( 0.23826 0.76174 )  
##      4) LoyalCH < 0.142213 101   55.92 MM ( 0.07921 0.92079 ) *
##      5) LoyalCH > 0.142213 197  246.90 MM ( 0.31980 0.68020 )  
##       10) PriceDiff < 0.31 154  169.80 MM ( 0.24026 0.75974 )  
##         20) SpecialCH < 0.5 137  136.00 MM ( 0.19708 0.80292 ) *
##         21) SpecialCH > 0.5 17   23.03 CH ( 0.58824 0.41176 ) *
##       11) PriceDiff > 0.31 43   57.71 CH ( 0.60465 0.39535 ) *
##    3) LoyalCH > 0.48285 502  437.00 CH ( 0.84263 0.15737 )  
##      6) LoyalCH < 0.705699 208  258.40 CH ( 0.68750 0.31250 )  
##       12) PriceDiff < -0.165 27   25.87 MM ( 0.18519 0.81481 ) *
##       13) PriceDiff > -0.165 181  198.50 CH ( 0.76243 0.23757 )  
##         26) PriceDiff < 0.265 97  125.70 CH ( 0.64948 0.35052 )  
##           52) LoyalCH < 0.6864 92  114.70 CH ( 0.68478 0.31522 ) *
##           53) LoyalCH > 0.6864 5    0.00 MM ( 0.00000 1.00000 ) *
##         27) PriceDiff > 0.265 84   57.20 CH ( 0.89286 0.10714 ) *
##      7) LoyalCH > 0.705699 294  112.60 CH ( 0.95238 0.04762 )  
##       14) PriceDiff < -0.39 14   19.12 CH ( 0.57143 0.42857 ) *
##       15) PriceDiff > -0.39 280   72.65 CH ( 0.97143 0.02857 ) *

Looking at node 20, we see that the splitting variable at this node is SpecialCH. The splitting value for this node is 0.5. There are 137 points in the subtree below this node. The deviance for all points contained in the region below node 20 is 136.00. A * in the line denotes that this is a terminal node in the tree. The prediction at this node is Sales = MM. According to the text output, about 19.71% of the points in this node have CH as the value of Sales, while roughly 81.29% of the points have MM as the value of Sales.

d) Create a plot of the tree, and interpret the results.

plot(oj_tree)
text(oj_tree, pretty = 0, cex = 0.55)

LoyalCH is clearly the most important variable of the tree, since the top three nodes split on LoyalCH. If LoyalCH < 0.142213, the tree will predict MM for the value of Sales. If LoyalCH > 0.705699, the tree will predict CH for the value of Sales. For intermediate values of LoyalCH, the decision also depends on the value of the two additional variables: PriceDiff andSpecialCH.

e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

library(caret)

oj_pred <- predict(oj_tree, oj_test, type = "class")
confusionMatrix(oj_test$Purchase, oj_pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 139  20
##         MM  37  74
##                                          
##                Accuracy : 0.7889         
##                  95% CI : (0.7353, 0.836)
##     No Information Rate : 0.6519         
##     P-Value [Acc > NIR] : 6.278e-07      
##                                          
##                   Kappa : 0.5537         
##                                          
##  Mcnemar's Test P-Value : 0.03407        
##                                          
##             Sensitivity : 0.7898         
##             Specificity : 0.7872         
##          Pos Pred Value : 0.8742         
##          Neg Pred Value : 0.6667         
##              Prevalence : 0.6519         
##          Detection Rate : 0.5148         
##    Detection Prevalence : 0.5889         
##       Balanced Accuracy : 0.7885         
##                                          
##        'Positive' Class : CH             
##

The prediction accuracy is 78.89%, so the test error rate is 21.11%.

f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

oj_cv <- cv.tree(oj_tree, FUN = prune.tree)

g) Produce a plot with tree size on the \(x\)-axis and cross-validated classification error rate on the \(y\)-axis.

plot(oj_cv$size, oj_cv$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")

h) Which tree size corresponds to the lowest cross-validated classification error rate?

According the plot, it appears that a tree size of 3 gives the lowest cross-validation error.

i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of pruned tree, then create a pruned tree with five terminal nodes.

oj_prune <- prune.tree(oj_tree, best = 3)

j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(oj_tree)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj_train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff" "SpecialCH"
## Number of terminal nodes:  10 
## Residual mean deviance:  0.7116 = 562.2 / 790 
## Misclassification error rate: 0.145 = 116 / 800

summary(oj_prune)

## 
## Classification tree:
## snip.tree(tree = oj_tree, nodes = c(7L, 2L, 6L))
## Variables actually used in tree construction:
## [1] "LoyalCH"
## Number of terminal nodes:  3 
## Residual mean deviance:  0.876 = 698.2 / 797 
## Misclassification error rate: 0.1875 = 150 / 800

It appears that the unpruned tree has a slightly lower error rate on the training set then that of the pruned tree.

k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

pred_unpruned <- predict(oj_tree, oj_test, type = "class")
unpruned_error <- sum(oj_test$Purchase != pred_unpruned)
unpruned_error / length(pred_unpruned)

## [1] 0.2111111

pred_pruned <- predict(oj_prune, oj_test, type = "class")
pruned_error <- sum(oj_test$Purchase != pred_pruned)
pruned_error / length(pred_pruned)

## [1] 0.2

It appears that the pruned tree has a slightly lower test error rate then that of the unpruned tree.

Assignment 7

Matthew Smith

2022-11-11

Conceptual

Applied

9. This problem involves the `OJ` data set which is part of the `ISLR2` package.

Assignment 7

Matthew Smith

2022-11-11

Conceptual

Applied

8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a qunatitative variable.

9. This problem involves the OJ data set which is part of the ISLR2 package.

8. In the lab, a classification tree was applied to the `Carseats` data set after converting `Sales` into a qualitative response variable. Now we will seek to predict `Sales` using regression trees and related approaches, treating the response as a qunatitative variable.

9. This problem involves the `OJ` data set which is part of the `ISLR2` package.