Assignment #7

3.Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.

library(ISLR2)
library(tidyverse)
library(modelr)
library(splines)
library(corrplot)
library(caret)
library(boot)
library(leaps)
library(MASS)
library (gam)
library(rpart)
library (tree)
library (randomForest)
library(dbarts)
library(BART)
data("Carseats")
set.seed(4)
data("OJ")
set.seed(4)

# Set range of p_hat_m1 values from 0 to 1
p_hat_m1 <- seq(0, 1, length.out = 100)

# Calculate Gini index, classification error, and entropy for each p_hat_m1 value
gini <- 2 * p_hat_m1 * (1 - p_hat_m1)
class_err <- 1 - pmax(p_hat_m1, 1 - p_hat_m1)
entropy <- - p_hat_m1 * log2(p_hat_m1) - (1 - p_hat_m1) * log2(1 - p_hat_m1)

# Plot the results
plot(p_hat_m1, gini, type = "l", col = "red", xlab = expression(hat(p)[m][1]), ylab = "Index Value", ylim = c(0, 1))
lines(p_hat_m1, class_err, col = "blue")
lines(p_hat_m1, entropy, col = "green")
legend("topright", c("Gini Index", "Classification Error", "Entropy"), lty = 1, col = c("red", "blue", "green"))

8.In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches,treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

set.seed(4)
names(Carseats)

##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"

train<-sample(nrow(Carseats), size=0.7*nrow(Carseats))
test=(-train)
TRAIN=Carseats[train,]
TEST=Carseats[-train,]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?> cv.boston <- cv. tree (tree.boston)

tree.Carseats=tree(Sales~., Carseats, subset=train)
summary(tree.Carseats)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats, subset = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "Population" 
## [6] "CompPrice"  
## Number of terminal nodes:  19 
## Residual mean deviance:  2.228 = 581.6 / 261 
## Distribution of residuals:
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -4.083000 -0.967100  0.001414  0.000000  0.946500  4.891000

plot(tree.Carseats)
text(tree.Carseats,pretty = 0)

yhat <- predict (tree.Carseats, newdata =Carseats[-train , ])
Carseats.test <- Carseats[-train , "Sales"]
plot (yhat , Carseats.test)
abline (0, 1)

mean ((yhat - Carseats.test)^2)

## [1] 5.018197

The MSE is roughly 5.02

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

set.seed(4)
cv.Carseats <- cv.tree(tree.Carseats)
plot(cv.Carseats$size, cv.Carseats$dev, type = "b", xlab = "cv.Carseats$size", ylab = "cv.Carseats$dev")
max_x <- max(cv.Carseats$size)
axis(1, at = seq(1,20, by=1))
points(which.min(cv.Carseats$size), cv.Carseats$dev[which.min(cv.Carseats$dev)], col = "purple", pch = 18, cex = 2)

prune.Carseats<- prune.tree (tree.Carseats , best = 17)
plot (prune.Carseats)
text (prune.Carseats , pretty = 0)

yhat <- predict (prune.Carseats , newdata = Carseats[-train , ])
Carseats.test <- Carseats[-train , "Sales"]
plot (yhat , Carseats.test)
abline (0, 1)

mean ((yhat - Carseats.test)^2)

## [1] 4.949164

The MSE drop from 5.02 to 2.82 means pruning the tree improve the test MSE.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

set.seed(4)
bag.carseats = randomForest(Sales ~ ., data = TRAIN, mtry = 11, ntree = 25,  importance = T)
yhat.bag <- predict (bag.carseats , newdata = Carseats[-train , ])
plot (yhat.bag , Carseats.test)
abline (0, 1)

mean ((yhat.bag - Carseats.test)^2)

## [1] 2.825339

The test MSE is 2.83

varImp(bag.carseats)

##                 Overall
## CompPrice    8.39430626
## Income       1.32136702
## Advertising  7.43345531
## Population   0.04170951
## Price       13.16343330
## ShelveLoc   16.67029538
## Age          4.25687571
## Education    1.28678556
## Urban        2.24366326
## US          -0.51588071

max=max(varImp(bag.carseats))
max

## [1] 16.6703

ShelveLoc is the most important variable.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed(4)
rf.carseats = randomForest(Sales ~ ., data = TRAIN, mtry = 5, ntree =25, importance = T)
rf.pred = predict(rf.carseats, newdata = Carseats[-train , ])
mean((rf.pred-Carseats.test)^2)

## [1] 2.976509

The test MSE is 2.98

(f) Now analyze the data using BART, and report your results.

set.seed(4)
x=Carseats[,1:10]
y=Carseats[,"Sales"]
xtrain=x[train, ]
ytrain=y[train]
xtest=x[test, ]
ytest=y[test]
bart.fit=gbart(xtrain,ytrain,x.test=xtest)

## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 280, 13, 120
## y1,yn: -1.215786, -0.465786
## x1,x[n*p]: 6.200000, 1.000000
## xp1,xp[np*p]: 10.060000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 100 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.276302,3,8.81255e-31,7.41579
## *****sigma: 0.000000
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,13,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## trcnt,tecnt: 1000,1000

yhat.bart=bart.fit$yhat.test.mean
mean((ytest-yhat.bart)^2)

## [1] 0.1334018

BART produced a VERY low Test MSE 0.13

9. This problem involves the OJ data set which is part of the ISLR2 package. (a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

dim(OJ)

## [1] 1070   18

names(OJ)

##  [1] "Purchase"       "WeekofPurchase" "StoreID"        "PriceCH"       
##  [5] "PriceMM"        "DiscCH"         "DiscMM"         "SpecialCH"     
##  [9] "SpecialMM"      "LoyalCH"        "SalePriceMM"    "SalePriceCH"   
## [13] "PriceDiff"      "Store7"         "PctDiscMM"      "PctDiscCH"     
## [17] "ListPriceDiff"  "STORE"

train <- sample(nrow(OJ), 800)

# Create the training set
train_OJ <- OJ[train, ]

# Create the test set by selecting the observations that are not in the training set
test_OJ<- OJ[-train, ]

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

# Fit a tree to the training data
set.seed(4)
tree.fit <- tree(Purchase ~ ., data = train_OJ)

# Print the summary of the tree
summary(tree.fit)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train_OJ)
## Variables actually used in tree construction:
## [1] "LoyalCH"     "PriceDiff"   "SalePriceMM"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7186 = 569.1 / 792 
## Misclassification error rate: 0.1775 = 142 / 800

The training error rate is 0.1562, the tree has 8 terminal nodes.

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

# Print the detailed text output of the tree
tree.fit

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1074.00 CH ( 0.60375 0.39625 )  
##    2) LoyalCH < 0.559036 381  461.50 MM ( 0.29396 0.70604 )  
##      4) LoyalCH < 0.276142 174  124.10 MM ( 0.11494 0.88506 )  
##        8) LoyalCH < 0.051325 64   10.30 MM ( 0.01562 0.98438 ) *
##        9) LoyalCH > 0.051325 110  101.20 MM ( 0.17273 0.82727 ) *
##      5) LoyalCH > 0.276142 207  284.40 MM ( 0.44444 0.55556 )  
##       10) PriceDiff < -0.165 36   20.65 MM ( 0.08333 0.91667 ) *
##       11) PriceDiff > -0.165 171  236.80 CH ( 0.52047 0.47953 ) *
##    3) LoyalCH > 0.559036 419  298.30 CH ( 0.88544 0.11456 )  
##      6) LoyalCH < 0.764572 154  178.50 CH ( 0.73377 0.26623 )  
##       12) PriceDiff < -0.35 15   11.78 MM ( 0.13333 0.86667 ) *
##       13) PriceDiff > -0.35 139  139.70 CH ( 0.79856 0.20144 )  
##         26) SalePriceMM < 2.125 88  106.80 CH ( 0.70455 0.29545 ) *
##         27) SalePriceMM > 2.125 51   16.88 CH ( 0.96078 0.03922 ) *
##      7) LoyalCH > 0.764572 265   64.69 CH ( 0.97358 0.02642 ) *

8) LoyalCH < 0.051325 64 10.30 MM ( 0.01562 0.98438 ), This node represents a subset of 64 customers who are characterized by having a low level of brand loyalty to the Swiss brand (LoyalCH < 0.051325) and are predicted to purchase more frequently from the MM brand (yval=MM) with a high probability (yprob=0.98438). The node has a deviance of 10.30, which represents the sum of the squared differences between the observed response values and the predicted values for this node. Since this is a terminal node, it cannot be further split and the prediction for any new customer falling into this node will be MM.

(d) Create a plot of the tree, and interpret the results.

plot(tree.fit)
text(tree.fit,pretty = 0,
     cex = 0.7)

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

# Predict response on the test data
pred <- predict(tree.fit, newdata = test_OJ,type = "class")

# Create confusion matrix
conf_mat <- table(test_OJ$Purchase, pred)
conf_mat

##     pred
##       CH  MM
##   CH 158  12
##   MM  49  51

# Calculate test error rate
test_error <- 1 - sum(diag(conf_mat)) / sum(conf_mat)
test_error

## [1] 0.2259259

the test error rate is 0.21

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

set.seed(4)
# Perform cross-validation to find optimal tree size
cv <- cv.tree(tree.fit, FUN = prune.tree)
cv

## $size
## [1] 8 7 6 5 4 3 2 1
## 
## $dev
## [1]  713.2674  705.0023  715.4586  735.0959  735.0959  794.0907  789.3464
## [8] 1078.6529
## 
## $k
## [1]      -Inf  12.59792  15.96376  26.98020  27.03401  52.97201  55.11284
## [8] 314.54569
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

which.min(cv$dev)

## [1] 2

optimal tree size is 7

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

set.seed(4)
# Determine the optimal tree size using cv.tree() and plot the cross-validated error rate
cv.tree.fit <- cv.tree(tree.fit)
plot(cv.tree.fit$size, cv.tree.fit$dev, type = "b", xlab = "Tree Size", ylab = "Cross-Validated Classification Error")

# Add a point for the minimum error rate
points(cv.tree.fit$size[which.min(cv.tree.fit$dev)], cv.tree.fit$dev[which.min(cv.tree.fit$dev)], col = "blue", cex=2, pch = 19)

(h) Which tree size corresponds to the lowest cross-validated classification error rate? Tree size 7 has the lowest error rate.

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

# Prune the tree to the optimal tree size
min.cv.tree=which.min(cv$dev)
if (min.cv.tree <= 5) {
  pruned.tree <- prune.misclass(tree.fit, best = min.cv.tree)
} else {
  pruned.tree <- prune.misclass(tree.fit, best = 5)
}
plot(pruned.tree)
text(pruned.tree, pretty = 0, cex = 0.7)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(pruned.tree)

## 
## Classification tree:
## snip.tree(tree = tree.fit, nodes = 2:3)
## Variables actually used in tree construction:
## [1] "LoyalCH"
## Number of terminal nodes:  2 
## Residual mean deviance:  0.9521 = 759.8 / 798 
## Misclassification error rate: 0.2 = 160 / 800

pruned tree training error rates is 0.2, unpruned trees training error rates is 0.15, so the pruned trees training error rates is higher.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

oj.pruned.pred=predict(pruned.tree, newdata=test_OJ,type='class')
table(oj.pruned.pred,test_OJ$Purchase)

##               
## oj.pruned.pred  CH  MM
##             CH 124  24
##             MM  46  76

mean(oj.pruned.pred != test_OJ$Purchase)

## [1] 0.2592593

the unpruned tree test error rate is 0.21, the pruned tree test error rate is 0.19, the unpruned tree test rate is higher

Assignment #7

Yuan Torres

2023-05-03