Assignment 7

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: ggplot2

## Loading required package: lattice

library(rattle)

## Warning: package 'rattle' was built under R version 4.4.3

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(rpart)
library(tree)

## Warning: package 'tree' was built under R version 4.4.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.3

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(BART)

## Warning: package 'BART' was built under R version 4.4.3

## Loading required package: nlme

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

attach(Carseats)
attach(OJ)

3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆ pm1. The x-axis should display ˆ pm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, ˆ pm1 =1− ˆ pm2. You could make this plot by hand, but it will be much easier to make in R.

p <- seq(0, 1, 0.01)

gini <- 2 * p * (1 - p)
classification_error <- 1 - pmax(p, 1 - p)
entropy <- -p * log2(p) - (1 - p) * log2(1 - p)
entropy[is.na(entropy)] <- 0  

plot(p, gini, type = "l", col = "blue", ylim = c(0,1),
     ylab = "Value", xlab = "pm1")
lines(p, entropy, col = "red")
lines(p, classification_error, col = "green")
legend("top", legend = c("Gini", "Entropy", "Classification Error"),
       col = c("blue", "red", "green"), lty = 1)

8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

set.seed(62)

train_index <- createDataPartition(Carseats$Sales, p = 0.7, list = FALSE)
car_train <- Carseats[train_index, ]
car_test <- Carseats[-train_index, ]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

car_tree <- rpart(Sales ~ ., data = car_train)
fancyRpartPlot(car_tree, sub = "", cex = 0.7)

The decision tree helps predict outcomes based on features like "Price", "ShelveLoc", and "Income". It starts by checking “ShelveLoc” (Bad/Medium). If it’s “Bad” or “Medium,” it keeps splitting based on other factors like "Price" and other feature. The tree ends in leaf nodes, which give the final prediction along with the number of cases in each category. Each split helps the model make better predictions.

tree_pred <- predict(car_tree, newdata = car_test)
mse_tree <- mean((tree_pred - car_test$Sales)^2)
mse_tree

## [1] 4.741038

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

printcp(car_tree)

## 
## Regression tree:
## rpart(formula = Sales ~ ., data = car_train)
## 
## Variables actually used in tree construction:
## [1] Advertising Age         CompPrice   Education   Income      Price      
## [7] ShelveLoc  
## 
## Root node error: 2204.8/281 = 7.8463
## 
## n= 281 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.209522      0   1.00000 1.00909 0.087406
## 2  0.119805      1   0.79048 0.80463 0.065598
## 3  0.065052      2   0.67067 0.70854 0.058902
## 4  0.039570      3   0.60562 0.68788 0.054292
## 5  0.035441      4   0.56605 0.72448 0.056051
## 6  0.029500      5   0.53061 0.73349 0.053593
## 7  0.027752      6   0.50111 0.71589 0.053731
## 8  0.024447      7   0.47336 0.71297 0.053083
## 9  0.019810      8   0.44891 0.71525 0.053177
## 10 0.019158      9   0.42910 0.71966 0.053711
## 11 0.017220     10   0.40994 0.70184 0.052208
## 12 0.016757     11   0.39272 0.68745 0.052450
## 13 0.015813     12   0.37596 0.68626 0.053120
## 14 0.015727     13   0.36015 0.68323 0.052788
## 15 0.013609     14   0.34442 0.66967 0.051432
## 16 0.013460     15   0.33082 0.65294 0.048939
## 17 0.012952     16   0.31736 0.65294 0.048939
## 18 0.010835     17   0.30440 0.63746 0.048019
## 19 0.010000     18   0.29357 0.63109 0.047565

plotcp(car_tree)

best_cp <- car_tree$cptable[which.min(car_tree$cptable[, "xerror"]), "CP"]
car_prun <- prune(car_tree, cp = best_cp)

pred_prun <- predict(car_prun, car_test)
mse_prun <- mean((car_test$Sales - pred_prun)^2)
mse_prun

## [1] 4.741038

After pruning, the test MSE remained the same at 4.74. This suggests that pruning did not improve or worsen the model’s performance on the test set.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

set.seed(62)
car_bag <- randomForest(Sales ~ ., data = car_train, mtry = ncol(car_train) - 1, importance = TRUE)
car_bag

## 
## Call:
##  randomForest(formula = Sales ~ ., data = car_train, mtry = ncol(car_train) -      1, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.466286
##                     % Var explained: 68.57

bag_pred <- predict(car_bag, newdata = car_test)
mse_bag <- mean((car_test$Sales - bag_pred)^2)
mse_bag

## [1] 2.419629

Using the bagging approach, the test MSE was 2.42, which is a big improvement over the regression tree (4.74). This suggests bagging significantly boosts prediction accuracy by reducing variance.

importance(car_bag)

##                %IncMSE IncNodePurity
## CompPrice   29.8580734    197.688262
## Income      12.5910626    122.873227
## Advertising 25.2242668    212.756458
## Population  -0.3157721     75.005450
## Price       66.9930266    668.898161
## ShelveLoc   62.9411213    543.989191
## Age         22.7934087    232.611362
## Education    4.7508081     64.795992
## Urban       -2.3355119      9.792186
## US           3.6163318     14.813023

varImpPlot(car_bag)

The most important variable is Price, followed by ShelveLoc, Advertising, CompPrice, and Age are moderately important. Variables like Income, Education, Urban, and US have low importance.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed(62)
car_rf <- randomForest(Sales ~ ., data = car_train, mtry = 5, importance = TRUE)
car_rf

## 
## Call:
##  randomForest(formula = Sales ~ ., data = car_train, mtry = 5,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 2.604994
##                     % Var explained: 66.8

rf_pred <- predict(car_rf, newdata = car_test)
mse_rf <- mean((car_test$Sales - rf_pred)^2)
mse_rf

## [1] 2.551815

importance(car_rf)

##                %IncMSE IncNodePurity
## CompPrice   22.0619269     184.08901
## Income       8.1554907     143.48805
## Advertising 21.7979130     231.15250
## Population  -0.2317539     107.04063
## Price       57.9510177     599.13072
## ShelveLoc   52.3995633     491.56403
## Age         21.6935478     250.48481
## Education    4.1393667      79.45348
## Urban       -2.6527966      12.04230
## US           5.9685629      25.13044

As mtry increased, the test error decreased, showing that using more variables at each split improved the model’s accuracy.

(f) Now analyze the data using BART, and report your results.

x_train <- data.matrix(car_train[, -which(names(car_train) == "Sales")])
y_train <- car_train$Sales

x_test <- data.matrix(car_test[, -which(names(car_test) == "Sales")])
y_test <- car_test$Sales

set.seed(62)
car_bart <- gbart(x.train = x_train, y.train = y_train, x.test = x_test)

## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 281, 10, 119
## y1,yn: 1.984591, -1.575409
## x1,x[n*p]: 138.000000, 2.000000
## xp1,xp[np*p]: 113.000000, 2.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 68 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.287616,3,0.703589,7.51541
## *****sigma: 1.900530
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,10,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## trcnt,tecnt: 1000,1000

bart_pred <- car_bart$yhat.test.mean
mse_bart <- mean((y_test - bart_pred)^2)
mse_bart

## [1] 1.406111

Using BART, the test MSE was 1.41, which is the lowest among all models, indicating the best prediction accuracy on the test set.

9. This problem involves the OJ data set which is part of the ISLR2 package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(62)

n <- nrow(OJ)
oj_train_index <- sample(1:n, 800)
oj_train <- OJ[oj_train_index, ]
oj_test <- OJ[-oj_train_index, ]

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

oj_tree <- tree(Purchase ~ ., data = oj_train)
summary(oj_tree)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj_train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff" "PctDiscCH"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7594 = 600.6 / 791 
## Misclassification error rate: 0.1612 = 129 / 800

Training error rate is 0.1612 and the Terminal nodes is 9

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

oj_tree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1070.00 CH ( 0.61000 0.39000 )  
##    2) LoyalCH < 0.48285 297  326.70 MM ( 0.23906 0.76094 )  
##      4) LoyalCH < 0.0616725 66   17.92 MM ( 0.03030 0.96970 ) *
##      5) LoyalCH > 0.0616725 231  281.70 MM ( 0.29870 0.70130 )  
##       10) PriceDiff < 0.31 179  195.00 MM ( 0.23464 0.76536 )  
##         20) PriceDiff < -0.34 19    0.00 MM ( 0.00000 1.00000 ) *
##         21) PriceDiff > -0.34 160  184.20 MM ( 0.26250 0.73750 ) *
##       11) PriceDiff > 0.31 52   72.01 CH ( 0.51923 0.48077 ) *
##    3) LoyalCH > 0.48285 503  460.20 CH ( 0.82903 0.17097 )  
##      6) LoyalCH < 0.705699 204  258.30 CH ( 0.67157 0.32843 )  
##       12) ListPriceDiff < 0.235 79  106.70 MM ( 0.40506 0.59494 )  
##         24) PctDiscCH < 0.052007 66   82.56 MM ( 0.31818 0.68182 ) *
##         25) PctDiscCH > 0.052007 13   11.16 CH ( 0.84615 0.15385 ) *
##       13) ListPriceDiff > 0.235 125  109.90 CH ( 0.84000 0.16000 ) *
##      7) LoyalCH > 0.705699 299  141.50 CH ( 0.93645 0.06355 )  
##       14) PriceDiff < -0.39 8   10.59 MM ( 0.37500 0.62500 ) *
##       15) PriceDiff > -0.39 291  112.30 CH ( 0.95189 0.04811 ) *

In Node 15, there are 291 observations where PriceDiff > -0.39. The predicted class is CH, with 95% of customers choosing CH, so the model is confident in this prediction.

(d) Create a plot of the tree, and interpret the results.

plot(oj_tree)
text(oj_tree)

It look like most of the data is split on the LoyalCH variable.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

oj_pred <- predict(oj_tree, newdata = oj_test, type = "class")
confusionMatrix(data = oj_pred, reference = oj_test$Purchase)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 130  19
##         MM  35  86
##                                          
##                Accuracy : 0.8            
##                  95% CI : (0.7472, 0.846)
##     No Information Rate : 0.6111         
##     P-Value [Acc > NIR] : 2.106e-11      
##                                          
##                   Kappa : 0.5906         
##                                          
##  Mcnemar's Test P-Value : 0.04123        
##                                          
##             Sensitivity : 0.7879         
##             Specificity : 0.8190         
##          Pos Pred Value : 0.8725         
##          Neg Pred Value : 0.7107         
##              Prevalence : 0.6111         
##          Detection Rate : 0.4815         
##    Detection Prevalence : 0.5519         
##       Balanced Accuracy : 0.8035         
##                                          
##        'Positive' Class : CH             
##

oj_error <- mean(oj_pred != oj_test$Purchase)
oj_error

## [1] 0.2

The Test result error rate is 0.2

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

set.seed(62)
oj_cv <- cv.tree(oj_tree, FUN = prune.misclass)
oj_cv

## $size
## [1] 9 8 6 5 2 1
## 
## $dev
## [1] 152 151 150 149 160 312
## 
## $k
## [1] -Inf    0    1    2    8  155
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(g) Produce a plot with tree size on the x-axis and cross-validatedclassification error rate on the y-axis.

plot(oj_cv$size, oj_cv$dev, type = "b")

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

oj_cv$size[which.min(oj_cv$dev)]

## [1] 5

The optimal tree size is 5, as it gives the lowest cross-validated classification error.

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

oj_pruned <- prune.misclass(oj_tree, best = 5)
plot(oj_pruned)
text(oj_pruned, pretty = 0)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

pred_unpruned <- predict(oj_tree, newdata = oj_train, type = "class")
pred_pruned <- predict(oj_pruned, newdata = oj_train, type = "class")

error_unpruned <- mean(pred_unpruned != oj_train$Purchase)
error_unpruned

## [1] 0.16125

Unpruned is 0.16125

error_pruned <- mean(pred_pruned != oj_train$Purchase)
error_pruned

## [1] 0.16625

Pruned is 0.16625

The training error rate for the unpruned tree was 0.16125, while the pruned tree had a slightly higher error rate of 0.16625.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

test_unpruned <- predict(oj_tree, newdata = oj_test, type = "class")
test_pruned <- predict(oj_pruned, newdata = oj_test, type = "class")

error_unpruned2 <- mean(test_unpruned != oj_test$Purchase)
error_unpruned2

## [1] 0.2

Test error rate for unpruned is 0.2

error_pruned <- mean(test_pruned != oj_test$Purchase)
error_pruned

## [1] 0.1851852

Test error rate for unpruned is 0.1851852

The test error rate for the unpruned tree was 0.2, while the pruned tree had a slightly lower error rate of 0.1852.