library(ISLR)
library(tree)
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
library(BART)
## Loading required package: nlme
## Loading required package: survival
library(caret)       
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
library(MASS)        

(3)Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of pˆm1. The x-axis should display pˆm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.

A plot was created to show how the Gini index, classification error, and entropy vary as \(\hat{p}_{m1}\) increases from 0 to 1. All three measures reach their maximum when the class probabilities are equal (i.e., \(\hat{p}_{m1} = 0.5\)). Entropy is the most sensitive around this point, classification error is the least sensitive, and the Gini index falls in between. This visual illustrates how these impurity measures behave and why they are used in decision trees.

p <- seq(0, 1, by = 0.01)


gini <- 2 * p * (1 - p)


class_error <- 1 - pmax(p, 1 - p)

entropy <- -p * log2(p) - (1 - p) * log2(1 - p)
entropy[is.nan(entropy)] <- 0 

plot(p, gini, type = "l", col = "blue", ylim = c(0, 1), ylab = "Impurity Measure", xlab = expression(hat(p)[m1]), lwd = 2)
lines(p, class_error, col = "red", lwd = 2)
lines(p, entropy, col = "darkgreen", lwd = 2)
legend("top", legend = c("Gini", "Classification Error", "Entropy"), 
       col = c("blue", "red", "darkgreen"), lty = 1, lwd = 2)

(8)In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

  1. Split the data set into a training set and a test set.

    The Carseats data was randomly divided into a training set (70%) and a test set (30%).

set.seed(1)
train_index <- createDataPartition(Carseats$Sales, p = 0.7, list = FALSE)
train <- Carseats[train_index, ]
test <- Carseats[-train_index, ]
  1. Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

    A regression tree was fitted to predict Sales. The tree used six variables and produced 18 terminal nodes. The test mean squared error (MSE) from this model was 4.47.

tree_carseats <- tree(Sales ~ ., data = train)
summary(tree_carseats)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Income"      "Advertising" "CompPrice"  
## [6] "Age"        
## Number of terminal nodes:  18 
## Residual mean deviance:  2.22 = 583.8 / 263 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.64200 -0.96000  0.02174  0.00000  0.96000  3.40700
plot(tree_carseats)
text(tree_carseats, pretty = 0)

pred_tree <- predict(tree_carseats, newdata = test)
mse_tree <- mean((pred_tree - test$Sales)^2)
mse_tree
## [1] 4.469011
  1. Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

    Cross-validation was used to determine the best tree size. The optimal tree size was the same as the original tree, so pruning did not reduce the test MSE. The test MSE remained 4.47 after pruning.

set.seed(2)
cv_tree <- cv.tree(tree_carseats, FUN = prune.tree)
plot(cv_tree$size, cv_tree$dev, type = "b")

pruned_tree <- prune.tree(tree_carseats, best = cv_tree$size[which.min(cv_tree$dev)])
plot(pruned_tree)
text(pruned_tree, pretty = 5)

pred_pruned <- predict(pruned_tree, newdata = test)
mse_pruned <- mean((pred_pruned - test$Sales)^2)
mse_pruned
## [1] 4.469011
  1. Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

    The bagging method was applied using all predictors at each split. The test MSE dropped to 2.95, an improvement over the single tree. The most important variables were ShelveLoc, Price, and Advertising.

set.seed(1)
bag_carseats <- randomForest(Sales ~ ., data = train, mtry = ncol(train) - 1, importance = TRUE)
pred_bag <- predict(bag_carseats, newdata = test)
mse_bag <- mean((pred_bag - test$Sales)^2)
mse_bag
## [1] 2.95016
importance(bag_carseats)
##                %IncMSE IncNodePurity
## CompPrice   29.3462503    184.444145
## Income       2.1950934     84.971443
## Advertising 28.4461920    226.470679
## Population  -0.7596617     67.714326
## Price       68.7814941    651.996671
## ShelveLoc   71.7824230    628.164064
## Age         21.3597782    166.657371
## Education    1.4885394     55.640049
## Urban       -3.4002675      8.580206
## US           5.0762780     12.380845
varImpPlot(bag_carseats)

  1. Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

    A random forest model was built with mtry = 4. The test MSE was 3.50. The most important variables remained ShelveLoc and Price. A test across different mtry values showed that performance varied slightly, with some improvement for smaller values due to increased randomness.

set.seed(1)
rf_carseats <- randomForest(Sales ~ ., data = train, mtry = 4, importance = TRUE)
pred_rf <- predict(rf_carseats, newdata = test)
mse_rf <- mean((pred_rf - test$Sales)^2)
mse_rf
## [1] 3.500971
importance(rf_carseats)
##                %IncMSE IncNodePurity
## CompPrice   16.8566344     180.61779
## Income       1.6173049     119.33836
## Advertising 22.7107139     230.95192
## Population   0.1108546     107.96778
## Price       50.2654195     571.56201
## ShelveLoc   54.7490099     534.09345
## Age         16.3214308     194.92518
## Education   -0.5807710      79.17980
## Urban       -0.5769700      15.64903
## US           6.0425707      23.57621
varImpPlot(rf_carseats)

errors <- sapply(1:10, function(m) {
  rf_tmp <- randomForest(Sales ~ ., data = train, mtry = m)
  pred_tmp <- predict(rf_tmp, newdata = test)
  mean((pred_tmp - test$Sales)^2)
})
plot(1:10, errors, type = "b", xlab = "mtry", ylab = "Test MSE")

  1. Now analyze the data using BART, and report your results.

    BART was used to predict Sales, and it gave the lowest test MSE of 1.73, outperforming all other models in this task.

x_train <- train[, -which(names(train) == "Sales")]
x_test <- test[, -which(names(test) == "Sales")]
y_train <- train$Sales


set.seed(1)
bart_fit <- wbart(x.train = x_train, y.train = y_train, x.test = x_test)
## *****Into main of wbart
## *****Data:
## data:n,p,np: 281, 14, 119
## y1,yn: 1.974733, 2.184733
## x1,x[n*p]: 138.000000, 1.000000
## xp1,xp[np*p]: 122.000000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 64 ... 1
## *****burn and ndpost: 100, 1000
## *****Prior:beta,alpha,tau,nu,lambda: 2.000000,0.950000,0.287616,3.000000,0.182475
## *****sigma: 0.967869
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,14,0
## *****nkeeptrain,nkeeptest,nkeeptestme,nkeeptreedraws: 1000,1000,1000,1000
## *****printevery: 100
## *****skiptr,skipte,skipteme,skiptreedraws: 1,1,1,1
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 3s
## check counts
## trcnt,tecnt,temecnt,treedrawscnt: 1000,1000,1000,1000
pred_bart <- colMeans(bart_fit$yhat.test)
mse_bart <- mean((pred_bart - test$Sales)^2)
mse_bart
## [1] 1.73499
data(OJ)
  1. Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

    The OJ dataset was split into 800 observations for training and 270 for testing.

set.seed(1)
train_index <- sample(1:nrow(OJ), 800)
train <- OJ[train_index, ]
test <- OJ[-train_index, ]
  1. Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

    A classification tree was built to predict Purchase. It had 9 terminal nodes and a training error rate of 15.88%. LoyalCH was the most influential variable in the tree.

    tree_oj <- tree(Purchase ~ ., data = train)
    summary(tree_oj)
    ## 
    ## Classification tree:
    ## tree(formula = Purchase ~ ., data = train)
    ## Variables actually used in tree construction:
    ## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
    ## [5] "PctDiscMM"    
    ## Number of terminal nodes:  9 
    ## Residual mean deviance:  0.7432 = 587.8 / 791 
    ## Misclassification error rate: 0.1588 = 127 / 800
  2. Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

    One terminal node showed that when LoyalCH was very low, 98.3% of the observations were classified as MM, indicating high confidence in that prediction.

    tree_oj
    ## node), split, n, deviance, yval, (yprob)
    ##       * denotes terminal node
    ## 
    ##  1) root 800 1073.00 CH ( 0.60625 0.39375 )  
    ##    2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  
    ##      4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  
    ##        8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *
    ##        9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *
    ##      5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  
    ##       10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 )  
    ##         20) SpecialCH < 0.5 64   51.98 MM ( 0.14062 0.85938 ) *
    ##         21) SpecialCH > 0.5 15   20.19 CH ( 0.60000 0.40000 ) *
    ##       11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *
    ##    3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  
    ##      6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  
    ##       12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  
    ##         24) PctDiscMM < 0.196196 55   73.14 CH ( 0.61818 0.38182 ) *
    ##         25) PctDiscMM > 0.196196 17   12.32 MM ( 0.11765 0.88235 ) *
    ##       13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *
    ##      7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *
  3. Create a plot of the tree, and interpret the results.

    A plot of the tree showed that LoyalCH was the main splitting variable. Other variables like PriceDiff and SpecialCH appeared in deeper splits.

    plot(tree_oj)
    text(tree_oj, pretty = 5)

  4. Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

    Using the test set, the classification tree had a test error rate of 17.04%, based on a confusion matrix comparing predicted and actual values.

    pred_test <- predict(tree_oj, test, type = "class")
    confusion <- table(Predicted = pred_test, Actual = test$Purchase)
    confusion
    ##          Actual
    ## Predicted  CH  MM
    ##        CH 160  38
    ##        MM   8  64
    test_error <- 1 - sum(diag(confusion)) / sum(confusion)
    test_error
    ## [1] 0.1703704
  5. Apply the cv.tree() function to the training set in order to determine the optimal tree size.

    Cross-validation suggested the optimal tree size was 9, the same as the full tree, so no pruning was required.

    set.seed(2)
    cv_oj <- cv.tree(tree_oj, FUN = prune.misclass)
  6. Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

    A plot of tree size versus cross-validated classification error showed the lowest error at tree size 9.

    plot(cv_oj$size, cv_oj$dev, type = "b", 
         xlab = "Tree Size", ylab = "CV Misclassification Rate")

  7. Which tree size corresponds to the lowest cross-validated classification error rate?

    The lowest cross-validation error occurred with a tree size of 9, which matched the original unpruned tree.

    min_dev_index <- which.min(cv_oj$dev)
    optimal_size <- cv_oj$size[min_dev_index]
    optimal_size
    ## [1] 9

i.Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

To compare performance, the tree was manually pruned to 5 terminal nodes, even though cross-validation did not suggest pruning.

# Use optimal size from CV, or 5 as fallback
pruned_tree <- prune.misclass(tree_oj, best = 5)


plot(pruned_tree)
text(pruned_tree, pretty = 0)

  1. Compare the training error rates between the pruned and unpruned trees. Which is higher?

    The unpruned tree had a training error rate of 15.88%, while the pruned tree had a slightly higher rate of 16.25%.

train_pred_unpruned <- predict(tree_oj, train, type = "class")
train_err_unpruned <- mean(train_pred_unpruned != train$Purchase)


train_pred_pruned <- predict(pruned_tree, train, type = "class")
train_err_pruned <- mean(train_pred_pruned != train$Purchase)

train_err_unpruned
## [1] 0.15875
train_err_pruned
## [1] 0.1625
  1. Compare the test error rates between the pruned and unpruned trees. Which is higher?

The pruned tree had a test error rate of 16.30%, which was slightly lower than the unpruned tree’s test error of 17.04%. This suggests the pruned tree generalized slightly better.

test_pred_unpruned <- predict(tree_oj, test, type = "class")
test_err_unpruned <- mean(test_pred_unpruned != test$Purchase)


test_pred_pruned <- predict(pruned_tree, test, type = "class")
test_err_pruned <- mean(test_pred_pruned != test$Purchase)

test_err_unpruned
## [1] 0.1703704
test_err_pruned
## [1] 0.162963