1. Consider the Gini index, classifcation error, and entropy in a simple classifcation setting with two classes. Create a single plot that displays each of these quantities as a function of pˆm1. The x-axis should display pˆm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classifcation error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
library(tree)
## Warning: package 'tree' was built under R version 4.3.3
library(rpart)
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 4.3.2
p = seq(0, 1, 0.001)
gini.index = 2 * p * (1 - p)
class.error = 1 - pmax(p, 1 - p)
cross.entropy = - (p * log(p) + (1 - p) * log(1 - p))
matplot(p, cbind(gini.index, class.error, cross.entropy))

  1. In the lab, a classifcation tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches,treating the response as a quantitative variable.
  1. Split the data set into a training set and a test set.
set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats)/2)
strain = Carseats[train, ]
stest = Carseats[-train, ]
  1. Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
tree.seats = tree(Sales ~ ., data = strain)
summary(tree.seats)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = strain)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "CompPrice"  
## [6] "US"         
## Number of terminal nodes:  18 
## Residual mean deviance:  2.167 = 394.3 / 182 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.88200 -0.88200 -0.08712  0.00000  0.89590  4.09900
plot(tree.seats)
text(tree.seats, pretty = 0)

treeseat.pred = predict(tree.seats, newdata = stest)
mean((treeseat.pred - stest$Sales)^2)
## [1] 4.922039

MSE= 4.922039

  1. Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
set.seed(1)
cv.seats = cv.tree(tree.seats)
plot(cv.seats$size, cv.seats$dev, type = "b")

prune.car = prune.tree(tree.seats, best = 10)
plot(prune.car)
text(prune.car,pretty=0)

treeseat.pred = predict(prune.car, newdata = stest)
mean((treeseat.pred - stest$Sales)^2)
## [1] 4.918134

Yes, MSE went decreased to 4.918134

  1. Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the feature_importance_ values to determine which variables are most important.
set.seed(1)
bag.seats = randomForest(Sales~., data = strain, mtry = 10, ntree = 551, importance = TRUE)
bagseat.pred = predict(bag.seats, newdata = stest)
mean((bagseat.pred - stest$Sales)^2)
## [1] 2.599099
importance(bag.seats)
##                 %IncMSE IncNodePurity
## CompPrice   26.18616309    170.781666
## Income       5.25063979     90.717958
## Advertising 13.25673204     97.498810
## Population  -2.14346969     58.289311
## Price       60.58241525    503.478806
## ShelveLoc   50.77308639    380.258594
## Age         19.03720001    158.282846
## Education    1.24264920     44.834257
## Urban       -0.08461165      9.883299
## US           4.71515903     17.907727

MSE=2.599 the most important variables are Price and ShelveLoc

  1. Use random forests to analyze this data. What test MSE do you obtain? Use the feature_importance_ values to determine which variables are most important. Describe the efect of m, the number of variables considered at each split, on the error rate obtained.
set.seed(1)
rando.seats = randomForest(Sales~., data = strain, mtry = 10, importance = TRUE)
randseat.pred = predict(rando.seats, newdata = stest)
mean((randseat.pred - stest$Sales)^2)
## [1] 2.605253
importance(rando.seats)
##                %IncMSE IncNodePurity
## CompPrice   24.8888481    170.182937
## Income       4.7121131     91.264880
## Advertising 12.7692401     97.164338
## Population  -1.8074075     58.244596
## Price       56.3326252    502.903407
## ShelveLoc   48.8886689    380.032715
## Age         17.7275460    157.846774
## Education    0.5962186     44.598731
## Urban        0.1728373      9.822082
## US           4.2172102     18.073863

MSE=2.605 Price and SHelveLoc are still the most important variables

  1. Now analyze the data using BART, and report your results.
library(BART)
## Warning: package 'BART' was built under R version 4.3.3
## Loading required package: nlme
## Loading required package: nnet
## Loading required package: survival
## Warning: package 'survival' was built under R version 4.3.3
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
#x = Carseats[, 2:11]
#y = Carseats[, "Sales"]
#xtrain = x[train,]
#ytrain = y[train]
#xtest = x[stest, ]
#ytest = y[stest]
#set.seed(1)
#bartfit = gbart(xtrain,
 #               ytrain,
  #              x.test = xtest)
  1. This problem involves the OJ data set which is part of the ISLP package.
  1. Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
set.seed(1)
train = sample(1:nrow(OJ), 800)
OJtrain = OJ[train, ]
OJtest = OJ[-train, ]
  1. Fit a tree to the training data, with Purchase as the response and the other variables as predictors. What is the training error rate?
tree.OJ = tree(Purchase ~ ., data = OJtrain)
summary(tree.OJ)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJtrain)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800
plot(tree.OJ)
text(tree.OJ, pretty = 0)

5 variables 9 nodes error rate:0.1588

  1. Create a plot of the tree, and interpret the results. How many terminal nodes does the tree have?
tree.OJ
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1073.00 CH ( 0.60625 0.39375 )  
##    2) LoyalCH < 0.5036 365  441.60 MM ( 0.29315 0.70685 )  
##      4) LoyalCH < 0.280875 177  140.50 MM ( 0.13559 0.86441 )  
##        8) LoyalCH < 0.0356415 59   10.14 MM ( 0.01695 0.98305 ) *
##        9) LoyalCH > 0.0356415 118  116.40 MM ( 0.19492 0.80508 ) *
##      5) LoyalCH > 0.280875 188  258.00 MM ( 0.44149 0.55851 )  
##       10) PriceDiff < 0.05 79   84.79 MM ( 0.22785 0.77215 )  
##         20) SpecialCH < 0.5 64   51.98 MM ( 0.14062 0.85938 ) *
##         21) SpecialCH > 0.5 15   20.19 CH ( 0.60000 0.40000 ) *
##       11) PriceDiff > 0.05 109  147.00 CH ( 0.59633 0.40367 ) *
##    3) LoyalCH > 0.5036 435  337.90 CH ( 0.86897 0.13103 )  
##      6) LoyalCH < 0.764572 174  201.00 CH ( 0.73563 0.26437 )  
##       12) ListPriceDiff < 0.235 72   99.81 MM ( 0.50000 0.50000 )  
##         24) PctDiscMM < 0.196196 55   73.14 CH ( 0.61818 0.38182 ) *
##         25) PctDiscMM > 0.196196 17   12.32 MM ( 0.11765 0.88235 ) *
##       13) ListPriceDiff > 0.235 102   65.43 CH ( 0.90196 0.09804 ) *
##      7) LoyalCH > 0.764572 261   91.20 CH ( 0.95785 0.04215 ) *

LoyalCH has 261 observations branch value: LoyalCH > 0.76457 4.2% takes value CH and 95.8% take MM

  1. Use the export_tree() function to produce a text summary of the ftted tree. Pick one of the terminal nodes, and interpret theinformation displayed.
plot(tree.OJ)
text(tree.OJ, pretty = 0)

LoyalCH, SpecialCH, PriceDiff, PctDiscMM, and ListPriceDiff are the most important variables.

  1. Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
treeOJ.pred = predict(tree.OJ, newdata = OJtest, type = "class")
table(treeOJ.pred, OJtest$Purchase)
##            
## treeOJ.pred  CH  MM
##          CH 160  38
##          MM   8  64

(64+160)/(160+64+38+8)= 0.8296 is the error rate

  1. Use cross-validation on the training set in order to determine the optimal tree size.
OJcv = cv.tree(tree.OJ, FUN = prune.misclass)
OJcv
## $size
## [1] 9 8 7 4 2 1
## 
## $dev
## [1] 150 150 149 158 172 315
## 
## $k
## [1]       -Inf   0.000000   3.000000   4.333333  10.500000 151.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"
  1. Produce a plot with tree size on the x-axis and cross-validated classifcation error rate on the y-axis.
plot(OJcv$size, OJcv$dev, type = "b")

  1. Which tree size corresponds to the lowest Size 7

  2. Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with fve terminal nodes.

prune.OJ=prune.tree(tree.OJ,best=7)
  1. Compare the training error rates between the pruned and unpruned trees. Which is higher?
summary(tree.OJ)
## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJtrain)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## [5] "PctDiscMM"    
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7432 = 587.8 / 791 
## Misclassification error rate: 0.1588 = 127 / 800
  1. Compare the test error rates between the pruned and unprunedtrees. Which is higher?
treeOJ.pred = predict(tree.OJ, newdata = OJtest, type = "class")
table(treeOJ.pred, OJtest$Purchase)
##            
## treeOJ.pred  CH  MM
##          CH 160  38
##          MM   8  64
unprunedOJvalerr = (38 + 8) / 270
unprunedOJvalerr
## [1] 0.1703704
pruneOJ.pred = predict(prune.OJ, newdata = OJtest, type = "class")
table(pruneOJ.pred, OJtest$Purchase)
##             
## pruneOJ.pred  CH  MM
##           CH 160  36
##           MM   8  66
prunedOJvalerr = (36 + 8) / 270
prunedOJvalerr
## [1] 0.162963

pruned tree=0.162963 unpruned tree=0.1703704