assignment 7

Question 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function ofˆpm1. The x-axis should displayˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes,ˆpm1 = 1−ˆpm2. You could make this plot by hand, but it will be much easier to make in R.

p=seq(0,1,0.0001)
#Gini
G=2*p*(1-p)
#Classification Error
E=1-pmax(p,1-p)
#Entropy
D=-(p*log(p) + (1-p)*log(1-p))

plot(p,D, col="red",ylab="")
lines(p,E,col='green')
lines(p,G,col='blue')
legend(0.3,0.15,c("Entropy", "Missclassification","Gini"),lty=c(1,1,1),lwd=c(2.5,2.5,2.5),col=c('red','green','blue'))

Question 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

library(ISLR2)

attach(Carseats)
set.seed(96)

train = sample(dim(Carseats)[1], dim(Carseats)[1]/3)
Carseats.train = Carseats[-train, ]
Carseats.test = Carseats[train, ]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

library(tree)

tree.carseats = tree(Sales ~ ., data = Carseats.train)
summary(tree.carseats)


Regression tree:
tree(formula = Sales ~ ., data = Carseats.train)
Variables actually used in tree construction:
[1] "ShelveLoc"   "Price"       "Age"         "Income"      "CompPrice"  
[6] "Education"   "Advertising"
Number of terminal nodes:  19 
Residual mean deviance:  2.408 = 597.1 / 248 
Distribution of residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-3.87000 -1.10000  0.06133  0.00000  1.05900  3.93500

plot(tree.carseats)
text(tree.carseats, pretty = 0)

pred.carseats = predict(tree.carseats, Carseats.test)
mean((Carseats.test$Sales - pred.carseats)^2)

[1] 4.106416

MSE is roughly 5.25.

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

cv.carseats = cv.tree(tree.carseats, FUN = prune.tree)
par(mfrow = c(1, 2))
plot(cv.carseats$size, cv.carseats$dev, type = "b")
plot(cv.carseats$k, cv.carseats$dev, type = "b")

Best size = 13.

pruned.carseats = prune.tree(tree.carseats, best = 13)
par(mfrow = c(1, 1))
plot(pruned.carseats)
text(pruned.carseats, pretty = 0)

pred.pruned = predict(pruned.carseats, Carseats.test)
mean((Carseats.test$Sales - pred.pruned)^2)

[1] 4.316175

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the feature_importance_ values to determine which variables are most important.

library(randomForest)

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the feature_importance_ values to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

1 + 1

[1] 2

(f) Now analyze the data using BART, and report your results.

1 + 1

[1] 2

Question 9

This problem involves the OJ data set which is part of the ISLP package.

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

[1] 4

Fit a tree to the training data, with Purchase as the response and the other variables as predictors. What is the training error rate?

[1] 4

Create a plot of the tree, and interpret the results. How many terminal nodes does the tree have?

[1] 4

Use the export_tree() function to produce a text summary of the fitted tree. Pick one of the terminal nodes, and interpret the information displayed.

[1] 4

Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

[1] 4

Use cross-validation on the training set in order to determine the optimal tree size.

[1] 4

Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

[1] 4

Which tree size corresponds to the lowest cross-validated classification error rate?

[1] 4

Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

[1] 4

Compare the training error rates between the pruned and unpruned trees. Which is higher?

[1] 4

Compare the test error rates between the pruned and unpruned trees. Which is higher?

[1] 4