p=seq(0,1,0.0001)
#Gini
G=2*p*(1-p)
#Classification Error
E=1-pmax(p,1-p)
#Entropy
D=-(p*log(p) + (1-p)*log(1-p))
plot(p,D, col="red",ylab="")
lines(p,E,col='green')
lines(p,G,col='blue')
legend(0.3,0.15,c("Entropy", "Missclassification","Gini"),lty=c(1,1,1),lwd=c(2.5,2.5,2.5),col=c('red','green','blue'))assignment 7
Question 3
Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function ofˆpm1. The x-axis should displayˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
Hint: In a setting with two classes,ˆpm1 = 1−ˆpm2. You could make this plot by hand, but it will be much easier to make in R.
Question 8
In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
(a) Split the data set into a training set and a test set.
library(ISLR2)attach(Carseats)
set.seed(96)
train = sample(dim(Carseats)[1], dim(Carseats)[1]/3)
Carseats.train = Carseats[-train, ]
Carseats.test = Carseats[train, ](b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
library(tree)tree.carseats = tree(Sales ~ ., data = Carseats.train)
summary(tree.carseats)
Regression tree:
tree(formula = Sales ~ ., data = Carseats.train)
Variables actually used in tree construction:
[1] "ShelveLoc" "Price" "Age" "Income" "CompPrice"
[6] "Education" "Advertising"
Number of terminal nodes: 19
Residual mean deviance: 2.408 = 597.1 / 248
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.87000 -1.10000 0.06133 0.00000 1.05900 3.93500
plot(tree.carseats)
text(tree.carseats, pretty = 0)pred.carseats = predict(tree.carseats, Carseats.test)
mean((Carseats.test$Sales - pred.carseats)^2)[1] 4.106416
MSE is roughly 5.25.
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
cv.carseats = cv.tree(tree.carseats, FUN = prune.tree)
par(mfrow = c(1, 2))
plot(cv.carseats$size, cv.carseats$dev, type = "b")
plot(cv.carseats$k, cv.carseats$dev, type = "b")Best size = 13.
pruned.carseats = prune.tree(tree.carseats, best = 13)
par(mfrow = c(1, 1))
plot(pruned.carseats)
text(pruned.carseats, pretty = 0)pred.pruned = predict(pruned.carseats, Carseats.test)
mean((Carseats.test$Sales - pred.pruned)^2)[1] 4.316175
(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the feature_importance_ values to determine which variables are most important.
library(randomForest)randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
(e) Use random forests to analyze this data. What test MSE do you obtain? Use the feature_importance_ values to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
1 + 1[1] 2
(f) Now analyze the data using BART, and report your results.
1 + 1[1] 2
Question 9
This problem involves the OJ data set which is part of the ISLP package.
- Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
[1] 4
- Fit a tree to the training data, with Purchase as the response and the other variables as predictors. What is the training error rate?
[1] 4
- Create a plot of the tree, and interpret the results. How many terminal nodes does the tree have?
[1] 4
- Use the export_tree() function to produce a text summary of the fitted tree. Pick one of the terminal nodes, and interpret the information displayed.
[1] 4
- Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
[1] 4
- Use cross-validation on the training set in order to determine the optimal tree size.
[1] 4
- Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
[1] 4
- Which tree size corresponds to the lowest cross-validated classification error rate?
[1] 4
- Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
[1] 4
- Compare the training error rates between the pruned and unpruned trees. Which is higher?
[1] 4
- Compare the test error rates between the pruned and unpruned trees. Which is higher?
[1] 4