ISLR Assignment 7

Problem 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from \(0\) to \(1\), and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, \(\hat{p}_{m1} = 1 − \hat{p}_{m2}\). You could make this plot by hand, but it will be much easier to make in `R`.

p <- seq(0, 1, 0.01)

gini.index <- (1-p) * 2*p
classification.error <- 1-pmax(p, 1-p)
entropy <- -(p*log(p) + (1-p)*log(1-p))

par(bg = '#006666')
plot(NA, NA, xlim = c(0,1), ylim = c(0,1), xlab = 'p', ylab = 'f')

lines(p, gini.index, type = 'l',  col = '#FF33CC')
lines(p, classification.error, col = 'darkslategray1')
lines(p, entropy, col = 'chartreuse')

Problem 8

In the lab, a classification tree was applied to the `Carseats` data set after converting `Sales` into a qualitative response variable. Now we will seek to predict `Sales` using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

set.seed(1)
set.training.carseats <- sample(nrow(Carseats), nrow(Carseats) * 0.66)
training.carseats <- Carseats[set.training.carseats, ]
test.carseats <- Carseats[-set.training.carseats, ]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

Fit Regression Tree

tree.carseats <- tree(Sales ~ ., data = training.carseats)
summary(tree.carseats)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = training.carseats)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Income"      "CompPrice"  
## [6] "Advertising"
## Number of terminal nodes:  19 
## Residual mean deviance:  2.156 = 528.1 / 245 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.96700 -0.98010  0.04079  0.00000  0.96310  3.58500

Plot

plot(tree.carseats)
text(tree.carseats, pretty = 0)

Test MSE

pred.carseats <- predict(tree.carseats, test.carseats)
mean.tree.carseats <- mean((test.carseats$Sales - pred.carseats)^2)

The test MSE is 4.739727.

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

Cross Validation

cv.carseats <- cv.tree(tree.carseats, FUN = prune.tree)
par(mfrow = c(1, 2))
plot(cv.carseats$size, cv.carseats$dev, type = 'b')
plot(cv.carseats$k, cv.carseats$dev, type = 'b')

Prune Tree

prune.carseats <- prune.tree(tree.carseats, best = 9)
plot(prune.carseats)
text(prune.carseats, pretty = 0)

Pruned Test MSE

pred.carseats.prune <- predict(prune.carseats, test.carseats)
mean.prune.carseats <- mean((test.carseats$Sales - pred.carseats.prune)^2)

The new MSE is 5.364237, which is not better than the non-pruned tree regression.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important.

bag.carseats <- randomForest(Sales ~ ., data = training.carseats, mtry = 10, ntree = 500, importance = T )

pred.bag.carseats <- predict(bag.carseats, test.carseats)
mean.bag.carseats <- mean((test.carseats$Sales - pred.bag.carseats)^2)

importance(bag.carseats)

##                %IncMSE IncNodePurity
## CompPrice   33.6276479     224.22283
## Income       7.5953252      99.93095
## Advertising 15.7832135     130.97829
## Population  -2.7342280      57.73962
## Price       66.8090967     649.94840
## ShelveLoc   72.0356483     545.97396
## Age         19.7502859     218.62806
## Education    2.7471904      57.29101
## Urban       -0.8042207      10.26382
## US           2.2150442      11.75372

The MSE dropped to 2.770149, which is almost half of our original MSE. The most important variables are Price, ShelveLoc, CompPrice, and Age.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

rf.carseats <- randomForest(Sales ~ ., data = training.carseats, ntree = 500, mtry = 5, importance = TRUE)
pred.rf.carseats <- predict(rf.carseats, test.carseats)
mean.rf.carseats <- mean((test.carseats$Sales - pred.rf.carseats)^2)

imp.rf.carseats <- importance(rf.carseats)

There was a slight increase in MSE in comparison to bagging. The important variables are the same as with the bagging approach.

detach(Carseats)

Problem 9

This problem involves the `OJ` data set which is part of the `ISLR` package.

(a) Create a training set containing a random sample of \(800\) observations and a test set containing the remaining observations.

set.seed(2)
set.training.oj <- sample(nrow(OJ), nrow(OJ)*0.66)
training.oj <- OJ[set.training.oj,]
test.oj <- OJ[-set.training.oj,]

(b) Fit a tree to the training data, with `Purchase` as the response and the other variables as predictors. Use the `summary()` function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

Fit Tree and Summary

tree.oj <- tree(Purchase ~ ., data = training.oj,)
summary(tree.oj)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = training.oj)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  6 
## Residual mean deviance:  0.7523 = 526.6 / 700 
## Misclassification error rate: 0.1728 = 122 / 706

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree.oj

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 706 947.50 CH ( 0.60482 0.39518 )  
##    2) LoyalCH < 0.5036 323 382.20 MM ( 0.27864 0.72136 )  
##      4) LoyalCH < 0.280875 154 115.10 MM ( 0.12338 0.87662 ) *
##      5) LoyalCH > 0.280875 169 230.00 MM ( 0.42012 0.57988 )  
##       10) PriceDiff < 0.05 64  64.60 MM ( 0.20312 0.79688 ) *
##       11) PriceDiff > 0.05 105 144.40 CH ( 0.55238 0.44762 ) *
##    3) LoyalCH > 0.5036 383 281.20 CH ( 0.87990 0.12010 )  
##      6) LoyalCH < 0.737888 151 176.60 CH ( 0.72848 0.27152 )  
##       12) PriceDiff < 0.015 47  64.96 MM ( 0.46809 0.53191 ) *
##       13) PriceDiff > 0.015 104  89.30 CH ( 0.84615 0.15385 ) *
##      7) LoyalCH > 0.737888 232  48.26 CH ( 0.97845 0.02155 ) *

There are 6 terminal nodes.
LoyalCH and PriceDiff are the main important variables.
There are 13 nodes in total.

(d) Create a plot of the tree, and interpret the results.

plot(tree.oj)
text(tree.oj, pretty =0)

There are 2 major important variables in this dataset. LoyalCH, then PriceDiff.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

pred.tree.oj <- predict(tree.oj, test.oj, type = 'class')
table(test.oj$Purchase, pred.tree.oj, dnn = c('Actual', 'Predicted'))

##       Predicted
## Actual  CH  MM
##     CH 196  30
##     MM  39  99

mse.pred.oj <- mean(pred.tree.oj != test.oj$Purchase)
mse.pred.oj

## [1] 0.1895604

The MSE is approximately 0.19, or 19%.

(f) Apply the `cv.tree()` function to the training set in order to determine the optimal tree size.

cv.tree.oj <- cv.tree(tree.oj, FUN = prune.tree)
cv.tree.oj

## $size
## [1] 6 5 4 3 2 1
## 
## $dev
## [1] 590.2541 599.2843 598.8823 615.9539 668.4509 952.7208
## 
## $k
## [1]      -Inf  20.94305  22.33652  37.19361  56.35952 284.02935
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

The optimal tree size appears to be 6 nodes.

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(cv.tree.oj$size, cv.tree.oj$dev, type = "b", xlab = "Tree Size", ylab = "Classification Error")

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

6 notes appears to have the lowest error rate, but there is not much difference between the error rate of 4 and 6 nodes.

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

prune.oj <- prune.tree(tree.oj, best = 5)

Since the result is not a pruned tree, we will use 5 nodes for comparison.

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(prune.oj)

## 
## Classification tree:
## snip.tree(tree = tree.oj, nodes = 5L)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.7811 = 547.5 / 701 
## Misclassification error rate: 0.1884 = 133 / 706

summary(tree.oj)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = training.oj)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  6 
## Residual mean deviance:  0.7523 = 526.6 / 700 
## Misclassification error rate: 0.1728 = 122 / 706

The training error increases from 0.1728 to 0.1884, or approximately from 17% to 19%.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

pred.oj.prune <- predict(prune.oj, test.oj, type = 'class')
mse.pred.oj.prune <- mean(pred.oj.prune != test.oj$Purchase)

print("MSE of Pruned Tree")

## [1] "MSE of Pruned Tree"

mse.pred.oj.prune

## [1] 0.2225275

print("MSE of Original Tree")

## [1] "MSE of Original Tree"

mse.pred.oj

## [1] 0.1895604

The Test Error increases to approximately 0.22, or 22%, from approximately 19%.

detach(OJ)

ISLR Assignment 7

Chapter 8 (page 332) Problems 3, 8, 9

Misty Stultz

4/22/22

Problem 3

Problem 8

Problem 9

This problem involves the `OJ` data set which is part of the `ISLR` package.

(a) Create a training set containing a random sample of \(800\) observations and a test set containing the remaining observations.

(b) Fit a tree to the training data, with `Purchase` as the response and the other variables as predictors. Use the `summary()` function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

Fit Tree and Summary

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

(d) Create a plot of the tree, and interpret the results.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

(f) Apply the `cv.tree()` function to the training set in order to determine the optimal tree size.

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

ISLR Assignment 7

Chapter 8 (page 332) Problems 3, 8, 9

Misty Stultz

4/22/22

Problem 3

Problem 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

Fit Regression Tree

Plot

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

Cross Validation

Prune Tree

Pruned Test MSE

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

Problem 9

This problem involves the OJ data set which is part of the ISLR package.

(a) Create a training set containing a random sample of \(800\) observations and a test set containing the remaining observations.

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

Fit Tree and Summary

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

(d) Create a plot of the tree, and interpret the results.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

In the lab, a classification tree was applied to the `Carseats` data set after converting `Sales` into a qualitative response variable. Now we will seek to predict `Sales` using regression trees and related approaches, treating the response as a quantitative variable.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

This problem involves the `OJ` data set which is part of the `ISLR` package.

(b) Fit a tree to the training data, with `Purchase` as the response and the other variables as predictors. Use the `summary()` function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

(f) Apply the `cv.tree()` function to the training set in order to determine the optimal tree size.