Assignment 7

Chapter 08 (page 332): 3, 8, 9

Libraries:

library(ISLR)
library(caret)
library(tree)
library(rattle)
library(randomForest)
library(gbm)

3) Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of `ˆpm1`. The x-axis should display `ˆpm1`, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. (Hint: In a setting with two classes, `pˆm1 = 1 − pˆm2`. You could make this plot by hand, but it will be much easier to make in R.)

# set the axis range from 0 to 1
p <- seq(0, 1, 0.01)

# Calculate each error rate. Since 2 classes, use pˆm1 = 1 − pˆm2
classification.error <- 1 - pmax(p, 1 - p)
gini.index <- 2 * p * (1 - p)
entropy <- - (p * log(p) + (1 - p) * log(1 - p))

# Build the plot
matplot(p, cbind(classification.error, gini.index, entropy), col = c("blue", "green", "red" ), pch=16,main = "Classification Tree Measures with 2 classes", xlab = "Proportion of training observations", ylab = "Measure Values",ylim=c(0,1.1), lty=1)
legend(x='top', pch = 16,title = "Measures", col=c("blue", "green", "red"), legend=c("Classification Error Rate", "Gini Index", "Entropy"), box.lty=1)

Classification Error Rate: The fraction of the training observations in that region that do not belong to the most common class.

Gini Index: Gini Index is a measure of total variance across K classes (or) node purity (a small value indicates that a node contains predominantly observations from a single class) . Gini index takes on a small value if all of the pˆmk’s are close to zero or one.

Entropy: This is another measure of node purity and will take on a value near zero if the p̂mk’s are all near zero or near 1.

When building a classification tree, either the Gini index or the entropy are typically used to evaluate the quality of a particular split, since these two approaches are more sensitive to node purity than is the classification error rate. Any of these three approaches might be used when pruning the tree, but the classification error rate is preferable if prediction accuracy of the final pruned tree is the goal.

8) In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

summary(Carseats)

##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
##

str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

# get p and n values; as responsevariable is in the dataset, reduce column count by 1.
p = ncol(Carseats) - 1
n = nrow(Carseats)
cat ('p = ', p , '; n = ', n)

## p =  10 ; n =  400

8a) Split the data set into a training set and a test set.

# set the seed
set.seed(1)
# Divide the dataset into train and test. using 75-25 even though n is just 400 
# using care package for practice
inTrain <- createDataPartition(Carseats$Sales, p = 0.75, list = FALSE, times = 1)
carseats.train <- Carseats[inTrain,]
carseats.test <- Carseats[-inTrain,]
dim(carseats.train)

## [1] 301  11

8b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

# Fit a regression tree
tree.carseats = tree(Sales~., data = carseats.train)
summary(tree.carseats)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Income"      "Age"         "Advertising"
## [6] "CompPrice"  
## Number of terminal nodes:  16 
## Residual mean deviance:  2.295 = 654 / 285 
## Distribution of residuals:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.5490 -1.0160 -0.1187  0.0000  0.9648  3.5840

The regression tree fit shows that 6 out of 11 variables are used in tree construction.

# Build a plot
plot(tree.carseats)
text(tree.carseats, pretty = 0)

The tree plot indicates that the most important indicator of Sales is shelving location (ShelvLoc) which is a categorical variable.

# Get the MSE by using predict function
preds.carseats = predict(tree.carseats, newdata = carseats.test)
tree.mse = mean((preds.carseats - carseats.test$Sales)^2)
tree.mse

## [1] 4.935788

The MSE from this tree is 4.935788.

8c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

# Use cross validation to determine the optimal level of tree complexity
set.seed(2)
cv.carseats <- cv.tree(tree.carseats)
cv.carseats

## $size
##  [1] 16 15 14 13 12 10  9  8  7  6  5  4  3  2  1
## 
## $dev
##  [1] 1096.549 1096.851 1089.865 1075.781 1099.359 1235.014 1317.378 1317.028
##  [9] 1324.641 1389.604 1504.873 1530.520 1618.503 1813.690 2354.377
## 
## $k
##  [1]      -Inf  24.57854  25.70560  27.05048  29.90736  43.69234  48.99582
##  [8]  49.70574  54.03148  78.84306 112.38463 123.79024 152.23047 280.88027
## [15] 593.16181
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

The size values go from decreasing order. So, gather the size for which the deviance value is low. In this case, it is size=13.

# Gather the size for which the deviance is low
tree.min.cv = which.min(cv.carseats$dev)
tree.min.size = cv.carseats$size[tree.min.cv]
tree.min.size

## [1] 13

# Use the size value from above to highlight the size at which deviance is low
plot(cv.carseats$size, cv.carseats$dev, type = "b",xlab='Size',ylab='Deviance',main='Cross Validation Error Plot')
points(tree.min.size, cv.carseats$dev[tree.min.cv], col = "red", cex = 2, pch = 20)

As the plot indicates, tree size=8 has the lowest CV error or in other words it indicates optimal level of tree complexity. Let’s prune the tree to 13 nodes.

# Populate the size value for the best parameter to prune
prune.carseats <- prune.tree(tree.carseats, best = tree.min.size)
plot(prune.carseats)
text(prune.carseats, pretty = 0)

# Use predict function to get the predictions and find the MSE
prune.preds <- predict(prune.carseats, newdata = carseats.test)
prune.mse = mean((prune.preds - carseats.test$Sales)^2)
prune.mse

## [1] 5.021645

Pruning the tree actually increased the test MSE.

8d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

To proceed with bagging approach, use randomForest function. randomForest uses bagging approach with m=p. Hence populate mtry parameter with the variables count i.e. p=10.

# use randomforest function for bagging approach
set.seed(3)
bag.carseats <- randomForest(Sales ~ ., data = carseats.train, mtry = p, importance = TRUE)
bag.carseats

## 
## Call:
##  randomForest(formula = Sales ~ ., data = carseats.train, mtry = p,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.389486
##                     % Var explained: 69.3

The above fit could explain 69.3% of variability. Lets find the MSE.

bag.preds <- predict(bag.carseats, newdata = carseats.test)
bag.mse = mean((bag.preds - carseats.test$Sales)^2)
bag.mse

## [1] 3.178911

The test MSE obtained from bagging is 3.178911 which is better than Decision tree approach.

Lets determine which variables are important.

# Use importance function to determine the importance of variables
importance(bag.carseats)

##                %IncMSE IncNodePurity
## CompPrice   29.6775410     215.26019
## Income       7.4441919      98.62490
## Advertising 25.7177819     182.87717
## Population   0.2845831      73.86004
## Price       72.5086085     751.09396
## ShelveLoc   79.1673609     705.64459
## Age         20.0807016     184.05511
## Education    2.7658912      58.03403
## Urban       -1.0251162      10.42711
## US           3.3490685      12.88167

From the above importance, we can conclude that only Price and ShelveLoc are the most important variables.

8e) Use random forests to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important. Describe the effect of `m`, the number of variables considered at each split, on the error rate obtained.

This time, use the randomForest function but do not define the mtry parameter to the predictors count p and let the function choose it.

# set the seed
set.seed(4)

# use randomForest function
rf.carseats <- randomForest(Sales ~ ., data = carseats.train, importance = TRUE)
rf.carseats

## 
## Call:
##  randomForest(formula = Sales ~ ., data = carseats.train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 2.792944
##                     % Var explained: 64.11

The above randomforest fit too 3 varaibles at each split and explains about 64.11% of variability. Lets get the MSE and importance.

rf.preds <- predict(rf.carseats, newdata = carseats.test)
rf.mse = mean((rf.preds - carseats.test$Sales)^2)
rf.mse

## [1] 3.748591

importance(rf.carseats)

##                %IncMSE IncNodePurity
## CompPrice   14.0763435     198.72262
## Income       3.0894890     156.39168
## Advertising 20.3108832     202.91520
## Population   0.2388015     138.20203
## Price       46.8728499     578.50601
## ShelveLoc   50.1638302     542.98987
## Age         16.2776580     234.60743
## Education   -0.1216019     103.10753
## Urban       -0.1260788      19.19768
## US           6.7317642      33.33732

The importance shows that Price and ShelveLoc are the important variables still.

However, the test MSE obtained in randomForest approach is not better than the test MSE obtained from bagging approach. Bagging approach uses the m = p(10) i.e. 10 variables at each split where as randomForest used m=√p, i.e. m=3 for each split.

We can conclude that in this scenario, bagging approach with considering all 10 variables in a split actually produced better test error rate.

9) This problem involves the OJ data set which is part of the ISLR package.

attach(OJ)

str(OJ)

## 'data.frame':    1070 obs. of  18 variables:
##  $ Purchase      : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
##  $ WeekofPurchase: num  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : num  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : num  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : num  1 1 1 1 0 0 0 0 0 0 ...

p = ncol(OJ)-1
n = nrow(OJ)
cat ('p = ', p , '; n = ', n)

## p =  17 ; n =  1070

9a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

# set the seed
set.seed(5)

# use caret package and split the dataset. Determined that if p=.746, training set will have 800 so set p=.746
oj.inTrain <- createDataPartition(OJ$Purchase, p = .746, list = FALSE, times = 1)
oj.train <- OJ[oj.inTrain,]
oj.test <- OJ[-oj.inTrain,]
cat('training observations count = ', dim(oj.train)[1], '\ntest observations count = ', dim(oj.test)[1])

## training observations count =  800 
## test observations count =  270

9b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

# set the seed
set.seed(6)

# Call the tree function
tree.oj <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7823 = 620.3 / 793 
## Misclassification error rate: 0.1675 = 134 / 800

The response variable Purchase is a categorical varaible and this is a classification problem. The Misclassification rate or the training error rate is 16.75%. The terminal nodes are 7. The varaibles actually used in tree construction are ‘LoyalCH’, ‘PriceDiff’ and ‘ListPriceDiff’.

9c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

# Print the tree fit
tree.oj

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1070.00 CH ( 0.61000 0.39000 )  
##    2) LoyalCH < 0.469289 293  319.90 MM ( 0.23549 0.76451 )  
##      4) LoyalCH < 0.280875 170  127.10 MM ( 0.12353 0.87647 ) *
##      5) LoyalCH > 0.280875 123  164.50 MM ( 0.39024 0.60976 )  
##       10) PriceDiff < -0.24 13    0.00 MM ( 0.00000 1.00000 ) *
##       11) PriceDiff > -0.24 110  150.70 MM ( 0.43636 0.56364 ) *
##    3) LoyalCH > 0.469289 507  468.00 CH ( 0.82643 0.17357 )  
##      6) LoyalCH < 0.764572 242  302.70 CH ( 0.68182 0.31818 )  
##       12) PriceDiff < 0.145 91  124.80 MM ( 0.43956 0.56044 )  
##         24) ListPriceDiff < 0.235 63   78.74 MM ( 0.31746 0.68254 ) *
##         25) ListPriceDiff > 0.235 28   33.50 CH ( 0.71429 0.28571 ) *
##       13) PriceDiff > 0.145 151  138.70 CH ( 0.82781 0.17219 ) *
##      7) LoyalCH > 0.764572 265   91.54 CH ( 0.95849 0.04151 ) *

We pick one of the terminal nodes ‘ListPriceDiff’ because of the asterisk at the end. The split criterion is ListPriceDiff < 0.235, the number of observations in that branch is 63 with a deviance of 78.74 and an overall prediction for the branch of MM. If the ListPriceDiff > 0.235, the number of observations in that branch is 28 and the overall prediction is for the branch of CH (Citrus Hill). There are no further child nodes for this ListPriceDiff node as per this fit.

9d) Create a plot of the tree, and interpret the results.

# Plot the tree
plot(tree.oj)
text(tree.oj, pretty = 0)

The above tree plot shows that ‘LoyalCH’, ‘PriceDiff’ and ‘ListPriceDiff’ are the only variables used in tree construction. However, ‘LoyalCH’ appears to be the most important predictor as the first branch (left) for LoyalCH<0.469289 classifies it as MM, which indicates that this model predicts that when there is a strong preference for MinuteMaid (when LoyalCH <0.469289), MM will always be the drink chosen. Also, the right branch indicates if loyalCH > 0.764572 i.e. when there is strong loyalty of 76.5% or above, CitruHill is the drink that will always be chosen.

9e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

# use predict function to get the response on the test data
tree.preds <- predict(tree.oj, oj.test, type = "class")

# Build a confusion matrix
table(tree.preds, oj.test$Purchase)

##           
## tree.preds  CH  MM
##         CH 135  17
##         MM  30  88

# Get the misclassification rate and accuracy rate
mse.oj = (17+30)/(dim(oj.test)[1])
acc.oj = 1-mse.oj
cat('Misclassification/test error rate = ', mse.oj, '\nAccuracy Rate = ',acc.oj)

## Misclassification/test error rate =  0.1740741 
## Accuracy Rate =  0.8259259

The test error rate is 0.1740741 or the misclassification rate is 17.40741%.

9f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

# Get the cross validation error rate
cv.oj <- cv.tree(tree.oj, FUN = prune.misclass)
cv.oj

## $size
## [1] 7 5 2 1
## 
## $dev
## [1] 168 168 174 312
## 
## $k
## [1]       -Inf   0.000000   7.666667 155.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(cv.oj)

9g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(cv.oj$size, cv.oj$dev, type = "b", xlab = "Tree size", ylab = "Deviance", main='Cross Validation Error Plot')
points(5,cv.oj$dev[2],col='green',cex=2,pch=20)
points(7,cv.oj$dev[1],col='red',cex=2,pch=20)

9h) Which tree size corresponds to the lowest cross-validated classification error rate?

The above plot shows that both 5 and 7 have the same deviance. However, we would like to go with the lowest size to reduce overfitting. Hence, we can conclude that tree size of 5 corresponds to the lowest cross-validated classification error rate.

9i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

prune.oj <- prune.misclass(tree.oj, best = 5)
plot(prune.oj)
text(prune.oj, pretty = 0)

This pruned tree is much better and reduced the noise before pruning. The left branch for LoyalCH < 0.469289 got pruned as previously also, before pruning also it always resulted in the classification as MM but there were 2 extra terminal nodes.Now, with CV approach, they got pruned.

9j) Compare the training error rates between the pruned and un-pruned trees. Which is higher?

From 9b), we know that the training error rate of un-pruned tree is .1675 or misclassification rate is 16.75%. Lets find the training error rate of pruned tree.

summary(prune.oj)

## 
## Classification tree:
## snip.tree(tree = tree.oj, nodes = 2L)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.8332 = 662.4 / 795 
## Misclassification error rate: 0.1675 = 134 / 800

From the above, we can see that the training error rate of pruned tree is also .1675. Hence, we can conclude that both pruned and un-pruned trees have the same training error rate.

The key difference is that the pruned tree has lesser terminal nodes(5) compared to the unpruned tree.

9k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

We know the test error rate for the unpruned tree is 0.1740741 from 9(e). Lets find out the test error rate of pruned tree.

# Get the training error of pruned tree using predict function
prune.oj.preds = predict(prune.oj, newdata = oj.test, type = "class")
table(prune.oj.preds, oj.test$Purchase)

##               
## prune.oj.preds  CH  MM
##             CH 135  17
##             MM  30  88

mse.prune.oj = (17+30)/270
mse.prune.oj

## [1] 0.1740741

From the above, the test error rate of pruned tree is 0.1740741 which is same as the test error rate of un-pruned tree. There is no difference in the misclassification rate.

Assignment 7

Michael King

8/12/2021

Chapter 08 (page 332): 3, 8, 9

Libraries:

9) This problem involves the OJ data set which is part of the ISLR package.

9a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

9b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

9c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

9d) Create a plot of the tree, and interpret the results.

9e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

9f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

9g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

9h) Which tree size corresponds to the lowest cross-validated classification error rate?

9i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

9j) Compare the training error rates between the pruned and un-pruned trees. Which is higher?

9k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

Assignment 7

Michael King

8/12/2021

Chapter 08 (page 332): 3, 8, 9

Libraries:

8) In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

8a) Split the data set into a training set and a test set.

8b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

8c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

8d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

8e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

9) This problem involves the OJ data set which is part of the ISLR package.

9a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

9b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

9c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

9d) Create a plot of the tree, and interpret the results.

9e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

9f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

9g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

9h) Which tree size corresponds to the lowest cross-validated classification error rate?

9i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

9j) Compare the training error rates between the pruned and un-pruned trees. Which is higher?

9k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

8e) Use random forests to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important. Describe the effect of `m`, the number of variables considered at each split, on the error rate obtained.