Assignment 7

3. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, \(\hat{p}_{m1}\) = 1− \(\hat{p}_{m2}\). You could make this plot by hand, but it will be much easier to make in R.

The classification error rate is:

\[E = 1 − \displaystyle \max_{k}(\hat{p}_{mk}).\]

Therefore, \[E = 1 - max \{\hat{p}_{m1}, \hat{p}_{m2}\}\]

The Gini index is:

\[G = \sum_{k=1}^{K} \hat{p}_{mk} (1 - \hat{p}_{mk}) \]

Therefore, \[G = \hat{p}_{m1} (1 - \hat{p}_{m1}) + \hat{p}_{m2} (1 - \hat{p}_{m2})\]

The Entropy is:

\[D = - \sum_{k=1}^{K} \hat{p}_{mk} log(\hat{p}_{mk}) \]

Therefore, \[D = - \hat{p}_{m1} log(\hat{p}_{m1}) - \hat{p}_{m2} log(\hat{p}_{m2}) \]

When we plot for the above quantities in a simple classification settings in R, we get the below plot.

#Two classes
p1 <- seq(0, 1, 0.001)
p2 <- 1 - p1

#Classification error rate
classification.error.rate <- 1 - pmax(p1, p2)
#Gini index
gini.index <- p1 * (1 - p1) + p2 * (1 - p2)
#Entropy
entropy <- -p1 * log(p1) - p2*log(p2)

#Data frame to hold the 2 classes with the quantities
df.classification.quantities <-  data.frame(p1, p2, classification.error.rate, gini.index, entropy) 
df.classification.quantities <- pivot_longer(df.classification.quantities, cols = c(classification.error.rate, gini.index, entropy), names_to = "metric")

##Plot classification quantities
ggplot(data=df.classification.quantities,
        aes(x = p1, y = value, col = factor(metric))) + 
        geom_line(size = 1) + 
        scale_y_continuous(breaks = seq(0, 1, 0.1), minor_breaks = NULL) + 
        scale_color_hue(labels = c("Classification Error", "Entropy", "Gini")) +
        labs(col = "Metric", 
        y = "Value", 
        x = "P")

8. In the lab, a classification tree was applied to the `Carseats` data set after converting `Sales` into a qualitative response variable. Now we will seek to predict `Sales` using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

Split using the same code from the Lab.

set.seed(123)
train <- sample(1 : nrow(Carseats), 200)
carseats.train <- Carseats[train, ]
carseats.test <- Carseats[-train ,]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

tree.model.train <- tree(Sales ~ ., carseats.train)

plot(tree.model.train)
text(tree.model.train, pretty = 0, cex = 0.5)

From the above tree, we can identify that Price and ShelveLoc have the best split of the Carseats data. So, they can be used to predict Sales of the Carseats.

summary(tree.model.train)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Income"      "Age"         "Population" 
## [6] "Education"   "CompPrice"   "Advertising"
## Number of terminal nodes:  18 
## Residual mean deviance:  2.132 = 388.1 / 182 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.08000 -0.92870  0.06244  0.00000  0.87020  3.71700

There are 18 terminal nodes in the tree.

test.carseats.pred <- predict(tree.model.train, newdata = carseats.test)
mean((test.carseats.pred - carseats.test$Sales)^2)

## [1] 4.395357

The test MSE is 4.40 (approximately).

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

Using the cross-validation in order to determine the optimal level of tree complexity.

#Use Cross-validation with K=10 fold
set.seed(3)
cv.carseats <- cv.tree(tree.model.train, K = 10)
#Construct the data frame with deviance and size 
df.cv.carseats <- data.frame(Tree.Size = cv.carseats$size,
                              CV.RSS = cv.carseats$dev)
#Find the minimum RSS
df.cv.carseats <- mutate(df.cv.carseats, min.cv.rss = as.numeric(min(CV.RSS) == CV.RSS))
##Use the ggplot
ggplot(df.cv.carseats, aes(x = Tree.Size, y = CV.RSS)) +
  geom_line(col = "black") +
  geom_point(size = 2, aes(col = factor(min.cv.rss))) +
  scale_x_continuous(breaks = seq(1, 20, 1)) +
  scale_y_continuous(labels = scales::comma_format()) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "Carseats Dataset - Cross Validation",
       x = "Tree nodes",
       y = "CV - RSS")

The tree with 13 terminal nodes results in the lowest cross-validation error rate.

We now apply the prune.tree() function in order to prune the tree to obtain the 13-node tree.

prune.carseats <- prune.tree(tree.model.train, best = 13)
plot(prune.carseats)
text(prune.carseats, pretty=0, cex = 0.5)

test.prune.carseats.pred <- predict(prune.carseats, newdata = carseats.test)
mean((test.prune.carseats.pred - carseats.test$Sales)^2)

## [1] 4.674089

The test MSE is 4.67 (approximately), which indicates that pruning the tree increases slightly the test MSE.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important.

set.seed(1)
bag.carseats <- randomForest(Sales ~ . , data = carseats.train, mtry=10, importance =TRUE)
bag.carseats

## 
## Call:
##  randomForest(formula = Sales ~ ., data = carseats.train, mtry = 10,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.914407
##                     % Var explained: 59.5

bag.carseats.pred <- predict(bag.carseats, newdata = carseats.test)
mean((bag.carseats.pred - carseats.test$Sales)^2)

## [1] 2.719623

The test MSE is 2.72 (approximately), which indicates that the bagging reduces the test MSE considerably.

importance(bag.carseats)

##                 %IncMSE IncNodePurity
## CompPrice   19.45546048    157.493242
## Income       6.54808190     88.084674
## Advertising  9.35661044     76.584090
## Population  -0.03332117     52.232916
## Price       46.38347866    383.642775
## ShelveLoc   50.78194122    388.510326
## Age         19.66414864    181.155199
## Education    2.95721774     54.469563
## Urban       -0.22123191      8.959947
## US           1.51594678      6.052909

The above results indicate that across all of the trees considered in the bagging, the price (Price) and the shelve location (ShelveLoc) are by far the two most important variables.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed(3)
rf.carseats <- randomForest(Sales ~ ., data = carseats.train, mtry = 3, ntree = 500, importance = TRUE)
rf.carseats

## 
## Call:
##  randomForest(formula = Sales ~ ., data = carseats.train, mtry = 3,      ntree = 500, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 3.330894
##                     % Var explained: 53.71

rf.carseats.pred <- predict(rf.carseats, newdata = carseats.test)
mean((rf.carseats.pred - carseats.test$Sales)^2)

## [1] 3.546194

The test MSE is 3.55 (approximately), which indicates that the random forest increases the test MSE considerably from the bagging but less than the tree regression model.

importance(rf.carseats)

##                %IncMSE IncNodePurity
## CompPrice   10.9368252     149.53849
## Income       6.2376902     122.01057
## Advertising  6.3414943      99.06308
## Population  -0.7555239      95.37645
## Price       29.5402685     296.11531
## ShelveLoc   30.4862904     276.79837
## Age         17.7260347     213.36623
## Education    1.0378316      70.73823
## Urban       -0.3607377      15.65228
## US           1.1709668      14.10679

The above results also indicate that across all of the trees considered in the random forest, the price (Price) and the shelve location (ShelveLoc) are by far the two most important variables.

9. This problem involves the `OJ` data set which is part of the `ISLR` package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

Creating the training set with 800 observations and remaining in the test set.

set.seed(123)
train <- sample(1:nrow(OJ), 800)
oj.train <- OJ[train, ]
oj.test <- OJ[-train, ]

(b) Fit a tree to the training data, with `Purchase` as the response and the other variables as predictors. Use the `summary()` function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

tree.oj.model <- tree(Purchase ~ ., data = oj.train)
summary(tree.oj.model)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7625 = 603.9 / 792 
## Misclassification error rate: 0.165 = 132 / 800

The classification tree has 8 terminal nodes and a training error rate of 16.5%.

From the available 17 predictors, the classification tree splits used only 2 predictors:
(i) LoyalCH - Customer brand loyalty for CH
(ii) PriceDiff - Sale price of MM less sale price of CH

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree.oj.model

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1071.00 CH ( 0.60875 0.39125 )  
##    2) LoyalCH < 0.5036 350  415.10 MM ( 0.28000 0.72000 )  
##      4) LoyalCH < 0.276142 170  131.00 MM ( 0.12941 0.87059 )  
##        8) LoyalCH < 0.0356415 56   10.03 MM ( 0.01786 0.98214 ) *
##        9) LoyalCH > 0.0356415 114  108.90 MM ( 0.18421 0.81579 ) *
##      5) LoyalCH > 0.276142 180  245.20 MM ( 0.42222 0.57778 )  
##       10) PriceDiff < 0.05 74   74.61 MM ( 0.20270 0.79730 ) *
##       11) PriceDiff > 0.05 106  144.50 CH ( 0.57547 0.42453 ) *
##    3) LoyalCH > 0.5036 450  357.10 CH ( 0.86444 0.13556 )  
##      6) PriceDiff < -0.39 27   32.82 MM ( 0.29630 0.70370 ) *
##      7) PriceDiff > -0.39 423  273.70 CH ( 0.90071 0.09929 )  
##       14) LoyalCH < 0.705326 130  135.50 CH ( 0.78462 0.21538 )  
##         28) PriceDiff < 0.145 43   58.47 CH ( 0.58140 0.41860 ) *
##         29) PriceDiff > 0.145 87   62.07 CH ( 0.88506 0.11494 ) *
##       15) LoyalCH > 0.705326 293  112.50 CH ( 0.95222 0.04778 ) *

I am picking the terminal node 8 (because there is an asterisk at the end). The split criterion for that node is LoyalCH < 0.0356415 with 56 observations in the branch. The deviance of the branch is 10.03 with overall prediction for the branch of MM. Within that branch, MM takes approximately 81% of the observations and CH takes approximately 18% of the observations.

(d) Create a plot of the tree, and interpret the results.

plot(tree.oj.model)
text(tree.oj.model, pretty=0, cex = 0.5)

From the tree plot, we can identify that the LoyalCH is the most important variable followed by PriceDiff to predict the Purchase. The first branch differentiates the intensity of the Customer brand loyalty for CH (LoyalCH). The top 2 nodes are LoyalCH and the three node is PriceDiff. There are 8 terminal nodes and the number of terminal nodes are equally divided among CH and MM.

The first node split is that the customer less loyal to Citrus Hill (CH) orange juice to the left and more loyal to the right. On the more loyal to Citrus Hill side, the PriceDiff plays the role in dividing the nodes, that’s if the PriceDiff is less than -0.39, the customer prefers Minute Maid Orange Juice.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

Prediction and the confusion matrix are:

tree.oj.pred <- predict(tree.oj.model, oj.test, type = "class")
table(tree.oj.pred, oj.test$Purchase)

##             
## tree.oj.pred  CH  MM
##           CH 150  34
##           MM  16  70

\[Test Error rate = 1 - (150 + 70)/270 = 0.18518519\]

We can say the test error rate would be 18.5% (approximately).

(f) Apply the `cv.tree()` function to the training set in order to determine the optimal tree size.

Apply prune.misclass to the `cv.tree() function in order to determine the classification error rate.

set.seed(2)

cv.tree.model <- cv.tree(tree.oj.model, K = 10, FUN = prune.misclass)
cv.tree.model

## $size
## [1] 8 5 3 2 1
## 
## $dev
## [1] 138 138 162 165 313
## 
## $k
## [1] -Inf    0    8   11  154
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

#Construct the data frame with deviance and size 
df.cv.oj <- data.frame(Tree.Size = cv.tree.model$size,
                        CV.Error = cv.tree.model$dev/nrow(oj.train))
#Find the minimum Error rate
df.cv.oj <- mutate(df.cv.oj, min.cv.error = as.numeric(min(CV.Error) == CV.Error))
##Use the ggplot
ggplot(df.cv.oj, aes(x = Tree.Size, y = CV.Error)) +
  geom_line(col = "black") +
  geom_point(size = 2, aes(col = factor(min.cv.error))) +
  scale_x_continuous(breaks = seq(1, 10, 1)) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_color_manual(values = c("green", "red")) +
  theme(panel.background = element_rect(fill = 'white', colour = 'red'), legend.position = "none") +
  labs(title = "OJ Dataset - Cross Validation",
       x = "Tree size",
       y = "CV - Error")

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

From the above (g) cross-validation tree plot, we find out tree terminal nodes 5 and 8 have the same cross-validation error rate. It is more obvious to select the model with 5 terminal nodes.

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

From the above cross-validation (h), the optimal tree size is 5.

prune.oj <- prune.misclass(tree.oj.model, best = 5)
plot(prune.oj)
text(prune.oj, pretty=0, cex = 0.5)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(tree.oj.model)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7625 = 603.9 / 792 
## Misclassification error rate: 0.165 = 132 / 800

summary(prune.oj)

## 
## Classification tree:
## snip.tree(tree = tree.oj.model, nodes = c(4L, 7L))
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.826 = 656.6 / 795 
## Misclassification error rate: 0.165 = 132 / 800

Since both the cross-validation error rate is same for the nodes 5 and 8 (as determined in (h)), it looks like the training error rates are same when compared between pruned and unpruned trees.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

mean(predict(tree.oj.model, type = "class", newdata = oj.test) != oj.test$Purchase)

## [1] 0.1851852

mean(predict(prune.oj, type = "class", newdata = oj.test) != oj.test$Purchase)

## [1] 0.1851852

Since both the cross-validation error rate is same for the nodes 5 and 8 (as determined in (h)), it looks like we have the same test error rates as well between the pruned and unpruned trees.

Assignment 7

Alexis Chinnappan (UTSA ID: sez126)

8/2/2021

9. This problem involves the `OJ` data set which is part of the `ISLR` package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

(b) Fit a tree to the training data, with `Purchase` as the response and the other variables as predictors. Use the `summary()` function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

(d) Create a plot of the tree, and interpret the results.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

(f) Apply the `cv.tree()` function to the training set in order to determine the optimal tree size.

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

Assignment 7

Alexis Chinnappan (UTSA ID: sez126)

8/2/2021

8. In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

9. This problem involves the OJ data set which is part of the ISLR package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

(d) Create a plot of the tree, and interpret the results.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

8. In the lab, a classification tree was applied to the `Carseats` data set after converting `Sales` into a qualitative response variable. Now we will seek to predict `Sales` using regression trees and related approaches, treating the response as a quantitative variable.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the `importance()` function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

9. This problem involves the `OJ` data set which is part of the `ISLR` package.

(b) Fit a tree to the training data, with `Purchase` as the response and the other variables as predictors. Use the `summary()` function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

(f) Apply the `cv.tree()` function to the training set in order to determine the optimal tree size.