hw_7_data_mining

Question 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.

The Gini Index is a measure of the total variance across all the K classes, and would take a small value if the pmk (the proportion of training observations in the mth region that are from the kth class) is close to either 0 or 1; the same can be said for the entropy of a model. The Gini index can be measured as the negative sum of pmk multiplied by its complement for all the K classes of a model, while the entropy of the model can be measured as the the negative sum of pmk multiplied by the log of the same value for all the K classes of a model. Our classification error is just defined as the fraction of the training observation that DO NOT belong to the most common cause; its formula is 1 minus the maximum pmk.

p <- seq(0, 1, 0.01)
gini_index <- 2 * p * (1 - p) #this is because we have 2 classes
class_error <- 1 - pmax(p, 1 - p) #here, pmax will look at the highest value of the p-vector we defined earlier
entropy <- - (p * log(p) + (1 - p) * log(1 - p))
par(bg = "orchid3")
matplot(p, cbind(gini_index, class_error, entropy), pch=c(15,17,19) ,ylab = "gini.index, class.error, cross.entropy",col = c("ghostwhite" , "gray0", "gray"), type = 'b')
legend('bottom', inset=.01, legend = c('gini.index', 'class.error', 'cross.entropy'), col = c("ghostwhite" , "gray0", "gray"), pch=c(15,17,19))

Question 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

set.seed(24)
train_division <- sample(1:nrow(Carseats), nrow(Carseats)*0.75)
training_carseats <- Carseats[train_division, ]
test_carseat <- Carseats[-train_division, ]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

tree_carseats <- tree(Sales ~ ., data = training_carseats)
plot(tree_carseats)
text(tree_carseats, pretty = 1)

To obtain our MSE, we must obtain the predicted values of the sales of the test data and subtract the observed sales from them, squaring this result. Thus, we do the following:

yhat_tree=predict(tree_carseats, test_carseat)
obs_sales<-test_carseat$Sales
mean((yhat_tree - obs_sales)^2)

## [1] 5.064223

Therefore, our MSE value is 5.064223

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

cross_validation_carseats <- cv.tree(tree_carseats)
cross_validation_carseats

## $size
##  [1] 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1
## 
## $dev
##  [1] 1466.526 1495.742 1548.589 1511.621 1501.165 1546.583 1530.132 1538.571
##  [9] 1512.517 1533.302 1542.787 1574.725 1512.818 1512.818 1469.126 1498.218
## [17] 1706.665 1708.358 1856.341 2330.926
## 
## $k
##  [1]      -Inf  24.58710  28.41399  30.78746  32.30501  35.19482  36.22118
##  [8]  36.60542  37.96271  38.27342  40.64363  43.52142  54.20696  54.49613
## [15]  63.35705  83.46101 130.39388 132.78774 232.13313 537.53515
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(cross_validation_carseats$size, cross_validation_carseats$dev, 
     xlab = "size",
     ylab = "cv error")
abline(h = min(cross_validation_carseats$dev) + 0.2*sd(cross_validation_carseats$dev))

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

bag_carseats = randomForest(Sales ~ ., data = training_carseats, mtry = 10, ntree = 500, 
    importance = T)
bag_pred = predict(bag_carseats, test_carseat)
mean((test_carseat$Sales - bag_pred)^2)

## [1] 2.659297

The model with the lowest MSE would be the one with 4 nodes, having an MSE of about 2.643673

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

set.seed(24)
bag_carseats<-randomForest(Sales ~ ., data = training_carseats, mtry = 10, importance = TRUE)
yhat_bag<-predict(bag_carseats, newdata = test_carseat)
mean((yhat_bag - test_carseat$Sales)^2)

## [1] 2.651329

importance(bag_carseats)

##                %IncMSE IncNodePurity
## CompPrice   36.5403195     256.47385
## Income       7.0174602     110.17012
## Advertising 20.2839775     171.36030
## Population   0.6991205      87.06898
## Price       75.2681369     648.05673
## ShelveLoc   68.1472870     618.35931
## Age         29.8947733     270.79924
## Education    6.3758695      68.78396
## Urban       -1.7614733      10.72299
## US           1.9791749      10.25786

The MSE obtained by this process if is about 2.651329 (almost the same as the obtained MSE when using a bagging approach). According to our importance function, the top 3 most important variables would be CompPrice, Income, and Advertising.

Question 9

This problem involves the OJ data set which is part of the ISLR package. (a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

oj <- ISLR::OJ
train_division_oj <- sample(1:nrow(oj), nrow(oj)*0.75)
training_oj <- oj[train_division_oj, ]
test_oj <- oj[-train_division_oj, ]

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

tree_oj <- tree(Purchase ~ ., data = training_oj)
summary(tree_oj)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = training_oj)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7264 = 576.8 / 794 
## Misclassification error rate: 0.1534 = 123 / 802

The tree has 8 terminal nodes and has 4 variables: LoyalCH, PriceDiff, ListPriceDiff, and SalePriceMM. The training error rate for the tree is 0.1633.

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

tree_oj

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 802 1065.00 CH ( 0.61970 0.38030 )  
##    2) LoyalCH < 0.48285 298  322.50 MM ( 0.23154 0.76846 )  
##      4) LoyalCH < 0.0356415 52    0.00 MM ( 0.00000 1.00000 ) *
##      5) LoyalCH > 0.0356415 246  292.00 MM ( 0.28049 0.71951 )  
##       10) PriceDiff < 0.195 111   95.05 MM ( 0.15315 0.84685 ) *
##       11) PriceDiff > 0.195 135  180.00 MM ( 0.38519 0.61481 )  
##         22) LoyalCH < 0.277977 63   64.14 MM ( 0.20635 0.79365 ) *
##         23) LoyalCH > 0.277977 72   99.31 CH ( 0.54167 0.45833 ) *
##    3) LoyalCH > 0.48285 504  427.50 CH ( 0.84921 0.15079 )  
##      6) LoyalCH < 0.764572 233  275.90 CH ( 0.72103 0.27897 )  
##       12) PriceDiff < -0.165 34   39.30 MM ( 0.26471 0.73529 ) *
##       13) PriceDiff > -0.165 199  199.70 CH ( 0.79899 0.20101 )  
##         26) PriceDiff < 0.265 111  133.30 CH ( 0.71171 0.28829 ) *
##         27) PriceDiff > 0.265 88   53.62 CH ( 0.90909 0.09091 ) *
##      7) LoyalCH > 0.764572 271   92.04 CH ( 0.95941 0.04059 ) *

We will be interpreting the note denoted by the number “12”. The splitting variable for this node is PriceDiff and the splitting value of the node would be -0.165. There are 37 points in the subtree below this node and the deviance for all the points here is below 45. The * in this line denotes that it is a terminal node and the prediction at this node is MM. About 24% of the points in this node have CH as value of Sales, while the remaining 76% have MM as a value of Sales.

(d) Create a plot of the tree, and interpret the results.

plot(tree_oj)
text(tree_oj, pretty = 1)

LoyalCH is the most important variable of the tree, in fact top 3 nodes contain LoyalCH. If LoyalCH<0.035 , the tree predicts MM. If LoyalCH>0.76, the tree predicts CH. For intermediate values of LoyalCH, the decision also depends on the value of PriceDiff.

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

oj_pred = predict(tree_oj, test_oj, type = "class")
table(test_oj$Purchase, oj_pred)

##     oj_pred
##       CH  MM
##   CH 137  19
##   MM  37  75

The test error rate for this model can be calculated by (17 + 35) = 52 / 268 = 0.19, meaning that the test error rate for would be around 19%.

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

cv_oj = cv.tree(tree_oj, FUN = prune.tree)

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(cv_oj$size, cv_oj$dev, type = "b", xlab = "Tree Size", ylab = "Deviance")

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

The tree size with the lowest cross-validated classification error rate is 7

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

oj_pruned = prune.tree(tree_oj, best = 7)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(oj_pruned)

## 
## Classification tree:
## snip.tree(tree = tree_oj, nodes = 13L)
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  7 
## Residual mean deviance:  0.7416 = 589.6 / 795 
## Misclassification error rate: 0.1534 = 123 / 802

As we can see from the pruned tree summary, the training error rate is exactly the same as it was from the unpruned tree: 0.1633. Therefore, neither is higher by comaprison.

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

pred_unpruned = predict(tree_oj, test_oj, type = "class")
misclass_unpruned = sum(test_oj$Purchase != pred_unpruned)
misclass_unpruned/length(pred_unpruned)

## [1] 0.2089552

pred_pruned = predict(oj_pruned, test_oj, type = "class")
misclass_pruned = sum(test_oj$Purchase != pred_pruned)
misclass_pruned/length(pred_pruned)

## [1] 0.2089552

As we can see, the test error for both the pruned and unpruned tree is exactly the same: 0.1940299

hw_7_data_mining

Eileen Ramirez del Rio

4/12/2022

Question 3

Question 8

Question 9