DM Homework 7

Question 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.

Hint: In a setting with two classes, \(\hat{p}_{m1}\) = 1 ??? \(\hat{p}_{m2}\). You could make this plot by hand, but it will be much easier to make in R.

p=seq(0,1,0.01)
gini.index= 2 *p*(1-p)
class.err= 1 - pmax(p, 1 - p)
entropy=-(p * log(p) + (1 - p) * log(1 - p))
matplot(p, cbind(gini.index, class.err, entropy), col=c("dodgerblue", "darkorchid", "darkseagreen"), pch=16, main="Classification Trees", xlab="p values", ylab = "Splitting Criterion")
legend("topright",pch=16,col=c("cadetblue", "darkorchid", "darkseagreen"), legend=c("Gini Index","Class Error","Entropy"))

Question 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

Split the data set into a training set and a test set.

library(ISLR)
set.seed(1)
data(Carseats)
trainingindex=sample(nrow(Carseats), 0.75*nrow(Carseats))
head(trainingindex)

## [1] 107 149 228 361  80 355

Car.train<-Carseats[trainingindex, ]
Car.test<-Carseats[-trainingindex, ]

Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

library(tree)
reg.tree=tree(Sales~., data=Car.train)
summary(reg.tree)

## 
## Regression tree:
## tree(formula = Sales ~ ., data = Car.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "Age"         "Advertising" "Population" 
## [6] "CompPrice"  
## Number of terminal nodes:  15 
## Residual mean deviance:  2.849 = 812 / 285 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -4.12700 -1.04800 -0.01051  0.00000  1.03200  5.15700

The summary of the tree lists the variables that are used as internal nodes in the tree and provides the number of terminal nodes. The variables that are used include: Shelve Location, Price, Age, Advertising, Populationm Education and Comp Price.

plot(reg.tree)
text(reg.tree, pretty=0)

Since the summary of the tree may be a little hard to decipher or even imagine, the tree above is plotted. A prettier version of this tree can be created with the function rpart.plot(). Unfortunately, I was not able to further complete the analysis and was having a hard time producing the minimum dev.

yhat=predict(reg.tree, newdata = Car.test)
mean((yhat-Car.test$Sales)^2)

## [1] 5.491716

The test MSE that is received from this regression tree is 5.491716.

Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

library(tree)
reg.tree=tree(Sales ~ ., data = Car.train)
cv.car=cv.tree(reg.tree)
plot(cv.car$size, cv.car$dev, type = "b")

In this case, I see that the tree size that was choen by cross-validation is 7. I will continue to prune the tree to obtain the 7-node tree.

prune.car=prune.tree(reg.tree, best=7)
plot(prune.car)
text(prune.car, pretty=0)

yhat=predict(prune.car, newdata = Car.test)
mean((yhat-Car.test$Sales)^2)

## [1] 5.419728

The test MSE of the pruned tree is 5.4197728. This MSE was pretty close to the un-pruned tree.

Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

set.seed(1)
bag.car=randomForest(Sales~., data=Car.train, mtry=10, importance=TRUE)

yhat.bag=predict(bag.car, newdata = Car.test)
mean((yhat.bag-Car.test$Sales)^2)

## [1] 2.209486

importance(bag.car)

##               %IncMSE IncNodePurity
## CompPrice   30.260612     228.92041
## Income       8.102690     116.70390
## Advertising 21.884122     190.20770
## Population   3.621385      95.10519
## Price       72.265798     731.37342
## ShelveLoc   76.197166     655.36199
## Age         22.506044     218.59352
## Education    2.551942      63.46486
## Urban       -3.905544      10.01294
## US           4.222018      13.39688

varImpPlot(bag.car)

The most importance variables are the Shelve Location and Price of the car seats. The test MSE associated with the bagged regression tree is 2.209486, which is remarkably smaller than the MSE obtained from the pruned single tree.

Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

library(randomForest)
set.seed(1)
rf.car=randomForest(Sales~., data=Car.train, mtry=3, importance=TRUE)
yhat.rf=predict(rf.car, newdata=Car.test)
mean((yhat.rf-Car.test$Sales)^2)

## [1] 2.683383

The Random Forest test MSE is 2.683383. This method did not outperform the bagging method.

Question 9

This problem involves the OJ data set which is part of the ISLR package.

library(ISLR)
data("OJ")
summary(OJ)

##  Purchase WeekofPurchase     StoreID        PriceCH         PriceMM     
##  CH:653   Min.   :227.0   Min.   :1.00   Min.   :1.690   Min.   :1.690  
##  MM:417   1st Qu.:240.0   1st Qu.:2.00   1st Qu.:1.790   1st Qu.:1.990  
##           Median :257.0   Median :3.00   Median :1.860   Median :2.090  
##           Mean   :254.4   Mean   :3.96   Mean   :1.867   Mean   :2.085  
##           3rd Qu.:268.0   3rd Qu.:7.00   3rd Qu.:1.990   3rd Qu.:2.180  
##           Max.   :278.0   Max.   :7.00   Max.   :2.090   Max.   :2.290  
##      DiscCH            DiscMM         SpecialCH        SpecialMM     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.05186   Mean   :0.1234   Mean   :0.1477   Mean   :0.1617  
##  3rd Qu.:0.00000   3rd Qu.:0.2300   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :0.50000   Max.   :0.8000   Max.   :1.0000   Max.   :1.0000  
##     LoyalCH          SalePriceMM     SalePriceCH      PriceDiff      
##  Min.   :0.000011   Min.   :1.190   Min.   :1.390   Min.   :-0.6700  
##  1st Qu.:0.325257   1st Qu.:1.690   1st Qu.:1.750   1st Qu.: 0.0000  
##  Median :0.600000   Median :2.090   Median :1.860   Median : 0.2300  
##  Mean   :0.565782   Mean   :1.962   Mean   :1.816   Mean   : 0.1465  
##  3rd Qu.:0.850873   3rd Qu.:2.130   3rd Qu.:1.890   3rd Qu.: 0.3200  
##  Max.   :0.999947   Max.   :2.290   Max.   :2.090   Max.   : 0.6400  
##  Store7      PctDiscMM        PctDiscCH       ListPriceDiff  
##  No :714   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  Yes:356   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.140  
##            Median :0.0000   Median :0.00000   Median :0.240  
##            Mean   :0.0593   Mean   :0.02731   Mean   :0.218  
##            3rd Qu.:0.1127   3rd Qu.:0.00000   3rd Qu.:0.300  
##            Max.   :0.4020   Max.   :0.25269   Max.   :0.440  
##      STORE      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :2.000  
##  Mean   :1.631  
##  3rd Qu.:3.000  
##  Max.   :4.000

Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(1)
OJ.trainingindex=sample(dim(OJ)[1], 800)
OJ.train=OJ[OJ.trainingindex,]
OJ.test=OJ[-OJ.trainingindex,]

Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

OJ.regtree=tree(Purchase~., data=OJ.train)
summary(OJ.regtree)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7305 = 578.6 / 792 
## Misclassification error rate: 0.165 = 132 / 800

The regression tree from the OJ dataset uses 4 variables (LoyalCH, PriceDiff, SpecialCH, and ListPriceDiff). There are 8 terminal nodes and a training error rate of 0.165.

Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

OJ.regtree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1064.00 CH ( 0.61750 0.38250 )  
##    2) LoyalCH < 0.508643 350  409.30 MM ( 0.27143 0.72857 )  
##      4) LoyalCH < 0.264232 166  122.10 MM ( 0.12048 0.87952 )  
##        8) LoyalCH < 0.0356415 57   10.07 MM ( 0.01754 0.98246 ) *
##        9) LoyalCH > 0.0356415 109  100.90 MM ( 0.17431 0.82569 ) *
##      5) LoyalCH > 0.264232 184  248.80 MM ( 0.40761 0.59239 )  
##       10) PriceDiff < 0.195 83   91.66 MM ( 0.24096 0.75904 )  
##         20) SpecialCH < 0.5 70   60.89 MM ( 0.15714 0.84286 ) *
##         21) SpecialCH > 0.5 13   16.05 CH ( 0.69231 0.30769 ) *
##       11) PriceDiff > 0.195 101  139.20 CH ( 0.54455 0.45545 ) *
##    3) LoyalCH > 0.508643 450  318.10 CH ( 0.88667 0.11333 )  
##      6) LoyalCH < 0.764572 172  188.90 CH ( 0.76163 0.23837 )  
##       12) ListPriceDiff < 0.235 70   95.61 CH ( 0.57143 0.42857 ) *
##       13) ListPriceDiff > 0.235 102   69.76 CH ( 0.89216 0.10784 ) *
##      7) LoyalCH > 0.764572 278   86.14 CH ( 0.96403 0.03597 ) *

I will be explaining up to node 7 (a terminal node as indicated by the astericks). The first split to get to the node begins at the root of CH. If Loyal CH > 0.508643, then go to test to see if LoyalCH is < or > 0.764572. For node 7, the Orange Juice selection would stop here if LoyalCH is > 0.764572. At node 7, there are 278 observations with a deviance of 86.14. Less than 3.6% of the observations in node 7 take the value of MM and over 96% of the observations in this branch take the value of CH.

Create a plot of the tree, and interpret the results.

plot(OJ.regtree)
text(OJ.regtree, pretty=TRUE)

Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

OJtree.pred=predict(OJ.regtree, newdata = OJ.test, type="class")
table(OJtree.pred, OJ.test$Purchase)

##            
## OJtree.pred  CH  MM
##          CH 147  49
##          MM  12  62

(147+62)/270

## [1] 0.7740741

Based on the OJ.regtree model that I have created, the test observations were correctly predicted 77.41% of the time. The test error rate for this model is approximately 23%.

Apply the cv.tree() function to the training set in order to determine the optimal tree size.

cv.OJtree=cv.tree(OJ.regtree, FUN=prune.misclass)
cv.OJtree

## $size
## [1] 8 5 2 1
## 
## $dev
## [1] 156 156 156 306
## 
## $k
## [1]       -Inf   0.000000   4.666667 160.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(cv.OJtree)

Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(cv.OJtree$size, cv.OJtree$dev, type="b", xlab="Tree Size", ylab="Deviance")

Which tree size corresponds to the lowest cross-validated classification error rate?

It looks like from the previous 2 graphs that the lowest cross-validated classification error rate is the tree of a size 2.

Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

prune.OJtree=prune.misclass(OJ.regtree, best=2)
plot(prune.OJtree)
text(prune.OJtree, pretty=0)

Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(OJ.regtree)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "SpecialCH"     "ListPriceDiff"
## Number of terminal nodes:  8 
## Residual mean deviance:  0.7305 = 578.6 / 792 
## Misclassification error rate: 0.165 = 132 / 800

summary(prune.OJtree)

## 
## Classification tree:
## snip.tree(tree = OJ.regtree, nodes = c(3L, 2L))
## Variables actually used in tree construction:
## [1] "LoyalCH"
## Number of terminal nodes:  2 
## Residual mean deviance:  0.9115 = 727.4 / 798 
## Misclassification error rate: 0.1825 = 146 / 800

The training error rate for the pruned tree is a little higher than the tree that had not been pruned. This may be due to the fact that there are less significant decision factors to help distinguish whether or not the purchaser will choose CH or MM.

Compare the test error rates between the pruned and unpruned trees. Which is higher?

prune.OJtreepred=predict(prune.OJtree, OJ.test, type="class")
table(prune.OJtreepred, OJ.test$Purchase)

##                 
## prune.OJtreepred  CH  MM
##               CH 119  30
##               MM  40  81

# Not Pruned Tree Prediction 
1-((147+62)/270)

## [1] 0.2259259

1-((119+81)/270)

## [1] 0.2592593

For this problem the pruned tree produced a higher test error rate to about 26% from 23%. However, the pruned tree allows for more interpretation considering how little the tree requires for factors to decided what a purchaser will purchase.

DM Homework 7

Selena Romero

April 18, 2019

Question 3

Question 8

Question 9