p = seq(0,1,0.01)
gini = p * (1-p) * 2
entropy = -(p* log(p) + (1-p) * log(1-p) )
class_error = 1 - pmax(p, 1 - p )
matplot(p, cbind(gini, entropy, class_error), col = c("red", "grey","orange"))
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.2
attach(Carseats)
library(tree)
## Warning: package 'tree' was built under R version 4.0.2
set.seed(1)
train = sample(dim(Carseats)[1], dim(Carseats)[1]/2)
Carseats.train = Carseats[train, ]
Carseats.test = Carseats[-train,]
tree.carseats = tree(Sales ~ ., data = Carseats.train)
summary(tree.carseats)
##
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "CompPrice"
## [6] "US"
## Number of terminal nodes: 18
## Residual mean deviance: 2.167 = 394.3 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.88200 -0.88200 -0.08712 0.00000 0.89590 4.09900
plot(tree.carseats)
text(tree.carseats, pretty = 0)
pred.carseats = predict( tree.carseats, Carseats.test)
mean(( Carseats.test$Sales - pred.carseats)^2)
## [1] 4.922039
We got the MSE is 4.922
cv.carseats = cv.tree(tree.carseats, FUN = prune.tree )
par(mfrow = c(1,2))
plot(cv.carseats$size, cv.carseats$dev, type = "b")
plot(cv.carseats$k, cv.carseats$dev, type = "b")
pruned.carseats = prune.tree(tree.carseats, best = 9)
par(mfrow = c(1,1))
plot(pruned.carseats)
text(pruned.carseats, pretty = 0)
pred.pruned = predict(pruned.carseats, Carseats.test)
mean((Carseats.test$Sales - pred.pruned)^2)
## [1] 4.918134
MSE is 4.918
d)Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
bag.carseats = randomForest(Sales ~ ., data = Carseats.train, mtry = 10 , ntree = 500, imporance = T )
bag.pred = predict(bag.carseats, Carseats.test)
mean((Carseats.test$Sales - bag.pred)^2)
## [1] 2.622915
importance(bag.carseats)
## IncNodePurity
## CompPrice 169.829553
## Income 91.464308
## Advertising 102.156091
## Population 59.921849
## Price 495.642493
## ShelveLoc 381.170255
## Age 152.330816
## Education 45.600313
## Urban 9.229808
## US 15.470199
Bagging imporves the MSE and we can see from above function output that Price, ShelveLoc and Age are the most important variables in Sales.
rf.carseats = randomForest(Sales ~ . , data = Carseats.train, mtry = 5, ntree = 500 , importance = T)
rf.pred = predict(rf.carseats, Carseats.test)
mean((Carseats.test$Sales - rf.pred)^2)
## [1] 2.732066
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 19.4239781 159.18237
## Income 4.8774271 110.16804
## Advertising 10.0913784 105.89067
## Population -2.4916697 80.15222
## Price 44.8465403 456.01518
## ShelveLoc 43.2983207 329.47794
## Age 13.4429336 169.72365
## Education 1.4438055 58.56204
## Urban -0.5453909 12.05602
## US 6.7127513 26.31279
Random forest worsened the MSE. We see that again the Price, ShelveLoc and Age are three most important predictors of Sales.
This problem involves the OJ data set which is part of the ISLR package. a. Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
attach(OJ)
set.seed(1013)
train = sample(dim(OJ)[1], 800)
OJ.train = OJ[train, ]
OJ.test = OJ[-train, ]
oj.tree = tree(Purchase ~., data = OJ.train)
summary(oj.tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff" "SalePriceMM"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7564 = 599.8 / 793
## Misclassification error rate: 0.1612 = 129 / 800
oj.tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1069.00 CH ( 0.61125 0.38875 )
## 2) LoyalCH < 0.5036 344 407.30 MM ( 0.27907 0.72093 )
## 4) LoyalCH < 0.276142 163 121.40 MM ( 0.12270 0.87730 ) *
## 5) LoyalCH > 0.276142 181 246.30 MM ( 0.41989 0.58011 )
## 10) PriceDiff < 0.065 75 75.06 MM ( 0.20000 0.80000 ) *
## 11) PriceDiff > 0.065 106 144.50 CH ( 0.57547 0.42453 ) *
## 3) LoyalCH > 0.5036 456 366.30 CH ( 0.86184 0.13816 )
## 6) LoyalCH < 0.753545 189 224.30 CH ( 0.71958 0.28042 )
## 12) ListPriceDiff < 0.235 79 109.40 MM ( 0.48101 0.51899 )
## 24) SalePriceMM < 1.64 22 20.86 MM ( 0.18182 0.81818 ) *
## 25) SalePriceMM > 1.64 57 76.88 CH ( 0.59649 0.40351 ) *
## 13) ListPriceDiff > 0.235 110 75.81 CH ( 0.89091 0.10909 ) *
## 7) LoyalCH > 0.753545 267 85.31 CH ( 0.96255 0.03745 ) *
If we pick random node for example 25) SalePriceMM: the splittin criteria is 1.64 and we have 57 point in the subtree below this level of tree. The prediction at this node is CH. About 60% points in this node have CH value of Sales.
plot(oj.tree)
text(oj.tree, pretty = 0 )
LoyalCH is the root or lets say the most important variable of this tree. if LoyalCh < 0.27, MM is predict, if LoyalCH > 0.753, it predicts CH. PriceDiff gives the intermediate values of LoyalCH.
oj.pred = predict(oj.tree, OJ.test, type = "class")
table(OJ.test$Purchase, oj.pred)
## oj.pred
## CH MM
## CH 149 15
## MM 30 76
cv.oj = cv.tree(oj.tree, FUN = prune.tree)
plot(cv.oj$size, cv.oj$dev, type= "b", xlab = "Tree Size", ylab = "Deviance")
h. Which tree size corresponds to the lowest cross-validated classification error rate? Size 5 gives lowest cross-validation error.
oj.pruned = prune.tree(oj.tree, best = 6)
summary(oj.pruned)
##
## Classification tree:
## snip.tree(tree = oj.tree, nodes = 12L)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff"
## Number of terminal nodes: 6
## Residual mean deviance: 0.7701 = 611.5 / 794
## Misclassification error rate: 0.175 = 140 / 800
For original tree, Misclassification error rate: 0.1612. After pruning, it changed to 0.175
pred.unpruned = predict(oj.tree, OJ.test, type = "class")
misclass.unpruned = sum(OJ.test$Purchase != pred.unpruned)
misclass.unpruned / length(pred.unpruned)
## [1] 0.1666667
pred.pruned = predict(oj.tree, OJ.test, type = "class")
misclass.pruned = sum(OJ.test$Purchase != pred.pruned)
misclass.pruned / length(pred.pruned)
## [1] 0.1666667
Pruned and unpruned trees have same test error rate of 0.1666