data("Carseats")
data("OJ")
Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy.
Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.
x1=seq(0, 1, .01)
x2=1-x1
c=1-pmax(x1, x2)
g=x1*x2*2
e=-(x1*log(x1))-(x2*log(x2))
plot(x1, c, type="l", col="Blue", ylab="Value", xlab="pm1", ylim=c(min(0), max(1)))
lines(x1, g, col="Red")
lines(x1, e, col="Green")
text(.5, .35, "Classification", col="Blue")
text(.5, .55, "Gini", col="Red")
text(.5, .75, "Entropy", col="Green")
grid()
In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
set.seed(1)
train=sample(1:nrow(Carseats),200)
test=Carseats[-train,]
regtree=tree(Sales~., data=Carseats, subset=train)
plot(regtree)
text(regtree, pretty=0, cex=.5)
summary(regtree)
##
## Regression tree:
## tree(formula = Sales ~ ., data = Carseats, subset = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "CompPrice"
## [6] "US"
## Number of terminal nodes: 18
## Residual mean deviance: 2.167 = 394.3 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.88200 -0.88200 -0.08712 0.00000 0.89590 4.09900
pred=predict(regtree, test)
error=mean((test$Sales-pred)^2)
error
## [1] 4.922039
From the results we can see that we only use 6 variables and the test MSE is 4.922039.
cv=cv.tree(regtree, FUN=prune.tree)
plot(cv$size, cv$dev, type="b")
p=prune.tree(regtree, best=12)
plot(p)
text(p, pretty=0, cex=.5)
pred=predict(p, test)
error=mean((test$Sales-pred)^2)
error
## [1] 4.966929
We can see that using 12 as the optimal level and pruning the model actually raises the MSE to 4.966929.
set.seed (1)
bag=randomForest(Sales~., data=Carseats, subset=train, mtry=10, importance=TRUE)
pred=predict(bag, test)
error=mean((test$Sales-pred)^2)
error
## [1] 2.605253
importance(bag)
## %IncMSE IncNodePurity
## CompPrice 24.8888481 170.182937
## Income 4.7121131 91.264880
## Advertising 12.7692401 97.164338
## Population -1.8074075 58.244596
## Price 56.3326252 502.903407
## ShelveLoc 48.8886689 380.032715
## Age 17.7275460 157.846774
## Education 0.5962186 44.598731
## Urban 0.1728373 9.822082
## US 4.2172102 18.073863
The test MSE for bagging is 2.605253.
bag=randomForest(Sales~., data=Carseats, subset=train, mtry=3, importance=TRUE)
pred=predict(bag, test)
error=mean((test$Sales-pred)^2)
error
## [1] 3.054306
importance(bag)
## %IncMSE IncNodePurity
## CompPrice 12.9540442 157.53376
## Income 2.1683293 129.18612
## Advertising 8.7289900 111.38250
## Population -2.5290493 102.78681
## Price 33.9482500 393.61313
## ShelveLoc 34.1358807 289.28756
## Age 12.0804387 172.03776
## Education 0.2213600 72.02479
## Urban 0.9793293 14.73763
## US 4.1072742 33.91622
bag=randomForest(Sales~., data=Carseats, subset=train, mtry=5, importance=TRUE)
pred=predict(bag, test)
error=mean((test$Sales-pred)^2)
error
## [1] 2.736102
importance(bag)
## %IncMSE IncNodePurity
## CompPrice 18.6728671 158.17550
## Income 3.1491322 109.09946
## Advertising 10.8214206 108.75115
## Population -1.5217588 76.78543
## Price 48.0253398 445.75915
## ShelveLoc 41.7681844 348.48696
## Age 14.9304083 166.65241
## Education 0.6073698 58.68779
## Urban 0.6843160 11.95040
## US 8.2025379 27.20123
bag=randomForest(Sales~., data=Carseats, subset=train, mtry=7, importance=TRUE)
pred=predict(bag, test)
error=mean((test$Sales-pred)^2)
error
## [1] 2.670094
importance(bag)
## %IncMSE IncNodePurity
## CompPrice 22.2364796 163.634007
## Income 4.4330080 100.551727
## Advertising 11.0280661 107.823416
## Population -0.2689341 67.196288
## Price 52.2754197 484.547908
## ShelveLoc 44.1715023 360.811941
## Age 13.6084631 162.451138
## Education 2.0785295 48.640094
## Urban 0.1480040 9.686486
## US 4.7421244 22.120670
We can see that with a lower mtry level that it raises the test MSE.
This problem involves the OJ data set which is part of the ISLR2 package.
set.seed(1)
train=sample(1:nrow(OJ),800)
test=OJ[-train,]
regtree=tree(Purchase~., data=OJ, subset=train)
summary(regtree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ, subset = train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
The model uses 5 variables and has 9 nodes. The training error rate is .1588.
regtree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
## 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
## 9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
## 20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
## 21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
## 24) PctDiscMM < 0.196196 55 73.14 CH ( 0.61818 0.38182 ) *
## 25) PctDiscMM > 0.196196 17 12.32 MM ( 0.11765 0.88235 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
From node 7, the splitting value is LoyalCH, we can see that there are 261 observations from the node with the splits being 95.79% CH and 4.21% MM and the deviance at 91.20
plot(regtree)
text(regtree, pretty=0, cex=.5)
We can see that LoyalCH is the most important variable.
pred=predict(regtree, test, type="class")
table(test$Purchase, pred)
## pred
## CH MM
## CH 160 8
## MM 38 64
1-mean(pred==test$Purchase)
## [1] 0.1703704
The test error rate is .1703704.
cv=cv.tree(regtree, FUN=prune.tree)
plot(cv$size, cv$dev, xlab="Tree Size", ylab="Cross-validated Classification")
The size 7 corresponds to the lowest cross-validation classification.
p=prune.tree(regtree, best=7)
summary(p)
##
## Classification tree:
## snip.tree(tree = regtree, nodes = c(10L, 4L))
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff" "PctDiscMM"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7748 = 614.4 / 793
## Misclassification error rate: 0.1625 = 130 / 800
The training error for the pruned model is higher at .1625
predu=predict(regtree, test, type="class")
tpredu=sum(test$Purchase != predu)
tpredu/length(predu)
## [1] 0.1703704
predp=predict(p, test, type="class")
tpredp=sum(test$Purchase != predp)
tpredp/length(predp)
## [1] 0.162963
The test error for the unpruned model is higher at a .1703704.