title: “assigment 7” author: “oscar cancino” date: “4/22/2022” output: html_document
Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.
p=seq(0, 1, 0.01)
gini.index=2 * p * (1 - p)
class.error=1 - pmax(p, 1 - p)
cross.entropy=- (p * log(p) + (1 - p) * log(1 - p))
par(bg = "white")
matplot(p, cbind(gini.index, class.error, cross.entropy), pch=c(15,17,19) ,ylab = "gini.index, class.error, cross.entropy",col = c("green" , "yellow", "red"), type = 'b')
legend('bottom', inset=.01, legend = c('gini.index', 'class.error', 'cross.entropy'), col = c("green" , "yellow", "red"), pch=c(15,17,19))
library(ISLR)
set.seed(1)
train = sample(1:nrow(Carseats), nrow(Carseats) / 2)
Car.train = Carseats[train, ]
Car.test = Carseats[-train,]
library(tree)
## Warning: package 'tree' was built under R version 4.1.3
reg.tree = tree(Sales~.,data = Carseats, subset=train)
reg.tree = tree(Sales~.,data = Car.train)
summary(reg.tree)
##
## Regression tree:
## tree(formula = Sales ~ ., data = Car.train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Age" "Advertising" "CompPrice"
## [6] "US"
## Number of terminal nodes: 18
## Residual mean deviance: 2.167 = 394.3 / 182
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.88200 -0.88200 -0.08712 0.00000 0.89590 4.09900
plot(reg.tree)
text(reg.tree ,pretty =0)
yhat = predict(reg.tree,newdata = Car.test)
mean((yhat - Car.test$Sales)^2)
## [1] 4.922039
the test mse comes out to 4.92
set.seed(1)
cv.car = cv.tree(reg.tree)
plot(cv.car$size, cv.car$dev, type = "b")
prune.car = prune.tree(reg.tree, best = 8)
plot(prune.car)
text(prune.car,pretty=0)
yhat=predict(prune.car, newdata= Car.test)
mean((yhat-Car.test$Sales)^2)
## [1] 5.113254
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
## randomForest 4.7-1
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1)
bag.car = randomForest(Sales~.,data=Car.train,mtry = 10, importance = TRUE)
yhat.bag = predict(bag.car,newdata=Car.test)
mean((yhat.bag-Car.test$Sales)^2)
## [1] 2.605253
importance(bag.car)
## %IncMSE IncNodePurity
## CompPrice 24.8888481 170.182937
## Income 4.7121131 91.264880
## Advertising 12.7692401 97.164338
## Population -1.8074075 58.244596
## Price 56.3326252 502.903407
## ShelveLoc 48.8886689 380.032715
## Age 17.7275460 157.846774
## Education 0.5962186 44.598731
## Urban 0.1728373 9.822082
## US 4.2172102 18.073863
varImpPlot(bag.car)
the most important varianles is the price at each site and the quality of the shelving location. the rest mse comes out to 2.60, which is less than above
library(randomForest)
set.seed(1)
rf.car = randomForest(Sales~.,data=Car.train,mtry = 3, importance = TRUE)
yhat.rf = predict(rf.car,newdata=Car.test)
mean((yhat.rf-Car.test$Sales)^2)
## [1] 2.960559
the MSe comes out to 2.960559 which indicates that random forest does not provide an improvemnt over bagging
library(ISLR)
set.seed(1)
train = sample(dim(OJ)[1],800)
OJ.train = OJ[train,]
OJ.test = OJ[-train,]
OJ.tree = tree(Purchase~., data=OJ.train)
summary(OJ.tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = OJ.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "SpecialCH" "ListPriceDiff"
## [5] "PctDiscMM"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7432 = 587.8 / 791
## Misclassification error rate: 0.1588 = 127 / 800
the fitted tree has 9 terminal nodes with a training error rate of 0.1588
OJ.tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1073.00 CH ( 0.60625 0.39375 )
## 2) LoyalCH < 0.5036 365 441.60 MM ( 0.29315 0.70685 )
## 4) LoyalCH < 0.280875 177 140.50 MM ( 0.13559 0.86441 )
## 8) LoyalCH < 0.0356415 59 10.14 MM ( 0.01695 0.98305 ) *
## 9) LoyalCH > 0.0356415 118 116.40 MM ( 0.19492 0.80508 ) *
## 5) LoyalCH > 0.280875 188 258.00 MM ( 0.44149 0.55851 )
## 10) PriceDiff < 0.05 79 84.79 MM ( 0.22785 0.77215 )
## 20) SpecialCH < 0.5 64 51.98 MM ( 0.14062 0.85938 ) *
## 21) SpecialCH > 0.5 15 20.19 CH ( 0.60000 0.40000 ) *
## 11) PriceDiff > 0.05 109 147.00 CH ( 0.59633 0.40367 ) *
## 3) LoyalCH > 0.5036 435 337.90 CH ( 0.86897 0.13103 )
## 6) LoyalCH < 0.764572 174 201.00 CH ( 0.73563 0.26437 )
## 12) ListPriceDiff < 0.235 72 99.81 MM ( 0.50000 0.50000 )
## 24) PctDiscMM < 0.196197 55 73.14 CH ( 0.61818 0.38182 ) *
## 25) PctDiscMM > 0.196197 17 12.32 MM ( 0.11765 0.88235 ) *
## 13) ListPriceDiff > 0.235 102 65.43 CH ( 0.90196 0.09804 ) *
## 7) LoyalCH > 0.764572 261 91.20 CH ( 0.95785 0.04215 ) *
i picked node 9 which is a terminal node. the split criterion is loyalch >.0356415 the number of observations is 118 with a deviance of 116.40 amd an overall prediction for the branch mm.
plot(OJ.tree)
text(OJ.tree,pretty=TRUE)
the most important indicator of “purchase” appears to be “loyalCH” since the first branch diffrent the intesity of customer brand loyalty to CH. in fact, the top three nodes contain “loyalCH”
tree.pred = predict(OJ.tree, newdata = OJ.test, type = "class")
table(tree.pred,OJ.test$Purchase)
##
## tree.pred CH MM
## CH 160 38
## MM 8 64
(147+62)/270
## [1] 0.7740741
77% of the test observation are correctly classified so the test error rate is 23%
cv.OJ = cv.tree(OJ.tree, FUN = prune.misclass)
cv.OJ
## $size
## [1] 9 8 7 4 2 1
##
## $dev
## [1] 150 150 149 158 172 315
##
## $k
## [1] -Inf 0.000000 3.000000 4.333333 10.500000 151.000000
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.OJ$size,cv.OJ$dev,type='b', xlab = "Tree size", ylab = "Deviance")
we might see that the 4-node tree is the smallest tree with the lowest classification error rate
prune.OJ = prune.misclass(OJ.tree, best=5)
plot(prune.OJ)
text(prune.OJ,pretty=0)
(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?
tree.pred = predict(prune.OJ, newdata = OJ.test, type = "class")
table(tree.pred,OJ.test$Purchase)
##
## tree.pred CH MM
## CH 160 36
## MM 8 66
(147+62)/270
## [1] 0.7740741
in this case the prunning has the same test error, but it produced a way more interpretable tree