p=seq(0,1,0.01)
gini= 2*p*(1-p)
classerror= 1-pmax(p,1-p)
crossentropy= -(p*log(p)+(1-p)*log(1-p))
matplot(p, cbind(gini, classerror, crossentropy), col = c("red", "green", "blue"))
Split the data set into a training set and a test set.
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.3
library(tree)
## Warning: package 'tree' was built under R version 4.0.4
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.5
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(caret)
## Warning: package 'caret' was built under R version 4.0.3
## Loading required package: lattice
## Loading required package: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
#View(Carseats)
set.seed(1)
index = sample(1:nrow(Carseats), 0.75*nrow(Carseats))
train.carseats = Carseats[index,]
test.carseats = Carseats[-index,]
Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
Looking at the plot we see that our most important variable is ShelveLoc which splits on bad and medium. It then looks like our mot important variable is price. Computing the MSE we return an error of 4.91.
tree.carseats = tree(Sales~ ., data = train.carseats)
plot(tree.carseats)
text(tree.carseats, pretty = 0, cex = .7)
tree.predict <- predict(tree.carseats, newdata = test.carseats)
mean((tree.predict - test.carseats$Sales)^2)
## [1] 4.910268
Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
Computing for the most optimal level we return a value of 17, which at first glance seems pretty high. Next i’ll plot the cross validation to identify an optimal level. I decided 7 is most optimal as it’s the lowest point before the graph begins to increase.
After pruning the tree and calculating the MSE we return a value of 5.1, which does not end up improving the test error.
cv.carseats = cv.tree(tree.carseats)
#cv.carseats
which.min(cv.carseats$size)
## [1] 17
plot(cv.carseats$size, cv.carseats$dev, type = 'b')
prune.carseats = prune.tree(tree.carseats, best = 7)
plot(prune.carseats)
text(prune.carseats, pretty = 0)
prune.predict <- predict(prune.carseats, newdata = test.carseats)
mean((prune.predict - test.carseats$Sales)^2)
## [1] 5.104051
Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
Using random forest and the bagging approach we get the lowest MSE so far at 2.82. Using the importance function we see that our most imporatant variable is ShelvelLoc followed by Price.
set.seed(1)
rf.carseats = randomForest(Sales ~ ., data = train.carseats, mtry = 10, ntree = 500, importance = TRUE)
predict.bag = predict(rf.carseats, newdata = test.carseats)
mean((predict.bag - test.carseats$Sales)^2)
## [1] 2.820284
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 36.1922541 243.46553
## Income 8.1825801 129.36706
## Advertising 21.7200717 170.81770
## Population -2.9629193 70.42358
## Price 69.0884762 701.82750
## ShelveLoc 78.6820621 691.74585
## Age 26.6644544 248.85675
## Education 4.0907048 56.33708
## Urban 0.1344484 10.43790
## US 1.8235528 11.35347
varImpPlot(rf.carseats)
Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
We first wanted to find the most optimal number for our mtry using the bestTune and our results were 11. Applying 11 to our random Forest model gave us an error though so we went back and looked at the results of our first random forest and saw that an mtry of 6 was also acceptable. Using mtry of 6 for our random forest results with a MSE of 2.79 just slightly lower than what we saw in part d. Similarly when using the importance function we get the same top two variables of Price and ShelveLoc.
carseat.rf = train(Sales ~., data= train.carseats, method='rf', trControl=trainControl('cv',number =10)
,importance=T)
carseat.rf$results
## mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 2 1.873057 0.6763707 1.525785 0.2512151 0.09254889 0.2105758
## 2 6 1.599387 0.7104705 1.279808 0.2013524 0.08760461 0.1696911
## 3 11 1.556582 0.7100990 1.246339 0.2136924 0.08545993 0.1749030
carseat.rf$bestTune
## mtry
## 3 11
carseat.rf$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = ..1)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 11
##
## Mean of squared residuals: 2.395651
## % Var explained: 69.86
set.seed(1)
rf.carseats = randomForest(Sales ~ ., data = train.carseats, mtry = 6, ntree = 500, importance = TRUE)
predict.bag = predict(rf.carseats, newdata = test.carseats)
mean((predict.bag - test.carseats$Sales)^2)
## [1] 2.793601
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 27.2576997 228.07002
## Income 7.9502187 141.21172
## Advertising 20.0426577 179.95941
## Population -2.5580791 92.72272
## Price 64.7466610 660.69802
## ShelveLoc 70.4708127 651.12878
## Age 23.1915453 262.61350
## Education 1.5418740 69.42682
## Urban -0.1863153 13.29967
## US 4.6346726 19.22371
varImpPlot(rf.carseats)
Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
set.seed(1)
index = 1:800
oj.train = OJ[index,]
oj.test = OJ[-index,]
Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
Looking at the summary of our decision tree obtained from the OJ training data set we return an error rate of 16.5% and 7 terminal nodes.
oj.tree = tree(Purchase ~., data = oj.train)
summary(oj.tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7342 = 582.3 / 793
## Misclassification error rate: 0.165 = 132 / 800
Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
oj.tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1062.00 CH ( 0.62125 0.37875 )
## 2) LoyalCH < 0.5036 348 413.70 MM ( 0.28161 0.71839 )
## 4) LoyalCH < 0.276142 164 113.50 MM ( 0.10976 0.89024 )
## 8) LoyalCH < 0.035047 62 0.00 MM ( 0.00000 1.00000 ) *
## 9) LoyalCH > 0.035047 102 95.06 MM ( 0.17647 0.82353 ) *
## 5) LoyalCH > 0.276142 184 251.90 MM ( 0.43478 0.56522 )
## 10) PriceDiff < 0.05 75 77.75 MM ( 0.21333 0.78667 ) *
## 11) PriceDiff > 0.05 109 147.80 CH ( 0.58716 0.41284 ) *
## 3) LoyalCH > 0.5036 452 326.70 CH ( 0.88274 0.11726 )
## 6) LoyalCH < 0.753545 172 188.90 CH ( 0.76163 0.23837 )
## 12) PriceDiff < 0.145 67 92.15 CH ( 0.55224 0.44776 ) *
## 13) PriceDiff > 0.145 105 70.44 CH ( 0.89524 0.10476 ) *
## 7) LoyalCH > 0.753545 280 99.08 CH ( 0.95714 0.04286 ) *
Create a plot of the tree, and interpret the results.
From the plot we see that the two most important variable is LoyalCH and that it actually splits into a smaller subset of LoyalCH.
plot(oj.tree)
text(oj.tree, pretty = 0)
Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
We return a test error rate of 21.85%
oj.pred = predict(oj.tree, oj.test, type = "class")
table(oj.pred, oj.test$Purchase)
##
## oj.pred CH MM
## CH 141 44
## MM 15 70
1-(141+70)/270
## [1] 0.2185185
Apply the cv.tree() function to the training set in order to determine the optimal tree size.
It looks like the optimal tree size is 7.
oj.cv = cv.tree(oj.tree, FUN = prune.misclass)
oj.cv
## $size
## [1] 7 4 2 1
##
## $dev
## [1] 152 152 157 303
##
## $k
## [1] -Inf 0.0 9.5 152.0
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
bestsize = oj.cv$size[which.min(oj.cv$dev)]
bestsize
## [1] 7
Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(oj.cv$size, oj.cv$dev, type = "b", xlab = "Tree size", ylab = "Deviance")
Which tree size corresponds to the lowest cross-validated classification error rate? Looking at the plot it looks like 4 is actually the lowest cross-validated error rate.
Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
oj.prune = prune.misclass(oj.tree, best = 7)
plot(oj.prune)
text(oj.prune, pretty = 0)
Compare the training error rates between the pruned and unpruned trees. Which is higher?
For the unpruned tree we return a training error rate of 16.5%, which is actually the same error rate as the pruned tree.
summary(oj.prune)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7342 = 582.3 / 793
## Misclassification error rate: 0.165 = 132 / 800
summary(oj.tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = oj.train)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff"
## Number of terminal nodes: 7
## Residual mean deviance: 0.7342 = 582.3 / 793
## Misclassification error rate: 0.165 = 132 / 800
Compare the test error rates between the pruned and unpruned trees. Which is higher?
Again, we return the same test error rate.
pred.prune <- predict(oj.prune, oj.test, type = "class")
table(pred.prune, oj.test$Purchase)
##
## pred.prune CH MM
## CH 141 44
## MM 15 70