library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.1.0
## ✔ tidyr 1.3.0 ✔ stringr 1.5.0
## ✔ readr 2.1.4 ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ISLR2)
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.2.3
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
##
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
##
## Loaded glmnet 4.1-7
library(pls)
## Warning: package 'pls' was built under R version 4.2.3
##
## Attaching package: 'pls'
##
## The following object is masked from 'package:stats':
##
## loadings
library(dplyr)
library(leaps)
## Warning: package 'leaps' was built under R version 4.2.3
library(boot)
library(gam)
## Warning: package 'gam' was built under R version 4.2.3
## Loading required package: splines
## Loading required package: foreach
##
## Attaching package: 'foreach'
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
##
## Loaded gam 1.22-2
library(leaps)
library(tree)
## Warning: package 'tree' was built under R version 4.2.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(BART)
## Warning: package 'BART' was built under R version 4.2.3
## Loading required package: nlme
##
## Attaching package: 'nlme'
##
## The following object is masked from 'package:dplyr':
##
## collapse
##
## Loading required package: nnet
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:boot':
##
## aml
Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.
p = seq(0, 1, 0.01)
gini = p * (1 - p) * 2
entropy = -(p * log(p) + (1 - p) * log(1 - p))
class.err = 1 - pmax(p, 1 - p)
matplot(p, cbind(gini, entropy, class.err), col = c("Blue", "red", "purple"))
In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.
(a) Split the data set into a training set and a test set.
attach(Carseats)
set.seed(24)
c<-Carseats
train<-sample(1:nrow(c), nrow(c)/2)
c.tr<-c[train,]
c.te<-c[-train,]
(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?
tree.c=tree(Sales~., data=c.tr)
plot(tree.c)
text(tree.c, pretty=0)
yhat.tree=predict(tree.c, c.te)
obs.sales<-c.te$Sales
mean((yhat.tree-obs.sales)^2)
## [1] 4.541017
The Means Squared Error is found to be rounded to 4.54.
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?
cv.c=cv.tree(tree.c)
cv.c
## $size
## [1] 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
##
## $dev
## [1] 1012.989 1041.573 1020.262 1043.493 1057.997 1064.050 1131.822 1130.429
## [9] 1211.419 1182.640 1250.044 1239.531 1352.795 1438.036 1382.154 1686.631
##
## $k
## [1] -Inf 17.46067 18.21007 25.43324 29.77121 32.67272 44.19972
## [8] 46.09869 49.55016 50.87585 60.04473 63.14420 99.94859 132.15272
## [15] 143.22869 395.57046
##
## $method
## [1] "deviance"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.c$size,cv.c$dev, xlab = 'size', ylab = 'cv error ')
abline(h = min(cv.c$dev) + .2*sd(cv.c$dev))
As the above model shows the lowest CV error is at 16 variables but there is a few others that are very close so they may also be worth trying out.
prune.c<-list()
yhat.prune<-list()
mse.prune<-list()
for (i in 2:16) {
prune.c[[i]]=prune.tree(tree.c, best = i)
yhat.prune[[i]]=predict(prune.c[[i]], c.te)
obs.sales<-c.te$Sales
mse.prune[[i]]<-mean((yhat.prune[[i]]-obs.sales)^2)
}
mse.mat<-as.matrix(mse.prune)
mse.df<-as.data.frame(mse.mat)
par(bg = "cornsilk3" )
matplot(2:16, mse.df[-1,1], xlab = 'size', ylab = 'MSE', main = "What size tree \n gives the lowest MSE?")
lines(2:16, mse.df[-1,1], type = "o")
mse.df <- data.frame(size=2:16,lapply(mse.df, unlist) )
mse.df$mse.rank<-rank(mse.df$V1)
names(mse.df)[2]<-'mse'
mse.df[mse.df$mse.rank %in% 1:3,]
## size mse mse.rank
## 13 14 4.727879 3
## 14 15 4.499964 1
## 15 16 4.541017 2
It now seems from the above model that 15 nodes of size 14 variable has the lowest MSE of 4.50
(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.
set.seed(24)
bag.c<-randomForest(Sales~., data = c.tr, mtry = 10, importance = TRUE)
yhat.bag<-predict(bag.c, newdata = c.te)
mean((yhat.bag - c.te$Sales)^2)
## [1] 2.692938
importance(bag.c)
## %IncMSE IncNodePurity
## CompPrice 22.8321261 176.578526
## Income 4.1064747 77.183648
## Advertising 12.8743758 99.953925
## Population 0.3779757 76.359264
## Price 51.4435675 465.155918
## ShelveLoc 51.6714645 438.596460
## Age 21.2202209 201.116544
## Education 1.3558349 36.305982
## Urban -2.0566741 6.577210
## US 0.6451619 5.857904
The above model shows the lowest MSE at 2.69.As this is a learning and test envoriment it may be a good idea to practice And use mtry = 1:10 for extra assurance.
rf.c<- list()
yhat.rf<-list()
mse.rf<-list()
for ( i in 1:10 ) {
set.seed(24)
rf.c[[i]]<-randomForest(Sales ~ ., data = c.tr, mtry = i, importance = TRUE)
yhat.rf[[i]]<-predict(rf.c[[i]], newdata = c.te)
mse.rf[[i]]<- mean((yhat.rf[[i]] - c.te$Sales)^2)
}
par(bg = "cornflowerblue" )
matplot(1:10, mse.rf, xlab = 'mtry', ylab = 'MSE', main = "What # of predictors \n gives the lowest MSE?")
lines(1:10, mse.rf, type = "o")
mseRF<-t(as.data.frame(lapply(mse.rf, unlist)))
mse.rf<-data.frame(size=1:10,mseRF)
mse.rf.rownames<-paste0("mtry", 1:10)
rownames(mse.rf)<-mse.rf.rownames
mse.rf$mse.rank<-rank(mse.rf$mseRF)
mse.rf[order(mse.rf$mse.rank),]
## size mseRF mse.rank
## mtry10 10 2.692938 1
## mtry8 8 2.704865 2
## mtry9 9 2.722630 3
## mtry7 7 2.759607 4
## mtry6 6 2.843481 5
## mtry5 5 2.897282 6
## mtry4 4 2.970406 7
## mtry3 3 3.204131 8
## mtry2 2 3.573238 9
## mtry1 1 4.770207 10
With this test model mtry= 10 has the lowest mse as 2.69 while mtry= 1 has the highest as 4.77.
(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
importance(bag.c)
## %IncMSE IncNodePurity
## CompPrice 22.8321261 176.578526
## Income 4.1064747 77.183648
## Advertising 12.8743758 99.953925
## Population 0.3779757 76.359264
## Price 51.4435675 465.155918
## ShelveLoc 51.6714645 438.596460
## Age 21.2202209 201.116544
## Education 1.3558349 36.305982
## Urban -2.0566741 6.577210
## US 0.6451619 5.857904
rf.carseats = randomForest(Sales ~ ., data = c.tr, mtry = 5, ntree = 500,
importance = T)
rf.pred = predict(rf.carseats, c.te)
mean((c.te$Sales - rf.pred)^2)
## [1] 2.829245
importance(rf.carseats)
## %IncMSE IncNodePurity
## CompPrice 17.218971 153.98693
## Income 2.253299 95.92800
## Advertising 12.276418 108.15561
## Population 2.731687 104.08481
## Price 42.454124 410.96911
## ShelveLoc 45.481810 388.04732
## Age 20.430292 227.72396
## Education 1.802344 47.67768
## Urban -3.730875 10.22056
## US 2.257687 10.89850
The Random Forrest Model as shown above resulted in an MSE of 2.83 which is worse than the previous model. To add the most important predictors are Price, ShelveLoc, and followed by age.
(f) Now analyze the data using BART, and report your results.
library(BART)
x=Carseats[,1:11]
y=Carseats[,"Sales"]
xtrain=x[train,]
ytrain=y[train]
xtest=x[-train,]
ytest=y[-train]
set.seed(5)
bart.fit=gbart(xtrain,ytrain,x.test=xtest)
## Warning in summary.lm(lm(y.train ~ ., data.frame(t(x.train), y.train))):
## essentially perfect fit: summary may be unreliable
## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 15, 200
## y1,yn: -2.153500, 0.486500
## x1,x[n*p]: 5.160000, 1.000000
## xp1,xp[np*p]: 11.220000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 100 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.263397,3,7.18123e-31,7.3135
## *****sigma: 0.000000
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,15,0
## *****printevery: 100
##
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 5s
## trcnt,tecnt: 1000,1000
yhat.bart=bart.fit$yhat.test.mean
mean((ytest-yhat.bart)^2)
## [1] 0.0810372
The Bart resulted in a very low MSE close to 0 of .081 which show it as being insignificant.
detach(Carseats)
This problem involves the OJ data set which is part of the ISLR2 package.
attach(OJ)
(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.
set.seed(5)
train=sample(dim(OJ)[1],800)
train.OJ=OJ[train,]
test.OJ=OJ[-train,]
summary(train.OJ)
## Purchase WeekofPurchase StoreID PriceCH PriceMM
## CH:490 Min. :227.0 Min. :1.000 Min. :1.69 Min. :1.690
## MM:310 1st Qu.:240.0 1st Qu.:2.000 1st Qu.:1.79 1st Qu.:2.090
## Median :257.0 Median :3.000 Median :1.86 Median :2.130
## Mean :254.6 Mean :3.926 Mean :1.87 Mean :2.089
## 3rd Qu.:269.0 3rd Qu.:7.000 3rd Qu.:1.99 3rd Qu.:2.180
## Max. :278.0 Max. :7.000 Max. :2.09 Max. :2.290
## DiscCH DiscMM SpecialCH SpecialMM
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.000 Median :0.0000 Median :0.0000
## Mean :0.05221 Mean :0.125 Mean :0.1525 Mean :0.1613
## 3rd Qu.:0.00000 3rd Qu.:0.200 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :0.50000 Max. :0.800 Max. :1.0000 Max. :1.0000
## LoyalCH SalePriceMM SalePriceCH PriceDiff Store7
## Min. :0.000011 Min. :1.190 Min. :1.390 Min. :-0.6700 No :542
## 1st Qu.:0.331072 1st Qu.:1.690 1st Qu.:1.750 1st Qu.: 0.0000 Yes:258
## Median :0.600000 Median :2.090 Median :1.860 Median : 0.2400
## Mean :0.570457 Mean :1.964 Mean :1.818 Mean : 0.1457
## 3rd Qu.:0.859921 3rd Qu.:2.180 3rd Qu.:1.890 3rd Qu.: 0.3200
## Max. :0.999947 Max. :2.290 Max. :2.090 Max. : 0.6400
## PctDiscMM PctDiscCH ListPriceDiff STORE
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.1400 1st Qu.:0.000
## Median :0.00000 Median :0.00000 Median :0.2400 Median :2.000
## Mean :0.05992 Mean :0.02759 Mean :0.2185 Mean :1.669
## 3rd Qu.:0.11268 3rd Qu.:0.00000 3rd Qu.:0.3000 3rd Qu.:3.000
## Max. :0.40201 Max. :0.25269 Max. :0.4400 Max. :4.000
summary(test.OJ)
## Purchase WeekofPurchase StoreID PriceCH PriceMM
## CH:163 Min. :227.0 Min. :1.000 Min. :1.690 Min. :1.690
## MM:107 1st Qu.:238.0 1st Qu.:2.000 1st Qu.:1.760 1st Qu.:1.990
## Median :256.0 Median :3.000 Median :1.860 Median :2.090
## Mean :253.7 Mean :4.059 Mean :1.860 Mean :2.076
## 3rd Qu.:266.8 3rd Qu.:7.000 3rd Qu.:1.982 3rd Qu.:2.180
## Max. :278.0 Max. :7.000 Max. :2.090 Max. :2.290
## DiscCH DiscMM SpecialCH SpecialMM
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.000
## Median :0.00000 Median :0.0000 Median :0.0000 Median :0.000
## Mean :0.05081 Mean :0.1184 Mean :0.1333 Mean :0.163
## 3rd Qu.:0.00000 3rd Qu.:0.2400 3rd Qu.:0.0000 3rd Qu.:0.000
## Max. :0.50000 Max. :0.8000 Max. :1.0000 Max. :1.000
## LoyalCH SalePriceMM SalePriceCH PriceDiff Store7
## Min. :0.000014 Min. :1.190 Min. :1.390 Min. :-0.6700 No :172
## 1st Qu.:0.320000 1st Qu.:1.690 1st Qu.:1.750 1st Qu.: 0.0000 Yes: 98
## Median :0.574400 Median :2.090 Median :1.860 Median : 0.2150
## Mean :0.551933 Mean :1.958 Mean :1.809 Mean : 0.1488
## 3rd Qu.:0.823003 3rd Qu.:2.130 3rd Qu.:1.890 3rd Qu.: 0.3150
## Max. :0.999934 Max. :2.290 Max. :2.090 Max. : 0.6400
## PctDiscMM PctDiscCH ListPriceDiff STORE
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.1400 1st Qu.:0.000
## Median :0.00000 Median :0.00000 Median :0.2400 Median :1.000
## Mean :0.05746 Mean :0.02648 Mean :0.2164 Mean :1.519
## 3rd Qu.:0.11268 3rd Qu.:0.00000 3rd Qu.:0.3000 3rd Qu.:3.000
## Max. :0.40201 Max. :0.25269 Max. :0.4400 Max. :4.000
(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?
oj.tree=tree(Purchase~.,data=train.OJ)
summary(oj.tree)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train.OJ)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7347 = 581.1 / 791
## Misclassification error rate: 0.1662 = 133 / 800
As shown the MisClassification error rate is 0.1662 and the number of terminal nodes the tree has is 9< Then the variable used were LoyalCH, PriceDiff, and ListPriceDiff.
(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.
oj.tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1068.00 CH ( 0.61250 0.38750 )
## 2) LoyalCH < 0.5036 346 412.40 MM ( 0.28324 0.71676 )
## 4) LoyalCH < 0.280875 164 125.50 MM ( 0.12805 0.87195 )
## 8) LoyalCH < 0.0356415 56 10.03 MM ( 0.01786 0.98214 ) *
## 9) LoyalCH > 0.0356415 108 103.50 MM ( 0.18519 0.81481 ) *
## 5) LoyalCH > 0.280875 182 248.00 MM ( 0.42308 0.57692 )
## 10) PriceDiff < 0.05 71 67.60 MM ( 0.18310 0.81690 ) *
## 11) PriceDiff > 0.05 111 151.30 CH ( 0.57658 0.42342 ) *
## 3) LoyalCH > 0.5036 454 362.00 CH ( 0.86344 0.13656 )
## 6) PriceDiff < -0.39 31 40.32 MM ( 0.35484 0.64516 )
## 12) LoyalCH < 0.638841 10 0.00 MM ( 0.00000 1.00000 ) *
## 13) LoyalCH > 0.638841 21 29.06 CH ( 0.52381 0.47619 ) *
## 7) PriceDiff > -0.39 423 273.70 CH ( 0.90071 0.09929 )
## 14) LoyalCH < 0.705326 135 143.00 CH ( 0.77778 0.22222 )
## 28) ListPriceDiff < 0.255 67 89.49 CH ( 0.61194 0.38806 ) *
## 29) ListPriceDiff > 0.255 68 30.43 CH ( 0.94118 0.05882 ) *
## 15) LoyalCH > 0.705326 288 99.77 CH ( 0.95833 0.04167 ) *
The best terminal node to pick seems to be 28, as this node has 67 observations in it. This point it splits is .255 with a deviance of 89.43 and this branch predicts that the Citrus HIll is being purchased. THie terminal node though is only correct 61.19% of the time.
(d) Create a plot of the tree, and interpret the results.
plot(oj.tree)
text(oj.tree,pretty=0)
As show the splitting variable is LoyalCH, which how loyal CH's customers are. This meaning Loyalch is the most imprtant variable followed by Pricediff and listprice. This also means/show the less loyal customers are to citrus hill the more likely they are to purchase minute maid and for Citrus hill the more loyal they are the more it take for them to purchase the other brand.
(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
oj.pred=predict(oj.tree,newdata=test.OJ,type="class")
table(test.OJ$Purchase,oj.pred)
## oj.pred
## CH MM
## CH 148 15
## MM 32 75
mean(oj.pred!=test.OJ$Purchase)
## [1] 0.1740741
This model show the test error rate of 0.174
(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.
cv.oj=cv.tree(oj.tree,FUN=prune.misclass)
cv.oj
## $size
## [1] 9 6 5 3 2 1
##
## $dev
## [1] 155 156 156 172 170 310
##
## $k
## [1] -Inf 0.0 1.0 8.5 9.0 150.0
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.
plot(cv.oj$size,cv.oj$dev,type="b",xlab="Tree Size",ylab="Deviance")
(h) Which tree size corresponds to the lowest cross-validated classification error rate?
THe previous model in g show the tree size with the lowest CV error rate is the one with 9 nodes but there may be ones that are very close.
(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.
oj.pruned=prune.misclass(oj.tree,best=9)
plot(oj.pruned)
text(oj.pruned,pretty=0)
(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?
summary(oj.pruned)
##
## Classification tree:
## tree(formula = Purchase ~ ., data = train.OJ)
## Variables actually used in tree construction:
## [1] "LoyalCH" "PriceDiff" "ListPriceDiff"
## Number of terminal nodes: 9
## Residual mean deviance: 0.7347 = 581.1 / 791
## Misclassification error rate: 0.1662 = 133 / 800
It looks like the test error rate between the pruned and unpruned trees in this case are the same because the pruned tree actually not pruned. As shown the cv with the lowest cv error of terminal nodes is 9 still.
(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?
ojprune.pred=predict(oj.pruned, newdata=test.OJ, type="class")
mean(ojprune.pred != test.OJ$Purchase)
## [1] 0.1740741
As assumed we again get the same test error rate of 0.174 as the CV found the unpruned treee is the best model.