Assignment 7

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.4     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(ISLR2)
library(glmnet)

## Warning: package 'glmnet' was built under R version 4.2.3

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-7

library(pls)

## Warning: package 'pls' was built under R version 4.2.3

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:stats':
## 
##     loadings

library(dplyr)
library(leaps)

## Warning: package 'leaps' was built under R version 4.2.3

library(boot)
library(gam)

## Warning: package 'gam' was built under R version 4.2.3

## Loading required package: splines
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## 
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## 
## Loaded gam 1.22-2

library(leaps)
library(tree)

## Warning: package 'tree' was built under R version 4.2.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.2.3

## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin

library(BART)

## Warning: package 'BART' was built under R version 4.2.3

## Loading required package: nlme
## 
## Attaching package: 'nlme'
## 
## The following object is masked from 'package:dplyr':
## 
##     collapse
## 
## Loading required package: nnet
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:boot':
## 
##     aml

Question 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The x-axis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.

p = seq(0, 1, 0.01)
gini = p * (1 - p) * 2
entropy = -(p * log(p) + (1 - p) * log(1 - p))
class.err = 1 - pmax(p, 1 - p)
matplot(p, cbind(gini, entropy, class.err), col = c("Blue", "red", "purple"))

Question 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

attach(Carseats)

set.seed(24)
c<-Carseats
train<-sample(1:nrow(c), nrow(c)/2)

c.tr<-c[train,]
c.te<-c[-train,]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

tree.c=tree(Sales~., data=c.tr)
plot(tree.c)
text(tree.c, pretty=0)

yhat.tree=predict(tree.c, c.te)
obs.sales<-c.te$Sales
mean((yhat.tree-obs.sales)^2)

## [1] 4.541017

    The Means Squared Error is found to be rounded to 4.54.
    
(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

cv.c=cv.tree(tree.c)
cv.c

## $size
##  [1] 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1
## 
## $dev
##  [1] 1012.989 1041.573 1020.262 1043.493 1057.997 1064.050 1131.822 1130.429
##  [9] 1211.419 1182.640 1250.044 1239.531 1352.795 1438.036 1382.154 1686.631
## 
## $k
##  [1]      -Inf  17.46067  18.21007  25.43324  29.77121  32.67272  44.19972
##  [8]  46.09869  49.55016  50.87585  60.04473  63.14420  99.94859 132.15272
## [15] 143.22869 395.57046
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(cv.c$size,cv.c$dev, xlab = 'size', ylab = 'cv error ')
abline(h = min(cv.c$dev) + .2*sd(cv.c$dev))

    As the above model shows the lowest CV error is at 16 variables but there is a few others that are very close so they may also be worth trying out.

prune.c<-list()
yhat.prune<-list()
mse.prune<-list()
for (i in 2:16) {
prune.c[[i]]=prune.tree(tree.c, best = i)
yhat.prune[[i]]=predict(prune.c[[i]], c.te)
obs.sales<-c.te$Sales
mse.prune[[i]]<-mean((yhat.prune[[i]]-obs.sales)^2)
}

mse.mat<-as.matrix(mse.prune)
mse.df<-as.data.frame(mse.mat)
par(bg = "cornsilk3" )
matplot(2:16, mse.df[-1,1], xlab = 'size', ylab = 'MSE', main = "What size tree \n gives the lowest MSE?")
lines(2:16, mse.df[-1,1], type = "o")

mse.df <- data.frame(size=2:16,lapply(mse.df, unlist) )
mse.df$mse.rank<-rank(mse.df$V1)
names(mse.df)[2]<-'mse'
mse.df[mse.df$mse.rank %in% 1:3,]

##    size      mse mse.rank
## 13   14 4.727879        3
## 14   15 4.499964        1
## 15   16 4.541017        2

    It now seems from the above model that 15 nodes of size 14 variable has the lowest MSE of 4.50 

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

set.seed(24)
bag.c<-randomForest(Sales~., data = c.tr, mtry = 10, importance = TRUE)

yhat.bag<-predict(bag.c, newdata = c.te)
mean((yhat.bag - c.te$Sales)^2)

## [1] 2.692938

importance(bag.c)

##                %IncMSE IncNodePurity
## CompPrice   22.8321261    176.578526
## Income       4.1064747     77.183648
## Advertising 12.8743758     99.953925
## Population   0.3779757     76.359264
## Price       51.4435675    465.155918
## ShelveLoc   51.6714645    438.596460
## Age         21.2202209    201.116544
## Education    1.3558349     36.305982
## Urban       -2.0566741      6.577210
## US           0.6451619      5.857904

    The above model shows the lowest MSE at 2.69.As this is a learning and test envoriment it may be a good idea to practice And use mtry = 1:10 for extra assurance.

rf.c<- list()
yhat.rf<-list()
mse.rf<-list()
for ( i in 1:10 ) {
    set.seed(24)
    rf.c[[i]]<-randomForest(Sales ~ ., data = c.tr, mtry = i, importance = TRUE)
    yhat.rf[[i]]<-predict(rf.c[[i]], newdata = c.te)
    mse.rf[[i]]<- mean((yhat.rf[[i]] - c.te$Sales)^2)
}
par(bg = "cornflowerblue" )
matplot(1:10, mse.rf, xlab = 'mtry', ylab = 'MSE', main = "What # of predictors \n gives the lowest MSE?")
lines(1:10, mse.rf, type = "o")

mseRF<-t(as.data.frame(lapply(mse.rf, unlist)))
mse.rf<-data.frame(size=1:10,mseRF)
mse.rf.rownames<-paste0("mtry", 1:10)
rownames(mse.rf)<-mse.rf.rownames
mse.rf$mse.rank<-rank(mse.rf$mseRF)
mse.rf[order(mse.rf$mse.rank),]

##        size    mseRF mse.rank
## mtry10   10 2.692938        1
## mtry8     8 2.704865        2
## mtry9     9 2.722630        3
## mtry7     7 2.759607        4
## mtry6     6 2.843481        5
## mtry5     5 2.897282        6
## mtry4     4 2.970406        7
## mtry3     3 3.204131        8
## mtry2     2 3.573238        9
## mtry1     1 4.770207       10

    With this test model mtry= 10 has the lowest mse as 2.69 while mtry= 1 has the highest as 4.77.
    
(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

importance(bag.c)

##                %IncMSE IncNodePurity
## CompPrice   22.8321261    176.578526
## Income       4.1064747     77.183648
## Advertising 12.8743758     99.953925
## Population   0.3779757     76.359264
## Price       51.4435675    465.155918
## ShelveLoc   51.6714645    438.596460
## Age         21.2202209    201.116544
## Education    1.3558349     36.305982
## Urban       -2.0566741      6.577210
## US           0.6451619      5.857904

rf.carseats = randomForest(Sales ~ ., data = c.tr, mtry = 5, ntree = 500,
importance = T)
rf.pred = predict(rf.carseats, c.te)
mean((c.te$Sales - rf.pred)^2)

## [1] 2.829245

importance(rf.carseats)

##               %IncMSE IncNodePurity
## CompPrice   17.218971     153.98693
## Income       2.253299      95.92800
## Advertising 12.276418     108.15561
## Population   2.731687     104.08481
## Price       42.454124     410.96911
## ShelveLoc   45.481810     388.04732
## Age         20.430292     227.72396
## Education    1.802344      47.67768
## Urban       -3.730875      10.22056
## US           2.257687      10.89850

The Random Forrest Model as shown above resulted in an MSE of 2.83 which is worse than the previous model. To add the most important predictors are Price, ShelveLoc, and followed by age.

(f) Now analyze the data using BART, and report your results.

library(BART)

x=Carseats[,1:11]
y=Carseats[,"Sales"]
xtrain=x[train,]
ytrain=y[train]
xtest=x[-train,]
ytest=y[-train]
set.seed(5)
bart.fit=gbart(xtrain,ytrain,x.test=xtest)

## Warning in summary.lm(lm(y.train ~ ., data.frame(t(x.train), y.train))):
## essentially perfect fit: summary may be unreliable

## *****Calling gbart: type=1
## *****Data:
## data:n,p,np: 200, 15, 200
## y1,yn: -2.153500, 0.486500
## x1,x[n*p]: 5.160000, 1.000000
## xp1,xp[np*p]: 11.220000, 1.000000
## *****Number of Trees: 200
## *****Number of Cut Points: 100 ... 1
## *****burn,nd,thin: 100,1000,1
## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,0.263397,3,7.18123e-31,7.3135
## *****sigma: 0.000000
## *****w (weights): 1.000000 ... 1.000000
## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,15,0
## *****printevery: 100
## 
## MCMC
## done 0 (out of 1100)
## done 100 (out of 1100)
## done 200 (out of 1100)
## done 300 (out of 1100)
## done 400 (out of 1100)
## done 500 (out of 1100)
## done 600 (out of 1100)
## done 700 (out of 1100)
## done 800 (out of 1100)
## done 900 (out of 1100)
## done 1000 (out of 1100)
## time: 5s
## trcnt,tecnt: 1000,1000

yhat.bart=bart.fit$yhat.test.mean
mean((ytest-yhat.bart)^2)

## [1] 0.0810372

The Bart resulted in a very low MSE close to 0 of .081 which show it as being insignificant.

detach(Carseats)

Question 9

This problem involves the OJ data set which is part of the ISLR2 package.

attach(OJ)

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

set.seed(5)
train=sample(dim(OJ)[1],800)
train.OJ=OJ[train,]
test.OJ=OJ[-train,]

summary(train.OJ)

##  Purchase WeekofPurchase     StoreID         PriceCH        PriceMM     
##  CH:490   Min.   :227.0   Min.   :1.000   Min.   :1.69   Min.   :1.690  
##  MM:310   1st Qu.:240.0   1st Qu.:2.000   1st Qu.:1.79   1st Qu.:2.090  
##           Median :257.0   Median :3.000   Median :1.86   Median :2.130  
##           Mean   :254.6   Mean   :3.926   Mean   :1.87   Mean   :2.089  
##           3rd Qu.:269.0   3rd Qu.:7.000   3rd Qu.:1.99   3rd Qu.:2.180  
##           Max.   :278.0   Max.   :7.000   Max.   :2.09   Max.   :2.290  
##      DiscCH            DiscMM        SpecialCH        SpecialMM     
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.000   Median :0.0000   Median :0.0000  
##  Mean   :0.05221   Mean   :0.125   Mean   :0.1525   Mean   :0.1613  
##  3rd Qu.:0.00000   3rd Qu.:0.200   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :0.50000   Max.   :0.800   Max.   :1.0000   Max.   :1.0000  
##     LoyalCH          SalePriceMM     SalePriceCH      PriceDiff       Store7   
##  Min.   :0.000011   Min.   :1.190   Min.   :1.390   Min.   :-0.6700   No :542  
##  1st Qu.:0.331072   1st Qu.:1.690   1st Qu.:1.750   1st Qu.: 0.0000   Yes:258  
##  Median :0.600000   Median :2.090   Median :1.860   Median : 0.2400            
##  Mean   :0.570457   Mean   :1.964   Mean   :1.818   Mean   : 0.1457            
##  3rd Qu.:0.859921   3rd Qu.:2.180   3rd Qu.:1.890   3rd Qu.: 0.3200            
##  Max.   :0.999947   Max.   :2.290   Max.   :2.090   Max.   : 0.6400            
##    PctDiscMM         PctDiscCH       ListPriceDiff        STORE      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.1400   1st Qu.:0.000  
##  Median :0.00000   Median :0.00000   Median :0.2400   Median :2.000  
##  Mean   :0.05992   Mean   :0.02759   Mean   :0.2185   Mean   :1.669  
##  3rd Qu.:0.11268   3rd Qu.:0.00000   3rd Qu.:0.3000   3rd Qu.:3.000  
##  Max.   :0.40201   Max.   :0.25269   Max.   :0.4400   Max.   :4.000

summary(test.OJ)

##  Purchase WeekofPurchase     StoreID         PriceCH         PriceMM     
##  CH:163   Min.   :227.0   Min.   :1.000   Min.   :1.690   Min.   :1.690  
##  MM:107   1st Qu.:238.0   1st Qu.:2.000   1st Qu.:1.760   1st Qu.:1.990  
##           Median :256.0   Median :3.000   Median :1.860   Median :2.090  
##           Mean   :253.7   Mean   :4.059   Mean   :1.860   Mean   :2.076  
##           3rd Qu.:266.8   3rd Qu.:7.000   3rd Qu.:1.982   3rd Qu.:2.180  
##           Max.   :278.0   Max.   :7.000   Max.   :2.090   Max.   :2.290  
##      DiscCH            DiscMM         SpecialCH        SpecialMM    
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.000  
##  Mean   :0.05081   Mean   :0.1184   Mean   :0.1333   Mean   :0.163  
##  3rd Qu.:0.00000   3rd Qu.:0.2400   3rd Qu.:0.0000   3rd Qu.:0.000  
##  Max.   :0.50000   Max.   :0.8000   Max.   :1.0000   Max.   :1.000  
##     LoyalCH          SalePriceMM     SalePriceCH      PriceDiff       Store7   
##  Min.   :0.000014   Min.   :1.190   Min.   :1.390   Min.   :-0.6700   No :172  
##  1st Qu.:0.320000   1st Qu.:1.690   1st Qu.:1.750   1st Qu.: 0.0000   Yes: 98  
##  Median :0.574400   Median :2.090   Median :1.860   Median : 0.2150            
##  Mean   :0.551933   Mean   :1.958   Mean   :1.809   Mean   : 0.1488            
##  3rd Qu.:0.823003   3rd Qu.:2.130   3rd Qu.:1.890   3rd Qu.: 0.3150            
##  Max.   :0.999934   Max.   :2.290   Max.   :2.090   Max.   : 0.6400            
##    PctDiscMM         PctDiscCH       ListPriceDiff        STORE      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.1400   1st Qu.:0.000  
##  Median :0.00000   Median :0.00000   Median :0.2400   Median :1.000  
##  Mean   :0.05746   Mean   :0.02648   Mean   :0.2164   Mean   :1.519  
##  3rd Qu.:0.11268   3rd Qu.:0.00000   3rd Qu.:0.3000   3rd Qu.:3.000  
##  Max.   :0.40201   Max.   :0.25269   Max.   :0.4400   Max.   :4.000

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

oj.tree=tree(Purchase~.,data=train.OJ)
summary(oj.tree)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train.OJ)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff"
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7347 = 581.1 / 791 
## Misclassification error rate: 0.1662 = 133 / 800

    As shown the MisClassification error rate is 0.1662 and the number of terminal nodes the tree has is 9< Then the variable used were LoyalCH, PriceDiff, and ListPriceDiff.
    
(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

oj.tree

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1068.00 CH ( 0.61250 0.38750 )  
##    2) LoyalCH < 0.5036 346  412.40 MM ( 0.28324 0.71676 )  
##      4) LoyalCH < 0.280875 164  125.50 MM ( 0.12805 0.87195 )  
##        8) LoyalCH < 0.0356415 56   10.03 MM ( 0.01786 0.98214 ) *
##        9) LoyalCH > 0.0356415 108  103.50 MM ( 0.18519 0.81481 ) *
##      5) LoyalCH > 0.280875 182  248.00 MM ( 0.42308 0.57692 )  
##       10) PriceDiff < 0.05 71   67.60 MM ( 0.18310 0.81690 ) *
##       11) PriceDiff > 0.05 111  151.30 CH ( 0.57658 0.42342 ) *
##    3) LoyalCH > 0.5036 454  362.00 CH ( 0.86344 0.13656 )  
##      6) PriceDiff < -0.39 31   40.32 MM ( 0.35484 0.64516 )  
##       12) LoyalCH < 0.638841 10    0.00 MM ( 0.00000 1.00000 ) *
##       13) LoyalCH > 0.638841 21   29.06 CH ( 0.52381 0.47619 ) *
##      7) PriceDiff > -0.39 423  273.70 CH ( 0.90071 0.09929 )  
##       14) LoyalCH < 0.705326 135  143.00 CH ( 0.77778 0.22222 )  
##         28) ListPriceDiff < 0.255 67   89.49 CH ( 0.61194 0.38806 ) *
##         29) ListPriceDiff > 0.255 68   30.43 CH ( 0.94118 0.05882 ) *
##       15) LoyalCH > 0.705326 288   99.77 CH ( 0.95833 0.04167 ) *

    The best terminal node to pick seems to be 28, as this node has 67 observations in it. This point it splits is .255 with a deviance of 89.43 and this branch predicts that the Citrus HIll is being purchased. THie terminal node though is only correct 61.19% of the time. 
    
(d) Create a plot of the tree, and interpret the results.

plot(oj.tree)
text(oj.tree,pretty=0)

    As show the splitting variable is LoyalCH, which how loyal CH's customers are. This meaning Loyalch is the most imprtant variable followed by Pricediff and listprice. This also means/show the less loyal customers are to citrus hill the more likely they are to purchase minute maid and for Citrus hill the more loyal they are the more it take for them to purchase the other brand. 
    
(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

oj.pred=predict(oj.tree,newdata=test.OJ,type="class")
table(test.OJ$Purchase,oj.pred)

##     oj.pred
##       CH  MM
##   CH 148  15
##   MM  32  75

mean(oj.pred!=test.OJ$Purchase)

## [1] 0.1740741

This model show the test error rate of 0.174

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

cv.oj=cv.tree(oj.tree,FUN=prune.misclass)
cv.oj

## $size
## [1] 9 6 5 3 2 1
## 
## $dev
## [1] 155 156 156 172 170 310
## 
## $k
## [1]  -Inf   0.0   1.0   8.5   9.0 150.0
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

(g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis.

plot(cv.oj$size,cv.oj$dev,type="b",xlab="Tree Size",ylab="Deviance")

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

    THe previous model in g show the tree size with the lowest CV error rate is the one with 9 nodes but there may be ones that are very close. 

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

oj.pruned=prune.misclass(oj.tree,best=9)
plot(oj.pruned)
text(oj.pruned,pretty=0)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(oj.pruned)

## 
## Classification tree:
## tree(formula = Purchase ~ ., data = train.OJ)
## Variables actually used in tree construction:
## [1] "LoyalCH"       "PriceDiff"     "ListPriceDiff"
## Number of terminal nodes:  9 
## Residual mean deviance:  0.7347 = 581.1 / 791 
## Misclassification error rate: 0.1662 = 133 / 800

    It looks like the test error rate between the pruned and unpruned trees in this case are the same because the pruned tree actually not pruned. As shown the cv with the lowest cv error of terminal nodes is 9 still. 
    
(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

ojprune.pred=predict(oj.pruned, newdata=test.OJ, type="class")
mean(ojprune.pred != test.OJ$Purchase)

## [1] 0.1740741

    As assumed we again get the same test error rate of 0.174 as the CV found the unpruned treee is the best model.

Assignment 7

Kelley Williams

2023-05-02

Question 3

Question 8

Question 9