Assignment 7

Problem 3

Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of ˆpm1. The xaxis should display ˆpm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy

Hint: In a setting with two classes, p^m1 = 1 − p^m2. You could make this plot by hand, but it will be much easier to make in R.

pm1=seq(0,1,0.01)

gini.index= 2*pm1*(1-pm1)
c.error= 1-pmax(pm1,1-pm1)
crossentropy= -(pm1*log(pm1)+(1-pm1)*log(1-pm1))

plot(NA,NA,xlim=c(0,1),ylim=c(0,1),xlab='pm1',ylab='f')

lines(pm1,gini.index,type='l',ls = 2)
lines(pm1,c.error,col='blue',ls =2 )
lines(pm1,crossentropy,col='red',ls = 2)

legend(x='top',legend=c('gini.index','classification error','cross entropy'),
       col=c('black','blue','red'),lty=1,text.width = 0.22)

Problem 8

In the lab, a classification tree was applied to the Carseats data set after converting Sales into a qualitative response variable. Now we will seek to predict Sales using regression trees and related approaches, treating the response as a quantitative variable.

(a) Split the data set into a training set and a test set.

library(ISLR)
df <- Carseats
str(df)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

set.seed(1)
inTrain=createDataPartition(df$Sales,p=0.7,list=FALSE)
train=df[inTrain,]
test = df[-inTrain,]

(b) Fit a regression tree to the training set. Plot the tree, and interpret the results. What test MSE do you obtain?

##fit the tree
tree.df=rpart(Sales~., data=df, 
                    method="anova", 
                    control=rpart.control(minsplit=15, cp=0.01))

library(rattle)

## Warning: package 'rattle' was built under R version 4.0.5

## Loading required package: bitops

## Warning: package 'bitops' was built under R version 4.0.3

## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(tree.df)

tree.df

## n= 400 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 400 3182.27500  7.496325  
##    2) ShelveLoc=Bad,Medium 315 1859.56000  6.762984  
##      4) Price>=105.5 207  956.57240  6.018792  
##        8) ShelveLoc=Bad 61  240.81970  4.722459  
##         16) Population< 196.5 25   88.22930  3.767200 *
##         17) Population>=196.5 36  113.93510  5.385833 *
##        9) ShelveLoc=Medium 146  570.41420  6.560411  
##         18) Advertising< 5.5 77  280.11340  5.902468  
##           36) Price>=127 34  133.53970  4.986765 *
##           37) Price< 127 43   95.52198  6.626512 *
##         19) Advertising>=5.5 69  219.77110  7.294638  
##           38) CompPrice< 121.5 19   40.33360  6.230000 *
##           39) CompPrice>=121.5 50  149.71840  7.699200  
##             78) Price>=127 28   71.99441  6.731786 *
##             79) Price< 127 22   18.16730  8.930455 *
##      5) Price< 105.5 108  568.61750  8.189352  
##       10) Age>=54.5 65  303.05690  7.380154  
##         20) Income< 105.5 56  203.03290  6.946071  
##           40) ShelveLoc=Bad 20   76.96006  5.786500  
##             80) Income< 95 15   43.80929  5.047333 *
##             81) Income>=95 5    0.36872  8.004000 *
##           41) ShelveLoc=Medium 36   84.24070  7.590278 *
##         21) Income>=105.5 9   23.81549 10.081110 *
##       11) Age< 54.5 43  158.66040  9.412558  
##         22) Income< 57.5 13   19.24283  7.987692 *
##         23) Income>=57.5 30  101.58740 10.030000  
##           46) ShelveLoc=Bad 9   22.75640  8.396667 *
##           47) ShelveLoc=Medium 21   44.53100 10.730000 *
##    3) ShelveLoc=Good 85  525.52220 10.214000  
##      6) Price>=109.5 57  277.26520  9.244386  
##       12) Advertising< 13.5 48  185.42030  8.742500  
##         24) Price>=142.5 12   36.64722  7.152500 *
##         25) Price< 142.5 36  108.32350  9.272500  
##           50) Income< 40.5 9    9.82780  7.603333 *
##           51) Income>=40.5 27   65.06227  9.828889 *
##       13) Advertising>=13.5 9   15.27049 11.921110 *
##      7) Price< 109.5 28   85.57727 12.187860 *

The most important indicator of Sales appears to be shelving location, since the first branch differentiates Good locations from Bad and Medium locations. And we can see that there are 315 observations in that branch and its overall prediction that came out to be 6.762984. We can also observe that there are branches that are leading to the terminal nodes denoted with asterisk.

# TEST MSE
pred = predict(tree.df,newdata = test)
print(mean((test$Sales-pred)^2)) #2.576552

## [1] 2.576552

print(RMSE(pred,test$Sales))

## [1] 1.605164

After predicting on the test dataset with model I got 2.576552 mean square error.

printcp(tree.df)

## 
## Regression tree:
## rpart(formula = Sales ~ ., data = df, method = "anova", control = rpart.control(minsplit = 15, 
##     cp = 0.01))
## 
## Variables actually used in tree construction:
## [1] Advertising Age         CompPrice   Income      Population  Price      
## [7] ShelveLoc  
## 
## Root node error: 3182.3/400 = 7.9557
## 
## n= 400 
## 
##          CP nsplit rel error  xerror     xstd
## 1  0.250510      0   1.00000 1.00785 0.069393
## 2  0.105073      1   0.74949 0.75911 0.051685
## 3  0.051121      2   0.64442 0.72164 0.048052
## 4  0.045671      3   0.59330 0.71745 0.046503
## 5  0.033592      4   0.54763 0.69225 0.045981
## 6  0.024063      5   0.51403 0.64638 0.042529
## 7  0.023948      6   0.48997 0.66288 0.042683
## 8  0.022163      7   0.46602 0.65964 0.042692
## 9  0.016043      8   0.44386 0.62908 0.040890
## 10 0.014027      9   0.42782 0.59288 0.038382
## 11 0.013145     11   0.39976 0.58667 0.036284
## 12 0.012711     12   0.38662 0.59137 0.036144
## 13 0.012147     13   0.37391 0.59671 0.036464
## 14 0.011888     14   0.36176 0.59180 0.036279
## 15 0.010778     15   0.34987 0.59038 0.037069
## 16 0.010506     16   0.33909 0.59503 0.036884
## 17 0.010301     17   0.32859 0.60020 0.037307
## 18 0.010000     18   0.31829 0.59637 0.037200

plotcp(tree.df)

We can see that the the value of cp we mentioned while training the data is almost nearer to the value that is suggested by the plot.

(c) Use cross-validation in order to determine the optimal level of tree complexity. Does pruning the tree improve the test MSE?

set.seed(1)
# define training control
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
# fit the model
tree2 = train(Sales ~.,data=train, trControl=train_control,method='rpart')
tree2

## CART 
## 
## 281 samples
##  10 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 253, 253, 253, 253, 253, 252, ... 
## Resampling results across tuning parameters:
## 
##   cp          RMSE      Rsquared   MAE     
##   0.06683057  2.301610  0.3231498  1.881713
##   0.11422248  2.359155  0.2767136  1.927404
##   0.25399156  2.634895  0.1775481  2.147800
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.06683057.

From the output we can see that the minimum RMSE was recorded for the CP value 0.06683057 which is the smallest. The MSE has increased when compared to the first model we fit. So it looks like the pruning of the DT does improve the MSE.

(d) Use the bagging approach in order to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important.

## fit random forest
library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

bag.fit = randomForest(Sales~.,data=train, 
                       mtry =ncol(df), 
                       importance = TRUE)
pred.bag = predict (bag.fit , newdata =test)
mean (( pred.bag - test$Sales)^2)

## [1] 3.050789

The noted test MSE was 3.050789

varImp(bag.fit)

##               Overall
## CompPrice   29.248121
## Income       2.486601
## Advertising 28.822622
## Population  -1.195730
## Price       70.208687
## ShelveLoc   70.650068
## Age         21.269075
## Education   -1.888371
## Urban       -3.402822
## US           4.235365

The tow most important variables are Price and ShelveLoc from the output above.

(e) Use random forests to analyze this data. What test MSE do you obtain? Use the importance() function to determine which variables are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

## fit random forest
rf.fit=train(Sales~.,data=train,
              method='rf',
              trControl = trainControl("cv", number = 10),
              importance = TRUE)

## best tuning parameter
rf.fit$bestTune

##   mtry
## 3   11

## final model
rf.fit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##           Mean of squared residuals: 2.34911
##                     % Var explained: 69.12

From the observation we can see that the mtry value is 11 so there are 11 variables that are randomly chose by our bagging approach. And the opted test MSE was 2.34911 On using only the selected variables for the splitting purpose the error rate has been decreased and we can see that when compared this RF model to other models.

varImp(rf.fit)

## rf variable importance
## 
##                 Overall
## ShelveLocGood   100.000
## Price            96.349
## Advertising      42.988
## ShelveLocMedium  40.464
## CompPrice        39.462
## Age              27.472
## USYes            10.221
## Income            7.138
## Population        2.575
## Education         1.981
## UrbanYes          0.000

plot(varImp(rf.fit))

The most important variable are ShelvLocprice and Good in the data set. I got 2.34911 as mean squared error. which is reduced from previous fit.

Problem 9

This problem involves the OJ data set which is part of the ISLR package.

(a) Create a training set containing a random sample of 800 observations, and a test set containing the remaining observations.

data <- OJ
str(data)

## 'data.frame':    1070 obs. of  18 variables:
##  $ Purchase      : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
##  $ WeekofPurchase: num  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : num  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : num  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : num  1 1 1 1 0 0 0 0 0 0 ...

set.seed(123)
index = sample(nrow(data),800)
train.set = data[index,]
test.set = data[-index,]

(b) Fit a tree to the training data, with Purchase as the response and the other variables as predictors. Use the summary() function to produce summary statistics about the tree, and describe the results obtained. What is the training error rate? How many terminal nodes does the tree have?

##fit the tree
oj.fit=rpart(Purchase ~., data=train.set, 
                    method= "class", 
                    control=rpart.control(minsplit=15, cp=0.01))

summary(oj.fit)

## Call:
## rpart(formula = Purchase ~ ., data = train.set, method = "class", 
##     control = rpart.control(minsplit = 15, cp = 0.01))
##   n= 800 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.49201278      0 1.0000000 1.0000000 0.04410089
## 2 0.03514377      1 0.5079872 0.5335463 0.03672576
## 3 0.02555911      2 0.4728435 0.5335463 0.03672576
## 4 0.01277955      4 0.4217252 0.4504792 0.03443205
## 5 0.01000000      7 0.3833866 0.4728435 0.03508854
## 
## Variable importance
##        LoyalCH        StoreID      PriceDiff    SalePriceMM WeekofPurchase 
##             45              9              9              6              6 
##        PriceMM         DiscMM      PctDiscMM        PriceCH  ListPriceDiff 
##              5              5              4              4              3 
##    SalePriceCH          STORE      SpecialCH 
##              2              1              1 
## 
## Node number 1: 800 observations,    complexity param=0.4920128
##   predicted class=CH  expected loss=0.39125  P(node) =1
##     class counts:   487   313
##    probabilities: 0.609 0.391 
##   left son=2 (450 obs) right son=3 (350 obs)
##   Primary splits:
##       LoyalCH   < 0.5036    to the right, improve=134.49530, (0 missing)
##       StoreID   < 3.5       to the right, improve= 40.88655, (0 missing)
##       STORE     < 0.5       to the left,  improve= 20.84871, (0 missing)
##       Store7    splits as  RL, improve= 20.84871, (0 missing)
##       PriceDiff < 0.015     to the right, improve= 19.14298, (0 missing)
##   Surrogate splits:
##       StoreID        < 3.5       to the right, agree=0.660, adj=0.223, (0 split)
##       WeekofPurchase < 246.5     to the right, agree=0.625, adj=0.143, (0 split)
##       PriceCH        < 1.825     to the right, agree=0.600, adj=0.086, (0 split)
##       PriceMM        < 1.89      to the right, agree=0.596, adj=0.077, (0 split)
##       ListPriceDiff  < 0.035     to the right, agree=0.581, adj=0.043, (0 split)
## 
## Node number 2: 450 observations,    complexity param=0.03514377
##   predicted class=CH  expected loss=0.1355556  P(node) =0.5625
##     class counts:   389    61
##    probabilities: 0.864 0.136 
##   left son=4 (423 obs) right son=5 (27 obs)
##   Primary splits:
##       PriceDiff   < -0.39     to the right, improve=18.543390, (0 missing)
##       DiscMM      < 0.72      to the left,  improve= 9.309254, (0 missing)
##       SalePriceMM < 1.435     to the right, improve= 9.309254, (0 missing)
##       PctDiscMM   < 0.3342595 to the left,  improve= 9.309254, (0 missing)
##       LoyalCH     < 0.7645725 to the right, improve= 8.822549, (0 missing)
##   Surrogate splits:
##       DiscMM      < 0.72      to the left,  agree=0.967, adj=0.444, (0 split)
##       SalePriceMM < 1.435     to the right, agree=0.967, adj=0.444, (0 split)
##       PctDiscMM   < 0.3342595 to the left,  agree=0.967, adj=0.444, (0 split)
##       SalePriceCH < 2.075     to the left,  agree=0.949, adj=0.148, (0 split)
## 
## Node number 3: 350 observations,    complexity param=0.02555911
##   predicted class=MM  expected loss=0.28  P(node) =0.4375
##     class counts:    98   252
##    probabilities: 0.280 0.720 
##   left son=6 (180 obs) right son=7 (170 obs)
##   Primary splits:
##       LoyalCH   < 0.2761415 to the right, improve=14.991900, (0 missing)
##       StoreID   < 3.5       to the right, improve= 6.562913, (0 missing)
##       Store7    splits as  RL, improve= 4.617311, (0 missing)
##       STORE     < 0.5       to the left,  improve= 4.617311, (0 missing)
##       SpecialCH < 0.5       to the right, improve= 4.512108, (0 missing)
##   Surrogate splits:
##       STORE       < 1.5       to the left,  agree=0.629, adj=0.235, (0 split)
##       StoreID     < 1.5       to the left,  agree=0.589, adj=0.153, (0 split)
##       PriceCH     < 1.875     to the left,  agree=0.589, adj=0.153, (0 split)
##       SalePriceCH < 1.875     to the left,  agree=0.586, adj=0.147, (0 split)
##       SalePriceMM < 1.84      to the left,  agree=0.571, adj=0.118, (0 split)
## 
## Node number 4: 423 observations
##   predicted class=CH  expected loss=0.09929078  P(node) =0.52875
##     class counts:   381    42
##    probabilities: 0.901 0.099 
## 
## Node number 5: 27 observations
##   predicted class=MM  expected loss=0.2962963  P(node) =0.03375
##     class counts:     8    19
##    probabilities: 0.296 0.704 
## 
## Node number 6: 180 observations,    complexity param=0.02555911
##   predicted class=MM  expected loss=0.4222222  P(node) =0.225
##     class counts:    76   104
##    probabilities: 0.422 0.578 
##   left son=12 (106 obs) right son=13 (74 obs)
##   Primary splits:
##       PriceDiff     < 0.05      to the right, improve=12.110850, (0 missing)
##       SalePriceMM   < 2.04      to the right, improve=11.572070, (0 missing)
##       DiscMM        < 0.25      to the left,  improve= 5.760121, (0 missing)
##       PctDiscMM     < 0.1345485 to the left,  improve= 5.760121, (0 missing)
##       ListPriceDiff < 0.18      to the right, improve= 5.597236, (0 missing)
##   Surrogate splits:
##       SalePriceMM   < 1.94      to the right, agree=0.933, adj=0.838, (0 split)
##       DiscMM        < 0.08      to the left,  agree=0.822, adj=0.568, (0 split)
##       PctDiscMM     < 0.038887  to the left,  agree=0.822, adj=0.568, (0 split)
##       ListPriceDiff < 0.135     to the right, agree=0.800, adj=0.514, (0 split)
##       PriceMM       < 2.04      to the right, agree=0.783, adj=0.473, (0 split)
## 
## Node number 7: 170 observations
##   predicted class=MM  expected loss=0.1294118  P(node) =0.2125
##     class counts:    22   148
##    probabilities: 0.129 0.871 
## 
## Node number 12: 106 observations,    complexity param=0.01277955
##   predicted class=CH  expected loss=0.4245283  P(node) =0.1325
##     class counts:    61    45
##    probabilities: 0.575 0.425 
##   left son=24 (8 obs) right son=25 (98 obs)
##   Primary splits:
##       LoyalCH        < 0.3084325 to the left,  improve=3.118983, (0 missing)
##       WeekofPurchase < 247.5     to the right, improve=2.489639, (0 missing)
##       SpecialMM      < 0.5       to the left,  improve=2.454538, (0 missing)
##       PriceCH        < 1.755     to the right, improve=2.048863, (0 missing)
##       PriceMM        < 2.04      to the right, improve=1.514675, (0 missing)
## 
## Node number 13: 74 observations
##   predicted class=MM  expected loss=0.2027027  P(node) =0.0925
##     class counts:    15    59
##    probabilities: 0.203 0.797 
## 
## Node number 24: 8 observations
##   predicted class=CH  expected loss=0  P(node) =0.01
##     class counts:     8     0
##    probabilities: 1.000 0.000 
## 
## Node number 25: 98 observations,    complexity param=0.01277955
##   predicted class=CH  expected loss=0.4591837  P(node) =0.1225
##     class counts:    53    45
##    probabilities: 0.541 0.459 
##   left son=50 (46 obs) right son=51 (52 obs)
##   Primary splits:
##       LoyalCH        < 0.442144  to the right, improve=3.071463, (0 missing)
##       WeekofPurchase < 248.5     to the right, improve=2.208454, (0 missing)
##       SpecialMM      < 0.5       to the left,  improve=2.011796, (0 missing)
##       STORE          < 0.5       to the left,  improve=1.624324, (0 missing)
##       StoreID        < 5.5       to the right, improve=1.624324, (0 missing)
##   Surrogate splits:
##       WeekofPurchase < 255       to the left,  agree=0.622, adj=0.196, (0 split)
##       SalePriceCH    < 1.755     to the right, agree=0.571, adj=0.087, (0 split)
##       STORE          < 2.5       to the right, agree=0.571, adj=0.087, (0 split)
##       PriceMM        < 2.205     to the right, agree=0.561, adj=0.065, (0 split)
##       DiscCH         < 0.115     to the left,  agree=0.561, adj=0.065, (0 split)
## 
## Node number 50: 46 observations
##   predicted class=CH  expected loss=0.326087  P(node) =0.0575
##     class counts:    31    15
##    probabilities: 0.674 0.326 
## 
## Node number 51: 52 observations,    complexity param=0.01277955
##   predicted class=MM  expected loss=0.4230769  P(node) =0.065
##     class counts:    22    30
##    probabilities: 0.423 0.577 
##   left son=102 (8 obs) right son=103 (44 obs)
##   Primary splits:
##       SpecialCH      < 0.5       to the right, improve=2.020979, (0 missing)
##       STORE          < 1.5       to the left,  improve=1.724009, (0 missing)
##       SpecialMM      < 0.5       to the left,  improve=1.680070, (0 missing)
##       WeekofPurchase < 245       to the right, improve=1.384615, (0 missing)
##       StoreID        < 5.5       to the right, improve=1.319751, (0 missing)
##   Surrogate splits:
##       DiscCH      < 0.27      to the right, agree=0.942, adj=0.625, (0 split)
##       SalePriceCH < 1.54      to the left,  agree=0.942, adj=0.625, (0 split)
##       PctDiscCH   < 0.149059  to the right, agree=0.942, adj=0.625, (0 split)
##       SalePriceMM < 1.64      to the left,  agree=0.923, adj=0.500, (0 split)
##       DiscMM      < 0.42      to the right, agree=0.904, adj=0.375, (0 split)
## 
## Node number 102: 8 observations
##   predicted class=CH  expected loss=0.25  P(node) =0.01
##     class counts:     6     2
##    probabilities: 0.750 0.250 
## 
## Node number 103: 44 observations
##   predicted class=MM  expected loss=0.3636364  P(node) =0.055
##     class counts:    16    28
##    probabilities: 0.364 0.636

The minimum training error rate was 0.3833866 that was noted as cp = 0.01 value. The major split has happened at LoyalCH variable so it is the most important variable in the data set. Total number of terminal nodes observed was 103.

(c) Type in the name of the tree object in order to get a detailed text output. Pick one of the terminal nodes, and interpret the information displayed.

oj.fit

## n= 800 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 800 313 CH (0.60875000 0.39125000)  
##     2) LoyalCH>=0.5036 450  61 CH (0.86444444 0.13555556)  
##       4) PriceDiff>=-0.39 423  42 CH (0.90070922 0.09929078) *
##       5) PriceDiff< -0.39 27   8 MM (0.29629630 0.70370370) *
##     3) LoyalCH< 0.5036 350  98 MM (0.28000000 0.72000000)  
##       6) LoyalCH>=0.2761415 180  76 MM (0.42222222 0.57777778)  
##        12) PriceDiff>=0.05 106  45 CH (0.57547170 0.42452830)  
##          24) LoyalCH< 0.3084325 8   0 CH (1.00000000 0.00000000) *
##          25) LoyalCH>=0.3084325 98  45 CH (0.54081633 0.45918367)  
##            50) LoyalCH>=0.442144 46  15 CH (0.67391304 0.32608696) *
##            51) LoyalCH< 0.442144 52  22 MM (0.42307692 0.57692308)  
##             102) SpecialCH>=0.5 8   2 CH (0.75000000 0.25000000) *
##             103) SpecialCH< 0.5 44  16 MM (0.36363636 0.63636364) *
##        13) PriceDiff< 0.05 74  15 MM (0.20270270 0.79729730) *
##       7) LoyalCH< 0.2761415 170  22 MM (0.12941176 0.87058824) *

Branches that lead to terminal nodes are indicated using asterisks.The tree has 8 branches that are leading to the terminal nodes.The output is also showing on criterion the split is done. When we see the node 4 the split criterion it used was PriceDiff>=-0.39 and for node 2 LoyalCH>=0.5036. The numbers that are displayed after the split indicates the number of observations in that branch.

(d) Create a plot of the tree, and interpret the results

plotcp(oj.fit)

representation to the cross validated error summary. there is a group of CP values we can select the one which has the lowest cross-validated error and we can further use it for pruning. In this case the tree with 4 nodes could be the best at a cp value of 0.0127796

(e) Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

pred.data <-predict(oj.fit,test.set)
y.hat <-ifelse(pred.data[,1] >= 0.5,'CH','MM') # I considered 0.5 as the cut-off we can also find out the optimal cut-off and can do the confusion matrix accordingly.
confusionMatrix(as.factor(y.hat),as.factor(test.set$Purchase))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  CH  MM
##         CH 141  24
##         MM  25  80
##                                           
##                Accuracy : 0.8185          
##                  95% CI : (0.7673, 0.8626)
##     No Information Rate : 0.6148          
##     P-Value [Acc > NIR] : 3.407e-13       
##                                           
##                   Kappa : 0.6175          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8494          
##             Specificity : 0.7692          
##          Pos Pred Value : 0.8545          
##          Neg Pred Value : 0.7619          
##              Prevalence : 0.6148          
##          Detection Rate : 0.5222          
##    Detection Prevalence : 0.6111          
##       Balanced Accuracy : 0.8093          
##                                           
##        'Positive' Class : CH              
##

#test error rate
# FN + FP/total number of observations in the test data
# FN  falsely predicted negative 
# FP falsely predicted positive
(test.error = (25+24)/ncol(test)) # 4.454545

## [1] 4.454545

(f) Apply the cv.tree() function to the training set in order to determine the optimal tree size.

library(tree)

## Warning: package 'tree' was built under R version 4.0.5

## Registered S3 method overwritten by 'tree':
##   method     from
##   print.tree cli

set.seed(123)
cv_tree <- tree(Purchase ~., data=train.set)

set.seed(123)
(op <- cv.tree(cv_tree,,prune.tree))

## $size
## [1] 8 7 6 5 4 3 2 1
## 
## $dev
## [1]  736.9128  711.0227  703.7038  727.7832  727.7832  753.5331  792.3054
## [8] 1072.5980
## 
## $k
## [1]      -Inf  12.03823  14.92474  25.76707  26.02613  38.91686  50.61655
## [8] 298.68751
## 
## $method
## [1] "deviance"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(op)

the tree we can see that size 6 has the lowest deviance. Deviance means a measure of the error remaining in the tree after construction.So size 6 could be the optimal size. ##### (g) Produce a plot with tree size on the x-axis and cross-validated classification error rate on the y-axis

set.seed(123)
oj.fit2=tree(Purchase~.,data=train.set) # another way of building a tree before we used rpart. 
OJ.tree.cv = cv.tree(oj.fit2,K = 10,FUN = prune.misclass)
plot(OJ.tree.cv)

(h) Which tree size corresponds to the lowest cross-validated classification error rate?

Tree size 5 corresponds to the lowest croos-validated classification error rate.

(i) Produce a pruned tree corresponding to the optimal tree size obtained using cross-validation. If cross-validation does not lead to selection of a pruned tree, then create a pruned tree with five terminal nodes.

library(kableExtra) #for kable

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

oj.fit2=prune.misclass(oj.fit2,best = 5)

OJ.pred.train=predict(oj.fit2,train.set,type = 'class')
table(train.set[,'Purchase'],OJ.pred.train)

##     OJ.pred.train
##       CH  MM
##   CH 442  45
##   MM  87 226

table(train.set[,'Purchase'],OJ.pred.train)/nrow(train.set)

##     OJ.pred.train
##           CH      MM
##   CH 0.55250 0.05625
##   MM 0.10875 0.28250

OJ.pred.test=predict(oj.fit2,test.set,type = 'class')
table(test.set[,'Purchase'],OJ.pred.test)

##     OJ.pred.test
##       CH  MM
##   CH 150  16
##   MM  34  70

table(test.set[,'Purchase'],OJ.pred.test)/nrow(test.set)

##     OJ.pred.test
##              CH         MM
##   CH 0.55555556 0.05925926
##   MM 0.12592593 0.25925926

plot(oj.fit2)
text(oj.fit2)

(j) Compare the training error rates between the pruned and unpruned trees. Which is higher?

summary(oj.fit2)

## 
## Classification tree:
## snip.tree(tree = oj.fit2, nodes = c(4L, 7L))
## Variables actually used in tree construction:
## [1] "LoyalCH"   "PriceDiff"
## Number of terminal nodes:  5 
## Residual mean deviance:  0.826 = 656.6 / 795 
## Misclassification error rate: 0.165 = 132 / 800

(k) Compare the test error rates between the pruned and unpruned trees. Which is higher?

The unpruned tree got test error rate around 44% while the pruned tree with 2 leaf nodes achieves a misclassification test rate of 58% . This represents a 14% improvement over an unpruned tree for the test data. However, the training error for the unpruned tree is 0.3833866 i.e., 38% approx which is lower. This is quite evident that the decision trees overfit the data.