Question 3

Consider the Gini Index, Classification error and entrophy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1 and the y-axis should display the value of Gini index, classification error and entropy.

Classification Error Rate:

It is simply the fraction of the training observations in a given region that do not belong to the most common class

\[ E = 1-max_k(\hat{p}_{mk}) \]

When Class 1 is most common class The classification error value will be \(E = 1-\hat{p}_{m1}\)

When Class 2 is most common class The classification error value will be \(E = 1-\hat{p}_{m2}\)

Gini Index:

It is the measure of Node Purity. Its value lies between 0 and 1. It is defined as below. \[ G = \sum_{k=1}^K\hat{p}_{mk}(1-\hat{p}_{mk}) \]

Gini Index in terms of class 1 is \(G = \hat{p}_{m1}(1-\hat{p}_{m1})\)

Gini Index in terms of class 2 is \(G = \hat{p}_{m2}(1-\hat{p}_{m2})\)

Entropy:

It is the measure of Node Purity. It is Alternative to Gini Index.Its value lies between 0 and 1. It is defined as below. \[ D = -\sum_{k=1}^K\hat{p}_{mk}(\log\hat{p}_{mk}) \] Entropy in terms of class 1 is \(D = -(\hat{p}_{m1}(\log\hat{p}_{m1}))\)

Entropy in terms of class 2 is \(G = -(\hat{p}_{m2}(\log\hat{p}_{m2}))\)

Plot with Entropy and Gini Index:

Alt text

Question 8

In the lab, a classification tree was applied to the carseats data set after converting sales into qualitative response variable now we will seek to predict sales using regression trees and related approaches treating the response as a quantitative variable.

data("Carseats")
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Question 8a

Split the data set into training set and test set.

set.seed(1)
carseats.data=sample.split(Carseats$Sales,SplitRatio=0.8)
train.carseat=subset(Carseats,carseats.data==TRUE)
test.carseat=subset(Carseats,carseats.data==FALSE)

Question 8b

Fit the regression tree to the training data set. Plot the tree and interpret the results. What test MSE do you obtain?

carseat.tree = tree(Sales~.,data=train.carseat)
summary(carseat.tree)
## 
## Regression tree:
## tree(formula = Sales ~ ., data = train.carseat)
## Variables actually used in tree construction:
## [1] "ShelveLoc"   "Price"       "CompPrice"   "Age"         "Income"     
## [6] "Advertising"
## Number of terminal nodes:  19 
## Residual mean deviance:  2.314 = 696.5 / 301 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -5.72300 -0.99680 -0.04119  0.00000  1.04500  3.97900
carseat.tree
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##   1) root 320 2514.000  7.552  
##     2) ShelveLoc: Bad,Medium 246 1406.000  6.759  
##       4) Price < 105.5 85  416.700  8.055  
##         8) CompPrice < 123.5 64  255.700  7.426  
##          16) Age < 68.5 49  152.400  7.972  
##            32) Income < 57.5 15   30.560  6.656 *
##            33) Income > 57.5 34   84.430  8.553  
##              66) Age < 37 7   22.340 10.240 *
##              67) Age > 37 27   36.870  8.114 *
##          17) Age > 68.5 15   40.750  5.640 *
##         9) CompPrice > 123.5 21   58.350  9.974 *
##       5) Price > 105.5 161  771.300  6.075  
##        10) ShelveLoc: Bad 46  208.800  4.707  
##          20) CompPrice < 124.5 17   41.050  3.516 *
##          21) CompPrice > 124.5 29  129.500  5.406  
##            42) Price < 132.5 19   57.010  6.308 *
##            43) Price > 132.5 10   27.620  3.691 *
##        11) ShelveLoc: Medium 115  442.000  6.622  
##          22) CompPrice < 122.5 36  100.800  5.526 *
##          23) CompPrice > 122.5 79  278.400  7.121  
##            46) Advertising < 13.5 65  207.700  6.732  
##              92) Price < 131.5 38   71.570  7.448  
##               184) Age < 74.5 30   40.120  7.891 *
##               185) Age > 74.5 8    3.550  5.789 *
##              93) Price > 131.5 27   89.190  5.723 *
##            47) Advertising > 13.5 14   15.030  8.929 *
##     3) ShelveLoc: Good 74  438.500 10.190  
##       6) Price < 109.5 25   73.110 12.290 *
##       7) Price > 109.5 49  198.700  9.117  
##        14) Advertising < 14 42  135.400  8.700  
##          28) Income < 47 15   16.490  7.575 *
##          29) Income > 47 27   89.390  9.324  
##            58) CompPrice < 141 22   60.760  8.852  
##             116) Advertising < 6 10   13.980  7.546 *
##             117) Advertising > 6 12   15.500  9.941 *
##            59) CompPrice > 141 5    2.143 11.400 *
##        15) Advertising > 14 7   12.020 11.620 *
plot(carseat.tree)
text(carseat.tree,pretty=0)

carseat.tree.predict=predict(carseat.tree,newdata=test.carseat)
plot(carseat.tree.predict,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")

carseat.unprune.mse=mean((carseat.tree.predict-test.carseat$Sales)^2)
carseat.unprune.mse
## [1] 5.202575
carseat.unprune.rmse=sqrt(mean((carseat.tree.predict-test.carseat$Sales)^2))
carseat.unprune.rmse
## [1] 2.280915

As per the training Summary, below 5 variables are used while constructing the full decision tree which ends up in 19 terminal nodes. The Root mean square error is 2.280915 which implies that the prediction of car seat sales is within 2280 car seats of the actual median car seat sales across different stores

Question 8c

Use cross Validation on order to determine the optimal level of tree complexity. Does pruning the tree improved the test MSE

carseat.cv=cv.tree(carseat.tree)
plot(carseat.cv$size,carseat.cv$dev,type="b")

Based on The cross validation tree, the minimum error occur when the size of the tree is with 15 nodes

Prune the training data set with the size as 15 and check the RMSE for predicted model

prune.carseat=prune.tree(carseat.tree,best = 15)

Plot the prune tree

plot(prune.carseat)
text(prune.carseat,pretty=0)

Predict the carseat sales of the test dataset using the pruned tree model with the leaf node size as 15

carseat.prune.predict=predict(prune.carseat,newdata=test.carseat)
plot(carseat.prune.predict,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")

carseat.prune.mse=mean((carseat.prune.predict-test.carseat$Sales)^2)
carseat.prune.mse
## [1] 5.003813
carseat.prune.rmse=sqrt(mean((carseat.prune.predict-test.carseat$Sales)^2))
carseat.prune.rmse
## [1] 2.236921

As per the cross validation , the minimum cv error occurs at the size of 15. After Pruning the data set for that size and find out the MSE, the difference between the MSE of un-pruned tree and pruned tree to the size of 15 is very minimum.As an Eyeball view of the Plot the maximum dip in the CV error occur when the size of the tree is less than 6 nodes after that the dip in the CV error is minimum.So pruning the data set with the size as 6 and compare the MSE

Prune the training data set with the size as 6 and check the RMSE for predicted model

prune.carseat.1=prune.tree(carseat.tree,best = 6)

Plot the prune tree

plot(prune.carseat.1)
text(prune.carseat.1,pretty=0)

Predict the carseat sales of the test dataset using the pruned tree model with the leaf node size as 10

carseat.prune.predict.1=predict(prune.carseat.1,newdata=test.carseat)
plot(carseat.prune.predict.1,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")

carseat.prune1.mse=mean((carseat.prune.predict.1-test.carseat$Sales)^2)
carseat.prune1.mse
## [1] 5.064176
carseat.prune1.rmse=sqrt(mean((carseat.prune.predict.1-test.carseat$Sales)^2))
carseat.prune1.rmse
## [1] 2.250372

As per the comparison of the MSE between the Pruned tree and the un-pruned tree there is no much improvement in the MSE.Test MSE is slightly lower for the pruned tree compared to the unpruned tree

tree.type=c("unpruned","prune-15","prune-5")
tree.mses=c(carseat.unprune.mse,carseat.prune.mse,carseat.prune1.mse)
tree.rmses=c(carseat.unprune.rmse,carseat.prune.rmse,carseat.prune1.rmse)
data.frame(tree.type,tree.mses,tree.rmses)
##   tree.type tree.mses tree.rmses
## 1  unpruned  5.202575   2.280915
## 2  prune-15  5.003813   2.236921
## 3   prune-5  5.064176   2.250372

Question 8d

Use the bagging approach in-order to analyze this data. what test MSE do you obtain?.Use the Importance function to determine which variable are most important.

set.seed(1)
carseat.bag = randomForest(Sales~.,data=train.carseat,mtry=10,importance=TRUE)
carseat.bag
## 
## Call:
##  randomForest(formula = Sales ~ ., data = train.carseat, mtry = 10,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 10
## 
##           Mean of squared residuals: 2.482236
##                     % Var explained: 68.41
carseat.bag.predict=predict(carseat.bag,newdata = test.carseat )
plot(carseat.bag.predict,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")

carseat.bag.mse=mean((carseat.bag.predict-test.carseat$Sales)^2)
carseat.bag.mse
## [1] 2.609163
carseat.bag.rmse=sqrt(mean((carseat.bag.predict-test.carseat$Sales)^2))
carseat.bag.rmse
## [1] 1.61529

Find out the Variable importance of the bagged model

varImpPlot(carseat.bag)

Test MSE is low in bagging approach compared to the pruned tree. Also Based on the variable importance, shelf location, price takes the higher priority compared to all other variables.

Question 8e

Use the Random forest to analyze this data. what test MSE do you obtain?.Use the Importance function to determine which variable are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.

Fit the Random forest model using caret package

carseat.rf=train(Sales~.,data=train.carseat,trControl=trainControl(method = "cv",number=10),method='rf',importance=TRUE)

Find out the best Tune value to check how many predictors are needed

carseat.rf$bestTune
##   mtry
## 3   11
carseat.rf.predict=predict(carseat.rf,newdata = test.carseat )
plot(carseat.rf.predict,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")

carseat.rf.mse=mean((carseat.rf.predict-test.carseat$Sales)^2)
carseat.rf.mse
## [1] 2.566788
carseat.rf.rmse=sqrt(mean((carseat.rf.predict-test.carseat$Sales)^2))
carseat.rf.rmse
## [1] 1.60212

Plot the variable importance of the final model

plot(varImp(carseat.rf))

Test MSE is low in Random forest modelling technique compared to the pruned tree and bagg approach. Also based on the variable importance, shelf location, price takes the higher priority compared to all other variables.Overall Random forest model is best model to predict the number of car seat sales across 400 stores

Question 9

This problem involves the OJ data set which is part of the ISLR package

data(OJ)
str(OJ)
## 'data.frame':    1070 obs. of  18 variables:
##  $ Purchase      : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
##  $ WeekofPurchase: num  237 239 245 227 228 230 232 234 235 238 ...
##  $ StoreID       : num  1 1 1 1 7 7 7 7 7 7 ...
##  $ PriceCH       : num  1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceMM       : num  1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
##  $ DiscCH        : num  0 0 0.17 0 0 0 0 0 0 0 ...
##  $ DiscMM        : num  0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
##  $ SpecialCH     : num  0 0 0 0 0 0 1 1 0 0 ...
##  $ SpecialMM     : num  0 1 0 0 0 1 1 0 0 0 ...
##  $ LoyalCH       : num  0.5 0.6 0.68 0.4 0.957 ...
##  $ SalePriceMM   : num  1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
##  $ SalePriceCH   : num  1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
##  $ PriceDiff     : num  0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
##  $ Store7        : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
##  $ PctDiscMM     : num  0 0.151 0 0 0 ...
##  $ PctDiscCH     : num  0 0 0.0914 0 0 ...
##  $ ListPriceDiff : num  0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
##  $ STORE         : num  1 1 1 1 0 0 0 0 0 0 ...

Question 9a

Create a training set containing random sample of 800 observations and the test set containing the remaining observation

df = OJ
orange.juice.data = sample.split(df$Purchase, SplitRatio = 800/1070) #800 observations for the test set.
orange.juice.train = subset(df, orange.juice.data==T)
orange.juice.test = subset(df, orange.juice.data==F)

Question 9b

Fit a train to the training data, with purchase as the response and other variables as predictors.use the summary() function to produce summary statistics about the tree and describe the results obtained. what is the training error rate? how many terminal nodes does the tree have.

orange.juice.tree = tree(Purchase~.,data=orange.juice.train)
unprune.train.summary=summary(orange.juice.tree)

Based on the summary report, the number of terminal nodes used in the above tree is 8 and the training error rate is 16%

Question 9c

Type in the name of the tree object in order to get a detailed text output.pick one of the terminal nodes and interpret the results

orange.juice.tree
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 800 1070.000 CH ( 0.61000 0.39000 )  
##    2) LoyalCH < 0.52192 367  446.500 MM ( 0.29700 0.70300 )  
##      4) LoyalCH < 0.277221 161  120.800 MM ( 0.12422 0.87578 ) *
##      5) LoyalCH > 0.277221 206  281.800 MM ( 0.43204 0.56796 )  
##       10) PriceDiff < 0.05 79   70.650 MM ( 0.16456 0.83544 ) *
##       11) PriceDiff > 0.05 127  171.100 CH ( 0.59843 0.40157 ) *
##    3) LoyalCH > 0.52192 433  325.800 CH ( 0.87529 0.12471 )  
##      6) LoyalCH < 0.764572 168  193.200 CH ( 0.73810 0.26190 )  
##       12) PriceDiff < 0.215 80  110.500 CH ( 0.53750 0.46250 ) *
##       13) PriceDiff > 0.215 88   48.870 CH ( 0.92045 0.07955 ) *
##      7) LoyalCH > 0.764572 265   85.160 CH ( 0.96226 0.03774 )  
##       14) PriceDiff < -0.39 7    9.561 CH ( 0.57143 0.42857 ) *
##       15) PriceDiff > -0.39 258   64.310 CH ( 0.97287 0.02713 )  
##         30) STORE < 1.5 139    0.000 CH ( 1.00000 0.00000 ) *
##         31) STORE > 1.5 119   53.240 CH ( 0.94118 0.05882 ) *

Question 9d

Create a plot of a tree and interpret the results

plot(orange.juice.tree)
text(orange.juice.tree,pretty=0)

Question 9e

Predict the response of the test data and produce a confusion matrix comparing the test labels to the predicted test labels.what is the test error rate?

orange.juice.tree.predict=predict(orange.juice.tree,newdata = orange.juice.test,type = "class" )

Confusion matrix

unprune.test.conf=confusionMatrix(orange.juice.tree.predict,orange.juice.test$Purchase)
unprune.test.error.rate=(unprune.test.conf$table[1,2]+unprune.test.conf$table[2,1])/(dim(orange.juice.test)[1])
unprune.test.accuracy=(unprune.test.conf$table[1,1]+unprune.test.conf$table[2,2])/(dim(orange.juice.test)[1])

Question 9f

Apply the cv.tree() function to the training data set in order to obtain the optimal tree size

orange.juice.cv=cv.tree(orange.juice.tree,FUN = prune.misclass)
orange.juice.cv$dev
## [1] 153 152 174 312
summary(orange.juice.cv)
##        Length Class  Mode     
## size   4      -none- numeric  
## dev    4      -none- numeric  
## k      4      -none- numeric  
## method 1      -none- character

Question 9g

Produce a plot with the tree size on the x acis and the cross validated error classification rate on the y axis

plot(orange.juice.cv$size,orange.juice.cv$dev,type="b")

Question 9h

Which tree size corresponds to the lowest cross validated classification rate?

Based on the graph, the minimum deviation error occur at the size of 2 and from there the error pumps a little bit when the size is 4 ,after that the minimum error stay constant still the size 8

Question 9i

Produce a pruned tree corresponding to the optimal tree size obtained using cross validation. If the cross validation does not lead to the selection of a pruned tree, the create the pruned tree with five terminal nodes

prune.orange.juice=prune.tree(orange.juice.tree,best = 2)

Plot the prune tree

plot(prune.orange.juice)
text(prune.orange.juice,pretty=0)

prune.train.summary=summary(prune.orange.juice)

Question 9j

Compare the training error rates between the pruned and un-pruned trees. which is higher?

TreeType = c("Un-pruned Tree","Pruned Tree")
ErrorRate = c(unprune.train.summary$misclass[1]/unprune.train.summary$misclass[2],
              prune.train.summary$misclass[1]/prune.train.summary$misclass[2])
(Train.error.df=data.frame(TreeType,ErrorRate))
##         TreeType ErrorRate
## 1 Un-pruned Tree   0.17250
## 2    Pruned Tree   0.20375

Training mis classification rate is slightly higher for prune tree with 5 terminal nodes than the full tree

Question 9k

Compare the test error rates between the pruned and un-pruned trees. which is higher?

prune.orange.juice.predict=predict(prune.orange.juice,newdata = orange.juice.test,type = "class")
prune.test.conf=confusionMatrix(prune.orange.juice.predict,orange.juice.test$Purchase)
prune.test.error.rate=(prune.test.conf$table[1,2]+prune.test.conf$table[2,1])/(dim(orange.juice.test)[1])
prune.test.accuracy=(prune.test.conf$table[1,1]+prune.test.conf$table[2,2])/(dim(orange.juice.test)[1])
TestErrorRate = c(unprune.test.error.rate,prune.test.error.rate)
TestAccuracy=c(unprune.test.accuracy,prune.test.accuracy)
(test.error.df=data.frame(TreeType,TestErrorRate,TestAccuracy))
##         TreeType TestErrorRate TestAccuracy
## 1 Un-pruned Tree     0.2000000    0.8000000
## 2    Pruned Tree     0.2185185    0.7814815

Based on the table above test mis-classification rate and accuracy is almost same for prune tree with 5 terminal nodes and the full tree. There is negligible difference between the values.