Consider the Gini Index, Classification error and entrophy in a simple classification setting with two classes. Create a single plot that displays each of these quantities as a function of \(\hat{p}_{m1}\). The x-axis should display \(\hat{p}_{m1}\), ranging from 0 to 1 and the y-axis should display the value of Gini index, classification error and entropy.
It is simply the fraction of the training observations in a given region that do not belong to the most common class
\[ E = 1-max_k(\hat{p}_{mk}) \]
When Class 1 is most common class The classification error value will be \(E = 1-\hat{p}_{m1}\)
When Class 2 is most common class The classification error value will be \(E = 1-\hat{p}_{m2}\)
It is the measure of Node Purity. Its value lies between 0 and 1. It is defined as below. \[ G = \sum_{k=1}^K\hat{p}_{mk}(1-\hat{p}_{mk}) \]
Gini Index in terms of class 1 is \(G = \hat{p}_{m1}(1-\hat{p}_{m1})\)
Gini Index in terms of class 2 is \(G = \hat{p}_{m2}(1-\hat{p}_{m2})\)
It is the measure of Node Purity. It is Alternative to Gini Index.Its value lies between 0 and 1. It is defined as below. \[ D = -\sum_{k=1}^K\hat{p}_{mk}(\log\hat{p}_{mk}) \] Entropy in terms of class 1 is \(D = -(\hat{p}_{m1}(\log\hat{p}_{m1}))\)
Entropy in terms of class 2 is \(G = -(\hat{p}_{m2}(\log\hat{p}_{m2}))\)
Alt text
In the lab, a classification tree was applied to the carseats data set after converting sales into qualitative response variable now we will seek to predict sales using regression trees and related approaches treating the response as a quantitative variable.
data("Carseats")
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
Split the data set into training set and test set.
set.seed(1)
carseats.data=sample.split(Carseats$Sales,SplitRatio=0.8)
train.carseat=subset(Carseats,carseats.data==TRUE)
test.carseat=subset(Carseats,carseats.data==FALSE)
Fit the regression tree to the training data set. Plot the tree and interpret the results. What test MSE do you obtain?
carseat.tree = tree(Sales~.,data=train.carseat)
summary(carseat.tree)
##
## Regression tree:
## tree(formula = Sales ~ ., data = train.carseat)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "CompPrice" "Age" "Income"
## [6] "Advertising"
## Number of terminal nodes: 19
## Residual mean deviance: 2.314 = 696.5 / 301
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -5.72300 -0.99680 -0.04119 0.00000 1.04500 3.97900
carseat.tree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 320 2514.000 7.552
## 2) ShelveLoc: Bad,Medium 246 1406.000 6.759
## 4) Price < 105.5 85 416.700 8.055
## 8) CompPrice < 123.5 64 255.700 7.426
## 16) Age < 68.5 49 152.400 7.972
## 32) Income < 57.5 15 30.560 6.656 *
## 33) Income > 57.5 34 84.430 8.553
## 66) Age < 37 7 22.340 10.240 *
## 67) Age > 37 27 36.870 8.114 *
## 17) Age > 68.5 15 40.750 5.640 *
## 9) CompPrice > 123.5 21 58.350 9.974 *
## 5) Price > 105.5 161 771.300 6.075
## 10) ShelveLoc: Bad 46 208.800 4.707
## 20) CompPrice < 124.5 17 41.050 3.516 *
## 21) CompPrice > 124.5 29 129.500 5.406
## 42) Price < 132.5 19 57.010 6.308 *
## 43) Price > 132.5 10 27.620 3.691 *
## 11) ShelveLoc: Medium 115 442.000 6.622
## 22) CompPrice < 122.5 36 100.800 5.526 *
## 23) CompPrice > 122.5 79 278.400 7.121
## 46) Advertising < 13.5 65 207.700 6.732
## 92) Price < 131.5 38 71.570 7.448
## 184) Age < 74.5 30 40.120 7.891 *
## 185) Age > 74.5 8 3.550 5.789 *
## 93) Price > 131.5 27 89.190 5.723 *
## 47) Advertising > 13.5 14 15.030 8.929 *
## 3) ShelveLoc: Good 74 438.500 10.190
## 6) Price < 109.5 25 73.110 12.290 *
## 7) Price > 109.5 49 198.700 9.117
## 14) Advertising < 14 42 135.400 8.700
## 28) Income < 47 15 16.490 7.575 *
## 29) Income > 47 27 89.390 9.324
## 58) CompPrice < 141 22 60.760 8.852
## 116) Advertising < 6 10 13.980 7.546 *
## 117) Advertising > 6 12 15.500 9.941 *
## 59) CompPrice > 141 5 2.143 11.400 *
## 15) Advertising > 14 7 12.020 11.620 *
plot(carseat.tree)
text(carseat.tree,pretty=0)
carseat.tree.predict=predict(carseat.tree,newdata=test.carseat)
plot(carseat.tree.predict,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")
carseat.unprune.mse=mean((carseat.tree.predict-test.carseat$Sales)^2)
carseat.unprune.mse
## [1] 5.202575
carseat.unprune.rmse=sqrt(mean((carseat.tree.predict-test.carseat$Sales)^2))
carseat.unprune.rmse
## [1] 2.280915
As per the training Summary, below 5 variables are used while constructing the full decision tree which ends up in 19 terminal nodes. The Root mean square error is 2.280915 which implies that the prediction of car seat sales is within 2280 car seats of the actual median car seat sales across different stores
Use cross Validation on order to determine the optimal level of tree complexity. Does pruning the tree improved the test MSE
carseat.cv=cv.tree(carseat.tree)
plot(carseat.cv$size,carseat.cv$dev,type="b")
Based on The cross validation tree, the minimum error occur when the size of the tree is with 15 nodes
Prune the training data set with the size as 15 and check the RMSE for predicted model
prune.carseat=prune.tree(carseat.tree,best = 15)
Plot the prune tree
plot(prune.carseat)
text(prune.carseat,pretty=0)
Predict the carseat sales of the test dataset using the pruned tree model with the leaf node size as 15
carseat.prune.predict=predict(prune.carseat,newdata=test.carseat)
plot(carseat.prune.predict,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")
carseat.prune.mse=mean((carseat.prune.predict-test.carseat$Sales)^2)
carseat.prune.mse
## [1] 5.003813
carseat.prune.rmse=sqrt(mean((carseat.prune.predict-test.carseat$Sales)^2))
carseat.prune.rmse
## [1] 2.236921
As per the cross validation , the minimum cv error occurs at the size of 15. After Pruning the data set for that size and find out the MSE, the difference between the MSE of un-pruned tree and pruned tree to the size of 15 is very minimum.As an Eyeball view of the Plot the maximum dip in the CV error occur when the size of the tree is less than 6 nodes after that the dip in the CV error is minimum.So pruning the data set with the size as 6 and compare the MSE
Prune the training data set with the size as 6 and check the RMSE for predicted model
prune.carseat.1=prune.tree(carseat.tree,best = 6)
Plot the prune tree
plot(prune.carseat.1)
text(prune.carseat.1,pretty=0)
Predict the carseat sales of the test dataset using the pruned tree model with the leaf node size as 10
carseat.prune.predict.1=predict(prune.carseat.1,newdata=test.carseat)
plot(carseat.prune.predict.1,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")
carseat.prune1.mse=mean((carseat.prune.predict.1-test.carseat$Sales)^2)
carseat.prune1.mse
## [1] 5.064176
carseat.prune1.rmse=sqrt(mean((carseat.prune.predict.1-test.carseat$Sales)^2))
carseat.prune1.rmse
## [1] 2.250372
As per the comparison of the MSE between the Pruned tree and the un-pruned tree there is no much improvement in the MSE.Test MSE is slightly lower for the pruned tree compared to the unpruned tree
tree.type=c("unpruned","prune-15","prune-5")
tree.mses=c(carseat.unprune.mse,carseat.prune.mse,carseat.prune1.mse)
tree.rmses=c(carseat.unprune.rmse,carseat.prune.rmse,carseat.prune1.rmse)
data.frame(tree.type,tree.mses,tree.rmses)
## tree.type tree.mses tree.rmses
## 1 unpruned 5.202575 2.280915
## 2 prune-15 5.003813 2.236921
## 3 prune-5 5.064176 2.250372
Use the bagging approach in-order to analyze this data. what test MSE do you obtain?.Use the Importance function to determine which variable are most important.
set.seed(1)
carseat.bag = randomForest(Sales~.,data=train.carseat,mtry=10,importance=TRUE)
carseat.bag
##
## Call:
## randomForest(formula = Sales ~ ., data = train.carseat, mtry = 10, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 10
##
## Mean of squared residuals: 2.482236
## % Var explained: 68.41
carseat.bag.predict=predict(carseat.bag,newdata = test.carseat )
plot(carseat.bag.predict,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")
carseat.bag.mse=mean((carseat.bag.predict-test.carseat$Sales)^2)
carseat.bag.mse
## [1] 2.609163
carseat.bag.rmse=sqrt(mean((carseat.bag.predict-test.carseat$Sales)^2))
carseat.bag.rmse
## [1] 1.61529
Find out the Variable importance of the bagged model
varImpPlot(carseat.bag)
Test MSE is low in bagging approach compared to the pruned tree. Also Based on the variable importance, shelf location, price takes the higher priority compared to all other variables.
Use the Random forest to analyze this data. what test MSE do you obtain?.Use the Importance function to determine which variable are most important. Describe the effect of m, the number of variables considered at each split, on the error rate obtained.
Fit the Random forest model using caret package
carseat.rf=train(Sales~.,data=train.carseat,trControl=trainControl(method = "cv",number=10),method='rf',importance=TRUE)
Find out the best Tune value to check how many predictors are needed
carseat.rf$bestTune
## mtry
## 3 11
carseat.rf.predict=predict(carseat.rf,newdata = test.carseat )
plot(carseat.rf.predict,test.carseat$Sales,xlab = "Predicted Carseat Sales",ylab = "Actual Carseat sales")
abline(0,1,col="red")
carseat.rf.mse=mean((carseat.rf.predict-test.carseat$Sales)^2)
carseat.rf.mse
## [1] 2.566788
carseat.rf.rmse=sqrt(mean((carseat.rf.predict-test.carseat$Sales)^2))
carseat.rf.rmse
## [1] 1.60212
Plot the variable importance of the final model
plot(varImp(carseat.rf))
Test MSE is low in Random forest modelling technique compared to the pruned tree and bagg approach. Also based on the variable importance, shelf location, price takes the higher priority compared to all other variables.Overall Random forest model is best model to predict the number of car seat sales across 400 stores
This problem involves the OJ data set which is part of the ISLR package
data(OJ)
str(OJ)
## 'data.frame': 1070 obs. of 18 variables:
## $ Purchase : Factor w/ 2 levels "CH","MM": 1 1 1 2 1 1 1 1 1 1 ...
## $ WeekofPurchase: num 237 239 245 227 228 230 232 234 235 238 ...
## $ StoreID : num 1 1 1 1 7 7 7 7 7 7 ...
## $ PriceCH : num 1.75 1.75 1.86 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceMM : num 1.99 1.99 2.09 1.69 1.69 1.99 1.99 1.99 1.99 1.99 ...
## $ DiscCH : num 0 0 0.17 0 0 0 0 0 0 0 ...
## $ DiscMM : num 0 0.3 0 0 0 0 0.4 0.4 0.4 0.4 ...
## $ SpecialCH : num 0 0 0 0 0 0 1 1 0 0 ...
## $ SpecialMM : num 0 1 0 0 0 1 1 0 0 0 ...
## $ LoyalCH : num 0.5 0.6 0.68 0.4 0.957 ...
## $ SalePriceMM : num 1.99 1.69 2.09 1.69 1.69 1.99 1.59 1.59 1.59 1.59 ...
## $ SalePriceCH : num 1.75 1.75 1.69 1.69 1.69 1.69 1.69 1.75 1.75 1.75 ...
## $ PriceDiff : num 0.24 -0.06 0.4 0 0 0.3 -0.1 -0.16 -0.16 -0.16 ...
## $ Store7 : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 2 2 2 2 2 ...
## $ PctDiscMM : num 0 0.151 0 0 0 ...
## $ PctDiscCH : num 0 0 0.0914 0 0 ...
## $ ListPriceDiff : num 0.24 0.24 0.23 0 0 0.3 0.3 0.24 0.24 0.24 ...
## $ STORE : num 1 1 1 1 0 0 0 0 0 0 ...
Create a training set containing random sample of 800 observations and the test set containing the remaining observation
df = OJ
orange.juice.data = sample.split(df$Purchase, SplitRatio = 800/1070) #800 observations for the test set.
orange.juice.train = subset(df, orange.juice.data==T)
orange.juice.test = subset(df, orange.juice.data==F)
Fit a train to the training data, with purchase as the response and other variables as predictors.use the summary() function to produce summary statistics about the tree and describe the results obtained. what is the training error rate? how many terminal nodes does the tree have.
orange.juice.tree = tree(Purchase~.,data=orange.juice.train)
unprune.train.summary=summary(orange.juice.tree)
Based on the summary report, the number of terminal nodes used in the above tree is 8 and the training error rate is 16%
Type in the name of the tree object in order to get a detailed text output.pick one of the terminal nodes and interpret the results
orange.juice.tree
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 800 1070.000 CH ( 0.61000 0.39000 )
## 2) LoyalCH < 0.52192 367 446.500 MM ( 0.29700 0.70300 )
## 4) LoyalCH < 0.277221 161 120.800 MM ( 0.12422 0.87578 ) *
## 5) LoyalCH > 0.277221 206 281.800 MM ( 0.43204 0.56796 )
## 10) PriceDiff < 0.05 79 70.650 MM ( 0.16456 0.83544 ) *
## 11) PriceDiff > 0.05 127 171.100 CH ( 0.59843 0.40157 ) *
## 3) LoyalCH > 0.52192 433 325.800 CH ( 0.87529 0.12471 )
## 6) LoyalCH < 0.764572 168 193.200 CH ( 0.73810 0.26190 )
## 12) PriceDiff < 0.215 80 110.500 CH ( 0.53750 0.46250 ) *
## 13) PriceDiff > 0.215 88 48.870 CH ( 0.92045 0.07955 ) *
## 7) LoyalCH > 0.764572 265 85.160 CH ( 0.96226 0.03774 )
## 14) PriceDiff < -0.39 7 9.561 CH ( 0.57143 0.42857 ) *
## 15) PriceDiff > -0.39 258 64.310 CH ( 0.97287 0.02713 )
## 30) STORE < 1.5 139 0.000 CH ( 1.00000 0.00000 ) *
## 31) STORE > 1.5 119 53.240 CH ( 0.94118 0.05882 ) *
Create a plot of a tree and interpret the results
plot(orange.juice.tree)
text(orange.juice.tree,pretty=0)
Predict the response of the test data and produce a confusion matrix comparing the test labels to the predicted test labels.what is the test error rate?
orange.juice.tree.predict=predict(orange.juice.tree,newdata = orange.juice.test,type = "class" )
Confusion matrix
unprune.test.conf=confusionMatrix(orange.juice.tree.predict,orange.juice.test$Purchase)
unprune.test.error.rate=(unprune.test.conf$table[1,2]+unprune.test.conf$table[2,1])/(dim(orange.juice.test)[1])
unprune.test.accuracy=(unprune.test.conf$table[1,1]+unprune.test.conf$table[2,2])/(dim(orange.juice.test)[1])
Apply the cv.tree() function to the training data set in order to obtain the optimal tree size
orange.juice.cv=cv.tree(orange.juice.tree,FUN = prune.misclass)
orange.juice.cv$dev
## [1] 153 152 174 312
summary(orange.juice.cv)
## Length Class Mode
## size 4 -none- numeric
## dev 4 -none- numeric
## k 4 -none- numeric
## method 1 -none- character
Produce a plot with the tree size on the x acis and the cross validated error classification rate on the y axis
plot(orange.juice.cv$size,orange.juice.cv$dev,type="b")
Which tree size corresponds to the lowest cross validated classification rate?
Based on the graph, the minimum deviation error occur at the size of 2 and from there the error pumps a little bit when the size is 4 ,after that the minimum error stay constant still the size 8
Produce a pruned tree corresponding to the optimal tree size obtained using cross validation. If the cross validation does not lead to the selection of a pruned tree, the create the pruned tree with five terminal nodes
prune.orange.juice=prune.tree(orange.juice.tree,best = 2)
Plot the prune tree
plot(prune.orange.juice)
text(prune.orange.juice,pretty=0)
prune.train.summary=summary(prune.orange.juice)
Compare the training error rates between the pruned and un-pruned trees. which is higher?
TreeType = c("Un-pruned Tree","Pruned Tree")
ErrorRate = c(unprune.train.summary$misclass[1]/unprune.train.summary$misclass[2],
prune.train.summary$misclass[1]/prune.train.summary$misclass[2])
(Train.error.df=data.frame(TreeType,ErrorRate))
## TreeType ErrorRate
## 1 Un-pruned Tree 0.17250
## 2 Pruned Tree 0.20375
Training mis classification rate is slightly higher for prune tree with 5 terminal nodes than the full tree
Compare the test error rates between the pruned and un-pruned trees. which is higher?
prune.orange.juice.predict=predict(prune.orange.juice,newdata = orange.juice.test,type = "class")
prune.test.conf=confusionMatrix(prune.orange.juice.predict,orange.juice.test$Purchase)
prune.test.error.rate=(prune.test.conf$table[1,2]+prune.test.conf$table[2,1])/(dim(orange.juice.test)[1])
prune.test.accuracy=(prune.test.conf$table[1,1]+prune.test.conf$table[2,2])/(dim(orange.juice.test)[1])
TestErrorRate = c(unprune.test.error.rate,prune.test.error.rate)
TestAccuracy=c(unprune.test.accuracy,prune.test.accuracy)
(test.error.df=data.frame(TreeType,TestErrorRate,TestAccuracy))
## TreeType TestErrorRate TestAccuracy
## 1 Un-pruned Tree 0.2000000 0.8000000
## 2 Pruned Tree 0.2185185 0.7814815
Based on the table above test mis-classification rate and accuracy is almost same for prune tree with 5 terminal nodes and the full tree. There is negligible difference between the values.