Trees involve stratifying or sagmenting the Predictor(\(X_i\)) space into a number of simple Regions.The tree based Methods generate a set of \(Splitting \ Rules\) which are used to sagment the Predictor Space.These techniques of sagmenting and stratifying data into different Regions \(R_j\) are called Decision Trees.Decision Trees are used in both Regression and Classification Problems. These are Statistical Learning Techniques which are easier to understand and Simpler in terms of interpretablity.
The Rules generated are of form –
\[R_j = If (X = X_1 \cap X_2 \cap ....X_p) --> Y_i \].
require(ISLR) #package containing data
require(ggplot2)
require(tree)
#Using the Carseats data set
attach(Carseats)
?Carseats
Carseats is a simulated data set containing sales of child car seats at 400 different stores.
#Checking the distribution of Sales
ggplot(aes(x = Sales),data = Carseats) +
geom_histogram(color="black",fill = 'purple',alpha = 0.6, bins=30) +
labs(x = "Unit Sales in Thousands", y = "Frequency")
As the histogram suggests - It is Normally distributed Highest frequency of around 8000 Unit Sales
#Making a Factor variable from Sales
HighSales<-ifelse(Sales <= 8,"No","Yes")
head(HighSales)
## [1] "Yes" "Yes" "Yes" "No" "No" "Yes"
#Making a Data frame
Carseats<-data.frame(Carseats,HighSales)
Now we are going to fit a Tree to the Carseats Data to predict if we are going to have High Sales or not.The tree() function uses a Top-down Greedy approch to fit a Tree which is also known as Recursive Binary Splitting.It is Greedy because it dosen’t finds the best split amongst all possible splits,but only the best splits at the immediate place its looking i.e the best Split at that particular step.
#We will use the tree() function to fit a Desicion Tree
?tree
#Excluding the Sales atrribute
CarTree<-tree(HighSales ~ . -Sales , data = Carseats,split = c("deviance","gini"))
#split argument split to specify the splitting criterion to use.
CarTree #Outputs a Tree with various Splits at different Variables and Response at Terminals Nodes
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 400 541.500 No ( 0.59000 0.41000 )
## 2) ShelveLoc: Bad,Medium 315 390.600 No ( 0.68889 0.31111 )
## 4) Price < 92.5 46 56.530 Yes ( 0.30435 0.69565 )
## 8) Income < 57 10 12.220 No ( 0.70000 0.30000 )
## 16) CompPrice < 110.5 5 0.000 No ( 1.00000 0.00000 ) *
## 17) CompPrice > 110.5 5 6.730 Yes ( 0.40000 0.60000 ) *
## 9) Income > 57 36 35.470 Yes ( 0.19444 0.80556 )
## 18) Population < 207.5 16 21.170 Yes ( 0.37500 0.62500 ) *
## 19) Population > 207.5 20 7.941 Yes ( 0.05000 0.95000 ) *
## 5) Price > 92.5 269 299.800 No ( 0.75465 0.24535 )
## 10) Advertising < 13.5 224 213.200 No ( 0.81696 0.18304 )
## 20) CompPrice < 124.5 96 44.890 No ( 0.93750 0.06250 )
## 40) Price < 106.5 38 33.150 No ( 0.84211 0.15789 )
## 80) Population < 177 12 16.300 No ( 0.58333 0.41667 )
## 160) Income < 60.5 6 0.000 No ( 1.00000 0.00000 ) *
## 161) Income > 60.5 6 5.407 Yes ( 0.16667 0.83333 ) *
## 81) Population > 177 26 8.477 No ( 0.96154 0.03846 ) *
## 41) Price > 106.5 58 0.000 No ( 1.00000 0.00000 ) *
## 21) CompPrice > 124.5 128 150.200 No ( 0.72656 0.27344 )
## 42) Price < 122.5 51 70.680 Yes ( 0.49020 0.50980 )
## 84) ShelveLoc: Bad 11 6.702 No ( 0.90909 0.09091 ) *
## 85) ShelveLoc: Medium 40 52.930 Yes ( 0.37500 0.62500 )
## 170) Price < 109.5 16 7.481 Yes ( 0.06250 0.93750 ) *
## 171) Price > 109.5 24 32.600 No ( 0.58333 0.41667 )
## 342) Age < 49.5 13 16.050 Yes ( 0.30769 0.69231 ) *
## 343) Age > 49.5 11 6.702 No ( 0.90909 0.09091 ) *
## 43) Price > 122.5 77 55.540 No ( 0.88312 0.11688 )
## 86) CompPrice < 147.5 58 17.400 No ( 0.96552 0.03448 ) *
## 87) CompPrice > 147.5 19 25.010 No ( 0.63158 0.36842 )
## 174) Price < 147 12 16.300 Yes ( 0.41667 0.58333 )
## 348) CompPrice < 152.5 7 5.742 Yes ( 0.14286 0.85714 ) *
## 349) CompPrice > 152.5 5 5.004 No ( 0.80000 0.20000 ) *
## 175) Price > 147 7 0.000 No ( 1.00000 0.00000 ) *
## 11) Advertising > 13.5 45 61.830 Yes ( 0.44444 0.55556 )
## 22) Age < 54.5 25 25.020 Yes ( 0.20000 0.80000 )
## 44) CompPrice < 130.5 14 18.250 Yes ( 0.35714 0.64286 )
## 88) Income < 100 9 12.370 No ( 0.55556 0.44444 ) *
## 89) Income > 100 5 0.000 Yes ( 0.00000 1.00000 ) *
## 45) CompPrice > 130.5 11 0.000 Yes ( 0.00000 1.00000 ) *
## 23) Age > 54.5 20 22.490 No ( 0.75000 0.25000 )
## 46) CompPrice < 122.5 10 0.000 No ( 1.00000 0.00000 ) *
## 47) CompPrice > 122.5 10 13.860 No ( 0.50000 0.50000 )
## 94) Price < 125 5 0.000 Yes ( 0.00000 1.00000 ) *
## 95) Price > 125 5 0.000 No ( 1.00000 0.00000 ) *
## 3) ShelveLoc: Good 85 90.330 Yes ( 0.22353 0.77647 )
## 6) Price < 135 68 49.260 Yes ( 0.11765 0.88235 )
## 12) US: No 17 22.070 Yes ( 0.35294 0.64706 )
## 24) Price < 109 8 0.000 Yes ( 0.00000 1.00000 ) *
## 25) Price > 109 9 11.460 No ( 0.66667 0.33333 ) *
## 13) US: Yes 51 16.880 Yes ( 0.03922 0.96078 ) *
## 7) Price > 135 17 22.070 No ( 0.64706 0.35294 )
## 14) Income < 46 6 0.000 No ( 1.00000 0.00000 ) *
## 15) Income > 46 11 15.160 Yes ( 0.45455 0.54545 ) *
#The numeric values within the braces are the Proportions of Yes and No for each split.
#Summary of the Decision Tree
summary(CarTree)
##
## Classification tree:
## tree(formula = HighSales ~ . - Sales, data = Carseats, split = c("deviance",
## "gini"))
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Price" "Income" "CompPrice" "Population"
## [6] "Advertising" "Age" "US"
## Number of terminal nodes: 27
## Residual mean deviance: 0.4575 = 170.7 / 373
## Misclassification error rate: 0.09 = 36 / 400
The summary of the Model consists of the imporatant variables used for splitting the data which minimizes the deviance(Error) Rate and another Splitting criterion used is Gini Indes, which is also called Purity Index.
plot(CarTree)
#Adding Predictors as text to plot
text(CarTree ,pretty = 1 )
This tree is quiet Complicated and hard to understand due to lots of Splits and lots of variables included in the predictor space.The leaf nodes consists of the Response value i.e Yes / No .
set.seed(1001)
#A training sample of 250 examples sampled without replacement
train<-sample(1:nrow(Carseats), 250)
#Fitting another Model
tree1<-tree(HighSales ~ .-Sales , data = Carseats, subset = train)
summary(tree1)
##
## Classification tree:
## tree(formula = HighSales ~ . - Sales, data = Carseats, subset = train)
## Variables actually used in tree construction:
## [1] "ShelveLoc" "Urban" "Income" "Education" "Price"
## [6] "Advertising" "CompPrice" "Age"
## Number of terminal nodes: 22
## Residual mean deviance: 0.4124 = 94.02 / 228
## Misclassification error rate: 0.092 = 23 / 250
#Plotting
plot(tree1);text(tree1)
Now the tree is somewhat different and detailed but is quiet hard to interpret too due to lots of splits.
Predicting on Test Set
#Predicting the Class labels for Test set
pred<-predict(tree1, newdata = Carseats[-train,],type = "class")
head(pred)
## [1] Yes No Yes No No Yes
## Levels: No Yes
#Confusion Matrix to check number of Misclassifications
with(Carseats[-train,],table(pred,HighSales))
## HighSales
## pred No Yes
## No 71 22
## Yes 18 39
#Misclassification Error Rate on Test Set
mean(pred!=Carseats[-train,]$HighSales)
## [1] 0.2666667
The Diagonals are the correctly classified Test Examples , whereas the off-diagonals represent the misclassified examples.The Mean Error Rate is \(\text{26%}\).
The above tree was grown to Full length and might have lots of variables in it which might be degrading the Perfomance.We will now use 10 fold Cross Validation to Prune the Tree.
#10 fold CV
#Performing Cost Complexity Pruning
cv.tree1<-cv.tree(tree1, FUN=prune.misclass)
cv.tree1
## $size
## [1] 22 15 14 12 8 6 4 1
##
## $dev
## [1] 76 77 82 82 84 92 88 108
##
## $k
## [1] -Inf 0.00000 1.00000 1.50000 3.00000 4.50000 7.00000 13.66667
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
plot(cv.tree1)
#Deviance minimum for tree size 15 i.e 15 Splits
prune.tree1<-prune.misclass(tree1,best = 15)
plot(prune.tree1);text(prune.tree1)
Testing the pruned Tree on Test Set -
pred1<-predict(prune.tree1 , Carseats[-train,],type="class")
#Confusion Matrix
with(Carseats[-train,],table(pred1,HighSales))
## HighSales
## pred1 No Yes
## No 74 23
## Yes 15 38
#Misclassification Rate
ErrorPrune<-mean(pred1!=Carseats[-train,]$HighSales)
ErrorPrune
## [1] 0.2533333
#Error reduced to 25 %
As we can notice by the perfomance on Test Set the Pruned Tree dosen’t performs better as the Error rate reduced only by a factor of 0.1 % i.e from 26% to 25%. It’s just that Pruning lead us to a more simpler Tree with lesser Splits and a subset of predictors which is somewhat easier to interpret and understand.
Usually Trees don’t actually give good perfomance on Test Sets , and is called a Weak Learner.
Applying Ensembling Techniques such as Random Forests , Bagging and Boosting improves the Perfomance of Trees a lot by combining a lot of Trees trained on samples from training examples and finally combining(averaging) the Trees to form a single Strong Tree which performs nicely.
Hope you guys liked the article , make sure to share and like it.