Decision Tree Classifier

INTRODUCTION

“The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” -Wikipedia A Decision Tree is a Supervised Machine Learning algorithm which looks like an inverted tree, wherein each node represents a predictor variable (feature), the link between the nodes represents a Decision and each leaf node represents an outcome (response variable)

We shall use a Decision Tree to classify the 3 species of Iris flowers in this dataset

library(caret)

Loading required package: lattice

Loading required package: ggplot2

library(rpart.plot)

Loading required package: rpart

The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems -https://cran.r-project.org/web/packages/caret/vignettes/caret.html
rpart.plot is the front end of the prp package and is used to draw the Decision Tree

> irisdata<-datasets::iris
> table(iris$Species)


    setosa versicolor  virginica 
        50         50         50

Summary of data

> summary(irisdata)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50

Structure of data

> str(irisdata)

'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

since we are using a tree based classifier,there is no need to scale the dataset

we will now split the dataset into train and test sets, with 70% of data in train and 30% in test

> set.seed(3033)
> intrain <- createDataPartition(y = irisdata$Species, p= 0.7, list = FALSE)
> training <- irisdata[intrain,]
> testing <- irisdata[-intrain,]
> dim(training);dim(testing)

[1] 105   5

[1] 45  5

Classification_by_Information_Gain

> trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
> set.seed(3333)
> dtree_fit_info <- train(Species ~., data = training, method = "rpart",
+                    parms = list(split = "information"),
+                    trControl=trctrl,
+                    tuneLength = 10)
> prp(dtree_fit_info$finalModel, box.palette="Reds", tweak=1.2)

Let us check the outcome

> test_pred_info<-predict(dtree_fit_info,newdata = testing)
> confusionMatrix(test_pred_info,testing$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         1
  virginica       0          2        14

Overall Statistics
                                         
               Accuracy : 0.9333         
                 95% CI : (0.8173, 0.986)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9            
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           0.9333
Specificity                 1.0000            0.9667           0.9333
Pos Pred Value              1.0000            0.9286           0.8750
Neg Pred Value              1.0000            0.9355           0.9655
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.3111
Detection Prevalence        0.3333            0.3111           0.3556
Balanced Accuracy           1.0000            0.9167           0.9333

The Information Gain model performed well with an accuracy of 0.93

Classification_by_Gini_Coefficient

> set.seed(3333)
> dtree_fit_gini <- train(Species ~., data = training, method = "rpart",
+                         parms = list(split = "gini"),
+                         trControl=trctrl,
+                         tuneLength = 10)
> prp(dtree_fit_gini$finalModel,box.palette = "Blues", tweak = 1.2)

Let us check the outcome

> test_pred_gini<-predict(dtree_fit_gini,newdata = testing)
> confusionMatrix(test_pred_gini,testing$Species)

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         13         1
  virginica       0          2        14

Overall Statistics
                                         
               Accuracy : 0.9333         
                 95% CI : (0.8173, 0.986)
    No Information Rate : 0.3333         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9            
                                         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8667           0.9333
Specificity                 1.0000            0.9667           0.9333
Pos Pred Value              1.0000            0.9286           0.8750
Neg Pred Value              1.0000            0.9355           0.9655
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.2889           0.3111
Detection Prevalence        0.3333            0.3111           0.3556
Balanced Accuracy           1.0000            0.9167           0.9333

The Gini coefficient model performed well with an accuracy of 0.93

References: 1)https://dataaspirant.com/decision-tree-classifier-implementation-in-r/#:~:text=The%20decision%20tree%20classifier%20is,algorithm%20in%20our%20earlier%20articles. -Author: Rahul Saxena 2)https://drive.google.com/file/d/1mQguC2gku2-QFruj09a30N0TYDwCmPkq/view -Author: Xaltius Pte. Ltd.

Decision Tree Classifier

Anubrata Das

5/11/2021