INTRODUCTION
“The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis” -Wikipedia A Decision Tree is a Supervised Machine Learning algorithm which looks like an inverted tree, wherein each node represents a predictor variable (feature), the link between the nodes represents a Decision and each leaf node represents an outcome (response variable)
We shall use a Decision Tree to classify the 3 species of Iris flowers in this dataset
library(caret)
Loading required package: lattice
Loading required package: ggplot2
library(rpart.plot)
Loading required package: rpart
The caret package (short for Classification And REgression Training) contains functions to streamline the model training process for complex regression and classification problems -https://cran.r-project.org/web/packages/caret/vignettes/caret.html
rpart.plot is the front end of the prp package and is used to draw the Decision Tree
> irisdata<-datasets::iris
> table(iris$Species)
setosa versicolor virginica
50 50 50
Summary of data
> summary(irisdata)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Structure of data
> str(irisdata)
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
since we are using a tree based classifier,there is no need to scale the dataset
we will now split the dataset into train and test sets, with 70% of data in train and 30% in test
> set.seed(3033)
> intrain <- createDataPartition(y = irisdata$Species, p= 0.7, list = FALSE)
> training <- irisdata[intrain,]
> testing <- irisdata[-intrain,]
> dim(training);dim(testing)
[1] 105 5
[1] 45 5
Classification_by_Information_Gain
> trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
> set.seed(3333)
> dtree_fit_info <- train(Species ~., data = training, method = "rpart",
+ parms = list(split = "information"),
+ trControl=trctrl,
+ tuneLength = 10)
> prp(dtree_fit_info$finalModel, box.palette="Reds", tweak=1.2)
Let us check the outcome
> test_pred_info<-predict(dtree_fit_info,newdata = testing)
> confusionMatrix(test_pred_info,testing$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 15 0 0
versicolor 0 13 1
virginica 0 2 14
Overall Statistics
Accuracy : 0.9333
95% CI : (0.8173, 0.986)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8667 0.9333
Specificity 1.0000 0.9667 0.9333
Pos Pred Value 1.0000 0.9286 0.8750
Neg Pred Value 1.0000 0.9355 0.9655
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.2889 0.3111
Detection Prevalence 0.3333 0.3111 0.3556
Balanced Accuracy 1.0000 0.9167 0.9333
The Information Gain model performed well with an accuracy of 0.93
Classification_by_Gini_Coefficient
> set.seed(3333)
> dtree_fit_gini <- train(Species ~., data = training, method = "rpart",
+ parms = list(split = "gini"),
+ trControl=trctrl,
+ tuneLength = 10)
> prp(dtree_fit_gini$finalModel,box.palette = "Blues", tweak = 1.2)
Let us check the outcome
> test_pred_gini<-predict(dtree_fit_gini,newdata = testing)
> confusionMatrix(test_pred_gini,testing$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 15 0 0
versicolor 0 13 1
virginica 0 2 14
Overall Statistics
Accuracy : 0.9333
95% CI : (0.8173, 0.986)
No Information Rate : 0.3333
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.9
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8667 0.9333
Specificity 1.0000 0.9667 0.9333
Pos Pred Value 1.0000 0.9286 0.8750
Neg Pred Value 1.0000 0.9355 0.9655
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.2889 0.3111
Detection Prevalence 0.3333 0.3111 0.3556
Balanced Accuracy 1.0000 0.9167 0.9333
The Gini coefficient model performed well with an accuracy of 0.93
References: 1)https://dataaspirant.com/decision-tree-classifier-implementation-in-r/#:~:text=The%20decision%20tree%20classifier%20is,algorithm%20in%20our%20earlier%20articles. -Author: Rahul Saxena 2)https://drive.google.com/file/d/1mQguC2gku2-QFruj09a30N0TYDwCmPkq/view -Author: Xaltius Pte. Ltd.