Classification using Naive Baye’s and Decision Tree algorithms

Apply Decision Tree classifier and Naive Bayesian Classifier on the given datasets. Compare the performance of both these classifiers using all the parameters.

Case Study: IRIS Classification

#About the data set

The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. This is a very famous and widely used dataset by everyone trying to learn machine learning and statistics. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres. The fifth column is the species of the flower observed.

Installing the required packages

We need to load a few important packages first to begin our analysis

#CARET Package

The caret package provides a consistent interface into hundreds of machine learning algorithms and provides useful convenience methods for data visualization, data resampling, model tuning and model comparison, among other features. install.packages(“caret”)

Loading the data from within R

# Attach the dataset to the environment
data(iris)
# Get help on the data
help(iris)
# Rename the data
iris_filedata<-iris
# View the data
View(iris_filedata)
# View the top few rows of the data in R console
head(iris_filedata,25)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1           5.1         3.5          1.4         0.2  setosa
## 2           4.9         3.0          1.4         0.2  setosa
## 3           4.7         3.2          1.3         0.2  setosa
## 4           4.6         3.1          1.5         0.2  setosa
## 5           5.0         3.6          1.4         0.2  setosa
## 6           5.4         3.9          1.7         0.4  setosa
## 7           4.6         3.4          1.4         0.3  setosa
## 8           5.0         3.4          1.5         0.2  setosa
## 9           4.4         2.9          1.4         0.2  setosa
## 10          4.9         3.1          1.5         0.1  setosa
## 11          5.4         3.7          1.5         0.2  setosa
## 12          4.8         3.4          1.6         0.2  setosa
## 13          4.8         3.0          1.4         0.1  setosa
## 14          4.3         3.0          1.1         0.1  setosa
## 15          5.8         4.0          1.2         0.2  setosa
## 16          5.7         4.4          1.5         0.4  setosa
## 17          5.4         3.9          1.3         0.4  setosa
## 18          5.1         3.5          1.4         0.3  setosa
## 19          5.7         3.8          1.7         0.3  setosa
## 20          5.1         3.8          1.5         0.3  setosa
## 21          5.4         3.4          1.7         0.2  setosa
## 22          5.1         3.7          1.5         0.4  setosa
## 23          4.6         3.6          1.0         0.2  setosa
## 24          5.1         3.3          1.7         0.5  setosa
## 25          4.8         3.4          1.9         0.2  setosa

Splitting the Data into Training and Testing Sets

This is one of the most important steps and concepts in Machine Learning. Before we do any meaningful work and learn from the dataset available to us, we need to split the dataset into training set and testing set and sometimes into a third validation set.

TRAINING SET: Is the SEEN DATA which is used to build and train the model. In classification problems such as this, we train the model using the classfication error rate: the percentage of incorrectly/correctly classified instances. We use the training data set to help us understand the data, select the appropriate model and determine model parameters.

TESTING SET This is the UNSEEN DATA. We build a model because we want to classify new data. We are also chiefly interested in the model performance(error rate) on this new data as it is more realistic estimate of the model fit in the real world.

VALIDATION SET Sometimes a part of the training set is split into the Validation Set to help us tune and optimize our models. It can be thought of as a Practice Testing set before we actually test the model using the TESTING SET

# Load the Caret package which allows us to partition the data
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
# We use the dataset to create a partition (80% training 20% testing)
index <- createDataPartition(iris_filedata$Species, p=0.80, list=FALSE)
# select 20% of the data for testing
testset <- iris_filedata[-index,]
# select 80% of data to train the models
trainset <- iris_filedata[index,]

#Visualization and Exploration

# begin by loading the library
library(ggplot2)
## Box plot to understand how the distribution varies by class of flower
par(mfrow=c(1,4))
  for(i in 1:4) {
  boxplot(trainset[,i], main=names(trainset)[i])
}

## Faceting: Producing multiple charts in one plot
library(ggthemes)
facet <- ggplot(data=trainset, aes(Sepal.Length, y=Sepal.Width, color=Species))+
    geom_point(aes(shape=Species), size=1.5) + 
    geom_smooth(method="lm") +
    xlab("Sepal Length") +
    ylab("Sepal Width") +
    ggtitle("Faceting") +
    theme_fivethirtyeight() +
    facet_grid(. ~ Species) # Along rows
print(facet)

#The Problem: Given a set of data about the flowers (column 1 through 4) can we predict which of the 3 classes of flowers it belongs to.

We will build and fit a few different models to the training set and try to learn from the trainset data. This is machine learning. We will later use the model to predict the classification for the testset.

#Decision Tree Classifiers

Decision Trees are a widely used set of algorithms used in Classification as well as Regression Problems in Data mining. Decision Trees classify observations by sorting them down the tree from the root node to the leaf node which provides the classification for the observation. Each node specifies a test on a particular attribute and each branch from that node represents one of the possible values for that test. These represent a form of supervised learning as trees can be first learnt using training observations and then be used to predict on the test set.

There are many decision tree algorithms available and vary by the method the trees are constructed and grown. Here we will use the simple rpart algorithm to classify our data set and predict.

library(rpart)
dtm<- rpart(Species~., trainset, method = "class")
dtm
## n= 120 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 120 80 setosa (0.33333333 0.33333333 0.33333333)  
##   2) Petal.Length< 2.45 40  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.45 80 40 versicolor (0.00000000 0.50000000 0.50000000)  
##     6) Petal.Width< 1.75 43  4 versicolor (0.00000000 0.90697674 0.09302326) *
##     7) Petal.Width>=1.75 37  1 virginica (0.00000000 0.02702703 0.97297297) *
library(rpart.plot)
rpart.plot(dtm)

p<- predict(dtm, testset, type="class")
table(testset[,5], p)
##             p
##              setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          1         9
library("caret")

confusionMatrix(p, testset$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         1
##   virginica       0          0         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           0.9000
## Specificity                 1.0000            0.9500           1.0000
## Pos Pred Value              1.0000            0.9091           1.0000
## Neg Pred Value              1.0000            1.0000           0.9524
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3000
## Detection Prevalence        0.3333            0.3667           0.3000
## Balanced Accuracy           1.0000            0.9750           0.9500

Naive-Bayes Algorithm

#Building a model
#split data into training and test data sets
indxTrain <- createDataPartition(y = iris$Species,p = 0.75,list = FALSE)
training <- iris[indxTrain,]
testing <- iris[-indxTrain,]
 
#Check dimensions of the split
 
prop.table(table(iris$Species)) * 100
## 
##     setosa versicolor  virginica 
##   33.33333   33.33333   33.33333
#create objects x which holds the predictor variables and y which holds the response variables
x = training[,-9]
y = training$Species
library(e1071)
model = train(x,y,'nb',trControl=trainControl(method='cv',number=10))
model
## Naive Bayes 
## 
## 114 samples
##   5 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 102, 103, 103, 102, 102, 103, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  Accuracy  Kappa
##   FALSE      1         1    
##    TRUE      1         1    
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = FALSE
##  and adjust = 1.
#Model Evaluation
#Predict testing set
Predict <- predict(model,newdata = testing )
Predict
##  [1] setosa     setosa     setosa     setosa     setosa     setosa    
##  [7] setosa     setosa     setosa     setosa     setosa     setosa    
## [13] versicolor versicolor versicolor versicolor versicolor versicolor
## [19] versicolor versicolor versicolor versicolor versicolor versicolor
## [25] virginica  virginica  virginica  virginica  virginica  virginica 
## [31] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica
#Get the confusion matrix to see accuracy value and other parameter values
 
confusionMatrix(Predict, testing$Species )
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         12         0
##   virginica       0          0        12
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9026, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

Comparison between Decision Tree and Naive-Bayes algorithms

The advantage of using Decision Trees in classifying the data is that they are simple to understand and interpret . However, decision trees have such disadvantages as : 1)Most of the algorithms (like ID3 and C4.5) require that the target attribute will have only discrete values. 2)As decision trees use the “divide and conquer” method, they tend to perform well if a few highly relevant attributes exist, but less so if many complex interactions are present.

Naive BayesNaïve Bayesian classifiers assume that there are no dependencies amongst attributes. This assumption is called class conditional independence. The advantages of Naive Bayes are : It uses a very intuitive technique. Bayes classifiers, unlike neural networks, do not have several free parameters that must be set. This greatly simplifies the design process.

Since the classifier returns probabilities, it is simpler to apply these results to a wide variety of tasks than if an arbitrary scale was used.

It does not require large amounts of data before learning can begin. Naive Bayes classifiers are computationally fast when making decisions

The experiment we carried out reveals that Naïve Bayes outperforms Decision Tree.

When comparing Naïve Bayes and Decision Tree in the classification of training IRIS dataset find that the accuracy, Kappa value of Naïve Bayes are 1.00 and 1.00 respectively. This is better than Decision Tree whose accuracy, Kappa value are: 0.9667,0.95 respectively.

We can see that some data are misclassified in Decision Tree because the Petal.Length attributes for the two species ‘versicolor’ and ‘virginica’ are very close to each other and not clearly separable. This is why the model got them wrong.

REFERENCES

  1. Huang, Jin, Jingjing Lu, and Charles X. Ling. “Comparing naive Bayes, decision trees, and SVM with AUC and accuracy.” Third IEEE International Conference on Data Mining. IEEE, 2003.

  2. Ashari, Ahmad, Iman Paryudi, and A. Min Tjoa. “Performance comparison between Naïve Bayes, decision tree and k-nearest neighbor in searching alternative design in an energy simulation tool.” International Journal of Advanced Computer Science and Applications (IJACSA) 4.11 (2013).