Naive Bayes Classifier

Naive Bayes is a supervised learning classification algorithm based on Bayes Theorem. It is an extension of Exact Bayes classifier and assumes all variables are independent of each other. This method is suitable when dimensionality of inputs are high. It uses a simple method of conditional probability and can be computationally inexpensive and outperform more sophisticated models. We will again use Presidential debate data set for classification and compare it with results of K Nearest Neighbor classifier.

Basic Settings and Data Import

Let’s begin by loading the required libraries and importing the datasheet we are going to use for this model.

#- set working directory 
setwd("C:/Users/awani/Desktop")
options(scipen = 999)

#- load required libraries
library(knitr)
library(ggplot2)
library(tidyr)
library(caret)
library(e1071)
library(ROCR)

#- Import Data
data = read.csv("US Presidential Data.csv", stringsAsFactors = F)

Exploratory analysis

Before we proceed any further, it is essential to perform a basic exploratory data analysis. Since both outcomes are fairly represented in data, oversampling won’t be required. Also, there is nothing unusual in distribution plots.

#dependent variable
kable(table(data$Win.Loss),
      col.names = c("Win", "Frequency"), align = 'l')

Win	Frequency
0	595
1	929

# independent variables
ggplot(gather(data[,2:ncol(data)]), aes(value)) + 
  geom_histogram(bins = 5, fill = "blue", alpha = 0.6) + 
  facet_wrap(~key, scales = 'free_x')

Data Manipluation

We now need to ensure our data is in correct format and split it into training and validation datsets. We have used a 60-40 split here.

str(data)

## 'data.frame':    1524 obs. of  14 variables:
##  $ Win.Loss      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Optimism      : num  0.105 0.115 0.113 0.107 0.106 ...
##  $ Pessimism     : num  0.0505 0.0592 0.0493 0.0463 0.0517 ...
##  $ PastUsed      : num  0.438 0.291 0.416 0.463 0.334 ...
##  $ FutureUsed    : num  0.495 0.621 0.517 0.467 0.582 ...
##  $ PresentUsed   : num  0.067 0.0874 0.0672 0.0698 0.0836 ...
##  $ OwnPartyCount : int  2 1 1 1 3 0 6 2 2 1 ...
##  $ OppPartyCount : int  2 4 1 3 4 0 4 4 5 2 ...
##  $ NumericContent: num  0.00188 0.00142 0.00213 0.00187 0.00223 ...
##  $ Extra         : num  4.04 3.45 3.46 4.2 4.66 ...
##  $ Emoti         : num  4.05 3.63 4.04 4.66 4.02 ...
##  $ Agree         : num  3.47 3.53 3.28 4.01 3.28 ...
##  $ Consc         : num  2.45 2.4 2.16 2.8 2.42 ...
##  $ Openn         : num  2.55 2.83 2.46 3.07 2.84 ...

data$Win.Loss = as.factor(data$Win.Loss)


#- split data in training and test set.
Index = sample(1:nrow(data), size = round(0.6*nrow(data)), replace=FALSE)
train = data[Index ,]
test = data[-Index ,]

rm(Index)

Model Training and Prediction

Training the Naive Bayes model cannot be simpler, just use naive-Bayes that comes in e1701 package. Once the model is trained, we use it to predict outcome of debate in test data set.

NBClassifier = naiveBayes(Win.Loss ~., data = train)
NBClassifier

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##         0         1 
## 0.3971554 0.6028446 
## 
## Conditional probabilities:
##    Optimism
## Y        [,1]       [,2]
##   0 0.1255548 0.03205499
##   1 0.1182657 0.03705627
## 
##    Pessimism
## Y         [,1]       [,2]
##   0 0.06972860 0.02420495
##   1 0.05437613 0.02387045
## 
##    PastUsed
## Y        [,1]      [,2]
##   0 0.3804478 0.1156789
##   1 0.3873053 0.1392698
## 
##    FutureUsed
## Y        [,1]      [,2]
##   0 0.4114288 0.1077240
##   1 0.4411565 0.1500569
## 
##    PresentUsed
## Y        [,1]       [,2]
##   0 0.2081234 0.09169307
##   1 0.1715382 0.10223423
## 
##    OwnPartyCount
## Y       [,1]      [,2]
##   0 3.537190 12.532196
##   1 3.163339  9.508207
## 
##    OppPartyCount
## Y       [,1]     [,2]
##   0 3.209366 5.264309
##   1 3.709619 4.929602
## 
##    NumericContent
## Y          [,1]        [,2]
##   0 0.001584887 0.001109454
##   1 0.002485811 0.001847611
## 
##    Extra
## Y       [,1]     [,2]
##   0 3.909000 1.039050
##   1 3.524283 1.040212
## 
##    Emoti
## Y       [,1]      [,2]
##   0 3.152339 0.8284708
##   1 3.202405 0.7814007
## 
##    Agree
## Y       [,1]      [,2]
##   0 3.759741 0.6501189
##   1 3.657318 0.6861890
## 
##    Consc
## Y       [,1]      [,2]
##   0 3.783562 0.9715516
##   1 3.435554 0.9971333
## 
##    Openn
## Y       [,1]      [,2]
##   0 3.731763 0.9054545
##   1 3.737779 0.9873256

# Predict using Naive Bayes
test$predicted = predict(NBClassifier,test)
test$actual = test$Win.Loss

Confusion Matrix

Confusion Matrix gives us a good understanding of how well the model is doing. While overall accuracy is 72% , sensitivity and specificity is 64% and 77% respectively. If we compare it to results we have using KNN, KNN comes out as a clear winner in this case.

confusionMatrix(factor(test$predicted),
                factor(test$actual))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 158 105
##          1  74 273
##                                           
##                Accuracy : 0.7066          
##                  95% CI : (0.6687, 0.7424)
##     No Information Rate : 0.6197          
##     P-Value [Acc > NIR] : 0.000004223     
##                                           
##                   Kappa : 0.3931          
##  Mcnemar's Test P-Value : 0.02494         
##                                           
##             Sensitivity : 0.6810          
##             Specificity : 0.7222          
##          Pos Pred Value : 0.6008          
##          Neg Pred Value : 0.7867          
##              Prevalence : 0.3803          
##          Detection Rate : 0.2590          
##    Detection Prevalence : 0.4311          
##       Balanced Accuracy : 0.7016          
##                                           
##        'Positive' Class : 0               
##