Naive Bayes is a supervised learning classification algorithm based on Bayes Theorem. It is an extension of Exact Bayes classifier and assumes all variables are independent of each other. This method is suitable when dimensionality of inputs are high. It uses a simple method of conditional probability and can be computationally inexpensive and outperform more sophisticated models. We will again use Presidential debate data set for classification and compare it with results of K Nearest Neighbor classifier.
Let’s begin by loading the required libraries and importing the datasheet we are going to use for this model.
#- set working directory
setwd("C:/Users/awani/Desktop")
options(scipen = 999)
#- load required libraries
library(knitr)
library(ggplot2)
library(tidyr)
library(caret)
library(e1071)
library(ROCR)
#- Import Data
data = read.csv("US Presidential Data.csv", stringsAsFactors = F)
Before we proceed any further, it is essential to perform a basic exploratory data analysis. Since both outcomes are fairly represented in data, oversampling won’t be required. Also, there is nothing unusual in distribution plots.
#dependent variable
kable(table(data$Win.Loss),
col.names = c("Win", "Frequency"), align = 'l')
| Win | Frequency |
|---|---|
| 0 | 595 |
| 1 | 929 |
# independent variables
ggplot(gather(data[,2:ncol(data)]), aes(value)) +
geom_histogram(bins = 5, fill = "blue", alpha = 0.6) +
facet_wrap(~key, scales = 'free_x')
We now need to ensure our data is in correct format and split it into training and validation datsets. We have used a 60-40 split here.
str(data)
## 'data.frame': 1524 obs. of 14 variables:
## $ Win.Loss : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Optimism : num 0.105 0.115 0.113 0.107 0.106 ...
## $ Pessimism : num 0.0505 0.0592 0.0493 0.0463 0.0517 ...
## $ PastUsed : num 0.438 0.291 0.416 0.463 0.334 ...
## $ FutureUsed : num 0.495 0.621 0.517 0.467 0.582 ...
## $ PresentUsed : num 0.067 0.0874 0.0672 0.0698 0.0836 ...
## $ OwnPartyCount : int 2 1 1 1 3 0 6 2 2 1 ...
## $ OppPartyCount : int 2 4 1 3 4 0 4 4 5 2 ...
## $ NumericContent: num 0.00188 0.00142 0.00213 0.00187 0.00223 ...
## $ Extra : num 4.04 3.45 3.46 4.2 4.66 ...
## $ Emoti : num 4.05 3.63 4.04 4.66 4.02 ...
## $ Agree : num 3.47 3.53 3.28 4.01 3.28 ...
## $ Consc : num 2.45 2.4 2.16 2.8 2.42 ...
## $ Openn : num 2.55 2.83 2.46 3.07 2.84 ...
data$Win.Loss = as.factor(data$Win.Loss)
#- split data in training and test set.
Index = sample(1:nrow(data), size = round(0.6*nrow(data)), replace=FALSE)
train = data[Index ,]
test = data[-Index ,]
rm(Index)
Training the Naive Bayes model cannot be simpler, just use naive-Bayes that comes in e1701 package. Once the model is trained, we use it to predict outcome of debate in test data set.
NBClassifier = naiveBayes(Win.Loss ~., data = train)
NBClassifier
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.3971554 0.6028446
##
## Conditional probabilities:
## Optimism
## Y [,1] [,2]
## 0 0.1255548 0.03205499
## 1 0.1182657 0.03705627
##
## Pessimism
## Y [,1] [,2]
## 0 0.06972860 0.02420495
## 1 0.05437613 0.02387045
##
## PastUsed
## Y [,1] [,2]
## 0 0.3804478 0.1156789
## 1 0.3873053 0.1392698
##
## FutureUsed
## Y [,1] [,2]
## 0 0.4114288 0.1077240
## 1 0.4411565 0.1500569
##
## PresentUsed
## Y [,1] [,2]
## 0 0.2081234 0.09169307
## 1 0.1715382 0.10223423
##
## OwnPartyCount
## Y [,1] [,2]
## 0 3.537190 12.532196
## 1 3.163339 9.508207
##
## OppPartyCount
## Y [,1] [,2]
## 0 3.209366 5.264309
## 1 3.709619 4.929602
##
## NumericContent
## Y [,1] [,2]
## 0 0.001584887 0.001109454
## 1 0.002485811 0.001847611
##
## Extra
## Y [,1] [,2]
## 0 3.909000 1.039050
## 1 3.524283 1.040212
##
## Emoti
## Y [,1] [,2]
## 0 3.152339 0.8284708
## 1 3.202405 0.7814007
##
## Agree
## Y [,1] [,2]
## 0 3.759741 0.6501189
## 1 3.657318 0.6861890
##
## Consc
## Y [,1] [,2]
## 0 3.783562 0.9715516
## 1 3.435554 0.9971333
##
## Openn
## Y [,1] [,2]
## 0 3.731763 0.9054545
## 1 3.737779 0.9873256
# Predict using Naive Bayes
test$predicted = predict(NBClassifier,test)
test$actual = test$Win.Loss
Confusion Matrix gives us a good understanding of how well the model is doing. While overall accuracy is 72% , sensitivity and specificity is 64% and 77% respectively. If we compare it to results we have using KNN, KNN comes out as a clear winner in this case.
confusionMatrix(factor(test$predicted),
factor(test$actual))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 158 105
## 1 74 273
##
## Accuracy : 0.7066
## 95% CI : (0.6687, 0.7424)
## No Information Rate : 0.6197
## P-Value [Acc > NIR] : 0.000004223
##
## Kappa : 0.3931
## Mcnemar's Test P-Value : 0.02494
##
## Sensitivity : 0.6810
## Specificity : 0.7222
## Pos Pred Value : 0.6008
## Neg Pred Value : 0.7867
## Prevalence : 0.3803
## Detection Rate : 0.2590
## Detection Prevalence : 0.4311
## Balanced Accuracy : 0.7016
##
## 'Positive' Class : 0
##