United States Congressional Voting Records 1984 data set is retrieved from UCI Repository Of Machine Learning Databases via mlbench package. The data includes 435 observation with 17 variables. 1 class variable (democrat, republican) and 16 votes (yes, no) on different topics.
library(e1071)
library(mlbench)
library(ggplot2)
library(caret)
data(HouseVotes84)
str(HouseVotes84)
## 'data.frame': 435 obs. of 17 variables:
## $ Class: Factor w/ 2 levels "democrat","republican": 2 2 1 1 1 1 1 2 2 1 ...
## $ V1 : Factor w/ 2 levels "n","y": 1 1 NA 1 2 1 1 1 1 2 ...
## $ V2 : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...
## $ V3 : Factor w/ 2 levels "n","y": 1 1 2 2 2 2 1 1 1 2 ...
## $ V4 : Factor w/ 2 levels "n","y": 2 2 NA 1 1 1 2 2 2 1 ...
## $ V5 : Factor w/ 2 levels "n","y": 2 2 2 NA 2 2 2 2 2 1 ...
## $ V6 : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 1 ...
## $ V7 : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 2 ...
## $ V8 : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 2 ...
## $ V9 : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 2 ...
## $ V10 : Factor w/ 2 levels "n","y": 2 1 1 1 1 1 1 1 1 1 ...
## $ V11 : Factor w/ 2 levels "n","y": NA 1 2 2 2 1 1 1 1 1 ...
## $ V12 : Factor w/ 2 levels "n","y": 2 2 1 1 NA 1 1 1 2 1 ...
## $ V13 : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 NA 2 2 1 ...
## $ V14 : Factor w/ 2 levels "n","y": 2 2 2 1 2 2 2 2 2 1 ...
## $ V15 : Factor w/ 2 levels "n","y": 1 1 1 1 2 2 2 NA 1 NA ...
## $ V16 : Factor w/ 2 levels "n","y": 2 NA 1 2 2 2 2 2 2 NA ...
summary(HouseVotes84$Class)
## democrat republican
## 267 168
summary(HouseVotes84)
## Class V1 V2 V3 V4 V5
## democrat :267 n :236 n :192 n :171 n :247 n :208
## republican:168 y :187 y :195 y :253 y :177 y :212
## NA's: 12 NA's: 48 NA's: 11 NA's: 11 NA's: 15
## V6 V7 V8 V9 V10 V11
## n :152 n :182 n :178 n :206 n :212 n :264
## y :272 y :239 y :242 y :207 y :216 y :150
## NA's: 11 NA's: 14 NA's: 15 NA's: 22 NA's: 7 NA's: 21
## V12 V13 V14 V15 V16
## n :233 n :201 n :170 n :233 n : 62
## y :171 y :209 y :248 y :174 y :269
## NA's: 31 NA's: 25 NA's: 17 NA's: 28 NA's:104
We see that there are some missing data in the dataset. First we will remove the rows with NA values.
# Is there missing data?
head(is.na(HouseVotes84))
## Class V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## V12 V13 V14 V15 V16
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE TRUE
## 3 FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE
## 5 TRUE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE
CleanDataset <- na.omit(HouseVotes84)
qplot(Class, data=CleanDataset, geom = "bar") + theme(axis.text.x = element_text(angle = 90, hjust = 1))
set.seed(20)
# Lets do stratified sampling. Select rows to based on Class variable as strata
TrainingDataIndex <- createDataPartition(CleanDataset$Class, p=0.75, list = FALSE)
# Create Training Data as subset of soyabean dataset with row index numbers as identified above and all columns
trainingData <- CleanDataset[TrainingDataIndex,]
# Everything else not in training is test data. Note the - (minus)sign
testData <- CleanDataset[-TrainingDataIndex,]
# also save the labels
vote_train_labels <- trainingData$Class
vote_test_labels <- testData$Class
# check that the proportion of spam is similar
prop.table(table(vote_train_labels))
## vote_train_labels
## democrat republican
## 0.5344828 0.4655172
prop.table(table(vote_test_labels))
## vote_test_labels
## democrat republican
## 0.5344828 0.4655172
vote_classifier <- naiveBayes(trainingData, vote_train_labels)
We will store the Naive Bayes predictions in vote_test_pred and then compare it with true labels.
vote_test_pred <- predict(vote_classifier, testData)
head(vote_test_pred)
## [1] democrat democrat republican republican democrat democrat
## Levels: democrat republican
library(gmodels)
CrossTable(vote_test_pred, vote_test_labels,
prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 58
##
##
## | actual
## predicted | democrat | republican | Row Total |
## -------------|------------|------------|------------|
## democrat | 30 | 1 | 31 |
## | 0.968 | 0.037 | |
## -------------|------------|------------|------------|
## republican | 1 | 26 | 27 |
## | 0.032 | 0.963 | |
## -------------|------------|------------|------------|
## Column Total | 31 | 27 | 58 |
## | 0.534 | 0.466 | |
## -------------|------------|------------|------------|
##
##
The overall accuracy rate of the model is 0.966. 1 democrat vote is identified as republican and 1 republican vote labeled as democrat.
We will set the Laplace parameter to the model to check if it improves the accuracy.
In the confusion matrix we see that accuracy rate remained the same, laplace parameter did not help the model to perform better.
vote_classifier2 <- naiveBayes(trainingData, vote_train_labels, laplace = 1)
vote_test_pred2 <- predict(vote_classifier2, testData)
CrossTable(vote_test_pred2, vote_test_labels,
prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual'))
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Col Total |
## |-------------------------|
##
##
## Total Observations in Table: 58
##
##
## | actual
## predicted | democrat | republican | Row Total |
## -------------|------------|------------|------------|
## democrat | 30 | 1 | 31 |
## | 0.968 | 0.037 | |
## -------------|------------|------------|------------|
## republican | 1 | 26 | 27 |
## | 0.032 | 0.963 | |
## -------------|------------|------------|------------|
## Column Total | 31 | 27 | 58 |
## | 0.534 | 0.466 | |
## -------------|------------|------------|------------|
##
##