1. Collecting Data

United States Congressional Voting Records 1984 data set is retrieved from UCI Repository Of Machine Learning Databases via mlbench package. The data includes 435 observation with 17 variables. 1 class variable (democrat, republican) and 16 votes (yes, no) on different topics.

2. Exploring and Preparing the Data

library(e1071)
library(mlbench)
library(ggplot2)
library(caret)
data(HouseVotes84)
str(HouseVotes84)
## 'data.frame':    435 obs. of  17 variables:
##  $ Class: Factor w/ 2 levels "democrat","republican": 2 2 1 1 1 1 1 2 2 1 ...
##  $ V1   : Factor w/ 2 levels "n","y": 1 1 NA 1 2 1 1 1 1 2 ...
##  $ V2   : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V3   : Factor w/ 2 levels "n","y": 1 1 2 2 2 2 1 1 1 2 ...
##  $ V4   : Factor w/ 2 levels "n","y": 2 2 NA 1 1 1 2 2 2 1 ...
##  $ V5   : Factor w/ 2 levels "n","y": 2 2 2 NA 2 2 2 2 2 1 ...
##  $ V6   : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 1 ...
##  $ V7   : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 2 ...
##  $ V8   : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 2 ...
##  $ V9   : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 2 ...
##  $ V10  : Factor w/ 2 levels "n","y": 2 1 1 1 1 1 1 1 1 1 ...
##  $ V11  : Factor w/ 2 levels "n","y": NA 1 2 2 2 1 1 1 1 1 ...
##  $ V12  : Factor w/ 2 levels "n","y": 2 2 1 1 NA 1 1 1 2 1 ...
##  $ V13  : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 NA 2 2 1 ...
##  $ V14  : Factor w/ 2 levels "n","y": 2 2 2 1 2 2 2 2 2 1 ...
##  $ V15  : Factor w/ 2 levels "n","y": 1 1 1 1 2 2 2 NA 1 NA ...
##  $ V16  : Factor w/ 2 levels "n","y": 2 NA 1 2 2 2 2 2 2 NA ...
summary(HouseVotes84$Class)
##   democrat republican 
##        267        168
summary(HouseVotes84)
##         Class        V1         V2         V3         V4         V5     
##  democrat  :267   n   :236   n   :192   n   :171   n   :247   n   :208  
##  republican:168   y   :187   y   :195   y   :253   y   :177   y   :212  
##                   NA's: 12   NA's: 48   NA's: 11   NA's: 11   NA's: 15  
##     V6         V7         V8         V9        V10        V11     
##  n   :152   n   :182   n   :178   n   :206   n   :212   n   :264  
##  y   :272   y   :239   y   :242   y   :207   y   :216   y   :150  
##  NA's: 11   NA's: 14   NA's: 15   NA's: 22   NA's:  7   NA's: 21  
##    V12        V13        V14        V15        V16     
##  n   :233   n   :201   n   :170   n   :233   n   : 62  
##  y   :171   y   :209   y   :248   y   :174   y   :269  
##  NA's: 31   NA's: 25   NA's: 17   NA's: 28   NA's:104

We see that there are some missing data in the dataset. First we will remove the rows with NA values.

# Is there missing data?
head(is.na(HouseVotes84))
##   Class    V1    V2    V3    V4    V5    V6    V7    V8    V9   V10   V11
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##     V12   V13   V14   V15   V16
## 1 FALSE FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE  TRUE
## 3 FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE
## 5  TRUE FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE FALSE
CleanDataset <- na.omit(HouseVotes84)
qplot(Class, data=CleanDataset, geom = "bar") + theme(axis.text.x = element_text(angle = 90, hjust = 1))

Creating test and training datasets
set.seed(20)
# Lets do stratified sampling. Select rows to based on Class variable as strata
TrainingDataIndex <- createDataPartition(CleanDataset$Class, p=0.75, list = FALSE)

# Create Training Data as subset of soyabean dataset with row index numbers as identified above and all columns
trainingData <- CleanDataset[TrainingDataIndex,]

# Everything else not in training is test data. Note the - (minus)sign
testData <- CleanDataset[-TrainingDataIndex,]

# also save the labels
vote_train_labels <- trainingData$Class
vote_test_labels  <- testData$Class

# check that the proportion of spam is similar
prop.table(table(vote_train_labels))
## vote_train_labels
##   democrat republican 
##  0.5344828  0.4655172
prop.table(table(vote_test_labels))
## vote_test_labels
##   democrat republican 
##  0.5344828  0.4655172

3. Training Model on the Data

vote_classifier <- naiveBayes(trainingData, vote_train_labels)

4. Evaluating Model Performance

We will store the Naive Bayes predictions in vote_test_pred and then compare it with true labels.

vote_test_pred <- predict(vote_classifier, testData)

head(vote_test_pred)
## [1] democrat   democrat   republican republican democrat   democrat  
## Levels: democrat republican
library(gmodels)
CrossTable(vote_test_pred, vote_test_labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  58 
## 
##  
##              | actual 
##    predicted |   democrat | republican |  Row Total | 
## -------------|------------|------------|------------|
##     democrat |         30 |          1 |         31 | 
##              |      0.968 |      0.037 |            | 
## -------------|------------|------------|------------|
##   republican |          1 |         26 |         27 | 
##              |      0.032 |      0.963 |            | 
## -------------|------------|------------|------------|
## Column Total |         31 |         27 |         58 | 
##              |      0.534 |      0.466 |            | 
## -------------|------------|------------|------------|
## 
## 

The overall accuracy rate of the model is 0.966. 1 democrat vote is identified as republican and 1 republican vote labeled as democrat.

5. Improving Model Performance

We will set the Laplace parameter to the model to check if it improves the accuracy.
In the confusion matrix we see that accuracy rate remained the same, laplace parameter did not help the model to perform better.

vote_classifier2 <- naiveBayes(trainingData, vote_train_labels, laplace = 1)
vote_test_pred2 <- predict(vote_classifier2, testData)
CrossTable(vote_test_pred2, vote_test_labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  58 
## 
##  
##              | actual 
##    predicted |   democrat | republican |  Row Total | 
## -------------|------------|------------|------------|
##     democrat |         30 |          1 |         31 | 
##              |      0.968 |      0.037 |            | 
## -------------|------------|------------|------------|
##   republican |          1 |         26 |         27 | 
##              |      0.032 |      0.963 |            | 
## -------------|------------|------------|------------|
## Column Total |         31 |         27 |         58 | 
##              |      0.534 |      0.466 |            | 
## -------------|------------|------------|------------|
## 
##