Data have been taken from the UCI Repository Of Machine Learning Databases, and were also stored in the Rstudio data-repository. The data were pre-processed with a “Yes” or “No” per cell in the Housevote84 dataset. Some cells are missing values which neither indicates a “Yes” or “No”.
The data contains 435 observations in rows representing the number of house representatives, and 17 featurs/variables in columns representing different billes that each house representative may have different opinions on.
library(e1071)
library(mlbench)
data("HouseVotes84")
We can always randomized the dataset by row to make sure observations with similar class level doesn’t clustered together.
HouseVotes84 = HouseVotes84[sample(nrow(HouseVotes84)),]
The whole dataset is splitted by rows to get about 75% of data as the trained dataset, and about 25% of the data as the tested dataset for all the columns.
house_vote_train <- HouseVotes84[1:326, ]
house_vote_test <- HouseVotes84[327:435, ]
Similarly, the “Class” feature will be use as the target labels for classification. We split the “class” feature to obtain about 75% of that vector as the trained label, and the remaining 25% of that vector as the tested label. We can also use the prop.table() to convert the class table from value into proportion to make sure the frations between two class levels are similarly across trained and test labels.
train_labels <- HouseVotes84[1:326, ]$Class
test_labels <- HouseVotes84[327:435, ]$Class
prop.table(table(train_labels))
train_labels
democrat republican
0.5828221 0.4171779
prop.table(table(test_labels))
test_labels
democrat republican
0.706422 0.293578
Using the trained dataset and its label, we can perform the training through the use of naiveBayes() and generate a Naived Bayes model.
house_vote_model <- naiveBayes(house_vote_train, train_labels)
Using the model we just generated from the naived bayes algorithm, along with the tested dataset, we can make prediction on the class level of each observation (row) in the test dataset.
test_pred <- predict(house_vote_model, house_vote_test)
head(test_pred)
[1] democrat democrat republican democrat democrat republican
Levels: democrat republican
Now that we can compare the prediction labels with the original tested label, and see how accurately the naived bayes model has learned from the data.
library(gmodels)
CrossTable(test_pred, test_labels,
prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual'))
Cell Contents
|-------------------------|
| N |
| N / Col Total |
|-------------------------|
Total Observations in Table: 109
| actual
predicted | democrat | republican | Row Total |
-------------|------------|------------|------------|
democrat | 75 | 1 | 76 |
| 0.974 | 0.031 | |
-------------|------------|------------|------------|
republican | 2 | 31 | 33 |
| 0.026 | 0.969 | |
-------------|------------|------------|------------|
Column Total | 77 | 32 | 109 |
| 0.706 | 0.294 | |
-------------|------------|------------|------------|
As it turns out, we have 2 incidents being mis-classfied into republician and 1 incident being mis-classsified into democrate. The accuracy for this Naived Bayes training is 97.2%, which is quite high. Accuracy = (106)/109*100 = 97.4%
A way that can improve the model performance is to add the same value for all cells in the dataset to make sure we can avoid the zero probability of certain features that may drastically overrule the evidence of others. Using laplace = 1 here however, the model did not improve, accuracy is still 96.3%. (105/109*100)
house_vote_model2 <- naiveBayes(house_vote_train, train_labels, laplace = 1)
test_pred2 <- predict(house_vote_model2, house_vote_test)
CrossTable(test_pred2, test_labels,
prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
dnn = c('predicted', 'actual'))
Cell Contents
|-------------------------|
| N |
| N / Col Total |
|-------------------------|
Total Observations in Table: 109
| actual
predicted | democrat | republican | Row Total |
-------------|------------|------------|------------|
democrat | 74 | 1 | 75 |
| 0.961 | 0.031 | |
-------------|------------|------------|------------|
republican | 3 | 31 | 34 |
| 0.039 | 0.969 | |
-------------|------------|------------|------------|
Column Total | 77 | 32 | 109 |
| 0.706 | 0.294 | |
-------------|------------|------------|------------|