Analisis Naive Bayes Classifier di R dengan Data United States Congressional Voting 1984

Copyright @ Sya’roni @ Prof.Dr.Drs. Agus Widodo, M.Kes @ Magister Informatika @ UIN Maulana Malik Ibrahim @ UIN Malang

DATA

analisis Bayesian dengan menggunakan Kumpulan data United States Congressional Voting 1984 (HouseVotes84) diambil dari UCI Repository Of Machine Learning Database melalui mlbench package. Data meliputi 435 observasi dengan 17 variabel. 1 variabel class (demokrat, republik) dan 16 suara (ya, tidak) pada topik yang berbeda.

EXPLOR DAN PERSIAPAN DATA

library(mlbench)
library(e1071)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(ggplot2)
library(gmodels)

# Memuat data
data(HouseVotes84)

# Struktur
str(HouseVotes84)

## 'data.frame':    435 obs. of  17 variables:
##  $ Class: Factor w/ 2 levels "democrat","republican": 2 2 1 1 1 1 1 2 2 1 ...
##  $ V1   : Factor w/ 2 levels "n","y": 1 1 NA 1 2 1 1 1 1 2 ...
##  $ V2   : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ V3   : Factor w/ 2 levels "n","y": 1 1 2 2 2 2 1 1 1 2 ...
##  $ V4   : Factor w/ 2 levels "n","y": 2 2 NA 1 1 1 2 2 2 1 ...
##  $ V5   : Factor w/ 2 levels "n","y": 2 2 2 NA 2 2 2 2 2 1 ...
##  $ V6   : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 2 2 2 1 ...
##  $ V7   : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 2 ...
##  $ V8   : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 2 ...
##  $ V9   : Factor w/ 2 levels "n","y": 1 1 1 1 1 1 1 1 1 2 ...
##  $ V10  : Factor w/ 2 levels "n","y": 2 1 1 1 1 1 1 1 1 1 ...
##  $ V11  : Factor w/ 2 levels "n","y": NA 1 2 2 2 1 1 1 1 1 ...
##  $ V12  : Factor w/ 2 levels "n","y": 2 2 1 1 NA 1 1 1 2 1 ...
##  $ V13  : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 NA 2 2 1 ...
##  $ V14  : Factor w/ 2 levels "n","y": 2 2 2 1 2 2 2 2 2 1 ...
##  $ V15  : Factor w/ 2 levels "n","y": 1 1 1 1 2 2 2 NA 1 NA ...
##  $ V16  : Factor w/ 2 levels "n","y": 2 NA 1 2 2 2 2 2 2 NA ...

summary(HouseVotes84$Class)

##   democrat republican 
##        267        168

summary(HouseVotes84)

##         Class        V1         V2         V3         V4         V5     
##  democrat  :267   n   :236   n   :192   n   :171   n   :247   n   :208  
##  republican:168   y   :187   y   :195   y   :253   y   :177   y   :212  
##                   NA's: 12   NA's: 48   NA's: 11   NA's: 11   NA's: 15  
##     V6         V7         V8         V9        V10        V11        V12     
##  n   :152   n   :182   n   :178   n   :206   n   :212   n   :264   n   :233  
##  y   :272   y   :239   y   :242   y   :207   y   :216   y   :150   y   :171  
##  NA's: 11   NA's: 14   NA's: 15   NA's: 22   NA's:  7   NA's: 21   NA's: 31  
##    V13        V14        V15        V16     
##  n   :201   n   :170   n   :233   n   : 62  
##  y   :209   y   :248   y   :174   y   :269  
##  NA's: 25   NA's: 17   NA's: 28   NA's:104

Ada beberapa data yang hilang di dataset. Pertama kami akan menghapus baris dengan nilai NA.

head(is.na(HouseVotes84))

##   Class    V1    V2    V3    V4    V5    V6    V7    V8    V9   V10   V11   V12
## 1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## 2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 3 FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## 6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##     V13   V14   V15   V16
## 1 FALSE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE  TRUE
## 3 FALSE FALSE FALSE FALSE
## 4 FALSE FALSE FALSE FALSE
## 5 FALSE FALSE FALSE FALSE
## 6 FALSE FALSE FALSE FALSE

CleanDataset <- na.omit(HouseVotes84)
qplot(Class, data=CleanDataset, geom = "bar") + theme(axis.text.x = element_text(angle = 0, hjust = 2))

set.seed(20)
# pengambilan sampel bertingkat. Pilih baris berdasarkan variabel Kelas sebagai strata
TrainingDataIndex <- createDataPartition(CleanDataset$Class, p=0.50, list = FALSE)

# Buat Data Pelatihan sebagai bagian dari kumpulan data dengan nomor indeks baris seperti yang diidentifikasi di atas dan semua kolom
trainingData <- CleanDataset[TrainingDataIndex,]

# Hal lain yang tidak ada dalam pelatihan adalah data pengujian. Perhatikan tanda - (minus)
testData <- CleanDataset[-TrainingDataIndex,]

# simpan labelnya
vote_train_labels <- trainingData$Class
vote_test_labels  <- testData$Class

# periksa proporsi
prop.table(table(vote_train_labels))

## vote_train_labels
##   democrat republican 
##  0.5344828  0.4655172

prop.table(table(vote_test_labels))

## vote_test_labels
##   democrat republican 
##  0.5344828  0.4655172

TRAINING MODEL DATA

vote_classifier <- naiveBayes(trainingData, vote_train_labels)

EVALUASI KINERJA MODEL Prediksi Naive Bayes di vote_test_pred dan kemudian membandingkannya dengan label yang sebenarnya.

vote_test_pred <- predict(vote_classifier, testData)
head(vote_test_pred)

## [1] democrat   republican democrat   democrat   republican democrat  
## Levels: democrat republican

CrossTable(vote_test_pred, vote_test_labels,
           prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE,
           dnn = c('predicted', 'actual'))

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Col Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  116 
## 
##  
##              | actual 
##    predicted |   democrat | republican |  Row Total | 
## -------------|------------|------------|------------|
##     democrat |         59 |          2 |         61 | 
##              |      0.952 |      0.037 |            | 
## -------------|------------|------------|------------|
##   republican |          3 |         52 |         55 | 
##              |      0.048 |      0.963 |            | 
## -------------|------------|------------|------------|
## Column Total |         62 |         54 |        116 | 
##              |      0.534 |      0.466 |            | 
## -------------|------------|------------|------------|
## 
##

Tingkat akurasi keseluruhan model adalah 0,955. 1 suara demokrat diidentifikasi sebagai republik dan 1 suara republik diberi label sebagai demokrat.