Pada LBB kali ini akan mempelajari penggunaan pendekatan metode Supervised Learning (Classfication Machine Learning) untuk memprediksi Gender dari data suara yang sudah dijadikan data tabular. Metode pendekatan Supervised Learning yang akan dilakukan terdiri dari Klasifikasi Naive Bayes, Decision Tree dan Random Forest.
Set up chunk di awal untuk mengatur format chunk pada markdown
options(scipen = 9999)
rm(list=ls())
Setup library yang akan digunakan
library(tidyverse)
library(e1071)
library(caret)
library(partykit)
library(rsample)
library(randomForest)
# supaya semua plot memiliki theme_minimal()
theme_set(theme_minimal())
voice <- read.csv("voice.csv")
voice
glimpse(voice)
## Rows: 3,168
## Columns: 21
## $ meanfreq <dbl> 0.05978099, 0.06600874, 0.07731550, 0.15122809, 0.13512039, 0~
## $ sd <dbl> 0.06424127, 0.06731003, 0.08382942, 0.07211059, 0.07914610, 0~
## $ median <dbl> 0.03202691, 0.04022874, 0.03671846, 0.15801119, 0.12465623, 0~
## $ Q25 <dbl> 0.015071489, 0.019413867, 0.008701057, 0.096581728, 0.0787202~
## $ Q75 <dbl> 0.09019344, 0.09266619, 0.13190802, 0.20795525, 0.20604493, 0~
## $ IQR <dbl> 0.07512195, 0.07325232, 0.12320696, 0.11137352, 0.12732471, 0~
## $ skew <dbl> 12.8634618, 22.4232854, 30.7571546, 1.2328313, 1.1011737, 1.9~
## $ kurt <dbl> 274.402905, 634.613855, 1024.927705, 4.177296, 4.333713, 8.30~
## $ sp.ent <dbl> 0.8933694, 0.8921932, 0.8463891, 0.9633225, 0.9719551, 0.9631~
## $ sfm <dbl> 0.4919178, 0.5137238, 0.4789050, 0.7272318, 0.7835681, 0.7383~
## $ mode <dbl> 0.00000000, 0.00000000, 0.00000000, 0.08387818, 0.10426140, 0~
## $ centroid <dbl> 0.05978099, 0.06600874, 0.07731550, 0.15122809, 0.13512039, 0~
## $ meanfun <dbl> 0.08427911, 0.10793655, 0.09870626, 0.08896485, 0.10639785, 0~
## $ minfun <dbl> 0.01570167, 0.01582591, 0.01565558, 0.01779755, 0.01693122, 0~
## $ maxfun <dbl> 0.2758621, 0.2500000, 0.2711864, 0.2500000, 0.2666667, 0.2539~
## $ meandom <dbl> 0.007812500, 0.009014423, 0.007990057, 0.201497396, 0.7128125~
## $ mindom <dbl> 0.0078125, 0.0078125, 0.0078125, 0.0078125, 0.0078125, 0.0078~
## $ maxdom <dbl> 0.0078125, 0.0546875, 0.0156250, 0.5625000, 5.4843750, 2.7265~
## $ dfrange <dbl> 0.0000000, 0.0468750, 0.0078125, 0.5546875, 5.4765625, 2.7187~
## $ modindx <dbl> 0.00000000, 0.05263158, 0.04651163, 0.24711908, 0.20827389, 0~
## $ label <chr> "male", "male", "male", "male", "male", "male", "male", "male~
Apakah ada missing value dari data tersebut
anyNA(voice)
## [1] FALSE
Cek proporsi data
prop.table(table(voice$label))
##
## female male
## 0.5 0.5
Cek summary dari data frame.
summary(voice)
## meanfreq sd median Q25
## Min. :0.03936 Min. :0.01836 Min. :0.01097 Min. :0.0002288
## 1st Qu.:0.16366 1st Qu.:0.04195 1st Qu.:0.16959 1st Qu.:0.1110865
## Median :0.18484 Median :0.05916 Median :0.19003 Median :0.1402864
## Mean :0.18091 Mean :0.05713 Mean :0.18562 Mean :0.1404556
## 3rd Qu.:0.19915 3rd Qu.:0.06702 3rd Qu.:0.21062 3rd Qu.:0.1759388
## Max. :0.25112 Max. :0.11527 Max. :0.26122 Max. :0.2473469
## Q75 IQR skew kurt
## Min. :0.04295 Min. :0.01456 Min. : 0.1417 Min. : 2.068
## 1st Qu.:0.20875 1st Qu.:0.04256 1st Qu.: 1.6496 1st Qu.: 5.670
## Median :0.22568 Median :0.09428 Median : 2.1971 Median : 8.319
## Mean :0.22476 Mean :0.08431 Mean : 3.1402 Mean : 36.569
## 3rd Qu.:0.24366 3rd Qu.:0.11418 3rd Qu.: 2.9317 3rd Qu.: 13.649
## Max. :0.27347 Max. :0.25223 Max. :34.7255 Max. :1309.613
## sp.ent sfm mode centroid
## Min. :0.7387 Min. :0.03688 Min. :0.0000 Min. :0.03936
## 1st Qu.:0.8618 1st Qu.:0.25804 1st Qu.:0.1180 1st Qu.:0.16366
## Median :0.9018 Median :0.39634 Median :0.1866 Median :0.18484
## Mean :0.8951 Mean :0.40822 Mean :0.1653 Mean :0.18091
## 3rd Qu.:0.9287 3rd Qu.:0.53368 3rd Qu.:0.2211 3rd Qu.:0.19915
## Max. :0.9820 Max. :0.84294 Max. :0.2800 Max. :0.25112
## meanfun minfun maxfun meandom
## Min. :0.05557 Min. :0.009775 Min. :0.1031 Min. :0.007812
## 1st Qu.:0.11700 1st Qu.:0.018223 1st Qu.:0.2540 1st Qu.:0.419828
## Median :0.14052 Median :0.046110 Median :0.2712 Median :0.765795
## Mean :0.14281 Mean :0.036802 Mean :0.2588 Mean :0.829211
## 3rd Qu.:0.16958 3rd Qu.:0.047904 3rd Qu.:0.2775 3rd Qu.:1.177166
## Max. :0.23764 Max. :0.204082 Max. :0.2791 Max. :2.957682
## mindom maxdom dfrange modindx
## Min. :0.004883 Min. : 0.007812 Min. : 0.000 Min. :0.00000
## 1st Qu.:0.007812 1st Qu.: 2.070312 1st Qu.: 2.045 1st Qu.:0.09977
## Median :0.023438 Median : 4.992188 Median : 4.945 Median :0.13936
## Mean :0.052647 Mean : 5.047277 Mean : 4.995 Mean :0.17375
## 3rd Qu.:0.070312 3rd Qu.: 7.007812 3rd Qu.: 6.992 3rd Qu.:0.20918
## Max. :0.458984 Max. :21.867188 Max. :21.844 Max. :0.93237
## label
## Length:3168
## Class :character
## Mode :character
##
##
##
Dari EDA diatas terdapat 3,168 rows data, lalu ada 21 kolom. Pada data ini diambil variabel voice
untuk dijadikan target variabel, dan data memakai 20 prediktor.
Pada LBB kali ini dataset yang digunakan sudah cukup bersih, sehingga wrangling yang diperlukan hanyalah merubah label menjadi faktor.
voice <- voice %>% mutate(label=as.factor(label))
RNGkind(sample.kind = "Rounding")
set.seed(123)
splitter <- initial_split(voice, prop = 0.75, strata = "label")
data_train <- training(splitter)
data_test <- testing(splitter)
Hilangkan variabel target dari data test
predict_set<-data_test %>% select(-label)
cek proporsi data
prop.table(table(data_train$label))
##
## female male
## 0.5 0.5
Karena data train sudah balance, maka data sudah bisa langsung memasuki tahap modelling
Kita buat model Naive Bayes berdasarkan data yang sudah kita proses, masukkan label
sebagai target variabel dan data_train sebagai data yang akan digunakan.
#Pembuatan Model NaiveBayes
model_naive<- naiveBayes(label ~ ., data_train)
Kita juga akan menggunakan decision tree sebagai model untuk memprediksi, masukkan label
sebagai target variabel dan data_train sebagai data yang akan digunakan
modeltree <- ctree(label ~ ., data_train)
plot(modeltree, type = "simple")
Pada model random forest, kita akan menggunakan 5fold cross validation, kemudian proses itu diulang sebanyak 3 kali. Biasanya kita melakukan cross validation dengan membagi data hanya menjadi training dan testing data. K-Fold Cross Validation membagi data sebanyak \(k\) bagian sama banyak, dimana setiap bagiannya digunakan menjadi testing data secara bergantian.
set.seed(123)
ctrl <- trainControl(method="repeatedcv", number=5, repeats=3)
model_rforest <- train(label~ ., data=data_train, method="rf", trControl = ctrl)
saveRDS(model_rforest, file = "model_rforest.rds")
model_rforest$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 1.89%
## Confusion matrix:
## female male class.error
## female 1166 22 0.01851852
## male 23 1165 0.01936027
preds_naive <- predict(model_naive, newdata = predict_set)
confusionMatrix(preds_naive,reference = data_test$label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction female male
## female 357 47
## male 39 349
##
## Accuracy : 0.8914
## 95% CI : (0.8676, 0.9122)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.7828
##
## Mcnemar's Test P-Value : 0.4504
##
## Sensitivity : 0.9015
## Specificity : 0.8813
## Pos Pred Value : 0.8837
## Neg Pred Value : 0.8995
## Prevalence : 0.5000
## Detection Rate : 0.4508
## Detection Prevalence : 0.5101
## Balanced Accuracy : 0.8914
##
## 'Positive' Class : female
##
Dari hasil confusion Matrix naive bayes, didapat accuracy 89.14%.
predict_dtree <- predict(object = modeltree,newdata = data_test, type = "response")
confusionMatrix(predict_dtree,reference = data_test$label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction female male
## female 389 23
## male 7 373
##
## Accuracy : 0.9621
## 95% CI : (0.9464, 0.9743)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 0.0000000000000002
##
## Kappa : 0.9242
##
## Mcnemar's Test P-Value : 0.00617
##
## Sensitivity : 0.9823
## Specificity : 0.9419
## Pos Pred Value : 0.9442
## Neg Pred Value : 0.9816
## Prevalence : 0.5000
## Detection Rate : 0.4912
## Detection Prevalence : 0.5202
## Balanced Accuracy : 0.9621
##
## 'Positive' Class : female
##
Dari hasil confusion Matrix DT, didapat accuracy 96.21%
predict_forest <- predict(model_rforest, predict_set)
confusionMatrix(predict_forest,reference = data_test$label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction female male
## female 391 9
## male 5 387
##
## Accuracy : 0.9823
## 95% CI : (0.9705, 0.9903)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : <0.0000000000000002
##
## Kappa : 0.9646
##
## Mcnemar's Test P-Value : 0.4227
##
## Sensitivity : 0.9874
## Specificity : 0.9773
## Pos Pred Value : 0.9775
## Neg Pred Value : 0.9872
## Prevalence : 0.5000
## Detection Rate : 0.4937
## Detection Prevalence : 0.5051
## Balanced Accuracy : 0.9823
##
## 'Positive' Class : female
##
Dari hasil confusion Matrix Random Forest, didapat accuracy 98.23%
plot(varImp(model_rforest))