Klasifikasi merupakan salah satu metode pada data mining yang bertujuan untuk memisahkan observasi terhadap class atau label. Pada pendekatan machine learning klasifikasi dapat disebut sebagai supervised learning. Pada pembahasan kali ini, 2 teknik yang digunakan untuk mengklasifikasi adalah Decision Tree dan Naive Bayes. Dataset yang digunakan diambil dari UCI Dataset, dengan jumlah observasi sebanyak 520 dan terdiri dari variabel yang dijelaskan pada bagian berikut ini.
diabet.df = read.csv("C:/Personal Files/R Files/diabetes_data_upload.csv")str(diabet.df)## 'data.frame': 520 obs. of 17 variables:
## $ Age : int 40 58 41 45 60 55 57 66 67 70 ...
## $ Gender : chr "Male" "Male" "Male" "Male" ...
## $ Polyuria : chr "No" "No" "Yes" "No" ...
## $ Polydipsia : chr "Yes" "No" "No" "No" ...
## $ sudden.weight.loss: chr "No" "No" "No" "Yes" ...
## $ weakness : chr "Yes" "Yes" "Yes" "Yes" ...
## $ Polyphagia : chr "No" "No" "Yes" "Yes" ...
## $ Genital.thrush : chr "No" "No" "No" "Yes" ...
## $ visual.blurring : chr "No" "Yes" "No" "No" ...
## $ Itching : chr "Yes" "No" "Yes" "Yes" ...
## $ Irritability : chr "No" "No" "No" "No" ...
## $ delayed.healing : chr "Yes" "No" "Yes" "Yes" ...
## $ partial.paresis : chr "No" "Yes" "No" "No" ...
## $ muscle.stiffness : chr "Yes" "No" "Yes" "No" ...
## $ Alopecia : chr "Yes" "Yes" "Yes" "No" ...
## $ Obesity : chr "Yes" "No" "No" "No" ...
## $ class : chr "Positive" "Positive" "Positive" "Positive" ...
diabet.data = diabet.df
diabet.data = as.data.frame(unclass(diabet.data),
stringsAsFactors = TRUE)str(diabet.data)## 'data.frame': 520 obs. of 17 variables:
## $ Age : int 40 58 41 45 60 55 57 66 67 70 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
## $ Polyuria : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 2 2 2 1 ...
## $ Polydipsia : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 2 2 2 2 2 ...
## $ sudden.weight.loss: Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 2 1 2 ...
## $ weakness : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Polyphagia : Factor w/ 2 levels "No","Yes": 1 1 2 2 2 2 2 1 2 2 ...
## $ Genital.thrush : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 2 1 2 1 ...
## $ visual.blurring : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 2 1 2 1 2 ...
## $ Itching : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 1 2 2 2 ...
## $ Irritability : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 2 2 2 ...
## $ delayed.healing : Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 2 1 1 1 ...
## $ partial.paresis : Factor w/ 2 levels "No","Yes": 1 2 1 1 2 1 2 2 2 1 ...
## $ muscle.stiffness : Factor w/ 2 levels "No","Yes": 2 1 2 1 2 2 1 2 2 1 ...
## $ Alopecia : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 2 1 1 1 2 ...
## $ Obesity : Factor w/ 2 levels "No","Yes": 2 1 1 1 2 2 1 1 2 1 ...
## $ class : Factor w/ 2 levels "Negative","Positive": 2 2 2 2 2 2 2 2 2 2 ...
set.seed(123)pd = sample(2, nrow(diabet.data), replace = TRUE, prob = c(0.8,0.2))
train_data = diabet.data[pd == 1,]
test_data = diabet.data[pd == 2,]library(ggplot2)
library(dplyr)
library(ggpubr)p1 = ggplot(diabet.data, aes(x = class, fill = Alopecia)) +
geom_bar(position = "dodge")
p2 = ggplot(diabet.data, aes(x = class, fill = Polyuria)) +
geom_bar(position = "dodge")
p3 = ggplot(diabet.data, aes(x = class, fill = Gender)) +
geom_bar(position = "dodge")
p4 = ggplot(diabet.data, aes(x = class, fill = Polydipsia)) +
geom_bar(position = "dodge")
p5 = ggplot(diabet.data, aes(x = class, fill = sudden.weight.loss)) +
geom_bar(position = "dodge")
p6 = ggplot(diabet.data, aes(x = class, fill = weakness)) +
geom_bar(position = "dodge")
ggarrange(p1, p2, p3, p4, p5, p6, nrow = 3, ncol = 2)table(diabet.data$class)##
## Negative Positive
## 200 320
entropy_awal = - 200/520 * log2(200/520) - 320/520 * log2(320/520)
entropy_awal## [1] 0.9612366
Dengan nilai entropy awal sebesar 0.9612366 dimana nilai tersebut mendekati 1, dapat dinyatakan bahwa tidak ada kelas yang dominan atau jumlah dari kedua kelas seimbang.
table(Gender = diabet.data$Gender, Class = diabet.data$class)## Class
## Gender Negative Positive
## Female 19 173
## Male 181 147
entropy_female =
- (173/(173+19)) * log2(173/(173+19)) - (19/(173+19)) * log2(19/(173+19))
entropy_male =
- (147/(147+181)) * log2(147/(147+181)) - (181/(147+181)) * log2(181/(147+181))
inf_gain_gender =
entropy_awal - ((173+19)/520 * entropy_female + (147+181)/520 * entropy_male)
inf_gain_gender## [1] 0.16342
table(Polyuria = diabet.data$Polyuria, Class = diabet.data$class)## Class
## Polyuria Negative Positive
## No 185 77
## Yes 15 243
entropy_polyuria_no =
- (77/(77+185)) * log2((77/(77+185))) - (185/(77+185)) * log2((185/(77+185)))
entropy_polyuria_yes =
- (243/(243+15)) * log2((243/(243+15))) - (15/(243+15)) * log2((15/(243+15)))
inf_gain_polyuria =
entropy_awal -
((77+185)/520 * entropy_polyuria_no + (15+243)/520 * entropy_polyuria_yes)
inf_gain_polyuria## [1] 0.362251
Dengan membandingkan information gain yang didapat pada variabel Gender dan Polyuria sebesar 0.16342 dan 0.362251, dapat dikatakan bahwa variabel Polyuria memberikan informasi yang lebih banyak dibandingkan variabel Gender. Berdasarkan perbandingan tersebut terhadap seluruh variabel, maka pohon keputusan yang disusun adalah sebagai berikut:
library(party)
library(partykit)tree_all = ctree(class ~ ., data = train_data)
tree_all##
## Model formula:
## class ~ Age + Gender + Polyuria + Polydipsia + sudden.weight.loss +
## weakness + Polyphagia + Genital.thrush + visual.blurring +
## Itching + Irritability + delayed.healing + partial.paresis +
## muscle.stiffness + Alopecia + Obesity
##
## Fitted party:
## [1] root
## | [2] Polyuria in No
## | | [3] Gender in Female
## | | | [4] Alopecia in No: Positive (n = 36, err = 11.1%)
## | | | [5] Alopecia in Yes: Negative (n = 14, err = 21.4%)
## | | [6] Gender in Male
## | | | [7] Polydipsia in No
## | | | | [8] Irritability in No: Negative (n = 123, err = 4.1%)
## | | | | [9] Irritability in Yes: Negative (n = 13, err = 30.8%)
## | | | [10] Polydipsia in Yes: Positive (n = 22, err = 36.4%)
## | [11] Polyuria in Yes
## | | [12] Polydipsia in No
## | | | [13] delayed.healing in No: Positive (n = 24, err = 0.0%)
## | | | [14] delayed.healing in Yes
## | | | | [15] Alopecia in No: Positive (n = 10, err = 0.0%)
## | | | | [16] Alopecia in Yes: Negative (n = 17, err = 29.4%)
## | | [17] Polydipsia in Yes: Positive (n = 164, err = 0.0%)
##
## Number of inner nodes: 8
## Number of terminal nodes: 9
plot(tree_all, gp = gpar(fontsize = 6), # font size changed to 6
inner_panel=node_inner,
ip_args=list(
abbreviate = TRUE,
id = FALSE)
)library(caret)pred_test_dt = predict(tree_all, test_data)tree_conma = confusionMatrix(pred_test_dt, test_data$class, positive = "Positive")
tree_conma## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 36 8
## Positive 2 51
##
## Accuracy : 0.8969
## 95% CI : (0.8186, 0.9494)
## No Information Rate : 0.6082
## P-Value [Acc > NIR] : 2.123e-10
##
## Kappa : 0.7896
##
## Mcnemar's Test P-Value : 0.1138
##
## Sensitivity : 0.8644
## Specificity : 0.9474
## Pos Pred Value : 0.9623
## Neg Pred Value : 0.8182
## Prevalence : 0.6082
## Detection Rate : 0.5258
## Detection Prevalence : 0.5464
## Balanced Accuracy : 0.9059
##
## 'Positive' Class : Positive
##
Teorema bayes dapat dinyatakan pada persamaaan berikut ini:
\[ P(A|B) = \frac{P(B|A)\ P(A)}{P(B|A)\ P(A) + P(B|\neg A)\ P(\neg A)} \]
library(naivebayes)bayes_all = naive_bayes(class ~.,train_data)
bayes_all##
## ================================== Naive Bayes ==================================
##
## Call:
## naive_bayes.formula(formula = class ~ ., data = train_data)
##
## ---------------------------------------------------------------------------------
##
## Laplace smoothing: 0
##
## ---------------------------------------------------------------------------------
##
## A priori probabilities:
##
## Negative Positive
## 0.3829787 0.6170213
##
## ---------------------------------------------------------------------------------
##
## Tables:
##
## ---------------------------------------------------------------------------------
## ::: Age (Gaussian)
## ---------------------------------------------------------------------------------
##
## Age Negative Positive
## mean 46.46296 48.67433
## sd 12.07952 12.01687
##
## ---------------------------------------------------------------------------------
## ::: Gender (Bernoulli)
## ---------------------------------------------------------------------------------
##
## Gender Negative Positive
## Female 0.09259259 0.56704981
## Male 0.90740741 0.43295019
##
## ---------------------------------------------------------------------------------
## ::: Polyuria (Bernoulli)
## ---------------------------------------------------------------------------------
##
## Polyuria Negative Positive
## No 0.92592593 0.22222222
## Yes 0.07407407 0.77777778
##
## ---------------------------------------------------------------------------------
## ::: Polydipsia (Bernoulli)
## ---------------------------------------------------------------------------------
##
## Polydipsia Negative Positive
## No 0.95061728 0.27969349
## Yes 0.04938272 0.72030651
##
## ---------------------------------------------------------------------------------
## ::: sudden.weight.loss (Bernoulli)
## ---------------------------------------------------------------------------------
##
## sudden.weight.loss Negative Positive
## No 0.8518519 0.4137931
## Yes 0.1481481 0.5862069
##
## ---------------------------------------------------------------------------------
##
## # ... and 11 more tables
##
## ---------------------------------------------------------------------------------
pred_test_bayes = predict(bayes_all, test_data)bayes_conma = confusionMatrix(pred_test_bayes, test_data$class, "Positive")
bayes_conma## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Positive
## Negative 35 13
## Positive 3 46
##
## Accuracy : 0.8351
## 95% CI : (0.746, 0.9027)
## No Information Rate : 0.6082
## P-Value [Acc > NIR] : 1.12e-06
##
## Kappa : 0.6694
##
## Mcnemar's Test P-Value : 0.02445
##
## Sensitivity : 0.7797
## Specificity : 0.9211
## Pos Pred Value : 0.9388
## Neg Pred Value : 0.7292
## Prevalence : 0.6082
## Detection Rate : 0.4742
## Detection Prevalence : 0.5052
## Balanced Accuracy : 0.8504
##
## 'Positive' Class : Positive
##
Nilai hasil untuk model Naive Bayes:
bayes_conma$overall["Accuracy"]## Accuracy
## 0.8350515
bayes_conma$byClass["Sensitivity"]## Sensitivity
## 0.779661
bayes_conma$byClass["Specificity"]## Specificity
## 0.9210526
Nilai hasil untuk model Decision Tree:
tree_conma$overall["Accuracy"]## Accuracy
## 0.8969072
tree_conma$byClass["Sensitivity"]## Sensitivity
## 0.8644068
tree_conma$byClass["Specificity"]## Specificity
## 0.9473684
Berdasarkan model yang telah digunakan untuk klasifikasi diagnosa awal pasien diabetes dengan menggunakan Decision tree dan Naive Bayes, Akurasi model Decision Tree lebih besar dibandingkan dengan Naive Bayes dengan nilai sebesar 0.8969072 untuk Decision Tree dan 0.8350515 untuk Naive Bayes. Maka, dengan menggunakan dataset tersebut, model Decision Tree menjadi model terbaik untuk digunakan.
Dengan nilai Sensitivity yang menggambarkan probabilitas terhadap aktual positif, model Decision Tree merupakan model yang terbaik dibandingkan dengan model Naive Bayes yaitu dengan nilai 0.8644068,
Begitu pula halnya dengan nilai Specificity yang mengukur probabilitas terhadap aktual negatif, model Decision Tree memiliki nilai lebih besar dibandingkan dengan Naive Bayes walaupun dengan Specificity 0.9210526, model Naive Bayes dapat juga digunakan.
Berdasarkan nilai Accuracy, Sensitivity dan Specificity, maka model Decision Tree merupakan model terbaik untuk mengklasifikasi dan memprediksi gejala awal diabetes dengan dataset berikut.
Pengidap diabetes akan terklasifikasi positive apabila terdapat gejala Polyuria dan Polydipsia yang diwakili dengan jumlah observasi sebanyak 164 sampel.
Sebanyak 123 sampel dinyatakan negative dimana tidak ada gejala Polyuria, Gender atau berjenis kelamin laki - laki , tidak ada gejala Polydipsia dan tidak ada gejala Irritability.
Islam M.M.F., Ferdousi R., Rahman S., Bushra H.Y. (2020) Likelihood Prediction of Diabetes at Early Stage Using Data Mining Techniques. In: Gupta M., Konar D., Bhattacharyya S., Biswas S. (eds) Computer Vision and Machine Intelligence in Medical Image Analysis. Advances in Intelligent Systems and Computing, vol 992. Springer, Singapore. https://doi.org/10.1007/978-981-13-8798-2_12