Classification is a supervised machine learning method in which the
model tries to predict the correct label from the input data. In short,
classification is a form of “pattern recognition.” Here, a
classification algorithm applied to training data finds similar patterns
(similar sequences of numbers, words or sentiments, and the like) in
future data sets. There are four different types of Classification Tasks
in Machine Learning and they are following:
Binary Classification Multi-Class Classification Multi-Label
Classification Imbalanced Classification (Banoula, Mayank. 2023).
There are some common classification model:
> - Logistic Regression
> - Naive Bayes
> - K-Nearest Neighbor
> - Decision Tree
> - Support Vector Machine
> - Neural Network
(mathworks.com)
In this report, we will use Naive Bayes, Decision Tree, and Random Forest Model. The previous report is Classification I, we already discuss about Logistic Regression and K-Nearest Neighbor.
Diabetes or sugar (high blood sugar) is a chronic (long-term) disease you must be aware of. The main sign of this disease is an increase in blood sugar (glucose) levels above average values.
This condition occurs when the sufferer’s body can no longer take
sugar (glucose) into cells and use it as energy. This condition
ultimately results in the accumulation of extra sugar in the body’s
bloodstream. Uncontrolled diabetes can have serious consequences,
causing damage to various organs and tissues.
The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy. This data set is available on UC Irvine ML Repository.
library(dplyr)
library(e1071)
library(caret)
library(ROCR)
library(partykit)
library(randomForest)
This data set will be imported and saved as diab
diab <- read.csv("diab_new.csv")
rmarkdown::paged_table(diab)
We don’t need X column in our analysis, then we will
take it out.
diab <- diab %>% select(-X)
Rename Diabetes_012 column to Diabetes to
make it more intuitive.
diab <- rename(diab, Diabetes = Diabetes_012)
Data Columns
Diabetes: 0 = no diabetes; 1 = prediabetes or
diabetes
HighBP: 0 = no high BP; 1 = high BP
HighChol: 0 = no high cholesterol; 1 = high
cholesterol
CholCheck: 0 = no cholesterol check in 5 years; 1 = yes
cholesterol check in 5 years
BMI: Body Mass Index
Smoker: Have you smoked at least 100 cigarettes in your
entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yes
Stroke: (Ever told) you had a stroke. 0 = no 1 = yes
HeartDiseaseorAttack: coronary heart disease (CHD) or
myocardial infarction (MI) 0 = no 1 = yes
PhysActivity: physical activity in past 30 days - not
including job 0 = no 1 = yes
Fruits: Consume Fruit 1 or more times per day 0 = no 1 =
yes
Veggies: Consume Vegetables 1 or more times per day 0 = no
1 = yes HvyAlcoholConsump: Heavy drinkers (adult men having
more than 14 drinks per week and adult women having more than 7 drinks
per week) 0 = no 1 = yes
AnyHealthcare: Have any kind of health care coverage,
including health insurance, prepaid plans such as HMO, etc. 0 = no 1 =
yes
NoDocbcCost: Was there a time in the past 12 months when
you needed to see a doctor but could not because of cost? 0 = no 1 =
yes
GenHlth: Would you say that in general your health is:
scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor
MentHlth: Now thinking about your mental health, which
includes stress, depression, and problems with emotions, for how many
days during the past 30 days was your mental health not good? scale 1-30
days
PhysHlth: Now thinking about your physical health, which
includes physical illness and injury, for how many days during the past
30 days was your physical health not good? scale 1-30 days
DiffWalk: Do you have serious difficulty walking or
climbing stairs? 0 = no 1 = yes Sex: 0 = female 1 = male
Age: Age; 13-level age category (_AGEG5YR see codebook) 1 =
18-24 9 = 60-64 13 = 80 or older
Education: Education level (EDUCA see codebook) scale 1-6 1
= Never attended school or only kindergarten 2 = Grades 1 through 8
(Elementary) 3 = Grades 9 through 11 (Some high school) 4 = Grade 12 or
GED (High school graduate) 5 = College 1 year to 3 years (Some college
or technical school) 6 = College 4 years or more (College
graduate)
Income: Income scale (INCOME2 see codebook) scale 1-8 1 =
less than $10,000 5 = less than $35,000 8 = $75,000 or more
We should check whether there are any duplicate rows and missing values.
sum(duplicated(diab))
## [1] 0
colSums(is.na(x = diab))
## Diabetes HighBP HighChol
## 0 0 0
## CholCheck BMI Smoker
## 0 0 0
## Stroke HeartDiseaseorAttack PhysActivity
## 0 0 0
## Fruits Veggies HvyAlcoholConsump
## 0 0 0
## AnyHealthcare NoDocbcCost GenHlth
## 0 0 0
## MentHlth PhysHlth DiffWalk
## 0 0 0
## Sex Age Education
## 0 0 0
## Income
## 0
Check the data type and convert some columns into factor data type.
glimpse(diab)
## Rows: 700
## Columns: 22
## $ Diabetes <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HighBP <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1…
## $ HighChol <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1…
## $ CholCheck <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ BMI <int> 40, 25, 28, 27, 24, 25, 30, 25, 24, 34, 26, 33, 3…
## $ Smoker <int> 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0…
## $ Stroke <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
## $ HeartDiseaseorAttack <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ PhysActivity <int> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0…
## $ Fruits <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1…
## $ Veggies <int> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0…
## $ HvyAlcoholConsump <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ AnyHealthcare <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost <int> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ GenHlth <int> 5, 3, 5, 2, 2, 2, 3, 3, 2, 3, 3, 4, 2, 3, 2, 2, 3…
## $ MentHlth <int> 18, 0, 30, 0, 3, 0, 0, 0, 0, 0, 0, 30, 5, 0, 15, …
## $ PhysHlth <int> 15, 0, 30, 0, 0, 2, 14, 0, 0, 30, 15, 28, 0, 0, 0…
## $ DiffWalk <int> 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1…
## $ Sex <int> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Age <int> 9, 7, 9, 11, 11, 10, 9, 11, 8, 10, 7, 4, 6, 10, 2…
## $ Education <int> 4, 6, 4, 3, 5, 6, 6, 4, 4, 5, 5, 6, 6, 4, 6, 6, 4…
## $ Income <int> 3, 1, 8, 6, 4, 8, 7, 4, 3, 1, 7, 2, 8, 3, 7, 8, 4…
diab <- diab %>%
mutate_at(.vars = c("Diabetes","HighBP", "HighChol", "CholCheck", "Smoker", "Stroke", "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost", "GenHlth", "DiffWalk", "Sex", "Age", "Education", "Income"),
.funs = as.factor) %>%
glimpse()
## Rows: 700
## Columns: 22
## $ Diabetes <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HighBP <fct> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1…
## $ HighChol <fct> 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1…
## $ CholCheck <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ BMI <int> 40, 25, 28, 27, 24, 25, 30, 25, 24, 34, 26, 33, 3…
## $ Smoker <fct> 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0…
## $ Stroke <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
## $ HeartDiseaseorAttack <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ PhysActivity <fct> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0…
## $ Fruits <fct> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1…
## $ Veggies <fct> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0…
## $ HvyAlcoholConsump <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ AnyHealthcare <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost <fct> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ GenHlth <fct> 5, 3, 5, 2, 2, 2, 3, 3, 2, 3, 3, 4, 2, 3, 2, 2, 3…
## $ MentHlth <int> 18, 0, 30, 0, 3, 0, 0, 0, 0, 0, 0, 30, 5, 0, 15, …
## $ PhysHlth <int> 15, 0, 30, 0, 0, 2, 14, 0, 0, 30, 15, 28, 0, 0, 0…
## $ DiffWalk <fct> 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1…
## $ Sex <fct> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Age <fct> 9, 7, 9, 11, 11, 10, 9, 11, 8, 10, 7, 4, 6, 10, 2…
## $ Education <fct> 4, 6, 4, 3, 5, 6, 6, 4, 4, 5, 5, 6, 6, 4, 6, 6, 4…
## $ Income <fct> 3, 1, 8, 6, 4, 8, 7, 4, 3, 1, 7, 2, 8, 3, 7, 8, 4…
table(diab$Diabetes) %>%
prop.table()
##
## 0 1
## 0.7142857 0.2857143
Our data is imbalanced. Imbalanced data is a common challenge in machine learning where the distribution of classes in a data set is skewed, with one class significantly outnumbering the others. This phenomenon can lead to biased models and reduced performance (Shah & Mukhtar, 2023). Then, we can balance the data before we continue making the model.
GGally::ggcorr(diab[,-12], hjust = 1, layout.exp = 2, label = T, label_size = 2.9)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Based on the plot above, there are no numeric variables which have a
high correlation with one another.
Diabetes data set will be split into 2 section. The first section is
training data set, the proportion is 80% from all of data set. The
second section is testing data set, the proportion is 20% from all of
data set. We will use sample() function in this
process.
RNGkind(sample.kind = "Rounding")
set.seed(200)
split <- sample(nrow(diab), nrow(diab)*0.80)
diab_train <- diab[split, ]
diab_test <- diab[-split, ]
Then, we will balance the data using the upSample()
function. The balancing process can only be conducted on the training
data set.
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(200)
diab_train_up <- upSample(
x = diab_train %>% select(-Diabetes),
y = diab_train$Diabetes,
yname = "Diabetes"
)
head(diab_train_up)
## HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack PhysActivity
## 1 1 0 1 31 0 0 0 1
## 2 1 0 1 30 1 0 0 1
## 3 1 1 1 25 1 0 0 1
## 4 0 0 1 28 1 0 0 0
## 5 1 1 1 34 0 0 0 0
## 6 0 1 1 29 1 0 0 1
## Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth MentHlth
## 1 1 1 0 1 0 4 0
## 2 1 1 0 1 0 2 0
## 3 1 1 0 1 0 3 0
## 4 0 0 0 1 0 2 0
## 5 0 1 0 1 0 3 10
## 6 0 0 0 1 0 2 0
## PhysHlth DiffWalk Sex Age Education Income Diabetes
## 1 30 0 1 10 6 7 0
## 2 0 1 1 11 6 8 0
## 3 2 0 1 9 4 1 0
## 4 0 0 1 9 5 6 0
## 5 0 1 0 12 5 5 0
## 6 1 0 1 5 5 8 0
#Check the proportion of class target
prop.table(table(diab_train_up$Diabetes))
##
## 0 1
## 0.5 0.5
Yeay our data is already balance. Then it is ready for the next process.
We will create a Naive Bayes model with all predictors using the training data from cross-validation, and 1 as the value of Laplace smoothing.
diab.bayes <- naiveBayes(formula = Diabetes ~ .,
data = diab_train_up,
laplace = 1)
diab.bayes
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## 0 1
## 0.5 0.5
##
## Conditional probabilities:
## HighBP
## Y 0 1
## 0 0.4803922 0.5196078
## 1 0.4117647 0.5882353
##
## HighChol
## Y 0 1
## 0 0.5098039 0.4901961
## 1 0.4289216 0.5710784
##
## CholCheck
## Y 0 1
## 0 0.031862745 0.968137255
## 1 0.009803922 0.990196078
##
## BMI
## Y [,1] [,2]
## 0 28.14039 5.571754
## 1 29.90887 6.679918
##
## Smoker
## Y 0 1
## 0 0.5269608 0.4730392
## 1 0.6348039 0.3651961
##
## Stroke
## Y 0 1
## 0 0.95588235 0.04411765
## 1 0.98774510 0.01225490
##
## HeartDiseaseorAttack
## Y 0 1
## 0 0.90196078 0.09803922
## 1 0.88235294 0.11764706
##
## PhysActivity
## Y 0 1
## 0 0.3725490 0.6274510
## 1 0.4877451 0.5122549
##
## Fruits
## Y 0 1
## 0 0.4019608 0.5980392
## 1 0.4411765 0.5588235
##
## Veggies
## Y 0 1
## 0 0.2328431 0.7671569
## 1 0.1642157 0.8357843
##
## HvyAlcoholConsump
## Y 0 1
## 0 0.96323529 0.03676471
## 1 0.94607843 0.05392157
##
## AnyHealthcare
## Y 0 1
## 0 0.04656863 0.95343137
## 1 0.05882353 0.94117647
##
## NoDocbcCost
## Y 0 1
## 0 0.8823529 0.1176471
## 1 0.8725490 0.1274510
##
## GenHlth
## Y 1 2 3 4 5
## 0 0.13138686 0.31873479 0.33333333 0.13868613 0.07785888
## 1 0.08029197 0.14598540 0.34793187 0.36982968 0.05596107
##
## MentHlth
## Y [,1] [,2]
## 0 4.413793 8.745578
## 1 5.396552 10.302721
##
## PhysHlth
## Y [,1] [,2]
## 0 5.396552 9.750866
## 1 6.578818 10.818934
##
## DiffWalk
## Y 0 1
## 0 0.7696078 0.2303922
## 1 0.6985294 0.3014706
##
## Sex
## Y 0 1
## 0 0.6470588 0.3529412
## 1 0.6274510 0.3725490
##
## Age
## Y 1 2 3 4 5 6
## 0 0.009546539 0.016706444 0.021479714 0.042959427 0.052505967 0.057279236
## 1 0.011933174 0.019093079 0.040572792 0.062052506 0.059665871 0.093078759
## Age
## Y 7 8 9 10 11 12
## 0 0.119331742 0.116945107 0.150357995 0.140811456 0.102625298 0.081145585
## 1 0.093078759 0.100238663 0.136038186 0.088305489 0.155131265 0.057279236
## Age
## Y 13
## 0 0.088305489
## 1 0.083532220
##
## Education
## Y 1 2 3 4 5 6
## 0 0.004854369 0.019417476 0.065533981 0.296116505 0.276699029 0.337378641
## 1 0.002427184 0.087378641 0.140776699 0.257281553 0.223300971 0.288834951
##
## Income
## Y 1 2 3 4 5 6
## 0 0.06038647 0.07004831 0.10144928 0.09178744 0.08695652 0.17632850
## 1 0.18840580 0.11352657 0.20289855 0.14251208 0.10628019 0.09661836
## Income
## Y 7 8
## 0 0.15942029 0.25362319
## 1 0.07487923 0.07487923
Prediction using naive bayes model
diab_pred <- predict(diab.bayes,
diab_test,
type = "class")
diab_pred
## [1] 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## [38] 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1
## [75] 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0
## [112] 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1
## Levels: 0 1
Confusion Matrix Evaluation A confusion matrix, which shows a classification model’s accuracy, is a machine learning performance evaluation tool. The quantity of false positives, false negatives, true positives, and true negatives is shown. This matrix helps with misclassification detection, predictive accuracy improvement, and model performance analysis (Bhandari, Aniruddha.2024).
confusionMatrix(data = diab_pred,
reference = diab_test$Diabetes,
positive = "1",
mode = "everything")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 62 20
## 1 32 26
##
## Accuracy : 0.6286
## 95% CI : (0.5429, 0.7087)
## No Information Rate : 0.6714
## P-Value [Acc > NIR] : 0.8783
##
## Kappa : 0.2108
##
## Mcnemar's Test P-Value : 0.1272
##
## Sensitivity : 0.5652
## Specificity : 0.6596
## Pos Pred Value : 0.4483
## Neg Pred Value : 0.7561
## Precision : 0.4483
## Recall : 0.5652
## F1 : 0.5000
## Prevalence : 0.3286
## Detection Rate : 0.1857
## Detection Prevalence : 0.4143
## Balanced Accuracy : 0.6124
##
## 'Positive' Class : 1
##
Naive Bayes model has accuracy 62.86%, precision 44.83%, and recall
56.52%.
Accuracy is ratio of True predictions (positive and negative) to the
total data. The diab.bayes model can predict 62.86% true
from all data, both those that are correctly predicted to have diabetes
and those that are correctly predicted to be healthy.
Precision or Pos Pred Value is the ratio of true positive predictions
compared to the total number of positive predicted outcomes. The
diab.bayes model can predict 44.83% patients have diabetes
and same as the actual condition.
Recall or sensitivity is the ratio of true positive predictions compared
to the total true positive data. The diab.bayes model can
predict 56.52% of patients as having diabetes compared to the total
number of patients who actually have diabetes.
In the evaluation of this model will be focused on the recall value
because False Positive is better than False Negative. It is better for a
patient to be predicted to have diabetes but actually not, than to be
predicted to be healthy but actually have diabetes.
ROC AUC Evaluation
The AUC- ROC curve measures the classification issues’ performance at
different threshold settings. AUC is a metric or degree of separability,
whereas ROC is a probability curve. It indicates the degree to which the
model can discriminate between classes. The higher the AUC, the better
the model performs at predicting 0 classes as 0 and 1 classes as 1.
Similarly, a model’s ability to discriminate between patients who have
the condition and those who do not is indicated by a higher AUC
(Narkhede, Sarang .2018).
# prediction: take the probability value
pred_test_prob <- predict(diab.bayes,
diab_test,
type = "raw")
# take the probability of positive class
pred_prob <- pred_test_prob[,1]
bayes_roc <- prediction(predictions = pred_prob,
labels = diab_test$Diabetes,
label.ordering = c("0",
"1"))
model_roc_vec <- performance(bayes_roc,
"tpr",
"fpr")
plot(model_roc_vec)
abline(0,1 , lty = 2)
# Calculate the AUC value
bayes_auc <- performance(bayes_roc, "auc")
bayes_auc@y.values[[1]]
## [1] 0.3126735
The AUC value of this model is approximately 0, the model is actually reciprocating the classes. It means the model is predicting a negative class as a positive class and vice versa. Apa yang harus dilakukan?
diab_tree <- ctree(formula = Diabetes~ .,
data = diab_train_up)
plot(diab_tree, type = "simple")
## Prediction & Evaluation
# Prediction in training data
pred_train <- predict(diab_tree,
diab_train_up,
type = "response")
# Confusion matrix in training data
confusionMatrix(pred_train,
diab_train_up$Diabetes,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 322 145
## 1 84 261
##
## Accuracy : 0.718
## 95% CI : (0.6857, 0.7487)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.436
##
## Mcnemar's Test P-Value : 7.342e-05
##
## Sensitivity : 0.6429
## Specificity : 0.7931
## Pos Pred Value : 0.7565
## Neg Pred Value : 0.6895
## Prevalence : 0.5000
## Detection Rate : 0.3214
## Detection Prevalence : 0.4249
## Balanced Accuracy : 0.7180
##
## 'Positive' Class : 1
##
# Prediction in testing data
pred_test <- predict(diab_tree,
diab_test,
type = "response")
# Confusion matrix in testing data
confusionMatrix(pred_test,
diab_test$Diabetes,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 68 25
## 1 26 21
##
## Accuracy : 0.6357
## 95% CI : (0.5502, 0.7153)
## No Information Rate : 0.6714
## P-Value [Acc > NIR] : 0.8389
##
## Kappa : 0.1789
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.4565
## Specificity : 0.7234
## Pos Pred Value : 0.4468
## Neg Pred Value : 0.7312
## Prevalence : 0.3286
## Detection Rate : 0.1500
## Detection Prevalence : 0.3357
## Balanced Accuracy : 0.5900
##
## 'Positive' Class : 1
##
Check in training data: accuracy 71.8%, Recall/Sensitivity 64.29%,
Precision/Pos Pred Value : 75.65%
Check in testing data: accuracy 63.57%, Recall/Sensitivity 45.65%,
Precision/Pos Pred Value : 44.68%
Those checks indicates model is overfitting because it’s good at
training data but it’s not good at testing data. The above model is very
complex and will be very risky for overfitting, so it needs to be
simplified by pruning, namely by limiting the number of
branches that can be created by the model. The pruning
method involves setting the min criterion, min split, and min bucket
values.
The appropriate metrics for health cases are sensitivity or recall. The model must have a low FN/False Negative value (diabetic patients but predicted healthy). So, the tuning process will focus on achieving the most optimum sensitivity value.
Tuning is done by manually setting the mincriterion, minsplit and
minbucket values to obtain the optimum evaluation value (the highest
recall value and the difference between the evaluation results on the
train data and test data is not too different) (data can be seen in the
table below). In this model, a trial adjustment was performed on the
mincriterion, minsplit and minbucket values. The recall value results on
the train and test data were the same in several trials. Then, the
optimum mincriterion value was sought using the party
package with a mincriterion value of 0.198. This value was used in the
subsequent trial until the optimum recall value was obtained.
knitr::include_graphics("decision-tree-tuning-trial.png")
From the trial, we can conclude that the best mincriterion, minsplit
and minbucket value are 0.198, 100, and 50, respectively. Here are the
step of tuning trial:
1. Adjust mincriterion, minsplit and minbucket not too far from the
model to make tune model
We find the same result of recall value in train and test data on the
various values of mincriterion, minsplit and minbucket. 2. Find the
mincriterion optimum value using party package 3. Use the
mincriterion in step 2 and make tune model 4. Compare the result of step
3 and step 1.
Adjust mincriterion, minsplit and minbucket and make tune model
diab_tree_tune <- ctree(formula = Diabetes~ .,
data = diab_train_up,
control = ctree_control(mincriterion = 0.198,
minsplit = 100,
minbucket = 50))
plot(diab_tree_tune, type = "simple")
# Prediction in training data
pred_train_tune <- predict(diab_tree_tune,
diab_train_up,
type = "response")
# Confusion matrix in training data
confusionMatrix(pred_train_tune,
diab_train_up$Diabetes,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 305 130
## 1 101 276
##
## Accuracy : 0.7155
## 95% CI : (0.6831, 0.7463)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.431
##
## Mcnemar's Test P-Value : 0.06544
##
## Sensitivity : 0.6798
## Specificity : 0.7512
## Pos Pred Value : 0.7321
## Neg Pred Value : 0.7011
## Prevalence : 0.5000
## Detection Rate : 0.3399
## Detection Prevalence : 0.4643
## Balanced Accuracy : 0.7155
##
## 'Positive' Class : 1
##
# Prediction in testing data
pred_test_tune <- predict(diab_tree_tune,
diab_test,
type = "response")
# Confusion matrix in testing data
confusionMatrix(pred_test_tune,
diab_test$Diabetes,
positive = "1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 66 20
## 1 28 26
##
## Accuracy : 0.6571
## 95% CI : (0.5723, 0.7352)
## No Information Rate : 0.6714
## P-Value [Acc > NIR] : 0.6765
##
## Kappa : 0.256
##
## Mcnemar's Test P-Value : 0.3123
##
## Sensitivity : 0.5652
## Specificity : 0.7021
## Pos Pred Value : 0.4815
## Neg Pred Value : 0.7674
## Prevalence : 0.3286
## Detection Rate : 0.1857
## Detection Prevalence : 0.3857
## Balanced Accuracy : 0.6337
##
## 'Positive' Class : 1
##
Find the mincriterion optimum value using the
party package then use it in the tuning trial.
#classifier = train(form = Diabetes ~ .,
# data = diab_train_up,
# method = 'ctree',
# tuneGrid = data.frame(mincriterion = seq(0.01,0.99,length.out = 100)),
# trControl = trainControl(method = "boot",
# summaryFunction = defaultSummary,
# verboseIter = TRUE))
Aggregating results
Selecting tuning parameters Fitting mincriterion = 0.198 on full
training set
micriterion : 0.198
minsplit : 100
minbucket : 50
A : Recall/Sensitivity in training data 67.98%
B : Recall/Sensitivity testing data 56.52%
The difference between A and B: 11.46%
After we tune the model by changing the min criterion, min split, and min bucket values, we find that the optimum recall values on training and test data are 67.98% and 56.52%, respectively. The difference between training and testing metrics is not significant (the effect of overfitting can be reduced).
ROC-AUC Evaluation
# prediction: take the probability value
predict_tree <- predict(diab_tree_tune,
diab_test,
type = "prob")
head(predict_tree)
## 0 1
## 8 0.7000000 0.3000000
## 9 0.5578947 0.4421053
## 20 0.7200000 0.2800000
## 23 0.3076923 0.6923077
## 25 0.2037037 0.7962963
## 27 0.7000000 0.3000000
# take the probability of positive class
pred_prob_tree <- predict_tree[,1]
tree_roc <- prediction(predictions = pred_prob_tree,
labels = diab_test$Diabetes,
label.ordering = c("0",
"1"))
model_roc_tree <- performance(tree_roc,
"tpr",
"fpr")
plot(model_roc_tree)
abline(0,1 , lty = 2)
# Calculate the AUC value
tree_auc <- performance(tree_roc, "auc")
tree_auc@y.values[[1]]
## [1] 0.318802
After we tune the model by changing the min criterion, min split, and
min bucket values, the AUC value is 0.3188. It’s below 0.5 and means
that diab_tree model cannot differentiate between positive
and negative classes correctly.
Because building a random forest model takes a lot of computing time, we save the model as an RDS file and load it when we want to use it. Here the syntax of building random forest model
#train_ctrl <- trainControl(method = "repeatedcv",
# number = 5,
# repeats = 3)
#diab_forest <- train(Diabetes ~ .,
# data = diab_train_up,
# method = "rf",
# trControl = train_ctrl)
#saveRDS(diab_forest, "diab_forest.RDS")
Load the random forest model.
diab_forest <- readRDS("diab_forest.RDS")
diab_forest
## Random Forest
##
## 812 samples
## 21 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 650, 649, 650, 649, 650, 650, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8083253 0.6166501
## 23 0.8641589 0.7283188
## 45 0.8575820 0.7151636
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 23.
diab_forest model using 812 samples, 21 predictor and 2
classes target. the optimum number of variables considered for splitting
at each tree node is 23.
Out-of-Bag Score
Because random forest already provides out-of-bag estimates (OOB), which
serve as a trustworthy estimate of the accuracy on unseen samples, we
are not required to divide our dataset into train and test sets when
using random forest. However, a typical train-test cross-validation can
also be executed. For instance, our wine_train dataset was used to
construct the OOB we could accomplish, as seen in the summary below.
diab_forest$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 23
##
## OOB estimate of error rate: 9.36%
## Confusion matrix:
## 0 1 class.error
## 0 346 60 0.14778325
## 1 16 390 0.03940887
plot(diab_forest$finalModel)
legend("topright", colnames(diab_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)
Our model have error 3.9% when it predicts the positive class and 14.77%
when it predicts the negative class. OOB estimate of error rate: 9.36%.
It means the model accuracy in OOB data is 90.64%.
Variable Importance Although random forests are often said to be uninterpretable models, we can see which predictors are most important in forming random forests.
varImp(diab_forest) %>% plot()
From the graphic above, the top 5 most important variable are BMI,
Physhlth, GenHlth4, MentHlth, Income8.
Confusion Matrix Evaluation
#Prediction
predict_forest <- predict(diab_forest, diab_test)
#Confusion matrix
(conf_matrix_forestI <- table(predict_forest, diab_test$Diabetes))
##
## predict_forest 0 1
## 0 79 33
## 1 15 13
confusionMatrix(conf_matrix_forestI)
## Confusion Matrix and Statistics
##
##
## predict_forest 0 1
## 0 79 33
## 1 15 13
##
## Accuracy : 0.6571
## 95% CI : (0.5723, 0.7352)
## No Information Rate : 0.6714
## P-Value [Acc > NIR] : 0.67647
##
## Kappa : 0.1367
##
## Mcnemar's Test P-Value : 0.01414
##
## Sensitivity : 0.8404
## Specificity : 0.2826
## Pos Pred Value : 0.7054
## Neg Pred Value : 0.4643
## Prevalence : 0.6714
## Detection Rate : 0.5643
## Detection Prevalence : 0.8000
## Balanced Accuracy : 0.5615
##
## 'Positive' Class : 0
##
diab_forest gives accuracy 65.71%, recall/sensitivity
84.04%, and precision 70.54%.
ROC-AUC Evaluation
# prediction: take the probability value
predict_forest1 <- predict(diab_forest,
diab_test,
type = "prob")
head(predict_forest1)
## 0 1
## 8 0.806 0.194
## 9 0.446 0.554
## 20 0.868 0.132
## 23 0.442 0.558
## 25 0.448 0.552
## 27 0.718 0.282
# take the probability of positive class
pred_prob_rf <- predict_forest1[,1]
rf_roc <- prediction(predictions = pred_prob_rf,
labels = diab_test$Diabetes,
label.ordering = c("0",
"1"))
model_roc_rf <- performance(rf_roc,
"tpr",
"fpr")
plot(model_roc_rf)
abline(0,1 , lty = 2)
# Calculate the AUC value
rf_auc <- performance(rf_roc, "auc")
rf_auc@y.values[[1]]
## [1] 0.3619334
The AUC value of diab_forest model is 0.3619. It’s below
0.5 and means that diab_forest model cannot differentiate
between positive and negative classes correctly.
# conclusion
conclusion <- data.frame(matrix(c(62.86, 65, 65.71, 56.52, 56.52, 84.04, 44.83, 56.52, 70.34, 0.3127, 0.3188, 0.3619),
nrow = 3,
dimnames = list(c("Naive Bayes", "Decision Tree", "Random Forest"),
c("Accuracy", "Recall", "Precision", "AUC"))))
conclusion
## Accuracy Recall Precision AUC
## Naive Bayes 62.86 56.52 44.83 0.3127
## Decision Tree 65.00 56.52 56.52 0.3188
## Random Forest 65.71 84.04 70.34 0.3619
Based on the data above, and our concern is to find the best model with the highest recall value, we will choose a random forest model to predict the diabetes patient.
- Bhandari, Aniruddha.2024. https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/#:~:text=A%20confusion%20 matrix%20is%20a,false%20positives%2C%20and%20false%20negatives.
- Banoula, Mayank. 2023. https://www.simplilearn.com/tutorials/machine-learning-tutorial/classification-in-machine-learning#what_is_ classification
- Narkhede, Sarang .2018.https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
- Shah, Mukhtar. (2023). Imbalanced Data in Machine Learning: A Comprehensive Review. 10.13140/RG.2.2.18456.98564.
- https://archive.ics.uci.edu/dataset/891/cdc+Diabetes_012+health+indicators
- https://www.mathworks.com/campaigns/offers/next/choosing-the-best-machine-learning-classification-model-and-avoiding-overfitting.html