Introduction

Classification II

Classification is a supervised machine learning method in which the model tries to predict the correct label from the input data. In short, classification is a form of “pattern recognition.” Here, a classification algorithm applied to training data finds similar patterns (similar sequences of numbers, words or sentiments, and the like) in future data sets. There are four different types of Classification Tasks in Machine Learning and they are following:
Binary Classification Multi-Class Classification Multi-Label Classification Imbalanced Classification (Banoula, Mayank. 2023).

There are some common classification model:
> - Logistic Regression
> - Naive Bayes
> - K-Nearest Neighbor
> - Decision Tree
> - Support Vector Machine
> - Neural Network
(mathworks.com)

In this report, we will use Naive Bayes, Decision Tree, and Random Forest Model. The previous report is Classification I, we already discuss about Logistic Regression and K-Nearest Neighbor.

Diabetes Case

Diabetes or sugar (high blood sugar) is a chronic (long-term) disease you must be aware of. The main sign of this disease is an increase in blood sugar (glucose) levels above average values.

This condition occurs when the sufferer’s body can no longer take sugar (glucose) into cells and use it as energy. This condition ultimately results in the accumulation of extra sugar in the body’s bloodstream. Uncontrolled diabetes can have serious consequences, causing damage to various organs and tissues.

The Diabetes Health Indicators Dataset contains healthcare statistics and lifestyle survey information about people in general along with their diagnosis of diabetes. The 35 features consist of some demographics, lab test results, and answers to survey questions for each patient. The target variable for classification is whether a patient has diabetes, is pre-diabetic, or healthy. This data set is available on UC Irvine ML Repository.

Data Preparation

Importing Libraries

library(dplyr)
library(e1071)
library(caret)
library(ROCR)
library(partykit)
library(randomForest)

Importing Dataset

This data set will be imported and saved as diab

diab <- read.csv("diab_new.csv")
rmarkdown::paged_table(diab)

We don’t need X column in our analysis, then we will take it out.

diab <- diab %>% select(-X)

Rename Diabetes_012 column to Diabetes to make it more intuitive.

diab <- rename(diab, Diabetes = Diabetes_012)

Data Columns
Diabetes: 0 = no diabetes; 1 = prediabetes or diabetes
HighBP: 0 = no high BP; 1 = high BP
HighChol: 0 = no high cholesterol; 1 = high cholesterol
CholCheck: 0 = no cholesterol check in 5 years; 1 = yes cholesterol check in 5 years
BMI: Body Mass Index
Smoker: Have you smoked at least 100 cigarettes in your entire life? [Note: 5 packs = 100 cigarettes] 0 = no 1 = yes
Stroke: (Ever told) you had a stroke. 0 = no 1 = yes
HeartDiseaseorAttack: coronary heart disease (CHD) or myocardial infarction (MI) 0 = no 1 = yes
PhysActivity: physical activity in past 30 days - not including job 0 = no 1 = yes
Fruits: Consume Fruit 1 or more times per day 0 = no 1 = yes
Veggies: Consume Vegetables 1 or more times per day 0 = no 1 = yes HvyAlcoholConsump: Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) 0 = no 1 = yes
AnyHealthcare: Have any kind of health care coverage, including health insurance, prepaid plans such as HMO, etc. 0 = no 1 = yes
NoDocbcCost: Was there a time in the past 12 months when you needed to see a doctor but could not because of cost? 0 = no 1 = yes
GenHlth: Would you say that in general your health is: scale 1-5 1 = excellent 2 = very good 3 = good 4 = fair 5 = poor MentHlth: Now thinking about your mental health, which includes stress, depression, and problems with emotions, for how many days during the past 30 days was your mental health not good? scale 1-30 days
PhysHlth: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good? scale 1-30 days DiffWalk: Do you have serious difficulty walking or climbing stairs? 0 = no 1 = yes Sex: 0 = female 1 = male Age: Age; 13-level age category (_AGEG5YR see codebook) 1 = 18-24 9 = 60-64 13 = 80 or older
Education: Education level (EDUCA see codebook) scale 1-6 1 = Never attended school or only kindergarten 2 = Grades 1 through 8 (Elementary) 3 = Grades 9 through 11 (Some high school) 4 = Grade 12 or GED (High school graduate) 5 = College 1 year to 3 years (Some college or technical school) 6 = College 4 years or more (College graduate)
Income: Income scale (INCOME2 see codebook) scale 1-8 1 = less than $10,000 5 = less than $35,000 8 = $75,000 or more

Data Cleaning

We should check whether there are any duplicate rows and missing values.

Duplicates

sum(duplicated(diab))
## [1] 0

Missing Values

colSums(is.na(x = diab))
##             Diabetes               HighBP             HighChol 
##                    0                    0                    0 
##            CholCheck                  BMI               Smoker 
##                    0                    0                    0 
##               Stroke HeartDiseaseorAttack         PhysActivity 
##                    0                    0                    0 
##               Fruits              Veggies    HvyAlcoholConsump 
##                    0                    0                    0 
##        AnyHealthcare          NoDocbcCost              GenHlth 
##                    0                    0                    0 
##             MentHlth             PhysHlth             DiffWalk 
##                    0                    0                    0 
##                  Sex                  Age            Education 
##                    0                    0                    0 
##               Income 
##                    0

Exploratory Data Analysis

Data Type

Check the data type and convert some columns into factor data type.

glimpse(diab)
## Rows: 700
## Columns: 22
## $ Diabetes             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HighBP               <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1…
## $ HighChol             <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1…
## $ CholCheck            <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ BMI                  <int> 40, 25, 28, 27, 24, 25, 30, 25, 24, 34, 26, 33, 3…
## $ Smoker               <int> 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0…
## $ Stroke               <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
## $ HeartDiseaseorAttack <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ PhysActivity         <int> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0…
## $ Fruits               <int> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1…
## $ Veggies              <int> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0…
## $ HvyAlcoholConsump    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ AnyHealthcare        <int> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost          <int> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ GenHlth              <int> 5, 3, 5, 2, 2, 2, 3, 3, 2, 3, 3, 4, 2, 3, 2, 2, 3…
## $ MentHlth             <int> 18, 0, 30, 0, 3, 0, 0, 0, 0, 0, 0, 30, 5, 0, 15, …
## $ PhysHlth             <int> 15, 0, 30, 0, 0, 2, 14, 0, 0, 30, 15, 28, 0, 0, 0…
## $ DiffWalk             <int> 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1…
## $ Sex                  <int> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Age                  <int> 9, 7, 9, 11, 11, 10, 9, 11, 8, 10, 7, 4, 6, 10, 2…
## $ Education            <int> 4, 6, 4, 3, 5, 6, 6, 4, 4, 5, 5, 6, 6, 4, 6, 6, 4…
## $ Income               <int> 3, 1, 8, 6, 4, 8, 7, 4, 3, 1, 7, 2, 8, 3, 7, 8, 4…
diab <- diab %>% 
  mutate_at(.vars = c("Diabetes","HighBP", "HighChol", "CholCheck", "Smoker", "Stroke", "HeartDiseaseorAttack", "PhysActivity", "Fruits", "Veggies", "HvyAlcoholConsump", "AnyHealthcare", "NoDocbcCost", "GenHlth", "DiffWalk", "Sex", "Age", "Education", "Income"),
            .funs = as.factor) %>% 
  glimpse()
## Rows: 700
## Columns: 22
## $ Diabetes             <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ HighBP               <fct> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1…
## $ HighChol             <fct> 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1…
## $ CholCheck            <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1…
## $ BMI                  <int> 40, 25, 28, 27, 24, 25, 30, 25, 24, 34, 26, 33, 3…
## $ Smoker               <fct> 1, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0…
## $ Stroke               <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1…
## $ HeartDiseaseorAttack <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1…
## $ PhysActivity         <fct> 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0…
## $ Fruits               <fct> 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1…
## $ Veggies              <fct> 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0…
## $ HvyAlcoholConsump    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0…
## $ AnyHealthcare        <fct> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ NoDocbcCost          <fct> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ GenHlth              <fct> 5, 3, 5, 2, 2, 2, 3, 3, 2, 3, 3, 4, 2, 3, 2, 2, 3…
## $ MentHlth             <int> 18, 0, 30, 0, 3, 0, 0, 0, 0, 0, 0, 30, 5, 0, 15, …
## $ PhysHlth             <int> 15, 0, 30, 0, 0, 2, 14, 0, 0, 30, 15, 28, 0, 0, 0…
## $ DiffWalk             <fct> 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1…
## $ Sex                  <fct> 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0…
## $ Age                  <fct> 9, 7, 9, 11, 11, 10, 9, 11, 8, 10, 7, 4, 6, 10, 2…
## $ Education            <fct> 4, 6, 4, 3, 5, 6, 6, 4, 4, 5, 5, 6, 6, 4, 6, 6, 4…
## $ Income               <fct> 3, 1, 8, 6, 4, 8, 7, 4, 3, 1, 7, 2, 8, 3, 7, 8, 4…

Proportion of target class

table(diab$Diabetes) %>%
  prop.table()
## 
##         0         1 
## 0.7142857 0.2857143

Our data is imbalanced. Imbalanced data is a common challenge in machine learning where the distribution of classes in a data set is skewed, with one class significantly outnumbering the others. This phenomenon can lead to biased models and reduced performance (Shah & Mukhtar, 2023). Then, we can balance the data before we continue making the model.

Correlation of each variable

GGally::ggcorr(diab[,-12], hjust = 1, layout.exp = 2, label = T, label_size = 2.9)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Based on the plot above, there are no numeric variables which have a high correlation with one another.

Preprocessing Data

Cross Validation

Diabetes data set will be split into 2 section. The first section is training data set, the proportion is 80% from all of data set. The second section is testing data set, the proportion is 20% from all of data set. We will use sample() function in this process.

RNGkind(sample.kind = "Rounding") 
set.seed(200)

split <- sample(nrow(diab), nrow(diab)*0.80)
diab_train <- diab[split, ] 
diab_test <- diab[-split, ] 

Then, we will balance the data using the upSample() function. The balancing process can only be conducted on the training data set.

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(200)

diab_train_up <- upSample(
  x = diab_train %>% select(-Diabetes),
  y = diab_train$Diabetes,
  yname = "Diabetes"
)

head(diab_train_up)
##   HighBP HighChol CholCheck BMI Smoker Stroke HeartDiseaseorAttack PhysActivity
## 1      1        0         1  31      0      0                    0            1
## 2      1        0         1  30      1      0                    0            1
## 3      1        1         1  25      1      0                    0            1
## 4      0        0         1  28      1      0                    0            0
## 5      1        1         1  34      0      0                    0            0
## 6      0        1         1  29      1      0                    0            1
##   Fruits Veggies HvyAlcoholConsump AnyHealthcare NoDocbcCost GenHlth MentHlth
## 1      1       1                 0             1           0       4        0
## 2      1       1                 0             1           0       2        0
## 3      1       1                 0             1           0       3        0
## 4      0       0                 0             1           0       2        0
## 5      0       1                 0             1           0       3       10
## 6      0       0                 0             1           0       2        0
##   PhysHlth DiffWalk Sex Age Education Income Diabetes
## 1       30        0   1  10         6      7        0
## 2        0        1   1  11         6      8        0
## 3        2        0   1   9         4      1        0
## 4        0        0   1   9         5      6        0
## 5        0        1   0  12         5      5        0
## 6        1        0   1   5         5      8        0
#Check the proportion of class target
prop.table(table(diab_train_up$Diabetes))
## 
##   0   1 
## 0.5 0.5

Yeay our data is already balance. Then it is ready for the next process.

Modeling: Naive Bayes

Build the model

We will create a Naive Bayes model with all predictors using the training data from cross-validation, and 1 as the value of Laplace smoothing.

diab.bayes <- naiveBayes(formula = Diabetes ~ .,
                          data = diab_train_up,
                          laplace = 1)
diab.bayes
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   0   1 
## 0.5 0.5 
## 
## Conditional probabilities:
##    HighBP
## Y           0         1
##   0 0.4803922 0.5196078
##   1 0.4117647 0.5882353
## 
##    HighChol
## Y           0         1
##   0 0.5098039 0.4901961
##   1 0.4289216 0.5710784
## 
##    CholCheck
## Y             0           1
##   0 0.031862745 0.968137255
##   1 0.009803922 0.990196078
## 
##    BMI
## Y       [,1]     [,2]
##   0 28.14039 5.571754
##   1 29.90887 6.679918
## 
##    Smoker
## Y           0         1
##   0 0.5269608 0.4730392
##   1 0.6348039 0.3651961
## 
##    Stroke
## Y            0          1
##   0 0.95588235 0.04411765
##   1 0.98774510 0.01225490
## 
##    HeartDiseaseorAttack
## Y            0          1
##   0 0.90196078 0.09803922
##   1 0.88235294 0.11764706
## 
##    PhysActivity
## Y           0         1
##   0 0.3725490 0.6274510
##   1 0.4877451 0.5122549
## 
##    Fruits
## Y           0         1
##   0 0.4019608 0.5980392
##   1 0.4411765 0.5588235
## 
##    Veggies
## Y           0         1
##   0 0.2328431 0.7671569
##   1 0.1642157 0.8357843
## 
##    HvyAlcoholConsump
## Y            0          1
##   0 0.96323529 0.03676471
##   1 0.94607843 0.05392157
## 
##    AnyHealthcare
## Y            0          1
##   0 0.04656863 0.95343137
##   1 0.05882353 0.94117647
## 
##    NoDocbcCost
## Y           0         1
##   0 0.8823529 0.1176471
##   1 0.8725490 0.1274510
## 
##    GenHlth
## Y            1          2          3          4          5
##   0 0.13138686 0.31873479 0.33333333 0.13868613 0.07785888
##   1 0.08029197 0.14598540 0.34793187 0.36982968 0.05596107
## 
##    MentHlth
## Y       [,1]      [,2]
##   0 4.413793  8.745578
##   1 5.396552 10.302721
## 
##    PhysHlth
## Y       [,1]      [,2]
##   0 5.396552  9.750866
##   1 6.578818 10.818934
## 
##    DiffWalk
## Y           0         1
##   0 0.7696078 0.2303922
##   1 0.6985294 0.3014706
## 
##    Sex
## Y           0         1
##   0 0.6470588 0.3529412
##   1 0.6274510 0.3725490
## 
##    Age
## Y             1           2           3           4           5           6
##   0 0.009546539 0.016706444 0.021479714 0.042959427 0.052505967 0.057279236
##   1 0.011933174 0.019093079 0.040572792 0.062052506 0.059665871 0.093078759
##    Age
## Y             7           8           9          10          11          12
##   0 0.119331742 0.116945107 0.150357995 0.140811456 0.102625298 0.081145585
##   1 0.093078759 0.100238663 0.136038186 0.088305489 0.155131265 0.057279236
##    Age
## Y            13
##   0 0.088305489
##   1 0.083532220
## 
##    Education
## Y             1           2           3           4           5           6
##   0 0.004854369 0.019417476 0.065533981 0.296116505 0.276699029 0.337378641
##   1 0.002427184 0.087378641 0.140776699 0.257281553 0.223300971 0.288834951
## 
##    Income
## Y            1          2          3          4          5          6
##   0 0.06038647 0.07004831 0.10144928 0.09178744 0.08695652 0.17632850
##   1 0.18840580 0.11352657 0.20289855 0.14251208 0.10628019 0.09661836
##    Income
## Y            7          8
##   0 0.15942029 0.25362319
##   1 0.07487923 0.07487923

Prediction

Prediction using naive bayes model

diab_pred <- predict(diab.bayes,
                      diab_test,
                      type = "class")
diab_pred
##   [1] 1 0 0 1 1 1 1 0 1 1 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0
##  [38] 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1
##  [75] 0 0 1 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 1 0
## [112] 0 1 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 0 1 1 0 1 1 1 1 0 1
## Levels: 0 1

Evaluation Model

Confusion Matrix Evaluation A confusion matrix, which shows a classification model’s accuracy, is a machine learning performance evaluation tool. The quantity of false positives, false negatives, true positives, and true negatives is shown. This matrix helps with misclassification detection, predictive accuracy improvement, and model performance analysis (Bhandari, Aniruddha.2024).

confusionMatrix(data = diab_pred,
                reference = diab_test$Diabetes,
                positive = "1",
                mode = "everything")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 62 20
##          1 32 26
##                                           
##                Accuracy : 0.6286          
##                  95% CI : (0.5429, 0.7087)
##     No Information Rate : 0.6714          
##     P-Value [Acc > NIR] : 0.8783          
##                                           
##                   Kappa : 0.2108          
##                                           
##  Mcnemar's Test P-Value : 0.1272          
##                                           
##             Sensitivity : 0.5652          
##             Specificity : 0.6596          
##          Pos Pred Value : 0.4483          
##          Neg Pred Value : 0.7561          
##               Precision : 0.4483          
##                  Recall : 0.5652          
##                      F1 : 0.5000          
##              Prevalence : 0.3286          
##          Detection Rate : 0.1857          
##    Detection Prevalence : 0.4143          
##       Balanced Accuracy : 0.6124          
##                                           
##        'Positive' Class : 1               
## 

Naive Bayes model has accuracy 62.86%, precision 44.83%, and recall 56.52%.
Accuracy is ratio of True predictions (positive and negative) to the total data. The diab.bayes model can predict 62.86% true from all data, both those that are correctly predicted to have diabetes and those that are correctly predicted to be healthy.
Precision or Pos Pred Value is the ratio of true positive predictions compared to the total number of positive predicted outcomes. The diab.bayes model can predict 44.83% patients have diabetes and same as the actual condition.
Recall or sensitivity is the ratio of true positive predictions compared to the total true positive data. The diab.bayes model can predict 56.52% of patients as having diabetes compared to the total number of patients who actually have diabetes.
In the evaluation of this model will be focused on the recall value because False Positive is better than False Negative. It is better for a patient to be predicted to have diabetes but actually not, than to be predicted to be healthy but actually have diabetes.

ROC AUC Evaluation
The AUC- ROC curve measures the classification issues’ performance at different threshold settings. AUC is a metric or degree of separability, whereas ROC is a probability curve. It indicates the degree to which the model can discriminate between classes. The higher the AUC, the better the model performs at predicting 0 classes as 0 and 1 classes as 1. Similarly, a model’s ability to discriminate between patients who have the condition and those who do not is indicated by a higher AUC (Narkhede, Sarang .2018).

# prediction: take the probability value
pred_test_prob <- predict(diab.bayes,
                          diab_test,
                          type = "raw")
# take the probability of positive class
pred_prob <- pred_test_prob[,1]
bayes_roc <- prediction(predictions = pred_prob,
                        labels = diab_test$Diabetes,
                        label.ordering = c("0",
                                           "1"))
model_roc_vec <- performance(bayes_roc, 
                             "tpr",
                             "fpr")
plot(model_roc_vec)
abline(0,1 , lty = 2)

# Calculate the AUC value
bayes_auc <- performance(bayes_roc, "auc")
bayes_auc@y.values[[1]]
## [1] 0.3126735

The AUC value of this model is approximately 0, the model is actually reciprocating the classes. It means the model is predicting a negative class as a positive class and vice versa. Apa yang harus dilakukan?

Modeling: Decision Tree

Build the model

diab_tree <- ctree(formula = Diabetes~ .,
                     data = diab_train_up)
plot(diab_tree, type = "simple")

## Prediction & Evaluation

# Prediction in training data
pred_train <- predict(diab_tree, 
                           diab_train_up, 
                           type = "response")

# Confusion matrix in training data
confusionMatrix(pred_train, 
                diab_train_up$Diabetes, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 322 145
##          1  84 261
##                                           
##                Accuracy : 0.718           
##                  95% CI : (0.6857, 0.7487)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.436           
##                                           
##  Mcnemar's Test P-Value : 7.342e-05       
##                                           
##             Sensitivity : 0.6429          
##             Specificity : 0.7931          
##          Pos Pred Value : 0.7565          
##          Neg Pred Value : 0.6895          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3214          
##    Detection Prevalence : 0.4249          
##       Balanced Accuracy : 0.7180          
##                                           
##        'Positive' Class : 1               
## 
# Prediction in testing data
pred_test <- predict(diab_tree, 
                     diab_test, 
                     type = "response")
# Confusion matrix in testing data
confusionMatrix(pred_test, 
                diab_test$Diabetes, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 68 25
##          1 26 21
##                                           
##                Accuracy : 0.6357          
##                  95% CI : (0.5502, 0.7153)
##     No Information Rate : 0.6714          
##     P-Value [Acc > NIR] : 0.8389          
##                                           
##                   Kappa : 0.1789          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.4565          
##             Specificity : 0.7234          
##          Pos Pred Value : 0.4468          
##          Neg Pred Value : 0.7312          
##              Prevalence : 0.3286          
##          Detection Rate : 0.1500          
##    Detection Prevalence : 0.3357          
##       Balanced Accuracy : 0.5900          
##                                           
##        'Positive' Class : 1               
## 

Check in training data: accuracy 71.8%, Recall/Sensitivity 64.29%, Precision/Pos Pred Value : 75.65%
Check in testing data: accuracy 63.57%, Recall/Sensitivity 45.65%, Precision/Pos Pred Value : 44.68%

Those checks indicates model is overfitting because it’s good at training data but it’s not good at testing data. The above model is very complex and will be very risky for overfitting, so it needs to be simplified by pruning, namely by limiting the number of branches that can be created by the model. The pruning method involves setting the min criterion, min split, and min bucket values.

The appropriate metrics for health cases are sensitivity or recall. The model must have a low FN/False Negative value (diabetic patients but predicted healthy). So, the tuning process will focus on achieving the most optimum sensitivity value.

Model Tuning

Tuning is done by manually setting the mincriterion, minsplit and minbucket values to obtain the optimum evaluation value (the highest recall value and the difference between the evaluation results on the train data and test data is not too different) (data can be seen in the table below). In this model, a trial adjustment was performed on the mincriterion, minsplit and minbucket values. The recall value results on the train and test data were the same in several trials. Then, the optimum mincriterion value was sought using the party package with a mincriterion value of 0.198. This value was used in the subsequent trial until the optimum recall value was obtained.

knitr::include_graphics("decision-tree-tuning-trial.png")

From the trial, we can conclude that the best mincriterion, minsplit and minbucket value are 0.198, 100, and 50, respectively. Here are the step of tuning trial:
1. Adjust mincriterion, minsplit and minbucket not too far from the model to make tune model
We find the same result of recall value in train and test data on the various values of mincriterion, minsplit and minbucket. 2. Find the mincriterion optimum value using party package 3. Use the mincriterion in step 2 and make tune model 4. Compare the result of step 3 and step 1.

Adjust mincriterion, minsplit and minbucket and make tune model

diab_tree_tune <- ctree(formula = Diabetes~ .,
                     data = diab_train_up,
                     control = ctree_control(mincriterion = 0.198,
                                             minsplit = 100,
                                             minbucket = 50))
plot(diab_tree_tune, type = "simple")

# Prediction in training data
pred_train_tune <- predict(diab_tree_tune, 
                           diab_train_up, 
                           type = "response")

# Confusion matrix in training data
confusionMatrix(pred_train_tune, 
                diab_train_up$Diabetes, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 305 130
##          1 101 276
##                                           
##                Accuracy : 0.7155          
##                  95% CI : (0.6831, 0.7463)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.431           
##                                           
##  Mcnemar's Test P-Value : 0.06544         
##                                           
##             Sensitivity : 0.6798          
##             Specificity : 0.7512          
##          Pos Pred Value : 0.7321          
##          Neg Pred Value : 0.7011          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3399          
##    Detection Prevalence : 0.4643          
##       Balanced Accuracy : 0.7155          
##                                           
##        'Positive' Class : 1               
## 
# Prediction in testing data
pred_test_tune <- predict(diab_tree_tune, 
                     diab_test, 
                     type = "response")
# Confusion matrix in testing data
confusionMatrix(pred_test_tune, 
                diab_test$Diabetes, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 66 20
##          1 28 26
##                                           
##                Accuracy : 0.6571          
##                  95% CI : (0.5723, 0.7352)
##     No Information Rate : 0.6714          
##     P-Value [Acc > NIR] : 0.6765          
##                                           
##                   Kappa : 0.256           
##                                           
##  Mcnemar's Test P-Value : 0.3123          
##                                           
##             Sensitivity : 0.5652          
##             Specificity : 0.7021          
##          Pos Pred Value : 0.4815          
##          Neg Pred Value : 0.7674          
##              Prevalence : 0.3286          
##          Detection Rate : 0.1857          
##    Detection Prevalence : 0.3857          
##       Balanced Accuracy : 0.6337          
##                                           
##        'Positive' Class : 1               
## 

Find the mincriterion optimum value using the party package then use it in the tuning trial.

#classifier = train(form = Diabetes ~ ., 
#                   data = diab_train_up, 
#                   method = 'ctree',
#                   tuneGrid = data.frame(mincriterion = seq(0.01,0.99,length.out = 100)),
#                   trControl = trainControl(method = "boot",
#                                            summaryFunction = defaultSummary,
#                                            verboseIter = TRUE))

Aggregating results
Selecting tuning parameters Fitting mincriterion = 0.198 on full training set

micriterion : 0.198
minsplit : 100
minbucket : 50
A : Recall/Sensitivity in training data 67.98%
B : Recall/Sensitivity testing data 56.52%
The difference between A and B: 11.46%

After we tune the model by changing the min criterion, min split, and min bucket values, we find that the optimum recall values on training and test data are 67.98% and 56.52%, respectively. The difference between training and testing metrics is not significant (the effect of overfitting can be reduced).

ROC-AUC Evaluation

# prediction: take the probability value
predict_tree <- predict(diab_tree_tune,
                          diab_test,
                          type = "prob")
head(predict_tree)
##            0         1
## 8  0.7000000 0.3000000
## 9  0.5578947 0.4421053
## 20 0.7200000 0.2800000
## 23 0.3076923 0.6923077
## 25 0.2037037 0.7962963
## 27 0.7000000 0.3000000
# take the probability of positive class
pred_prob_tree <- predict_tree[,1]
tree_roc <- prediction(predictions = pred_prob_tree,
                        labels = diab_test$Diabetes,
                        label.ordering = c("0",
                                           "1"))
model_roc_tree <- performance(tree_roc, 
                             "tpr",
                             "fpr")
plot(model_roc_tree)
abline(0,1 , lty = 2)

# Calculate the AUC value
tree_auc <- performance(tree_roc, "auc")
tree_auc@y.values[[1]]
## [1] 0.318802

After we tune the model by changing the min criterion, min split, and min bucket values, the AUC value is 0.3188. It’s below 0.5 and means that diab_tree model cannot differentiate between positive and negative classes correctly.

Modeling: Random Forest

Build the model

Because building a random forest model takes a lot of computing time, we save the model as an RDS file and load it when we want to use it. Here the syntax of building random forest model

#train_ctrl <- trainControl(method = "repeatedcv",
#                            number = 5,
#                            repeats = 3) 
#diab_forest <- train(Diabetes ~ .,
#                    data = diab_train_up,
#                    method = "rf",
#                    trControl = train_ctrl)
#saveRDS(diab_forest, "diab_forest.RDS")

Load the random forest model.

diab_forest <- readRDS("diab_forest.RDS")
diab_forest
## Random Forest 
## 
## 812 samples
##  21 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 650, 649, 650, 649, 650, 650, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.8083253  0.6166501
##   23    0.8641589  0.7283188
##   45    0.8575820  0.7151636
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 23.

diab_forest model using 812 samples, 21 predictor and 2 classes target. the optimum number of variables considered for splitting at each tree node is 23.

Out-of-Bag Score
Because random forest already provides out-of-bag estimates (OOB), which serve as a trustworthy estimate of the accuracy on unseen samples, we are not required to divide our dataset into train and test sets when using random forest. However, a typical train-test cross-validation can also be executed. For instance, our wine_train dataset was used to construct the OOB we could accomplish, as seen in the summary below.

diab_forest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 23
## 
##         OOB estimate of  error rate: 9.36%
## Confusion matrix:
##     0   1 class.error
## 0 346  60  0.14778325
## 1  16 390  0.03940887
plot(diab_forest$finalModel)
legend("topright", colnames(diab_forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

Our model have error 3.9% when it predicts the positive class and 14.77% when it predicts the negative class. OOB estimate of error rate: 9.36%. It means the model accuracy in OOB data is 90.64%.

Variable Importance Although random forests are often said to be uninterpretable models, we can see which predictors are most important in forming random forests.

varImp(diab_forest) %>% plot()

From the graphic above, the top 5 most important variable are BMI, Physhlth, GenHlth4, MentHlth, Income8.

Prediction & Evaluation

Confusion Matrix Evaluation

#Prediction
predict_forest <- predict(diab_forest, diab_test)

#Confusion matrix
(conf_matrix_forestI <- table(predict_forest, diab_test$Diabetes))
##               
## predict_forest  0  1
##              0 79 33
##              1 15 13
confusionMatrix(conf_matrix_forestI)
## Confusion Matrix and Statistics
## 
##               
## predict_forest  0  1
##              0 79 33
##              1 15 13
##                                           
##                Accuracy : 0.6571          
##                  95% CI : (0.5723, 0.7352)
##     No Information Rate : 0.6714          
##     P-Value [Acc > NIR] : 0.67647         
##                                           
##                   Kappa : 0.1367          
##                                           
##  Mcnemar's Test P-Value : 0.01414         
##                                           
##             Sensitivity : 0.8404          
##             Specificity : 0.2826          
##          Pos Pred Value : 0.7054          
##          Neg Pred Value : 0.4643          
##              Prevalence : 0.6714          
##          Detection Rate : 0.5643          
##    Detection Prevalence : 0.8000          
##       Balanced Accuracy : 0.5615          
##                                           
##        'Positive' Class : 0               
## 

diab_forest gives accuracy 65.71%, recall/sensitivity 84.04%, and precision 70.54%.

ROC-AUC Evaluation

# prediction: take the probability value
predict_forest1 <- predict(diab_forest,
                          diab_test,
                          type = "prob")
head(predict_forest1)
##        0     1
## 8  0.806 0.194
## 9  0.446 0.554
## 20 0.868 0.132
## 23 0.442 0.558
## 25 0.448 0.552
## 27 0.718 0.282
# take the probability of positive class
pred_prob_rf <- predict_forest1[,1]
rf_roc <- prediction(predictions = pred_prob_rf,
                        labels = diab_test$Diabetes,
                        label.ordering = c("0",
                                           "1"))
model_roc_rf <- performance(rf_roc, 
                             "tpr",
                             "fpr")
plot(model_roc_rf)
abline(0,1 , lty = 2)

# Calculate the AUC value
rf_auc <- performance(rf_roc, "auc")
rf_auc@y.values[[1]]
## [1] 0.3619334

The AUC value of diab_forest model is 0.3619. It’s below 0.5 and means that diab_forest model cannot differentiate between positive and negative classes correctly.

Conclusion

# conclusion
conclusion <- data.frame(matrix(c(62.86, 65, 65.71, 56.52, 56.52, 84.04, 44.83, 56.52, 70.34, 0.3127, 0.3188, 0.3619), 
                    nrow = 3, 
                    dimnames = list(c("Naive Bayes", "Decision Tree", "Random Forest"), 
                                    c("Accuracy", "Recall", "Precision", "AUC"))))

conclusion
##               Accuracy Recall Precision    AUC
## Naive Bayes      62.86  56.52     44.83 0.3127
## Decision Tree    65.00  56.52     56.52 0.3188
## Random Forest    65.71  84.04     70.34 0.3619

Based on the data above, and our concern is to find the best model with the highest recall value, we will choose a random forest model to predict the diabetes patient.

References