Intro

On this occasion, I will try to make predictions on diabetic patients in a hospital who will be predicted to be sick or not based on the categories of several supporting variables. The algorithm that I will use is using the Naive Bayes method and Decision Tree which are included in supervised learning.

Data Preperation

Load required library.

library(caret)
library(e1071)
library(randomForest)
library(ggplot2)
library(class)
library(tidyr)
library(dplyr)
library(partykit)

Load data to perform regression model

diabetes <- read.csv(file="data_input/diabetes.csv",stringsAsFactors = F)

Inspect Data

after we have successfully imported our data, we will do a data inspection to find out contents our data, actually we can use the view() function to view the contents of the data but it will take time to see the whole data so we use a function that sees the head() only.

head(diabetes)

Descriptions:

  • pregnant: Number of times pregnant
  • glucose: Plasma glucose concentration (glucose tolerance test)
  • pressure: Diastolic blood pressure (mm Hg)
  • triceps: Triceps skin fold thickness (mm)
  • insulin: 2-Hour serum insulin (mu U/ml)
  • mass: Body mass index (weight in kg/(height in m)^2)
  • pedigree: Diabetes pedigree function
  • age: Age (years)
  • diabetes: Test for Diabetes

Data Wrangling & Eksploratory Data

Check the structure of the data copier. Are there any columns whose data types do not match?

str(diabetes)
#> 'data.frame':    768 obs. of  9 variables:
#>  $ pregnant: int  6 1 8 1 0 5 3 10 2 8 ...
#>  $ glucose : int  148 85 183 89 137 116 78 115 197 125 ...
#>  $ pressure: int  72 66 64 66 40 74 50 0 70 96 ...
#>  $ triceps : int  35 29 0 23 35 0 32 0 45 0 ...
#>  $ insulin : int  0 0 0 94 168 0 88 0 543 0 ...
#>  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
#>  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
#>  $ age     : int  50 31 32 21 33 30 26 29 53 54 ...
#>  $ diabetes: chr  "pos" "neg" "pos" "neg" ...

Target: diabetes (pos, neg)

Columns whose data types do not match:

  • diabetes -> factor
diabetes <- diabetes %>% 
  mutate(diabetes = as.factor(diabetes))

glimpse(diabetes)
#> Rows: 768
#> Columns: 9
#> $ pregnant <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1~
#> $ glucose  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,~
#> $ pressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0,~
#> $ triceps  <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0~
#> $ insulin  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230~
#> $ mass     <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37~
#> $ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158~
#> $ age      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3~
#> $ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n~

Check blank data in our dataset

colSums(is.na(diabetes))
#> pregnant  glucose pressure  triceps  insulin     mass pedigree      age 
#>        0        0        0        0        0        0        0        0 
#> diabetes 
#>        0

from the results of our exploratory data, we get the result that there is no NA or balnk data.

Cross Validation

Cross Validation is a stage where we will divide the data into 2 parts, namely data train and data test.

  • Train data will be used for model training.
  • Test data will be used for model performance testing. The model will be tested to predict the test data. Prediction results and actual data from the test data will be compared to validate the model’s performance.

The purpose of cross validation is to find out how well the model predicts unseen data.

RNGkind(sample.kind = "Rounding") 
set.seed(123)

index <- sample(x = nrow(diabetes),
                size = nrow(diabetes)*0.8)

# splitting
diab_train <- diabetes[index, ]
diab_test <- diabetes[-index, ]

Pre-Processing Data

Before doing the modeling, we need to first see the proportion of the target variable that we have in the target column.

prop.table(table(diab_train$diabetes))
#> 
#>       neg       pos 
#> 0.6563518 0.3436482

If it is seen from the proportion of the two classes, it is not so balanced, for that we need additional pre-processing to balance the proportion between the two classes of target variables.

Downsampling: reduces the observation of the majority class until it is equal to the minority class. Disadvantages: discarding information from the data held. Usually used when the data in the minority class is quite a lot.

RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
diab_train <- downSample(x = diab_train %>% 
                           select(-diabetes),
                          y = diab_train$diabetes,
                         yname = "diabetes")

prop.table(table(diab_train$diabetes))
#> 
#> neg pos 
#> 0.5 0.5

after we are doing downsampling the proportions on the data train look already balanced and we can proceed to the modeling process.

Modeling

The next step is to design a classification model using a different algorithm and compare the model accuracy of all the models that have been made, namely the Naive Bayes algorithm, Decision Tree.

Naive Bayes

Naive Bayes is a machine learning model that utilizes Bayes’ Theorem in classifying. The relationship between the predictor and the target variable is considered mutually dependent. It is said “Naive” because each predictor is assumed to be independent (not related to each other) and has the same weight (have the same importance or influence) in making predictions. This is to simplify calculations (the formula becomes simpler) and reduce the computational burden.

# Model with Naive Bayes
model_naive <- naiveBayes(formula = diabetes ~ ., data = diab_train)

we will make predictions using type = "class" to return the class label (default threshold 0.5).

# Prediction
preds_naive <- predict(model_naive, newdata = diab_test, type = "class")

after we make predictions we will evaluate the model by means of confusionMatrix we will see how it turns out

# Confusion Matrix
confusionMatrix(data = preds_naive, reference = diab_test$diabetes, positive = "pos" )
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction neg pos
#>        neg  81  15
#>        pos  16  42
#>                                           
#>                Accuracy : 0.7987          
#>                  95% CI : (0.7266, 0.8589)
#>     No Information Rate : 0.6299          
#>     P-Value [Acc > NIR] : 0.000004461     
#>                                           
#>                   Kappa : 0.5698          
#>                                           
#>  Mcnemar's Test P-Value : 1               
#>                                           
#>             Sensitivity : 0.7368          
#>             Specificity : 0.8351          
#>          Pos Pred Value : 0.7241          
#>          Neg Pred Value : 0.8438          
#>              Prevalence : 0.3701          
#>          Detection Rate : 0.2727          
#>    Detection Prevalence : 0.3766          
#>       Balanced Accuracy : 0.7859          
#>                                           
#>        'Positive' Class : pos             
#> 

The results of the confusionmatrix show that the Naive Bayes classification estimates 81 cases of patients not having diabetes correctly and 15 predictions incorrectly. Similarly, the model correctly predicted 42 cases of diabetic patients and 6 incorrectly predicted. From the output of the Naive Bayes classification model, it can be seen that the accuracy of the model is 79.87% and Recall is 73.68%.

Previously we have made predictions using test data. what if we make predictions using train data, will the results be better than testing using test data?

preds_naive <- predict(model_naive, newdata = diab_train, type = "class")

after we make predictions we will evaluate the model by means of confusionMatrix we will see how it turns out

confusionMatrix(data = preds_naive, reference = diab_train$diabetes, positive = "pos" )
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction neg pos
#>        neg 165  70
#>        pos  46 141
#>                                               
#>                Accuracy : 0.7251              
#>                  95% CI : (0.6799, 0.7672)    
#>     No Information Rate : 0.5                 
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.4502              
#>                                               
#>  Mcnemar's Test P-Value : 0.03272             
#>                                               
#>             Sensitivity : 0.6682              
#>             Specificity : 0.7820              
#>          Pos Pred Value : 0.7540              
#>          Neg Pred Value : 0.7021              
#>              Prevalence : 0.5000              
#>          Detection Rate : 0.3341              
#>    Detection Prevalence : 0.4431              
#>       Balanced Accuracy : 0.7251              
#>                                               
#>        'Positive' Class : pos                 
#> 

we will refer to the accuracy value and recall value from the results it can be seen that the accuracy of the model is 72.51% and Recall is 66.82% which by using test data we get accuracy of the model is 79.87% and Recall is 73.68%, With this we can assume that the model we have is not very suitable because there is an under fitting, which from the test results using train data and test data is not good.

Decision Tree

Decision Tree is a fairly simple tree-based model with robust/powerful performance for prediction. Decision Tree produces a visualization in the form of a decision tree that can be interpreted easily.

Additional Decision Tree characters:

  • The predictor variables are assumed to be interdependent, so that they can overcome multicollinearity.
  • Can overcome numeric predictor values ​​in the form of outliers.
# Model with Naive Bayes
model_tree <- ctree(formula = diabetes ~ . , data = diab_train)

plot(model_tree,type = "simple")

We can see the number of divisions / leaves (width) and many layers / levels (depth) it. where :

[1] is the Root Node or root [2], [3], [5], [8], and [11] are Internal Nodes or branches. These branches are indicated by an arrow pointing at them, and an arrow pointing from them. [4], [6], [7], [9], [10], [12] and [13] are Leaf Nodes or leaves. Leaves are indicated by arrows pointing at them, but no arrows pointing from them.

we will make predictions using type = "response" to return the class label (default threshold 0.5).

# Prediction
preds_tree <- predict(model_tree, newdata = diab_test, type = "response")

after we make predictions we will evaluate the model by means of confusionMatrix we will see how it turns out

# Confusion Matrix
confusionMatrix(data = preds_tree, reference = diab_test$diabetes, positive = "pos" )
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction neg pos
#>        neg  61  15
#>        pos  36  42
#>                                           
#>                Accuracy : 0.6688          
#>                  95% CI : (0.5885, 0.7425)
#>     No Information Rate : 0.6299          
#>     P-Value [Acc > NIR] : 0.179643        
#>                                           
#>                   Kappa : 0.3399          
#>                                           
#>  Mcnemar's Test P-Value : 0.005101        
#>                                           
#>             Sensitivity : 0.7368          
#>             Specificity : 0.6289          
#>          Pos Pred Value : 0.5385          
#>          Neg Pred Value : 0.8026          
#>              Prevalence : 0.3701          
#>          Detection Rate : 0.2727          
#>    Detection Prevalence : 0.5065          
#>       Balanced Accuracy : 0.6829          
#>                                           
#>        'Positive' Class : pos             
#> 

The results of the confusionmatrix show that the Naive Bayes classification estimates 61 cases of patients not having diabetes correctly and 15 predictions incorrectly. Similarly, the model correctly predicted 42 cases of diabetic patients and 36 incorrectly predicted. From the output of the Naive Bayes classification model, it can be seen that the accuracy of the model is 66.88% and Recall is 73.68%.

Previously we have made predictions using test data. what if we make predictions using train data, will the results be better than testing using test data?

preds_tree <- predict(model_tree, newdata = diab_train, type = "response")

after we make predictions we will evaluate the model by means of confusionMatrix we will see how it turns out

confusionMatrix(data = preds_tree, reference = diab_train$diabetes, positive = "pos" )
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction neg pos
#>        neg 154  42
#>        pos  57 169
#>                                              
#>                Accuracy : 0.7654             
#>                  95% CI : (0.722, 0.805)     
#>     No Information Rate : 0.5                
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.5308             
#>                                              
#>  Mcnemar's Test P-Value : 0.1594             
#>                                              
#>             Sensitivity : 0.8009             
#>             Specificity : 0.7299             
#>          Pos Pred Value : 0.7478             
#>          Neg Pred Value : 0.7857             
#>              Prevalence : 0.5000             
#>          Detection Rate : 0.4005             
#>    Detection Prevalence : 0.5355             
#>       Balanced Accuracy : 0.7654             
#>                                              
#>        'Positive' Class : pos                
#> 

we will refer to the accuracy value and recall value from the results it can be seen that the accuracy of the model is 76.54% and Recall is 80.09% which by using test data we get accuracy of the model is 66.88% and Recall is 73.68%, With this we can assume that the model we have is not very suitable because there is an over fitting, which from the test results using train data was good and test data is not good.

The disadvantage of the Decision Tree is that it is prone to overfitting, because it is capable of splitting data to very detail, even to conditions where 1 leaf node has only 1 observation. This makes the Decision Tree model only memorize train data patterns by creating complex rules, which should learn the data patterns. The result is that the model is less able to generalize the pattern of the test data, so its performance is much worse.

Conclusion

In getting the best performance for each model, especially in the Naive Bayes and Decision Tree models, performance can still be improved by changing and finding the most suitable cutoff value, in this analysis of course we want to minimize False Positive, it is necessary to find which cutoff value is able to provide the percentage of recall is high but does not change the level of accuracy too much.

Based on the order of accuracy level and recall value, the Naive Bayes classification model is the best model. With an accuracy rate of 79.87% and Recall 73.68%% while Decision Tree (Acc = 66.88% Recall = 73.68%). However, we can still change the cutoff value of each model and re-validate it.