Introduction

According to Wikipedia.org, is a group of metabolic disorders characterized by high blood sugar level over a prolonged period. Diabetes is due to either the pancreas not producing enough insulins or the cell of the body not responding properly to insulin produced.

It is estimated in 2017 around 425 million people had diabetes worldwide, this also represent 8.8% of the worldwide adult population (both men and women) that got diabetes.

In this instance, we shall look at the diabetes dataset published in Kaggle.com collected by the National Istitute of Diabetes and Digestive and Kidney Diseases. It is consists of samples from the Pima Native American where all of the patient’s data are of female aged 21 above and from the Pima people.

Dataset Preparation

Overview

Using the data we have gathered from Kaggle.com, we must look on how the dataset comprise first.

'data.frame':   768 obs. of  9 variables:
 $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
 $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
 $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
 $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
 $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
 $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
 $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
 $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
 $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

From the dataset, we can say that it need some correction regarding the data type of the variables. For instance, the Outcome variable need to be a factor type instead of integer type. And we can also scale the numeric based variables.

'data.frame':   768 obs. of  9 variables:
 $ Pregnancies             : num  0.64 -0.844 1.233 -0.844 -1.141 ...
 $ Glucose                 : num  0.848 -1.123 1.942 -0.998 0.504 ...
 $ BloodPressure           : num  0.15 -0.16 -0.264 -0.16 -1.504 ...
 $ SkinThickness           : num  0.907 0.531 -1.287 0.154 0.907 ...
 $ Insulin                 : num  -0.692 -0.692 -0.692 0.123 0.765 ...
 $ BMI                     : num  0.204 -0.684 -1.103 -0.494 1.409 ...
 $ DiabetesPedigreeFunction: num  0.468 -0.365 0.604 -0.92 5.481 ...
 $ Age                     : num  1.4251 -0.1905 -0.1055 -1.0409 -0.0205 ...
 $ Outcome                 : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 2 ...

For same keeping, we should chack wether our data has NA values within it using is.na() function. And if there is any and if they consist around 5% less of our entire datam we can drop it using tidyr::drop_na().

is.na(diab)

       Pregnancies Glucose BloodPressure SkinThickness Insulin   BMI
  [1,]       FALSE   FALSE         FALSE         FALSE   FALSE FALSE
  [2,]       FALSE   FALSE         FALSE         FALSE   FALSE FALSE
  [3,]       FALSE   FALSE         FALSE         FALSE   FALSE FALSE
  [4,]       FALSE   FALSE         FALSE         FALSE   FALSE FALSE
  [5,]       FALSE   FALSE         FALSE         FALSE   FALSE FALSE
  [6,]       FALSE   FALSE         FALSE         FALSE   FALSE FALSE
  [7,]       FALSE   FALSE         FALSE         FALSE   FALSE FALSE
  [8,]       FALSE   FALSE         FALSE         FALSE   FALSE FALSE
       DiabetesPedigreeFunction   Age Outcome
  [1,]                    FALSE FALSE   FALSE
  [2,]                    FALSE FALSE   FALSE
  [3,]                    FALSE FALSE   FALSE
  [4,]                    FALSE FALSE   FALSE
  [5,]                    FALSE FALSE   FALSE
  [6,]                    FALSE FALSE   FALSE
  [7,]                    FALSE FALSE   FALSE
  [8,]                    FALSE FALSE   FALSE
 [ reached getOption("max.print") -- omitted 760 rows ]

From our inspection, we can say that none of our dataset’s value consist of NA values. So we can proceed freely to the creation of the classification model.

Data Splitting

After we have done tweaking our dataset, we can now do data splitting. In this case we shall do a 80:10:10 aproach which comprise of 80% for model training, 10% for model testing, and the final 10% for our model validation.

set.seed(746)
idx_1 <- initial_split(diab,0.8,"Outcome")   # train:test-val = 80:20

dt_train <- training(idx_1) # 80% dataset for model training
dt_tsval <- testing(idx_1)

set.seed(178)
idx_2 <- initial_split(dt_tsval,0.5,"Outcome") # test:val = 50:50

dt_test <- training(idx_2)  # 10% dataset for model testing
dt_val <- testing(idx_2)    # 10% dataset for validation

Classification

For our diabetes prediction, we will use the two methods of classification prediction. One using the logistic regression method and the other using the k-Nearest Neighbour (kNN) method. We shall label patients who are diabetic positive with binary ‘1’ and the diabetic negative with binary ‘0’

Logistic Regression

Logistic regression uses the probability of a certain class to model binary dependent variables. It is used in machine learning to classify binary classification problems, such as: sick/healthy, sunny/rainy, etc.

In this case, we want to create a classification model for our patient wether they have diabetes or not.


Call:
glm(formula = Outcome ~ ., family = "binomial", data = dt_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4062  -0.7442  -0.4305   0.7602   2.9428  

Coefficients:
                          Estimate Std. Error z value Pr(>|z|)    
(Intercept)              -0.884485   0.107761  -8.208 2.25e-16 ***
Pregnancies               0.343678   0.119612   2.873  0.00406 ** 
Glucose                   1.172099   0.136167   8.608  < 2e-16 ***
BloodPressure            -0.291090   0.111768  -2.604  0.00920 ** 
SkinThickness             0.004958   0.120547   0.041  0.96719    
Insulin                  -0.203966   0.121173  -1.683  0.09232 .  
BMI                       0.802722   0.134773   5.956 2.58e-09 ***
DiabetesPedigreeFunction  0.148443   0.106932   1.388  0.16507    
Age                       0.125647   0.124876   1.006  0.31433    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 796.91  on 615  degrees of freedom
Residual deviance: 591.80  on 607  degrees of freedom
AIC: 609.8

Number of Fisher Scoring iterations: 5

From the model summary above, it is indicated that some variables have quite large p-value (>0.05) while some others have much smaller p-value (<0.05). While we can use our current model to predict our test dataset. It is better if we drop some variables from our model especially the ones that have quite large p-value to achieve a better model.


Call:
glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
    BMI, family = "binomial", data = dt_train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1785  -0.7421  -0.4331   0.7814   2.9301  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -0.8768     0.1065  -8.235  < 2e-16 ***
Pregnancies     0.4204     0.1020   4.120 3.78e-05 ***
Glucose         1.1284     0.1208   9.342  < 2e-16 ***
BloodPressure  -0.2827     0.1072  -2.638  0.00834 ** 
BMI             0.7699     0.1255   6.133 8.62e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 796.91  on 615  degrees of freedom
Residual deviance: 598.48  on 611  degrees of freedom
AIC: 608.48

Number of Fisher Scoring iterations: 5

By now, we have a better model after dropping some variables. As to check our dataset wether our model is just right, overfit, or underfit; we need to compare the performance of our model to both training and testing dataset. A sign that our model is overfit is when the performance of our model using the training dataset is higher than to the testing dataset.

To the model created, we now apply both of our train and test dataset to see our model’s performance.

Model to training dataset performance

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 353  95
         1  48 120
                                          
               Accuracy : 0.7679          
                 95% CI : (0.7325, 0.8007)
    No Information Rate : 0.651           
    P-Value [Acc > NIR] : 2.173e-10       
                                          
                  Kappa : 0.4619          
                                          
 Mcnemar's Test P-Value : 0.0001197       
                                          
            Sensitivity : 0.5581          
            Specificity : 0.8803          
         Pos Pred Value : 0.7143          
         Neg Pred Value : 0.7879          
             Prevalence : 0.3490          
         Detection Rate : 0.1948          
   Detection Prevalence : 0.2727          
      Balanced Accuracy : 0.7192          
                                          
       'Positive' Class : 1

Model to testing dataset performance

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 43 13
         1  7 14
                                          
               Accuracy : 0.7403          
                 95% CI : (0.6277, 0.8336)
    No Information Rate : 0.6494          
    P-Value [Acc > NIR] : 0.05786         
                                          
                  Kappa : 0.3989          
                                          
 Mcnemar's Test P-Value : 0.26355         
                                          
            Sensitivity : 0.5185          
            Specificity : 0.8600          
         Pos Pred Value : 0.6667          
         Neg Pred Value : 0.7679          
             Prevalence : 0.3506          
         Detection Rate : 0.1818          
   Detection Prevalence : 0.2727          
      Balanced Accuracy : 0.6893          
                                          
       'Positive' Class : 1

We can simplify above performance into a easy-to-read table as shown below:

By comparing both performance, we can say that our logistic regression model are not overfit nor underfit, but just right. To this we will say that our model can proceed to use the validation dataset and we can see the final performance of our model.

k-Nearest Neighbour

k-Nearest Neighbour or kNN for short, is a classification method that uses distance from the predictor to predict to which class does the predictor belongs to. kNN assumes that the closer the distance between neighbour, the more similiar it will be hence belong to the same class.

In to the data and by using kNN method, we will try to classify wether a patient is determine as a positive patient (have diabetes) or a negative (healthy).

Before we go further, firstly we need to define the number of neighbours for kNN (define by ‘k’). The value of k can be find using the expression \[k = \sqrt{\sum data}\]

Using above expression and into our data, we achieve the value of 24 for our k value. From here, we then use it to create our kNN model.

Our kNN model will be using the variables or predictors selected in the logistic regression earlier such as the Pregnancies, Glucose, BloodPressure, BMI variables which have low p-value (<0.05).

Using our kNN model and as same as before, we shall see wether our kNN model is fitted just right by comparing our kNN model to both training and testing dataset and see both performances.

kNN model to training dataset performance

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 364 103
         1  37 112
                                          
               Accuracy : 0.7727          
                 95% CI : (0.7376, 0.8053)
    No Information Rate : 0.651           
    P-Value [Acc > NIR] : 3.644e-11       
                                          
                  Kappa : 0.4615          
                                          
 Mcnemar's Test P-Value : 3.940e-08       
                                          
            Sensitivity : 0.5209          
            Specificity : 0.9077          
         Pos Pred Value : 0.7517          
         Neg Pred Value : 0.7794          
             Prevalence : 0.3490          
         Detection Rate : 0.1818          
   Detection Prevalence : 0.2419          
      Balanced Accuracy : 0.7143          
                                          
       'Positive' Class : 1

kNN model to testing dataset performance

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 41 14
         1  9 13
                                          
               Accuracy : 0.7013          
                 95% CI : (0.5862, 0.8003)
    No Information Rate : 0.6494          
    P-Value [Acc > NIR] : 0.2027          
                                          
                  Kappa : 0.3149          
                                          
 Mcnemar's Test P-Value : 0.4042          
                                          
            Sensitivity : 0.4815          
            Specificity : 0.8200          
         Pos Pred Value : 0.5909          
         Neg Pred Value : 0.7455          
             Prevalence : 0.3506          
         Detection Rate : 0.1688          
   Detection Prevalence : 0.2857          
      Balanced Accuracy : 0.6507          
                                          
       'Positive' Class : 1

And for easy comparing we shall make it into easy-to-read table as shown below:

We can say to our kNN model are fitted just right with a small gap between both dataset performances. To this we will say that our model can proceed to use the validation dataset and we can see the final performance of our model.

Comparison

To our final result that is the summary of our predicted result using the valdation set. We can see both method give us the following: 1. Logistic Regression Model

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 43 12
         1  6 14
                                          
               Accuracy : 0.76            
                 95% CI : (0.6475, 0.8511)
    No Information Rate : 0.6533          
    P-Value [Acc > NIR] : 0.03172         
                                          
                  Kappa : 0.4398          
                                          
 Mcnemar's Test P-Value : 0.23859         
                                          
            Sensitivity : 0.5385          
            Specificity : 0.8776          
         Pos Pred Value : 0.7000          
         Neg Pred Value : 0.7818          
             Prevalence : 0.3467          
         Detection Rate : 0.1867          
   Detection Prevalence : 0.2667          
      Balanced Accuracy : 0.7080          
                                          
       'Positive' Class : 1

kNN Model

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 47  8
         1  2 18
                                          
               Accuracy : 0.8667          
                 95% CI : (0.7684, 0.9342)
    No Information Rate : 0.6533          
    P-Value [Acc > NIR] : 2.757e-05       
                                          
                  Kappa : 0.6888          
                                          
 Mcnemar's Test P-Value : 0.1138          
                                          
            Sensitivity : 0.6923          
            Specificity : 0.9592          
         Pos Pred Value : 0.9000          
         Neg Pred Value : 0.8545          
             Prevalence : 0.3467          
         Detection Rate : 0.2400          
   Detection Prevalence : 0.2667          
      Balanced Accuracy : 0.8257          
                                          
       'Positive' Class : 1

Using the validation dataset, for both type of models we can see that with the table below:

From both performance, both model give us a fantastic result that is neck-to-neck and especially with the high accuracy value. And with our case of detecting patient wether they have diabetes or not, both model gave us an exceptionaly good result because it can give us a quite low sensitivity to which could limit or even reduce misprediction of our patients.

Conclussion

For our case of predicting a patient of diabetes using a recorded predictors, both classification method of logistic regression and kNN can be use here as it gave us a wonderful prediction result.

Diabetes Prediction Classifier

2019-12-06