According to Wikipedia.org, is a group of metabolic disorders characterized by high blood sugar level over a prolonged period. Diabetes is due to either the pancreas not producing enough insulins or the cell of the body not responding properly to insulin produced.
It is estimated in 2017 around 425 million people had diabetes worldwide, this also represent 8.8% of the worldwide adult population (both men and women) that got diabetes.
In this instance, we shall look at the diabetes dataset published in Kaggle.com collected by the National Istitute of Diabetes and Digestive and Kidney Diseases. It is consists of samples from the Pima Native American where all of the patient’s data are of female aged 21 above and from the Pima people.
Using the data we have gathered from Kaggle.com, we must look on how the dataset comprise first.
'data.frame': 768 obs. of 9 variables:
$ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
$ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
$ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
$ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
$ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
$ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
$ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
$ Age : int 50 31 32 21 33 30 26 29 53 54 ...
$ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
From the dataset, we can say that it need some correction regarding the data type of the variables. For instance, the Outcome variable need to be a factor type instead of integer type. And we can also scale the numeric based variables.
'data.frame': 768 obs. of 9 variables:
$ Pregnancies : num 0.64 -0.844 1.233 -0.844 -1.141 ...
$ Glucose : num 0.848 -1.123 1.942 -0.998 0.504 ...
$ BloodPressure : num 0.15 -0.16 -0.264 -0.16 -1.504 ...
$ SkinThickness : num 0.907 0.531 -1.287 0.154 0.907 ...
$ Insulin : num -0.692 -0.692 -0.692 0.123 0.765 ...
$ BMI : num 0.204 -0.684 -1.103 -0.494 1.409 ...
$ DiabetesPedigreeFunction: num 0.468 -0.365 0.604 -0.92 5.481 ...
$ Age : num 1.4251 -0.1905 -0.1055 -1.0409 -0.0205 ...
$ Outcome : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 2 ...
For same keeping, we should chack wether our data has NA values within it using is.na() function. And if there is any and if they consist around 5% less of our entire datam we can drop it using tidyr::drop_na().
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
[1,] FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE FALSE FALSE
DiabetesPedigreeFunction Age Outcome
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 760 rows ]
From our inspection, we can say that none of our dataset’s value consist of NA values. So we can proceed freely to the creation of the classification model.
After we have done tweaking our dataset, we can now do data splitting. In this case we shall do a 80:10:10 aproach which comprise of 80% for model training, 10% for model testing, and the final 10% for our model validation.
For our diabetes prediction, we will use the two methods of classification prediction. One using the logistic regression method and the other using the k-Nearest Neighbour (kNN) method. We shall label patients who are diabetic positive with binary ‘1’ and the diabetic negative with binary ‘0’
Logistic regression uses the probability of a certain class to model binary dependent variables. It is used in machine learning to classify binary classification problems, such as: sick/healthy, sunny/rainy, etc.
In this case, we want to create a classification model for our patient wether they have diabetes or not.
Call:
glm(formula = Outcome ~ ., family = "binomial", data = dt_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4062 -0.7442 -0.4305 0.7602 2.9428
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.884485 0.107761 -8.208 2.25e-16 ***
Pregnancies 0.343678 0.119612 2.873 0.00406 **
Glucose 1.172099 0.136167 8.608 < 2e-16 ***
BloodPressure -0.291090 0.111768 -2.604 0.00920 **
SkinThickness 0.004958 0.120547 0.041 0.96719
Insulin -0.203966 0.121173 -1.683 0.09232 .
BMI 0.802722 0.134773 5.956 2.58e-09 ***
DiabetesPedigreeFunction 0.148443 0.106932 1.388 0.16507
Age 0.125647 0.124876 1.006 0.31433
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 796.91 on 615 degrees of freedom
Residual deviance: 591.80 on 607 degrees of freedom
AIC: 609.8
Number of Fisher Scoring iterations: 5
From the model summary above, it is indicated that some variables have quite large p-value (>0.05) while some others have much smaller p-value (<0.05). While we can use our current model to predict our test dataset. It is better if we drop some variables from our model especially the ones that have quite large p-value to achieve a better model.
Call:
glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
BMI, family = "binomial", data = dt_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1785 -0.7421 -0.4331 0.7814 2.9301
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8768 0.1065 -8.235 < 2e-16 ***
Pregnancies 0.4204 0.1020 4.120 3.78e-05 ***
Glucose 1.1284 0.1208 9.342 < 2e-16 ***
BloodPressure -0.2827 0.1072 -2.638 0.00834 **
BMI 0.7699 0.1255 6.133 8.62e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 796.91 on 615 degrees of freedom
Residual deviance: 598.48 on 611 degrees of freedom
AIC: 608.48
Number of Fisher Scoring iterations: 5
By now, we have a better model after dropping some variables. As to check our dataset wether our model is just right, overfit, or underfit; we need to compare the performance of our model to both training and testing dataset. A sign that our model is overfit is when the performance of our model using the training dataset is higher than to the testing dataset.
To the model created, we now apply both of our train and test dataset to see our model’s performance.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 353 95
1 48 120
Accuracy : 0.7679
95% CI : (0.7325, 0.8007)
No Information Rate : 0.651
P-Value [Acc > NIR] : 2.173e-10
Kappa : 0.4619
Mcnemar's Test P-Value : 0.0001197
Sensitivity : 0.5581
Specificity : 0.8803
Pos Pred Value : 0.7143
Neg Pred Value : 0.7879
Prevalence : 0.3490
Detection Rate : 0.1948
Detection Prevalence : 0.2727
Balanced Accuracy : 0.7192
'Positive' Class : 1
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 43 13
1 7 14
Accuracy : 0.7403
95% CI : (0.6277, 0.8336)
No Information Rate : 0.6494
P-Value [Acc > NIR] : 0.05786
Kappa : 0.3989
Mcnemar's Test P-Value : 0.26355
Sensitivity : 0.5185
Specificity : 0.8600
Pos Pred Value : 0.6667
Neg Pred Value : 0.7679
Prevalence : 0.3506
Detection Rate : 0.1818
Detection Prevalence : 0.2727
Balanced Accuracy : 0.6893
'Positive' Class : 1
We can simplify above performance into a easy-to-read table as shown below:
By comparing both performance, we can say that our logistic regression model are not overfit nor underfit, but just right. To this we will say that our model can proceed to use the validation dataset and we can see the final performance of our model.
k-Nearest Neighbour or kNN for short, is a classification method that uses distance from the predictor to predict to which class does the predictor belongs to. kNN assumes that the closer the distance between neighbour, the more similiar it will be hence belong to the same class.
In to the data and by using kNN method, we will try to classify wether a patient is determine as a positive patient (have diabetes) or a negative (healthy).
Before we go further, firstly we need to define the number of neighbours for kNN (define by ‘k’). The value of k can be find using the expression \[k = \sqrt{\sum data}\]
Using above expression and into our data, we achieve the value of 24 for our k value. From here, we then use it to create our kNN model.
Our kNN model will be using the variables or predictors selected in the logistic regression earlier such as the Pregnancies, Glucose, BloodPressure, BMI variables which have low p-value (<0.05).
Using our kNN model and as same as before, we shall see wether our kNN model is fitted just right by comparing our kNN model to both training and testing dataset and see both performances.
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 364 103
1 37 112
Accuracy : 0.7727
95% CI : (0.7376, 0.8053)
No Information Rate : 0.651
P-Value [Acc > NIR] : 3.644e-11
Kappa : 0.4615
Mcnemar's Test P-Value : 3.940e-08
Sensitivity : 0.5209
Specificity : 0.9077
Pos Pred Value : 0.7517
Neg Pred Value : 0.7794
Prevalence : 0.3490
Detection Rate : 0.1818
Detection Prevalence : 0.2419
Balanced Accuracy : 0.7143
'Positive' Class : 1
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 41 14
1 9 13
Accuracy : 0.7013
95% CI : (0.5862, 0.8003)
No Information Rate : 0.6494
P-Value [Acc > NIR] : 0.2027
Kappa : 0.3149
Mcnemar's Test P-Value : 0.4042
Sensitivity : 0.4815
Specificity : 0.8200
Pos Pred Value : 0.5909
Neg Pred Value : 0.7455
Prevalence : 0.3506
Detection Rate : 0.1688
Detection Prevalence : 0.2857
Balanced Accuracy : 0.6507
'Positive' Class : 1
And for easy comparing we shall make it into easy-to-read table as shown below:
We can say to our kNN model are fitted just right with a small gap between both dataset performances. To this we will say that our model can proceed to use the validation dataset and we can see the final performance of our model.
To our final result that is the summary of our predicted result using the valdation set. We can see both method give us the following: 1. Logistic Regression Model
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 43 12
1 6 14
Accuracy : 0.76
95% CI : (0.6475, 0.8511)
No Information Rate : 0.6533
P-Value [Acc > NIR] : 0.03172
Kappa : 0.4398
Mcnemar's Test P-Value : 0.23859
Sensitivity : 0.5385
Specificity : 0.8776
Pos Pred Value : 0.7000
Neg Pred Value : 0.7818
Prevalence : 0.3467
Detection Rate : 0.1867
Detection Prevalence : 0.2667
Balanced Accuracy : 0.7080
'Positive' Class : 1
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 47 8
1 2 18
Accuracy : 0.8667
95% CI : (0.7684, 0.9342)
No Information Rate : 0.6533
P-Value [Acc > NIR] : 2.757e-05
Kappa : 0.6888
Mcnemar's Test P-Value : 0.1138
Sensitivity : 0.6923
Specificity : 0.9592
Pos Pred Value : 0.9000
Neg Pred Value : 0.8545
Prevalence : 0.3467
Detection Rate : 0.2400
Detection Prevalence : 0.2667
Balanced Accuracy : 0.8257
'Positive' Class : 1
Using the validation dataset, for both type of models we can see that with the table below:
From both performance, both model give us a fantastic result that is neck-to-neck and especially with the high accuracy value. And with our case of detecting patient wether they have diabetes or not, both model gave us an exceptionaly good result because it can give us a quite low sensitivity to which could limit or even reduce misprediction of our patients.
For our case of predicting a patient of diabetes using a recorded predictors, both classification method of logistic regression and kNN can be use here as it gave us a wonderful prediction result.