Introduction

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Using this data with only numerical as the values inside the data, I intend to create a machine learning model to predict a patient which has a diabetes using several numerical variables as a way to practice in building machine learning model using Logistic Regression and K Nearest Neighbor method and to fulfill my assignment in Algoritma Data Science school.

Data Processing

Pre-processing

Define libraries
library(rsample)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(class)
library(gtools)
## 
## Attaching package: 'gtools'
## The following object is masked from 'package:rsample':
## 
##     permutations
library(cmplot)
## Warning: replacing previous import 'ggplot2::last_plot' by 'plotly::last_plot'
## when loading 'cmplot'
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Read and preview the data
diabetes <- read.csv("data/diabetes_data.csv")
diabetes

Data description:

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 or 1)

The ourcome variable will be our target where 0 is categorized as negative and 1 is categorized as positive for diabetes.

Change the Outcome value to a readable value
diabetes <- diabetes %>% 
  mutate(Outcome = as.factor(case_when(Outcome == 1 ~ "Positive Diabetes",
                             T ~ "Negative Diabetes")))

Cross validation

Split train and test data
set.seed(129)

index <- sample(nrow(diabetes), nrow(diabetes) * 0.75)
diab_train <- diabetes[index,]
diab_test <- diabetes[-index,]
Observe the proportion of the Outcome
table(diab_train$Outcome)
## 
## Negative Diabetes Positive Diabetes 
##               367               209
prop.table(table(diab_train$Outcome))
## 
## Negative Diabetes Positive Diabetes 
##         0.6371528         0.3628472

Since our outcome is not balanced, we need to balance the data before we make a model out of it. I choose downSample() as the difference between the two values are not too big.

Balancing
set.seed(129)
ds_diab_train <- downSample(x = diab_train %>% select(-Outcome),
                            y = diab_train$Outcome,
                            yname = "Outcome")

table(ds_diab_train$Outcome)
## 
## Negative Diabetes Positive Diabetes 
##               209               209

Modeling

Logistic Regression

For comparison, I will create a Logistic Regression model using all of the possible predictors and using automatic selection using stepwise method.

Modeling
model_diab <- glm(Outcome ~ ., ds_diab_train, family = "binomial")
model_diab_step <- step(glm(Outcome ~ ., ds_diab_train, family = "binomial"), direction = "backward")
## Start:  AIC=414.8
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - Insulin                   1   396.82 412.82
## - SkinThickness             1   397.16 413.16
## <none>                          396.80 414.80
## - BloodPressure             1   399.51 415.51
## - DiabetesPedigreeFunction  1   400.12 416.12
## - Age                       1   404.01 420.01
## - Pregnancies               1   407.13 423.13
## - BMI                       1   416.11 432.11
## - Glucose                   1   465.64 481.64
## 
## Step:  AIC=412.82
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     BMI + DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - SkinThickness             1   397.38 411.38
## <none>                          396.82 412.82
## - BloodPressure             1   399.51 413.51
## - DiabetesPedigreeFunction  1   400.13 414.13
## - Age                       1   404.02 418.02
## - Pregnancies               1   407.16 421.16
## - BMI                       1   416.22 430.22
## - Glucose                   1   475.38 489.38
## 
## Step:  AIC=411.38
## Outcome ~ Pregnancies + Glucose + BloodPressure + BMI + DiabetesPedigreeFunction + 
##     Age
## 
##                            Df Deviance    AIC
## <none>                          397.38 411.38
## - DiabetesPedigreeFunction  1   400.35 412.35
## - BloodPressure             1   400.50 412.50
## - Age                       1   405.13 417.13
## - Pregnancies               1   407.79 419.79
## - BMI                       1   416.87 428.87
## - Glucose                   1   476.35 488.35
Model summary
summary(model_diab)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = ds_diab_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.83497  -0.72036   0.04313   0.74668   2.83781  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.4209307  0.9486984  -8.876  < 2e-16 ***
## Pregnancies               0.1408466  0.0448686   3.139  0.00169 ** 
## Glucose                   0.0370386  0.0051567   7.183 6.84e-13 ***
## BloodPressure            -0.0128447  0.0078927  -1.627  0.10365    
## SkinThickness            -0.0058008  0.0097002  -0.598  0.54984    
## Insulin                  -0.0001893  0.0013216  -0.143  0.88610    
## BMI                       0.0802895  0.0196893   4.078 4.55e-05 ***
## DiabetesPedigreeFunction  0.7190993  0.4024713   1.787  0.07398 .  
## Age                       0.0373147  0.0141490   2.637  0.00836 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 579.47  on 417  degrees of freedom
## Residual deviance: 396.80  on 409  degrees of freedom
## AIC: 414.8
## 
## Number of Fisher Scoring iterations: 5
summary(model_diab_step)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     BMI + DiabetesPedigreeFunction + Age, family = "binomial", 
##     data = ds_diab_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.85523  -0.71935   0.03976   0.74589   2.82588  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.348991   0.932664  -8.952  < 2e-16 ***
## Pregnancies               0.141323   0.044855   3.151  0.00163 ** 
## Glucose                   0.036874   0.004846   7.609 2.77e-14 ***
## BloodPressure            -0.013551   0.007759  -1.746  0.08073 .  
## BMI                       0.075483   0.018401   4.102 4.09e-05 ***
## DiabetesPedigreeFunction  0.670956   0.396784   1.691  0.09084 .  
## Age                       0.038521   0.014089   2.734  0.00625 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 579.47  on 417  degrees of freedom
## Residual deviance: 397.38  on 411  degrees of freedom
## AIC: 411.38
## 
## Number of Fisher Scoring iterations: 5

According to the stepwise model, the skin thickness of a patient is not a significant predictor as we can agree that in most cases, the skin thickness does not change whether we suffer from diabetes or not. With the insulin predictor, we all know that it is related directly to diabetes as insulin is the blood sugar regulator agent in our body. Those with diabetes does not produce insulin with the proper amount, or even completely stop producing them naturally. However, it is stated that the insulin predictor here is that the 2 hour post eating level of blood sugar in our body. Laboratories often use the 12 hour post eating blood sugar level instead of 2 hours, the 2 hours post eating blood sugar level were used to test the sugar level inside the blood personally at home to reguralely control their sugar intake. In this case, we can agree that the insulin predictor is not that significant to the target.

Odds and Probability
exp(0.670956)
## [1] 1.956106
inv.logit(0.670956)
## [1] 0.6617172

The predictor which have negative values in the Estimate Std. column are the predictors which can increase the chance of getting tested negative of diabetes the higher the value is. While the positive values will decrease the chance of getting tested negitve of diabetes the higher the value is. Taking example of DiabetesPedigreeFunction predictor to be derived to get the odds and probability value, we can see when the previosly mentioned predictor value increase by 1, the odds of a person suffering from diabetes will be 1.96 times higher. The probability of a person that the Diabetes Pedigree Function is checked have 66.17% chance of suffering from diabetes.

Prediction and evaluation
pred_diab <- predict(model_diab, diab_test, type = "response")
pred_diab_step <- predict(model_diab_step, diab_test, type = "response")

diab_test$pred_diab <- factor(ifelse(pred_diab > 0.5, "Positive Diabetes", "Negative Diabetes"))
diab_test$pred_diab_step <- factor(ifelse(pred_diab_step > 0.5, "Positive Diabetes", "Negative Diabetes"))
confusionMatrix(diab_test$pred_diab, diab_test$Outcome, positive = "Positive Diabetes")
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Negative Diabetes Positive Diabetes
##   Negative Diabetes                98                21
##   Positive Diabetes                35                38
##                                            
##                Accuracy : 0.7083           
##                  95% CI : (0.6386, 0.7716) 
##     No Information Rate : 0.6927           
##     P-Value [Acc > NIR] : 0.35110          
##                                            
##                   Kappa : 0.3573           
##                                            
##  Mcnemar's Test P-Value : 0.08235          
##                                            
##             Sensitivity : 0.6441           
##             Specificity : 0.7368           
##          Pos Pred Value : 0.5205           
##          Neg Pred Value : 0.8235           
##              Prevalence : 0.3073           
##          Detection Rate : 0.1979           
##    Detection Prevalence : 0.3802           
##       Balanced Accuracy : 0.6905           
##                                            
##        'Positive' Class : Positive Diabetes
## 
confusionMatrix(diab_test$pred_diab_step, diab_test$Outcome, positive = "Positive Diabetes")
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Negative Diabetes Positive Diabetes
##   Negative Diabetes                98                20
##   Positive Diabetes                35                39
##                                            
##                Accuracy : 0.7135           
##                  95% CI : (0.644, 0.7763)  
##     No Information Rate : 0.6927           
##     P-Value [Acc > NIR] : 0.29452          
##                                            
##                   Kappa : 0.3716           
##                                            
##  Mcnemar's Test P-Value : 0.05906          
##                                            
##             Sensitivity : 0.6610           
##             Specificity : 0.7368           
##          Pos Pred Value : 0.5270           
##          Neg Pred Value : 0.8305           
##              Prevalence : 0.3073           
##          Detection Rate : 0.2031           
##    Detection Prevalence : 0.3854           
##       Balanced Accuracy : 0.6989           
##                                            
##        'Positive' Class : Positive Diabetes
## 

Although in terms of model performance, the stepwise method have 0.52% higher accuracy, and 1.69% higher sensitivity compare to the model which use all predictors. It appears that there were almost no difference between the two models. In medical case, a false negative prediction can result in a fatal health condition or even death. For that reason, we want to make sure our model can precisely predict a patient which were diagnosed to have diabetes. For that reason, we can tune our model to increase its precision.

Performance plot
confmat_plot(pred_diab,
             diab_test$Outcome,
             postarget = "Positive Diabetes",
             negtarget = "Negative Diabetes")
## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.

The plot above help us visualize what will happened if we increase and reduce the cutoff value to our model. Since there we not much difference in the performance of both stepwise model and a model with all predictors, I will only use stepwise model for model tuning.

Model tuning
diab_test$pred_diab_step <- factor(ifelse(pred_diab_step > 0.3, "Positive Diabetes", "Negative Diabetes"))

confusionMatrix(diab_test$pred_diab_step, diab_test$Outcome, positive = "Positive Diabetes")
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Negative Diabetes Positive Diabetes
##   Negative Diabetes                73                10
##   Positive Diabetes                60                49
##                                            
##                Accuracy : 0.6354           
##                  95% CI : (0.5631, 0.7035) 
##     No Information Rate : 0.6927           
##     P-Value [Acc > NIR] : 0.9624           
##                                            
##                   Kappa : 0.307            
##                                            
##  Mcnemar's Test P-Value : 4.724e-09        
##                                            
##             Sensitivity : 0.8305           
##             Specificity : 0.5489           
##          Pos Pred Value : 0.4495           
##          Neg Pred Value : 0.8795           
##              Prevalence : 0.3073           
##          Detection Rate : 0.2552           
##    Detection Prevalence : 0.5677           
##       Balanced Accuracy : 0.6897           
##                                            
##        'Positive' Class : Positive Diabetes
## 

With the cutoff of 0.3, we have obtain the sensitivity value of 83.05% with the overall accuracy of 63.54%. below the cutoff of 0.3, we will be reducing to much of the accuracy which result in too many false positive diagnosis.

K Nearest Neighbor

Besides Logistic Regression, we can use K Nearest Neighbor method to predict the patients which have diabetes. However, we have to make sure that our data is scaled to the same range of numbers. By simply checking the value of our dataframe, we can find out whether our data need some scaling or not. In my case, the data need to be scaled before we start to create our K-NN model.

Scaling using z-score standarization
diab_train_norm <- diab_train %>% mutate_if(is.numeric, scale)
diab_test_norm <- diab_test %>% 
  select(-c(pred_diab, pred_diab_step)) %>% 
  mutate_if(is.numeric, scale)
Balancing
set.seed(129)
ds_diab_train_norm <- downSample(x = diab_train_norm %>% select(-Outcome),
                            y = diab_train_norm$Outcome,
                            yname = "Outcome")
Train Test
diab_train_x <- ds_diab_train_norm %>% select(-Outcome)
diab_test_x <- diab_test_norm %>% select(-Outcome)
diab_train_y <- ds_diab_train_norm$Outcome
diab_test_y <- diab_test_norm$Outcome

To find the best model, we can find the optimum K value for our knn() function by square rooting the total number of our train data. When we obtain even number as the result, we can reduce the value by 1 to make it become an odd number.

Optimum K value
round(sqrt(nrow(diab_train_x)))
## [1] 20
Prediction and model evaluation
pred_diab_knn <- knn(train = diab_train_x, 
                test = diab_test_x, 
                cl = diab_train_y, 
                k = 19)

confusionMatrix(pred_diab_knn, diab_test_y, "Positive Diabetes")
## Confusion Matrix and Statistics
## 
##                    Reference
## Prediction          Negative Diabetes Positive Diabetes
##   Negative Diabetes                88                18
##   Positive Diabetes                45                41
##                                            
##                Accuracy : 0.6719           
##                  95% CI : (0.6006, 0.7378) 
##     No Information Rate : 0.6927           
##     P-Value [Acc > NIR] : 0.760793         
##                                            
##                   Kappa : 0.3163           
##                                            
##  Mcnemar's Test P-Value : 0.001054         
##                                            
##             Sensitivity : 0.6949           
##             Specificity : 0.6617           
##          Pos Pred Value : 0.4767           
##          Neg Pred Value : 0.8302           
##              Prevalence : 0.3073           
##          Detection Rate : 0.2135           
##    Detection Prevalence : 0.4479           
##       Balanced Accuracy : 0.6783           
##                                            
##        'Positive' Class : Positive Diabetes
## 

Although the accuracy value were 4.16% lower that the stepwise model, the sensitivity value were 3.39% higher than the stepwise model before the model tuned to desired result.

Conclusion

Both method can predict the target with a good accuracy. However, the Logistic Regression can be tuned to achieve desired performance and are interpretable. Meanwhile, the K-NN model which are usually performed better compare to Logistic Regression model, were not really flexible in this case and are not interpretable. With all of this reasons, we can say that the Logistic Regression is better suited compare to the K-NN.