Objective

The dataset used was the Pima Indian Diabetes dataset from Machine Learning Repository (originally from National Institute of Diabetes and Digestive and Kidney Disease) which contains 8 medical diagnostic attributes and one target variable (i.e, Outcome) of 768 female patients with 34.9% having diabetes (268 patients). The variance for insulin for both categories was quite high. This dataset is used to predict whether a person with certain medical diagnostic attributes is likely to have a diabetes or not.

Data Preparation

#Loading Library
library(dplyr)
library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(caret)
#Import csv data to data frame
diabetes <- read.csv("diabetes.csv")

Data Wrangling

#Inspect Data
glimpse(diabetes)
#> Rows: 768
#> Columns: 9
#> $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, ~
#> $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125~
#> $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74~
#> $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, ~
#> $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, ~
#> $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.~
#> $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2~
#> $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3~
#> $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, ~

From a glimpse, there is a few data that doesn’t make sense, for example BMI with 0 value, etc.

#Filling the data with mean value of each corresponding columns
diabetes_clean <- diabetes %>% 
  mutate(Outcome=as.factor(Outcome),
         Glucose=ifelse(Glucose==0,mean(Glucose),Glucose),
         BMI=ifelse(BMI==0,mean(BMI),BMI),
         BloodPressure=ifelse(BloodPressure==0,mean(BloodPressure),BloodPressure),
         SkinThickness=ifelse(SkinThickness==0,mean(SkinThickness),SkinThickness),
         Insulin=ifelse(Insulin==0,mean(Insulin),Insulin))

diabetes_clean
#Check target variable proportion
prop.table(table(diabetes_clean$Outcome))
#> 
#>         0         1 
#> 0.6510417 0.3489583

0 - not diabetic (65%) 1 - diabetic (35%) Both data from each class is balanced

Exploratory Data Analysis

#Checking missing value
colSums(is.na(diabetes_clean))
#>              Pregnancies                  Glucose            BloodPressure 
#>                        0                        0                        0 
#>            SkinThickness                  Insulin                      BMI 
#>                        0                        0                        0 
#> DiabetesPedigreeFunction                      Age                  Outcome 
#>                        0                        0                        0
#Checking 5 nummber summary
summary(diabetes_clean)
#>   Pregnancies        Glucose       BloodPressure    SkinThickness  
#>  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
#>  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:20.54  
#>  Median : 3.000   Median :117.00   Median : 72.00   Median :23.00  
#>  Mean   : 3.845   Mean   :121.68   Mean   : 72.25   Mean   :26.61  
#>  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
#>  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
#>     Insulin           BMI        DiabetesPedigreeFunction      Age       
#>  Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
#>  1st Qu.: 79.8   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
#>  Median : 79.8   Median :32.00   Median :0.3725           Median :29.00  
#>  Mean   :118.7   Mean   :32.45   Mean   :0.4719           Mean   :33.24  
#>  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
#>  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
#>  Outcome
#>  0:500  
#>  1:268  
#>         
#>         
#>         
#> 

Cross-Validation

From the data, we can try to split train-test data with 75%-25%

#Splitting data to data train with 75% observation and the rest 25% observation to data test
RNGkind(sample.kind = "Rounding")
set.seed(1234)

index <- sample(x = nrow(diabetes_clean),
                size = nrow(diabetes_clean)*0.75)

# splitting
diabetes_train <- diabetes_clean[index, ]
diabetes_test <- diabetes_clean[-index, ]
diabetes_test_knn <- diabetes_clean[-index, ]

Logistic Regression

After we clean, and split the data, now we can go ahead and try to build the logistic regression model

Modeling

#Build model with all predictor
model_all <- glm(Outcome~.,
                 data=diabetes_train,
                 family="binomial")

summary(model_all)
#> 
#> Call:
#> glm(formula = Outcome ~ ., family = "binomial", data = diabetes_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.7268  -0.7270  -0.4063   0.7182   2.4307  
#> 
#> Coefficients:
#>                            Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)              -9.0444268  0.9343913  -9.679  < 2e-16 ***
#> Pregnancies               0.1225614  0.0369163   3.320 0.000900 ***
#> Glucose                   0.0377215  0.0045713   8.252  < 2e-16 ***
#> BloodPressure            -0.0054068  0.0099837  -0.542 0.588117    
#> SkinThickness            -0.0049890  0.0128468  -0.388 0.697761    
#> Insulin                  -0.0002465  0.0011973  -0.206 0.836900    
#> BMI                       0.0892791  0.0211122   4.229 2.35e-05 ***
#> DiabetesPedigreeFunction  1.1539762  0.3399108   3.395 0.000686 ***
#> Age                       0.0089110  0.0107678   0.828 0.407917    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 753.47  on 575  degrees of freedom
#> Residual deviance: 541.73  on 567  degrees of freedom
#> AIC: 559.73
#> 
#> Number of Fisher Scoring iterations: 5
#Feature Selecting with backward model
model_step <- step(object = model_all,
                   direction = "both",
                   trace = F)

summary(model_step)
#> 
#> Call:
#> glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, 
#>     family = "binomial", data = diabetes_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.7040  -0.7197  -0.4022   0.7389   2.4426  
#> 
#> Coefficients:
#>                          Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)              -9.07217    0.81761 -11.096  < 2e-16 ***
#> Pregnancies               0.13557    0.03166   4.282 1.86e-05 ***
#> Glucose                   0.03786    0.00410   9.234  < 2e-16 ***
#> BMI                       0.08055    0.01749   4.605 4.12e-06 ***
#> DiabetesPedigreeFunction  1.14537    0.33640   3.405 0.000662 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 753.47  on 575  degrees of freedom
#> Residual deviance: 542.75  on 571  degrees of freedom
#> AIC: 552.75
#> 
#> Number of Fisher Scoring iterations: 5

Prediction

#From the first 5 observations on data test, in form of log of odds
predict(object = model_step, 
        newdata = diabetes_test[1:5,], 
        type = "link")
#>          3          4          5          6         17 
#>  1.5874580 -3.1121854  2.2071801 -1.7101047 -0.2841825
#From the first 5 observations on data test, in form of probability
predict(object = model_step, 
        newdata = diabetes_test[1:5,], 
        type = "response")
#>          3          4          5          6         17 
#> 0.83025815 0.04260741 0.90089243 0.15315013 0.42942867
#Adding prediction column to the test data
diabetes_test$prediction <- predict(object = model_step,
                                 newdata = diabetes_test,
                                 type = "response")
diabetes_test$diabetic <- ifelse(test = diabetes_test$prediction > 0.5,
                                yes = "1",
                                no = "0")

diabetes_test <- diabetes_test %>% 
  mutate(diabetic = as.factor(diabetic))

diabetes_test

Model Evaluation

confusionMatrix(data = diabetes_test$diabetic,
                reference = diabetes_test$Outcome,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 119  27
#>          1  13  33
#>                                           
#>                Accuracy : 0.7917          
#>                  95% CI : (0.7273, 0.8468)
#>     No Information Rate : 0.6875          
#>     P-Value [Acc > NIR] : 0.0008541       
#>                                           
#>                   Kappa : 0.4822          
#>                                           
#>  Mcnemar's Test P-Value : 0.0398326       
#>                                           
#>             Sensitivity : 0.5500          
#>             Specificity : 0.9015          
#>          Pos Pred Value : 0.7174          
#>          Neg Pred Value : 0.8151          
#>              Prevalence : 0.3125          
#>          Detection Rate : 0.1719          
#>    Detection Prevalence : 0.2396          
#>       Balanced Accuracy : 0.7258          
#>                                           
#>        'Positive' Class : 1               
#> 

Because we are predicting whether a person is diabetic or not, which means one class is more important than the others, we going to use either Recall or Precision

#Recall/Sensitivity
33 / (33 + 27)
#> [1] 0.55
#Pos Pred Value/Precision
33 / (33 + 13)
#> [1] 0.7173913

Model Tunning

We are going to focus on the False Negative point, because there might be dangerous when a patient is falsely diagnosed. So we will focus on the Recall metrics.

#Decreasing threshold value to 0.35
diabetes_test$diabetic_tunning <- ifelse(test = diabetes_test$prediction > 0.35,
                                yes = "1",
                                no = "0")

diabetes_test <- diabetes_test %>% 
  mutate(diabetic_tunning = as.factor(diabetic_tunning))

diabetes_test
confusionMatrix(data = diabetes_test$diabetic_tunning,
                reference = diabetes_test$Outcome,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 100  14
#>          1  32  46
#>                                           
#>                Accuracy : 0.7604          
#>                  95% CI : (0.6937, 0.8189)
#>     No Information Rate : 0.6875          
#>     P-Value [Acc > NIR] : 0.01614         
#>                                           
#>                   Kappa : 0.4846          
#>                                           
#>  Mcnemar's Test P-Value : 0.01219         
#>                                           
#>             Sensitivity : 0.7667          
#>             Specificity : 0.7576          
#>          Pos Pred Value : 0.5897          
#>          Neg Pred Value : 0.8772          
#>              Prevalence : 0.3125          
#>          Detection Rate : 0.2396          
#>    Detection Prevalence : 0.4062          
#>       Balanced Accuracy : 0.7621          
#>                                           
#>        'Positive' Class : 1               
#> 

After tunning, the Recall metrics on the model increases to 76%. It is concluded that the model is 76% accurate to predict whether someone is diabetic or not, according to few medical variables that use to diagnose diabetes

K-Nearest Neighbor

Scaling

# Diabetes Train data predictor
diabetes_train_predictor <- diabetes_train %>% 
  select(-Outcome)

# Diabetes Train data target
diabetes_train_target <- diabetes_train %>% 
  pull(Outcome)
# Diabetes Train data predictor
diabetes_test_predictor <- diabetes_test_knn %>% 
  select(-Outcome)

# Diabetes Train data target
diabetes_test_target <- diabetes_test_knn %>% 
  pull(Outcome)
# Scaling data train diabetes predictor only
diabetes_train_predictor_scale <- diabetes_train_predictor %>% 
  scale()
# Scaling data test diabetes predictor only
diabetes_test_predictor_scale <- diabetes_test_predictor %>% 
  scale(center = attr(diabetes_train_predictor_scale, "scaled:center"),
        scale = attr(diabetes_train_predictor_scale, "scaled:scale"))

Prediction

#Finding optimum k
sqrt(nrow(diabetes_train_predictor))
#> [1] 24
diabetic_pred <- knn(train = diabetes_train_predictor_scale,
                     test = diabetes_test_predictor_scale,
                     cl = diabetes_train_target,
                     k = 24)
confusionMatrix(data = diabetic_pred,
                reference = diabetes_test_target,
                positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 112  29
#>          1  20  31
#>                                          
#>                Accuracy : 0.7448         
#>                  95% CI : (0.677, 0.8048)
#>     No Information Rate : 0.6875         
#>     P-Value [Acc > NIR] : 0.04913        
#>                                          
#>                   Kappa : 0.3807         
#>                                          
#>  Mcnemar's Test P-Value : 0.25310        
#>                                          
#>             Sensitivity : 0.5167         
#>             Specificity : 0.8485         
#>          Pos Pred Value : 0.6078         
#>          Neg Pred Value : 0.7943         
#>              Prevalence : 0.3125         
#>          Detection Rate : 0.1615         
#>    Detection Prevalence : 0.2656         
#>       Balanced Accuracy : 0.6826         
#>                                          
#>        'Positive' Class : 1              
#> 

With K = 24, we got around 74% of Accuracy, and Recall/Sensitivity of 51%.

Conclusion

According to the metrics of both models, both models have perform good, in terms of accuracy from both we got around 76% from Logistic Regression model after some tuning, while having Recall/Sensitivity of 76%. On K-NN model, with K value of 24, we got accuracy of 74%, and Recall/Sensitivity of 51%. Because we focusing on the Recall metrics, its concluded that the Logistic Regression model is better to use here. However, the K-NN model can also be improved by oversampling, which means more observational data, and will probably took more time to do so.