# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

Intro

This article is made for completing the assignment for Algoritma : Machine Learning Course in Classification 1 Theia Batch 2022.

In this article, we will try to create model to predict if someone is diabetic or not based on many variables.

From CDC.gov, Diabetes is Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy. Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy.

With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream. Over time, that can cause serious health problems, such as heart disease, vision loss, and kidney disease.

There isn’t a cure yet for diabetes, but losing weight, eating healthy food, and being active can really help

Diabetes

There are three main types of diabetes: type 1, type 2, and gestational diabetes (diabetes while pregnant).

Type of Diabetes

The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.

Explaratory Data Analysis

Load Library

options(scipen = 99) #non active science annotation (example : e-1)
library(tidyverse) #load several important libraries 
library(dplyr) #grammar of data manipulation
library(ggplot2) #plotting data visualization
library(scales) #edit scales
library(GGally) #plotting data visualization

Load Dataset

#load dataset
diabetes <- read.csv("data-input/diabetes2.csv")
glimpse(diabetes)

#> Rows: 768
#> Columns: 9
#> $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
#> $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
#> $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
#> $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
#> $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
#> $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
#> $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
#> $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
#> $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

We want to see how many unique values on each character columns.

c <- c(length(unique(diabetes$Pregnancies)), 
       length(unique(diabetes$Age)), 
       length(unique(diabetes$Outcome)))
c

#> [1] 17 52  2

Based on glimpse above, we change some column datatypes for easier analysis

diabetes <- diabetes %>%
  mutate(Outcome = as.factor(Outcome))

Columns Description

Pregnancies : Pregnancies Experienced, we assumed in the dataset every individual is a Woman
Glucose : Level of Glucose
BloodPressure : Blood Pressure Level
SkinThickness : Skin Thicknes of Individual
Insulin : Level of Insulin
BMI : Body Mass Index, Bodyweight Compared to Height.
DiabetesPedigreeFunction : Likelihood of Diabetes based on Family History
Age : Age of Individual
Outcome : The Result if individual is diabetes or not

Check Missing Value

colSums(is.na(diabetes))

#>              Pregnancies                  Glucose            BloodPressure 
#>                        0                        0                        0 
#>            SkinThickness                  Insulin                      BMI 
#>                        0                        0                        0 
#> DiabetesPedigreeFunction                      Age                  Outcome 
#>                        0                        0                        0

Pre Processing Dataset

Check Summary of Data

summary(diabetes)

#>   Pregnancies        Glucose      BloodPressure    SkinThickness  
#>  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
#>  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
#>  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
#>  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
#>  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
#>  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
#>     Insulin           BMI        DiabetesPedigreeFunction      Age       
#>  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
#>  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
#>  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
#>  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
#>  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
#>  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
#>  Outcome
#>  0:500  
#>  1:268  
#>         
#>         
#>         
#>

We can see that the Outcome is not balance.

Divide into two dataset, one for training and one for testing

RNGkind(sample.kind = "Rounding")
set.seed(100)
library(rsample)

# Stratified Random Sampling

index_n <- initial_split(data = diabetes, # data yang digunakan untuk splitting
                         prop = 0.8,  # proporsi data train
                         strata = Outcome) # variabel target

diab_train <- training(index_n) # fungsi training bawaan dari library rsample
diab_test <- testing(index_n)

Check Proportion of Target Variable

prop.table(table(diab_train$Outcome))

#> 
#>         0         1 
#> 0.6514658 0.3485342

prop.table(table(diab_test$Outcome))

#> 
#>         0         1 
#> 0.6493506 0.3506494

Downsampling the data train to make better model, make it 50:50

# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)

diab_train <- downSample(x=diab_train %>% select(-Outcome), #hanya select variabel prediktor saja
                         y=diab_train$Outcome, # hanya selesct variabel target saja
                         yname="Outcome") # nama kolom target

prop.table(table(diab_train$Outcome))

#> 
#>   0   1 
#> 0.5 0.5

Modelling Data

We try several steps to make the best model we can.

Initial Model

Using all predictor for the Model

model_diab_1 <- glm(formula = Outcome ~.,
                   data = diab_train,
                   family = "binomial")
summary(model_diab_1)

#> 
#> Call:
#> glm(formula = Outcome ~ ., family = "binomial", data = diab_train)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.6571  -0.8576  -0.0265   0.8390   2.7050  
#> 
#> Coefficients:
#>                           Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)              -7.758683   0.917167  -8.459 < 0.0000000000000002 ***
#> Pregnancies               0.154585   0.041463   3.728             0.000193 ***
#> Glucose                   0.036257   0.004957   7.315    0.000000000000258 ***
#> BloodPressure            -0.012327   0.006843  -1.801             0.071627 .  
#> SkinThickness             0.005792   0.009187   0.630             0.528409    
#> Insulin                  -0.001713   0.001210  -1.416             0.156778    
#> BMI                       0.093538   0.019499   4.797    0.000001609269932 ***
#> DiabetesPedigreeFunction  0.625532   0.373458   1.675             0.093940 .  
#> Age                       0.002779   0.011732   0.237             0.812788    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 593.33  on 427  degrees of freedom
#> Residual deviance: 442.81  on 419  degrees of freedom
#> AIC: 460.81
#> 
#> Number of Fisher Scoring iterations: 5

Stepwise Model

Using Stepwise to Pick Predictor from the Intial Model model_diab_1

# stepwise Backward
model_diab_2 <- step(object = model_diab_1, direction = "backward", trace = F)
summary(model_diab_2)

#> 
#> Call:
#> glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
#>     BMI + DiabetesPedigreeFunction, family = "binomial", data = diab_train)
#> 
#> Deviance Residuals: 
#>      Min        1Q    Median        3Q       Max  
#> -2.59464  -0.85677  -0.00793   0.82378   2.68100  
#> 
#> Coefficients:
#>                           Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)              -7.504248   0.873282  -8.593 < 0.0000000000000002 ***
#> Pregnancies               0.161202   0.036277   4.444   0.0000088466714820 ***
#> Glucose                   0.034095   0.004506   7.567   0.0000000000000381 ***
#> BloodPressure            -0.011195   0.006540  -1.712               0.0869 .  
#> BMI                       0.093157   0.018346   5.078   0.0000003817828481 ***
#> DiabetesPedigreeFunction  0.602369   0.365725   1.647               0.0995 .  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 593.33  on 427  degrees of freedom
#> Residual deviance: 444.93  on 422  degrees of freedom
#> AIC: 456.93
#> 
#> Number of Fisher Scoring iterations: 4

Prediction

Predict Data using our Models

diab_test$prob_diab_1<-predict(model_diab_1, type = "response", newdata = diab_test)
diab_test$prob_diab_2<-predict(model_diab_2, type = "response", newdata = diab_test)

Density Data using our Models

Using Model 1

ggplot(diab_test, aes(x=prob_diab_1)) +
  geom_density(lwd=1) +
  labs(title = "Distribution of Probability Prediction Data")

Using Model 2

ggplot(diab_test, aes(x=prob_diab_2)) +
  geom_density(lwd=1) +
  labs(title = "Distribution of Probability Prediction Data")

Based on Density Plot above, we can see that there are two peak of population. Peak on the right considered as Diabetic, and needs more observation to check if she has diabetes or not. Peak on the left considered as Healthy.

We Split the model using threshold 0.7

diab_test$pred_diab_1 <- factor(ifelse(diab_test$prob_diab_1 > 0.7, "1","0"))
diab_test$pred_diab_2 <- factor(ifelse(diab_test$prob_diab_2 > 0.7, "1","0"))

Model Evaluation

Confusion Matrix

library(caret)
log_conf_1 <- confusionMatrix(diab_test$pred_diab_1, diab_test$Outcome, positive = "1")
log_conf_1

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 92 24
#>          1  8 30
#>                                           
#>                Accuracy : 0.7922          
#>                  95% CI : (0.7195, 0.8533)
#>     No Information Rate : 0.6494          
#>     P-Value [Acc > NIR] : 0.00008061      
#>                                           
#>                   Kappa : 0.5103          
#>                                           
#>  Mcnemar's Test P-Value : 0.00801         
#>                                           
#>             Sensitivity : 0.5556          
#>             Specificity : 0.9200          
#>          Pos Pred Value : 0.7895          
#>          Neg Pred Value : 0.7931          
#>              Prevalence : 0.3506          
#>          Detection Rate : 0.1948          
#>    Detection Prevalence : 0.2468          
#>       Balanced Accuracy : 0.7378          
#>                                           
#>        'Positive' Class : 1               
#>

library(caret)
log_conf_2 <- confusionMatrix(diab_test$pred_diab_2, diab_test$Outcome, positive = "1")
log_conf_2

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 93 23
#>          1  7 31
#>                                           
#>                Accuracy : 0.8052          
#>                  95% CI : (0.7337, 0.8645)
#>     No Information Rate : 0.6494          
#>     P-Value [Acc > NIR] : 0.00001687      
#>                                           
#>                   Kappa : 0.5409          
#>                                           
#>  Mcnemar's Test P-Value : 0.00617         
#>                                           
#>             Sensitivity : 0.5741          
#>             Specificity : 0.9300          
#>          Pos Pred Value : 0.8158          
#>          Neg Pred Value : 0.8017          
#>              Prevalence : 0.3506          
#>          Detection Rate : 0.2013          
#>    Detection Prevalence : 0.2468          
#>       Balanced Accuracy : 0.7520          
#>                                           
#>        'Positive' Class : 1               
#>

It seems that the performance of these two models is not so different, but Model 2 prove could make better prediction

Model Optimization

We find the best CutOff (the number we split the data as positive class and negative class) using find optimum value of Sensitivity, Specificity, Precision, Accuracy.

# create function for performance
performa <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.80,length=100)
result <- matrix(0,100,4)

Create Dataframe of Performance using Model 2

for(i in 1:100){
  result[i,] = performa(cutoff = co[i], 
                     prob = diab_test$prob_diab_2, 
                     ref = diab_test$Outcome, 
                     postarget = "1", 
                     negtarget = "0")
}
plot_data_2 <- data_frame("Recall" = result[,1], 
                          "Accuracy" = result[,2], 
                          "Precision" = result[,3], 
                          "Specificity" = result[,4], 
                          "Cutoff" = co) %>% 
  gather(key = "performa", value = "value", 1:4)

Plot of Model 2 Performance

plot_2 <- ggplot(data = plot_data_2, aes(x = Cutoff, y = value, col = performa)) + 
  geom_line(lwd = 1) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(0,1,0.1)) +
  labs(title = "Model Performance Compared to Model Cutoff") +
  theme(legend.position = "top", 
        legend.title = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank())
plot_2

Based on business Model, we want the model to has more best Recall as possible because we don’t want to reduce False Negative. Which the person is predicted healthy but she has high chance has diabetes.

Based on the plot above, we can see that the optimum cutOFF for every performance : 0.5

Create New Prediction using Model 2

diab_test$pred_diab_3 <- factor(ifelse(diab_test$prob_diab_2 > 0.5, "1","0"))

Model Interpretation

# Odds ratio all coefficients
exp(model_diab_2$coefficients) %>% 
  data.frame() %>%
  arrange(desc(.))

Based on model interpretation above, we can see that on this dataset people with family history of diabetes is more likely to have diabetes and it is the most significant predictor.

K-Nearest Neighbor Modelling

Scale Data

head(diab_train, 5)

# predictor
diab_train_x <- diab_train %>% select_if(is.numeric)

diab_test_x <- diab_test %>% select_if(is.numeric)

# target
diab_train_y <- diab_train[,"Outcome"]

diab_test_y <- diab_test[,"Outcome"]

# scale predictor data

diab_train_xs <- scale(diab_train_x)

diab_test_xs <- scale(diab_train_x,
                      center = attr(diab_train_xs, "scaled:center"),
                      scale = attr(diab_train_xs, "scaled:scale"))

Create Model

Find Optimum K Number based on Number of Data

sqrt(nrow(diab_train))

#> [1] 20.68816

library(class) # package untuk fungsi `knn()`

pred_diab_knn <- knn(train = diab_train_xs,
                 test = diab_test_xs,
                 cl = diab_train_y,
                 k = 21)

KNN Model Performance

# confusion matrix
pred_knn_conf <- confusionMatrix(data = as.factor(pred_diab_knn), 
                                 reference = diab_train_y, 
                                 positive = "1")
pred_knn_conf

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 163  54
#>          1  51 160
#>                                              
#>                Accuracy : 0.7547             
#>                  95% CI : (0.7111, 0.7947)   
#>     No Information Rate : 0.5                
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.5093             
#>                                              
#>  Mcnemar's Test P-Value : 0.8453             
#>                                              
#>             Sensitivity : 0.7477             
#>             Specificity : 0.7617             
#>          Pos Pred Value : 0.7583             
#>          Neg Pred Value : 0.7512             
#>              Prevalence : 0.5000             
#>          Detection Rate : 0.3738             
#>    Detection Prevalence : 0.4930             
#>       Balanced Accuracy : 0.7547             
#>                                              
#>        'Positive' Class : 1                  
#>

Based on Confusion Matrix above, our KNN Model has slightly lower Accuracy compared to the Model 2, we will compare it further.

Logistic Regression vs KNN

Create Confusion Matrix for Model 2 with Cut Off 0.5 that we prove has the best performance in every Measurement.

log_conf <- confusionMatrix(diab_test$pred_diab_3, diab_test$Outcome, positive = "1")
log_conf

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 78 11
#>          1 22 43
#>                                           
#>                Accuracy : 0.7857          
#>                  95% CI : (0.7124, 0.8477)
#>     No Information Rate : 0.6494          
#>     P-Value [Acc > NIR] : 0.0001665       
#>                                           
#>                   Kappa : 0.5505          
#>                                           
#>  Mcnemar's Test P-Value : 0.0817228       
#>                                           
#>             Sensitivity : 0.7963          
#>             Specificity : 0.7800          
#>          Pos Pred Value : 0.6615          
#>          Neg Pred Value : 0.8764          
#>              Prevalence : 0.3506          
#>          Detection Rate : 0.2792          
#>    Detection Prevalence : 0.4221          
#>       Balanced Accuracy : 0.7881          
#>                                           
#>        'Positive' Class : 1               
#>

eval_logit <- data_frame(Accuracy = log_conf$overall[1],
           Recall = log_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = log_conf$byClass[3])

eval_knn <- data_frame(Accuracy = pred_knn_conf$overall[1],
           Recall = pred_knn_conf$byClass[1],
           Specificity = log_conf$byClass[2],
           Precision = pred_knn_conf$byClass[3])

eval_logit

eval_knn

Conclusion

Based on model comparation above, it seems that Logistic Regression perform better for our application because it has significantly higher Recall Value compared to the KNN Method.

Comparation between Model 1 which using all predictor and Model 2 which using several predictor is not very different, it maybe because the sample is too small to see the significance the rest of the Predictors.

It is better to predict someone as Diabetic wrong rather than predict someone as Not Diabetic wrong. Why? because the treatment for Diabetic symptoms is to pursue better Lifestyle like eating less-sugar, exercise more, and have enough sleep. This treatment will also beneficial to someone that is Not Diabetic

For more advanced treatment, it must be take further investigation to the patient.

Learning By Building - Logistic Regression and KNN : Diabetes Prediction

Muhammad Irfani

November 21, 2022