Introduction

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

The dataset is obtained from Kaggle

Importing data

diabetes <- read.csv("diabetes.csv")

Importing the necessary tools

library(ggthemes)
library(magrittr)
library(tidyverse)
library(dplyr)
library(inspectdf)
library(e1071)
library(caret)
library(doParallel)
theme_set(theme_minimal())

Exploratory Data Analysis

Overview of the data

glimpse(diabetes)
## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, ~
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125~
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74~
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, ~
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, ~
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.~
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2~
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3~
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, ~

We have a dataset that includes 1338 observations on 7 variables:

  • Pregnancies: Number of times pregnant
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: age in years
  • Outcome: target variable 0 or 1. where 1 means positively has diabetes.

We will convert the outcome variables to the type factor to make it categorical:

diabetes$Outcome %<>% as.factor()

Description of variables

summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##  Outcome
##  0:500  
##  1:268  
##         
##         
##         
## 

for some variable to have value 0 is non-sense unless it means that the actual value is missing. so we will replace 0 in these variable to Na (Glucose, BloodPressure, SkinThickness, Insulin, and BMI)

cols <- c("Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI")
diabetes[,cols][diabetes[,cols] == 0] <- NA

Missing values

colSums(is.na(diabetes))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                      227                      374                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

To fill these Nan values the data distribution needs to be understood against the target. we will measure median for all predictor that has missing value, and we will diffrentiate the value based on the target variable.

median_target <- function(v) {
  df <- diabetes %>% 
    filter(!is.na((!!as.symbol(v))))
  
  df <- df %>% select((!!as.symbol(v)), Outcome) %>%
    group_by(Outcome) %>%
    summarise(median = median((!!as.symbol(v))))
  return(df)
}

# na_median <- function(v)

Glucose

median_target("Glucose")
diabetes[,"Glucose"][(diabetes[,"Glucose"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 107
diabetes[,"Glucose"][(diabetes[,"Glucose"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 140

BloodPressure

median_target("BloodPressure")
diabetes[,"BloodPressure"][(diabetes[,"BloodPressure"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 70
diabetes[,"BloodPressure"][(diabetes[,"BloodPressure"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 74.5

SkinThickness

median_target("SkinThickness")
diabetes[,"SkinThickness"][(diabetes[,"SkinThickness"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 27
diabetes[,"SkinThickness"][(diabetes[,"SkinThickness"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 32

Insulin

median_target("Insulin")
diabetes[,"Insulin"][(diabetes[,"Insulin"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 102.5
diabetes[,"Insulin"][(diabetes[,"Insulin"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 169.5

BMI

median_target("BMI")
diabetes[,"BMI"][(diabetes[,"BMI"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 30.1
diabetes[,"BMI"][(diabetes[,"BMI"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 34.3

Missing Values (Final Check)

Now, let us recheck the missing values

colSums(is.na(diabetes))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

There is no missing values anymore in the dataset.

Visualisation

Correlation matrix betweens variables

library(ggcorrplot)

corr <- round(cor(diabetes %>% select(-Outcome)), 1)
# to do create color for target variable
ggcorrplot(corr, hc.order = TRUE, type = "lower",
   lab = TRUE)

As we can see from the graph above, there is no highly correlated variable between the predictors.

Modelization

In the result of prediction we will focus on the recall matrix which tell us what proportion actual positives was identified correctly. because in this model we prefer the patient false positively predicted has diabetes than predicted false negatively. therefore, it can has a better prevention with the actual disease.

Pre-processing

Check the target variable proportion

diabetes$Outcome %>% 
   table() %>% 
   prop.table()
## .
##         0         1 
## 0.6510417 0.3489583

The proportion of the target variable is slightly imbalance. but we it should not be a problem for model training and prediction.

Cross validation

RNGkind(sample.kind = "Rounding") 
set.seed(123)
split_index <- sample(nrow(diabetes), nrow(diabetes)*0.80)
diabetes_train <- diabetes[split_index, ] 
diabetes_test <- diabetes[-split_index, ]

Logistic Regression

Model Building

model_log <- glm(Outcome ~ ., data = diabetes_train, family = "binomial")

summary(model_log)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diabetes_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4800  -0.7138  -0.3776   0.7455   2.4514  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.068789   0.899325 -10.084  < 2e-16 ***
## Pregnancies               0.111113   0.036633   3.033  0.00242 ** 
## Glucose                   0.030155   0.004317   6.986 2.83e-12 ***
## BloodPressure            -0.002812   0.009919  -0.283  0.77682    
## SkinThickness             0.037210   0.015476   2.404  0.01620 *  
## Insulin                   0.004644   0.001583   2.934  0.00334 ** 
## BMI                       0.058073   0.019541   2.972  0.00296 ** 
## DiabetesPedigreeFunction  0.680046   0.332773   2.044  0.04100 *  
## Age                       0.010072   0.011007   0.915  0.36015    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 790.13  on 613  degrees of freedom
## Residual deviance: 561.46  on 605  degrees of freedom
## AIC: 579.46
## 
## Number of Fisher Scoring iterations: 5

Making prediction (Logistic Regression)

diabetes_log_pred <- predict(model_log, newdata = diabetes_test)
diabetes_test$pred_odd <-  predict(object = model_log, 
                                newdata = diabetes_test, 
                                type = "response")
diabetes_test$pred_Label <- ifelse(test = diabetes_test$pred_odd > 0.5, yes = "1", no = "0")

Evaluation (Logistic Regression)

# diabetes_test <- diabetes_test %>% mutate(pred_Label = as.factor(pred_Label))
confusionMatrix(data = as.factor(diabetes_test$pred_Label), 
                reference = diabetes_test$Outcome, 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 87 22
##          1 10 35
##                                           
##                Accuracy : 0.7922          
##                  95% CI : (0.7195, 0.8533)
##     No Information Rate : 0.6299          
##     P-Value [Acc > NIR] : 1.037e-05       
##                                           
##                   Kappa : 0.5341          
##                                           
##  Mcnemar's Test P-Value : 0.05183         
##                                           
##             Sensitivity : 0.6140          
##             Specificity : 0.8969          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.7982          
##              Prevalence : 0.3701          
##          Detection Rate : 0.2273          
##    Detection Prevalence : 0.2922          
##       Balanced Accuracy : 0.7555          
##                                           
##        'Positive' Class : 1               
## 

Result of Logistic Regression Model: from the result of confusion matrix above as we can see the accuracy of the model is ~74%

Naive Bayes

Model Building

model_bayes <- naiveBayes(Outcome ~ ., data = diabetes_train)

Making prediction (Naive Bayes)

diabetes_nb_pred <- predict(model_bayes, newdata = diabetes_test)

Evaluation (Naive Bayes)

confusionMatrix(data = diabetes_nb_pred, reference = diabetes_test$Outcome, positive = "1") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85 20
##          1 12 37
##                                           
##                Accuracy : 0.7922          
##                  95% CI : (0.7195, 0.8533)
##     No Information Rate : 0.6299          
##     P-Value [Acc > NIR] : 1.037e-05       
##                                           
##                   Kappa : 0.5411          
##                                           
##  Mcnemar's Test P-Value : 0.2159          
##                                           
##             Sensitivity : 0.6491          
##             Specificity : 0.8763          
##          Pos Pred Value : 0.7551          
##          Neg Pred Value : 0.8095          
##              Prevalence : 0.3701          
##          Detection Rate : 0.2403          
##    Detection Prevalence : 0.3182          
##       Balanced Accuracy : 0.7627          
##                                           
##        'Positive' Class : 1               
## 

Result of Naive Bayes Model: from the result of confusion matrix above as we can see the accuracy of the model is ~72%

K-NN

Because all the predictors are numerical we can proceed to building the model.

Cross validation

RNGkind(sample.kind = "Rounding") 
set.seed(123)
split_index <- sample(nrow(diabetes), nrow(diabetes)*0.80)
diabetes_train <- diabetes[split_index, ] 
diabetes_test <- diabetes[-split_index, ]

Model Building

splittiong between predictor and target variable

# predictors
diabetes_train_x <- diabetes_train %>% select_if(is.numeric)
diabetes_test_x <- diabetes_test %>% select_if(is.numeric)

# target
diabetes_train_y <- diabetes_train %>% select(Outcome)
diabetes_test_y <- diabetes_test %>% select(Outcome)

scalling the predictors data

diabetes_train_xs <- diabetes_train_x %>% scale()
diabetes_test_xs <- diabetes_test_x %>% 
                      scale(center = attr(diabetes_train_xs, "scaled:center"),
                            scale = attr(diabetes_train_xs, "scaled:scale"))

Model Prediction

find optimum K

k <- sqrt(nrow(diabetes_train)) %>% round(digits = 0)

Because the target is even (2 class) we should use odd amount of K.

library(class)
cl = diabetes_train_y[,"Outcome"]
knn_pred <- knn(train = diabetes_train_xs, # data train, predictors, scaled
                 test = diabetes_test_xs, # data test, predictors, scaled
                 cl =  cl, # data train, label (target) aktual
                 k = k)

Model Evaluation

confusionMatrix(data = knn_pred, # confusion matrix needs the data in factor
                reference = diabetes_test_y[,"Outcome"], 
                positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 88 15
##          1  9 42
##                                          
##                Accuracy : 0.8442         
##                  95% CI : (0.777, 0.8975)
##     No Information Rate : 0.6299         
##     P-Value [Acc > NIR] : 3.871e-09      
##                                          
##                   Kappa : 0.6583         
##                                          
##  Mcnemar's Test P-Value : 0.3074         
##                                          
##             Sensitivity : 0.7368         
##             Specificity : 0.9072         
##          Pos Pred Value : 0.8235         
##          Neg Pred Value : 0.8544         
##              Prevalence : 0.3701         
##          Detection Rate : 0.2727         
##    Detection Prevalence : 0.3312         
##       Balanced Accuracy : 0.8220         
##                                          
##        'Positive' Class : 1              
## 

Result of KNN Model: from the result of confusion matrix above as we can see the accuracy of the model is ~81%

Decision Tree

Model Building

library(partykit)
library(ROCR)

model_tree <- ctree(Outcome ~ ., data=diabetes_train) 

# Visualize decision tree
plot(model_tree, type = "simple") 

Model Prediction

pred_dtree <- predict(model_tree, diabetes_test_x)

Model Evaluation

confusionMatrix(pred_dtree, diabetes_test$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 80 15
##          1 17 42
##                                           
##                Accuracy : 0.7922          
##                  95% CI : (0.7195, 0.8533)
##     No Information Rate : 0.6299          
##     P-Value [Acc > NIR] : 1.037e-05       
##                                           
##                   Kappa : 0.5576          
##                                           
##  Mcnemar's Test P-Value : 0.8597          
##                                           
##             Sensitivity : 0.8247          
##             Specificity : 0.7368          
##          Pos Pred Value : 0.8421          
##          Neg Pred Value : 0.7119          
##              Prevalence : 0.6299          
##          Detection Rate : 0.5195          
##    Detection Prevalence : 0.6169          
##       Balanced Accuracy : 0.7808          
##                                           
##        'Positive' Class : 0               
## 

Support Vector Machine

Model Building

model_svm <-  svm(Outcome ~ ., data = diabetes_train)

Model Prediction

pred_svm <- predict(model_svm, diabetes_test_x)

Model Evaluation

confusionMatrix(pred_svm, diabetes_test$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 91 11
##          1  6 46
##                                           
##                Accuracy : 0.8896          
##                  95% CI : (0.8291, 0.9344)
##     No Information Rate : 0.6299          
##     P-Value [Acc > NIR] : 3.151e-13       
##                                           
##                   Kappa : 0.7589          
##                                           
##  Mcnemar's Test P-Value : 0.332           
##                                           
##             Sensitivity : 0.9381          
##             Specificity : 0.8070          
##          Pos Pred Value : 0.8922          
##          Neg Pred Value : 0.8846          
##              Prevalence : 0.6299          
##          Detection Rate : 0.5909          
##    Detection Prevalence : 0.6623          
##       Balanced Accuracy : 0.8726          
##                                           
##        'Positive' Class : 0               
## 

Result of Naive Bayes Model: from the result of confusion matrix above as we can see the accuracy of the model is ~79%

Random Forests

Model Building

library(randomForest)

control <- trainControl(method='repeatedcv', 
                        number=10, 
                        repeats=3,
                        search = 'random')

set.seed(100)
model_rf <- train(Outcome ~ .,
                   data = diabetes_train,
                   method = 'rf',
                   metric = 'Accuracy',
                   trControl = control)

saveRDS(model_rf, "model_rf.RDS") # save model

Model Prediction

pred_rf <- predict(model_rf, diabetes_test_x)

Model Evaluation

confusionMatrix(pred_rf, diabetes_test$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 91 10
##          1  6 47
##                                           
##                Accuracy : 0.8961          
##                  95% CI : (0.8368, 0.9394)
##     No Information Rate : 0.6299          
##     P-Value [Acc > NIR] : 6.496e-14       
##                                           
##                   Kappa : 0.7739          
##                                           
##  Mcnemar's Test P-Value : 0.4533          
##                                           
##             Sensitivity : 0.9381          
##             Specificity : 0.8246          
##          Pos Pred Value : 0.9010          
##          Neg Pred Value : 0.8868          
##              Prevalence : 0.6299          
##          Detection Rate : 0.5909          
##    Detection Prevalence : 0.6558          
##       Balanced Accuracy : 0.8814          
##                                           
##        'Positive' Class : 0               
## 

Conclusion

Based on several model that we test, we want to use the model with accuracy >80%. Also, for this diabetes case, it is important to minimize the False Negative error due to the importance to treat the patients sooner. so we will also consider which model that has the best sensitivy. thus, the model that has the best sensitiviy is Random Forest model with >90% sensitivity and with >89.6% accuracy.