Diabetes Tests Model

Data source: https://www.kaggle.com/uciml/pima-indians-diabetes-database

Context

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Content

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Acknowledgements

Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press.

Load the Data Requirement

library(tidyverse)
library(caret)
library(corrplot)
library(class)
library(GGally)
library(rsample)
library(performance)

Read data

#lets import the data file and have some look
diab <- read.csv("diabetes2.csv")
head(diab,3)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1

from this database we would like to try a comparison models between k-NN(Neareast Neighbour) and Logistic Regression

firstly, check the data by using glimpse & summary

glimpse(diab)
## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, ~
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125~
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74~
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, ~
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, ~
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.~
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2~
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3~
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, ~
summary(diab)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Data Pre-Processing

Before we make our own data analysis, make sure we already explore and clean the data, computer might not recognized by sentiment lexicons because it is not written properly.

diab <- diab %>%
  mutate(Outcome = as.factor(Outcome))
class(diab$Outcome)
## [1] "factor"

Missing Values

Any not available values

Make sure our data has not any not applicable or not available values.

cat(
any(is.na(diab)),
any(is.null(diab))
)
## FALSE FALSE

NA or not NA

make sure our data no more missing value

colSums(is.na(diab))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

then we can process to the next step

EDA (Exploratory data analysis)

#lets check for incorrect values / outliers
ggplot(gather(diab[, -9]), aes(value)) + geom_boxplot() + facet_wrap(key ~
                                                                         ., scales = "free_x")

# Call Library
library(GGally)
# Call Function
ggcorr(
  diab,
  label = T,
  label_size = 2.9, 
  hjust = 0.9,
  layout.exp = 2
)

ggplot(diab,aes(x=DiabetesPedigreeFunction,y=Insulin)) + geom_smooth() + geom_point(aes(color=Outcome)) 

Logistic regression models (GLM)

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Sometimes logistic regressions are difficult to interpret; the Intellectus Statistics tool easily allows you to conduct the analysis, then in plain English interprets the output.

TEST TRAIN & TEST

RNGkind(sample.kind = "Rounding")
set.seed(777)
index <- initial_split(diab, prop = 0.8, strata = "Outcome")
diab_test <- testing(index)
diab_train <- training(index)

Check the proportion of our data frame from train and test

we should make sure our data is not imbalance, therefore we need to check the proportion of our data that we made from train and test

# re-check class imbalance
prop.table(table(diab_train$Outcome))
## 
##         0         1 
## 0.6514658 0.3485342
prop.table(table(diab_test$Outcome))
## 
##         0         1 
## 0.6493506 0.3506494

Model fitting

now we try make our variables as a predictor expect Outcome, because we will predict our Outcome variables B or M

model_diab <- glm(Outcome ~ . , diab_train, family = "binomial")
summary(model_diab) # Check Coefficiens and Pvalue
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diab_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4346  -0.7416  -0.4196   0.7530   2.4646  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -7.9581359  0.7671525 -10.374  < 2e-16 ***
## Pregnancies               0.1409855  0.0353711   3.986 6.72e-05 ***
## Glucose                   0.0349569  0.0041426   8.438  < 2e-16 ***
## BloodPressure            -0.0132597  0.0056436  -2.349   0.0188 *  
## SkinThickness            -0.0012581  0.0075366  -0.167   0.8674    
## Insulin                  -0.0013918  0.0009881  -1.409   0.1590    
## BMI                       0.0858112  0.0165734   5.178 2.25e-07 ***
## DiabetesPedigreeFunction  0.5914459  0.3255888   1.817   0.0693 .  
## Age                       0.0112643  0.0100991   1.115   0.2647    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 793.94  on 613  degrees of freedom
## Residual deviance: 585.92  on 605  degrees of freedom
## AIC: 603.92
## 
## Number of Fisher Scoring iterations: 5
model_null <- glm(Outcome ~ 1, diab_train, family = "binomial")

Stepwise directions

model_back <- stats::step(object = model_diab, direction = "backward")
## Start:  AIC=603.92
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - SkinThickness             1   585.95 601.95
## - Age                       1   587.16 603.16
## - Insulin                   1   587.90 603.90
## <none>                          585.92 603.92
## - DiabetesPedigreeFunction  1   589.29 605.29
## - BloodPressure             1   591.54 607.54
## - Pregnancies               1   602.50 618.50
## - BMI                       1   616.11 632.11
## - Glucose                   1   675.39 691.39
## 
## Step:  AIC=601.95
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - Age                       1   587.22 601.22
## <none>                          585.95 601.95
## - Insulin                   1   588.58 602.58
## - DiabetesPedigreeFunction  1   589.30 603.30
## - BloodPressure             1   591.90 605.90
## - Pregnancies               1   602.51 616.51
## - BMI                       1   618.67 632.67
## - Glucose                   1   677.17 691.17
## 
## Step:  AIC=601.22
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     DiabetesPedigreeFunction
## 
##                            Df Deviance    AIC
## <none>                          587.22 601.22
## - Insulin                   1   590.29 602.29
## - DiabetesPedigreeFunction  1   590.82 602.82
## - BloodPressure             1   592.43 604.43
## - Pregnancies               1   614.74 626.74
## - BMI                       1   619.12 631.12
## - Glucose                   1   689.41 701.41
model_forward <- stats::step(model_diab,
                             scope = list(lower = model_null,
                                          upper = model_diab),
                             direction = "forward")
## Start:  AIC=603.92
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
model_both <- stats::step(model_diab,
                             scope = list(lower = model_null,
                                          upper = model_diab),
                             direction = "both")
## Start:  AIC=603.92
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - SkinThickness             1   585.95 601.95
## - Age                       1   587.16 603.16
## - Insulin                   1   587.90 603.90
## <none>                          585.92 603.92
## - DiabetesPedigreeFunction  1   589.29 605.29
## - BloodPressure             1   591.54 607.54
## - Pregnancies               1   602.50 618.50
## - BMI                       1   616.11 632.11
## - Glucose                   1   675.39 691.39
## 
## Step:  AIC=601.95
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - Age                       1   587.22 601.22
## <none>                          585.95 601.95
## - Insulin                   1   588.58 602.58
## - DiabetesPedigreeFunction  1   589.30 603.30
## + SkinThickness             1   585.92 603.92
## - BloodPressure             1   591.90 605.90
## - Pregnancies               1   602.51 616.51
## - BMI                       1   618.67 632.67
## - Glucose                   1   677.17 691.17
## 
## Step:  AIC=601.22
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     DiabetesPedigreeFunction
## 
##                            Df Deviance    AIC
## <none>                          587.22 601.22
## + Age                       1   585.95 601.95
## - Insulin                   1   590.29 602.29
## - DiabetesPedigreeFunction  1   590.82 602.82
## + SkinThickness             1   587.16 603.16
## - BloodPressure             1   592.43 604.43
## - Pregnancies               1   614.74 626.74
## - BMI                       1   619.12 631.12
## - Glucose                   1   689.41 701.41

Comparison Models from our directions model

compare_performance(model_back, model_forward, model_both,model_diab,metrics = c("RMSE","R2"))
## # Comparison of Model Performance Indices
## 
## Name          | Model | Tjur's R2 |  RMSE
## -----------------------------------------
## model_back    |   glm |     0.309 | 0.397
## model_forward |   glm |     0.310 | 0.397
## model_both    |   glm |     0.309 | 0.397
## model_diab    |   glm |     0.310 | 0.397

Model fitting after stepwise

model_diab_back <- glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
     BMI + DiabetesPedigreeFunction + Age, family = "binomial", 
    data = diab)
summary(model_diab_back)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     BMI + DiabetesPedigreeFunction + Age, family = "binomial", 
##     data = diab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7380  -0.7313  -0.4123   0.7276   2.8984  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.239812   0.701970 -11.738  < 2e-16 ***
## Pregnancies               0.124919   0.031972   3.907 9.34e-05 ***
## Glucose                   0.033492   0.003440   9.736  < 2e-16 ***
## BloodPressure            -0.013487   0.005114  -2.637  0.00836 ** 
## BMI                       0.087676   0.014268   6.145 7.99e-10 ***
## DiabetesPedigreeFunction  0.896150   0.294862   3.039  0.00237 ** 
## Age                       0.016325   0.009237   1.767  0.07719 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 725.46  on 761  degrees of freedom
## AIC: 739.46
## 
## Number of Fisher Scoring iterations: 5

now we make create prediction outcome from our models

diab_test$pred.Outcome <- predict(
  object = model_diab_back,
  newdata = diab_test %>% select(
    Pregnancies,
    Glucose,
    BloodPressure,
    Insulin,
    BMI,
    DiabetesPedigreeFunction,
    Age
  ),
  type = "response"
)
summary(diab_test$pred.Outcome)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.002298 0.129029 0.286313 0.359490 0.608122 0.955191

The distribution of probability

ggplot(diab_test, aes(x=pred.Outcome)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data") +
  theme_minimal()

now substantiate the prediction outcome : - if the value counts under 0.5 that means 0, otherwise above 0.5 is legit as 1 0 : means no diabetes 1 : means yes diabetes

diab_test$pred.label <- ifelse(diab_test$pred.Outcome > 0.5, 1, 0)

make sure our target predictors as factor or a category (no diabetes or yes diabetes)

diab_test$pred.label <- as.factor(diab_test$pred.label)

double confirm our prediction data test that we have already put together

diab_test %>% 
  select(Outcome, pred.Outcome, pred.label) %>% 
  head(10)
##    Outcome pred.Outcome pred.label
## 11       0   0.20762991          0
## 20       1   0.23553000          0
## 23       1   0.92745355          1
## 29       0   0.55750015          1
## 32       1   0.59675406          1
## 45       0   0.60239713          1
## 62       1   0.50152164          1
## 67       1   0.17653544          0
## 73       1   0.80274594          1
## 75       0   0.04922499          0

The result of our logistic regression Model

result_glm <- confusionMatrix(data = diab_test$pred.label, 
                reference = diab_test$Outcome,
                positive = "1")
result_glm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 86 15
##          1 14 39
##                                           
##                Accuracy : 0.8117          
##                  95% CI : (0.7409, 0.8701)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 7.279e-06       
##                                           
##                   Kappa : 0.5847          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7222          
##             Specificity : 0.8600          
##          Pos Pred Value : 0.7358          
##          Neg Pred Value : 0.8515          
##              Prevalence : 0.3506          
##          Detection Rate : 0.2532          
##    Detection Prevalence : 0.3442          
##       Balanced Accuracy : 0.7911          
##                                           
##        'Positive' Class : 1               
## 

kNN (k-Nearest Neighbour) model

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.

Data Manipulation

# all prediktor variables
diab_test_x <- diab_test %>% select(-Outcome, -pred.Outcome, -pred.label)

diab_train_x <- diab_train %>% select(-Outcome)

#target
diab_test_y <- diab_test %>% select(Outcome)

diab_train_y <- diab_train %>% select(Outcome)

data predict will be scaled using z-score standarization. In addition, data test will be scaled as well using data train’s parameters (otherwise data test becomes unseen data)

Scaling Data

#Min-Max Normalization

normalize <- function(x){
  return (
    (x - min(x))/(max(x) - min(x))
  )
}

# scalling data prediktor
diab_test_xs <- normalize(diab_test_x) 
  
diab_train_xs <- normalize(diab_train_x)

the benefit of using k-nn is : - the model doesnt need an model

Find optium k

sqrt(nrow(diab))
## [1] 27.71281
  • our target we choose : 29 (Rounding)
  • k : 29
diab_pred <- knn(train = diab_train_xs,
                 test = diab_test_xs,
                 cl = diab_train_y$Outcome,
                 k = 29)

The result of KNN (k-Nearest Neighbour)

result_knn <- confusionMatrix(data = diab_pred,
                reference = diab_test_y$Outcome,
                positive = "1")
result_knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  3
##          1 65 51
##                                           
##                Accuracy : 0.5584          
##                  95% CI : (0.4763, 0.6383)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.9921          
##                                           
##                   Kappa : 0.2329          
##                                           
##  Mcnemar's Test P-Value : 1.389e-13       
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.3500          
##          Pos Pred Value : 0.4397          
##          Neg Pred Value : 0.9211          
##              Prevalence : 0.3506          
##          Detection Rate : 0.3312          
##    Detection Prevalence : 0.7532          
##       Balanced Accuracy : 0.6472          
##                                           
##        'Positive' Class : 1               
## 

Comparison Logistic & k-NN models

Logistic regression

result_glm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 86 15
##          1 14 39
##                                           
##                Accuracy : 0.8117          
##                  95% CI : (0.7409, 0.8701)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 7.279e-06       
##                                           
##                   Kappa : 0.5847          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7222          
##             Specificity : 0.8600          
##          Pos Pred Value : 0.7358          
##          Neg Pred Value : 0.8515          
##              Prevalence : 0.3506          
##          Detection Rate : 0.2532          
##    Detection Prevalence : 0.3442          
##       Balanced Accuracy : 0.7911          
##                                           
##        'Positive' Class : 1               
## 

KNN

result_knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 35  3
##          1 65 51
##                                           
##                Accuracy : 0.5584          
##                  95% CI : (0.4763, 0.6383)
##     No Information Rate : 0.6494          
##     P-Value [Acc > NIR] : 0.9921          
##                                           
##                   Kappa : 0.2329          
##                                           
##  Mcnemar's Test P-Value : 1.389e-13       
##                                           
##             Sensitivity : 0.9444          
##             Specificity : 0.3500          
##          Pos Pred Value : 0.4397          
##          Neg Pred Value : 0.9211          
##              Prevalence : 0.3506          
##          Detection Rate : 0.3312          
##    Detection Prevalence : 0.7532          
##       Balanced Accuracy : 0.6472          
##                                           
##        'Positive' Class : 1               
## 

Summary

in conclusion from models that we already created. It depends on what the doctor want to achieve, for instance, logistic model has better accuracy 81% (0.811) and if we want to maximize both the TP (True Positive) and TN (True Negative). if we spent some money to target those who are predicted as positive (diabetes) and we want our investment to be worthy, we should prioritize model with higher precision value which is logistic models. Overall, Logistic model is better than KNN.

KNN models are good only for Recalls 94% (0.944) (because it has the higest Sensivity value), and willingly to spent more money for recall