Intro

Objective

The authors will try to classify wether patient have diabetes or not in this dataset using logistic regression.

Context

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Preparation

Library and setup

pacman::p_load(tidyverse, 
       GGally, 
       gtools, 
       skimr,
       mice,
       caret,
       pROC,
       class,
       car,
       cmplot)

Data import

diab <- read_csv("diabetes.zip")

a glimpse at our dataset.

head(diab)

The dataset includes information as follow:

  • Pregnancies: Number of times pregnant

  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

  • Blood Pressure : Diastolic blood pressure (mm Hg)

  • Skin Thickness : Triceps skin fold thickness (mm)

  • Insulin : 2-Hour serum insulin (mu U/ml)

  • BMI : Body mass index (weight in kg/(height in m)^2)

  • Diabetes pedigree function : a function that represents how likely they are to get the disease by extrapolating from their ancestor’s history

  • Age : Self-explanatory

  • Outcome : 1 is True (Diabetic), 0 is False (Non-Diabetic)

skim(diab)
Data summary
Name diab
Number of rows 768
Number of columns 9
_______________________
Column type frequency:
numeric 9
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Pregnancies 0 1 3.85 3.37 0.00 1.00 3.00 6.00 17.00 ▇▃▂▁▁
Glucose 0 1 120.89 31.97 0.00 99.00 117.00 140.25 199.00 ▁▁▇▆▂
BloodPressure 0 1 69.11 19.36 0.00 62.00 72.00 80.00 122.00 ▁▁▇▇▁
SkinThickness 0 1 20.54 15.95 0.00 0.00 23.00 32.00 99.00 ▇▇▂▁▁
Insulin 0 1 79.80 115.24 0.00 0.00 30.50 127.25 846.00 ▇▁▁▁▁
BMI 0 1 31.99 7.88 0.00 27.30 32.00 36.60 67.10 ▁▃▇▂▁
DiabetesPedigreeFunction 0 1 0.47 0.33 0.08 0.24 0.37 0.63 2.42 ▇▃▁▁▁
Age 0 1 33.24 11.76 21.00 24.00 29.00 41.00 81.00 ▇▃▁▁▁
Outcome 0 1 0.35 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅

Data Cleansing

I use skim() to take a look at the statistical summary and data structure on the dataset, it seems that all the variables are numeric type, that means this dataset needs scrubbing a little bit by converting some of the variables, I also notice that most of the variables have the minimum value of zero, it does not seem right for BloodPressure, SkinThickness, Insulin, and BMI to have zero values.

In the code below, I replaced 0s within all variables except Pregnancies and Outcome with NA’s, created a new variable called BMICategory to categorize Body Mass Index and see how it correlates with the Outcome, and I also converted Outcome as factor.

Notes: A person whose body produces no insulin will simply die.

diab_clean <- diab %>% 
  mutate(Outcome = as.factor(Outcome)) %>%
  mutate(BMICategory =as.factor( case_when(BMI > 0 & BMI < 18.5 ~ "underweight",
                                 BMI >= 18.5 & BMI <= 25 ~ "normal",
                                 BMI > 25 & BMI <= 30 ~ "overweight",
                                 BMI > 30 ~ "obese",
                                 TRUE ~ "other"))) %>%
  mutate(BMI = ifelse(BMI == 0, NA, BMI)) %>% 
  mutate(BloodPressure = ifelse(BloodPressure == 0, NA, BloodPressure)) %>% 
  mutate(SkinThickness = ifelse(SkinThickness == 0, NA, SkinThickness)) %>% 
  mutate(Insulin = ifelse(Insulin == 0, NA, Insulin)) %>%  
  mutate(Glucose = ifelse(Glucose == 0, NA, Glucose)) 
  

skim(diab_clean)
Data summary
Name diab_clean
Number of rows 768
Number of columns 10
_______________________
Column type frequency:
factor 2
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Outcome 0 1 FALSE 2 0: 500, 1: 268
BMICategory 0 1 FALSE 5 obe: 465, ove: 180, nor: 108, oth: 11

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Pregnancies 0 1.00 3.85 3.37 0.00 1.00 3.00 6.00 17.00 ▇▃▂▁▁
Glucose 5 0.99 121.69 30.54 44.00 99.00 117.00 141.00 199.00 ▁▇▇▃▂
BloodPressure 35 0.95 72.41 12.38 24.00 64.00 72.00 80.00 122.00 ▁▃▇▂▁
SkinThickness 227 0.70 29.15 10.48 7.00 22.00 29.00 36.00 99.00 ▆▇▁▁▁
Insulin 374 0.51 155.55 118.78 14.00 76.25 125.00 190.00 846.00 ▇▂▁▁▁
BMI 11 0.99 32.46 6.92 18.20 27.50 32.30 36.60 67.10 ▅▇▃▁▁
DiabetesPedigreeFunction 0 1.00 0.47 0.33 0.08 0.24 0.37 0.63 2.42 ▇▃▁▁▁
Age 0 1.00 33.24 11.76 21.00 24.00 29.00 41.00 81.00 ▇▃▁▁▁

Now we can see that this dataset have a lot of missing values in it, let’s take a look at the proportion of missing values in each variables.

Missing Values

colSums(prop.table(is.na(diab_clean)))
#>              Pregnancies                  Glucose            BloodPressure 
#>              0.000000000              0.007668712              0.053680982 
#>            SkinThickness                  Insulin                      BMI 
#>              0.348159509              0.573619632              0.016871166 
#> DiabetesPedigreeFunction                      Age                  Outcome 
#>              0.000000000              0.000000000              0.000000000 
#>              BMICategory 
#>              0.000000000

EDA

diab_clean %>% 
  select(BMICategory, Outcome) %>% 
  group_by(Outcome) %>% 
  plot()

Here is the visualization of BMICategory vs Outcome, It safe to say that most of the diabetic person tend to have more than 25 BMI and a person who is considered underweight and have diabetic is almost non-existent.

let’s use boxplot to check the statistical distribution of this dataset.

boxplot(diab_clean, las = 2, varwidth = T)

we need to watch out for multicollinearity in our dataset, since that our logistic regression model is sensitive to it.

ggcorr(diab_clean, label = T)

we can see that BMI v SkinThickness and Insulin v Glucose have a relatively high correlation, let’s check for multi-co later with VIF() to measures how much the variance of an estimated regression coefficient increases if the predictors are correlated.

Based on this target proportion we will determine the ratio of our train and test data.

prop.table(table(diab_clean$Outcome))
#> 
#>         0         1 
#> 0.6510417 0.3489583

Modeling

Train and test dataset splitting

We will split the dataset by 65/35 in accordance to our target proportion.

library(rsample)
set.seed(100)

index <- initial_split(data = diab_clean %>% 
                         select(-BMICategory),prop = 0.65,strata = "Outcome") #drop bmi category, karena sudah selesai di visualisasikan

diab.train <- training(index)
diab.test <- testing(index)
diab.test_step <- testing(index) #for stepwise

let’s check the target proportion on each dataset.

prop.table(table(diab.train$Outcome))
#> 
#>    0    1 
#> 0.65 0.35
prop.table(table(diab.test$Outcome))
#> 
#>         0         1 
#> 0.6529851 0.3470149
prop.table(table(diab.test_step$Outcome))
#> 
#>         0         1 
#> 0.6529851 0.3470149

Imputation on data train

in this project i will use mice to fill-in the missing values to preserve the distribution of our data, using the pmm(polynomial regression) method, mice will give different values to our NA’s based on it’s relation with other variables.

stripplot(imp, col = c("grey", mdc(2)), pch = c(1, 20))

Iteration number 3 seems to fit right into the data distribution , so we will impute the missing values using iteration number 3.

diab.train.clean <- complete(imp,3)

EDA on train dataset

skim(diab.train.clean)
Data summary
Name diab.train.clean
Number of rows 500
Number of columns 9
_______________________
Column type frequency:
factor 1
numeric 8
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Outcome 0 1 FALSE 2 0: 325, 1: 175

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Pregnancies 0 1 4.02 3.42 0.00 1.00 3.00 6.00 17.00 ▇▃▃▁▁
Glucose 0 1 122.20 30.83 44.00 100.00 118.50 142.00 199.00 ▁▇▇▅▂
BloodPressure 0 1 72.96 12.46 38.00 64.00 72.00 80.00 122.00 ▁▇▇▂▁
SkinThickness 0 1 28.73 11.13 7.00 20.00 28.00 36.00 99.00 ▆▇▁▁▁
Insulin 0 1 153.25 119.24 14.00 75.75 120.00 190.00 846.00 ▇▂▁▁▁
BMI 0 1 32.35 7.16 18.20 27.28 32.00 36.60 67.10 ▅▇▃▁▁
DiabetesPedigreeFunction 0 1 0.48 0.35 0.08 0.24 0.37 0.65 2.42 ▇▃▁▁▁
Age 0 1 33.28 11.67 21.00 24.00 29.00 40.00 81.00 ▇▃▁▁▁
boxplot(diab.train, las = 2, varwidth = T)

ggcorr(diab.train, label = T)

let’s compare diab_clean and diab.train statistical distribution on insulin variable, since it have the highest number of missingvalues.

Median :

  • Median of Insulin on diab_clean : 125
  • Median of Insulin on diab.train.clean : 120

SD :

  • SD of Insulin on diab_clean : 119
  • SD of Insulin on diab.train.clean : 119

Logistic Regression

Since it does not take a lot of computational power to perform logistic regression model on a relatively small dataset, I will try to create three models and later compare them.

I will start modelling using all of the variables available as predictor and see how it performs.

model_all <- glm( Outcome ~ .,family = "binomial", data = diab.train.clean)

In the next model I will perform a feature selection using step in a backward direction, to select features with high enough significance.

model_step <- step( model_all,direction = "backward", trace = 0)

I later reduce the number of predictor variables to only 3 of them based on their significance.

model_fitted <- glm( Outcome ~ Pregnancies+Glucose+BMI,family = "binomial", data = diab.train.clean)

Model Evaluation

Residual deviance of each models

model_all$deviance
#> [1] 467.0651
model_step$deviance
#> [1] 467.3661
model_fitted$deviance
#> [1] 472.0083

AIC

model_all$aic
#> [1] 485.0651
model_step$aic
#> [1] 477.3661
model_fitted$aic
#> [1] 480.0083

Based on errors and AIC value of each models, model_fitted seem to have the lowest accuracy but better AIC than model_all. I will pick two models as our candidate based on the residual deviance value and compare these models performance.

VIF

Let’s see if the model is stable by measuring how much the variance of an estimated regression coefficient increases if your predictors are correlated.

vif(model_all)
#>              Pregnancies                  Glucose            BloodPressure 
#>                 1.445646                 1.334125                 1.299358 
#>            SkinThickness                  Insulin                      BMI 
#>                 1.573632                 1.351720                 1.832557 
#> DiabetesPedigreeFunction                      Age 
#>                 1.014315                 1.705619
vif(model_step)
#>              Pregnancies                  Glucose                      BMI 
#>                 1.019347                 1.004173                 1.019998 
#> DiabetesPedigreeFunction 
#>                 1.010721

A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection, all of the variables do not have VIF greater than 10, so it is safe to say that multicollinearity does not occured in both models.

Tuning the threshold

Cutoff tuning

Tradeoff model performance on model_all

confmat_plot(model_all$fitted.values, diab.train$Outcome, "1", "0")

Tradeoff model performance on model_step

confmat_plot(model_step$fitted.values, diab.train$Outcome, "1", "0")

it seems reasonable to have 80% of recall and still have relatively good accuracy, so we will set our threshold to 0.3 for both models.

Prediction

diab.test$pred.Outcome <- predict(model_all,newdata = diab.test, type = "response")
diab.test_step$pred.Outcome_step <- predict(model_step,newdata = diab.test_step, type = "response")


diab.test$pred.Outcome <- as.factor(ifelse(diab.test$pred.Outcome > 0.3 , "1","0"))
diab.test_step$pred.Outcome_step <- as.factor(ifelse(diab.test_step$pred.Outcome_step > 0.3 , "1","0")) 

Model Comparison

confusionMatrix(data = diab.test$pred.Outcome,reference = diab.test$Outcome, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  0  1
#>          0 66  6
#>          1 20 41
#>                                           
#>                Accuracy : 0.8045          
#>                  95% CI : (0.7268, 0.8681)
#>     No Information Rate : 0.6466          
#>     P-Value [Acc > NIR] : 0.00005162      
#>                                           
#>                   Kappa : 0.5993          
#>                                           
#>  Mcnemar's Test P-Value : 0.01079         
#>                                           
#>             Sensitivity : 0.8723          
#>             Specificity : 0.7674          
#>          Pos Pred Value : 0.6721          
#>          Neg Pred Value : 0.9167          
#>              Prevalence : 0.3534          
#>          Detection Rate : 0.3083          
#>    Detection Prevalence : 0.4586          
#>       Balanced Accuracy : 0.8199          
#>                                           
#>        'Positive' Class : 1               
#> 
confusionMatrix(data = diab.test_step$pred.Outcome_step,reference = diab.test_step$Outcome, positive = "1")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction   0   1
#>          0 128  21
#>          1  42  70
#>                                          
#>                Accuracy : 0.7586         
#>                  95% CI : (0.702, 0.8092)
#>     No Information Rate : 0.6513         
#>     P-Value [Acc > NIR] : 0.0001212      
#>                                          
#>                   Kappa : 0.4956         
#>                                          
#>  Mcnemar's Test P-Value : 0.0117434      
#>                                          
#>             Sensitivity : 0.7692         
#>             Specificity : 0.7529         
#>          Pos Pred Value : 0.6250         
#>          Neg Pred Value : 0.8591         
#>              Prevalence : 0.3487         
#>          Detection Rate : 0.2682         
#>    Detection Prevalence : 0.4291         
#>       Balanced Accuracy : 0.7611         
#>                                          
#>        'Positive' Class : 1              
#> 

Using all the variables available on the dataset seem to be a better option, we reached 80% of accuracy and only have 6 patients that are positive incorrectly classified as negative by model_all.

Model Interpretation

I will try interpret model_all since it has the best performance.

summary(model_all)
#> 
#> Call:
#> glm(formula = Outcome ~ ., family = "binomial", data = diab.train.clean)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.7247  -0.7088  -0.4067   0.6838   2.3790  
#> 
#> Coefficients:
#>                            Estimate Std. Error z value             Pr(>|z|)    
#> (Intercept)              -8.5598977  0.9658166  -8.863 < 0.0000000000000002 ***
#> Pregnancies               0.1518398  0.0404018   3.758             0.000171 ***
#> Glucose                   0.0351849  0.0048599   7.240    0.000000000000449 ***
#> BloodPressure            -0.0017156  0.0107892  -0.159             0.873659    
#> SkinThickness             0.0062901  0.0135674   0.464             0.642921    
#> Insulin                   0.0001934  0.0011027   0.175             0.860807    
#> BMI                       0.0754036  0.0230906   3.266             0.001093 ** 
#> DiabetesPedigreeFunction  0.7513657  0.3538178   2.124             0.033704 *  
#> Age                      -0.0025173  0.0125656  -0.200             0.841218    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 647.45  on 499  degrees of freedom
#> Residual deviance: 467.07  on 491  degrees of freedom
#> AIC: 485.07
#> 
#> Number of Fisher Scoring iterations: 5

Based on the summary, it is safe to say that Glucose is affecting the model a lot more than other variables.

exp(model_all$coefficients)
#>              (Intercept)              Pregnancies                  Glucose 
#>             0.0001916389             1.1639737688             1.0358112546 
#>            BloodPressure            SkinThickness                  Insulin 
#>             0.9982858517             1.0063099063             1.0001933790 
#>                      BMI DiabetesPedigreeFunction                      Age 
#>             1.0783192934             2.1198931781             0.9974858247

Odds interpretation

Pregnancies : Every time a person is pregnant, the odds of that person to have diabetes is increasing for ~16%.

Glucose : For a one-unit increase in glucose level, the odds to have diabetes is increasing for ~3%.

BloodPressure : The odds value is less than 1, means, that as bloodpressure level of a patient is increasing by one unit, the odds of that patient to have diabetes is decreasing by ~0.2%.

SkinThickness : For every 1 mm increase on a patient skin thickness, their odds of having diabetes is increasing for ~0.6%.

Insulin : A one-unit increase in insulin level adds ~0.00019% to their odds of having diabetes.

BMI : For every one-unit increase on a patient BMI, their odds of having diabetes is increasing for ~0.78%.

DiabetesPedigreeFunction : The odds value is more than 1, means that the higher DiabetesPedigreeFunction level of a patient the more likely they are to have diabetes.

Age : The odds value is less than 1, means that the higher age level of a patient the less likely they are to have diabetes.

Conclusion

Disclaimer first, the author is not a doctor, I built these logistic regression models to study more about how classification in machine learning works, do not examine whether a person is diabetic or not based on this report.

With logistic regression, we can build a model to help doctors and medical workers alike to make a decision based on the model prediction, but it is clear that is far too dangerous to make decision based on the prediction alone, so interpretability of a model can also help to gather insights on which feature is significantly affecting the model and which is not, the odds of whether a patient is more likely to have diabetes or not based on the level of every features recorded. In comparison to blackbox models, I think that for this particular case it is better to use models that we can interpret and gather some insights from, to further investigate the symptom, rather than just relying on the machine prediction which could lead to fatal mistakes, since it involves the health and safety of a person.

Since it is not recommended to clean the test data set, I am not able to perform prediction using KNN, because of the amount of missing values on the dataset.