The authors will try to classify wether patient have diabetes or not in this dataset using logistic regression.
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
pacman::p_load(tidyverse,
GGally,
gtools,
skimr,
mice,
caret,
pROC,
class,
car,
cmplot)diab <- read_csv("diabetes.zip")a glimpse at our dataset.
head(diab)The dataset includes information as follow:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
Blood Pressure : Diastolic blood pressure (mm Hg)
Skin Thickness : Triceps skin fold thickness (mm)
Insulin : 2-Hour serum insulin (mu U/ml)
BMI : Body mass index (weight in kg/(height in m)^2)
Diabetes pedigree function : a function that represents how likely they are to get the disease by extrapolating from their ancestor’s history
Age : Self-explanatory
Outcome : 1 is True (Diabetic), 0 is False (Non-Diabetic)
skim(diab)| Name | diab |
| Number of rows | 768 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| numeric | 9 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Pregnancies | 0 | 1 | 3.85 | 3.37 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▂▁▁ |
| Glucose | 0 | 1 | 120.89 | 31.97 | 0.00 | 99.00 | 117.00 | 140.25 | 199.00 | ▁▁▇▆▂ |
| BloodPressure | 0 | 1 | 69.11 | 19.36 | 0.00 | 62.00 | 72.00 | 80.00 | 122.00 | ▁▁▇▇▁ |
| SkinThickness | 0 | 1 | 20.54 | 15.95 | 0.00 | 0.00 | 23.00 | 32.00 | 99.00 | ▇▇▂▁▁ |
| Insulin | 0 | 1 | 79.80 | 115.24 | 0.00 | 0.00 | 30.50 | 127.25 | 846.00 | ▇▁▁▁▁ |
| BMI | 0 | 1 | 31.99 | 7.88 | 0.00 | 27.30 | 32.00 | 36.60 | 67.10 | ▁▃▇▂▁ |
| DiabetesPedigreeFunction | 0 | 1 | 0.47 | 0.33 | 0.08 | 0.24 | 0.37 | 0.63 | 2.42 | ▇▃▁▁▁ |
| Age | 0 | 1 | 33.24 | 11.76 | 21.00 | 24.00 | 29.00 | 41.00 | 81.00 | ▇▃▁▁▁ |
| Outcome | 0 | 1 | 0.35 | 0.48 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▅ |
I use skim() to take a look at the statistical summary and data structure on the dataset, it seems that all the variables are numeric type, that means this dataset needs scrubbing a little bit by converting some of the variables, I also notice that most of the variables have the minimum value of zero, it does not seem right for BloodPressure, SkinThickness, Insulin, and BMI to have zero values.
In the code below, I replaced 0s within all variables except Pregnancies and Outcome with NA’s, created a new variable called BMICategory to categorize Body Mass Index and see how it correlates with the Outcome, and I also converted Outcome as factor.
Notes: A person whose body produces no insulin will simply die.
diab_clean <- diab %>%
mutate(Outcome = as.factor(Outcome)) %>%
mutate(BMICategory =as.factor( case_when(BMI > 0 & BMI < 18.5 ~ "underweight",
BMI >= 18.5 & BMI <= 25 ~ "normal",
BMI > 25 & BMI <= 30 ~ "overweight",
BMI > 30 ~ "obese",
TRUE ~ "other"))) %>%
mutate(BMI = ifelse(BMI == 0, NA, BMI)) %>%
mutate(BloodPressure = ifelse(BloodPressure == 0, NA, BloodPressure)) %>%
mutate(SkinThickness = ifelse(SkinThickness == 0, NA, SkinThickness)) %>%
mutate(Insulin = ifelse(Insulin == 0, NA, Insulin)) %>%
mutate(Glucose = ifelse(Glucose == 0, NA, Glucose))
skim(diab_clean)| Name | diab_clean |
| Number of rows | 768 |
| Number of columns | 10 |
| _______________________ | |
| Column type frequency: | |
| factor | 2 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Outcome | 0 | 1 | FALSE | 2 | 0: 500, 1: 268 |
| BMICategory | 0 | 1 | FALSE | 5 | obe: 465, ove: 180, nor: 108, oth: 11 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Pregnancies | 0 | 1.00 | 3.85 | 3.37 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▂▁▁ |
| Glucose | 5 | 0.99 | 121.69 | 30.54 | 44.00 | 99.00 | 117.00 | 141.00 | 199.00 | ▁▇▇▃▂ |
| BloodPressure | 35 | 0.95 | 72.41 | 12.38 | 24.00 | 64.00 | 72.00 | 80.00 | 122.00 | ▁▃▇▂▁ |
| SkinThickness | 227 | 0.70 | 29.15 | 10.48 | 7.00 | 22.00 | 29.00 | 36.00 | 99.00 | ▆▇▁▁▁ |
| Insulin | 374 | 0.51 | 155.55 | 118.78 | 14.00 | 76.25 | 125.00 | 190.00 | 846.00 | ▇▂▁▁▁ |
| BMI | 11 | 0.99 | 32.46 | 6.92 | 18.20 | 27.50 | 32.30 | 36.60 | 67.10 | ▅▇▃▁▁ |
| DiabetesPedigreeFunction | 0 | 1.00 | 0.47 | 0.33 | 0.08 | 0.24 | 0.37 | 0.63 | 2.42 | ▇▃▁▁▁ |
| Age | 0 | 1.00 | 33.24 | 11.76 | 21.00 | 24.00 | 29.00 | 41.00 | 81.00 | ▇▃▁▁▁ |
Now we can see that this dataset have a lot of missing values in it, let’s take a look at the proportion of missing values in each variables.
colSums(prop.table(is.na(diab_clean)))#> Pregnancies Glucose BloodPressure
#> 0.000000000 0.007668712 0.053680982
#> SkinThickness Insulin BMI
#> 0.348159509 0.573619632 0.016871166
#> DiabetesPedigreeFunction Age Outcome
#> 0.000000000 0.000000000 0.000000000
#> BMICategory
#> 0.000000000
diab_clean %>%
select(BMICategory, Outcome) %>%
group_by(Outcome) %>%
plot() Here is the visualization of
BMICategory vs Outcome, It safe to say that most of the diabetic person tend to have more than 25 BMI and a person who is considered underweight and have diabetic is almost non-existent.
let’s use boxplot to check the statistical distribution of this dataset.
boxplot(diab_clean, las = 2, varwidth = T)we need to watch out for multicollinearity in our dataset, since that our logistic regression model is sensitive to it.
ggcorr(diab_clean, label = T) we can see that
BMI v SkinThickness and Insulin v Glucose have a relatively high correlation, let’s check for multi-co later with VIF() to measures how much the variance of an estimated regression coefficient increases if the predictors are correlated.
Based on this target proportion we will determine the ratio of our train and test data.
prop.table(table(diab_clean$Outcome))#>
#> 0 1
#> 0.6510417 0.3489583
We will split the dataset by 65/35 in accordance to our target proportion.
library(rsample)
set.seed(100)
index <- initial_split(data = diab_clean %>%
select(-BMICategory),prop = 0.65,strata = "Outcome") #drop bmi category, karena sudah selesai di visualisasikan
diab.train <- training(index)
diab.test <- testing(index)
diab.test_step <- testing(index) #for stepwiselet’s check the target proportion on each dataset.
prop.table(table(diab.train$Outcome))#>
#> 0 1
#> 0.65 0.35
prop.table(table(diab.test$Outcome))#>
#> 0 1
#> 0.6529851 0.3470149
prop.table(table(diab.test_step$Outcome))#>
#> 0 1
#> 0.6529851 0.3470149
in this project i will use mice to fill-in the missing values to preserve the distribution of our data, using the pmm(polynomial regression) method, mice will give different values to our NA’s based on it’s relation with other variables.
stripplot(imp, col = c("grey", mdc(2)), pch = c(1, 20))Iteration number 3 seems to fit right into the data distribution , so we will impute the missing values using iteration number 3.
diab.train.clean <- complete(imp,3)skim(diab.train.clean)| Name | diab.train.clean |
| Number of rows | 500 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 8 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Outcome | 0 | 1 | FALSE | 2 | 0: 325, 1: 175 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Pregnancies | 0 | 1 | 4.02 | 3.42 | 0.00 | 1.00 | 3.00 | 6.00 | 17.00 | ▇▃▃▁▁ |
| Glucose | 0 | 1 | 122.20 | 30.83 | 44.00 | 100.00 | 118.50 | 142.00 | 199.00 | ▁▇▇▅▂ |
| BloodPressure | 0 | 1 | 72.96 | 12.46 | 38.00 | 64.00 | 72.00 | 80.00 | 122.00 | ▁▇▇▂▁ |
| SkinThickness | 0 | 1 | 28.73 | 11.13 | 7.00 | 20.00 | 28.00 | 36.00 | 99.00 | ▆▇▁▁▁ |
| Insulin | 0 | 1 | 153.25 | 119.24 | 14.00 | 75.75 | 120.00 | 190.00 | 846.00 | ▇▂▁▁▁ |
| BMI | 0 | 1 | 32.35 | 7.16 | 18.20 | 27.28 | 32.00 | 36.60 | 67.10 | ▅▇▃▁▁ |
| DiabetesPedigreeFunction | 0 | 1 | 0.48 | 0.35 | 0.08 | 0.24 | 0.37 | 0.65 | 2.42 | ▇▃▁▁▁ |
| Age | 0 | 1 | 33.28 | 11.67 | 21.00 | 24.00 | 29.00 | 40.00 | 81.00 | ▇▃▁▁▁ |
boxplot(diab.train, las = 2, varwidth = T)ggcorr(diab.train, label = T) let’s compare
diab_clean and diab.train statistical distribution on insulin variable, since it have the highest number of missingvalues.
Median :
Insulin on diab_clean : 125Insulin on diab.train.clean : 120SD :
Insulin on diab_clean : 119Insulin on diab.train.clean : 119Since it does not take a lot of computational power to perform logistic regression model on a relatively small dataset, I will try to create three models and later compare them.
I will start modelling using all of the variables available as predictor and see how it performs.
model_all <- glm( Outcome ~ .,family = "binomial", data = diab.train.clean)In the next model I will perform a feature selection using step in a backward direction, to select features with high enough significance.
model_step <- step( model_all,direction = "backward", trace = 0)I later reduce the number of predictor variables to only 3 of them based on their significance.
model_fitted <- glm( Outcome ~ Pregnancies+Glucose+BMI,family = "binomial", data = diab.train.clean)model_all$deviance#> [1] 467.0651
model_step$deviance#> [1] 467.3661
model_fitted$deviance#> [1] 472.0083
model_all$aic#> [1] 485.0651
model_step$aic#> [1] 477.3661
model_fitted$aic#> [1] 480.0083
Based on errors and AIC value of each models, model_fitted seem to have the lowest accuracy but better AIC than model_all. I will pick two models as our candidate based on the residual deviance value and compare these models performance.
Let’s see if the model is stable by measuring how much the variance of an estimated regression coefficient increases if your predictors are correlated.
vif(model_all)#> Pregnancies Glucose BloodPressure
#> 1.445646 1.334125 1.299358
#> SkinThickness Insulin BMI
#> 1.573632 1.351720 1.832557
#> DiabetesPedigreeFunction Age
#> 1.014315 1.705619
vif(model_step)#> Pregnancies Glucose BMI
#> 1.019347 1.004173 1.019998
#> DiabetesPedigreeFunction
#> 1.010721
A common rule of thumbs is that a VIF number greater than 10 may indicate high collinearity and worth further inspection, all of the variables do not have VIF greater than 10, so it is safe to say that multicollinearity does not occured in both models.
Tradeoff model performance on model_all
confmat_plot(model_all$fitted.values, diab.train$Outcome, "1", "0")confmat_plot(model_step$fitted.values, diab.train$Outcome, "1", "0")it seems reasonable to have 80% of recall and still have relatively good accuracy, so we will set our threshold to 0.3 for both models.
diab.test$pred.Outcome <- predict(model_all,newdata = diab.test, type = "response")
diab.test_step$pred.Outcome_step <- predict(model_step,newdata = diab.test_step, type = "response")
diab.test$pred.Outcome <- as.factor(ifelse(diab.test$pred.Outcome > 0.3 , "1","0"))
diab.test_step$pred.Outcome_step <- as.factor(ifelse(diab.test_step$pred.Outcome_step > 0.3 , "1","0")) confusionMatrix(data = diab.test$pred.Outcome,reference = diab.test$Outcome, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 66 6
#> 1 20 41
#>
#> Accuracy : 0.8045
#> 95% CI : (0.7268, 0.8681)
#> No Information Rate : 0.6466
#> P-Value [Acc > NIR] : 0.00005162
#>
#> Kappa : 0.5993
#>
#> Mcnemar's Test P-Value : 0.01079
#>
#> Sensitivity : 0.8723
#> Specificity : 0.7674
#> Pos Pred Value : 0.6721
#> Neg Pred Value : 0.9167
#> Prevalence : 0.3534
#> Detection Rate : 0.3083
#> Detection Prevalence : 0.4586
#> Balanced Accuracy : 0.8199
#>
#> 'Positive' Class : 1
#>
confusionMatrix(data = diab.test_step$pred.Outcome_step,reference = diab.test_step$Outcome, positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 128 21
#> 1 42 70
#>
#> Accuracy : 0.7586
#> 95% CI : (0.702, 0.8092)
#> No Information Rate : 0.6513
#> P-Value [Acc > NIR] : 0.0001212
#>
#> Kappa : 0.4956
#>
#> Mcnemar's Test P-Value : 0.0117434
#>
#> Sensitivity : 0.7692
#> Specificity : 0.7529
#> Pos Pred Value : 0.6250
#> Neg Pred Value : 0.8591
#> Prevalence : 0.3487
#> Detection Rate : 0.2682
#> Detection Prevalence : 0.4291
#> Balanced Accuracy : 0.7611
#>
#> 'Positive' Class : 1
#>
Using all the variables available on the dataset seem to be a better option, we reached 80% of accuracy and only have 6 patients that are positive incorrectly classified as negative by model_all.
I will try interpret model_all since it has the best performance.
summary(model_all)#>
#> Call:
#> glm(formula = Outcome ~ ., family = "binomial", data = diab.train.clean)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.7247 -0.7088 -0.4067 0.6838 2.3790
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -8.5598977 0.9658166 -8.863 < 0.0000000000000002 ***
#> Pregnancies 0.1518398 0.0404018 3.758 0.000171 ***
#> Glucose 0.0351849 0.0048599 7.240 0.000000000000449 ***
#> BloodPressure -0.0017156 0.0107892 -0.159 0.873659
#> SkinThickness 0.0062901 0.0135674 0.464 0.642921
#> Insulin 0.0001934 0.0011027 0.175 0.860807
#> BMI 0.0754036 0.0230906 3.266 0.001093 **
#> DiabetesPedigreeFunction 0.7513657 0.3538178 2.124 0.033704 *
#> Age -0.0025173 0.0125656 -0.200 0.841218
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 647.45 on 499 degrees of freedom
#> Residual deviance: 467.07 on 491 degrees of freedom
#> AIC: 485.07
#>
#> Number of Fisher Scoring iterations: 5
Based on the summary, it is safe to say that Glucose is affecting the model a lot more than other variables.
exp(model_all$coefficients)#> (Intercept) Pregnancies Glucose
#> 0.0001916389 1.1639737688 1.0358112546
#> BloodPressure SkinThickness Insulin
#> 0.9982858517 1.0063099063 1.0001933790
#> BMI DiabetesPedigreeFunction Age
#> 1.0783192934 2.1198931781 0.9974858247
Pregnancies: Every time a person is pregnant, the odds of that person to have diabetes is increasing for ~16%.
Glucose: For a one-unit increase in glucose level, the odds to have diabetes is increasing for ~3%.
BloodPressure: The odds value is less than 1, means, that as bloodpressure level of a patient is increasing by one unit, the odds of that patient to have diabetes is decreasing by ~0.2%.
SkinThickness: For every 1 mm increase on a patient skin thickness, their odds of having diabetes is increasing for ~0.6%.
Insulin: A one-unit increase in insulin level adds ~0.00019% to their odds of having diabetes.
BMI: For every one-unit increase on a patient BMI, their odds of having diabetes is increasing for ~0.78%.
DiabetesPedigreeFunction: The odds value is more than 1, means that the higher DiabetesPedigreeFunction level of a patient the more likely they are to have diabetes.
Age: The odds value is less than 1, means that the higher age level of a patient the less likely they are to have diabetes.
Disclaimer first, the author is not a doctor, I built these logistic regression models to study more about how classification in machine learning works, do not examine whether a person is diabetic or not based on this report.
With logistic regression, we can build a model to help doctors and medical workers alike to make a decision based on the model prediction, but it is clear that is far too dangerous to make decision based on the prediction alone, so interpretability of a model can also help to gather insights on which feature is significantly affecting the model and which is not, the odds of whether a patient is more likely to have diabetes or not based on the level of every features recorded. In comparison to blackbox models, I think that for this particular case it is better to use models that we can interpret and gather some insights from, to further investigate the symptom, rather than just relying on the machine prediction which could lead to fatal mistakes, since it involves the health and safety of a person.
Since it is not recommended to clean the test data set, I am not able to perform prediction using KNN, because of the amount of missing values on the dataset.