This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Using this data with only numerical as the values inside the data, I intend to create a machine learning model to predict a patient which has a diabetes using several numerical variables as a way to practice in building machine learning model using Logistic Regression and K Nearest Neighbor method and to fulfill my assignment in Algoritma Data Science school.
library(rsample)
library(caret)## Loading required package: lattice
## Loading required package: ggplot2
library(class)
library(gtools)##
## Attaching package: 'gtools'
## The following object is masked from 'package:rsample':
##
## permutations
library(cmplot)## Warning: replacing previous import 'ggplot2::last_plot' by 'plotly::last_plot'
## when loading 'cmplot'
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
diabetes <- read.csv("data/diabetes_data.csv")
diabetesData description:
The ourcome variable will be our target where 0 is categorized as negative and 1 is categorized as positive for diabetes.
diabetes <- diabetes %>%
mutate(Outcome = as.factor(case_when(Outcome == 1 ~ "Positive Diabetes",
T ~ "Negative Diabetes")))set.seed(129)
index <- sample(nrow(diabetes), nrow(diabetes) * 0.75)
diab_train <- diabetes[index,]
diab_test <- diabetes[-index,]table(diab_train$Outcome)##
## Negative Diabetes Positive Diabetes
## 367 209
prop.table(table(diab_train$Outcome))##
## Negative Diabetes Positive Diabetes
## 0.6371528 0.3628472
Since our outcome is not balanced, we need to balance the data before we make a model out of it. I choose downSample() as the difference between the two values are not too big.
set.seed(129)
ds_diab_train <- downSample(x = diab_train %>% select(-Outcome),
y = diab_train$Outcome,
yname = "Outcome")
table(ds_diab_train$Outcome)##
## Negative Diabetes Positive Diabetes
## 209 209
For comparison, I will create a Logistic Regression model using all of the possible predictors and using automatic selection using stepwise method.
model_diab <- glm(Outcome ~ ., ds_diab_train, family = "binomial")
model_diab_step <- step(glm(Outcome ~ ., ds_diab_train, family = "binomial"), direction = "backward")## Start: AIC=414.8
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - Insulin 1 396.82 412.82
## - SkinThickness 1 397.16 413.16
## <none> 396.80 414.80
## - BloodPressure 1 399.51 415.51
## - DiabetesPedigreeFunction 1 400.12 416.12
## - Age 1 404.01 420.01
## - Pregnancies 1 407.13 423.13
## - BMI 1 416.11 432.11
## - Glucose 1 465.64 481.64
##
## Step: AIC=412.82
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## BMI + DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - SkinThickness 1 397.38 411.38
## <none> 396.82 412.82
## - BloodPressure 1 399.51 413.51
## - DiabetesPedigreeFunction 1 400.13 414.13
## - Age 1 404.02 418.02
## - Pregnancies 1 407.16 421.16
## - BMI 1 416.22 430.22
## - Glucose 1 475.38 489.38
##
## Step: AIC=411.38
## Outcome ~ Pregnancies + Glucose + BloodPressure + BMI + DiabetesPedigreeFunction +
## Age
##
## Df Deviance AIC
## <none> 397.38 411.38
## - DiabetesPedigreeFunction 1 400.35 412.35
## - BloodPressure 1 400.50 412.50
## - Age 1 405.13 417.13
## - Pregnancies 1 407.79 419.79
## - BMI 1 416.87 428.87
## - Glucose 1 476.35 488.35
summary(model_diab)##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = ds_diab_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.83497 -0.72036 0.04313 0.74668 2.83781
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4209307 0.9486984 -8.876 < 2e-16 ***
## Pregnancies 0.1408466 0.0448686 3.139 0.00169 **
## Glucose 0.0370386 0.0051567 7.183 6.84e-13 ***
## BloodPressure -0.0128447 0.0078927 -1.627 0.10365
## SkinThickness -0.0058008 0.0097002 -0.598 0.54984
## Insulin -0.0001893 0.0013216 -0.143 0.88610
## BMI 0.0802895 0.0196893 4.078 4.55e-05 ***
## DiabetesPedigreeFunction 0.7190993 0.4024713 1.787 0.07398 .
## Age 0.0373147 0.0141490 2.637 0.00836 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 579.47 on 417 degrees of freedom
## Residual deviance: 396.80 on 409 degrees of freedom
## AIC: 414.8
##
## Number of Fisher Scoring iterations: 5
summary(model_diab_step)##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction + Age, family = "binomial",
## data = ds_diab_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.85523 -0.71935 0.03976 0.74589 2.82588
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.348991 0.932664 -8.952 < 2e-16 ***
## Pregnancies 0.141323 0.044855 3.151 0.00163 **
## Glucose 0.036874 0.004846 7.609 2.77e-14 ***
## BloodPressure -0.013551 0.007759 -1.746 0.08073 .
## BMI 0.075483 0.018401 4.102 4.09e-05 ***
## DiabetesPedigreeFunction 0.670956 0.396784 1.691 0.09084 .
## Age 0.038521 0.014089 2.734 0.00625 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 579.47 on 417 degrees of freedom
## Residual deviance: 397.38 on 411 degrees of freedom
## AIC: 411.38
##
## Number of Fisher Scoring iterations: 5
According to the stepwise model, the skin thickness of a patient is not a significant predictor as we can agree that in most cases, the skin thickness does not change whether we suffer from diabetes or not. With the insulin predictor, we all know that it is related directly to diabetes as insulin is the blood sugar regulator agent in our body. Those with diabetes does not produce insulin with the proper amount, or even completely stop producing them naturally. However, it is stated that the insulin predictor here is that the 2 hour post eating level of blood sugar in our body. Laboratories often use the 12 hour post eating blood sugar level instead of 2 hours, the 2 hours post eating blood sugar level were used to test the sugar level inside the blood personally at home to reguralely control their sugar intake. In this case, we can agree that the insulin predictor is not that significant to the target.
exp(0.670956)## [1] 1.956106
inv.logit(0.670956)## [1] 0.6617172
The predictor which have negative values in the Estimate Std. column are the predictors which can increase the chance of getting tested negative of diabetes the higher the value is. While the positive values will decrease the chance of getting tested negitve of diabetes the higher the value is. Taking example of DiabetesPedigreeFunction predictor to be derived to get the odds and probability value, we can see when the previosly mentioned predictor value increase by 1, the odds of a person suffering from diabetes will be 1.96 times higher. The probability of a person that the Diabetes Pedigree Function is checked have 66.17% chance of suffering from diabetes.
pred_diab <- predict(model_diab, diab_test, type = "response")
pred_diab_step <- predict(model_diab_step, diab_test, type = "response")
diab_test$pred_diab <- factor(ifelse(pred_diab > 0.5, "Positive Diabetes", "Negative Diabetes"))
diab_test$pred_diab_step <- factor(ifelse(pred_diab_step > 0.5, "Positive Diabetes", "Negative Diabetes"))confusionMatrix(diab_test$pred_diab, diab_test$Outcome, positive = "Positive Diabetes")## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Diabetes Positive Diabetes
## Negative Diabetes 98 21
## Positive Diabetes 35 38
##
## Accuracy : 0.7083
## 95% CI : (0.6386, 0.7716)
## No Information Rate : 0.6927
## P-Value [Acc > NIR] : 0.35110
##
## Kappa : 0.3573
##
## Mcnemar's Test P-Value : 0.08235
##
## Sensitivity : 0.6441
## Specificity : 0.7368
## Pos Pred Value : 0.5205
## Neg Pred Value : 0.8235
## Prevalence : 0.3073
## Detection Rate : 0.1979
## Detection Prevalence : 0.3802
## Balanced Accuracy : 0.6905
##
## 'Positive' Class : Positive Diabetes
##
confusionMatrix(diab_test$pred_diab_step, diab_test$Outcome, positive = "Positive Diabetes")## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Diabetes Positive Diabetes
## Negative Diabetes 98 20
## Positive Diabetes 35 39
##
## Accuracy : 0.7135
## 95% CI : (0.644, 0.7763)
## No Information Rate : 0.6927
## P-Value [Acc > NIR] : 0.29452
##
## Kappa : 0.3716
##
## Mcnemar's Test P-Value : 0.05906
##
## Sensitivity : 0.6610
## Specificity : 0.7368
## Pos Pred Value : 0.5270
## Neg Pred Value : 0.8305
## Prevalence : 0.3073
## Detection Rate : 0.2031
## Detection Prevalence : 0.3854
## Balanced Accuracy : 0.6989
##
## 'Positive' Class : Positive Diabetes
##
Although in terms of model performance, the stepwise method have 0.52% higher accuracy, and 1.69% higher sensitivity compare to the model which use all predictors. It appears that there were almost no difference between the two models. In medical case, a false negative prediction can result in a fatal health condition or even death. For that reason, we want to make sure our model can precisely predict a patient which were diagnosed to have diabetes. For that reason, we can tune our model to increase its precision.
confmat_plot(pred_diab,
diab_test$Outcome,
postarget = "Positive Diabetes",
negtarget = "Negative Diabetes")## Warning in confusionMatrix.default(predict, ref, positive = postarget): Levels
## are not in the same order for reference and data. Refactoring data to match.
The plot above help us visualize what will happened if we increase and reduce the cutoff value to our model. Since there we not much difference in the performance of both stepwise model and a model with all predictors, I will only use stepwise model for model tuning.
diab_test$pred_diab_step <- factor(ifelse(pred_diab_step > 0.3, "Positive Diabetes", "Negative Diabetes"))
confusionMatrix(diab_test$pred_diab_step, diab_test$Outcome, positive = "Positive Diabetes")## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Diabetes Positive Diabetes
## Negative Diabetes 73 10
## Positive Diabetes 60 49
##
## Accuracy : 0.6354
## 95% CI : (0.5631, 0.7035)
## No Information Rate : 0.6927
## P-Value [Acc > NIR] : 0.9624
##
## Kappa : 0.307
##
## Mcnemar's Test P-Value : 4.724e-09
##
## Sensitivity : 0.8305
## Specificity : 0.5489
## Pos Pred Value : 0.4495
## Neg Pred Value : 0.8795
## Prevalence : 0.3073
## Detection Rate : 0.2552
## Detection Prevalence : 0.5677
## Balanced Accuracy : 0.6897
##
## 'Positive' Class : Positive Diabetes
##
With the cutoff of 0.3, we have obtain the sensitivity value of 83.05% with the overall accuracy of 63.54%. below the cutoff of 0.3, we will be reducing to much of the accuracy which result in too many false positive diagnosis.
Besides Logistic Regression, we can use K Nearest Neighbor method to predict the patients which have diabetes. However, we have to make sure that our data is scaled to the same range of numbers. By simply checking the value of our dataframe, we can find out whether our data need some scaling or not. In my case, the data need to be scaled before we start to create our K-NN model.
diab_train_norm <- diab_train %>% mutate_if(is.numeric, scale)
diab_test_norm <- diab_test %>%
select(-c(pred_diab, pred_diab_step)) %>%
mutate_if(is.numeric, scale)set.seed(129)
ds_diab_train_norm <- downSample(x = diab_train_norm %>% select(-Outcome),
y = diab_train_norm$Outcome,
yname = "Outcome")diab_train_x <- ds_diab_train_norm %>% select(-Outcome)
diab_test_x <- diab_test_norm %>% select(-Outcome)
diab_train_y <- ds_diab_train_norm$Outcome
diab_test_y <- diab_test_norm$OutcomeTo find the best model, we can find the optimum K value for our knn() function by square rooting the total number of our train data. When we obtain even number as the result, we can reduce the value by 1 to make it become an odd number.
round(sqrt(nrow(diab_train_x)))## [1] 20
pred_diab_knn <- knn(train = diab_train_x,
test = diab_test_x,
cl = diab_train_y,
k = 19)
confusionMatrix(pred_diab_knn, diab_test_y, "Positive Diabetes")## Confusion Matrix and Statistics
##
## Reference
## Prediction Negative Diabetes Positive Diabetes
## Negative Diabetes 88 18
## Positive Diabetes 45 41
##
## Accuracy : 0.6719
## 95% CI : (0.6006, 0.7378)
## No Information Rate : 0.6927
## P-Value [Acc > NIR] : 0.760793
##
## Kappa : 0.3163
##
## Mcnemar's Test P-Value : 0.001054
##
## Sensitivity : 0.6949
## Specificity : 0.6617
## Pos Pred Value : 0.4767
## Neg Pred Value : 0.8302
## Prevalence : 0.3073
## Detection Rate : 0.2135
## Detection Prevalence : 0.4479
## Balanced Accuracy : 0.6783
##
## 'Positive' Class : Positive Diabetes
##
Although the accuracy value were 4.16% lower that the stepwise model, the sensitivity value were 3.39% higher than the stepwise model before the model tuned to desired result.
Both method can predict the target with a good accuracy. However, the Logistic Regression can be tuned to achieve desired performance and are interpretable. Meanwhile, the K-NN model which are usually performed better compare to Logistic Regression model, were not really flexible in this case and are not interpretable. With all of this reasons, we can say that the Logistic Regression is better suited compare to the K-NN.