Diabetics Prediction using Logistic Regression and K-Nearest Neighbors
Overview
This time, we will try to predict whether the person is having diabetics or not based on several parameters. In this case, we will be focusing on a positive diabetes score and want our model to predict or capture as much data as possible that is likely to be diabetes positive. So that the matrix we are going to compare is the Recall / Sensitivity value in the Logistic Regression and K-Nearest Neigbors algorithms and which algorithm has the highest recall / sensitivity value. Recall / sensitivity itself is the percentage of the number of correct predictions on observations that are actually positive.
The dataset we will use for making prediction is from kaggle that contains 9 attributes for 768 entries. If you interested, the dataset for this project can be accessed here.
Library and Setup
Load required packages.
library(car)
library(caret)
library(class)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(performance)Logistic Regression
Data Preparation
Load the dataset.
diabetes <- read.csv("data_input/diabetes2.csv")Then we change the variable type according to what it should be.
diabetes <- diabetes %>%
mutate(Outcome = factor(Outcome, levels = c(0,1), labels = c("Not Diabetes", "Diabetes")))glimpse(diabetes)## Rows: 768
## Columns: 9
## $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ...
## $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, ...
## $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,...
## $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ...
## $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ...
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ...
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ...
## $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30...
## $ Outcome <fct> Diabetes, Not Diabetes, Diabetes, Not Diab...
Pregnancies- Number of times pregnant.Glucose- Plasma glucose concentration (glucose tolerance test).BloodPressure- Diastolic blood pressure (mm Hg).SkinThickness- Triceps skin fold thickness (mm).Insulin- 2-Hour serum insulin (mu U/ml).BMI- Body mass index (weight in kg/(height in m)^2).DiabetesPedigreeFunction- Diabetes pedigree function.Age- Age (years).Outcome- Test for Diabetes
Exploratory Data Analysis
Checking for missing values.
colSums(is.na(diabetes))## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
We need to check the class proportion of the target variable.
x <- prop.table(table(diabetes$Outcome))
b <- barplot(x,col="lightBlue", main = "Target Class Proportion Diagram")
text(x=b, y= x, labels=round(x,2), pos = 1) The target variable class is still relatively balanced
Cross Validation
Before we make the model, we need to split the data into train dataset and test dataset. We will 80% of the data as the training data and the rest of it as the testing data.
RNGkind(sample.kind = "Rounding")
set.seed(23)
intrain <- sample(nrow(diabetes),nrow(diabetes)*.8)
diabetes_train <- diabetes[intrain,]
diabetes_test <- diabetes[-intrain,]We need to check again the proportion of our train dataset, wheter it is still balanced or not.
prop.table(table(diabetes_train$Outcome))##
## Not Diabetes Diabetes
## 0.6416938 0.3583062
Modelling
We will try to create several models the Logistic Regression using Outcome as the target value. The models that we will create come from several ways, some from the my understanding or estimation and from stepwise selection.
LR_diabetes_model_all <- glm(formula = Outcome ~ ., data = diabetes_train, family = "binomial")
LR_diabetes_model_none <- glm(formula = Outcome~1,data = diabetes_train,family = "binomial")
summary(LR_diabetes_model_all)##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diabetes_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5786 -0.7259 -0.4120 0.7148 2.8001
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.1751845 0.7817661 -10.457 < 0.0000000000000002 ***
## Pregnancies 0.1362161 0.0356930 3.816 0.000135 ***
## Glucose 0.0360271 0.0041399 8.702 < 0.0000000000000002 ***
## BloodPressure -0.0152774 0.0058098 -2.630 0.008548 **
## SkinThickness 0.0037301 0.0076761 0.486 0.627013
## Insulin -0.0007789 0.0010000 -0.779 0.436016
## BMI 0.0827918 0.0170758 4.848 0.00000124 ***
## DiabetesPedigreeFunction 0.8110411 0.3436386 2.360 0.018267 *
## Age 0.0128087 0.0103071 1.243 0.213976
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 801.19 on 613 degrees of freedom
## Residual deviance: 576.21 on 605 degrees of freedom
## AIC: 594.21
##
## Number of Fisher Scoring iterations: 5
LR_diabetes_model_selected <- glm(formula = Outcome~Pregnancies+Glucose+BloodPressure+BMI+Insulin+DiabetesPedigreeFunction,data = diabetes_train,family = "binomial")
LR_diabetes_model_backward <- step(object = LR_diabetes_model_all,direction = "backward",trace = F)After making several models, now let’s compare each other.
compare_performance(LR_diabetes_model_all,LR_diabetes_model_backward,LR_diabetes_model_selected)## # Comparison of Model Performance Indices
##
## Model | Type | AIC | BIC | Tjur's R2 | RMSE | Sigma | Score_log | Score_spherical | PCP | BF | p
## ------------------------------------------------------------------------------------------------------------------------------------
## LR_diabetes_model_all | glm | 594.21 | 633.99 | 0.34 | 1.07 | 0.98 | -Inf | 1.71e-03 | 0.70 | 1.00 |
## LR_diabetes_model_backward | glm | 590.48 | 617.00 | 0.34 | 1.07 | 0.98 | -Inf | 1.69e-03 | 0.70 | > 1000 | 0.517
## LR_diabetes_model_selected | glm | 591.88 | 622.82 | 0.34 | 1.07 | 0.98 | -Inf | 1.68e-03 | 0.70 | 266.04 | 0.437
In choosing which model is the best, the AIC value can be considered. AIC works to estimate the amount of “information loss” from one model to another. The smaller the AIC, the less information is lost, so the better the model is in predicting the data. And, from the result above, we will use LR_diabetes_model_backward as our Logistic Regression model.
summary(LR_diabetes_model_backward)##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction, family = "binomial", data = diabetes_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7603 -0.7365 -0.4207 0.7161 2.8613
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.860400 0.737872 -10.653 < 0.0000000000000002 ***
## Pregnancies 0.159326 0.031145 5.116 0.000000313 ***
## Glucose 0.035964 0.003791 9.487 < 0.0000000000000002 ***
## BloodPressure -0.013872 0.005618 -2.469 0.0135 *
## BMI 0.081424 0.015998 5.090 0.000000359 ***
## DiabetesPedigreeFunction 0.819290 0.336800 2.433 0.0150 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 801.19 on 613 degrees of freedom
## Residual deviance: 578.48 on 608 degrees of freedom
## AIC: 590.48
##
## Number of Fisher Scoring iterations: 5
LR_diabetes_model_backward’s summary above contains lot of information. But, we need more focus on Pr(>|t|) and AIC. From Pr(>|t|) above, we can get information on which predictors have a significant influence on the target, if the value is below 0.05 (alpha), we asume that the variable has significant effect toward the model, and then the smaller the Pr(>|t|) value, the more significant the predictors have on the target, and to make it easier, there is a star symbol which indicates the more stars the more significant the predictor’s influence on the target.
Predicting
After choosing the best model for our dataset, then we need to test our model performance using testing dataset that we have splitted above.
diabetes_test_LR <- diabetes_test
diabetes_test_LR$pred_model_backward <- predict(object = LR_diabetes_model_backward,newdata = diabetes_test_LR,type = "response")
diabetes_test_LR$label_model_backward <- as.factor(ifelse(diabetes_test_LR$pred_model_backward > .5, "Diabetes", "Not Diabetes"))ggplot(diabetes_test_LR, aes(x=pred_model_backward)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data",x="Diabetes Probability",y="Density") +
theme_minimal() From the probability prediction diagram above, it can be seen that the majority of the data are negative for diabetes.
Evaluation
From testing performance using testing dataset above, we can evaluate our model using confusion matrix. Usually the mastrix used for model evaluation is accuracy, specificity, sensitivity, and precision. But in this case we focus on sensitivity, or the comparison between the number of positive observations that are predicted to be positive (True Positive) and the total number of observations that are actually positive (True Positive + False Negative).
confusionMatrix_LR <- confusionMatrix(data = diabetes_test_LR$label_model_backward,reference = diabetes_test_LR$Outcome,positive = "Diabetes")
confusionMatrix_LR## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Diabetes Diabetes
## Not Diabetes 92 29
## Diabetes 14 19
##
## Accuracy : 0.7208
## 95% CI : (0.6429, 0.79)
## No Information Rate : 0.6883
## P-Value [Acc > NIR] : 0.21814
##
## Kappa : 0.2884
##
## Mcnemar's Test P-Value : 0.03276
##
## Sensitivity : 0.3958
## Specificity : 0.8679
## Pos Pred Value : 0.5758
## Neg Pred Value : 0.7603
## Prevalence : 0.3117
## Detection Rate : 0.1234
## Detection Prevalence : 0.2143
## Balanced Accuracy : 0.6319
##
## 'Positive' Class : Diabetes
##
Assumption test
Assumptions are essentially conditions that should be met before we draw inferences regarding the model estimates. Logistics regression model needs multicolinearity test to prove that the resulting model is not misleading, or has biased estimators. Multicollinearity occurs when the independent variables are too highly correlated with each other.
Multicollinearity will be tested with Variance Inflation Factor (VIF). Variance inflation factor of the linear regression is defined as VIF = 1/Tolerance (T). With VIF > 10 there is multicollinearity among the variables.
vif(LR_diabetes_model_backward)## Pregnancies Glucose BloodPressure
## 1.058501 1.022759 1.124327
## BMI DiabetesPedigreeFunction
## 1.114673 1.018145
From result above, we can see that each variabel has no correlation because all of our predictors that used for making model have VIF < 10.
K-Nearest Neighbors
The principle of K-Nearest Neighbor (KNN) is to find the closest distance between the data to be evaluated and the k closest neighbors in the training data. Where k is the number of closest neighbors.
Pre-Processing
In determining the closest neighbors, it is necessary to ensure that the scale of each numerical predictor has the same or nearly the same scale. If the scale of each variable has a much different scale, it is necessary to equalize the scale of each variable so that the resulting results are balanced and unbiased.
summary(diabetes)## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Not Diabetes:500
## Diabetes :268
##
##
##
##
From the information above, it turns out that the scale of each variable has high difference, so it is necessary to do scaling based on the average value and standard deviation of each variable from the train dataset.
#Predictors
diabetes_train_x <- diabetes_train %>% select(-Outcome)
diabetes_test_x <- diabetes_test %>% select(-Outcome)
#Target
diabetes_train_y <- diabetes_train %>% select(Outcome)
diabetes_test_y <- diabetes_test %>% select(Outcome)So that the train and test dataset is the same even after scaling, the scaling process needs to be done using the same parameters. This means scaling the train dataset and test to use the average parameter and standard deviation of the train dataset for each predictor.
diabetes_train_x_scaled <- scale(x = diabetes_train_x)
diabetes_test_x_scaled <- scale(x = diabetes_test_x,
center = attr(diabetes_train_x_scaled,"scaled:center"),
scale = attr(diabetes_train_x_scaled,"scaled:scale"))Then, to determine the number of k, generally it can be based on the root of the number of train datasets.
k <- sqrt(nrow(diabetes_train))
k~23## k ~ 23
Predicting
Classification methods using k-NN do not build a model. More technically, it can say that no ‘parameters’ are learned about the data. Furthermore, after all the predictors are scaled, and the k value is obtained. Then predictions are made using using the parameters as below.
diabetes_test_KNN <- diabetes_test
diabetes_test_KNN$label_predicted_KNN <- knn(train = diabetes_train_x_scaled,test = diabetes_test_x_scaled,cl = diabetes_train$Outcome,k = 23)Evaluation
The prediction results above are then compared with the test dataset for evaluation of the results. And like the Logistic Regression above, the matrix to focus on is sensitivity.
confusionMatrix_KNN <- confusionMatrix(data = diabetes_test_KNN$label_predicted_KNN,
reference = diabetes_test_KNN$Outcome,positive = "Diabetes")
confusionMatrix_KNN## Confusion Matrix and Statistics
##
## Reference
## Prediction Not Diabetes Diabetes
## Not Diabetes 98 27
## Diabetes 8 21
##
## Accuracy : 0.7727
## 95% CI : (0.6984, 0.8363)
## No Information Rate : 0.6883
## P-Value [Acc > NIR] : 0.013084
##
## Kappa : 0.406
##
## Mcnemar's Test P-Value : 0.002346
##
## Sensitivity : 0.4375
## Specificity : 0.9245
## Pos Pred Value : 0.7241
## Neg Pred Value : 0.7840
## Prevalence : 0.3117
## Detection Rate : 0.1364
## Detection Prevalence : 0.1883
## Balanced Accuracy : 0.6810
##
## 'Positive' Class : Diabetes
##
Conclusion
From the results of the evaluation of the two models above, it can be seen the sensitivity value for each method below.
comparison_sensitivity <- data.frame("Sensitivity Logistic Regression"=data.frame(confusionMatrix_LR$byClass)[1,],
"Sensitivity K-Nearest Neighbors"=data.frame(confusionMatrix_KNN$byClass)[1,])From the comparison results above, it appears that the results of the KNN have a better sensitivity value (0.4375) than the results of Logistic Regression (0.395). This means that KNN has a better performance in classification compared to Logistic Regression. However, KNN has a weakness because it is not possible to know which parameters/predictors have a strong influence in predicting targets.
Therefore, if we also pay attention to knowing the proportion of predictors’ influence on the model, it is better to use Logistic Regression. But, if we focus more on the prediction results without paying attention to the proportion of predictors’ influence on the model, then it is good to use KNN.