Diabetics Prediction using Logistic Regression and K-Nearest Neighbors

Overview

This time, we will try to predict whether the person is having diabetics or not based on several parameters. In this case, we will be focusing on a positive diabetes score and want our model to predict or capture as much data as possible that is likely to be diabetes positive. So that the matrix we are going to compare is the Recall / Sensitivity value in the Logistic Regression and K-Nearest Neigbors algorithms and which algorithm has the highest recall / sensitivity value. Recall / sensitivity itself is the percentage of the number of correct predictions on observations that are actually positive.

The dataset we will use for making prediction is from kaggle that contains 9 attributes for 768 entries. If you interested, the dataset for this project can be accessed here.

Library and Setup

Load required packages.

library(car)
library(caret)
library(class)
library(dplyr)
library(ggplot2)
library(tidyr)
library(tidyverse)
library(performance)

Logistic Regression

Data Preparation

Load the dataset.

diabetes <- read.csv("data_input/diabetes2.csv")

Then we change the variable type according to what it should be.

diabetes <- diabetes %>% 
  mutate(Outcome = factor(Outcome, levels = c(0,1), labels = c("Not Diabetes", "Diabetes")))
glimpse(diabetes)
## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, ...
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, ...
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92,...
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, ...
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, ...
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, ...
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, ...
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30...
## $ Outcome                  <fct> Diabetes, Not Diabetes, Diabetes, Not Diab...
  • Pregnancies - Number of times pregnant.
  • Glucose - Plasma glucose concentration (glucose tolerance test).
  • BloodPressure - Diastolic blood pressure (mm Hg).
  • SkinThickness - Triceps skin fold thickness (mm).
  • Insulin - 2-Hour serum insulin (mu U/ml).
  • BMI - Body mass index (weight in kg/(height in m)^2).
  • DiabetesPedigreeFunction - Diabetes pedigree function.
  • Age - Age (years).
  • Outcome - Test for Diabetes

Exploratory Data Analysis

Checking for missing values.

colSums(is.na(diabetes))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

We need to check the class proportion of the target variable.

x <- prop.table(table(diabetes$Outcome))
b <- barplot(x,col="lightBlue", main = "Target Class Proportion Diagram")
text(x=b, y= x, labels=round(x,2), pos = 1)

The target variable class is still relatively balanced

Cross Validation

Before we make the model, we need to split the data into train dataset and test dataset. We will 80% of the data as the training data and the rest of it as the testing data.

RNGkind(sample.kind = "Rounding")
set.seed(23)

intrain <- sample(nrow(diabetes),nrow(diabetes)*.8)

diabetes_train <- diabetes[intrain,]
diabetes_test <- diabetes[-intrain,]

We need to check again the proportion of our train dataset, wheter it is still balanced or not.

prop.table(table(diabetes_train$Outcome))
## 
## Not Diabetes     Diabetes 
##    0.6416938    0.3583062

Modelling

We will try to create several models the Logistic Regression using Outcome as the target value. The models that we will create come from several ways, some from the my understanding or estimation and from stepwise selection.

LR_diabetes_model_all <- glm(formula = Outcome ~ ., data = diabetes_train, family = "binomial")
LR_diabetes_model_none <- glm(formula = Outcome~1,data = diabetes_train,family = "binomial")
summary(LR_diabetes_model_all)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diabetes_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5786  -0.7259  -0.4120   0.7148   2.8001  
## 
## Coefficients:
##                            Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)              -8.1751845  0.7817661 -10.457 < 0.0000000000000002 ***
## Pregnancies               0.1362161  0.0356930   3.816             0.000135 ***
## Glucose                   0.0360271  0.0041399   8.702 < 0.0000000000000002 ***
## BloodPressure            -0.0152774  0.0058098  -2.630             0.008548 ** 
## SkinThickness             0.0037301  0.0076761   0.486             0.627013    
## Insulin                  -0.0007789  0.0010000  -0.779             0.436016    
## BMI                       0.0827918  0.0170758   4.848           0.00000124 ***
## DiabetesPedigreeFunction  0.8110411  0.3436386   2.360             0.018267 *  
## Age                       0.0128087  0.0103071   1.243             0.213976    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 801.19  on 613  degrees of freedom
## Residual deviance: 576.21  on 605  degrees of freedom
## AIC: 594.21
## 
## Number of Fisher Scoring iterations: 5
LR_diabetes_model_selected <- glm(formula = Outcome~Pregnancies+Glucose+BloodPressure+BMI+Insulin+DiabetesPedigreeFunction,data = diabetes_train,family = "binomial")

LR_diabetes_model_backward <- step(object = LR_diabetes_model_all,direction = "backward",trace = F)

After making several models, now let’s compare each other.

compare_performance(LR_diabetes_model_all,LR_diabetes_model_backward,LR_diabetes_model_selected)
## # Comparison of Model Performance Indices
## 
## Model                      | Type |    AIC |    BIC | Tjur's R2 | RMSE | Sigma | Score_log | Score_spherical |  PCP |     BF |     p
## ------------------------------------------------------------------------------------------------------------------------------------
## LR_diabetes_model_all      |  glm | 594.21 | 633.99 |      0.34 | 1.07 |  0.98 |      -Inf |        1.71e-03 | 0.70 |   1.00 |      
## LR_diabetes_model_backward |  glm | 590.48 | 617.00 |      0.34 | 1.07 |  0.98 |      -Inf |        1.69e-03 | 0.70 | > 1000 | 0.517
## LR_diabetes_model_selected |  glm | 591.88 | 622.82 |      0.34 | 1.07 |  0.98 |      -Inf |        1.68e-03 | 0.70 | 266.04 | 0.437

In choosing which model is the best, the AIC value can be considered. AIC works to estimate the amount of “information loss” from one model to another. The smaller the AIC, the less information is lost, so the better the model is in predicting the data. And, from the result above, we will use LR_diabetes_model_backward as our Logistic Regression model.

summary(LR_diabetes_model_backward)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     BMI + DiabetesPedigreeFunction, family = "binomial", data = diabetes_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7603  -0.7365  -0.4207   0.7161   2.8613  
## 
## Coefficients:
##                           Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)              -7.860400   0.737872 -10.653 < 0.0000000000000002 ***
## Pregnancies               0.159326   0.031145   5.116          0.000000313 ***
## Glucose                   0.035964   0.003791   9.487 < 0.0000000000000002 ***
## BloodPressure            -0.013872   0.005618  -2.469               0.0135 *  
## BMI                       0.081424   0.015998   5.090          0.000000359 ***
## DiabetesPedigreeFunction  0.819290   0.336800   2.433               0.0150 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 801.19  on 613  degrees of freedom
## Residual deviance: 578.48  on 608  degrees of freedom
## AIC: 590.48
## 
## Number of Fisher Scoring iterations: 5

LR_diabetes_model_backward’s summary above contains lot of information. But, we need more focus on Pr(>|t|) and AIC. From Pr(>|t|) above, we can get information on which predictors have a significant influence on the target, if the value is below 0.05 (alpha), we asume that the variable has significant effect toward the model, and then the smaller the Pr(>|t|) value, the more significant the predictors have on the target, and to make it easier, there is a star symbol which indicates the more stars the more significant the predictor’s influence on the target.

Predicting

After choosing the best model for our dataset, then we need to test our model performance using testing dataset that we have splitted above.

diabetes_test_LR <- diabetes_test

diabetes_test_LR$pred_model_backward <- predict(object = LR_diabetes_model_backward,newdata = diabetes_test_LR,type = "response")
diabetes_test_LR$label_model_backward <- as.factor(ifelse(diabetes_test_LR$pred_model_backward > .5, "Diabetes", "Not Diabetes"))
ggplot(diabetes_test_LR, aes(x=pred_model_backward)) +
  geom_density(lwd=0.5) +
  labs(title = "Distribution of Probability Prediction Data",x="Diabetes Probability",y="Density") +
  theme_minimal()

From the probability prediction diagram above, it can be seen that the majority of the data are negative for diabetes.

Evaluation

From testing performance using testing dataset above, we can evaluate our model using confusion matrix. Usually the mastrix used for model evaluation is accuracy, specificity, sensitivity, and precision. But in this case we focus on sensitivity, or the comparison between the number of positive observations that are predicted to be positive (True Positive) and the total number of observations that are actually positive (True Positive + False Negative).

confusionMatrix_LR <- confusionMatrix(data = diabetes_test_LR$label_model_backward,reference = diabetes_test_LR$Outcome,positive = "Diabetes")
confusionMatrix_LR
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Not Diabetes Diabetes
##   Not Diabetes           92       29
##   Diabetes               14       19
##                                         
##                Accuracy : 0.7208        
##                  95% CI : (0.6429, 0.79)
##     No Information Rate : 0.6883        
##     P-Value [Acc > NIR] : 0.21814       
##                                         
##                   Kappa : 0.2884        
##                                         
##  Mcnemar's Test P-Value : 0.03276       
##                                         
##             Sensitivity : 0.3958        
##             Specificity : 0.8679        
##          Pos Pred Value : 0.5758        
##          Neg Pred Value : 0.7603        
##              Prevalence : 0.3117        
##          Detection Rate : 0.1234        
##    Detection Prevalence : 0.2143        
##       Balanced Accuracy : 0.6319        
##                                         
##        'Positive' Class : Diabetes      
## 

Assumption test

Assumptions are essentially conditions that should be met before we draw inferences regarding the model estimates. Logistics regression model needs multicolinearity test to prove that the resulting model is not misleading, or has biased estimators. Multicollinearity occurs when the independent variables are too highly correlated with each other.

Multicollinearity will be tested with Variance Inflation Factor (VIF). Variance inflation factor of the linear regression is defined as VIF = 1/Tolerance (T). With VIF > 10 there is multicollinearity among the variables.

vif(LR_diabetes_model_backward)
##              Pregnancies                  Glucose            BloodPressure 
##                 1.058501                 1.022759                 1.124327 
##                      BMI DiabetesPedigreeFunction 
##                 1.114673                 1.018145

From result above, we can see that each variabel has no correlation because all of our predictors that used for making model have VIF < 10.

K-Nearest Neighbors

The principle of K-Nearest Neighbor (KNN) is to find the closest distance between the data to be evaluated and the k closest neighbors in the training data. Where k is the number of closest neighbors.

Pre-Processing

In determining the closest neighbors, it is necessary to ensure that the scale of each numerical predictor has the same or nearly the same scale. If the scale of each variable has a much different scale, it is necessary to equalize the scale of each variable so that the resulting results are balanced and unbiased.

summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##          Outcome   
##  Not Diabetes:500  
##  Diabetes    :268  
##                    
##                    
##                    
## 

From the information above, it turns out that the scale of each variable has high difference, so it is necessary to do scaling based on the average value and standard deviation of each variable from the train dataset.

#Predictors
diabetes_train_x <- diabetes_train %>% select(-Outcome)
diabetes_test_x <- diabetes_test %>% select(-Outcome)

#Target
diabetes_train_y <- diabetes_train %>% select(Outcome)
diabetes_test_y <- diabetes_test %>% select(Outcome)

So that the train and test dataset is the same even after scaling, the scaling process needs to be done using the same parameters. This means scaling the train dataset and test to use the average parameter and standard deviation of the train dataset for each predictor.

diabetes_train_x_scaled <- scale(x = diabetes_train_x)
diabetes_test_x_scaled <- scale(x = diabetes_test_x,
                                center = attr(diabetes_train_x_scaled,"scaled:center"),
                                scale = attr(diabetes_train_x_scaled,"scaled:scale"))

Then, to determine the number of k, generally it can be based on the root of the number of train datasets.

k <- sqrt(nrow(diabetes_train))
k~23
## k ~ 23

Predicting

Classification methods using k-NN do not build a model. More technically, it can say that no ‘parameters’ are learned about the data. Furthermore, after all the predictors are scaled, and the k value is obtained. Then predictions are made using using the parameters as below.

diabetes_test_KNN <- diabetes_test
diabetes_test_KNN$label_predicted_KNN <- knn(train = diabetes_train_x_scaled,test = diabetes_test_x_scaled,cl = diabetes_train$Outcome,k = 23)

Evaluation

The prediction results above are then compared with the test dataset for evaluation of the results. And like the Logistic Regression above, the matrix to focus on is sensitivity.

confusionMatrix_KNN <- confusionMatrix(data = diabetes_test_KNN$label_predicted_KNN,
                reference = diabetes_test_KNN$Outcome,positive = "Diabetes")
confusionMatrix_KNN
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Not Diabetes Diabetes
##   Not Diabetes           98       27
##   Diabetes                8       21
##                                           
##                Accuracy : 0.7727          
##                  95% CI : (0.6984, 0.8363)
##     No Information Rate : 0.6883          
##     P-Value [Acc > NIR] : 0.013084        
##                                           
##                   Kappa : 0.406           
##                                           
##  Mcnemar's Test P-Value : 0.002346        
##                                           
##             Sensitivity : 0.4375          
##             Specificity : 0.9245          
##          Pos Pred Value : 0.7241          
##          Neg Pred Value : 0.7840          
##              Prevalence : 0.3117          
##          Detection Rate : 0.1364          
##    Detection Prevalence : 0.1883          
##       Balanced Accuracy : 0.6810          
##                                           
##        'Positive' Class : Diabetes        
## 

Conclusion

From the results of the evaluation of the two models above, it can be seen the sensitivity value for each method below.

comparison_sensitivity <- data.frame("Sensitivity Logistic Regression"=data.frame(confusionMatrix_LR$byClass)[1,],
           "Sensitivity K-Nearest Neighbors"=data.frame(confusionMatrix_KNN$byClass)[1,])

From the comparison results above, it appears that the results of the KNN have a better sensitivity value (0.4375) than the results of Logistic Regression (0.395). This means that KNN has a better performance in classification compared to Logistic Regression. However, KNN has a weakness because it is not possible to know which parameters/predictors have a strong influence in predicting targets.

Therefore, if we also pay attention to knowing the proportion of predictors’ influence on the model, it is better to use Logistic Regression. But, if we focus more on the prediction results without paying attention to the proportion of predictors’ influence on the model, then it is good to use KNN.