The dataset used was the Pima Indian Diabetes dataset from Machine Learning Repository (originally from National Institute of Diabetes and Digestive and Kidney Disease) which contains 8 medical diagnostic attributes and one target variable (i.e, Outcome) of 768 female patients with 34.9% having diabetes (268 patients). The variance for insulin for both categories was quite high. This dataset is used to predict whether a person with certain medical diagnostic attributes is likely to have a diabetes or not.
#Loading Library
library(dplyr)
library(gtools)
library(ggplot2)
library(class)
library(tidyr)
library(caret)#Import csv data to data frame
diabetes <- read.csv("diabetes.csv")#Inspect Data
glimpse(diabetes)#> Rows: 768
#> Columns: 9
#> $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, ~
#> $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125~
#> $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74~
#> $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, ~
#> $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, ~
#> $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.~
#> $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2~
#> $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3~
#> $ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, ~
From a glimpse, there is a few data that doesn’t make sense, for example BMI with 0 value, etc.
#Filling the data with mean value of each corresponding columns
diabetes_clean <- diabetes %>%
mutate(Outcome=as.factor(Outcome),
Glucose=ifelse(Glucose==0,mean(Glucose),Glucose),
BMI=ifelse(BMI==0,mean(BMI),BMI),
BloodPressure=ifelse(BloodPressure==0,mean(BloodPressure),BloodPressure),
SkinThickness=ifelse(SkinThickness==0,mean(SkinThickness),SkinThickness),
Insulin=ifelse(Insulin==0,mean(Insulin),Insulin))
diabetes_clean#Check target variable proportion
prop.table(table(diabetes_clean$Outcome))#>
#> 0 1
#> 0.6510417 0.3489583
0 - not diabetic (65%) 1 - diabetic (35%) Both data from each class is balanced
#Checking missing value
colSums(is.na(diabetes_clean))#> Pregnancies Glucose BloodPressure
#> 0 0 0
#> SkinThickness Insulin BMI
#> 0 0 0
#> DiabetesPedigreeFunction Age Outcome
#> 0 0 0
#Checking 5 nummber summary
summary(diabetes_clean)#> Pregnancies Glucose BloodPressure SkinThickness
#> Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 7.00
#> 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:20.54
#> Median : 3.000 Median :117.00 Median : 72.00 Median :23.00
#> Mean : 3.845 Mean :121.68 Mean : 72.25 Mean :26.61
#> 3rd Qu.: 6.000 3rd Qu.:140.25 3rd Qu.: 80.00 3rd Qu.:32.00
#> Max. :17.000 Max. :199.00 Max. :122.00 Max. :99.00
#> Insulin BMI DiabetesPedigreeFunction Age
#> Min. : 14.0 Min. :18.20 Min. :0.0780 Min. :21.00
#> 1st Qu.: 79.8 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
#> Median : 79.8 Median :32.00 Median :0.3725 Median :29.00
#> Mean :118.7 Mean :32.45 Mean :0.4719 Mean :33.24
#> 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
#> Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
#> Outcome
#> 0:500
#> 1:268
#>
#>
#>
#>
From the data, we can try to split train-test data with 75%-25%
#Splitting data to data train with 75% observation and the rest 25% observation to data test
RNGkind(sample.kind = "Rounding")
set.seed(1234)
index <- sample(x = nrow(diabetes_clean),
size = nrow(diabetes_clean)*0.75)
# splitting
diabetes_train <- diabetes_clean[index, ]
diabetes_test <- diabetes_clean[-index, ]
diabetes_test_knn <- diabetes_clean[-index, ]After we clean, and split the data, now we can go ahead and try to build the logistic regression model
#Build model with all predictor
model_all <- glm(Outcome~.,
data=diabetes_train,
family="binomial")
summary(model_all)#>
#> Call:
#> glm(formula = Outcome ~ ., family = "binomial", data = diabetes_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.7268 -0.7270 -0.4063 0.7182 2.4307
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -9.0444268 0.9343913 -9.679 < 2e-16 ***
#> Pregnancies 0.1225614 0.0369163 3.320 0.000900 ***
#> Glucose 0.0377215 0.0045713 8.252 < 2e-16 ***
#> BloodPressure -0.0054068 0.0099837 -0.542 0.588117
#> SkinThickness -0.0049890 0.0128468 -0.388 0.697761
#> Insulin -0.0002465 0.0011973 -0.206 0.836900
#> BMI 0.0892791 0.0211122 4.229 2.35e-05 ***
#> DiabetesPedigreeFunction 1.1539762 0.3399108 3.395 0.000686 ***
#> Age 0.0089110 0.0107678 0.828 0.407917
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 753.47 on 575 degrees of freedom
#> Residual deviance: 541.73 on 567 degrees of freedom
#> AIC: 559.73
#>
#> Number of Fisher Scoring iterations: 5
#Feature Selecting with backward model
model_step <- step(object = model_all,
direction = "both",
trace = F)
summary(model_step)#>
#> Call:
#> glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction,
#> family = "binomial", data = diabetes_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.7040 -0.7197 -0.4022 0.7389 2.4426
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -9.07217 0.81761 -11.096 < 2e-16 ***
#> Pregnancies 0.13557 0.03166 4.282 1.86e-05 ***
#> Glucose 0.03786 0.00410 9.234 < 2e-16 ***
#> BMI 0.08055 0.01749 4.605 4.12e-06 ***
#> DiabetesPedigreeFunction 1.14537 0.33640 3.405 0.000662 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 753.47 on 575 degrees of freedom
#> Residual deviance: 542.75 on 571 degrees of freedom
#> AIC: 552.75
#>
#> Number of Fisher Scoring iterations: 5
#From the first 5 observations on data test, in form of log of odds
predict(object = model_step,
newdata = diabetes_test[1:5,],
type = "link")#> 3 4 5 6 17
#> 1.5874580 -3.1121854 2.2071801 -1.7101047 -0.2841825
#From the first 5 observations on data test, in form of probability
predict(object = model_step,
newdata = diabetes_test[1:5,],
type = "response")#> 3 4 5 6 17
#> 0.83025815 0.04260741 0.90089243 0.15315013 0.42942867
#Adding prediction column to the test data
diabetes_test$prediction <- predict(object = model_step,
newdata = diabetes_test,
type = "response")diabetes_test$diabetic <- ifelse(test = diabetes_test$prediction > 0.5,
yes = "1",
no = "0")
diabetes_test <- diabetes_test %>%
mutate(diabetic = as.factor(diabetic))
diabetes_testconfusionMatrix(data = diabetes_test$diabetic,
reference = diabetes_test$Outcome,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 119 27
#> 1 13 33
#>
#> Accuracy : 0.7917
#> 95% CI : (0.7273, 0.8468)
#> No Information Rate : 0.6875
#> P-Value [Acc > NIR] : 0.0008541
#>
#> Kappa : 0.4822
#>
#> Mcnemar's Test P-Value : 0.0398326
#>
#> Sensitivity : 0.5500
#> Specificity : 0.9015
#> Pos Pred Value : 0.7174
#> Neg Pred Value : 0.8151
#> Prevalence : 0.3125
#> Detection Rate : 0.1719
#> Detection Prevalence : 0.2396
#> Balanced Accuracy : 0.7258
#>
#> 'Positive' Class : 1
#>
Because we are predicting whether a person is diabetic or not, which means one class is more important than the others, we going to use either Recall or Precision
#Recall/Sensitivity
33 / (33 + 27)#> [1] 0.55
#Pos Pred Value/Precision
33 / (33 + 13)#> [1] 0.7173913
We are going to focus on the False Negative point,
because there might be dangerous when a patient is falsely diagnosed. So
we will focus on the Recall metrics.
#Decreasing threshold value to 0.35
diabetes_test$diabetic_tunning <- ifelse(test = diabetes_test$prediction > 0.35,
yes = "1",
no = "0")
diabetes_test <- diabetes_test %>%
mutate(diabetic_tunning = as.factor(diabetic_tunning))
diabetes_testconfusionMatrix(data = diabetes_test$diabetic_tunning,
reference = diabetes_test$Outcome,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 100 14
#> 1 32 46
#>
#> Accuracy : 0.7604
#> 95% CI : (0.6937, 0.8189)
#> No Information Rate : 0.6875
#> P-Value [Acc > NIR] : 0.01614
#>
#> Kappa : 0.4846
#>
#> Mcnemar's Test P-Value : 0.01219
#>
#> Sensitivity : 0.7667
#> Specificity : 0.7576
#> Pos Pred Value : 0.5897
#> Neg Pred Value : 0.8772
#> Prevalence : 0.3125
#> Detection Rate : 0.2396
#> Detection Prevalence : 0.4062
#> Balanced Accuracy : 0.7621
#>
#> 'Positive' Class : 1
#>
After tunning, the Recall metrics on the model increases to 76%. It is concluded that the model is 76% accurate to predict whether someone is diabetic or not, according to few medical variables that use to diagnose diabetes
# Diabetes Train data predictor
diabetes_train_predictor <- diabetes_train %>%
select(-Outcome)
# Diabetes Train data target
diabetes_train_target <- diabetes_train %>%
pull(Outcome)# Diabetes Train data predictor
diabetes_test_predictor <- diabetes_test_knn %>%
select(-Outcome)
# Diabetes Train data target
diabetes_test_target <- diabetes_test_knn %>%
pull(Outcome)# Scaling data train diabetes predictor only
diabetes_train_predictor_scale <- diabetes_train_predictor %>%
scale()# Scaling data test diabetes predictor only
diabetes_test_predictor_scale <- diabetes_test_predictor %>%
scale(center = attr(diabetes_train_predictor_scale, "scaled:center"),
scale = attr(diabetes_train_predictor_scale, "scaled:scale"))#Finding optimum k
sqrt(nrow(diabetes_train_predictor))#> [1] 24
diabetic_pred <- knn(train = diabetes_train_predictor_scale,
test = diabetes_test_predictor_scale,
cl = diabetes_train_target,
k = 24)confusionMatrix(data = diabetic_pred,
reference = diabetes_test_target,
positive = "1")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 112 29
#> 1 20 31
#>
#> Accuracy : 0.7448
#> 95% CI : (0.677, 0.8048)
#> No Information Rate : 0.6875
#> P-Value [Acc > NIR] : 0.04913
#>
#> Kappa : 0.3807
#>
#> Mcnemar's Test P-Value : 0.25310
#>
#> Sensitivity : 0.5167
#> Specificity : 0.8485
#> Pos Pred Value : 0.6078
#> Neg Pred Value : 0.7943
#> Prevalence : 0.3125
#> Detection Rate : 0.1615
#> Detection Prevalence : 0.2656
#> Balanced Accuracy : 0.6826
#>
#> 'Positive' Class : 1
#>
With K = 24, we got around 74% of Accuracy, and Recall/Sensitivity of 51%.
According to the metrics of both models, both models have perform good, in terms of accuracy from both we got around 76% from Logistic Regression model after some tuning, while having Recall/Sensitivity of 76%. On K-NN model, with K value of 24, we got accuracy of 74%, and Recall/Sensitivity of 51%. Because we focusing on the Recall metrics, its concluded that the Logistic Regression model is better to use here. However, the K-NN model can also be improved by oversampling, which means more observational data, and will probably took more time to do so.