Data source: https://www.kaggle.com/uciml/pima-indians-diabetes-database
Context
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Content
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Acknowledgements
Smith, J.W., Everhart, J.E., Dickson, W.C., Knowler, W.C., & Johannes, R.S. (1988). Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care (pp. 261–265). IEEE Computer Society Press.
Load the Data Requirement
library(tidyverse)
library(caret)
library(corrplot)
library(class)
library(GGally)
library(rsample)
library(performance)
Read data
#lets import the data file and have some look
diab <- read.csv("diabetes2.csv")
head(diab,3)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
from this database we would like to try a comparison models between k-NN(Neareast Neighbour) and Logistic Regression
firstly, check the data by using glimpse & summary
glimpse(diab)
## Rows: 768
## Columns: 9
## $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, ~
## $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125~
## $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74~
## $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, ~
## $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, ~
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.~
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2~
## $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3~
## $ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, ~
summary(diab)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Data Pre-Processing
Before we make our own data analysis, make sure we already explore and clean the data, computer might not recognized by sentiment lexicons because it is not written properly.
diab <- diab %>%
mutate(Outcome = as.factor(Outcome))
class(diab$Outcome)
## [1] "factor"
Missing Values
Any not available values
Make sure our data has not any not applicable or not available values.
cat(
any(is.na(diab)),
any(is.null(diab))
)
## FALSE FALSE
NA or not NA
make sure our data no more missing value
colSums(is.na(diab))
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
then we can process to the next step
EDA (Exploratory data analysis)
#lets check for incorrect values / outliers
ggplot(gather(diab[, -9]), aes(value)) + geom_boxplot() + facet_wrap(key ~
., scales = "free_x")
# Call Library
library(GGally)
# Call Function
ggcorr(
diab,
label = T,
label_size = 2.9,
hjust = 0.9,
layout.exp = 2
)
ggplot(diab,aes(x=DiabetesPedigreeFunction,y=Insulin)) + geom_smooth() + geom_point(aes(color=Outcome))
Logistic regression models (GLM)
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Sometimes logistic regressions are difficult to interpret; the Intellectus Statistics tool easily allows you to conduct the analysis, then in plain English interprets the output.
TEST TRAIN & TEST
RNGkind(sample.kind = "Rounding")
set.seed(777)
index <- initial_split(diab, prop = 0.8, strata = "Outcome")
diab_test <- testing(index)
diab_train <- training(index)
Check the proportion of our data frame from train and test
we should make sure our data is not imbalance, therefore we need to check the proportion of our data that we made from train and test
# re-check class imbalance
prop.table(table(diab_train$Outcome))
##
## 0 1
## 0.6514658 0.3485342
prop.table(table(diab_test$Outcome))
##
## 0 1
## 0.6493506 0.3506494
Model fitting
now we try make our variables as a predictor expect Outcome, because we will predict our Outcome variables B or M
model_diab <- glm(Outcome ~ . , diab_train, family = "binomial")
summary(model_diab) # Check Coefficiens and Pvalue
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diab_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4346 -0.7416 -0.4196 0.7530 2.4646
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.9581359 0.7671525 -10.374 < 2e-16 ***
## Pregnancies 0.1409855 0.0353711 3.986 6.72e-05 ***
## Glucose 0.0349569 0.0041426 8.438 < 2e-16 ***
## BloodPressure -0.0132597 0.0056436 -2.349 0.0188 *
## SkinThickness -0.0012581 0.0075366 -0.167 0.8674
## Insulin -0.0013918 0.0009881 -1.409 0.1590
## BMI 0.0858112 0.0165734 5.178 2.25e-07 ***
## DiabetesPedigreeFunction 0.5914459 0.3255888 1.817 0.0693 .
## Age 0.0112643 0.0100991 1.115 0.2647
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 793.94 on 613 degrees of freedom
## Residual deviance: 585.92 on 605 degrees of freedom
## AIC: 603.92
##
## Number of Fisher Scoring iterations: 5
model_null <- glm(Outcome ~ 1, diab_train, family = "binomial")
Stepwise directions
model_back <- stats::step(object = model_diab, direction = "backward")
## Start: AIC=603.92
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - SkinThickness 1 585.95 601.95
## - Age 1 587.16 603.16
## - Insulin 1 587.90 603.90
## <none> 585.92 603.92
## - DiabetesPedigreeFunction 1 589.29 605.29
## - BloodPressure 1 591.54 607.54
## - Pregnancies 1 602.50 618.50
## - BMI 1 616.11 632.11
## - Glucose 1 675.39 691.39
##
## Step: AIC=601.95
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - Age 1 587.22 601.22
## <none> 585.95 601.95
## - Insulin 1 588.58 602.58
## - DiabetesPedigreeFunction 1 589.30 603.30
## - BloodPressure 1 591.90 605.90
## - Pregnancies 1 602.51 616.51
## - BMI 1 618.67 632.67
## - Glucose 1 677.17 691.17
##
## Step: AIC=601.22
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## DiabetesPedigreeFunction
##
## Df Deviance AIC
## <none> 587.22 601.22
## - Insulin 1 590.29 602.29
## - DiabetesPedigreeFunction 1 590.82 602.82
## - BloodPressure 1 592.43 604.43
## - Pregnancies 1 614.74 626.74
## - BMI 1 619.12 631.12
## - Glucose 1 689.41 701.41
model_forward <- stats::step(model_diab,
scope = list(lower = model_null,
upper = model_diab),
direction = "forward")
## Start: AIC=603.92
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age
model_both <- stats::step(model_diab,
scope = list(lower = model_null,
upper = model_diab),
direction = "both")
## Start: AIC=603.92
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - SkinThickness 1 585.95 601.95
## - Age 1 587.16 603.16
## - Insulin 1 587.90 603.90
## <none> 585.92 603.92
## - DiabetesPedigreeFunction 1 589.29 605.29
## - BloodPressure 1 591.54 607.54
## - Pregnancies 1 602.50 618.50
## - BMI 1 616.11 632.11
## - Glucose 1 675.39 691.39
##
## Step: AIC=601.95
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - Age 1 587.22 601.22
## <none> 585.95 601.95
## - Insulin 1 588.58 602.58
## - DiabetesPedigreeFunction 1 589.30 603.30
## + SkinThickness 1 585.92 603.92
## - BloodPressure 1 591.90 605.90
## - Pregnancies 1 602.51 616.51
## - BMI 1 618.67 632.67
## - Glucose 1 677.17 691.17
##
## Step: AIC=601.22
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## DiabetesPedigreeFunction
##
## Df Deviance AIC
## <none> 587.22 601.22
## + Age 1 585.95 601.95
## - Insulin 1 590.29 602.29
## - DiabetesPedigreeFunction 1 590.82 602.82
## + SkinThickness 1 587.16 603.16
## - BloodPressure 1 592.43 604.43
## - Pregnancies 1 614.74 626.74
## - BMI 1 619.12 631.12
## - Glucose 1 689.41 701.41
Comparison Models from our directions model
compare_performance(model_back, model_forward, model_both,model_diab,metrics = c("RMSE","R2"))
## # Comparison of Model Performance Indices
##
## Name | Model | Tjur's R2 | RMSE
## -----------------------------------------
## model_back | glm | 0.309 | 0.397
## model_forward | glm | 0.310 | 0.397
## model_both | glm | 0.309 | 0.397
## model_diab | glm | 0.310 | 0.397
Model fitting after stepwise
model_diab_back <- glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
BMI + DiabetesPedigreeFunction + Age, family = "binomial",
data = diab)
summary(model_diab_back)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction + Age, family = "binomial",
## data = diab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7380 -0.7313 -0.4123 0.7276 2.8984
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.239812 0.701970 -11.738 < 2e-16 ***
## Pregnancies 0.124919 0.031972 3.907 9.34e-05 ***
## Glucose 0.033492 0.003440 9.736 < 2e-16 ***
## BloodPressure -0.013487 0.005114 -2.637 0.00836 **
## BMI 0.087676 0.014268 6.145 7.99e-10 ***
## DiabetesPedigreeFunction 0.896150 0.294862 3.039 0.00237 **
## Age 0.016325 0.009237 1.767 0.07719 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 725.46 on 761 degrees of freedom
## AIC: 739.46
##
## Number of Fisher Scoring iterations: 5
now we make create prediction outcome from our models
diab_test$pred.Outcome <- predict(
object = model_diab_back,
newdata = diab_test %>% select(
Pregnancies,
Glucose,
BloodPressure,
Insulin,
BMI,
DiabetesPedigreeFunction,
Age
),
type = "response"
)
summary(diab_test$pred.Outcome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.002298 0.129029 0.286313 0.359490 0.608122 0.955191
The distribution of probability
ggplot(diab_test, aes(x=pred.Outcome)) +
geom_density(lwd=0.5) +
labs(title = "Distribution of Probability Prediction Data") +
theme_minimal()
now substantiate the prediction outcome : - if the value counts under 0.5 that means 0, otherwise above 0.5 is legit as 1 0 : means no diabetes 1 : means yes diabetes
diab_test$pred.label <- ifelse(diab_test$pred.Outcome > 0.5, 1, 0)
make sure our target predictors as factor or a category (no diabetes or yes diabetes)
diab_test$pred.label <- as.factor(diab_test$pred.label)
double confirm our prediction data test that we have already put together
diab_test %>%
select(Outcome, pred.Outcome, pred.label) %>%
head(10)
## Outcome pred.Outcome pred.label
## 11 0 0.20762991 0
## 20 1 0.23553000 0
## 23 1 0.92745355 1
## 29 0 0.55750015 1
## 32 1 0.59675406 1
## 45 0 0.60239713 1
## 62 1 0.50152164 1
## 67 1 0.17653544 0
## 73 1 0.80274594 1
## 75 0 0.04922499 0
The result of our logistic regression Model
result_glm <- confusionMatrix(data = diab_test$pred.label,
reference = diab_test$Outcome,
positive = "1")
result_glm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 15
## 1 14 39
##
## Accuracy : 0.8117
## 95% CI : (0.7409, 0.8701)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 7.279e-06
##
## Kappa : 0.5847
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7222
## Specificity : 0.8600
## Pos Pred Value : 0.7358
## Neg Pred Value : 0.8515
## Prevalence : 0.3506
## Detection Rate : 0.2532
## Detection Prevalence : 0.3442
## Balanced Accuracy : 0.7911
##
## 'Positive' Class : 1
##
kNN (k-Nearest Neighbour) model
The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.
Data Manipulation
# all prediktor variables
diab_test_x <- diab_test %>% select(-Outcome, -pred.Outcome, -pred.label)
diab_train_x <- diab_train %>% select(-Outcome)
#target
diab_test_y <- diab_test %>% select(Outcome)
diab_train_y <- diab_train %>% select(Outcome)
data predict will be scaled using z-score standarization. In addition, data test will be scaled as well using data train’s parameters (otherwise data test becomes unseen data)
Scaling Data
#Min-Max Normalization
normalize <- function(x){
return (
(x - min(x))/(max(x) - min(x))
)
}
# scalling data prediktor
diab_test_xs <- normalize(diab_test_x)
diab_train_xs <- normalize(diab_train_x)
the benefit of using k-nn is : - the model doesnt need an model
Find optium k
sqrt(nrow(diab))
## [1] 27.71281
- our target we choose : 29 (Rounding)
- k : 29
diab_pred <- knn(train = diab_train_xs,
test = diab_test_xs,
cl = diab_train_y$Outcome,
k = 29)
The result of KNN (k-Nearest Neighbour)
result_knn <- confusionMatrix(data = diab_pred,
reference = diab_test_y$Outcome,
positive = "1")
result_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 3
## 1 65 51
##
## Accuracy : 0.5584
## 95% CI : (0.4763, 0.6383)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.9921
##
## Kappa : 0.2329
##
## Mcnemar's Test P-Value : 1.389e-13
##
## Sensitivity : 0.9444
## Specificity : 0.3500
## Pos Pred Value : 0.4397
## Neg Pred Value : 0.9211
## Prevalence : 0.3506
## Detection Rate : 0.3312
## Detection Prevalence : 0.7532
## Balanced Accuracy : 0.6472
##
## 'Positive' Class : 1
##
Comparison Logistic & k-NN models
Logistic regression
result_glm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 86 15
## 1 14 39
##
## Accuracy : 0.8117
## 95% CI : (0.7409, 0.8701)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 7.279e-06
##
## Kappa : 0.5847
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7222
## Specificity : 0.8600
## Pos Pred Value : 0.7358
## Neg Pred Value : 0.8515
## Prevalence : 0.3506
## Detection Rate : 0.2532
## Detection Prevalence : 0.3442
## Balanced Accuracy : 0.7911
##
## 'Positive' Class : 1
##
KNN
result_knn
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 35 3
## 1 65 51
##
## Accuracy : 0.5584
## 95% CI : (0.4763, 0.6383)
## No Information Rate : 0.6494
## P-Value [Acc > NIR] : 0.9921
##
## Kappa : 0.2329
##
## Mcnemar's Test P-Value : 1.389e-13
##
## Sensitivity : 0.9444
## Specificity : 0.3500
## Pos Pred Value : 0.4397
## Neg Pred Value : 0.9211
## Prevalence : 0.3506
## Detection Rate : 0.3312
## Detection Prevalence : 0.7532
## Balanced Accuracy : 0.6472
##
## 'Positive' Class : 1
##
Summary
in conclusion from models that we already created. It depends on what the doctor want to achieve, for instance, logistic model has better accuracy 81% (0.811) and if we want to maximize both the TP (True Positive) and TN (True Negative). if we spent some money to target those who are predicted as positive (diabetes) and we want our investment to be worthy, we should prioritize model with higher precision value which is logistic models. Overall, Logistic model is better than KNN.
KNN models are good only for Recalls 94% (0.944) (because it has the higest Sensivity value), and willingly to spent more money for recall