# clear-up the environment
rm(list = ls())
# chunk options
::opts_chunk$set(
knitrmessage = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)
This article is made for completing the assignment for Algoritma : Machine Learning Course in Classification 1 Theia Batch 2022.
In this article, we will try to create model to predict if someone is diabetic or not based on many variables.
From CDC.gov, Diabetes is Diabetes is a chronic (long-lasting) health condition that affects how your body turns food into energy. Your body breaks down most of the food you eat into sugar (glucose) and releases it into your bloodstream. When your blood sugar goes up, it signals your pancreas to release insulin. Insulin acts like a key to let the blood sugar into your body’s cells for use as energy.
With diabetes, your body doesn’t make enough insulin or can’t use it as well as it should. When there isn’t enough insulin or cells stop responding to insulin, too much blood sugar stays in your bloodstream. Over time, that can cause serious health problems, such as heart disease, vision loss, and kidney disease.
There isn’t a cure yet for diabetes, but losing weight, eating healthy food, and being active can really help
Diabetes
There are three main types of diabetes: type 1, type 2, and gestational diabetes (diabetes while pregnant).
Type of Diabetes
The data was collected and made available by “National Institute of Diabetes and Digestive and Kidney Diseases” as part of the Pima Indians Diabetes Database. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here belong to the Pima Indian heritage (subgroup of Native Americans), and are females of ages 21 and above.
options(scipen = 99) #non active science annotation (example : e-1)
library(tidyverse) #load several important libraries
library(dplyr) #grammar of data manipulation
library(ggplot2) #plotting data visualization
library(scales) #edit scales
library(GGally) #plotting data visualization
#load dataset
<- read.csv("data-input/diabetes2.csv")
diabetes glimpse(diabetes)
#> Rows: 768
#> Columns: 9
#> $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
#> $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
#> $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
#> $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
#> $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
#> $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
#> $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
#> $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
#> $ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
We want to see how many unique values on each character columns.
<- c(length(unique(diabetes$Pregnancies)),
c length(unique(diabetes$Age)),
length(unique(diabetes$Outcome)))
c
#> [1] 17 52 2
Based on glimpse above, we change some column datatypes for easier analysis
<- diabetes %>%
diabetes mutate(Outcome = as.factor(Outcome))
Columns Description
Pregnancies
: Pregnancies Experienced, we assumed in
the dataset every individual is a WomanGlucose
: Level of GlucoseBloodPressure
: Blood Pressure LevelSkinThickness
: Skin Thicknes of IndividualInsulin
: Level of InsulinBMI
: Body Mass Index, Bodyweight Compared to
Height.DiabetesPedigreeFunction
: Likelihood of Diabetes based
on Family HistoryAge
: Age of IndividualOutcome
: The Result if individual is diabetes or
notCheck Missing Value
colSums(is.na(diabetes))
#> Pregnancies Glucose BloodPressure
#> 0 0 0
#> SkinThickness Insulin BMI
#> 0 0 0
#> DiabetesPedigreeFunction Age Outcome
#> 0 0 0
Check Summary of Data
summary(diabetes)
#> Pregnancies Glucose BloodPressure SkinThickness
#> Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
#> 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
#> Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
#> Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
#> 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
#> Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
#> Insulin BMI DiabetesPedigreeFunction Age
#> Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
#> 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
#> Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
#> Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
#> 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
#> Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
#> Outcome
#> 0:500
#> 1:268
#>
#>
#>
#>
We can see that the Outcome
is not balance.
Divide into two dataset, one for training and one for testing
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(rsample)
# Stratified Random Sampling
<- initial_split(data = diabetes, # data yang digunakan untuk splitting
index_n prop = 0.8, # proporsi data train
strata = Outcome) # variabel target
<- training(index_n) # fungsi training bawaan dari library rsample
diab_train <- testing(index_n) diab_test
Check Proportion of Target Variable
prop.table(table(diab_train$Outcome))
#>
#> 0 1
#> 0.6514658 0.3485342
prop.table(table(diab_test$Outcome))
#>
#> 0 1
#> 0.6493506 0.3506494
Downsampling the data train to make better model, make it 50:50
# downsampling
RNGkind(sample.kind = "Rounding")
set.seed(100)
library(caret)
<- downSample(x=diab_train %>% select(-Outcome), #hanya select variabel prediktor saja
diab_train y=diab_train$Outcome, # hanya selesct variabel target saja
yname="Outcome") # nama kolom target
prop.table(table(diab_train$Outcome))
#>
#> 0 1
#> 0.5 0.5
We try several steps to make the best model we can.
Using all predictor for the Model
<- glm(formula = Outcome ~.,
model_diab_1 data = diab_train,
family = "binomial")
summary(model_diab_1)
#>
#> Call:
#> glm(formula = Outcome ~ ., family = "binomial", data = diab_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.6571 -0.8576 -0.0265 0.8390 2.7050
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -7.758683 0.917167 -8.459 < 0.0000000000000002 ***
#> Pregnancies 0.154585 0.041463 3.728 0.000193 ***
#> Glucose 0.036257 0.004957 7.315 0.000000000000258 ***
#> BloodPressure -0.012327 0.006843 -1.801 0.071627 .
#> SkinThickness 0.005792 0.009187 0.630 0.528409
#> Insulin -0.001713 0.001210 -1.416 0.156778
#> BMI 0.093538 0.019499 4.797 0.000001609269932 ***
#> DiabetesPedigreeFunction 0.625532 0.373458 1.675 0.093940 .
#> Age 0.002779 0.011732 0.237 0.812788
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 593.33 on 427 degrees of freedom
#> Residual deviance: 442.81 on 419 degrees of freedom
#> AIC: 460.81
#>
#> Number of Fisher Scoring iterations: 5
Using Stepwise to Pick Predictor from the Intial Model
model_diab_1
# stepwise Backward
<- step(object = model_diab_1, direction = "backward", trace = F)
model_diab_2 summary(model_diab_2)
#>
#> Call:
#> glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
#> BMI + DiabetesPedigreeFunction, family = "binomial", data = diab_train)
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.59464 -0.85677 -0.00793 0.82378 2.68100
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -7.504248 0.873282 -8.593 < 0.0000000000000002 ***
#> Pregnancies 0.161202 0.036277 4.444 0.0000088466714820 ***
#> Glucose 0.034095 0.004506 7.567 0.0000000000000381 ***
#> BloodPressure -0.011195 0.006540 -1.712 0.0869 .
#> BMI 0.093157 0.018346 5.078 0.0000003817828481 ***
#> DiabetesPedigreeFunction 0.602369 0.365725 1.647 0.0995 .
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 593.33 on 427 degrees of freedom
#> Residual deviance: 444.93 on 422 degrees of freedom
#> AIC: 456.93
#>
#> Number of Fisher Scoring iterations: 4
Predict Data using our Models
$prob_diab_1<-predict(model_diab_1, type = "response", newdata = diab_test)
diab_test$prob_diab_2<-predict(model_diab_2, type = "response", newdata = diab_test) diab_test
Density Data using our Models
Using Model 1
ggplot(diab_test, aes(x=prob_diab_1)) +
geom_density(lwd=1) +
labs(title = "Distribution of Probability Prediction Data")
Using Model 2
ggplot(diab_test, aes(x=prob_diab_2)) +
geom_density(lwd=1) +
labs(title = "Distribution of Probability Prediction Data")
Based on Density Plot above, we can see that there are two peak of
population. Peak on the right considered as Diabetic, and needs more
observation to check if she has diabetes or not. Peak on the left
considered as Healthy.
We Split the model using threshold 0.7
$pred_diab_1 <- factor(ifelse(diab_test$prob_diab_1 > 0.7, "1","0"))
diab_test$pred_diab_2 <- factor(ifelse(diab_test$prob_diab_2 > 0.7, "1","0")) diab_test
library(caret)
<- confusionMatrix(diab_test$pred_diab_1, diab_test$Outcome, positive = "1")
log_conf_1 log_conf_1
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 92 24
#> 1 8 30
#>
#> Accuracy : 0.7922
#> 95% CI : (0.7195, 0.8533)
#> No Information Rate : 0.6494
#> P-Value [Acc > NIR] : 0.00008061
#>
#> Kappa : 0.5103
#>
#> Mcnemar's Test P-Value : 0.00801
#>
#> Sensitivity : 0.5556
#> Specificity : 0.9200
#> Pos Pred Value : 0.7895
#> Neg Pred Value : 0.7931
#> Prevalence : 0.3506
#> Detection Rate : 0.1948
#> Detection Prevalence : 0.2468
#> Balanced Accuracy : 0.7378
#>
#> 'Positive' Class : 1
#>
library(caret)
<- confusionMatrix(diab_test$pred_diab_2, diab_test$Outcome, positive = "1")
log_conf_2 log_conf_2
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 93 23
#> 1 7 31
#>
#> Accuracy : 0.8052
#> 95% CI : (0.7337, 0.8645)
#> No Information Rate : 0.6494
#> P-Value [Acc > NIR] : 0.00001687
#>
#> Kappa : 0.5409
#>
#> Mcnemar's Test P-Value : 0.00617
#>
#> Sensitivity : 0.5741
#> Specificity : 0.9300
#> Pos Pred Value : 0.8158
#> Neg Pred Value : 0.8017
#> Prevalence : 0.3506
#> Detection Rate : 0.2013
#> Detection Prevalence : 0.2468
#> Balanced Accuracy : 0.7520
#>
#> 'Positive' Class : 1
#>
It seems that the performance of these two models is not so different, but Model 2 prove could make better prediction
We find the best CutOff (the number we split the data as positive class and negative class) using find optimum value of Sensitivity, Specificity, Precision, Accuracy.
# create function for performance
<- function(cutoff, prob, ref, postarget, negtarget)
performa
{<- factor(ifelse(prob >= cutoff, postarget, negtarget))
predict <- caret::confusionMatrix(predict , ref, positive = postarget)
conf <- conf$overall[1]
acc <- conf$byClass[1]
rec <- conf$byClass[3]
prec <- conf$byClass[2]
spec <- t(as.matrix(c(rec , acc , prec, spec)))
mat colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
<- seq(0.01,0.80,length=100)
co <- matrix(0,100,4) result
Create Dataframe of Performance using Model 2
for(i in 1:100){
= performa(cutoff = co[i],
result[i,] prob = diab_test$prob_diab_2,
ref = diab_test$Outcome,
postarget = "1",
negtarget = "0")
}<- data_frame("Recall" = result[,1],
plot_data_2 "Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "performa", value = "value", 1:4)
Plot of Model 2 Performance
<- ggplot(data = plot_data_2, aes(x = Cutoff, y = value, col = performa)) +
plot_2 geom_line(lwd = 1) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Model Performance Compared to Model Cutoff") +
theme(legend.position = "top",
legend.title = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
plot_2
Based on business Model, we want the model to has more best
Recall
as possible because we don’t want to reduce False
Negative. Which the person is predicted healthy but she has high chance
has diabetes.
Based on the plot above, we can see that the optimum cutOFF for every performance : 0.5
Create New Prediction using Model 2
$pred_diab_3 <- factor(ifelse(diab_test$prob_diab_2 > 0.5, "1","0")) diab_test
# Odds ratio all coefficients
exp(model_diab_2$coefficients) %>%
data.frame() %>%
arrange(desc(.))
Based on model interpretation above, we can see that on this dataset people with family history of diabetes is more likely to have diabetes and it is the most significant predictor.
head(diab_train, 5)
# predictor
<- diab_train %>% select_if(is.numeric)
diab_train_x
<- diab_test %>% select_if(is.numeric)
diab_test_x
# target
<- diab_train[,"Outcome"]
diab_train_y
<- diab_test[,"Outcome"] diab_test_y
# scale predictor data
<- scale(diab_train_x)
diab_train_xs
<- scale(diab_train_x,
diab_test_xs center = attr(diab_train_xs, "scaled:center"),
scale = attr(diab_train_xs, "scaled:scale"))
Find Optimum K Number based on Number of Data
sqrt(nrow(diab_train))
#> [1] 20.68816
library(class) # package untuk fungsi `knn()`
<- knn(train = diab_train_xs,
pred_diab_knn test = diab_test_xs,
cl = diab_train_y,
k = 21)
# confusion matrix
<- confusionMatrix(data = as.factor(pred_diab_knn),
pred_knn_conf reference = diab_train_y,
positive = "1")
pred_knn_conf
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 163 54
#> 1 51 160
#>
#> Accuracy : 0.7547
#> 95% CI : (0.7111, 0.7947)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.5093
#>
#> Mcnemar's Test P-Value : 0.8453
#>
#> Sensitivity : 0.7477
#> Specificity : 0.7617
#> Pos Pred Value : 0.7583
#> Neg Pred Value : 0.7512
#> Prevalence : 0.5000
#> Detection Rate : 0.3738
#> Detection Prevalence : 0.4930
#> Balanced Accuracy : 0.7547
#>
#> 'Positive' Class : 1
#>
Based on Confusion Matrix above, our KNN Model has slightly lower Accuracy compared to the Model 2, we will compare it further.
Create Confusion Matrix for Model 2 with Cut Off 0.5 that we prove has the best performance in every Measurement.
<- confusionMatrix(diab_test$pred_diab_3, diab_test$Outcome, positive = "1")
log_conf log_conf
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction 0 1
#> 0 78 11
#> 1 22 43
#>
#> Accuracy : 0.7857
#> 95% CI : (0.7124, 0.8477)
#> No Information Rate : 0.6494
#> P-Value [Acc > NIR] : 0.0001665
#>
#> Kappa : 0.5505
#>
#> Mcnemar's Test P-Value : 0.0817228
#>
#> Sensitivity : 0.7963
#> Specificity : 0.7800
#> Pos Pred Value : 0.6615
#> Neg Pred Value : 0.8764
#> Prevalence : 0.3506
#> Detection Rate : 0.2792
#> Detection Prevalence : 0.4221
#> Balanced Accuracy : 0.7881
#>
#> 'Positive' Class : 1
#>
<- data_frame(Accuracy = log_conf$overall[1],
eval_logit Recall = log_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = log_conf$byClass[3])
<- data_frame(Accuracy = pred_knn_conf$overall[1],
eval_knn Recall = pred_knn_conf$byClass[1],
Specificity = log_conf$byClass[2],
Precision = pred_knn_conf$byClass[3])
eval_logit
eval_knn
Based on model comparation above, it seems that Logistic Regression
perform better for our application because it has significantly higher
Recall
Value compared to the KNN Method.
Comparation between Model 1
which using all predictor
and Model 2
which using several predictor is not very
different, it maybe because the sample is too small to see the
significance the rest of the Predictors.
It is better to predict someone as Diabetic
wrong rather
than predict someone as Not Diabetic
wrong. Why? because
the treatment for Diabetic symptoms is to pursue better Lifestyle like
eating less-sugar, exercise more, and have enough sleep. This treatment
will also beneficial to someone that is Not Diabetic
For more advanced treatment, it must be take further investigation to the patient.