This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
The dataset is obtained from Kaggle
diabetes <- read.csv("diabetes.csv")library(ggthemes)
library(magrittr)
library(tidyverse)
library(dplyr)
library(inspectdf)
library(e1071)
library(caret)
library(doParallel)
theme_set(theme_minimal())glimpse(diabetes)## Rows: 768
## Columns: 9
## $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, ~
## $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125~
## $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74~
## $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, ~
## $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, ~
## $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.~
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2~
## $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3~
## $ Outcome <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, ~
We have a dataset that includes 1338 observations on 7 variables:
We will convert the outcome variables to the type factor to make it categorical:
diabetes$Outcome %<>% as.factor()summary(diabetes)## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## 0:500
## 1:268
##
##
##
##
for some variable to have value 0 is non-sense unless it means that the actual value is missing. so we will replace 0 in these variable to Na (Glucose, BloodPressure, SkinThickness, Insulin, and BMI)
cols <- c("Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI")
diabetes[,cols][diabetes[,cols] == 0] <- NAcolSums(is.na(diabetes))## Pregnancies Glucose BloodPressure
## 0 5 35
## SkinThickness Insulin BMI
## 227 374 11
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
To fill these Nan values the data distribution needs to be understood against the target. we will measure median for all predictor that has missing value, and we will diffrentiate the value based on the target variable.
median_target <- function(v) {
df <- diabetes %>%
filter(!is.na((!!as.symbol(v))))
df <- df %>% select((!!as.symbol(v)), Outcome) %>%
group_by(Outcome) %>%
summarise(median = median((!!as.symbol(v))))
return(df)
}
# na_median <- function(v)median_target("Glucose")diabetes[,"Glucose"][(diabetes[,"Glucose"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 107
diabetes[,"Glucose"][(diabetes[,"Glucose"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 140median_target("BloodPressure")diabetes[,"BloodPressure"][(diabetes[,"BloodPressure"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 70
diabetes[,"BloodPressure"][(diabetes[,"BloodPressure"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 74.5median_target("SkinThickness")diabetes[,"SkinThickness"][(diabetes[,"SkinThickness"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 27
diabetes[,"SkinThickness"][(diabetes[,"SkinThickness"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 32median_target("Insulin")diabetes[,"Insulin"][(diabetes[,"Insulin"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 102.5
diabetes[,"Insulin"][(diabetes[,"Insulin"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 169.5median_target("BMI")diabetes[,"BMI"][(diabetes[,"BMI"] %>% is.na() ) & diabetes[,"Outcome"] == 0] <- 30.1
diabetes[,"BMI"][(diabetes[,"BMI"] %>% is.na() ) & diabetes[,"Outcome"] == 1] <- 34.3Now, let us recheck the missing values
colSums(is.na(diabetes))## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
There is no missing values anymore in the dataset.
library(ggcorrplot)
corr <- round(cor(diabetes %>% select(-Outcome)), 1)
# to do create color for target variable
ggcorrplot(corr, hc.order = TRUE, type = "lower",
lab = TRUE) As we can see from the graph above, there is no highly correlated variable between the predictors.
In the result of prediction we will focus on the recall matrix which tell us what proportion actual positives was identified correctly. because in this model we prefer the patient false positively predicted has diabetes than predicted false negatively. therefore, it can has a better prevention with the actual disease.
diabetes$Outcome %>%
table() %>%
prop.table()## .
## 0 1
## 0.6510417 0.3489583
The proportion of the target variable is slightly imbalance. but we it should not be a problem for model training and prediction.
RNGkind(sample.kind = "Rounding")
set.seed(123)
split_index <- sample(nrow(diabetes), nrow(diabetes)*0.80)
diabetes_train <- diabetes[split_index, ]
diabetes_test <- diabetes[-split_index, ]model_log <- glm(Outcome ~ ., data = diabetes_train, family = "binomial")
summary(model_log)##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = diabetes_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4800 -0.7138 -0.3776 0.7455 2.4514
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.068789 0.899325 -10.084 < 2e-16 ***
## Pregnancies 0.111113 0.036633 3.033 0.00242 **
## Glucose 0.030155 0.004317 6.986 2.83e-12 ***
## BloodPressure -0.002812 0.009919 -0.283 0.77682
## SkinThickness 0.037210 0.015476 2.404 0.01620 *
## Insulin 0.004644 0.001583 2.934 0.00334 **
## BMI 0.058073 0.019541 2.972 0.00296 **
## DiabetesPedigreeFunction 0.680046 0.332773 2.044 0.04100 *
## Age 0.010072 0.011007 0.915 0.36015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 790.13 on 613 degrees of freedom
## Residual deviance: 561.46 on 605 degrees of freedom
## AIC: 579.46
##
## Number of Fisher Scoring iterations: 5
diabetes_log_pred <- predict(model_log, newdata = diabetes_test)
diabetes_test$pred_odd <- predict(object = model_log,
newdata = diabetes_test,
type = "response")
diabetes_test$pred_Label <- ifelse(test = diabetes_test$pred_odd > 0.5, yes = "1", no = "0")# diabetes_test <- diabetes_test %>% mutate(pred_Label = as.factor(pred_Label))
confusionMatrix(data = as.factor(diabetes_test$pred_Label),
reference = diabetes_test$Outcome,
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 87 22
## 1 10 35
##
## Accuracy : 0.7922
## 95% CI : (0.7195, 0.8533)
## No Information Rate : 0.6299
## P-Value [Acc > NIR] : 1.037e-05
##
## Kappa : 0.5341
##
## Mcnemar's Test P-Value : 0.05183
##
## Sensitivity : 0.6140
## Specificity : 0.8969
## Pos Pred Value : 0.7778
## Neg Pred Value : 0.7982
## Prevalence : 0.3701
## Detection Rate : 0.2273
## Detection Prevalence : 0.2922
## Balanced Accuracy : 0.7555
##
## 'Positive' Class : 1
##
Result of Logistic Regression Model: from the result of confusion matrix above as we can see the accuracy of the model is ~74%
model_bayes <- naiveBayes(Outcome ~ ., data = diabetes_train)diabetes_nb_pred <- predict(model_bayes, newdata = diabetes_test)confusionMatrix(data = diabetes_nb_pred, reference = diabetes_test$Outcome, positive = "1") ## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 20
## 1 12 37
##
## Accuracy : 0.7922
## 95% CI : (0.7195, 0.8533)
## No Information Rate : 0.6299
## P-Value [Acc > NIR] : 1.037e-05
##
## Kappa : 0.5411
##
## Mcnemar's Test P-Value : 0.2159
##
## Sensitivity : 0.6491
## Specificity : 0.8763
## Pos Pred Value : 0.7551
## Neg Pred Value : 0.8095
## Prevalence : 0.3701
## Detection Rate : 0.2403
## Detection Prevalence : 0.3182
## Balanced Accuracy : 0.7627
##
## 'Positive' Class : 1
##
Result of Naive Bayes Model: from the result of confusion matrix above as we can see the accuracy of the model is ~72%
Because all the predictors are numerical we can proceed to building the model.
RNGkind(sample.kind = "Rounding")
set.seed(123)
split_index <- sample(nrow(diabetes), nrow(diabetes)*0.80)
diabetes_train <- diabetes[split_index, ]
diabetes_test <- diabetes[-split_index, ]splittiong between predictor and target variable
# predictors
diabetes_train_x <- diabetes_train %>% select_if(is.numeric)
diabetes_test_x <- diabetes_test %>% select_if(is.numeric)
# target
diabetes_train_y <- diabetes_train %>% select(Outcome)
diabetes_test_y <- diabetes_test %>% select(Outcome)scalling the predictors data
diabetes_train_xs <- diabetes_train_x %>% scale()
diabetes_test_xs <- diabetes_test_x %>%
scale(center = attr(diabetes_train_xs, "scaled:center"),
scale = attr(diabetes_train_xs, "scaled:scale"))find optimum K
k <- sqrt(nrow(diabetes_train)) %>% round(digits = 0)Because the target is even (2 class) we should use odd amount of K.
library(class)
cl = diabetes_train_y[,"Outcome"]
knn_pred <- knn(train = diabetes_train_xs, # data train, predictors, scaled
test = diabetes_test_xs, # data test, predictors, scaled
cl = cl, # data train, label (target) aktual
k = k)confusionMatrix(data = knn_pred, # confusion matrix needs the data in factor
reference = diabetes_test_y[,"Outcome"],
positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 88 15
## 1 9 42
##
## Accuracy : 0.8442
## 95% CI : (0.777, 0.8975)
## No Information Rate : 0.6299
## P-Value [Acc > NIR] : 3.871e-09
##
## Kappa : 0.6583
##
## Mcnemar's Test P-Value : 0.3074
##
## Sensitivity : 0.7368
## Specificity : 0.9072
## Pos Pred Value : 0.8235
## Neg Pred Value : 0.8544
## Prevalence : 0.3701
## Detection Rate : 0.2727
## Detection Prevalence : 0.3312
## Balanced Accuracy : 0.8220
##
## 'Positive' Class : 1
##
Result of KNN Model: from the result of confusion matrix above as we can see the accuracy of the model is ~81%
library(partykit)
library(ROCR)
model_tree <- ctree(Outcome ~ ., data=diabetes_train)
# Visualize decision tree
plot(model_tree, type = "simple") pred_dtree <- predict(model_tree, diabetes_test_x)confusionMatrix(pred_dtree, diabetes_test$Outcome)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 80 15
## 1 17 42
##
## Accuracy : 0.7922
## 95% CI : (0.7195, 0.8533)
## No Information Rate : 0.6299
## P-Value [Acc > NIR] : 1.037e-05
##
## Kappa : 0.5576
##
## Mcnemar's Test P-Value : 0.8597
##
## Sensitivity : 0.8247
## Specificity : 0.7368
## Pos Pred Value : 0.8421
## Neg Pred Value : 0.7119
## Prevalence : 0.6299
## Detection Rate : 0.5195
## Detection Prevalence : 0.6169
## Balanced Accuracy : 0.7808
##
## 'Positive' Class : 0
##
model_svm <- svm(Outcome ~ ., data = diabetes_train)pred_svm <- predict(model_svm, diabetes_test_x)confusionMatrix(pred_svm, diabetes_test$Outcome)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 91 11
## 1 6 46
##
## Accuracy : 0.8896
## 95% CI : (0.8291, 0.9344)
## No Information Rate : 0.6299
## P-Value [Acc > NIR] : 3.151e-13
##
## Kappa : 0.7589
##
## Mcnemar's Test P-Value : 0.332
##
## Sensitivity : 0.9381
## Specificity : 0.8070
## Pos Pred Value : 0.8922
## Neg Pred Value : 0.8846
## Prevalence : 0.6299
## Detection Rate : 0.5909
## Detection Prevalence : 0.6623
## Balanced Accuracy : 0.8726
##
## 'Positive' Class : 0
##
Result of Naive Bayes Model: from the result of confusion matrix above as we can see the accuracy of the model is ~79%
library(randomForest)
control <- trainControl(method='repeatedcv',
number=10,
repeats=3,
search = 'random')
set.seed(100)
model_rf <- train(Outcome ~ .,
data = diabetes_train,
method = 'rf',
metric = 'Accuracy',
trControl = control)
saveRDS(model_rf, "model_rf.RDS") # save modelpred_rf <- predict(model_rf, diabetes_test_x)confusionMatrix(pred_rf, diabetes_test$Outcome)## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 91 10
## 1 6 47
##
## Accuracy : 0.8961
## 95% CI : (0.8368, 0.9394)
## No Information Rate : 0.6299
## P-Value [Acc > NIR] : 6.496e-14
##
## Kappa : 0.7739
##
## Mcnemar's Test P-Value : 0.4533
##
## Sensitivity : 0.9381
## Specificity : 0.8246
## Pos Pred Value : 0.9010
## Neg Pred Value : 0.8868
## Prevalence : 0.6299
## Detection Rate : 0.5909
## Detection Prevalence : 0.6558
## Balanced Accuracy : 0.8814
##
## 'Positive' Class : 0
##
Based on several model that we test, we want to use the model with accuracy >80%. Also, for this diabetes case, it is important to minimize the False Negative error due to the importance to treat the patients sooner. so we will also consider which model that has the best sensitivy. thus, the model that has the best sensitiviy is Random Forest model with >90% sensitivity and with >89.6% accuracy.