Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my Fifth RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.
Diabetes is among the most prevalent chronic diseases in the United
States, impacting millions of Americans each year and exerting a
significant financial burden on the economy. Diabetes is a serious
chronic disease in which individuals lose the ability to effectively
regulate levels of glucose in the blood, and can lead to reduced quality
of life and life expectancy. After different foods are broken down into
sugars during digestion, the sugars are then released into the
bloodstream. This signals the pancreas to release insulin. Insulin helps
enable cells within the body to use those sugars in the bloodstream for
energy. Diabetes is generally characterized by either the body not
making enough insulin or being unable to use the insulin that is made as
effectively as needed.
Complications like heart disease, vision loss, lower-limb amputation,
and kidney disease are associated with chronically high levels of sugar
remaining in the bloodstream for those with diabetes. While there is no
cure for diabetes, strategies like losing weight, eating healthily,
being active, and receiving medical treatments can mitigate the harms of
this disease in many patients. Early diagnosis can lead to lifestyle
changes and more effective treatment, making predictive models for
diabetes risk important tools for public and public health
officials.
The scale of this problem is also important to recognize. The Centers
for Disease Control and Prevention has indicated that as of 2018, 34.2
million Americans have diabetes and 88 million have prediabetes.
Furthermore, the CDC estimates that 1 in 5 diabetics, and roughly 8 in
10 prediabetics are unaware of their risk. While there are different
types of diabetes, type II diabetes is the most common form and its
prevalence varies by age, education, income, location, race, and other
social determinants of health. Much of the burden of the disease falls
on those of lower socioeconomic status as well. Diabetes also places a
massive burden on the economy, with diagnosed diabetes costs of roughly
$327 billion dollars and total costs with undiagnosed diabetes and
prediabetes approaching $400 billion dollars annually.
Before we start develop the analysis, let’s start with environment setup.
# clear-up the environment
rm(list = ls())
# chunk options
knitr::opts_chunk$set(
message = FALSE,
warning = FALSE,
fig.align = "center",
comment = "#>"
)
options(scipen=9999)After we done with environment setup, let’s call several important library that will be used during analysis.
library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(GGally)
library(caret)
library(e1071)
library(tm)
library(ROCR)
library(partykit)
library(gridExtra)
library(inspectdf)
library(tidymodels)
library(rsample)
library(plotly)
library(rpart)
library(rattle)
library(rpart.plot)Before we generate classification model of our dataset, we need to
understand what’s inside the dataset. Let’s start with reading
diabetes.csv in our folder using
read.csv() function, change chr to
Factor, save it into an object named
diabetes_df, check the data structure and the data
types.
diabetes_df <- read.csv("diabetes.csv",stringsAsFactors = TRUE)
str(diabetes_df)#> 'data.frame': 768 obs. of 9 variables:
#> $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
#> $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
#> $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
#> $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
#> $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
#> $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
#> $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
#> $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
#> $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Based on the output above, we now that our data has dimension of
768 rows and 9 columns. In the other hand, we still
need to change int types to Factor in
Outcome column.
diabetes_df$Outcome <- as.factor(diabetes_df$Outcome)
glimpse(diabetes_df)#> Rows: 768
#> Columns: 9
#> $ Pregnancies <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
#> $ Glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
#> $ BloodPressure <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
#> $ SkinThickness <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
#> $ Insulin <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
#> $ BMI <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
#> $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
#> $ Age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
#> $ Outcome <fct> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …
Good, now our dataset has the correct data types for each column.
Even though the dataset already has the correct data types, we still need to check for any missing data and conduct necessary data treatment for those missing data.
anyNA(diabetes_df)#> [1] FALSE
colSums(is.na(diabetes_df))#> Pregnancies Glucose BloodPressure
#> 0 0 0
#> SkinThickness Insulin BMI
#> 0 0 0
#> DiabetesPedigreeFunction Age Outcome
#> 0 0 0
Based on the output above, the dataset doesn’t have any missing
values. That means we don’t need to do any missing value data treatment
on the dataset.
Next, we would like to check statistic analysis of our dataset.
summary(diabetes_df)#> Pregnancies Glucose BloodPressure SkinThickness
#> Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
#> 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
#> Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
#> Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
#> 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
#> Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
#> Insulin BMI DiabetesPedigreeFunction Age
#> Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
#> 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
#> Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
#> Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
#> 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
#> Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
#> Outcome
#> 0:500
#> 1:268
#>
#>
#>
#>
We also need to plot Heatmap Correlation as one of consideration during feature selection in model development later.
diabetes1 <- read.csv("diabetes.csv",stringsAsFactors = TRUE)
ggcorr(diabetes1, label = T, label_size = 2.9, hjust = 1, layout.exp = 3)+
labs(title = 'Diabetes Dataset Heatmap Correlation')+
theme(plot.title = element_text(hjust = 0.5))
From Heatmap Correlation, we can see that there are no dominant
parameters that has significant correlation to our output. That means we
can still use all of input parameters for our model development.
It is important to balance the class proportion so that model can predict well in both classes. Let’s check current proportion of our class target.
x <- inspectdf::inspect_cat(diabetes_df)
show_plot(x)prop.table(table(diabetes_df$Outcome))#>
#> 0 1
#> 0.6510417 0.3489583
table(diabetes_df$Outcome)#>
#> 0 1
#> 500 268
Based on the output, we need to conduct some adjustment on the output data proportion later. We also need to check distribution for each input parameters through boxplot.
stacked_df <- stack(diabetes1 %>% select(-Insulin))
ggplot(stacked_df, aes(ind,values))+
geom_boxplot(aes(fill = ind))+
labs(title = 'Box Plot for Each InputColumns', x='Column Names', y='Value',
fill = 'Column Names')+
theme(plot.title = element_text(hjust = 0.5))+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
Even though the data still have some outlier, we can keep the outlier
for current exercise.
Looks like our dataset still has imbalanced proportion, that means we
need additional preprocessing step to balance our class proportion by
using down sampling but first, we need to do splitting dataset. The goal
is that we will use the train data to get good model, while the test
data will be used as testers for the model we have created when faced
with unseen data. In addition, this can be used to see the ability of
the model that we create in dealing with unseen data. We use ratio train
& test data 80:20 for this exercise.
RNGkind(sample.kind = "Rounding")
set.seed(123)
index <- sample(x = nrow(diabetes_df), size = nrow(diabetes_df)*0.8)
diabetes_train <- diabetes_df[index , ]
diabetes_test <- diabetes_df[-index , ]
diabetes_df$Outcome %>% levels()#> [1] "0" "1"
#downsampling
RNGkind(sample.kind = "Rounding")
set.seed(123)
diabetes_train <- downSample(x = diabetes_train %>% select(-Outcome),
y = diabetes_train$Outcome,
yname = "diabetes")
prop.table(table(diabetes_train$diabetes))#>
#> 0 1
#> 0.5 0.5
Good, now our training data has balanced proportion and ready for the next step which is Model Development
Our first model will be build by using Naive Bayes Method. Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters.
# Model Naive Bayes
naive <- naiveBayes(formula = diabetes ~ . ,
data = diabetes_train,
laplace = 1)
# model fitting
naive_pred <- predict(naive, diabetes_test, type = "class") # for the class prediction
naive_prob <- predict(naive, diabetes_test, type = "raw") # for the probability
# result
naive_table <- select(diabetes_test, Outcome) %>%
bind_cols(diabetes_pred = naive_pred) %>%
bind_cols(diabetes_eprob = round(naive_prob[,1],4)) %>%
bind_cols(diabetes_pprob = round(naive_prob[,2],4))
# performance evaluation - confusion matrix
naive_table %>%
conf_mat(Outcome, diabetes_pred) %>%
autoplot(type = "heatmap")# Metrics Summary
naive_table %>%
summarise(
accuracy = accuracy_vec(Outcome, diabetes_pred),
sensitivity = sens_vec(Outcome, diabetes_pred),
specifity = spec_vec(Outcome, diabetes_pred),
precision = precision_vec(Outcome, diabetes_pred),
)# ROC
naive_roc <- data.frame(prediction = round(naive_prob[,2],4),
trueclass = as.numeric(naive_table$Outcome==1))
naive_roc <- ROCR::prediction(naive_roc$prediction, naive_roc$trueclass)
# ROC curve
plot(performance(naive_roc, 'tpr','fpr'),
main = 'ROC')
abline(a=0, b=1, lty=2)# AUC
auc_ROCR_n <- performance(naive_roc, measure = 'auc')
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n#> [1] 0.8567553
model_n <- naive_table %>%
summarise(
accuracy = accuracy_vec(Outcome, diabetes_pred),
sensitivity = sens_vec(Outcome, diabetes_pred),
specificity = spec_vec(Outcome, diabetes_pred),
precision = precision_vec(Outcome, diabetes_pred)
) %>%
cbind(AUC = auc_ROCR_n)The next model is Decision Tree Model. A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.
dtree <- rpart(formula=diabetes ~.,
data = diabetes_train,
method = 'class')
fancyRpartPlot(dtree, sub=NULL)
According to the illustration above, the depth of our decision tree
model still acceptable because the model still looks simple and easy to
explain by the user. That’s why I think we don’t need to do any
pruning for now.
# model fitting
dtree_pred <- predict(dtree, diabetes_test, type = "class")
dtree_prob <- predict(dtree, diabetes_test, type = "prob")
# result
dtree_table <- select(diabetes_test, Outcome) %>%
bind_cols(diabetes_pred = dtree_pred) %>%
bind_cols(diabetes_eprob = round(dtree_prob[,1],4)) %>%
bind_cols(diabetes_pprob = round(dtree_prob[,2],4))
# performance evaluation - confusion matrix
dtree_table %>%
conf_mat(Outcome, diabetes_pred) %>%
autoplot(type = "heatmap")# Metrics Summary
dtree_table %>%
summarise(
accuracy = accuracy_vec(Outcome, diabetes_pred),
sensitivity = sens_vec(Outcome, diabetes_pred),
specificity = spec_vec(Outcome, diabetes_pred),
precision = precision_vec(Outcome, diabetes_pred)
)# ROC
dtree_roc <- data.frame(prediction=round(dtree_prob[,2],4),
trueclass=as.numeric(dtree_table$Outcome==1))
dtree_roc <- ROCR::prediction(dtree_roc$prediction, dtree_roc$trueclass)
# ROC curve
plot(performance(dtree_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1, lty = 2)# AUC
auc_ROCR_d <- performance(dtree_roc, measure = "auc")
auc_ROCR_d <- auc_ROCR_d@y.values[[1]]
auc_ROCR_d#> [1] 0.8147043
model_d <- dtree_table %>%
summarise(
accuracy = accuracy_vec(Outcome, diabetes_pred),
sensitivity = sens_vec(Outcome, diabetes_pred),
specificity = spec_vec(Outcome, diabetes_pred),
precision = precision_vec(Outcome, diabetes_pred)
) %>%
cbind(AUC = auc_ROCR_d)The last model for the exercise is Random Forest (RF). A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.
# model building
set.seed(123)
ctrl <- trainControl(method="repeatedcv",
number=5,
repeats=5) # k-fold cross validation
forest <- train(diabetes ~ .,
data=diabetes_train,
method="rf",
trControl = ctrl)
forest#> Random Forest
#>
#> 422 samples
#> 8 predictor
#> 2 classes: '0', '1'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 5 times)
#> Summary of sample sizes: 338, 338, 338, 336, 338, 337, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.7293418 0.4587586
#> 5 0.7321989 0.4645222
#> 8 0.7303107 0.4607114
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 5.
From the model summary, we know that the optimal number of variables
to consider splitting at each node is 8. We can also check the
importance of each variable used in our random forest using
varImp.
varImp(forest)#> rf variable importance
#>
#> Overall
#> Glucose 100.000
#> BMI 48.148
#> Age 27.150
#> DiabetesPedigreeFunction 23.523
#> BloodPressure 9.168
#> SkinThickness 3.847
#> Pregnancies 3.262
#> Insulin 0.000
When using random forests - we are not required to divide our dataset into train and test data because random forests already have out-of-bag (OOB) estimates which act as reliable estimates of accuracy on unseen examples. However, it is also possible to conduct regular train-test cross-validation. For example, the OOB we achieved (in summary below) was generated from our diab_train dataset.
plot(forest$finalModel)
legend("topright", colnames(forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)forest$finalModel#>
#> Call:
#> randomForest(x = x, y = y, mtry = param$mtry)
#> Type of random forest: classification
#> Number of trees: 500
#> No. of variables tried at each split: 5
#>
#> OOB estimate of error rate: 28.67%
#> Confusion matrix:
#> 0 1 class.error
#> 0 150 61 0.2890995
#> 1 60 151 0.2843602
Let’s test our random forest model to our
diabetes_test:
# model fitting
forest_pred <- predict(forest, diabetes_test, type = "raw")
forest_prob <- predict(forest, diabetes_test, type = "prob")
# result
forest_table <- select(diabetes_test, Outcome) %>%
bind_cols(diabetes_pred = forest_pred) %>%
bind_cols(diabetes_eprob = round(forest_prob[,1],4)) %>%
bind_cols(diabetes_pprob = round(forest_prob[,2],4))
# performance evaluation - confusion matrix
forest_table %>%
conf_mat(Outcome, diabetes_pred) %>%
autoplot(type = "heatmap")# Metrics Summary
forest_table %>%
summarise(
accuracy = accuracy_vec(Outcome, diabetes_pred),
sensitivity = sens_vec(Outcome, diabetes_pred),
specificity = spec_vec(Outcome, diabetes_pred),
precision = precision_vec(Outcome, diabetes_pred)
)# ROC
forest_roc <- data.frame(prediction=round(forest_prob[,2],4),
trueclass=as.numeric(forest_table$Outcome==1))
forest_roc <- ROCR::prediction(forest_roc$prediction, forest_roc$trueclass)
# ROC curve
plot(performance(forest_roc, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1, lty = 2)# AUC
auc_ROCR_f <- performance(forest_roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f#> [1] 0.849159
model_f <- forest_table %>%
summarise(
accuracy = accuracy_vec(Outcome, diabetes_pred),
sensitivity = sens_vec(Outcome, diabetes_pred),
specificity = spec_vec(Outcome, diabetes_pred),
precision = precision_vec(Outcome, diabetes_pred)
) %>%
cbind(AUC = auc_ROCR_f)Let’s compile all of metrics from each model we built into 1 table below:
rbind("Naive Bayes" = model_n,
"Decision Tree" = model_d,
"Random Forest" = model_f)Overall our 3 models have excellent AUC and can be considered as
model with robust performance.
Looking at the objective of our case, we need to focus on Precision
metrics because we want to minimize false prediction for patient with
Diabetes. Let’s imagine what happened when the doctor false diagnose
healthy patient. Based on this consideration, we tends to select
Decision Tree Model to represent our model because it has
the best precision metrics. Even though Decision Tree Model still
lacking performance both in sensitivity & AUC compare to other
models but the differences are not that big so we can neglect this. We
also think Decision Tree Model will give additional benefit for the user
in terms of explainability because the doctors can easily explain why
the model can brings those output so the patient can accept the
explaination from the doctors.
A work by Ade Anggi N S