LBB Classification Model II - Diabetes Dataset

1. Case Description

Brief Introduction

Hi My Name is Ade Anggi Naluriawan Santoso. Nice to meet you! This is my Fifth RPubs Document as part of Learn By Building Assignment at Algoritma Data Science School.

Case Description

Diabetes is among the most prevalent chronic diseases in the United States, impacting millions of Americans each year and exerting a significant financial burden on the economy. Diabetes is a serious chronic disease in which individuals lose the ability to effectively regulate levels of glucose in the blood, and can lead to reduced quality of life and life expectancy. After different foods are broken down into sugars during digestion, the sugars are then released into the bloodstream. This signals the pancreas to release insulin. Insulin helps enable cells within the body to use those sugars in the bloodstream for energy. Diabetes is generally characterized by either the body not making enough insulin or being unable to use the insulin that is made as effectively as needed.

Complications like heart disease, vision loss, lower-limb amputation, and kidney disease are associated with chronically high levels of sugar remaining in the bloodstream for those with diabetes. While there is no cure for diabetes, strategies like losing weight, eating healthily, being active, and receiving medical treatments can mitigate the harms of this disease in many patients. Early diagnosis can lead to lifestyle changes and more effective treatment, making predictive models for diabetes risk important tools for public and public health officials.

The scale of this problem is also important to recognize. The Centers for Disease Control and Prevention has indicated that as of 2018, 34.2 million Americans have diabetes and 88 million have prediabetes. Furthermore, the CDC estimates that 1 in 5 diabetics, and roughly 8 in 10 prediabetics are unaware of their risk. While there are different types of diabetes, type II diabetes is the most common form and its prevalence varies by age, education, income, location, race, and other social determinants of health. Much of the burden of the disease falls on those of lower socioeconomic status as well. Diabetes also places a massive burden on the economy, with diagnosed diabetes costs of roughly $327 billion dollars and total costs with undiagnosed diabetes and prediabetes approaching $400 billion dollars annually.

Attribute Information

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration (glucose tolerance test)
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour skin fold thickness (mu U/ml)
BMI: Body mass index (weight in kg/(heights in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Test for Diabetes (1: Diabetes, 0: Healthy)

Objectives

The choice of target variable depends on the perspective of the case you want to take
Data analysis and process of selecting predictor variables / feature selection
Pre-processing of data from data cleansing to cross validation.
An explanation of the evaluation of the model used, what is the best metric to evaluate the model and why.
Document analysis of how to improve the performance of the model (eg pruning tree process) and/or comparison of naive Bayes and tree models.

2. Setup Environment

Before we start develop the analysis, let’s start with environment setup.

# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

options(scipen=9999)

After we done with environment setup, let’s call several important library that will be used during analysis.

library(dplyr)
library(gtools)
library(gmodels)
library(ggplot2)
library(class)
library(tidyr)
library(GGally)
library(caret)
library(e1071)
library(tm)
library(ROCR)
library(partykit)
library(gridExtra)
library(inspectdf)
library(tidymodels)
library(rsample)
library(plotly)
library(rpart)
library(rattle)
library(rpart.plot)

3. Data Preprocessing

3.1 Data Input & Structure

Before we generate classification model of our dataset, we need to understand what’s inside the dataset. Let’s start with reading diabetes.csv in our folder using read.csv() function, change chr to Factor, save it into an object named diabetes_df, check the data structure and the data types.

diabetes_df <- read.csv("diabetes.csv",stringsAsFactors = TRUE)
str(diabetes_df)

#> 'data.frame':    768 obs. of  9 variables:
#>  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
#>  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
#>  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
#>  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
#>  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
#>  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
#>  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
#>  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
#>  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Based on the output above, we now that our data has dimension of 768 rows and 9 columns. In the other hand, we still need to change int types to Factor in Outcome column.

diabetes_df$Outcome <- as.factor(diabetes_df$Outcome)
glimpse(diabetes_df)

#> Rows: 768
#> Columns: 9
#> $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
#> $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
#> $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
#> $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
#> $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
#> $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
#> $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
#> $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
#> $ Outcome                  <fct> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

Good, now our dataset has the correct data types for each column.

3.2 Missing Data & Practical Statistics

Even though the dataset already has the correct data types, we still need to check for any missing data and conduct necessary data treatment for those missing data.

anyNA(diabetes_df)

#> [1] FALSE

colSums(is.na(diabetes_df))

#>              Pregnancies                  Glucose            BloodPressure 
#>                        0                        0                        0 
#>            SkinThickness                  Insulin                      BMI 
#>                        0                        0                        0 
#> DiabetesPedigreeFunction                      Age                  Outcome 
#>                        0                        0                        0

Based on the output above, the dataset doesn’t have any missing values. That means we don’t need to do any missing value data treatment on the dataset.

Next, we would like to check statistic analysis of our dataset.

summary(diabetes_df)

#>   Pregnancies        Glucose      BloodPressure    SkinThickness  
#>  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
#>  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
#>  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
#>  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
#>  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
#>  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
#>     Insulin           BMI        DiabetesPedigreeFunction      Age       
#>  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
#>  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
#>  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
#>  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
#>  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
#>  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
#>  Outcome
#>  0:500  
#>  1:268  
#>         
#>         
#>         
#>

We also need to plot Heatmap Correlation as one of consideration during feature selection in model development later.

diabetes1 <- read.csv("diabetes.csv",stringsAsFactors = TRUE)
ggcorr(diabetes1, label = T, label_size = 2.9, hjust = 1, layout.exp = 3)+
  labs(title = 'Diabetes Dataset Heatmap Correlation')+
  theme(plot.title = element_text(hjust = 0.5))

From Heatmap Correlation, we can see that there are no dominant parameters that has significant correlation to our output. That means we can still use all of input parameters for our model development.

3.3 Exploratory Data Analysis

It is important to balance the class proportion so that model can predict well in both classes. Let’s check current proportion of our class target.

x <- inspectdf::inspect_cat(diabetes_df)
show_plot(x)

prop.table(table(diabetes_df$Outcome))

#> 
#>         0         1 
#> 0.6510417 0.3489583

table(diabetes_df$Outcome)

#> 
#>   0   1 
#> 500 268

Based on the output, we need to conduct some adjustment on the output data proportion later. We also need to check distribution for each input parameters through boxplot.

stacked_df <- stack(diabetes1 %>% select(-Insulin))
ggplot(stacked_df, aes(ind,values))+
  geom_boxplot(aes(fill = ind))+
  labs(title = 'Box Plot for Each InputColumns', x='Column Names', y='Value', 
       fill = 'Column Names')+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Even though the data still have some outlier, we can keep the outlier for current exercise.

3.3 Splitting Data

Looks like our dataset still has imbalanced proportion, that means we need additional preprocessing step to balance our class proportion by using down sampling but first, we need to do splitting dataset. The goal is that we will use the train data to get good model, while the test data will be used as testers for the model we have created when faced with unseen data. In addition, this can be used to see the ability of the model that we create in dealing with unseen data. We use ratio train & test data 80:20 for this exercise.

RNGkind(sample.kind = "Rounding")
set.seed(123)
index <- sample(x = nrow(diabetes_df), size = nrow(diabetes_df)*0.8)
diabetes_train <- diabetes_df[index , ]
diabetes_test <-  diabetes_df[-index , ]
diabetes_df$Outcome %>% levels()

#> [1] "0" "1"

#downsampling
RNGkind(sample.kind = "Rounding")
set.seed(123)

diabetes_train <- downSample(x = diabetes_train %>% select(-Outcome),
                             y = diabetes_train$Outcome,
                             yname = "diabetes")
prop.table(table(diabetes_train$diabetes))

#> 
#>   0   1 
#> 0.5 0.5

Good, now our training data has balanced proportion and ready for the next step which is Model Development

4. Classification Model

4.1 Naive Bayes

Our first model will be build by using Naive Bayes Method. Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable. In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters.

# Model Naive Bayes
naive <- naiveBayes(formula = diabetes ~ . , 
                    data = diabetes_train,
                    laplace = 1)

# model fitting
naive_pred <- predict(naive, diabetes_test, type = "class") # for the class prediction
naive_prob <- predict(naive, diabetes_test, type = "raw") # for the probability

# result
naive_table <- select(diabetes_test, Outcome) %>%
  bind_cols(diabetes_pred = naive_pred) %>% 
  bind_cols(diabetes_eprob = round(naive_prob[,1],4)) %>% 
  bind_cols(diabetes_pprob = round(naive_prob[,2],4))

# performance evaluation - confusion matrix
naive_table %>% 
  conf_mat(Outcome, diabetes_pred) %>% 
  autoplot(type = "heatmap")

# Metrics Summary
naive_table %>% 
  summarise(
    accuracy = accuracy_vec(Outcome, diabetes_pred),
    sensitivity = sens_vec(Outcome, diabetes_pred),
    specifity = spec_vec(Outcome, diabetes_pred),
    precision = precision_vec(Outcome, diabetes_pred),
  )

# ROC
naive_roc <- data.frame(prediction = round(naive_prob[,2],4),
                        trueclass = as.numeric(naive_table$Outcome==1))

naive_roc <- ROCR::prediction(naive_roc$prediction, naive_roc$trueclass)

# ROC curve
plot(performance(naive_roc, 'tpr','fpr'),
     main = 'ROC')
abline(a=0, b=1, lty=2)

# AUC
auc_ROCR_n <- performance(naive_roc, measure = 'auc')
auc_ROCR_n <- auc_ROCR_n@y.values[[1]]
auc_ROCR_n

#> [1] 0.8567553

model_n <- naive_table %>%
  summarise(
    accuracy = accuracy_vec(Outcome, diabetes_pred),
    sensitivity = sens_vec(Outcome, diabetes_pred),
    specificity = spec_vec(Outcome, diabetes_pred),
    precision = precision_vec(Outcome, diabetes_pred)
  ) %>% 
  cbind(AUC = auc_ROCR_n)

4.2 Decision Tree

The next model is Decision Tree Model. A decision tree is a non-parametric supervised learning algorithm, which is utilized for both classification and regression tasks. It has a hierarchical, tree structure, which consists of a root node, branches, internal nodes and leaf nodes.

dtree <- rpart(formula=diabetes ~.,
               data = diabetes_train,
               method = 'class')
fancyRpartPlot(dtree, sub=NULL)

According to the illustration above, the depth of our decision tree model still acceptable because the model still looks simple and easy to explain by the user. That’s why I think we don’t need to do any pruning for now.

# model fitting
dtree_pred <- predict(dtree, diabetes_test, type = "class")
dtree_prob <- predict(dtree, diabetes_test, type = "prob")

# result
dtree_table <- select(diabetes_test, Outcome) %>%
  bind_cols(diabetes_pred = dtree_pred) %>% 
  bind_cols(diabetes_eprob = round(dtree_prob[,1],4)) %>% 
  bind_cols(diabetes_pprob = round(dtree_prob[,2],4))

# performance evaluation - confusion matrix
dtree_table %>% 
  conf_mat(Outcome, diabetes_pred) %>% 
  autoplot(type = "heatmap")

# Metrics Summary
dtree_table %>%
  summarise(
    accuracy = accuracy_vec(Outcome, diabetes_pred),
    sensitivity = sens_vec(Outcome, diabetes_pred),
    specificity = spec_vec(Outcome, diabetes_pred),
    precision = precision_vec(Outcome, diabetes_pred)
  )

# ROC
dtree_roc <- data.frame(prediction=round(dtree_prob[,2],4),
                        trueclass=as.numeric(dtree_table$Outcome==1))

dtree_roc <- ROCR::prediction(dtree_roc$prediction, dtree_roc$trueclass) 

# ROC curve
plot(performance(dtree_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1, lty = 2)

# AUC
auc_ROCR_d <- performance(dtree_roc, measure = "auc")
auc_ROCR_d <- auc_ROCR_d@y.values[[1]]
auc_ROCR_d

#> [1] 0.8147043

model_d <- dtree_table %>%
  summarise(
    accuracy = accuracy_vec(Outcome, diabetes_pred),
    sensitivity = sens_vec(Outcome, diabetes_pred),
    specificity = spec_vec(Outcome, diabetes_pred),
    precision = precision_vec(Outcome, diabetes_pred)
  ) %>% 
  cbind(AUC = auc_ROCR_d)

4.3 Random Forest

The last model for the exercise is Random Forest (RF). A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Each individual tree in the random forest spits out a class prediction and the class with the most votes becomes our model’s prediction.

# model building
set.seed(123)
ctrl <- trainControl(method="repeatedcv", 
                     number=5, 
                     repeats=5) # k-fold cross validation

forest <- train(diabetes ~ ., 
                data=diabetes_train, 
                method="rf", 
                trControl = ctrl)
forest

#> Random Forest 
#> 
#> 422 samples
#>   8 predictor
#>   2 classes: '0', '1' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 5 times) 
#> Summary of sample sizes: 338, 338, 338, 336, 338, 337, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>   2     0.7293418  0.4587586
#>   5     0.7321989  0.4645222
#>   8     0.7303107  0.4607114
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 5.

From the model summary, we know that the optimal number of variables to consider splitting at each node is 8. We can also check the importance of each variable used in our random forest using varImp.

varImp(forest)

#> rf variable importance
#> 
#>                          Overall
#> Glucose                  100.000
#> BMI                       48.148
#> Age                       27.150
#> DiabetesPedigreeFunction  23.523
#> BloodPressure              9.168
#> SkinThickness              3.847
#> Pregnancies                3.262
#> Insulin                    0.000

When using random forests - we are not required to divide our dataset into train and test data because random forests already have out-of-bag (OOB) estimates which act as reliable estimates of accuracy on unseen examples. However, it is also possible to conduct regular train-test cross-validation. For example, the OOB we achieved (in summary below) was generated from our diab_train dataset.

plot(forest$finalModel)
legend("topright", colnames(forest$finalModel$err.rate),col=1:6,cex=0.8,fill=1:6)

forest$finalModel

#> 
#> Call:
#>  randomForest(x = x, y = y, mtry = param$mtry) 
#>                Type of random forest: classification
#>                      Number of trees: 500
#> No. of variables tried at each split: 5
#> 
#>         OOB estimate of  error rate: 28.67%
#> Confusion matrix:
#>     0   1 class.error
#> 0 150  61   0.2890995
#> 1  60 151   0.2843602

Let’s test our random forest model to our diabetes_test:

# model fitting
forest_pred <- predict(forest, diabetes_test, type = "raw")
forest_prob <- predict(forest, diabetes_test, type = "prob")

# result
forest_table <- select(diabetes_test, Outcome) %>%
  bind_cols(diabetes_pred = forest_pred) %>% 
  bind_cols(diabetes_eprob = round(forest_prob[,1],4)) %>% 
  bind_cols(diabetes_pprob = round(forest_prob[,2],4))

# performance evaluation - confusion matrix 
forest_table %>% 
  conf_mat(Outcome, diabetes_pred) %>% 
  autoplot(type = "heatmap")

# Metrics Summary
forest_table %>%
  summarise(
    accuracy = accuracy_vec(Outcome, diabetes_pred),
    sensitivity = sens_vec(Outcome, diabetes_pred),
    specificity = spec_vec(Outcome, diabetes_pred),
    precision = precision_vec(Outcome, diabetes_pred)
  )

# ROC
forest_roc <- data.frame(prediction=round(forest_prob[,2],4),
                      trueclass=as.numeric(forest_table$Outcome==1))

forest_roc <- ROCR::prediction(forest_roc$prediction, forest_roc$trueclass) 

# ROC curve
plot(performance(forest_roc, "tpr", "fpr"),
     main = "ROC")
abline(a = 0, b = 1, lty = 2)

# AUC
auc_ROCR_f <- performance(forest_roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f

#> [1] 0.849159

model_f <- forest_table %>%
  summarise(
    accuracy = accuracy_vec(Outcome, diabetes_pred),
    sensitivity = sens_vec(Outcome, diabetes_pred),
    specificity = spec_vec(Outcome, diabetes_pred),
    precision = precision_vec(Outcome, diabetes_pred)
  ) %>% 
  cbind(AUC = auc_ROCR_f)

5. Conclusion

Let’s compile all of metrics from each model we built into 1 table below:

rbind("Naive Bayes" = model_n, 
      "Decision Tree" = model_d, 
      "Random Forest" = model_f)

Overall our 3 models have excellent AUC and can be considered as model with robust performance.

Looking at the objective of our case, we need to focus on Precision metrics because we want to minimize false prediction for patient with Diabetes. Let’s imagine what happened when the doctor false diagnose healthy patient. Based on this consideration, we tends to select Decision Tree Model to represent our model because it has the best precision metrics. Even though Decision Tree Model still lacking performance both in sensitivity & AUC compare to other models but the differences are not that big so we can neglect this. We also think Decision Tree Model will give additional benefit for the user in terms of explainability because the doctors can easily explain why the model can brings those output so the patient can accept the explaination from the doctors.

A work by Ade Anggi N S

adeanggins@gmail.com