Introduction

Based on data from World Health Organization (WHO), The number of people with diabetes has risen from 108 million in 1980 to 422 million in 2014 and the global prevalence of diabetes among adults over 18 years of age has risen from 4.7% in 1980 to 8.5% in 2014. Obviously we know that diabetes is a major cause of blindness, kidney failure, heart attacks, stroke and limb amputation. However, diabetes can be treated and its consequences avoided or delayed with diet, physical activity, medication and regular screening and treatment for complications.

Furthermore, this analysis aims to predict Diabetes from diagnostic measurement. The following dataset is originally donated to the UCI Machine Learning Repistory and organised by Friedrich Leisch. It contains 768 observations and 9 variables.

Read Data

diab <- read.csv("diabetes.csv")

After that we try to take a glimpse of our data structure using str().

str(diab)

## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: int  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: int  72 66 64 66 40 74 50 0 70 96 ...
##  $ triceps : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ insulin : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

Variable Description

pregnant : Number of times pregnant
glucose : Plasma glucose concentration (glucose tolerance test)
triceps : Triceps skin fold thickness (mm Hg)
insulin : 2-hour serum insulin (mu U/ml)
mass : Body mass index (weight in kg/(height in m)^2)
pedigree : Diabetes pedigree function
age : Age (years)
diabetes : Test for diabetes

Then, we inspect whether there is any missing value of our observation using colsums(is.na()).

colSums(is.na(diab))

## pregnant  glucose pressure  triceps  insulin     mass pedigree      age 
##        0        0        0        0        0        0        0        0 
## diabetes 
##        0

There is no missing data of our dataframe so we could proceed to the next step.

Basic Exploratory Data Analysis

How the diabetics accross Age group?

library(tidyverse) #for data wrangling

diab1 <- diab %>%
  group_by(age, diabetes) %>%
  summarise(total = n()) %>%
  ungroup()


library(ggplot2) # for plot
plot_age_diab <- ggplot(data = diab1, aes(x= age, y= total, label = total))+
  geom_line(aes(color =diabetes),)+
  geom_point(aes(color=diabetes, show.legend = F))+
  theme_bw()+
  labs(title = "Diabetics based on Age",
       x = "Age",
       y = "Total")

plot_age_diab

Based on plot above we know that many of diabetics are they who are in the age bracket from 30-40.

Data Analysis

Checking Correlation

Here we would see the correlation among predictor variables.

library(GGally)
# inspect correlation between predictors
GGally::ggcorr(diab[,-9], hjust = 1, layout.exp = 2, label = T, label_size = 2.9)

Based on the plot above, there is no strong correlation among predictor variables. This gave advantage in using model such as Naive Bayes.

Splitting Data

In this step we create our train and test set with proportion 90% for data train and 10% for data test. The spliting will use random sampling, as followed :

set.seed(100)
in_diab_train <-  sample(nrow(diab), nrow(diab)*0.9)
diab_train <- diab[in_diab_train,]
diab_test <- diab[-in_diab_train,]

dim(diab_train)

## [1] 691   9

dim(diab_test)

## [1] 77  9

# erase target variable on data set
toppredict_set <- diab_test[1:8]

dim(toppredict_set)

## [1] 77  8

Modelling

In this step, we build our classification model using several algorithms and comparing accuracy level of all models. In which the models that will builst are Naive Bayes, Decision Tree, and Random Forest.

Naive Bayes

# creating Naive Bayes model
library(e1071) # for naive bayes
model_naive <- naiveBayes(diabetes ~., data = diab_train)

# predicting target 
preds_naive <- predict(model_naive, newdata = toppredict_set)

(conf_matrix_naive <- table(preds_naive, diab_test$diabetes))

##            
## preds_naive neg pos
##         neg  40  11
##         pos  12  14

Result of confusion Matrix shows that Naive Bayes predicts 40 cases negative diabetes correctly and 11 cases with wrong prediction. At the same time, this model predicts that there are 14 positive diabetes correctly and 12 cases of wrong prediction. How about the accuracy level? We can see using confusionMatrix function below:

library(caret) # for confusion matrix
confusionMatrix(conf_matrix_naive)

## Confusion Matrix and Statistics
## 
##            
## preds_naive neg pos
##         neg  40  11
##         pos  12  14
##                                           
##                Accuracy : 0.7013          
##                  95% CI : (0.5862, 0.8003)
##     No Information Rate : 0.6753          
##     P-Value [Acc > NIR] : 0.3623          
##                                           
##                   Kappa : 0.3258          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.7692          
##             Specificity : 0.5600          
##          Pos Pred Value : 0.7843          
##          Neg Pred Value : 0.5385          
##              Prevalence : 0.6753          
##          Detection Rate : 0.5195          
##    Detection Prevalence : 0.6623          
##       Balanced Accuracy : 0.6646          
##                                           
##        'Positive' Class : neg             
##

From output Naive Bayes model we can see that the accuracy level is only 70%.

Decision Tree

The second model that will be used is Decision Tree
Decision tree analysis is a classification method that uses tree-like models of decisions and their possible outcomes. This method is one of the most commonly used tools in machine learning analysis. We will use the rpart library in order to use recursive partitioning methods for decision trees. This exploratory method will identify the most important variables related to churn in a hierarchical format.

library(partykit) #for decision tree

model_dt <- ctree(diabetes~., diab_train)

plot the model_dt :

plot(model_dt)

plot(model_dt, type = "simple")

From figure above, we can see the number of nodes and its distribution. In which :
- [1] is root node
- [2],[3],[4], and [9] are internal nodes or branch. Internal nodes shown by arrow pointting to/from them.
- [5],[6],[7],[8],[10],[11] are leaf nodes or leaf. The leaf shown by arrow pointting to them.

Based on function below we can see that there are 6 leafs and 5 inner nodes. For the first branch is age, then body mass, pregant and then mass (this classification for the glucose rate is more than > 127).

model_dt

## 
## Model formula:
## diabetes ~ pregnant + glucose + pressure + triceps + insulin + 
##     mass + pedigree + age
## 
## Fitted party:
## [1] root
## |   [2] glucose <= 127
## |   |   [3] age <= 28
## |   |   |   [4] mass <= 30.8
## |   |   |   |   [5] pregnant <= 5: neg (n = 130, err = 0.8%)
## |   |   |   |   [6] pregnant > 5: neg (n = 7, err = 14.3%)
## |   |   |   [7] mass > 30.8: neg (n = 107, err = 16.8%)
## |   |   [8] age > 28: neg (n = 190, err = 32.6%)
## |   [9] glucose > 127
## |   |   [10] mass <= 29.9: neg (n = 69, err = 31.9%)
## |   |   [11] mass > 29.9: pos (n = 188, err = 26.1%)
## 
## Number of inner nodes:    5
## Number of terminal nodes: 6

The model above we can apply to our data test.

predict(model_dt, head(diab_test[,-9]))

##   8  20  22  29  33  43 
## neg neg neg neg neg neg 
## Levels: neg pos

Make prediction of data test using model_dt.

pred_dt <- predict(model_dt, diab_test[,-9])

Call the confusion matrix

(conf_matrix_dtree <- table(pred_dt, diab_test$diabetes))

##        
## pred_dt neg pos
##     neg  44  14
##     pos   8  11

Result of confusion Matrix shows that decision tree predicts 44 cases negative diabetes correctly and 14 cases with wrong prediction. At the same time, this model predicts that there are 11 positive diabetes correctly and 8 cases of wrong prediction. There is increasing in false positive in this case. To see the accuracy, we use again confusionMatrix on model classification Decision tree:

predict(model_dt, head(diab_test[,-9]), type="prob")

##          neg         pos
## 8  0.6736842 0.326315789
## 20 0.6736842 0.326315789
## 22 0.6736842 0.326315789
## 29 0.6811594 0.318840580
## 33 0.9923077 0.007692308
## 43 0.6736842 0.326315789

caret::confusionMatrix(pred_dt, diab_test[,9])

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction neg pos
##        neg  44  14
##        pos   8  11
##                                        
##                Accuracy : 0.7143       
##                  95% CI : (0.6, 0.8115)
##     No Information Rate : 0.6753       
##     P-Value [Acc > NIR] : 0.2746       
##                                        
##                   Kappa : 0.3052       
##                                        
##  Mcnemar's Test P-Value : 0.2864       
##                                        
##             Sensitivity : 0.8462       
##             Specificity : 0.4400       
##          Pos Pred Value : 0.7586       
##          Neg Pred Value : 0.5789       
##              Prevalence : 0.6753       
##          Detection Rate : 0.5714       
##    Detection Prevalence : 0.7532       
##       Balanced Accuracy : 0.6431       
##                                        
##        'Positive' Class : neg          
##

From output Decision Tree model we can see that the accuracy level is 71%. So far, our model Decision Tree is slightly better in predicting diabetes cases over Naive Bayes model.

Random Forest

Random forest analysis is another machine learning classification method that is often used in classification analysis. The method operates by constructing multiple decision trees and constructing models based on summary statistics of these decision trees.

library(randomForest) # for random forest
set.seed(101)
n0_var <- nearZeroVar(diab[1:7]) #NzeroVar on dataset are data from colomn 1 until 7

db <- diab[,-n0_var] #n0_var is used to substract variables that has variance close to 0.

library(animation)
ani.options(interval = 1, nmax = 15)

cv.ani(main = "Demonstartion of th k-fold Cross Validation", bty = "l")

Now, the model will be built using 5-fold cross validation, and 3 repeats, as followed :

set.seed(101)
ctrl <- trainControl(method = "repeatedcv", number =5, repeats = 3) # train() to make model, method = to use k-fold, repeats= to show the best 3 value of mytr

model_forest <- train(diabetes~., data=diab_train, method="rf", trControl=ctrl)

saveRDS(model_forest, file = "model_forest.rds")

model_forest <- readRDS("model_forest.rds")

Based on result the best mytr is 2 (2 variables) in which the accuracy level using 2 variables is the best out of all trail mytr. Therefore, we know that mytr is number of variable used in modelling process. If we plot it we can get the result as followed :

plot(model_forest)

sum(predict(model_forest, diab_test[,-9])==diab_test[,9])

## [1] 59

varImp(model_forest)

## rf variable importance
## 
##          Overall
## glucose  100.000
## mass      50.154
## age       37.064
## pedigree  28.573
## pressure  10.409
## pregnant   9.540
## insulin    2.749
## triceps    0.000

Based on result above, we know that glucose rate has the highest impact to the result while the other variables are only 50% or less than it. Then, we can see OOB on every class by using plot.:

plot(model_forest$finalModel)
legend("topright", colnames(model_forest$finalModel$err.rate),
       col=1:6, cex= 0.8, fill=1:6)

Based on visualization above comparison of OOB and targeted variable. It depicts that from tree number around 90 the error of model has been better, yet we can still use more than 400 trees to reduce our OOB.
Next, we can see the final model as followed:

model_forest$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 23.88%
## Confusion matrix:
##     neg pos class.error
## neg 383  65   0.1450893
## pos 100 143   0.4115226

Using mtry 2 variable we get our model in predicting 383 case negative diabetes correctly and 65 wrong prediction. At the same time, in predicting 143 positive cases diabetes with 100 cases wrong.

predict_forest <- predict(model_forest, toppredict_set)

(conf_matrix_forest1 <- table(predict_forest, diab_test$diabetes))

##               
## predict_forest neg pos
##            neg  45  11
##            pos   7  14

Result of confusion Matrix shows that random forest predicts 45 cases negative diabetes correctly and 11 cases with wrong prediction. At the same time, this model predicts that there are 14 positive diabetes correctly and 7 cases of wrong prediction. There is increasing in false positive in this case. To see the accuracy, we use again confusionMatrix on model classification Decision tree:

confusionMatrix(conf_matrix_forest1)

## Confusion Matrix and Statistics
## 
##               
## predict_forest neg pos
##            neg  45  11
##            pos   7  14
##                                           
##                Accuracy : 0.7662          
##                  95% CI : (0.6559, 0.8552)
##     No Information Rate : 0.6753          
##     P-Value [Acc > NIR] : 0.0539          
##                                           
##                   Kappa : 0.4438          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.8654          
##             Specificity : 0.5600          
##          Pos Pred Value : 0.8036          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6753          
##          Detection Rate : 0.5844          
##    Detection Prevalence : 0.7273          
##       Balanced Accuracy : 0.7127          
##                                           
##        'Positive' Class : neg             
##

Using random forest our model accuracy in predicting diabetes increased to 76%. This number is better than accuracy level produced by decision tree or naive bayes.

Conclusion

If we compare accuracy and sensitivity level of our models to see the highest value, we can summarise as followed :

confusionMatrix(conf_matrix_naive)

## Confusion Matrix and Statistics
## 
##            
## preds_naive neg pos
##         neg  40  11
##         pos  12  14
##                                           
##                Accuracy : 0.7013          
##                  95% CI : (0.5862, 0.8003)
##     No Information Rate : 0.6753          
##     P-Value [Acc > NIR] : 0.3623          
##                                           
##                   Kappa : 0.3258          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.7692          
##             Specificity : 0.5600          
##          Pos Pred Value : 0.7843          
##          Neg Pred Value : 0.5385          
##              Prevalence : 0.6753          
##          Detection Rate : 0.5195          
##    Detection Prevalence : 0.6623          
##       Balanced Accuracy : 0.6646          
##                                           
##        'Positive' Class : neg             
##

confusionMatrix(conf_matrix_dtree)

## Confusion Matrix and Statistics
## 
##        
## pred_dt neg pos
##     neg  44  14
##     pos   8  11
##                                        
##                Accuracy : 0.7143       
##                  95% CI : (0.6, 0.8115)
##     No Information Rate : 0.6753       
##     P-Value [Acc > NIR] : 0.2746       
##                                        
##                   Kappa : 0.3052       
##                                        
##  Mcnemar's Test P-Value : 0.2864       
##                                        
##             Sensitivity : 0.8462       
##             Specificity : 0.4400       
##          Pos Pred Value : 0.7586       
##          Neg Pred Value : 0.5789       
##              Prevalence : 0.6753       
##          Detection Rate : 0.5714       
##    Detection Prevalence : 0.7532       
##       Balanced Accuracy : 0.6431       
##                                        
##        'Positive' Class : neg          
##

confusionMatrix(conf_matrix_forest1)

## Confusion Matrix and Statistics
## 
##               
## predict_forest neg pos
##            neg  45  11
##            pos   7  14
##                                           
##                Accuracy : 0.7662          
##                  95% CI : (0.6559, 0.8552)
##     No Information Rate : 0.6753          
##     P-Value [Acc > NIR] : 0.0539          
##                                           
##                   Kappa : 0.4438          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 0.8654          
##             Specificity : 0.5600          
##          Pos Pred Value : 0.8036          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.6753          
##          Detection Rate : 0.5844          
##    Detection Prevalence : 0.7273          
##       Balanced Accuracy : 0.7127          
##                                           
##        'Positive' Class : neg             
##

glucose is variable that most impact to diabetes, and followed by age, pedigree and pressure.
In getting model with best performance, especially for naive bayes and decision tree can be done by adjusting and seeking for the proper cutoff value. Obviously, in this analysis we desire to minimize the False Positive, so we need to find out the cutoff value that can give good rate of recall.
Based on order accuracy and recall value, random forest model is the best classification model, with accuracy level of 76% and recall level 86%, while model naive bayes (accuracy = 70%, recall =76%) and decision tree (accuracy = 71% and recall = 84%). However, we can still improve our models by doing validation to k-fold value especially on random forest model.

Predicting Diabetes From Diagnostic Measurement

Meinari Claudia

4/18/2020