Based on data from World Health Organization (WHO), The number of people with diabetes has risen from 108 million in 1980 to 422 million in 2014 and the global prevalence of diabetes among adults over 18 years of age has risen from 4.7% in 1980 to 8.5% in 2014. Obviously we know that diabetes is a major cause of blindness, kidney failure, heart attacks, stroke and limb amputation. However, diabetes can be treated and its consequences avoided or delayed with diet, physical activity, medication and regular screening and treatment for complications.
Furthermore, this analysis aims to predict Diabetes from diagnostic measurement. The following dataset is originally donated to the UCI Machine Learning Repistory and organised by Friedrich Leisch. It contains 768 observations and 9 variables.
After that we try to take a glimpse of our data structure using str().
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: int 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: int 72 66 64 66 40 74 50 0 70 96 ...
## $ triceps : int 35 29 0 23 35 0 32 0 45 0 ...
## $ insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
Then, we inspect whether there is any missing value of our observation using colsums(is.na()).
## pregnant glucose pressure triceps insulin mass pedigree age
## 0 0 0 0 0 0 0 0
## diabetes
## 0
There is no missing data of our dataframe so we could proceed to the next step.
library(tidyverse) #for data wrangling
diab1 <- diab %>%
group_by(age, diabetes) %>%
summarise(total = n()) %>%
ungroup()
library(ggplot2) # for plot
plot_age_diab <- ggplot(data = diab1, aes(x= age, y= total, label = total))+
geom_line(aes(color =diabetes),)+
geom_point(aes(color=diabetes, show.legend = F))+
theme_bw()+
labs(title = "Diabetics based on Age",
x = "Age",
y = "Total")
plot_age_diabBased on plot above we know that many of diabetics are they who are in the age bracket from 30-40.
Here we would see the correlation among predictor variables.
library(GGally)
# inspect correlation between predictors
GGally::ggcorr(diab[,-9], hjust = 1, layout.exp = 2, label = T, label_size = 2.9) Based on the plot above, there is no strong correlation among predictor variables. This gave advantage in using model such as Naive Bayes.
In this step we create our train and test set with proportion 90% for data train and 10% for data test. The spliting will use random sampling, as followed :
set.seed(100)
in_diab_train <- sample(nrow(diab), nrow(diab)*0.9)
diab_train <- diab[in_diab_train,]
diab_test <- diab[-in_diab_train,]## [1] 691 9
## [1] 77 9
## [1] 77 8
In this step, we build our classification model using several algorithms and comparing accuracy level of all models. In which the models that will builst are Naive Bayes, Decision Tree, and Random Forest.
# creating Naive Bayes model
library(e1071) # for naive bayes
model_naive <- naiveBayes(diabetes ~., data = diab_train)
# predicting target
preds_naive <- predict(model_naive, newdata = toppredict_set)
(conf_matrix_naive <- table(preds_naive, diab_test$diabetes))##
## preds_naive neg pos
## neg 40 11
## pos 12 14
Result of confusion Matrix shows that Naive Bayes predicts 40 cases negative diabetes correctly and 11 cases with wrong prediction. At the same time, this model predicts that there are 14 positive diabetes correctly and 12 cases of wrong prediction. How about the accuracy level? We can see using confusionMatrix function below:
## Confusion Matrix and Statistics
##
##
## preds_naive neg pos
## neg 40 11
## pos 12 14
##
## Accuracy : 0.7013
## 95% CI : (0.5862, 0.8003)
## No Information Rate : 0.6753
## P-Value [Acc > NIR] : 0.3623
##
## Kappa : 0.3258
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7692
## Specificity : 0.5600
## Pos Pred Value : 0.7843
## Neg Pred Value : 0.5385
## Prevalence : 0.6753
## Detection Rate : 0.5195
## Detection Prevalence : 0.6623
## Balanced Accuracy : 0.6646
##
## 'Positive' Class : neg
##
From output Naive Bayes model we can see that the accuracy level is only 70%.
The second model that will be used is Decision Tree
Decision tree analysis is a classification method that uses tree-like models of decisions and their possible outcomes. This method is one of the most commonly used tools in machine learning analysis. We will use the rpart library in order to use recursive partitioning methods for decision trees. This exploratory method will identify the most important variables related to churn in a hierarchical format.
plot the model_dt :
From figure above, we can see the number of
nodes and its distribution. In which :
- [1] is root node
- [2],[3],[4], and [9] are internal nodes or branch. Internal nodes shown by arrow pointting to/from them.
- [5],[6],[7],[8],[10],[11] are leaf nodes or leaf. The leaf shown by arrow pointting to them.
Based on function below we can see that there are 6 leafs and 5 inner nodes. For the first branch is age, then body mass, pregant and then mass (this classification for the glucose rate is more than > 127).
##
## Model formula:
## diabetes ~ pregnant + glucose + pressure + triceps + insulin +
## mass + pedigree + age
##
## Fitted party:
## [1] root
## | [2] glucose <= 127
## | | [3] age <= 28
## | | | [4] mass <= 30.8
## | | | | [5] pregnant <= 5: neg (n = 130, err = 0.8%)
## | | | | [6] pregnant > 5: neg (n = 7, err = 14.3%)
## | | | [7] mass > 30.8: neg (n = 107, err = 16.8%)
## | | [8] age > 28: neg (n = 190, err = 32.6%)
## | [9] glucose > 127
## | | [10] mass <= 29.9: neg (n = 69, err = 31.9%)
## | | [11] mass > 29.9: pos (n = 188, err = 26.1%)
##
## Number of inner nodes: 5
## Number of terminal nodes: 6
The model above we can apply to our data test.
## 8 20 22 29 33 43
## neg neg neg neg neg neg
## Levels: neg pos
Make prediction of data test using model_dt.
Call the confusion matrix
##
## pred_dt neg pos
## neg 44 14
## pos 8 11
Result of confusion Matrix shows that decision tree predicts 44 cases negative diabetes correctly and 14 cases with wrong prediction. At the same time, this model predicts that there are 11 positive diabetes correctly and 8 cases of wrong prediction. There is increasing in false positive in this case. To see the accuracy, we use again confusionMatrix on model classification Decision tree:
## neg pos
## 8 0.6736842 0.326315789
## 20 0.6736842 0.326315789
## 22 0.6736842 0.326315789
## 29 0.6811594 0.318840580
## 33 0.9923077 0.007692308
## 43 0.6736842 0.326315789
## Confusion Matrix and Statistics
##
## Reference
## Prediction neg pos
## neg 44 14
## pos 8 11
##
## Accuracy : 0.7143
## 95% CI : (0.6, 0.8115)
## No Information Rate : 0.6753
## P-Value [Acc > NIR] : 0.2746
##
## Kappa : 0.3052
##
## Mcnemar's Test P-Value : 0.2864
##
## Sensitivity : 0.8462
## Specificity : 0.4400
## Pos Pred Value : 0.7586
## Neg Pred Value : 0.5789
## Prevalence : 0.6753
## Detection Rate : 0.5714
## Detection Prevalence : 0.7532
## Balanced Accuracy : 0.6431
##
## 'Positive' Class : neg
##
From output Decision Tree model we can see that the accuracy level is 71%. So far, our model Decision Tree is slightly better in predicting diabetes cases over Naive Bayes model.
Random forest analysis is another machine learning classification method that is often used in classification analysis. The method operates by constructing multiple decision trees and constructing models based on summary statistics of these decision trees.
library(randomForest) # for random forest
set.seed(101)
n0_var <- nearZeroVar(diab[1:7]) #NzeroVar on dataset are data from colomn 1 until 7
db <- diab[,-n0_var] #n0_var is used to substract variables that has variance close to 0.library(animation)
ani.options(interval = 1, nmax = 15)
cv.ani(main = "Demonstartion of th k-fold Cross Validation", bty = "l") Now, the model will be built using 5-fold cross validation, and 3 repeats, as followed :
set.seed(101)
ctrl <- trainControl(method = "repeatedcv", number =5, repeats = 3) # train() to make model, method = to use k-fold, repeats= to show the best 3 value of mytr
model_forest <- train(diabetes~., data=diab_train, method="rf", trControl=ctrl)
saveRDS(model_forest, file = "model_forest.rds")Based on result the best mytr is 2 (2 variables) in which the accuracy level using 2 variables is the best out of all trail mytr. Therefore, we know that mytr is number of variable used in modelling process. If we plot it we can get the result as followed :
## [1] 59
## rf variable importance
##
## Overall
## glucose 100.000
## mass 50.154
## age 37.064
## pedigree 28.573
## pressure 10.409
## pregnant 9.540
## insulin 2.749
## triceps 0.000
Based on result above, we know that glucose rate has the highest impact to the result while the other variables are only 50% or less than it. Then, we can see OOB on every class by using plot.:
plot(model_forest$finalModel)
legend("topright", colnames(model_forest$finalModel$err.rate),
col=1:6, cex= 0.8, fill=1:6)Based on visualization above comparison of OOB and targeted variable. It depicts that from tree number around 90 the error of model has been better, yet we can still use more than 400 trees to reduce our OOB.
Next, we can see the final model as followed:
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 23.88%
## Confusion matrix:
## neg pos class.error
## neg 383 65 0.1450893
## pos 100 143 0.4115226
Using mtry 2 variable we get our model in predicting 383 case negative diabetes correctly and 65 wrong prediction. At the same time, in predicting 143 positive cases diabetes with 100 cases wrong.
##
## predict_forest neg pos
## neg 45 11
## pos 7 14
Result of confusion Matrix shows that random forest predicts 45 cases negative diabetes correctly and 11 cases with wrong prediction. At the same time, this model predicts that there are 14 positive diabetes correctly and 7 cases of wrong prediction. There is increasing in false positive in this case. To see the accuracy, we use again confusionMatrix on model classification Decision tree:
## Confusion Matrix and Statistics
##
##
## predict_forest neg pos
## neg 45 11
## pos 7 14
##
## Accuracy : 0.7662
## 95% CI : (0.6559, 0.8552)
## No Information Rate : 0.6753
## P-Value [Acc > NIR] : 0.0539
##
## Kappa : 0.4438
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.8654
## Specificity : 0.5600
## Pos Pred Value : 0.8036
## Neg Pred Value : 0.6667
## Prevalence : 0.6753
## Detection Rate : 0.5844
## Detection Prevalence : 0.7273
## Balanced Accuracy : 0.7127
##
## 'Positive' Class : neg
##
Using random forest our model accuracy in predicting diabetes increased to 76%. This number is better than accuracy level produced by decision tree or naive bayes.
If we compare accuracy and sensitivity level of our models to see the highest value, we can summarise as followed :
## Confusion Matrix and Statistics
##
##
## preds_naive neg pos
## neg 40 11
## pos 12 14
##
## Accuracy : 0.7013
## 95% CI : (0.5862, 0.8003)
## No Information Rate : 0.6753
## P-Value [Acc > NIR] : 0.3623
##
## Kappa : 0.3258
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.7692
## Specificity : 0.5600
## Pos Pred Value : 0.7843
## Neg Pred Value : 0.5385
## Prevalence : 0.6753
## Detection Rate : 0.5195
## Detection Prevalence : 0.6623
## Balanced Accuracy : 0.6646
##
## 'Positive' Class : neg
##
## Confusion Matrix and Statistics
##
##
## pred_dt neg pos
## neg 44 14
## pos 8 11
##
## Accuracy : 0.7143
## 95% CI : (0.6, 0.8115)
## No Information Rate : 0.6753
## P-Value [Acc > NIR] : 0.2746
##
## Kappa : 0.3052
##
## Mcnemar's Test P-Value : 0.2864
##
## Sensitivity : 0.8462
## Specificity : 0.4400
## Pos Pred Value : 0.7586
## Neg Pred Value : 0.5789
## Prevalence : 0.6753
## Detection Rate : 0.5714
## Detection Prevalence : 0.7532
## Balanced Accuracy : 0.6431
##
## 'Positive' Class : neg
##
## Confusion Matrix and Statistics
##
##
## predict_forest neg pos
## neg 45 11
## pos 7 14
##
## Accuracy : 0.7662
## 95% CI : (0.6559, 0.8552)
## No Information Rate : 0.6753
## P-Value [Acc > NIR] : 0.0539
##
## Kappa : 0.4438
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 0.8654
## Specificity : 0.5600
## Pos Pred Value : 0.8036
## Neg Pred Value : 0.6667
## Prevalence : 0.6753
## Detection Rate : 0.5844
## Detection Prevalence : 0.7273
## Balanced Accuracy : 0.7127
##
## 'Positive' Class : neg
##