This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Diabetes mellitus is one of the major noncommunicable diseases which have great impact on human life today. Many nations are now facing a swiftly rising growth of diabetes among their residents [2].
According to a study by the World Health Organization (WHO), this number will have raised to 552 million by 2030, denote that one in 10 grownups will have diabetes by 2030 if no serious act is taken. In 2014, the worldwide frequency of diabetes was projected to be 9 % among adults aged 18+ years [1].
In developing nations, most publics with diabetes are aged between 35 and 64. WHO already made an alarm that Diabetes is the 7th leading cause of death in the world in 2030. In 2012, an estimated 1.5 million deaths were straightly triggered by diabetes. In this, more than 80 % of diabetes deaths occur in low and middleincome countries [2]. Total deaths from diabetes are projected to rise by more than 50 % in the next 10 years.
It is apparent out that diabetes is a foremost cause of blindness, amputation and kidney failure. Lack of alertness about diabetes, combined with inadequate access to health services and vital medicines, can lead to many hitches. It is a universal problem with overwhelming human, social, and economic impact, affecting around 300 million people worldwide.
By applying computational analytics on clinical big data, the massive amount of data generated in the healthcare systems, will be used to create medical intelligence which will drive medical prediction and forecasting. Medical analysis is a new trend in medical science. Developing medical intelligence out of the clinical data available will create healthcare system to be patient-centered and will reduce medical cost and hospital readmission too.
The data set used for the purpose of this study is Pima Indians Diabetes Database of National Institute of Diabetes and Digestive and Kidney Diseases. This diabetes database, donated by Vincent Sigillito, is a collection of medical diagnostic reports of 768 examples from a population living near Phoenix, Arizona, USA. You can find more information about the dataset https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.
The samples consist of examples with 8 attribute values and one of the two possible outcomes, namely whether the patient is tested positive for diabetes (indicated by output one) or not (indicated by zero).
Attribute Information:
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)
# 9. Class variable (0 or 1)
This data set is analysed in R using 04 algorithms for the prediction of diabetic in pregnant women:
# 1. Logistic Regression;
# 2. Decision Tree;
# 3. Random Forest;
# 4. Support Vector Machine (SVM) and;
# 5. Comparison of Model Accuracy.
Preparing the DataSet:
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.3
## corrplot 0.84 loaded
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.3
pima <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", col.names=c("Pregnant","Plasma_Glucose","Dias_BP","Triceps_Skin","Serum_Insulin","BMI","DPF","Age","Diabetes"))
head(pima) # # visualize the header of Pima data
## Pregnant Plasma_Glucose Dias_BP Triceps_Skin Serum_Insulin BMI DPF
## 1 1 85 66 29 0 26.6 0.351
## 2 8 183 64 0 0 23.3 0.672
## 3 1 89 66 23 94 28.1 0.167
## 4 0 137 40 35 168 43.1 2.288
## 5 5 116 74 0 0 25.6 0.201
## 6 3 78 50 32 88 31.0 0.248
## Age Diabetes
## 1 31 0
## 2 32 1
## 3 21 0
## 4 33 1
## 5 30 0
## 6 26 1
str(pima) # show the structure of the data
## 'data.frame': 767 obs. of 9 variables:
## $ Pregnant : int 1 8 1 0 5 3 10 2 8 4 ...
## $ Plasma_Glucose: int 85 183 89 137 116 78 115 197 125 110 ...
## $ Dias_BP : int 66 64 66 40 74 50 0 70 96 92 ...
## $ Triceps_Skin : int 29 0 23 35 0 32 0 45 0 0 ...
## $ Serum_Insulin : int 0 0 94 168 0 88 0 543 0 0 ...
## $ BMI : num 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 37.6 ...
## $ DPF : num 0.351 0.672 0.167 2.288 0.201 ...
## $ Age : int 31 32 21 33 30 26 29 53 54 30 ...
## $ Diabetes : int 0 1 0 1 0 1 0 1 1 0 ...
We use “sapply”" to check the number of missing values in each columns.
sapply(pima, function(x) sum(is.na(x)))
## Pregnant Plasma_Glucose Dias_BP Triceps_Skin Serum_Insulin
## 0 0 0 0 0
## BMI DPF Age Diabetes
## 0 0 0 0
As the results show there are not missing values on the data.
Let’s produce the matrix of scatterplots
pairs(pima, panel = panel.smooth)
we compute the matrix of correlations between the variables
corrplot(cor(pima[, -9]), type = "lower", method = "number")
we use a training data set containing a random sample of 70% of the observation to perform a Logistic Regression with “Diabetes” as the response and the remains variables as predictors.
# Preparing the DataSet
set.seed(123)
n <- nrow(pima)
train <- sample(n, trunc(0.70*n))
pima_training <- pima[train, ]
pima_testing <- pima[-train, ]
# Training The Model
glm_fm1 <- glm(Diabetes ~., data = pima_training, family = binomial)
summary(glm_fm1)
##
## Call:
## glm(formula = Diabetes ~ ., family = binomial, data = pima_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9063 -0.6925 -0.3641 0.6653 3.0684
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.6463340 0.9435703 -10.223 < 2e-16 ***
## Pregnant 0.0886598 0.0404862 2.190 0.0285 *
## Plasma_Glucose 0.0370229 0.0047419 7.808 5.82e-15 ***
## Dias_BP -0.0148023 0.0062585 -2.365 0.0180 *
## Triceps_Skin -0.0123352 0.0089285 -1.382 0.1671
## Serum_Insulin 0.0004372 0.0012196 0.359 0.7200
## BMI 0.1191741 0.0199373 5.977 2.27e-09 ***
## DPF 1.5580135 0.3738715 4.167 3.08e-05 ***
## Age 0.0177930 0.0120178 1.481 0.1387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 682.80 on 535 degrees of freedom
## Residual deviance: 472.69 on 527 degrees of freedom
## AIC: 490.69
##
## Number of Fisher Scoring iterations: 5
The result shows that the variables Triceps_Skin, Serum_Insulin and Age are not statiscally significance. In other words, the p_values is greather than 0.01. Therefore they will be removed.
Update to use only the significant variables
glm_fm2 <- update(glm_fm1, ~. - Triceps_Skin - Serum_Insulin - Age )
summary(glm_fm2)
##
## Call:
## glm(formula = Diabetes ~ Pregnant + Plasma_Glucose + Dias_BP +
## BMI + DPF, family = binomial, data = pima_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7812 -0.7095 -0.3644 0.6362 3.1411
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.208436 0.898444 -10.249 < 2e-16 ***
## Pregnant 0.122465 0.035005 3.498 0.000468 ***
## Plasma_Glucose 0.039263 0.004423 8.876 < 2e-16 ***
## Dias_BP -0.014911 0.005968 -2.499 0.012471 *
## BMI 0.105786 0.018141 5.831 5.51e-09 ***
## DPF 1.519865 0.364109 4.174 2.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 682.80 on 535 degrees of freedom
## Residual deviance: 477.45 on 530 degrees of freedom
## AIC: 489.45
##
## Number of Fisher Scoring iterations: 5
Now the results gives variables statiscally significance.
Plot the new model
par(mfrow = c(2,2))
plot(glm_fm2)
Residuals vs Fitted Plot: . This plot shows error Residuals vs fitted values; . The dotted line at y=0 indicates our fit line; . Any point on fit line obviously has zero residual. Points above have positive residuals and points below have negative residuals. . The red line is the the smoothed high order polynomial curve to give us an idea of pattern of residual movement. In our case we can see that our residuals have logaritmic pattern that means we got a better model.
Normal Q-Q Plot: . The Normal Q-Q plot is used to check if our residuals follow Normal distribution or not; . The residuals are normally distributed if the points follow the dotted line closely; In this case residual points follow the dotted line closely except for observation #349 So our model residuals have passed the test of Normality.
Scale - Location Plot: . Scale location plot indicates spread of points across predicted values range; . One of the assumptions for Regression is Homoscedasticity . i.e variance should be reasonably equal across the predictor range; . A horizontal red line is ideal and would indicate that residuals have uniform variance across the range; . As residuals spread wider from each other the red spread line goes up; In our case the data is Homoscedastic i.e has uniform variance.
Residuals vs Leverage Plot: Before attacking the plot we must know what Influence and what leverage is. Lets understand them first. Influence : The Influence of an observation can be thought of in terms of how much the predicted scores would change if the observation is excluded. Cook’s Distance is a pretty good measure of influence of an observation. Leverage : The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The more the leverage of an observation , the greater potential that point has in terms of influence.
Now lets analyse our leverage plot draw inferences. In this plot the dotted red lines are cook’s distance and the areas of interest for us are the ones outside dotted line on top right corner or bottom right corner. If any point falls in that region , we say that the observation has high leverage or potential for influencing our model is higher if we exclude that point. Its not always the case though that all outliers will have high leverage or vice versa.
In this case we do not have any points considered outlier, therefore the Logistic Regression model fit perfectly.
Apply the model to the testing sample
# Testing the Model
glm_probs <- predict(glm_fm2, newdata = pima_testing, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, 1, 0)
#print("Confusion Matrix for logistic regression"); table(Predicted = glm_pred, Actual = pima_testing$Diabetes)
confusionMatrix(glm_pred, pima_testing$Diabetes ) # Confusion Matrix for logistic regression
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 121 36
## 1 22 52
##
## Accuracy : 0.7489
## 95% CI : (0.6878, 0.8035)
## No Information Rate : 0.619
## P-Value [Acc > NIR] : 1.996e-05
##
## Kappa : 0.4509
## Mcnemar's Test P-Value : 0.08783
##
## Sensitivity : 0.8462
## Specificity : 0.5909
## Pos Pred Value : 0.7707
## Neg Pred Value : 0.7027
## Prevalence : 0.6190
## Detection Rate : 0.5238
## Detection Prevalence : 0.6797
## Balanced Accuracy : 0.7185
##
## 'Positive' Class : 0
##
acc_glm_fit <- confusionMatrix(glm_pred, pima_testing$Diabetes )$overall['Accuracy']
The test rate error is 25.11%. In other words, the accuracy is 74.89%.
Now we present the Decision Trees algorithm
# Preparing the DataSet:
pima <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", col.names=c("Pregnant","Plasma_Glucose","Dias_BP","Triceps_Skin","Serum_Insulin","BMI","DPF","Age","Diabetes"))
pima$Diabetes <- as.factor(pima$Diabetes)
library(caret)
library(tree)
## Warning: package 'tree' was built under R version 3.4.3
library(e1071)
## Warning: package 'e1071' was built under R version 3.4.3
set.seed(1000)
intrain <- createDataPartition(y = pima$Diabetes, p = 0.7, list = FALSE)
train <- pima[intrain, ]
test <- pima[-intrain, ]
# Training The Model
treemod <- tree(Diabetes ~ ., data = train)
summary(treemod)
##
## Classification tree:
## tree(formula = Diabetes ~ ., data = train)
## Variables actually used in tree construction:
## [1] "Plasma_Glucose" "BMI" "Age" "DPF"
## [5] "Dias_BP"
## Number of terminal nodes: 14
## Residual mean deviance: 0.7801 = 408 / 523
## Misclassification error rate: 0.1695 = 91 / 537
The results show that were used 05 variables are internal nodes in the tree, 14 terminal nodes and the training error rate is 16.95%.
Get a detailed text output.
treemod # get a detailed text output.
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 537 694.200 0 ( 0.65177 0.34823 )
## 2) Plasma_Glucose < 123.5 309 280.100 0 ( 0.83172 0.16828 )
## 4) BMI < 26.45 89 10.970 0 ( 0.98876 0.01124 ) *
## 5) BMI > 26.45 220 238.200 0 ( 0.76818 0.23182 )
## 10) Age < 30.5 129 92.740 0 ( 0.88372 0.11628 )
## 20) BMI < 45.4 124 74.290 0 ( 0.91129 0.08871 ) *
## 21) BMI > 45.4 5 5.004 1 ( 0.20000 0.80000 ) *
## 11) Age > 30.5 91 122.200 0 ( 0.60440 0.39560 )
## 22) Plasma_Glucose < 96.5 32 27.740 0 ( 0.84375 0.15625 ) *
## 23) Plasma_Glucose > 96.5 59 81.640 1 ( 0.47458 0.52542 )
## 46) DPF < 0.179 6 0.000 0 ( 1.00000 0.00000 ) *
## 47) DPF > 0.179 53 71.940 1 ( 0.41509 0.58491 ) *
## 3) Plasma_Glucose > 123.5 228 308.300 1 ( 0.40789 0.59211 )
## 6) Plasma_Glucose < 154.5 135 185.500 0 ( 0.55556 0.44444 )
## 12) BMI < 25.55 17 7.606 0 ( 0.94118 0.05882 ) *
## 13) BMI > 25.55 118 163.600 1 ( 0.50000 0.50000 )
## 26) Plasma_Glucose < 152.5 112 154.900 1 ( 0.47321 0.52679 )
## 52) DPF < 0.5275 70 95.610 0 ( 0.57143 0.42857 )
## 104) Dias_BP < 73 27 35.590 1 ( 0.37037 0.62963 ) *
## 105) Dias_BP > 73 43 52.700 0 ( 0.69767 0.30233 ) *
## 53) DPF > 0.5275 42 51.970 1 ( 0.30952 0.69048 )
## 106) Age < 29.5 18 24.060 0 ( 0.61111 0.38889 ) *
## 107) Age > 29.5 24 13.770 1 ( 0.08333 0.91667 )
## 214) DPF < 1.1005 19 0.000 1 ( 0.00000 1.00000 ) *
## 215) DPF > 1.1005 5 6.730 1 ( 0.40000 0.60000 ) *
## 27) Plasma_Glucose > 152.5 6 0.000 0 ( 1.00000 0.00000 ) *
## 7) Plasma_Glucose > 154.5 93 91.390 1 ( 0.19355 0.80645 ) *
The results display the split criterion (e.g. Plasma_Glucose < 123.5), the number of observations in that branch, the deviance, the overall prediction for the branch (Yes or No), and the fraction of observations in that branch that take on values of Yes and No. Branches that lead to terminal nodes are indicated using asterisks.
Now we plot of the tree, and interpret the results.
plot(treemod)
text(treemod, pretty = 0)
The most important indicator of “Diabetes” appears to be Plasma_Glucose, since the first branch split criterion (e.g. Plasma_Glucose < 123.5).
Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?
# Testing the Model
tree_pred <- predict(treemod, newdata = test, type = "class" )
confusionMatrix(tree_pred, test$Diabetes)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 114 33
## 1 36 47
##
## Accuracy : 0.7
## 95% CI : (0.6363, 0.7585)
## No Information Rate : 0.6522
## P-Value [Acc > NIR] : 0.07181
##
## Kappa : 0.3445
## Mcnemar's Test P-Value : 0.80973
##
## Sensitivity : 0.7600
## Specificity : 0.5875
## Pos Pred Value : 0.7755
## Neg Pred Value : 0.5663
## Prevalence : 0.6522
## Detection Rate : 0.4957
## Detection Prevalence : 0.6391
## Balanced Accuracy : 0.6738
##
## 'Positive' Class : 0
##
acc_treemod <- confusionMatrix(tree_pred, test$Diabetes)$overall['Accuracy']
The test error rate is 30%. In other words, the accuracy is 70%.
# Training The Model
set.seed(123)
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
rf_pima <- randomForest(Diabetes ~., data = pima_training, mtry = 8, ntree=50, importance = TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
# Testing the Model
rf_probs <- predict(rf_pima, newdata = pima_testing)
rf_pred <- ifelse(rf_probs > 0.5, 1, 0)
confusionMatrix(rf_pred, pima_testing$Diabetes )
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 124 34
## 1 19 54
##
## Accuracy : 0.7706
## 95% CI : (0.7109, 0.8232)
## No Information Rate : 0.619
## P-Value [Acc > NIR] : 6.567e-07
##
## Kappa : 0.4971
## Mcnemar's Test P-Value : 0.05447
##
## Sensitivity : 0.8671
## Specificity : 0.6136
## Pos Pred Value : 0.7848
## Neg Pred Value : 0.7397
## Prevalence : 0.6190
## Detection Rate : 0.5368
## Detection Prevalence : 0.6840
## Balanced Accuracy : 0.7404
##
## 'Positive' Class : 0
##
acc_rf_pima <- confusionMatrix(rf_pred, pima_testing$Diabetes)$overall['Accuracy']
The test error rate is 22.94%. In other words, the accuracy is 77.06%.
The important variable
importance(rf_pima)
## %IncMSE IncNodePurity
## Pregnant 4.29534581 6.689093
## Plasma_Glucose 18.06247751 40.309147
## Dias_BP 3.73430996 9.720362
## Triceps_Skin -0.09933701 3.813198
## Serum_Insulin -1.80517988 4.980444
## BMI 7.40408204 20.281781
## DPF 2.90470464 14.348029
## Age 5.71252884 10.788115
The “Plasma_Glucose” is by far the most important variable.
Let us plot the Variable Importance
par(mfrow = c(1, 2))
varImpPlot(rf_pima, type = 2, main = "Variable Importance",col = 'black')
plot(rf_pima, main = "Error vs no. of trees grown")
As we can see the important variable are “Plasma_Glucose”, “BMI” and “DPF”.
Preparing the DataSet:
#Load the DataSet
pima <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", col.names=c("Pregnant","Plasma_Glucose","Dias_BP","Triceps_Skin","Serum_Insulin","BMI","DPF","Age","Diabetes"))
pima$Diabetes <- as.factor(pima$Diabetes)
library(e1071)
#Preparing the DataSet:
set.seed(1000)
intrain <- createDataPartition(y = pima$Diabetes, p = 0.7, list = FALSE)
train <- pima[intrain, ]
test <- pima[-intrain, ]
Choosing Parameters: Now, we will use the tune() function to do a grid search over the supplied parameter ranges (C - cost, gamma), using the train set. The range to gamma parameter is between 0.000001 and 0.1. For cost parameter the range is from 0.1 until 10. It’s important to understanding the influence of this two parameters, because the accuracy of an SVM model is largely dependent on the selection them.
tuned <- tune.svm(Diabetes ~., data = train, gamma = 10^(-6:-1), cost = 10^(-1:1))
summary(tuned) # to show the results
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## gamma cost
## 0.01 10
##
## - best performance: 0.2213138
##
## - Detailed performance results:
## gamma cost error dispersion
## 1 1e-06 0.1 0.3479385 0.06920490
## 2 1e-05 0.1 0.3479385 0.06920490
## 3 1e-04 0.1 0.3479385 0.06920490
## 4 1e-03 0.1 0.3479385 0.06920490
## 5 1e-02 0.1 0.3497904 0.06790502
## 6 1e-01 0.1 0.2678896 0.06809146
## 7 1e-06 1.0 0.3479385 0.06920490
## 8 1e-05 1.0 0.3479385 0.06920490
## 9 1e-04 1.0 0.3479385 0.06920490
## 10 1e-03 1.0 0.3497904 0.06790502
## 11 1e-02 1.0 0.2231656 0.07791206
## 12 1e-01 1.0 0.2417890 0.06023243
## 13 1e-06 10.0 0.3479385 0.06920490
## 14 1e-05 10.0 0.3479385 0.06920490
## 15 1e-04 10.0 0.3497904 0.06790502
## 16 1e-03 10.0 0.2232006 0.06636237
## 17 1e-02 10.0 0.2213138 0.07415391
## 18 1e-01 10.0 0.2641509 0.04192063
As we can see the result show that the best parameters are Cost=10 and gamma=0.01.
Training The Model: In order to build a svm model to predict “Diabetes” using Cost=10 and gamma=0.01, which were the best values according the tune() function performed before.
svm_model <- svm(Diabetes ~., data = train, kernel = "radial", gamma = 0.01, cost = 10)
summary(svm_model)
##
## Call:
## svm(formula = Diabetes ~ ., data = train, kernel = "radial",
## gamma = 0.01, cost = 10)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
## gamma: 0.01
##
## Number of Support Vectors: 283
##
## ( 142 141 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
Testing the Model:
svm_pred <- predict(svm_model, newdata = test)
confusionMatrix(svm_pred, test$Diabetes)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 134 37
## 1 16 43
##
## Accuracy : 0.7696
## 95% CI : (0.7097, 0.8224)
## No Information Rate : 0.6522
## P-Value [Acc > NIR] : 7.748e-05
##
## Kappa : 0.4589
## Mcnemar's Test P-Value : 0.00601
##
## Sensitivity : 0.8933
## Specificity : 0.5375
## Pos Pred Value : 0.7836
## Neg Pred Value : 0.7288
## Prevalence : 0.6522
## Detection Rate : 0.5826
## Detection Prevalence : 0.7435
## Balanced Accuracy : 0.7154
##
## 'Positive' Class : 0
##
acc_svm_model <- confusionMatrix(svm_pred, test$Diabetes)$overall['Accuracy']
The test error rate is 23.04%. In other words, the accuracy is 76.96%.
Comparing the 04 models Logistic Regression, Decision Tree, Random Forest and Support Vector Machine (SVM) we got the following results:
accuracy <- data.frame(Model=c("Logistic Regression","Decision Tree","Random Forest", "Support Vector Machine (SVM)"), Accuracy=c(acc_glm_fit, acc_treemod, acc_rf_pima, acc_svm_model ))
ggplot(accuracy,aes(x=Model,y=Accuracy)) + geom_bar(stat='identity') + theme_bw() + ggtitle('Comparison of Model Accuracy')
To conclude, the graph shows that the Decision Tree model has the lowest accuracy. However the difference of accuracy between these 04 Classifiers are not significative.
[1] “WHO | Diabetes,” WHO. [Online]. Available: http://www.who.int/mediacentre/factsheets/fs312/en/. [Accessed: 03-Jan-2018].
[2] “Global status report on noncommunicable diseases 2010.”, [Online]. Available: http://www.who.int/nmh/publications/ncd_report_full_en.pdf