Using Predictive Models to Classify Pima Indians Diabetes Database

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

1.0 INTRODUCTION

Diabetes mellitus is one of the major noncommunicable diseases which have great impact on human life today. Many nations are now facing a swiftly rising growth of diabetes among their residents [2].

According to a study by the World Health Organization (WHO), this number will have raised to 552 million by 2030, denote that one in 10 grownups will have diabetes by 2030 if no serious act is taken. In 2014, the worldwide frequency of diabetes was projected to be 9 % among adults aged 18+ years [1].
In developing nations, most publics with diabetes are aged between 35 and 64. WHO already made an alarm that Diabetes is the 7th leading cause of death in the world in 2030. In 2012, an estimated 1.5 million deaths were straightly triggered by diabetes. In this, more than 80 % of diabetes deaths occur in low and middleincome countries [2]. Total deaths from diabetes are projected to rise by more than 50 % in the next 10 years.

It is apparent out that diabetes is a foremost cause of blindness, amputation and kidney failure. Lack of alertness about diabetes, combined with inadequate access to health services and vital medicines, can lead to many hitches. It is a universal problem with overwhelming human, social, and economic impact, affecting around 300 million people worldwide.

By applying computational analytics on clinical big data, the massive amount of data generated in the healthcare systems, will be used to create medical intelligence which will drive medical prediction and forecasting. Medical analysis is a new trend in medical science. Developing medical intelligence out of the clinical data available will create healthcare system to be patient-centered and will reduce medical cost and hospital readmission too.

2. PROPOSED SYSTEM

2.1 Data set

The data set used for the purpose of this study is Pima Indians Diabetes Database of National Institute of Diabetes and Digestive and Kidney Diseases. This diabetes database, donated by Vincent Sigillito, is a collection of medical diagnostic reports of 768 examples from a population living near Phoenix, Arizona, USA. You can find more information about the dataset https://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes.

The samples consist of examples with 8 attribute values and one of the two possible outcomes, namely whether the patient is tested positive for diabetes (indicated by output one) or not (indicated by zero).

Attribute Information:

# 1. Number of times pregnant 
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test 
# 3. Diastolic blood pressure (mm Hg) 
# 4. Triceps skin fold thickness (mm) 
# 5. 2-Hour serum insulin (mu U/ml) 
# 6. Body mass index (weight in kg/(height in m)^2) 
# 7. Diabetes pedigree function 
# 8. Age (years) 
# 9. Class variable (0 or 1)

This data set is analysed in R using 04 algorithms for the prediction of diabetic in pregnant women:

# 1. Logistic Regression;
# 2. Decision Tree;
# 3. Random Forest;
# 4. Support Vector Machine (SVM) and;
# 5. Comparison of Model Accuracy.

3.0 Pima Indians Diabetes Classification ———————————

Preparing the DataSet:

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.4.3

## corrplot 0.84 loaded

library(caret)

## Warning: package 'caret' was built under R version 3.4.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.4.3

pima <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", col.names=c("Pregnant","Plasma_Glucose","Dias_BP","Triceps_Skin","Serum_Insulin","BMI","DPF","Age","Diabetes"))

head(pima) # # visualize the header of Pima data

##   Pregnant Plasma_Glucose Dias_BP Triceps_Skin Serum_Insulin  BMI   DPF
## 1        1             85      66           29             0 26.6 0.351
## 2        8            183      64            0             0 23.3 0.672
## 3        1             89      66           23            94 28.1 0.167
## 4        0            137      40           35           168 43.1 2.288
## 5        5            116      74            0             0 25.6 0.201
## 6        3             78      50           32            88 31.0 0.248
##   Age Diabetes
## 1  31        0
## 2  32        1
## 3  21        0
## 4  33        1
## 5  30        0
## 6  26        1

str(pima) # show the structure of the data

## 'data.frame':    767 obs. of  9 variables:
##  $ Pregnant      : int  1 8 1 0 5 3 10 2 8 4 ...
##  $ Plasma_Glucose: int  85 183 89 137 116 78 115 197 125 110 ...
##  $ Dias_BP       : int  66 64 66 40 74 50 0 70 96 92 ...
##  $ Triceps_Skin  : int  29 0 23 35 0 32 0 45 0 0 ...
##  $ Serum_Insulin : int  0 0 94 168 0 88 0 543 0 0 ...
##  $ BMI           : num  26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 37.6 ...
##  $ DPF           : num  0.351 0.672 0.167 2.288 0.201 ...
##  $ Age           : int  31 32 21 33 30 26 29 53 54 30 ...
##  $ Diabetes      : int  0 1 0 1 0 1 0 1 1 0 ...

We use “sapply”" to check the number of missing values in each columns.

sapply(pima, function(x) sum(is.na(x)))

##       Pregnant Plasma_Glucose        Dias_BP   Triceps_Skin  Serum_Insulin 
##              0              0              0              0              0 
##            BMI            DPF            Age       Diabetes 
##              0              0              0              0

As the results show there are not missing values on the data.

Let’s produce the matrix of scatterplots

pairs(pima, panel = panel.smooth)

we compute the matrix of correlations between the variables

corrplot(cor(pima[, -9]), type = "lower", method = "number")

3.1 Apply Logistic Regression model—————————————–

we use a training data set containing a random sample of 70% of the observation to perform a Logistic Regression with “Diabetes” as the response and the remains variables as predictors.

# Preparing the DataSet
set.seed(123)
n <- nrow(pima)
train <- sample(n, trunc(0.70*n))
pima_training <- pima[train, ]
pima_testing <- pima[-train, ]

# Training The Model
glm_fm1 <- glm(Diabetes ~., data = pima_training, family = binomial)
summary(glm_fm1)

## 
## Call:
## glm(formula = Diabetes ~ ., family = binomial, data = pima_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9063  -0.6925  -0.3641   0.6653   3.0684  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -9.6463340  0.9435703 -10.223  < 2e-16 ***
## Pregnant        0.0886598  0.0404862   2.190   0.0285 *  
## Plasma_Glucose  0.0370229  0.0047419   7.808 5.82e-15 ***
## Dias_BP        -0.0148023  0.0062585  -2.365   0.0180 *  
## Triceps_Skin   -0.0123352  0.0089285  -1.382   0.1671    
## Serum_Insulin   0.0004372  0.0012196   0.359   0.7200    
## BMI             0.1191741  0.0199373   5.977 2.27e-09 ***
## DPF             1.5580135  0.3738715   4.167 3.08e-05 ***
## Age             0.0177930  0.0120178   1.481   0.1387    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 682.80  on 535  degrees of freedom
## Residual deviance: 472.69  on 527  degrees of freedom
## AIC: 490.69
## 
## Number of Fisher Scoring iterations: 5

The result shows that the variables Triceps_Skin, Serum_Insulin and Age are not statiscally significance. In other words, the p_values is greather than 0.01. Therefore they will be removed.

Update to use only the significant variables

glm_fm2 <- update(glm_fm1, ~. - Triceps_Skin - Serum_Insulin - Age )
summary(glm_fm2)

## 
## Call:
## glm(formula = Diabetes ~ Pregnant + Plasma_Glucose + Dias_BP + 
##     BMI + DPF, family = binomial, data = pima_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7812  -0.7095  -0.3644   0.6362   3.1411  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -9.208436   0.898444 -10.249  < 2e-16 ***
## Pregnant        0.122465   0.035005   3.498 0.000468 ***
## Plasma_Glucose  0.039263   0.004423   8.876  < 2e-16 ***
## Dias_BP        -0.014911   0.005968  -2.499 0.012471 *  
## BMI             0.105786   0.018141   5.831 5.51e-09 ***
## DPF             1.519865   0.364109   4.174 2.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 682.80  on 535  degrees of freedom
## Residual deviance: 477.45  on 530  degrees of freedom
## AIC: 489.45
## 
## Number of Fisher Scoring iterations: 5

Now the results gives variables statiscally significance.

Plot the new model

par(mfrow = c(2,2))
plot(glm_fm2)

Residuals vs Fitted Plot: . This plot shows error Residuals vs fitted values; . The dotted line at y=0 indicates our fit line; . Any point on fit line obviously has zero residual. Points above have positive residuals and points below have negative residuals. . The red line is the the smoothed high order polynomial curve to give us an idea of pattern of residual movement. In our case we can see that our residuals have logaritmic pattern that means we got a better model.

Normal Q-Q Plot: . The Normal Q-Q plot is used to check if our residuals follow Normal distribution or not; . The residuals are normally distributed if the points follow the dotted line closely; In this case residual points follow the dotted line closely except for observation #349 So our model residuals have passed the test of Normality.

Scale - Location Plot: . Scale location plot indicates spread of points across predicted values range; . One of the assumptions for Regression is Homoscedasticity . i.e variance should be reasonably equal across the predictor range; . A horizontal red line is ideal and would indicate that residuals have uniform variance across the range; . As residuals spread wider from each other the red spread line goes up; In our case the data is Homoscedastic i.e has uniform variance.

Residuals vs Leverage Plot: Before attacking the plot we must know what Influence and what leverage is. Lets understand them first. Influence : The Influence of an observation can be thought of in terms of how much the predicted scores would change if the observation is excluded. Cook’s Distance is a pretty good measure of influence of an observation. Leverage : The leverage of an observation is based on how much the observation’s value on the predictor variable differs from the mean of the predictor variable. The more the leverage of an observation , the greater potential that point has in terms of influence.

Now lets analyse our leverage plot draw inferences. In this plot the dotted red lines are cook’s distance and the areas of interest for us are the ones outside dotted line on top right corner or bottom right corner. If any point falls in that region , we say that the observation has high leverage or potential for influencing our model is higher if we exclude that point. Its not always the case though that all outliers will have high leverage or vice versa.

In this case we do not have any points considered outlier, therefore the Logistic Regression model fit perfectly.

Apply the model to the testing sample

# Testing the Model
glm_probs <- predict(glm_fm2, newdata = pima_testing, type = "response")
glm_pred <- ifelse(glm_probs > 0.5, 1, 0)
#print("Confusion Matrix for logistic regression"); table(Predicted = glm_pred, Actual = pima_testing$Diabetes)
confusionMatrix(glm_pred, pima_testing$Diabetes ) # Confusion Matrix for logistic regression

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 121  36
##          1  22  52
##                                           
##                Accuracy : 0.7489          
##                  95% CI : (0.6878, 0.8035)
##     No Information Rate : 0.619           
##     P-Value [Acc > NIR] : 1.996e-05       
##                                           
##                   Kappa : 0.4509          
##  Mcnemar's Test P-Value : 0.08783         
##                                           
##             Sensitivity : 0.8462          
##             Specificity : 0.5909          
##          Pos Pred Value : 0.7707          
##          Neg Pred Value : 0.7027          
##              Prevalence : 0.6190          
##          Detection Rate : 0.5238          
##    Detection Prevalence : 0.6797          
##       Balanced Accuracy : 0.7185          
##                                           
##        'Positive' Class : 0               
##

acc_glm_fit <- confusionMatrix(glm_pred, pima_testing$Diabetes )$overall['Accuracy']

The test rate error is 25.11%. In other words, the accuracy is 74.89%.

3.2 Decision Trees —————————————–

Now we present the Decision Trees algorithm

# Preparing the DataSet:
pima <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", col.names=c("Pregnant","Plasma_Glucose","Dias_BP","Triceps_Skin","Serum_Insulin","BMI","DPF","Age","Diabetes"))
pima$Diabetes <- as.factor(pima$Diabetes)

library(caret)
library(tree)

## Warning: package 'tree' was built under R version 3.4.3

library(e1071)

## Warning: package 'e1071' was built under R version 3.4.3

set.seed(1000)
intrain <- createDataPartition(y = pima$Diabetes, p = 0.7, list = FALSE)
train <- pima[intrain, ]
test <- pima[-intrain, ]

# Training The Model
treemod <- tree(Diabetes ~ ., data = train)

summary(treemod)

## 
## Classification tree:
## tree(formula = Diabetes ~ ., data = train)
## Variables actually used in tree construction:
## [1] "Plasma_Glucose" "BMI"            "Age"            "DPF"           
## [5] "Dias_BP"       
## Number of terminal nodes:  14 
## Residual mean deviance:  0.7801 = 408 / 523 
## Misclassification error rate: 0.1695 = 91 / 537

The results show that were used 05 variables are internal nodes in the tree, 14 terminal nodes and the training error rate is 16.95%.

Get a detailed text output.

treemod # get a detailed text output.

## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##   1) root 537 694.200 0 ( 0.65177 0.34823 )  
##     2) Plasma_Glucose < 123.5 309 280.100 0 ( 0.83172 0.16828 )  
##       4) BMI < 26.45 89  10.970 0 ( 0.98876 0.01124 ) *
##       5) BMI > 26.45 220 238.200 0 ( 0.76818 0.23182 )  
##        10) Age < 30.5 129  92.740 0 ( 0.88372 0.11628 )  
##          20) BMI < 45.4 124  74.290 0 ( 0.91129 0.08871 ) *
##          21) BMI > 45.4 5   5.004 1 ( 0.20000 0.80000 ) *
##        11) Age > 30.5 91 122.200 0 ( 0.60440 0.39560 )  
##          22) Plasma_Glucose < 96.5 32  27.740 0 ( 0.84375 0.15625 ) *
##          23) Plasma_Glucose > 96.5 59  81.640 1 ( 0.47458 0.52542 )  
##            46) DPF < 0.179 6   0.000 0 ( 1.00000 0.00000 ) *
##            47) DPF > 0.179 53  71.940 1 ( 0.41509 0.58491 ) *
##     3) Plasma_Glucose > 123.5 228 308.300 1 ( 0.40789 0.59211 )  
##       6) Plasma_Glucose < 154.5 135 185.500 0 ( 0.55556 0.44444 )  
##        12) BMI < 25.55 17   7.606 0 ( 0.94118 0.05882 ) *
##        13) BMI > 25.55 118 163.600 1 ( 0.50000 0.50000 )  
##          26) Plasma_Glucose < 152.5 112 154.900 1 ( 0.47321 0.52679 )  
##            52) DPF < 0.5275 70  95.610 0 ( 0.57143 0.42857 )  
##             104) Dias_BP < 73 27  35.590 1 ( 0.37037 0.62963 ) *
##             105) Dias_BP > 73 43  52.700 0 ( 0.69767 0.30233 ) *
##            53) DPF > 0.5275 42  51.970 1 ( 0.30952 0.69048 )  
##             106) Age < 29.5 18  24.060 0 ( 0.61111 0.38889 ) *
##             107) Age > 29.5 24  13.770 1 ( 0.08333 0.91667 )  
##               214) DPF < 1.1005 19   0.000 1 ( 0.00000 1.00000 ) *
##               215) DPF > 1.1005 5   6.730 1 ( 0.40000 0.60000 ) *
##          27) Plasma_Glucose > 152.5 6   0.000 0 ( 1.00000 0.00000 ) *
##       7) Plasma_Glucose > 154.5 93  91.390 1 ( 0.19355 0.80645 ) *

The results display the split criterion (e.g. Plasma_Glucose < 123.5), the number of observations in that branch, the deviance, the overall prediction for the branch (Yes or No), and the fraction of observations in that branch that take on values of Yes and No. Branches that lead to terminal nodes are indicated using asterisks.

Now we plot of the tree, and interpret the results.

plot(treemod)
text(treemod, pretty = 0)

The most important indicator of “Diabetes” appears to be Plasma_Glucose, since the first branch split criterion (e.g. Plasma_Glucose < 123.5).

Predict the response on the test data, and produce a confusion matrix comparing the test labels to the predicted test labels. What is the test error rate?

# Testing the Model
tree_pred <- predict(treemod, newdata = test, type = "class" )
confusionMatrix(tree_pred, test$Diabetes)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 114  33
##          1  36  47
##                                           
##                Accuracy : 0.7             
##                  95% CI : (0.6363, 0.7585)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 0.07181         
##                                           
##                   Kappa : 0.3445          
##  Mcnemar's Test P-Value : 0.80973         
##                                           
##             Sensitivity : 0.7600          
##             Specificity : 0.5875          
##          Pos Pred Value : 0.7755          
##          Neg Pred Value : 0.5663          
##              Prevalence : 0.6522          
##          Detection Rate : 0.4957          
##    Detection Prevalence : 0.6391          
##       Balanced Accuracy : 0.6738          
##                                           
##        'Positive' Class : 0               
##

acc_treemod <- confusionMatrix(tree_pred, test$Diabetes)$overall['Accuracy']

The test error rate is 30%. In other words, the accuracy is 70%.

3.3 Applying random forests model —————————————–

# Training The Model
set.seed(123)
library(randomForest)

## Warning: package 'randomForest' was built under R version 3.4.3

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_pima <- randomForest(Diabetes ~., data = pima_training, mtry = 8, ntree=50, importance = TRUE)

## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

# Testing the Model
rf_probs <- predict(rf_pima, newdata = pima_testing)
rf_pred <- ifelse(rf_probs > 0.5, 1, 0)
confusionMatrix(rf_pred, pima_testing$Diabetes )

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 124  34
##          1  19  54
##                                           
##                Accuracy : 0.7706          
##                  95% CI : (0.7109, 0.8232)
##     No Information Rate : 0.619           
##     P-Value [Acc > NIR] : 6.567e-07       
##                                           
##                   Kappa : 0.4971          
##  Mcnemar's Test P-Value : 0.05447         
##                                           
##             Sensitivity : 0.8671          
##             Specificity : 0.6136          
##          Pos Pred Value : 0.7848          
##          Neg Pred Value : 0.7397          
##              Prevalence : 0.6190          
##          Detection Rate : 0.5368          
##    Detection Prevalence : 0.6840          
##       Balanced Accuracy : 0.7404          
##                                           
##        'Positive' Class : 0               
##

acc_rf_pima <- confusionMatrix(rf_pred, pima_testing$Diabetes)$overall['Accuracy']

The test error rate is 22.94%. In other words, the accuracy is 77.06%.

The important variable

importance(rf_pima)

##                    %IncMSE IncNodePurity
## Pregnant        4.29534581      6.689093
## Plasma_Glucose 18.06247751     40.309147
## Dias_BP         3.73430996      9.720362
## Triceps_Skin   -0.09933701      3.813198
## Serum_Insulin  -1.80517988      4.980444
## BMI             7.40408204     20.281781
## DPF             2.90470464     14.348029
## Age             5.71252884     10.788115

The “Plasma_Glucose” is by far the most important variable.

Let us plot the Variable Importance

par(mfrow = c(1, 2))
varImpPlot(rf_pima, type = 2, main = "Variable Importance",col = 'black')
plot(rf_pima, main = "Error vs no. of trees grown")

As we can see the important variable are “Plasma_Glucose”, “BMI” and “DPF”.

3.4 Applying Support Vector Machine - svm model ———————-

Preparing the DataSet:

#Load the DataSet
pima <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data", col.names=c("Pregnant","Plasma_Glucose","Dias_BP","Triceps_Skin","Serum_Insulin","BMI","DPF","Age","Diabetes"))
pima$Diabetes <- as.factor(pima$Diabetes)

library(e1071)

#Preparing the DataSet:
set.seed(1000)
intrain <- createDataPartition(y = pima$Diabetes, p = 0.7, list = FALSE)
train <- pima[intrain, ]
test <- pima[-intrain, ]

Choosing Parameters: Now, we will use the tune() function to do a grid search over the supplied parameter ranges (C - cost, gamma), using the train set. The range to gamma parameter is between 0.000001 and 0.1. For cost parameter the range is from 0.1 until 10. It’s important to understanding the influence of this two parameters, because the accuracy of an SVM model is largely dependent on the selection them.

tuned <- tune.svm(Diabetes ~., data = train, gamma = 10^(-6:-1), cost = 10^(-1:1))
summary(tuned) # to show the results

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  gamma cost
##   0.01   10
## 
## - best performance: 0.2213138 
## 
## - Detailed performance results:
##    gamma cost     error dispersion
## 1  1e-06  0.1 0.3479385 0.06920490
## 2  1e-05  0.1 0.3479385 0.06920490
## 3  1e-04  0.1 0.3479385 0.06920490
## 4  1e-03  0.1 0.3479385 0.06920490
## 5  1e-02  0.1 0.3497904 0.06790502
## 6  1e-01  0.1 0.2678896 0.06809146
## 7  1e-06  1.0 0.3479385 0.06920490
## 8  1e-05  1.0 0.3479385 0.06920490
## 9  1e-04  1.0 0.3479385 0.06920490
## 10 1e-03  1.0 0.3497904 0.06790502
## 11 1e-02  1.0 0.2231656 0.07791206
## 12 1e-01  1.0 0.2417890 0.06023243
## 13 1e-06 10.0 0.3479385 0.06920490
## 14 1e-05 10.0 0.3479385 0.06920490
## 15 1e-04 10.0 0.3497904 0.06790502
## 16 1e-03 10.0 0.2232006 0.06636237
## 17 1e-02 10.0 0.2213138 0.07415391
## 18 1e-01 10.0 0.2641509 0.04192063

As we can see the result show that the best parameters are Cost=10 and gamma=0.01.

Training The Model: In order to build a svm model to predict “Diabetes” using Cost=10 and gamma=0.01, which were the best values according the tune() function performed before.

svm_model  <- svm(Diabetes ~., data = train, kernel = "radial", gamma = 0.01, cost = 10) 
summary(svm_model)

## 
## Call:
## svm(formula = Diabetes ~ ., data = train, kernel = "radial", 
##     gamma = 0.01, cost = 10)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.01 
## 
## Number of Support Vectors:  283
## 
##  ( 142 141 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1

Testing the Model:

svm_pred <- predict(svm_model, newdata = test)
confusionMatrix(svm_pred, test$Diabetes)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 134  37
##          1  16  43
##                                           
##                Accuracy : 0.7696          
##                  95% CI : (0.7097, 0.8224)
##     No Information Rate : 0.6522          
##     P-Value [Acc > NIR] : 7.748e-05       
##                                           
##                   Kappa : 0.4589          
##  Mcnemar's Test P-Value : 0.00601         
##                                           
##             Sensitivity : 0.8933          
##             Specificity : 0.5375          
##          Pos Pred Value : 0.7836          
##          Neg Pred Value : 0.7288          
##              Prevalence : 0.6522          
##          Detection Rate : 0.5826          
##    Detection Prevalence : 0.7435          
##       Balanced Accuracy : 0.7154          
##                                           
##        'Positive' Class : 0               
##

acc_svm_model <- confusionMatrix(svm_pred, test$Diabetes)$overall['Accuracy']

The test error rate is 23.04%. In other words, the accuracy is 76.96%.

4.0 Comparison of Model Accuracy ——————————————–

Comparing the 04 models Logistic Regression, Decision Tree, Random Forest and Support Vector Machine (SVM) we got the following results:

accuracy <- data.frame(Model=c("Logistic Regression","Decision Tree","Random Forest", "Support Vector Machine (SVM)"), Accuracy=c(acc_glm_fit, acc_treemod, acc_rf_pima, acc_svm_model ))
ggplot(accuracy,aes(x=Model,y=Accuracy)) + geom_bar(stat='identity') + theme_bw() + ggtitle('Comparison of Model Accuracy')

To conclude, the graph shows that the Decision Tree model has the lowest accuracy. However the difference of accuracy between these 04 Classifiers are not significative.

5. REFERENCES

[1] “WHO | Diabetes,” WHO. [Online]. Available: http://www.who.int/mediacentre/factsheets/fs312/en/. [Accessed: 03-Jan-2018].

[2] “Global status report on noncommunicable diseases 2010.”, [Online]. Available: http://www.who.int/nmh/publications/ncd_report_full_en.pdf