Dataset Information :

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Objective :

The dataset consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

The main objective is to predict whether the given person is having diabetes or not.

The methods we intend to use are:

Description of variables :

Prepare the data

diab<-read.csv("D:/R studio files/diabetes.csv",header = T)
head(diab)
str(diab)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Checking whether the data contains any null values or not :

colSums(is.na(diab))
##              Pregnancies                  Glucose            BloodPressure 
##                        0                        0                        0 
##            SkinThickness                  Insulin                      BMI 
##                        0                        0                        0 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

There are no null values present in our data.

Checking summary of the data :

summary(diab)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
diab1<-diab

Converting our dependent variable in factor :

diab1$Outcome<-factor(diab1$Outcome)

EDA

diab1$Age<-cut(diab1$Age,breaks = c(18,30,50,Inf),labels = c("c1","c2","c3"))
library(ggplot2)
ggplot(diab1,aes(x=Age)) + geom_bar(aes(fill=Outcome)) +labs(x = "Age Group",y="Frequency",title = "Age Wise Distribution")

We can conclude that Age group of 30-50 have higher chances of being diabetic then other Age groups

Partitioning data in training & testing :

set.seed(100)
index<-sample(nrow(diab1),0.75*nrow(diab1))
train_diab<-diab1[index,]
test_diab<-diab1[-index,]
dim(train_diab)
## [1] 576   9
dim(test_diab)
## [1] 192   9

Applying Machine Learning Algorithms :

Binary Logistic Regression

BLR<-glm(Outcome~.,data = train_diab,family = "binomial")
summary(BLR)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = train_diab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5460  -0.7113  -0.4381   0.7803   2.6463  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -7.093102   0.762263  -9.305  < 2e-16 ***
## Pregnancies               0.064940   0.038808   1.673 0.094253 .  
## Glucose                   0.031168   0.004031   7.732 1.06e-14 ***
## BloodPressure            -0.014245   0.006039  -2.359 0.018340 *  
## SkinThickness            -0.001967   0.007946  -0.248 0.804489    
## Insulin                  -0.001031   0.001001  -1.030 0.303018    
## BMI                       0.079331   0.017703   4.481 7.42e-06 ***
## DiabetesPedigreeFunction  0.794737   0.339184   2.343 0.019125 *  
## Agec2                     0.972621   0.276658   3.516 0.000439 ***
## Agec3                     0.556236   0.386472   1.439 0.150076    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 743.86  on 575  degrees of freedom
## Residual deviance: 553.32  on 566  degrees of freedom
## AIC: 573.32
## 
## Number of Fisher Scoring iterations: 5

AS we can see SkinThickness, Insulin, Age are insignificant variables we remove those columns and again partition it

col1<-c("SkinThickness","Insulin","Age")
diab1[,col1]<-list(NULL)

index<-sample(nrow(diab1),0.75*nrow(diab1))
train_diab<-diab1[index,]
test_diab<-diab1[-index,]
dim(train_diab)
## [1] 576   6
dim(test_diab)
## [1] 192   6
BLR1<-glm(Outcome~.,data = train_diab,family = "binomial")
summary(BLR1)
## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = train_diab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8133  -0.7213  -0.4379   0.7137   3.0974  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -7.461989   0.767443  -9.723  < 2e-16 ***
## Pregnancies               0.161791   0.032817   4.930 8.22e-07 ***
## Glucose                   0.036411   0.004075   8.935  < 2e-16 ***
## BloodPressure            -0.014606   0.005896  -2.477   0.0132 *  
## BMI                       0.066081   0.015528   4.256 2.08e-05 ***
## DiabetesPedigreeFunction  0.934784   0.333781   2.801   0.0051 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 737.35  on 575  degrees of freedom
## Residual deviance: 546.37  on 570  degrees of freedom
## AIC: 558.37
## 
## Number of Fisher Scoring iterations: 5
train_diab_BLR<-fitted(BLR1)
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
pred<-prediction(train_diab_BLR,train_diab$Outcome)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))

library(caret)
## Loading required package: lattice
pred_BLR<-predict(BLR1,test_diab,type="response")
pred_BLR1<-ifelse(pred_BLR<0.35,0,1)
pred_BLR1<-as.factor(pred_BLR1)
test_diab$Outcome<-as.factor(test_diab$Outcome)
confusionMatrix(pred_BLR1,test_diab$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 95 21
##          1 24 52
##                                           
##                Accuracy : 0.7656          
##                  95% CI : (0.6992, 0.8236)
##     No Information Rate : 0.6198          
##     P-Value [Acc > NIR] : 1.21e-05        
##                                           
##                   Kappa : 0.5066          
##                                           
##  Mcnemar's Test P-Value : 0.7656          
##                                           
##             Sensitivity : 0.7983          
##             Specificity : 0.7123          
##          Pos Pred Value : 0.8190          
##          Neg Pred Value : 0.6842          
##              Prevalence : 0.6198          
##          Detection Rate : 0.4948          
##    Detection Prevalence : 0.6042          
##       Balanced Accuracy : 0.7553          
##                                           
##        'Positive' Class : 0               
## 

Binary Logistic Regression gives us an accuracy of 76.56%

Naive Bayes Algorithm

Building the model on train data i.e. Training the data and finding the accuracy on test data

library(e1071)
## Warning: package 'e1071' was built under R version 3.6.2
NB_model<-naiveBayes(Outcome~.,data = train_diab)
NB_pred<-predict(NB_model,test_diab)
confusionMatrix(NB_pred,test_diab$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 105  32
##          1  14  41
##                                           
##                Accuracy : 0.7604          
##                  95% CI : (0.6937, 0.8189)
##     No Information Rate : 0.6198          
##     P-Value [Acc > NIR] : 2.436e-05       
##                                           
##                   Kappa : 0.4662          
##                                           
##  Mcnemar's Test P-Value : 0.01219         
##                                           
##             Sensitivity : 0.8824          
##             Specificity : 0.5616          
##          Pos Pred Value : 0.7664          
##          Neg Pred Value : 0.7455          
##              Prevalence : 0.6198          
##          Detection Rate : 0.5469          
##    Detection Prevalence : 0.7135          
##       Balanced Accuracy : 0.7220          
##                                           
##        'Positive' Class : 0               
## 

Naive Bayes gives us an accuracy of 76.04%

Support Vector Machine Algorithm

Building the model on train data i.e. Training the data and finding the accuracy on test data

SVM_model<-svm(Outcome~.,data = train_diab,kernel="linear",scale = F)
SVM_pred<-predict(SVM_model,test_diab)
SVM_pred<-as.factor(SVM_pred)
confusionMatrix(SVM_pred,test_diab$Outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 108  35
##          1  11  38
##                                           
##                Accuracy : 0.7604          
##                  95% CI : (0.6937, 0.8189)
##     No Information Rate : 0.6198          
##     P-Value [Acc > NIR] : 2.436e-05       
##                                           
##                   Kappa : 0.4572          
##                                           
##  Mcnemar's Test P-Value : 0.000696        
##                                           
##             Sensitivity : 0.9076          
##             Specificity : 0.5205          
##          Pos Pred Value : 0.7552          
##          Neg Pred Value : 0.7755          
##              Prevalence : 0.6198          
##          Detection Rate : 0.5625          
##    Detection Prevalence : 0.7448          
##       Balanced Accuracy : 0.7141          
##                                           
##        'Positive' Class : 0               
## 

SVM gives us an accuracy of 76.04%

K-Nearest Neighbor Algorithm

Building the model on train data i.e. Training the data and finding the accuracy on test data

ytrain<-diab1$Outcome[index]
ytest<-diab1$Outcome[-index]
sqrt(nrow(train_diab))
## [1] 24
library(class)
KNN_model<-knn(train_diab,test_diab,k=23,cl = ytrain)
confusionMatrix(ytest,KNN_model)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 107  12
##          1  36  37
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6826, 0.8096)
##     No Information Rate : 0.7448          
##     P-Value [Acc > NIR] : 0.4723919       
##                                           
##                   Kappa : 0.4336          
##                                           
##  Mcnemar's Test P-Value : 0.0009009       
##                                           
##             Sensitivity : 0.7483          
##             Specificity : 0.7551          
##          Pos Pred Value : 0.8992          
##          Neg Pred Value : 0.5068          
##              Prevalence : 0.7448          
##          Detection Rate : 0.5573          
##    Detection Prevalence : 0.6198          
##       Balanced Accuracy : 0.7517          
##                                           
##        'Positive' Class : 0               
## 

KNN gives us an accuracy of 75%

Conclusion :

After performing various classification algorithms and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 75% to 77%. Out of which Binary Logistic Regression gave a slightly better accuracy of 76.56%