This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The dataset consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
The main objective is to predict whether the given person is having diabetes or not.
The methods we intend to use are:
Prepare the data
diab<-read.csv("D:/R studio files/diabetes.csv",header = T)
head(diab)
str(diab)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Checking whether the data contains any null values or not :
colSums(is.na(diab))
## Pregnancies Glucose BloodPressure
## 0 0 0
## SkinThickness Insulin BMI
## 0 0 0
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
There are no null values present in our data.
Checking summary of the data :
summary(diab)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
diab1<-diab
Converting our dependent variable in factor :
diab1$Outcome<-factor(diab1$Outcome)
diab1$Age<-cut(diab1$Age,breaks = c(18,30,50,Inf),labels = c("c1","c2","c3"))
library(ggplot2)
ggplot(diab1,aes(x=Age)) + geom_bar(aes(fill=Outcome)) +labs(x = "Age Group",y="Frequency",title = "Age Wise Distribution")
We can conclude that Age group of 30-50 have higher chances of being diabetic then other Age groups
Partitioning data in training & testing :
set.seed(100)
index<-sample(nrow(diab1),0.75*nrow(diab1))
train_diab<-diab1[index,]
test_diab<-diab1[-index,]
dim(train_diab)
## [1] 576 9
dim(test_diab)
## [1] 192 9
BLR<-glm(Outcome~.,data = train_diab,family = "binomial")
summary(BLR)
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = train_diab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5460 -0.7113 -0.4381 0.7803 2.6463
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.093102 0.762263 -9.305 < 2e-16 ***
## Pregnancies 0.064940 0.038808 1.673 0.094253 .
## Glucose 0.031168 0.004031 7.732 1.06e-14 ***
## BloodPressure -0.014245 0.006039 -2.359 0.018340 *
## SkinThickness -0.001967 0.007946 -0.248 0.804489
## Insulin -0.001031 0.001001 -1.030 0.303018
## BMI 0.079331 0.017703 4.481 7.42e-06 ***
## DiabetesPedigreeFunction 0.794737 0.339184 2.343 0.019125 *
## Agec2 0.972621 0.276658 3.516 0.000439 ***
## Agec3 0.556236 0.386472 1.439 0.150076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 743.86 on 575 degrees of freedom
## Residual deviance: 553.32 on 566 degrees of freedom
## AIC: 573.32
##
## Number of Fisher Scoring iterations: 5
AS we can see SkinThickness, Insulin, Age are insignificant variables we remove those columns and again partition it
col1<-c("SkinThickness","Insulin","Age")
diab1[,col1]<-list(NULL)
index<-sample(nrow(diab1),0.75*nrow(diab1))
train_diab<-diab1[index,]
test_diab<-diab1[-index,]
dim(train_diab)
## [1] 576 6
dim(test_diab)
## [1] 192 6
BLR1<-glm(Outcome~.,data = train_diab,family = "binomial")
summary(BLR1)
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = train_diab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8133 -0.7213 -0.4379 0.7137 3.0974
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.461989 0.767443 -9.723 < 2e-16 ***
## Pregnancies 0.161791 0.032817 4.930 8.22e-07 ***
## Glucose 0.036411 0.004075 8.935 < 2e-16 ***
## BloodPressure -0.014606 0.005896 -2.477 0.0132 *
## BMI 0.066081 0.015528 4.256 2.08e-05 ***
## DiabetesPedigreeFunction 0.934784 0.333781 2.801 0.0051 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 737.35 on 575 degrees of freedom
## Residual deviance: 546.37 on 570 degrees of freedom
## AIC: 558.37
##
## Number of Fisher Scoring iterations: 5
train_diab_BLR<-fitted(BLR1)
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred<-prediction(train_diab_BLR,train_diab$Outcome)
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))
library(caret)
## Loading required package: lattice
pred_BLR<-predict(BLR1,test_diab,type="response")
pred_BLR1<-ifelse(pred_BLR<0.35,0,1)
pred_BLR1<-as.factor(pred_BLR1)
test_diab$Outcome<-as.factor(test_diab$Outcome)
confusionMatrix(pred_BLR1,test_diab$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 95 21
## 1 24 52
##
## Accuracy : 0.7656
## 95% CI : (0.6992, 0.8236)
## No Information Rate : 0.6198
## P-Value [Acc > NIR] : 1.21e-05
##
## Kappa : 0.5066
##
## Mcnemar's Test P-Value : 0.7656
##
## Sensitivity : 0.7983
## Specificity : 0.7123
## Pos Pred Value : 0.8190
## Neg Pred Value : 0.6842
## Prevalence : 0.6198
## Detection Rate : 0.4948
## Detection Prevalence : 0.6042
## Balanced Accuracy : 0.7553
##
## 'Positive' Class : 0
##
Binary Logistic Regression gives us an accuracy of 76.56%
Building the model on train data i.e. Training the data and finding the accuracy on test data
library(e1071)
## Warning: package 'e1071' was built under R version 3.6.2
NB_model<-naiveBayes(Outcome~.,data = train_diab)
NB_pred<-predict(NB_model,test_diab)
confusionMatrix(NB_pred,test_diab$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 105 32
## 1 14 41
##
## Accuracy : 0.7604
## 95% CI : (0.6937, 0.8189)
## No Information Rate : 0.6198
## P-Value [Acc > NIR] : 2.436e-05
##
## Kappa : 0.4662
##
## Mcnemar's Test P-Value : 0.01219
##
## Sensitivity : 0.8824
## Specificity : 0.5616
## Pos Pred Value : 0.7664
## Neg Pred Value : 0.7455
## Prevalence : 0.6198
## Detection Rate : 0.5469
## Detection Prevalence : 0.7135
## Balanced Accuracy : 0.7220
##
## 'Positive' Class : 0
##
Naive Bayes gives us an accuracy of 76.04%
Building the model on train data i.e. Training the data and finding the accuracy on test data
SVM_model<-svm(Outcome~.,data = train_diab,kernel="linear",scale = F)
SVM_pred<-predict(SVM_model,test_diab)
SVM_pred<-as.factor(SVM_pred)
confusionMatrix(SVM_pred,test_diab$Outcome)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 108 35
## 1 11 38
##
## Accuracy : 0.7604
## 95% CI : (0.6937, 0.8189)
## No Information Rate : 0.6198
## P-Value [Acc > NIR] : 2.436e-05
##
## Kappa : 0.4572
##
## Mcnemar's Test P-Value : 0.000696
##
## Sensitivity : 0.9076
## Specificity : 0.5205
## Pos Pred Value : 0.7552
## Neg Pred Value : 0.7755
## Prevalence : 0.6198
## Detection Rate : 0.5625
## Detection Prevalence : 0.7448
## Balanced Accuracy : 0.7141
##
## 'Positive' Class : 0
##
SVM gives us an accuracy of 76.04%
Building the model on train data i.e. Training the data and finding the accuracy on test data
ytrain<-diab1$Outcome[index]
ytest<-diab1$Outcome[-index]
sqrt(nrow(train_diab))
## [1] 24
library(class)
KNN_model<-knn(train_diab,test_diab,k=23,cl = ytrain)
confusionMatrix(ytest,KNN_model)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 107 12
## 1 36 37
##
## Accuracy : 0.75
## 95% CI : (0.6826, 0.8096)
## No Information Rate : 0.7448
## P-Value [Acc > NIR] : 0.4723919
##
## Kappa : 0.4336
##
## Mcnemar's Test P-Value : 0.0009009
##
## Sensitivity : 0.7483
## Specificity : 0.7551
## Pos Pred Value : 0.8992
## Neg Pred Value : 0.5068
## Prevalence : 0.7448
## Detection Rate : 0.5573
## Detection Prevalence : 0.6198
## Balanced Accuracy : 0.7517
##
## 'Positive' Class : 0
##
KNN gives us an accuracy of 75%
After performing various classification algorithms and taking into account their accuracies, we can conclude all the models had an accuracy ranging from 75% to 77%. Out of which Binary Logistic Regression gave a slightly better accuracy of 76.56%