Diabetes dataset prediction using Classification techniques

Data set information: This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Objective: The main objective of data set is to classify that wether a perticular person has diabetes or not with help of several independaent variables given in data set.

Dataset Description: The datasets consists 8 independent variables and one Dependent variable.

Independent variables includes:

Pregnancies: Number of times pregnant

Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test

BloodPressure: Diastolic blood pressure (mm Hg)

SkinThickness: Triceps skin fold thickness (mm)

Insulin: 2-Hour serum insulin (mu U/ml)

BMI: Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction: Diabetes pedigree function

Age: Age (years)

Dependent variable:

Outcome: 1 as diabites detected & 0 as not detetcted

Below are alogorithums we are using to perform predict classification:

*Binary logistic regression

*Decision Tree

*Random Forest

#read file diabetes
diabetes<-read.csv(file.choose(),header = T)#header=T means it will consider first row of data set as header and it will not take it in computation part
head(diabetes)#shows first 6 rows of data set

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

summary(diabetes)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

We observe data set & found that some variable i.e Glucose,BloodPressure,SkinThickness,Insulin,BMI can not be exactly “zero” as its not possible practically so we need to replace these Zero values with NA to replace them with some value using kNN computation method.

library(dplyr)#to use select function

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

diabetes1<-select(diabetes,Glucose,BloodPressure,SkinThickness,Insulin,BMI)
head(diabetes1)

##   Glucose BloodPressure SkinThickness Insulin  BMI
## 1     148            72            35       0 33.6
## 2      85            66            29       0 26.6
## 3     183            64             0       0 23.3
## 4      89            66            23      94 28.1
## 5     137            40            35     168 43.1
## 6     116            74             0       0 25.6

diabetes1[diabetes1=='0']=NA#Replacing 0 with NA 

summary(diabetes1)#This will show the NA present in every individual variable

##     Glucose      BloodPressure    SkinThickness      Insulin      
##  Min.   : 44.0   Min.   : 24.00   Min.   : 7.00   Min.   : 14.00  
##  1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:22.00   1st Qu.: 76.25  
##  Median :117.0   Median : 72.00   Median :29.00   Median :125.00  
##  Mean   :121.7   Mean   : 72.41   Mean   :29.15   Mean   :155.55  
##  3rd Qu.:141.0   3rd Qu.: 80.00   3rd Qu.:36.00   3rd Qu.:190.00  
##  Max.   :199.0   Max.   :122.00   Max.   :99.00   Max.   :846.00  
##  NA's   :5       NA's   :35       NA's   :227     NA's   :374     
##       BMI       
##  Min.   :18.20  
##  1st Qu.:27.50  
##  Median :32.30  
##  Mean   :32.46  
##  3rd Qu.:36.60  
##  Max.   :67.10  
##  NA's   :11

sum(is.na(diabetes1))#To check total number is NA present

## [1] 652

library(VIM)

## Loading required package: colorspace

## Loading required package: grid

## Loading required package: data.table

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## VIM is ready to use. 
##  Since version 4.0.0 the GUI is in its own package VIMGUI.
## 
##           Please use the package to use the new (and old) GUI.

## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues

## 
## Attaching package: 'VIM'

## The following object is masked from 'package:datasets':
## 
##     sleep

diabetes1<-kNN(diabetes1,k=sqrt(nrow(diabetes1)))#KNN Imputation method to remove NA
summary(diabetes1)

##     Glucose      BloodPressure    SkinThickness      Insulin     
##  Min.   : 44.0   Min.   : 24.00   Min.   : 7.00   Min.   : 14.0  
##  1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:23.00   1st Qu.: 85.0  
##  Median :117.0   Median : 72.00   Median :30.00   Median :125.0  
##  Mean   :121.6   Mean   : 72.51   Mean   :29.02   Mean   :142.2  
##  3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:33.00   3rd Qu.:168.0  
##  Max.   :199.0   Max.   :122.00   Max.   :99.00   Max.   :846.0  
##       BMI        Glucose_imp     BloodPressure_imp SkinThickness_imp
##  Min.   :18.20   Mode :logical   Mode :logical     Mode :logical    
##  1st Qu.:27.50   FALSE:763       FALSE:733         FALSE:541        
##  Median :32.40   TRUE :5         TRUE :35          TRUE :227        
##  Mean   :32.49                                                      
##  3rd Qu.:36.62                                                      
##  Max.   :67.10                                                      
##  Insulin_imp      BMI_imp       
##  Mode :logical   Mode :logical  
##  FALSE:394       FALSE:757      
##  TRUE :374       TRUE :11       
##                                 
##                                 
##

diabetes1=diabetes1[,1:5]#Removing dummy values 
#Replacing these procesed variable with orignal variables
diabetes$Glucose=diabetes1$Glucose
diabetes$BloodPressure=diabetes1$BloodPressure
diabetes$SkinThickness=diabetes1$SkinThickness
diabetes$Insulin=diabetes1$Insulin
diabetes$BMI=diabetes1$BMI
summary(diabetes)

##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.0   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 64.00   1st Qu.:23.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :30.00  
##  Mean   : 3.845   Mean   :121.6   Mean   : 72.51   Mean   :29.02  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:33.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.: 85.0   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median :125.0   Median :32.40   Median :0.3725           Median :29.00  
##  Mean   :142.2   Mean   :32.49   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:168.0   3rd Qu.:36.62   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

#lets chechk data types 
str(diabetes)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 76 70 96 ...
##  $ SkinThickness           : int  35 29 28 23 35 24 32 32 45 37 ...
##  $ Insulin                 : int  167 71 175 94 168 110 88 132 543 146 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 36.5 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

#as we are perforaing classification hence we are converting integer dependent variable to factor 
diabetes$Outcome=factor(diabetes$Outcome)

#Normalization of dataset required as independent variable has diffrent ranges which will affect on our classification models
diabetes2<-diabetes[,-9]

normalize=function(x){
  return((x-min(x))/(max(x)-min(x)))
}
diabetes_norm<-normalize(diabetes2)

diabetes[,1:8]<-diabetes_norm[,1:8]#Replacing normalize value of data with orignal data set
head(diabetes)#checking head again

##   Pregnancies   Glucose BloodPressure SkinThickness    Insulin        BMI
## 1 0.007092199 0.1749409    0.08510638    0.04137116 0.19739953 0.03971631
## 2 0.001182033 0.1004728    0.07801418    0.03427896 0.08392435 0.03144208
## 3 0.009456265 0.2163121    0.07565012    0.03309693 0.20685579 0.02754137
## 4 0.001182033 0.1052009    0.07801418    0.02718676 0.11111111 0.03321513
## 5 0.000000000 0.1619385    0.04728132    0.04137116 0.19858156 0.05094563
## 6 0.005910165 0.1371158    0.08747045    0.02836879 0.13002364 0.03026005
##   DiabetesPedigreeFunction        Age Outcome
## 1             0.0007411348 0.05910165       1
## 2             0.0004148936 0.03664303       0
## 3             0.0007943262 0.03782506       1
## 4             0.0001973995 0.02482270       0
## 5             0.0027044917 0.03900709       1
## 6             0.0002375887 0.03546099       0

str(diabetes)

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : num  0.00709 0.00118 0.00946 0.00118 0 ...
##  $ Glucose                 : num  0.175 0.1 0.216 0.105 0.162 ...
##  $ BloodPressure           : num  0.0851 0.078 0.0757 0.078 0.0473 ...
##  $ SkinThickness           : num  0.0414 0.0343 0.0331 0.0272 0.0414 ...
##  $ Insulin                 : num  0.1974 0.0839 0.2069 0.1111 0.1986 ...
##  $ BMI                     : num  0.0397 0.0314 0.0275 0.0332 0.0509 ...
##  $ DiabetesPedigreeFunction: num  0.000741 0.000415 0.000794 0.000197 0.002704 ...
##  $ Age                     : num  0.0591 0.0366 0.0378 0.0248 0.039 ...
##  $ Outcome                 : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 2 ...

Partition dataset in training dataset & testing data set to build model & check wether model has build correctly or not

library(caret)#its a classification and regression training

## Loading required package: lattice

## Loading required package: ggplot2

index<-createDataPartition(diabetes$Pregnancies,p=0.75,list = F)
# argument 'list=F' is added so that output has to be data frame not a list
train_diab<-diabetes[index,]
test_diab<-diabetes[-index,]
dim(train_diab)

## [1] 578   9

Performing Binary logistic regression

diab_glm_model<-glm(Outcome~.,data = train_diab,family = "binomial")
# argument (family = "binomial") is necessary as we are creating a model with dichotomous result
summary(diab_glm_model)

## 
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = train_diab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4630  -0.7433  -0.3987   0.7176   2.3133  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -8.9726     0.9655  -9.293  < 2e-16 ***
## Pregnancies              107.4530    31.9414   3.364 0.000768 ***
## Glucose                   36.4327     4.2974   8.478  < 2e-16 ***
## BloodPressure             -9.1180     8.3931  -1.086 0.277315    
## SkinThickness             14.8158    13.9368   1.063 0.287749    
## Insulin                   -1.9483     1.1344  -1.717 0.085913 .  
## BMI                       68.8638    18.9490   3.634 0.000279 ***
## DiabetesPedigreeFunction 459.5688   286.2270   1.606 0.108360    
## Age                        4.9197     8.9470   0.550 0.582407    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 757.53  on 577  degrees of freedom
## Residual deviance: 548.49  on 569  degrees of freedom
## AIC: 566.49
## 
## Number of Fisher Scoring iterations: 5

#Letz check our model predicted probabilities on train data
train_diab$pred_prob_outcome<-fitted(diab_glm_model)#gives probabilty of being 1

Now as want to classify Dependent variable (Outcome) as 1 or 0 we need to first find thershold

library(ROCR)

## Loading required package: gplots

## 
## Attaching package: 'gplots'

## The following object is masked from 'package:stats':
## 
##     lowess

pred<-prediction(train_diab$pred_prob_outcome,train_diab$Outcome)
#Compare model prediction probablity with actual outcome of data
perf<-performance(pred,"tpr","fpr")

plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))

As per above graph of senisitivity on y axis & (1-specificity) on x axis,selct thereshold in such way that sensitivity & specificity must be almost same in above plot it seems to be 0.35 so we will now classify our result using Threshold =0.35

train_diab$pred_Outcome<-ifelse(train_diab$pred_prob_outcome>0.35,1,0)
#these will store result of Predected outcome result
head(train_diab)

##   Pregnancies    Glucose BloodPressure SkinThickness    Insulin        BMI
## 1 0.007092199 0.17494090    0.08510638    0.04137116 0.19739953 0.03971631
## 2 0.001182033 0.10047281    0.07801418    0.03427896 0.08392435 0.03144208
## 3 0.009456265 0.21631206    0.07565012    0.03309693 0.20685579 0.02754137
## 4 0.001182033 0.10520095    0.07801418    0.02718676 0.11111111 0.03321513
## 6 0.005910165 0.13711584    0.08747045    0.02836879 0.13002364 0.03026005
## 7 0.003546099 0.09219858    0.05910165    0.03782506 0.10401891 0.03664303
##   DiabetesPedigreeFunction        Age Outcome pred_prob_outcome
## 1             0.0007411348 0.05910165       1        0.72746654
## 2             0.0004148936 0.03664303       0        0.04671357
## 3             0.0007943262 0.03782506       1        0.85441601
## 4             0.0001973995 0.02482270       0        0.04575268
## 6             0.0002375887 0.03546099       0        0.16726717
## 7             0.0002931442 0.03073286       1        0.06885921
##   pred_Outcome
## 1            1
## 2            0
## 3            1
## 4            0
## 6            0
## 7            0

#we no need to find it accuracy,sensitivity,specificty values manually
#To get accuracy,sensitivity,specificty we wiil use below command

confusionMatrix(table(train_diab$Outcome,train_diab$pred_Outcome))

## Confusion Matrix and Statistics
## 
##    
##       0   1
##   0 283  85
##   1  53 157
##                                           
##                Accuracy : 0.7612          
##                  95% CI : (0.7243, 0.7955)
##     No Information Rate : 0.5813          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5003          
##                                           
##  Mcnemar's Test P-Value : 0.008318        
##                                           
##             Sensitivity : 0.8423          
##             Specificity : 0.6488          
##          Pos Pred Value : 0.7690          
##          Neg Pred Value : 0.7476          
##              Prevalence : 0.5813          
##          Detection Rate : 0.4896          
##    Detection Prevalence : 0.6367          
##       Balanced Accuracy : 0.7455          
##                                           
##        'Positive' Class : 0               
##

Accuracy is 76.3 % ,Sensitivity = 84.94% ,Specificity =64.63 for train data which is ok

Let’z check our model on test data

test_diab$pred_prob_outcome<-predict(diab_glm_model,test_diab,type = "response")
#Predict function need to specify type as response to genrate probability otherwise it will consider as linear regression
test_diab$pred_outcome<-ifelse(test_diab$pred_prob_outcome>0.35,1,0)#use same thereshold value
head(test_diab)

##    Pregnancies   Glucose BloodPressure SkinThickness   Insulin        BMI
## 5  0.000000000 0.1619385    0.04728132    0.04137116 0.1985816 0.05094563
## 9  0.002364066 0.2328605    0.08274232    0.05319149 0.6418440 0.03605201
## 10 0.009456265 0.1477541    0.11347518    0.04373522 0.1725768 0.04314421
## 12 0.011820331 0.1985816    0.08747045    0.03782506 0.2482270 0.04491726
## 18 0.008274232 0.1264775    0.08747045    0.03309693 0.1359338 0.03498818
## 22 0.009456265 0.1170213    0.09929078    0.03782506 0.1241135 0.04184397
##    DiabetesPedigreeFunction        Age Outcome pred_prob_outcome
## 5              0.0027044917 0.03900709       1         0.8409526
## 9              0.0001867612 0.06264775       1         0.8061292
## 10             0.0002742317 0.06382979       1         0.5286710
## 12             0.0006347518 0.04018913       1         0.9163883
## 18             0.0003002364 0.03664303       1         0.2108288
## 22             0.0004586288 0.05910165       0         0.2897206
##    pred_outcome
## 5             1
## 9             1
## 10            1
## 12            1
## 18            0
## 22            0

#let us check accuracy,sensitivity,specificty on test data
confusionMatrix(table(test_diab$Outcome,test_diab$pred_outcome))

## Confusion Matrix and Statistics
## 
##    
##      0  1
##   0 92 40
##   1 12 46
##                                           
##                Accuracy : 0.7263          
##                  95% CI : (0.6571, 0.7884)
##     No Information Rate : 0.5474          
##     P-Value [Acc > NIR] : 2.986e-07       
##                                           
##                   Kappa : 0.4317          
##                                           
##  Mcnemar's Test P-Value : 0.000181        
##                                           
##             Sensitivity : 0.8846          
##             Specificity : 0.5349          
##          Pos Pred Value : 0.6970          
##          Neg Pred Value : 0.7931          
##              Prevalence : 0.5474          
##          Detection Rate : 0.4842          
##    Detection Prevalence : 0.6947          
##       Balanced Accuracy : 0.7097          
##                                           
##        'Positive' Class : 0               
##

Accuracy is 74.21 % ,Sensitivity = 87.96% ,Specificity =56.10 for test data which is ok

To check how much of our predicted values lie inside the curve:

auc<-performance(pred,"auc")
auc@y.values

## [[1]]
## [1] 0.8398421

We can conclude that we are getting an accuracy of 74.21% with 84.81% of our predicted values lying under the curve. Also our misclassifcation rate is 25.79%

Decision Tree We need to remove the extra coloumns we added while performing BLR before implementing Decision tree

train_diab$pred_Outcome<-NULL
train_diab$pred_prob_outcome<-NULL
test_diab$pred_outcome<-NULL
test_diab$pred_prob_outcome<-NULL
head(train_diab)

##   Pregnancies    Glucose BloodPressure SkinThickness    Insulin        BMI
## 1 0.007092199 0.17494090    0.08510638    0.04137116 0.19739953 0.03971631
## 2 0.001182033 0.10047281    0.07801418    0.03427896 0.08392435 0.03144208
## 3 0.009456265 0.21631206    0.07565012    0.03309693 0.20685579 0.02754137
## 4 0.001182033 0.10520095    0.07801418    0.02718676 0.11111111 0.03321513
## 6 0.005910165 0.13711584    0.08747045    0.02836879 0.13002364 0.03026005
## 7 0.003546099 0.09219858    0.05910165    0.03782506 0.10401891 0.03664303
##   DiabetesPedigreeFunction        Age Outcome
## 1             0.0007411348 0.05910165       1
## 2             0.0004148936 0.03664303       0
## 3             0.0007943262 0.03782506       1
## 4             0.0001973995 0.02482270       0
## 6             0.0002375887 0.03546099       0
## 7             0.0002931442 0.03073286       1

head(test_diab)

##    Pregnancies   Glucose BloodPressure SkinThickness   Insulin        BMI
## 5  0.000000000 0.1619385    0.04728132    0.04137116 0.1985816 0.05094563
## 9  0.002364066 0.2328605    0.08274232    0.05319149 0.6418440 0.03605201
## 10 0.009456265 0.1477541    0.11347518    0.04373522 0.1725768 0.04314421
## 12 0.011820331 0.1985816    0.08747045    0.03782506 0.2482270 0.04491726
## 18 0.008274232 0.1264775    0.08747045    0.03309693 0.1359338 0.03498818
## 22 0.009456265 0.1170213    0.09929078    0.03782506 0.1241135 0.04184397
##    DiabetesPedigreeFunction        Age Outcome
## 5              0.0027044917 0.03900709       1
## 9              0.0001867612 0.06264775       1
## 10             0.0002742317 0.06382979       1
## 12             0.0006347518 0.04018913       1
## 18             0.0003002364 0.03664303       1
## 22             0.0004586288 0.05910165       0

library(rpart)
library(rpart.plot)
tree_diab_model<-rpart(Outcome~.,data = train_diab)
test_diab$pred_outcome<-predict(tree_diab_model,test_diab,type = "class")
test_diab$pred_income<-NULL
head(test_diab)

##    Pregnancies   Glucose BloodPressure SkinThickness   Insulin        BMI
## 5  0.000000000 0.1619385    0.04728132    0.04137116 0.1985816 0.05094563
## 9  0.002364066 0.2328605    0.08274232    0.05319149 0.6418440 0.03605201
## 10 0.009456265 0.1477541    0.11347518    0.04373522 0.1725768 0.04314421
## 12 0.011820331 0.1985816    0.08747045    0.03782506 0.2482270 0.04491726
## 18 0.008274232 0.1264775    0.08747045    0.03309693 0.1359338 0.03498818
## 22 0.009456265 0.1170213    0.09929078    0.03782506 0.1241135 0.04184397
##    DiabetesPedigreeFunction        Age Outcome pred_outcome
## 5              0.0027044917 0.03900709       1            1
## 9              0.0001867612 0.06264775       1            1
## 10             0.0002742317 0.06382979       1            0
## 12             0.0006347518 0.04018913       1            1
## 18             0.0003002364 0.03664303       1            1
## 22             0.0004586288 0.05910165       0            0

#check accuracy,sensitivity,specificty on test data
confusionMatrix(table(test_diab$Outcome,test_diab$pred_outcome))

## Confusion Matrix and Statistics
## 
##    
##       0   1
##   0 105  27
##   1  24  34
##                                           
##                Accuracy : 0.7316          
##                  95% CI : (0.6626, 0.7931)
##     No Information Rate : 0.6789          
##     P-Value [Acc > NIR] : 0.0683          
##                                           
##                   Kappa : 0.3762          
##                                           
##  Mcnemar's Test P-Value : 0.7794          
##                                           
##             Sensitivity : 0.8140          
##             Specificity : 0.5574          
##          Pos Pred Value : 0.7955          
##          Neg Pred Value : 0.5862          
##              Prevalence : 0.6789          
##          Detection Rate : 0.5526          
##    Detection Prevalence : 0.6947          
##       Balanced Accuracy : 0.6857          
##                                           
##        'Positive' Class : 0               
##

For Decision tree algorithum Accuracy is 78.42 % ,Sensitivity = 85.71% ,Specificity =64.06 for test data which is ok.

Plot decision tree

rpart.plot(tree_diab_model,cex = 0.7)

Random Forest

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

rf_diab_model<-randomForest(Outcome~.,data = diabetes)
head(diabetes)

##   Pregnancies   Glucose BloodPressure SkinThickness    Insulin        BMI
## 1 0.007092199 0.1749409    0.08510638    0.04137116 0.19739953 0.03971631
## 2 0.001182033 0.1004728    0.07801418    0.03427896 0.08392435 0.03144208
## 3 0.009456265 0.2163121    0.07565012    0.03309693 0.20685579 0.02754137
## 4 0.001182033 0.1052009    0.07801418    0.02718676 0.11111111 0.03321513
## 5 0.000000000 0.1619385    0.04728132    0.04137116 0.19858156 0.05094563
## 6 0.005910165 0.1371158    0.08747045    0.02836879 0.13002364 0.03026005
##   DiabetesPedigreeFunction        Age Outcome
## 1             0.0007411348 0.05910165       1
## 2             0.0004148936 0.03664303       0
## 3             0.0007943262 0.03782506       1
## 4             0.0001973995 0.02482270       0
## 5             0.0027044917 0.03900709       1
## 6             0.0002375887 0.03546099       0

Here the Out of Bag error (OOB) gives us the miscalssification rate (MCR) of the model. In this case it comes out to be 24.35%, which gives us the accuracy of 75.65%

To check classwise error

plot(rf_diab_model)

Red line represents MCR of class Outcome=0 i.e person not having diabetes, green line represents MCR of outcome=1 i.e person having diabetes and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.

Accuracy: For Binary logistic regression :74.21% For Decision tree :78.42% For Random forest :75.65%

We can conclude that among three algorithum using which we genrate prediction model on Diabetes data Decision tree has slightly high accuracy then Binary logistic regression & Random forest

Diabetes dataset prediction using Classification techniques

Prathamesh Shelar

10/19/2019