Data set information: This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Objective: The main objective of data set is to classify that wether a perticular person has diabetes or not with help of several independaent variables given in data set.
Dataset Description: The datasets consists 8 independent variables and one Dependent variable.
Independent variables includes:
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Dependent variable:
Outcome: 1 as diabites detected & 0 as not detetcted
Below are alogorithums we are using to perform predict classification:
*Binary logistic regression
*Decision Tree
*Random Forest
#read file diabetes
diabetes<-read.csv(file.choose(),header = T)#header=T means it will consider first row of data set as header and it will not take it in computation part
head(diabetes)#shows first 6 rows of data set
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
summary(diabetes)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
We observe data set & found that some variable i.e Glucose,BloodPressure,SkinThickness,Insulin,BMI can not be exactly “zero” as its not possible practically so we need to replace these Zero values with NA to replace them with some value using kNN computation method.
library(dplyr)#to use select function
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
diabetes1<-select(diabetes,Glucose,BloodPressure,SkinThickness,Insulin,BMI)
head(diabetes1)
## Glucose BloodPressure SkinThickness Insulin BMI
## 1 148 72 35 0 33.6
## 2 85 66 29 0 26.6
## 3 183 64 0 0 23.3
## 4 89 66 23 94 28.1
## 5 137 40 35 168 43.1
## 6 116 74 0 0 25.6
diabetes1[diabetes1=='0']=NA#Replacing 0 with NA
summary(diabetes1)#This will show the NA present in every individual variable
## Glucose BloodPressure SkinThickness Insulin
## Min. : 44.0 Min. : 24.00 Min. : 7.00 Min. : 14.00
## 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:22.00 1st Qu.: 76.25
## Median :117.0 Median : 72.00 Median :29.00 Median :125.00
## Mean :121.7 Mean : 72.41 Mean :29.15 Mean :155.55
## 3rd Qu.:141.0 3rd Qu.: 80.00 3rd Qu.:36.00 3rd Qu.:190.00
## Max. :199.0 Max. :122.00 Max. :99.00 Max. :846.00
## NA's :5 NA's :35 NA's :227 NA's :374
## BMI
## Min. :18.20
## 1st Qu.:27.50
## Median :32.30
## Mean :32.46
## 3rd Qu.:36.60
## Max. :67.10
## NA's :11
sum(is.na(diabetes1))#To check total number is NA present
## [1] 652
library(VIM)
## Loading required package: colorspace
## Loading required package: grid
## Loading required package: data.table
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## VIM is ready to use.
## Since version 4.0.0 the GUI is in its own package VIMGUI.
##
## Please use the package to use the new (and old) GUI.
## Suggestions and bug-reports can be submitted at: https://github.com/alexkowa/VIM/issues
##
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
##
## sleep
diabetes1<-kNN(diabetes1,k=sqrt(nrow(diabetes1)))#KNN Imputation method to remove NA
summary(diabetes1)
## Glucose BloodPressure SkinThickness Insulin
## Min. : 44.0 Min. : 24.00 Min. : 7.00 Min. : 14.0
## 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:23.00 1st Qu.: 85.0
## Median :117.0 Median : 72.00 Median :30.00 Median :125.0
## Mean :121.6 Mean : 72.51 Mean :29.02 Mean :142.2
## 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:33.00 3rd Qu.:168.0
## Max. :199.0 Max. :122.00 Max. :99.00 Max. :846.0
## BMI Glucose_imp BloodPressure_imp SkinThickness_imp
## Min. :18.20 Mode :logical Mode :logical Mode :logical
## 1st Qu.:27.50 FALSE:763 FALSE:733 FALSE:541
## Median :32.40 TRUE :5 TRUE :35 TRUE :227
## Mean :32.49
## 3rd Qu.:36.62
## Max. :67.10
## Insulin_imp BMI_imp
## Mode :logical Mode :logical
## FALSE:394 FALSE:757
## TRUE :374 TRUE :11
##
##
##
diabetes1=diabetes1[,1:5]#Removing dummy values
#Replacing these procesed variable with orignal variables
diabetes$Glucose=diabetes1$Glucose
diabetes$BloodPressure=diabetes1$BloodPressure
diabetes$SkinThickness=diabetes1$SkinThickness
diabetes$Insulin=diabetes1$Insulin
diabetes$BMI=diabetes1$BMI
summary(diabetes)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.0 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 64.00 1st Qu.:23.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :30.00
## Mean : 3.845 Mean :121.6 Mean : 72.51 Mean :29.02
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:33.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.0 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.: 85.0 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :125.0 Median :32.40 Median :0.3725 Median :29.00
## Mean :142.2 Mean :32.49 Mean :0.4719 Mean :33.24
## 3rd Qu.:168.0 3rd Qu.:36.62 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
#lets chechk data types
str(diabetes)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 76 70 96 ...
## $ SkinThickness : int 35 29 28 23 35 24 32 32 45 37 ...
## $ Insulin : int 167 71 175 94 168 110 88 132 543 146 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 36.5 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
#as we are perforaing classification hence we are converting integer dependent variable to factor
diabetes$Outcome=factor(diabetes$Outcome)
#Normalization of dataset required as independent variable has diffrent ranges which will affect on our classification models
diabetes2<-diabetes[,-9]
normalize=function(x){
return((x-min(x))/(max(x)-min(x)))
}
diabetes_norm<-normalize(diabetes2)
diabetes[,1:8]<-diabetes_norm[,1:8]#Replacing normalize value of data with orignal data set
head(diabetes)#checking head again
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 0.007092199 0.1749409 0.08510638 0.04137116 0.19739953 0.03971631
## 2 0.001182033 0.1004728 0.07801418 0.03427896 0.08392435 0.03144208
## 3 0.009456265 0.2163121 0.07565012 0.03309693 0.20685579 0.02754137
## 4 0.001182033 0.1052009 0.07801418 0.02718676 0.11111111 0.03321513
## 5 0.000000000 0.1619385 0.04728132 0.04137116 0.19858156 0.05094563
## 6 0.005910165 0.1371158 0.08747045 0.02836879 0.13002364 0.03026005
## DiabetesPedigreeFunction Age Outcome
## 1 0.0007411348 0.05910165 1
## 2 0.0004148936 0.03664303 0
## 3 0.0007943262 0.03782506 1
## 4 0.0001973995 0.02482270 0
## 5 0.0027044917 0.03900709 1
## 6 0.0002375887 0.03546099 0
str(diabetes)
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : num 0.00709 0.00118 0.00946 0.00118 0 ...
## $ Glucose : num 0.175 0.1 0.216 0.105 0.162 ...
## $ BloodPressure : num 0.0851 0.078 0.0757 0.078 0.0473 ...
## $ SkinThickness : num 0.0414 0.0343 0.0331 0.0272 0.0414 ...
## $ Insulin : num 0.1974 0.0839 0.2069 0.1111 0.1986 ...
## $ BMI : num 0.0397 0.0314 0.0275 0.0332 0.0509 ...
## $ DiabetesPedigreeFunction: num 0.000741 0.000415 0.000794 0.000197 0.002704 ...
## $ Age : num 0.0591 0.0366 0.0378 0.0248 0.039 ...
## $ Outcome : Factor w/ 2 levels "0","1": 2 1 2 1 2 1 2 1 2 2 ...
Partition dataset in training dataset & testing data set to build model & check wether model has build correctly or not
library(caret)#its a classification and regression training
## Loading required package: lattice
## Loading required package: ggplot2
index<-createDataPartition(diabetes$Pregnancies,p=0.75,list = F)
# argument 'list=F' is added so that output has to be data frame not a list
train_diab<-diabetes[index,]
test_diab<-diabetes[-index,]
dim(train_diab)
## [1] 578 9
Performing Binary logistic regression
diab_glm_model<-glm(Outcome~.,data = train_diab,family = "binomial")
# argument (family = "binomial") is necessary as we are creating a model with dichotomous result
summary(diab_glm_model)
##
## Call:
## glm(formula = Outcome ~ ., family = "binomial", data = train_diab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4630 -0.7433 -0.3987 0.7176 2.3133
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.9726 0.9655 -9.293 < 2e-16 ***
## Pregnancies 107.4530 31.9414 3.364 0.000768 ***
## Glucose 36.4327 4.2974 8.478 < 2e-16 ***
## BloodPressure -9.1180 8.3931 -1.086 0.277315
## SkinThickness 14.8158 13.9368 1.063 0.287749
## Insulin -1.9483 1.1344 -1.717 0.085913 .
## BMI 68.8638 18.9490 3.634 0.000279 ***
## DiabetesPedigreeFunction 459.5688 286.2270 1.606 0.108360
## Age 4.9197 8.9470 0.550 0.582407
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 757.53 on 577 degrees of freedom
## Residual deviance: 548.49 on 569 degrees of freedom
## AIC: 566.49
##
## Number of Fisher Scoring iterations: 5
#Letz check our model predicted probabilities on train data
train_diab$pred_prob_outcome<-fitted(diab_glm_model)#gives probabilty of being 1
Now as want to classify Dependent variable (Outcome) as 1 or 0 we need to first find thershold
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
pred<-prediction(train_diab$pred_prob_outcome,train_diab$Outcome)
#Compare model prediction probablity with actual outcome of data
perf<-performance(pred,"tpr","fpr")
plot(perf,colorize=T,print.cutoffs.at=seq(0.1,by=0.05))
As per above graph of senisitivity on y axis & (1-specificity) on x axis,selct thereshold in such way that sensitivity & specificity must be almost same in above plot it seems to be 0.35 so we will now classify our result using Threshold =0.35
train_diab$pred_Outcome<-ifelse(train_diab$pred_prob_outcome>0.35,1,0)
#these will store result of Predected outcome result
head(train_diab)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 0.007092199 0.17494090 0.08510638 0.04137116 0.19739953 0.03971631
## 2 0.001182033 0.10047281 0.07801418 0.03427896 0.08392435 0.03144208
## 3 0.009456265 0.21631206 0.07565012 0.03309693 0.20685579 0.02754137
## 4 0.001182033 0.10520095 0.07801418 0.02718676 0.11111111 0.03321513
## 6 0.005910165 0.13711584 0.08747045 0.02836879 0.13002364 0.03026005
## 7 0.003546099 0.09219858 0.05910165 0.03782506 0.10401891 0.03664303
## DiabetesPedigreeFunction Age Outcome pred_prob_outcome
## 1 0.0007411348 0.05910165 1 0.72746654
## 2 0.0004148936 0.03664303 0 0.04671357
## 3 0.0007943262 0.03782506 1 0.85441601
## 4 0.0001973995 0.02482270 0 0.04575268
## 6 0.0002375887 0.03546099 0 0.16726717
## 7 0.0002931442 0.03073286 1 0.06885921
## pred_Outcome
## 1 1
## 2 0
## 3 1
## 4 0
## 6 0
## 7 0
#we no need to find it accuracy,sensitivity,specificty values manually
#To get accuracy,sensitivity,specificty we wiil use below command
confusionMatrix(table(train_diab$Outcome,train_diab$pred_Outcome))
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 283 85
## 1 53 157
##
## Accuracy : 0.7612
## 95% CI : (0.7243, 0.7955)
## No Information Rate : 0.5813
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5003
##
## Mcnemar's Test P-Value : 0.008318
##
## Sensitivity : 0.8423
## Specificity : 0.6488
## Pos Pred Value : 0.7690
## Neg Pred Value : 0.7476
## Prevalence : 0.5813
## Detection Rate : 0.4896
## Detection Prevalence : 0.6367
## Balanced Accuracy : 0.7455
##
## 'Positive' Class : 0
##
Accuracy is 76.3 % ,Sensitivity = 84.94% ,Specificity =64.63 for train data which is ok
Let’z check our model on test data
test_diab$pred_prob_outcome<-predict(diab_glm_model,test_diab,type = "response")
#Predict function need to specify type as response to genrate probability otherwise it will consider as linear regression
test_diab$pred_outcome<-ifelse(test_diab$pred_prob_outcome>0.35,1,0)#use same thereshold value
head(test_diab)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 5 0.000000000 0.1619385 0.04728132 0.04137116 0.1985816 0.05094563
## 9 0.002364066 0.2328605 0.08274232 0.05319149 0.6418440 0.03605201
## 10 0.009456265 0.1477541 0.11347518 0.04373522 0.1725768 0.04314421
## 12 0.011820331 0.1985816 0.08747045 0.03782506 0.2482270 0.04491726
## 18 0.008274232 0.1264775 0.08747045 0.03309693 0.1359338 0.03498818
## 22 0.009456265 0.1170213 0.09929078 0.03782506 0.1241135 0.04184397
## DiabetesPedigreeFunction Age Outcome pred_prob_outcome
## 5 0.0027044917 0.03900709 1 0.8409526
## 9 0.0001867612 0.06264775 1 0.8061292
## 10 0.0002742317 0.06382979 1 0.5286710
## 12 0.0006347518 0.04018913 1 0.9163883
## 18 0.0003002364 0.03664303 1 0.2108288
## 22 0.0004586288 0.05910165 0 0.2897206
## pred_outcome
## 5 1
## 9 1
## 10 1
## 12 1
## 18 0
## 22 0
#let us check accuracy,sensitivity,specificty on test data
confusionMatrix(table(test_diab$Outcome,test_diab$pred_outcome))
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 92 40
## 1 12 46
##
## Accuracy : 0.7263
## 95% CI : (0.6571, 0.7884)
## No Information Rate : 0.5474
## P-Value [Acc > NIR] : 2.986e-07
##
## Kappa : 0.4317
##
## Mcnemar's Test P-Value : 0.000181
##
## Sensitivity : 0.8846
## Specificity : 0.5349
## Pos Pred Value : 0.6970
## Neg Pred Value : 0.7931
## Prevalence : 0.5474
## Detection Rate : 0.4842
## Detection Prevalence : 0.6947
## Balanced Accuracy : 0.7097
##
## 'Positive' Class : 0
##
Accuracy is 74.21 % ,Sensitivity = 87.96% ,Specificity =56.10 for test data which is ok
To check how much of our predicted values lie inside the curve:
auc<-performance(pred,"auc")
auc@y.values
## [[1]]
## [1] 0.8398421
We can conclude that we are getting an accuracy of 74.21% with 84.81% of our predicted values lying under the curve. Also our misclassifcation rate is 25.79%
Decision Tree We need to remove the extra coloumns we added while performing BLR before implementing Decision tree
train_diab$pred_Outcome<-NULL
train_diab$pred_prob_outcome<-NULL
test_diab$pred_outcome<-NULL
test_diab$pred_prob_outcome<-NULL
head(train_diab)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 0.007092199 0.17494090 0.08510638 0.04137116 0.19739953 0.03971631
## 2 0.001182033 0.10047281 0.07801418 0.03427896 0.08392435 0.03144208
## 3 0.009456265 0.21631206 0.07565012 0.03309693 0.20685579 0.02754137
## 4 0.001182033 0.10520095 0.07801418 0.02718676 0.11111111 0.03321513
## 6 0.005910165 0.13711584 0.08747045 0.02836879 0.13002364 0.03026005
## 7 0.003546099 0.09219858 0.05910165 0.03782506 0.10401891 0.03664303
## DiabetesPedigreeFunction Age Outcome
## 1 0.0007411348 0.05910165 1
## 2 0.0004148936 0.03664303 0
## 3 0.0007943262 0.03782506 1
## 4 0.0001973995 0.02482270 0
## 6 0.0002375887 0.03546099 0
## 7 0.0002931442 0.03073286 1
head(test_diab)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 5 0.000000000 0.1619385 0.04728132 0.04137116 0.1985816 0.05094563
## 9 0.002364066 0.2328605 0.08274232 0.05319149 0.6418440 0.03605201
## 10 0.009456265 0.1477541 0.11347518 0.04373522 0.1725768 0.04314421
## 12 0.011820331 0.1985816 0.08747045 0.03782506 0.2482270 0.04491726
## 18 0.008274232 0.1264775 0.08747045 0.03309693 0.1359338 0.03498818
## 22 0.009456265 0.1170213 0.09929078 0.03782506 0.1241135 0.04184397
## DiabetesPedigreeFunction Age Outcome
## 5 0.0027044917 0.03900709 1
## 9 0.0001867612 0.06264775 1
## 10 0.0002742317 0.06382979 1
## 12 0.0006347518 0.04018913 1
## 18 0.0003002364 0.03664303 1
## 22 0.0004586288 0.05910165 0
library(rpart)
library(rpart.plot)
tree_diab_model<-rpart(Outcome~.,data = train_diab)
test_diab$pred_outcome<-predict(tree_diab_model,test_diab,type = "class")
test_diab$pred_income<-NULL
head(test_diab)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 5 0.000000000 0.1619385 0.04728132 0.04137116 0.1985816 0.05094563
## 9 0.002364066 0.2328605 0.08274232 0.05319149 0.6418440 0.03605201
## 10 0.009456265 0.1477541 0.11347518 0.04373522 0.1725768 0.04314421
## 12 0.011820331 0.1985816 0.08747045 0.03782506 0.2482270 0.04491726
## 18 0.008274232 0.1264775 0.08747045 0.03309693 0.1359338 0.03498818
## 22 0.009456265 0.1170213 0.09929078 0.03782506 0.1241135 0.04184397
## DiabetesPedigreeFunction Age Outcome pred_outcome
## 5 0.0027044917 0.03900709 1 1
## 9 0.0001867612 0.06264775 1 1
## 10 0.0002742317 0.06382979 1 0
## 12 0.0006347518 0.04018913 1 1
## 18 0.0003002364 0.03664303 1 1
## 22 0.0004586288 0.05910165 0 0
#check accuracy,sensitivity,specificty on test data
confusionMatrix(table(test_diab$Outcome,test_diab$pred_outcome))
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 105 27
## 1 24 34
##
## Accuracy : 0.7316
## 95% CI : (0.6626, 0.7931)
## No Information Rate : 0.6789
## P-Value [Acc > NIR] : 0.0683
##
## Kappa : 0.3762
##
## Mcnemar's Test P-Value : 0.7794
##
## Sensitivity : 0.8140
## Specificity : 0.5574
## Pos Pred Value : 0.7955
## Neg Pred Value : 0.5862
## Prevalence : 0.6789
## Detection Rate : 0.5526
## Detection Prevalence : 0.6947
## Balanced Accuracy : 0.6857
##
## 'Positive' Class : 0
##
For Decision tree algorithum Accuracy is 78.42 % ,Sensitivity = 85.71% ,Specificity =64.06 for test data which is ok.
Plot decision tree
rpart.plot(tree_diab_model,cex = 0.7)
Random Forest
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
rf_diab_model<-randomForest(Outcome~.,data = diabetes)
head(diabetes)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 0.007092199 0.1749409 0.08510638 0.04137116 0.19739953 0.03971631
## 2 0.001182033 0.1004728 0.07801418 0.03427896 0.08392435 0.03144208
## 3 0.009456265 0.2163121 0.07565012 0.03309693 0.20685579 0.02754137
## 4 0.001182033 0.1052009 0.07801418 0.02718676 0.11111111 0.03321513
## 5 0.000000000 0.1619385 0.04728132 0.04137116 0.19858156 0.05094563
## 6 0.005910165 0.1371158 0.08747045 0.02836879 0.13002364 0.03026005
## DiabetesPedigreeFunction Age Outcome
## 1 0.0007411348 0.05910165 1
## 2 0.0004148936 0.03664303 0
## 3 0.0007943262 0.03782506 1
## 4 0.0001973995 0.02482270 0
## 5 0.0027044917 0.03900709 1
## 6 0.0002375887 0.03546099 0
Here the Out of Bag error (OOB) gives us the miscalssification rate (MCR) of the model. In this case it comes out to be 24.35%, which gives us the accuracy of 75.65%
To check classwise error
plot(rf_diab_model)
Red line represents MCR of class Outcome=0 i.e person not having diabetes, green line represents MCR of outcome=1 i.e person having diabetes and black line represents overall MCR or OOB error. Overall error rate is what we are interested in which seems considerably good.
Accuracy: For Binary logistic regression :74.21% For Decision tree :78.42% For Random forest :75.65%
We can conclude that among three algorithum using which we genrate prediction model on Diabetes data Decision tree has slightly high accuracy then Binary logistic regression & Random forest