Introduction
The Logistic Regression is a regression model in which the response variable (dependent variable) has categorical values such as True/False or 0/1. It actually measures the probability of a binary response as the value of response variable based on the mathematical equation relating it with the predictor variables.
Problem Definition
The objective is to predict based on diagnostic measurements whether a patient has diabetes or not.
Dataset
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Data Description Attributes: [, 1] Pregnancies: Number of times pregnant
[, 2] Glucose:Plasma glucose concentration a 2 hours in an oral glucose tolerance test
[, 3] BloodPressure:Diastolic blood pressure (mm Hg)
[, 4] SkinThickness:Triceps skin fold thickness (mm)
[, 5] Insulin:2-Hour serum insulin (mu U/ml)
[, 6] BMI: Body mass index (weight in kg/(height in m)^2)
[, 7] DiabetesPedigreeFunction: Diabetes pedigree function
[, 8] Age:Age (years)
[, 9] Outcome: Class variable (0 or 1), 0=Non diabetic and 1= Diabetic
Setup
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.3.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.3.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(Deducer)
## Warning: package 'Deducer' was built under R version 3.3.3
## Loading required package: JGR
## Warning: package 'JGR' was built under R version 3.3.3
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots
## Warning: package 'iplots' was built under R version 3.3.3
##
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
## Warning: package 'car' was built under R version 3.3.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Note Non-JGR console detected:
## Deducer is best used from within JGR (http://jgr.markushelbig.org/).
## To Bring up GUI dialogs, type deducer().
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
library(pscl)
## Warning: package 'pscl' was built under R version 3.3.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
Functions
Dataset
setwd("D:\\PGDM\\Trim 4\\MachineLearning")
dfrModel <- read.csv("./Data/Diabetes_train1.csv", header=T, stringsAsFactors=F)
head(dfrModel)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Datatypes
str(dfrModel)
## 'data.frame': 700 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
Observation
Dataset is comprised of integer and numeric data
Check for Missing Data
lapply(dfrModel, FUN=detect_na)
## $Pregnancies
## [1] 0
##
## $Glucose
## [1] 0
##
## $BloodPressure
## [1] 0
##
## $SkinThickness
## [1] 0
##
## $Insulin
## [1] 0
##
## $BMI
## [1] 0
##
## $DiabetesPedigreeFunction
## [1] 0
##
## $Age
## [1] 0
##
## $Outcome
## [1] 0
Observation
Dataset has no missing data
Summarizing data
summarise(group_by(dfrModel, Pregnancies), n())
## # A tibble: 17 × 2
## Pregnancies `n()`
## <int> <int>
## 1 0 106
## 2 1 120
## 3 2 91
## 4 3 68
## 5 4 63
## 6 5 53
## 7 6 46
## 8 7 43
## 9 8 35
## 10 9 24
## 11 10 20
## 12 11 10
## 13 12 8
## 14 13 9
## 15 14 2
## 16 15 1
## 17 17 1
summarise(group_by(dfrModel, Glucose), n())
## # A tibble: 133 × 2
## Glucose `n()`
## <int> <int>
## 1 0 5
## 2 44 1
## 3 56 1
## 4 57 2
## 5 61 1
## 6 62 1
## 7 67 1
## 8 68 3
## 9 71 4
## 10 72 1
## # ... with 123 more rows
summarise(group_by(dfrModel, BloodPressure), n())
## # A tibble: 47 × 2
## BloodPressure `n()`
## <int> <int>
## 1 0 33
## 2 24 1
## 3 30 2
## 4 38 1
## 5 40 1
## 6 44 3
## 7 46 1
## 8 48 5
## 9 50 12
## 10 52 10
## # ... with 37 more rows
summarise(group_by(dfrModel, SkinThickness), n())
## # A tibble: 51 × 2
## SkinThickness `n()`
## <int> <int>
## 1 0 209
## 2 7 2
## 3 8 2
## 4 10 5
## 5 11 6
## 6 12 7
## 7 13 10
## 8 14 6
## 9 15 14
## 10 16 6
## # ... with 41 more rows
summarise(group_by(dfrModel, Insulin), n())
## # A tibble: 176 × 2
## Insulin `n()`
## <int> <int>
## 1 0 338
## 2 14 1
## 3 15 1
## 4 18 2
## 5 23 2
## 6 25 1
## 7 29 1
## 8 32 1
## 9 36 3
## 10 37 2
## # ... with 166 more rows
summarise(group_by(dfrModel, BMI), n())
## # A tibble: 245 × 2
## BMI `n()`
## <dbl> <int>
## 1 0.0 10
## 2 18.2 3
## 3 18.4 1
## 4 19.1 1
## 5 19.3 1
## 6 19.4 1
## 7 19.5 2
## 8 19.6 3
## 9 19.9 1
## 10 20.0 1
## # ... with 235 more rows
summarise(group_by(dfrModel, DiabetesPedigreeFunction), n())
## # A tibble: 487 × 2
## DiabetesPedigreeFunction `n()`
## <dbl> <int>
## 1 0.078 1
## 2 0.084 1
## 3 0.085 2
## 4 0.088 2
## 5 0.089 1
## 6 0.092 1
## 7 0.096 1
## 8 0.100 1
## 9 0.101 1
## 10 0.102 1
## # ... with 477 more rows
summarise(group_by(dfrModel, Age), n())
## # A tibble: 52 × 2
## Age `n()`
## <int> <int>
## 1 21 59
## 2 22 63
## 3 23 36
## 4 24 43
## 5 25 46
## 6 26 29
## 7 27 29
## 8 28 32
## 9 29 29
## 10 30 19
## # ... with 42 more rows
summarise(group_by(dfrModel, Outcome), n())
## # A tibble: 2 × 2
## Outcome `n()`
## <int> <int>
## 1 0 459
## 2 1 241
Exploratory Analysis
lapply(dfrModel, FUN=summary)
## $Pregnancies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.827 6.000 17.000
##
## $Glucose
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 99.0 116.5 120.5 140.2 199.0
##
## $BloodPressure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 63.50 72.00 68.88 80.00 122.00
##
## $SkinThickness
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 23.00 20.38 32.00 99.00
##
## $Insulin
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 36.50 79.88 126.50 846.00
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 27.00 32.00 31.89 36.50 67.10
##
## $DiabetesPedigreeFunction
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2400 0.3755 0.4760 0.6370 2.4200
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 33.12 40.00 81.00
##
## $Outcome
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3443 1.0000 1.0000
Histogram to check data distribution
hist(dfrModel$Pregnancies)
hist(dfrModel$Glucose)
hist(dfrModel$Age)
hist(dfrModel$BMI)
hist(dfrModel$Insulin)
Outliers Data
#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)
## $Pregnancies
## integer(0)
##
## $Glucose
## integer(0)
##
## $BloodPressure
## [1] 0 0 0 0 0 0 122 0 0 0 0 0 0 0 0 0 0
## [18] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## $SkinThickness
## integer(0)
##
## $Insulin
## [1] 543 846 495 485 495 478 744 680 545 465 579 474 480 600 540 480
##
## $BMI
## [1] 0.0 0.0 0.0 0.0 0.0 67.1 0.0 0.0 0.0 0.0 0.0
##
## $DiabetesPedigreeFunction
## [1] 2.288 1.893 1.781 2.329 2.137 1.731 2.420 1.699 1.698
##
## $Age
## [1] 81
##
## $Outcome
## integer(0)
Display Outliers
lapply(dfrModel[1:8],FUN=display_Outliers)
## $Pregnancies
##
## $Glucose
##
## $BloodPressure
##
## $SkinThickness
##
## $Insulin
##
## $BMI
##
## $DiabetesPedigreeFunction
##
## $Age
Observation
Outliers are present in few features.
But Outlier count is low.
For this model we will work with the outliers.
Correlation
vctCorr = numeric(0)
for (i in names(dfrModel)){
cor.result <- cor(as.numeric(dfrModel$Outcome), as.numeric(dfrModel[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
## Pregnancies Glucose BloodPressure
## 0.22774403 0.45928020 0.06019258
## SkinThickness Insulin BMI
## 0.08740524 0.14592233 0.30659734
## DiabetesPedigreeFunction Age Outcome
## 0.17053194 0.22699018 1.00000000
Data For Visualization
dfrGraph <- gather(dfrModel, variable, value, -Outcome)
head(dfrGraph)
## Outcome variable value
## 1 1 Pregnancies 6
## 2 0 Pregnancies 1
## 3 1 Pregnancies 8
## 4 0 Pregnancies 1
## 5 1 Pregnancies 0
## 6 0 Pregnancies 5
ggplot(dfrGraph) + #ggplot works better with factors
geom_jitter(aes(value,Outcome, colour=variable)) +
geom_smooth(aes(value,Outcome, colour=variable), method=lm, se=FALSE) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of diabetes With Other Features")
Find Best Multi Logistic Model
Choose the best logistic model by using step().
stpModel=step(glm(data=dfrModel, formula=Outcome~., family=binomial), trace=0, steps=100)
summary(stpModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction, family = binomial, data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7459 -0.7335 -0.4097 0.7153 2.8945
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.009215 0.713218 -11.230 < 2e-16 ***
## Pregnancies 0.157336 0.029270 5.375 7.64e-08 ***
## Glucose 0.033444 0.003517 9.509 < 2e-16 ***
## BloodPressure -0.012432 0.005246 -2.370 0.01781 *
## BMI 0.091142 0.014970 6.088 1.14e-09 ***
## DiabetesPedigreeFunction 0.885935 0.304143 2.913 0.00358 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 901.37 on 699 degrees of freedom
## Residual deviance: 659.26 on 694 degrees of freedom
## AIC: 671.26
##
## Number of Fisher Scoring iterations: 5
Observation
Best results given by Outcome ~ Pregnancies + Glucose + BloodPressure + BMI + DiabetesPedigreeFunction.
p-values for the features are less than 0.05.
Difference between the null deviance and residual deviance is quiet large. Thus model is fit.
Make Final Multi Linear Model
# make model
mgmModel <- glm(data=dfrModel, formula=Outcome~Pregnancies+Glucose+BloodPressure+BMI+DiabetesPedigreeFunction, family=binomial(link="logit"))
# print summary
summary(mgmModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction, family = binomial(link = "logit"),
## data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7459 -0.7335 -0.4097 0.7153 2.8945
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.009215 0.713218 -11.230 < 2e-16 ***
## Pregnancies 0.157336 0.029270 5.375 7.64e-08 ***
## Glucose 0.033444 0.003517 9.509 < 2e-16 ***
## BloodPressure -0.012432 0.005246 -2.370 0.01781 *
## BMI 0.091142 0.014970 6.088 1.14e-09 ***
## DiabetesPedigreeFunction 0.885935 0.304143 2.913 0.00358 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 901.37 on 699 degrees of freedom
## Residual deviance: 659.26 on 694 degrees of freedom
## AIC: 671.26
##
## Number of Fisher Scoring iterations: 5
Confusion Matrix
prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 404 103
## 1 55 138
##
## Accuracy : 0.7743
## 95% CI : (0.7415, 0.8048)
## No Information Rate : 0.6557
## P-Value [Acc > NIR] : 5.601e-12
##
## Kappa : 0.4753
## Mcnemar's Test P-Value : 0.0001847
##
## Sensitivity : 0.8802
## Specificity : 0.5726
## Pos Pred Value : 0.7968
## Neg Pred Value : 0.7150
## Prevalence : 0.6557
## Detection Rate : 0.5771
## Detection Prevalence : 0.7243
## Balanced Accuracy : 0.7264
##
## 'Positive' Class : 0
##
observation
Accuracy of the model is found to be 77% and sensitivity around 88%
Regression Data
dfrPlot <- mutate(dfrModel, PrdVal=prdVal, POutcome=prdBln)
head(dfrPlot)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome PrdVal POutcome
## 1 0.627 50 1 0.64732497 1
## 2 0.351 31 0 0.04334372 0
## 3 0.672 32 1 0.78466856 1
## 4 0.167 21 0 0.04802554 0
## 5 2.288 33 1 0.88397222 1
## 6 0.201 30 0 0.14783877 0
Regression Visulaization
#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=POutcome)) +
geom_point(shape=19, colour="blue", fill="blue") +
geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
labs(title="Binomial Regression Curve") +
labs(x="") +
labs(y="")
ROC Visulaization
rocplot(mgmModel)
Observation
Accuracy identified by the AUC model is around 83%.
Test Data
setwd("D:\\PGDM\\Trim 4\\MachineLearning")
dfrTests <- read.csv("./Data/Diabetes_test1.csv", header=T, stringsAsFactors=F)
head(dfrTests)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 2 122 76 27 200 35.9
## 2 6 125 78 31 0 27.6
## 3 1 168 88 29 0 35.0
## 4 2 129 0 0 0 38.5
## 5 4 110 76 20 100 28.4
## 6 6 80 80 36 0 39.8
## DiabetesPedigreeFunction Age Outcome
## 1 0.483 26 0
## 2 0.565 49 1
## 3 0.905 52 1
## 4 0.304 41 0
## 5 0.118 27 0
## 6 0.177 28 0
Observation
Test Data successfully created.
Predict using Test data
resVal <- predict(mgmModel, dfrTests, type="response")
prdOut <- ifelse(resVal > 0.5, 1, 0)
dfrTests <- mutate(dfrTests, Pvalue=resVal, POutcome=prdOut)
dfrTests
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 2 122 76 27 200 35.9
## 2 6 125 78 31 0 27.6
## 3 1 168 88 29 0 35.0
## 4 2 129 0 0 0 38.5
## 5 4 110 76 20 100 28.4
## 6 6 80 80 36 0 39.8
## 7 10 115 0 0 0 0.0
## 8 2 127 46 21 335 34.4
## 9 9 164 78 0 0 32.8
## 10 2 93 64 32 160 38.0
## 11 3 158 64 13 387 31.2
## 12 5 126 78 27 22 29.6
## 13 10 129 62 36 0 41.2
## 14 0 134 58 20 291 26.4
## 15 3 102 74 0 0 29.5
## 16 7 187 50 33 392 33.9
## 17 3 173 78 39 185 33.8
## 18 10 94 72 18 0 23.1
## 19 1 108 60 46 178 35.5
## 20 5 97 76 27 0 35.6
## 21 4 83 86 19 0 29.3
## 22 1 114 66 36 200 38.1
## 23 1 149 68 29 127 29.3
## 24 5 117 86 30 105 39.1
## 25 1 111 94 0 0 32.8
## 26 4 112 78 40 0 39.4
## 27 1 116 78 29 180 36.1
## 28 0 141 84 26 0 32.4
## 29 2 175 88 0 0 22.9
## 30 2 92 52 0 0 30.1
## 31 3 130 78 23 79 28.4
## 32 8 120 86 0 0 28.4
## 33 2 174 88 37 120 44.5
## 34 2 106 56 27 165 29.0
## 35 2 105 75 0 0 23.3
## 36 4 95 60 32 0 35.4
## 37 0 126 86 27 120 27.4
## 38 8 65 72 23 0 32.0
## 39 2 99 60 17 160 36.6
## 40 1 102 74 0 0 39.5
## 41 11 120 80 37 150 42.3
## 42 3 102 44 20 94 30.8
## 43 1 109 58 18 116 28.5
## 44 9 140 94 0 0 32.7
## 45 13 153 88 37 140 40.6
## 46 12 100 84 33 105 30.0
## 47 1 147 94 41 0 49.3
## 48 1 81 74 41 57 46.3
## 49 3 187 70 22 200 36.4
## 50 6 162 62 0 0 24.3
## 51 4 136 70 0 0 31.2
## 52 1 121 78 39 74 39.0
## 53 3 108 62 24 0 26.0
## 54 0 181 88 44 510 43.3
## 55 8 154 78 32 0 32.4
## 56 1 128 88 39 110 36.5
## 57 7 137 90 41 0 32.0
## 58 0 123 72 0 0 36.3
## 59 1 106 76 0 0 37.5
## 60 6 190 92 0 0 35.5
## 61 2 88 58 26 16 28.4
## 62 9 170 74 31 0 44.0
## 63 9 89 62 0 0 22.5
## 64 10 101 76 48 180 32.9
## 65 2 122 70 27 0 36.8
## 66 5 121 72 23 112 26.2
## 67 1 126 60 0 0 30.1
## 68 1 93 70 31 0 30.4
## DiabetesPedigreeFunction Age Outcome Pvalue POutcome
## 1 0.483 26 0 0.29749244 0
## 2 0.565 49 1 0.30189651 0
## 3 0.905 52 1 0.66026773 1
## 4 0.304 41 0 0.59821566 1
## 5 0.118 27 0 0.12424247 0
## 6 0.177 28 0 0.16798839 0
## 7 0.261 30 1 0.08638919 0
## 8 0.176 22 0 0.32567987 0
## 9 0.148 45 1 0.73934182 1
## 10 0.674 23 1 0.21092502 0
## 11 0.295 24 0 0.51407613 1
## 12 0.439 40 0 0.29079639 0
## 13 0.441 38 1 0.77789023 1
## 14 0.352 21 0 0.17788535 0
## 15 0.121 32 0 0.09535232 0
## 16 0.826 34 1 0.92731122 1
## 17 0.970 31 1 0.77187221 1
## 18 0.595 56 0 0.17441146 0
## 19 0.415 24 0 0.20058927 0
## 20 0.378 52 1 0.20689720 0
## 21 0.317 34 0 0.06169707 0
## 22 0.289 21 0 0.24393972 0
## 23 0.349 42 1 0.32422871 0
## 24 0.251 42 0 0.35601984 0
## 25 0.265 45 0 0.11066833 0
## 26 0.236 38 0 0.30922789 0
## 27 0.496 25 0 0.22927908 0
## 28 0.433 22 0 0.26869678 0
## 29 0.326 22 0 0.36358534 0
## 30 0.141 22 0 0.08349014 0
## 31 0.323 34 1 0.21677629 0
## 32 0.259 22 1 0.27121466 0
## 33 0.646 24 1 0.84008714 1
## 34 0.426 22 0 0.13882109 0
## 35 0.560 53 0 0.07617043 0
## 36 0.284 28 0 0.18685838 0
## 37 0.515 21 0 0.12888735 0
## 38 0.600 42 0 0.11674276 0
## 39 0.453 21 0 0.19903185 0
## 40 0.293 42 1 0.18229997 0
## 41 0.785 48 1 0.78431616 1
## 42 0.400 26 0 0.18073784 0
## 43 0.219 22 0 0.10565220 0
## 44 0.734 45 1 0.63437356 1
## 45 1.174 39 0 0.94265225 1
## 46 0.488 46 0 0.34198956 0
## 47 0.358 27 1 0.66958025 1
## 48 1.096 32 0 0.29483781 0
## 49 0.408 36 1 0.82137020 1
## 50 0.178 50 1 0.48861097 0
## 51 1.182 22 1 0.54713868 1
## 52 0.261 28 0 0.27109988 0
## 53 0.223 25 0 0.10633357 0
## 54 0.222 26 1 0.74900417 1
## 55 0.443 45 1 0.68474539 1
## 56 1.057 37 1 0.40085432 0
## 57 0.391 39 0 0.45464341 0
## 58 0.258 52 1 0.22206950 0
## 59 0.197 26 0 0.15986116 0
## 60 0.278 66 1 0.83579980 1
## 61 0.766 22 0 0.09926279 0
## 62 0.403 43 1 0.92687453 1
## 63 0.142 33 0 0.09877301 0
## 64 0.171 63 0 0.29885728 0
## 65 0.340 27 0 0.30378496 0
## 66 0.245 30 0 0.18756614 0
## 67 0.349 47 1 0.20895172 0
## 68 0.315 23 0 0.07162370 0
Observation
Predicted outcome is added as a column based on the resval.
summarise(group_by(dfrTests, Outcome), n())
## # A tibble: 2 × 2
## Outcome `n()`
## <int> <int>
## 1 0 41
## 2 1 27
Confusion Matrix of Test data
prdVal11 <- predict(mgmModel,dfrTests, type='response')
prdBln21 <- ifelse(prdVal11 > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln21, act=dfrTests$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 38 12
## 1 3 15
##
## Accuracy : 0.7794
## 95% CI : (0.6624, 0.871)
## No Information Rate : 0.6029
## P-Value [Acc > NIR] : 0.001609
##
## Kappa : 0.5115
## Mcnemar's Test P-Value : 0.038867
##
## Sensitivity : 0.9268
## Specificity : 0.5556
## Pos Pred Value : 0.7600
## Neg Pred Value : 0.8333
## Prevalence : 0.6029
## Detection Rate : 0.5588
## Detection Prevalence : 0.7353
## Balanced Accuracy : 0.7412
##
## 'Positive' Class : 0
##
Observation
Accuracy between actual and predicted values of test data is around 77% and sensitivity is 92%.
dfrTests$POutcome <- as.factor(dfrTests$POutcome)
levels(dfrTests$POutcome) <- c("Non Diabetic", "Diabetic")
dfrTests
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 2 122 76 27 200 35.9
## 2 6 125 78 31 0 27.6
## 3 1 168 88 29 0 35.0
## 4 2 129 0 0 0 38.5
## 5 4 110 76 20 100 28.4
## 6 6 80 80 36 0 39.8
## 7 10 115 0 0 0 0.0
## 8 2 127 46 21 335 34.4
## 9 9 164 78 0 0 32.8
## 10 2 93 64 32 160 38.0
## 11 3 158 64 13 387 31.2
## 12 5 126 78 27 22 29.6
## 13 10 129 62 36 0 41.2
## 14 0 134 58 20 291 26.4
## 15 3 102 74 0 0 29.5
## 16 7 187 50 33 392 33.9
## 17 3 173 78 39 185 33.8
## 18 10 94 72 18 0 23.1
## 19 1 108 60 46 178 35.5
## 20 5 97 76 27 0 35.6
## 21 4 83 86 19 0 29.3
## 22 1 114 66 36 200 38.1
## 23 1 149 68 29 127 29.3
## 24 5 117 86 30 105 39.1
## 25 1 111 94 0 0 32.8
## 26 4 112 78 40 0 39.4
## 27 1 116 78 29 180 36.1
## 28 0 141 84 26 0 32.4
## 29 2 175 88 0 0 22.9
## 30 2 92 52 0 0 30.1
## 31 3 130 78 23 79 28.4
## 32 8 120 86 0 0 28.4
## 33 2 174 88 37 120 44.5
## 34 2 106 56 27 165 29.0
## 35 2 105 75 0 0 23.3
## 36 4 95 60 32 0 35.4
## 37 0 126 86 27 120 27.4
## 38 8 65 72 23 0 32.0
## 39 2 99 60 17 160 36.6
## 40 1 102 74 0 0 39.5
## 41 11 120 80 37 150 42.3
## 42 3 102 44 20 94 30.8
## 43 1 109 58 18 116 28.5
## 44 9 140 94 0 0 32.7
## 45 13 153 88 37 140 40.6
## 46 12 100 84 33 105 30.0
## 47 1 147 94 41 0 49.3
## 48 1 81 74 41 57 46.3
## 49 3 187 70 22 200 36.4
## 50 6 162 62 0 0 24.3
## 51 4 136 70 0 0 31.2
## 52 1 121 78 39 74 39.0
## 53 3 108 62 24 0 26.0
## 54 0 181 88 44 510 43.3
## 55 8 154 78 32 0 32.4
## 56 1 128 88 39 110 36.5
## 57 7 137 90 41 0 32.0
## 58 0 123 72 0 0 36.3
## 59 1 106 76 0 0 37.5
## 60 6 190 92 0 0 35.5
## 61 2 88 58 26 16 28.4
## 62 9 170 74 31 0 44.0
## 63 9 89 62 0 0 22.5
## 64 10 101 76 48 180 32.9
## 65 2 122 70 27 0 36.8
## 66 5 121 72 23 112 26.2
## 67 1 126 60 0 0 30.1
## 68 1 93 70 31 0 30.4
## DiabetesPedigreeFunction Age Outcome Pvalue POutcome
## 1 0.483 26 0 0.29749244 Non Diabetic
## 2 0.565 49 1 0.30189651 Non Diabetic
## 3 0.905 52 1 0.66026773 Diabetic
## 4 0.304 41 0 0.59821566 Diabetic
## 5 0.118 27 0 0.12424247 Non Diabetic
## 6 0.177 28 0 0.16798839 Non Diabetic
## 7 0.261 30 1 0.08638919 Non Diabetic
## 8 0.176 22 0 0.32567987 Non Diabetic
## 9 0.148 45 1 0.73934182 Diabetic
## 10 0.674 23 1 0.21092502 Non Diabetic
## 11 0.295 24 0 0.51407613 Diabetic
## 12 0.439 40 0 0.29079639 Non Diabetic
## 13 0.441 38 1 0.77789023 Diabetic
## 14 0.352 21 0 0.17788535 Non Diabetic
## 15 0.121 32 0 0.09535232 Non Diabetic
## 16 0.826 34 1 0.92731122 Diabetic
## 17 0.970 31 1 0.77187221 Diabetic
## 18 0.595 56 0 0.17441146 Non Diabetic
## 19 0.415 24 0 0.20058927 Non Diabetic
## 20 0.378 52 1 0.20689720 Non Diabetic
## 21 0.317 34 0 0.06169707 Non Diabetic
## 22 0.289 21 0 0.24393972 Non Diabetic
## 23 0.349 42 1 0.32422871 Non Diabetic
## 24 0.251 42 0 0.35601984 Non Diabetic
## 25 0.265 45 0 0.11066833 Non Diabetic
## 26 0.236 38 0 0.30922789 Non Diabetic
## 27 0.496 25 0 0.22927908 Non Diabetic
## 28 0.433 22 0 0.26869678 Non Diabetic
## 29 0.326 22 0 0.36358534 Non Diabetic
## 30 0.141 22 0 0.08349014 Non Diabetic
## 31 0.323 34 1 0.21677629 Non Diabetic
## 32 0.259 22 1 0.27121466 Non Diabetic
## 33 0.646 24 1 0.84008714 Diabetic
## 34 0.426 22 0 0.13882109 Non Diabetic
## 35 0.560 53 0 0.07617043 Non Diabetic
## 36 0.284 28 0 0.18685838 Non Diabetic
## 37 0.515 21 0 0.12888735 Non Diabetic
## 38 0.600 42 0 0.11674276 Non Diabetic
## 39 0.453 21 0 0.19903185 Non Diabetic
## 40 0.293 42 1 0.18229997 Non Diabetic
## 41 0.785 48 1 0.78431616 Diabetic
## 42 0.400 26 0 0.18073784 Non Diabetic
## 43 0.219 22 0 0.10565220 Non Diabetic
## 44 0.734 45 1 0.63437356 Diabetic
## 45 1.174 39 0 0.94265225 Diabetic
## 46 0.488 46 0 0.34198956 Non Diabetic
## 47 0.358 27 1 0.66958025 Diabetic
## 48 1.096 32 0 0.29483781 Non Diabetic
## 49 0.408 36 1 0.82137020 Diabetic
## 50 0.178 50 1 0.48861097 Non Diabetic
## 51 1.182 22 1 0.54713868 Diabetic
## 52 0.261 28 0 0.27109988 Non Diabetic
## 53 0.223 25 0 0.10633357 Non Diabetic
## 54 0.222 26 1 0.74900417 Diabetic
## 55 0.443 45 1 0.68474539 Diabetic
## 56 1.057 37 1 0.40085432 Non Diabetic
## 57 0.391 39 0 0.45464341 Non Diabetic
## 58 0.258 52 1 0.22206950 Non Diabetic
## 59 0.197 26 0 0.15986116 Non Diabetic
## 60 0.278 66 1 0.83579980 Diabetic
## 61 0.766 22 0 0.09926279 Non Diabetic
## 62 0.403 43 1 0.92687453 Diabetic
## 63 0.142 33 0 0.09877301 Non Diabetic
## 64 0.171 63 0 0.29885728 Non Diabetic
## 65 0.340 27 0 0.30378496 Non Diabetic
## 66 0.245 30 0 0.18756614 Non Diabetic
## 67 0.349 47 1 0.20895172 Non Diabetic
## 68 0.315 23 0 0.07162370 Non Diabetic