Objective The objective is to predict based on diagnostic measurements whether a patient has diabetes.
Dataset This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Attributes Patient ID: serial number for the patient Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skin fold thickness (mm) Insulin: 2-Hour serum insulin (mu U/ml) BMI: Body mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree function Age: Age (years) Outcome: Class variable (0 or 1)
Setup
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.3.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.3.3
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(Deducer)
## Warning: package 'Deducer' was built under R version 3.3.3
## Loading required package: JGR
## Warning: package 'JGR' was built under R version 3.3.3
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots
## Warning: package 'iplots' was built under R version 3.3.3
##
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
## Warning: package 'car' was built under R version 3.3.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Note Non-JGR console detected:
## Deducer is best used from within JGR (http://jgr.markushelbig.org/).
## To Bring up GUI dialogs, type deducer().
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
library(pscl)
## Warning: package 'pscl' was built under R version 3.3.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
Functions
Dataset
dfrModel <- read.csv("./data/niddkd-diabetes train.csv", header=T, stringsAsFactors=F)
head(dfrModel)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Observation The above data has no non numeric value and hence no columns have to be dropped or converted into numeric values
Missing Data
#sum(is.na(dfrModel$Age))
lapply(dfrModel, FUN=detect_na)
## $Pregnancies
## [1] 0
##
## $Glucose
## [1] 0
##
## $BloodPressure
## [1] 0
##
## $SkinThickness
## [1] 0
##
## $Insulin
## [1] 0
##
## $BMI
## [1] 0
##
## $DiabetesPedigreeFunction
## [1] 0
##
## $Age
## [1] 0
##
## $Outcome
## [1] 0
Observation No missing data or NA records found in the Train data
Impute Data
dfrModel$Age[is.na(dfrModel$Age)] <- round(mean(dfrModel$Age[!is.na(dfrModel$Age)]),digits=0)
dfrModel$Age <- as.integer(dfrModel$Age)
detect_na(dfrModel$Age)
## [1] 0
#head(dfrModel)
Observation Since there are no NA’s found in the train data we dont have to replace anything with the 0 and hence impute data will show 0 as there are no NA’s found in the original data and hence no replacement data
Outliers Data
#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)
## $Pregnancies
## integer(0)
##
## $Glucose
## integer(0)
##
## $BloodPressure
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## $SkinThickness
## integer(0)
##
## $Insulin
## [1] 543 846 495 485 495 478 744 680 545 465 579 474 480 600 540 480
##
## $BMI
## [1] 0.0 0.0 0.0 0.0 0.0 67.1 0.0 0.0 0.0 0.0 0.0
##
## $DiabetesPedigreeFunction
## [1] 2.288 1.893 1.781 2.329 2.137 1.731 2.420 1.699 1.698
##
## $Age
## [1] 81
##
## $Outcome
## integer(0)
Outliers Graph
plotgraph <- function(inp, na.rm=TRUE) {
outplot <- ggplot(dfrModel, aes(x="", y=inp)) +
geom_boxplot(aes(fill=inp), color="blue") +
labs(title=" Outliers")
outplot
}
lapply(dfrModel, FUN=plotgraph)
## $Pregnancies
##
## $Glucose
##
## $BloodPressure
##
## $SkinThickness
##
## $Insulin
##
## $BMI
##
## $DiabetesPedigreeFunction
##
## $Age
##
## $Outcome
Observation The above graphs indicates the outliers present in the data set in a visual form
Correlation
vctCorr = numeric(0)
for (i in names(dfrModel)){
cor.result <- cor(as.numeric(dfrModel$Outcome), as.numeric(dfrModel[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
## Pregnancies Glucose BloodPressure
## 0.22802323 0.45976779 0.06135700
## SkinThickness Insulin BMI
## 0.08566126 0.14608347 0.30940050
## DiabetesPedigreeFunction Age Outcome
## 0.17257062 0.22617053 1.00000000
Data For Visualization
dfrGraph <- gather(dfrModel, variable, value, -Outcome)
head(dfrGraph)
## Outcome variable value
## 1 1 Pregnancies 6
## 2 0 Pregnancies 1
## 3 1 Pregnancies 8
## 4 0 Pregnancies 1
## 5 1 Pregnancies 0
## 6 0 Pregnancies 5
Data Visualization
ggplot(dfrGraph) +
geom_jitter(aes(value,Outcome, colour=variable)) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Outcome with other features")
Observation The above graphs show the impact of Other factors present in the dataset with the outcome
Summary
lapply(dfrModel, FUN=summary)
## $Pregnancies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.827 6.000 17.000
##
## $Glucose
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 99.0 116.0 120.5 140.8 199.0
##
## $BloodPressure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 62.50 72.00 68.85 80.00 122.00
##
## $SkinThickness
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 23.00 20.43 32.00 99.00
##
## $Insulin
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 36.50 79.89 126.00 846.00
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 27.00 32.00 31.87 36.50 67.10
##
## $DiabetesPedigreeFunction
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2400 0.3745 0.4752 0.6355 2.4200
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 33.14 40.00 81.00
##
## $Outcome
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3453 1.0000 1.0000
Find Best Multi Logistic Model
Choose the best logistic model by using step().
stpModel=step(glm(data=dfrModel, formula=Outcome~., family=binomial), trace=0, steps=100)
summary(stpModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction, family = binomial, data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7673 -0.7310 -0.4075 0.7171 2.8851
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.061128 0.717120 -11.241 < 2e-16 ***
## Pregnancies 0.157717 0.029307 5.382 7.38e-08 ***
## Glucose 0.033371 0.003518 9.487 < 2e-16 ***
## BloodPressure -0.012371 0.005259 -2.352 0.01866 *
## BMI 0.092684 0.015071 6.150 7.76e-10 ***
## DiabetesPedigreeFunction 0.913510 0.305415 2.991 0.00278 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 899.68 on 697 degrees of freedom
## Residual deviance: 656.31 on 692 degrees of freedom
## AIC: 668.31
##
## Number of Fisher Scoring iterations: 5
Observation Best results given by Outcome~Pregnancies+Glucose+BMI
Make Final Multi Linear Model
# make model
mgmModel <- glm(data=dfrModel, formula=Outcome ~Pregnancies+Glucose+BMI, family=binomial(link="logit"))
# print summary
summary(mgmModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI, family = binomial(link = "logit"),
## data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1712 -0.7257 -0.4233 0.7626 2.8089
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.21936 0.67333 -12.207 < 2e-16 ***
## Pregnancies 0.13938 0.02812 4.958 7.13e-07 ***
## Glucose 0.03291 0.00343 9.595 < 2e-16 ***
## BMI 0.08866 0.01458 6.081 1.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 899.68 on 697 degrees of freedom
## Residual deviance: 671.18 on 694 degrees of freedom
## AIC: 679.18
##
## Number of Fisher Scoring iterations: 5
Confusion Matrix
prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 400 105
## 1 57 136
##
## Accuracy : 0.7679
## 95% CI : (0.7348, 0.7988)
## No Information Rate : 0.6547
## P-Value [Acc > NIR] : 5.547e-11
##
## Kappa : 0.4613
## Mcnemar's Test P-Value : 0.0002219
##
## Sensitivity : 0.8753
## Specificity : 0.5643
## Pos Pred Value : 0.7921
## Neg Pred Value : 0.7047
## Prevalence : 0.6547
## Detection Rate : 0.5731
## Detection Prevalence : 0.7235
## Balanced Accuracy : 0.7198
##
## 'Positive' Class : 0
##
Observation The cut off for the above is 0.5 indicating that if a person has a probability of more then 0.5 he is considerd as diabetic The above model has an accuracy of 0.7679 Regression Data
dfrPlot <- mutate(dfrModel, PrdVal=prdVal, POutcome=prdBln)
head(dfrPlot)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome PrdVal POutcome
## 1 0.627 50 1 0.61471239 1
## 2 0.351 31 0 0.05098206 0
## 3 0.672 32 1 0.72804655 1
## 4 0.167 21 0 0.06541757 0
## 5 2.288 33 1 0.52773867 1
## 6 0.201 30 0 0.19236087 0
Regression Visulaization
#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=POutcome)) +
geom_point(shape=19, colour="blue", fill="blue") +
geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
labs(title="Binomial Regression Curve") +
labs(x="") +
labs(y="")
ROC Visulaization
#rocplot(logistic.model,diag=TRUE,pred.prob.labels=FALSE,prob.label.digits=3,AUC=TRUE)
rocplot(mgmModel)
Observation AUC is 0.8279 Accuracy as per confusion matrix is 0.7679 The difference between the two should not be more then 10% since here it is less then 10% we consider the above model has a good model
Test Data
dfrTests <- read.csv("./data/niddkd-diabetes test.csv", header=T, stringsAsFactors=F)
head(dfrTests)
## S.No Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 1 2 122 76 27 200 35.9
## 2 2 6 125 78 31 0 27.6
## 3 3 1 168 88 29 0 35.0
## 4 4 2 129 0 0 0 38.5
## 5 5 4 110 76 20 100 28.4
## 6 6 6 80 80 36 0 39.8
## DiabetesPedigreeFunction Age Outcome
## 1 0.483 26 0
## 2 0.565 49 1
## 3 0.905 52 1
## 4 0.304 41 0
## 5 0.118 27 0
## 6 0.177 28 0
Observation Test Data successfully created Predict
resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
prdSur <- as.factor(prdSur)
levels(prdSur) <- c("Diabetes", "No Diabetes")
dfrTests <- mutate(dfrTests, Result=resVal, Outcome=prdSur)
dfrTests
## S.No Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 1 2 122 76 27 200 35.9
## 2 2 6 125 78 31 0 27.6
## 3 3 1 168 88 29 0 35.0
## 4 4 2 129 0 0 0 38.5
## 5 5 4 110 76 20 100 28.4
## 6 6 6 80 80 36 0 39.8
## 7 7 10 115 0 0 0 0.0
## 8 8 2 127 46 21 335 34.4
## 9 9 9 164 78 0 0 32.8
## 10 10 2 93 64 32 160 38.0
## 11 11 3 158 64 13 387 31.2
## 12 12 5 126 78 27 22 29.6
## 13 13 10 129 62 36 0 41.2
## 14 14 0 134 58 20 291 26.4
## 15 15 3 102 74 0 0 29.5
## 16 16 7 187 50 33 392 33.9
## 17 17 3 173 78 39 185 33.8
## 18 18 10 94 72 18 0 23.1
## 19 19 1 108 60 46 178 35.5
## 20 20 5 97 76 27 0 35.6
## 21 21 4 83 86 19 0 29.3
## 22 22 1 114 66 36 200 38.1
## 23 23 1 149 68 29 127 29.3
## 24 24 5 117 86 30 105 39.1
## 25 25 1 111 94 0 0 32.8
## 26 26 4 112 78 40 0 39.4
## 27 27 1 116 78 29 180 36.1
## 28 28 0 141 84 26 0 32.4
## 29 29 2 175 88 0 0 22.9
## 30 30 2 92 52 0 0 30.1
## 31 31 3 130 78 23 79 28.4
## 32 32 8 120 86 0 0 28.4
## 33 33 2 174 88 37 120 44.5
## 34 34 2 106 56 27 165 29.0
## 35 35 2 105 75 0 0 23.3
## 36 36 4 95 60 32 0 35.4
## 37 37 0 126 86 27 120 27.4
## 38 38 8 65 72 23 0 32.0
## 39 39 2 99 60 17 160 36.6
## 40 40 1 102 74 0 0 39.5
## 41 41 11 120 80 37 150 42.3
## 42 42 3 102 44 20 94 30.8
## 43 43 1 109 58 18 116 28.5
## 44 44 9 140 94 0 0 32.7
## 45 45 13 153 88 37 140 40.6
## 46 46 12 100 84 33 105 30.0
## 47 47 1 147 94 41 0 49.3
## 48 48 1 81 74 41 57 46.3
## 49 49 3 187 70 22 200 36.4
## 50 50 6 162 62 0 0 24.3
## 51 51 4 136 70 0 0 31.2
## 52 52 1 121 78 39 74 39.0
## 53 53 3 108 62 24 0 26.0
## 54 54 0 181 88 44 510 43.3
## 55 55 8 154 78 32 0 32.4
## 56 56 1 128 88 39 110 36.5
## 57 57 7 137 90 41 0 32.0
## 58 58 0 123 72 0 0 36.3
## 59 59 1 106 76 0 0 37.5
## 60 60 6 190 92 0 0 35.5
## 61 61 2 88 58 26 16 28.4
## 62 62 9 170 74 31 0 44.0
## 63 63 9 89 62 0 0 22.5
## 64 64 10 101 76 48 180 32.9
## 65 65 2 122 70 27 0 36.8
## 66 66 5 121 72 23 112 26.2
## 67 67 1 126 60 0 0 30.1
## 68 68 1 93 70 31 0 30.4
## DiabetesPedigreeFunction Age Outcome Result
## 1 0.483 26 Diabetes 0.32251885
## 2 0.565 49 Diabetes 0.30537882
## 3 0.905 52 No Diabetes 0.63474985
## 4 0.304 41 Diabetes 0.43013313
## 5 0.118 27 Diabetes 0.17896659
## 6 0.177 28 Diabetes 0.22770716
## 7 0.261 30 Diabetes 0.04563324
## 8 0.176 22 Diabetes 0.32945772
## 9 0.148 45 No Diabetes 0.79265655
## 10 0.674 23 Diabetes 0.18085833
## 11 0.295 24 No Diabetes 0.54124880
## 12 0.439 40 Diabetes 0.32061105
## 13 0.441 38 No Diabetes 0.74519418
## 14 0.352 21 Diabetes 0.18720393
## 15 0.121 32 Diabetes 0.13841280
## 16 0.826 34 No Diabetes 0.87178045
## 17 0.970 31 No Diabetes 0.70880669
## 18 0.595 56 Diabetes 0.15662207
## 19 0.415 24 Diabetes 0.20135222
## 20 0.378 52 Diabetes 0.23621983
## 21 0.317 34 Diabetes 0.08848698
## 22 0.289 21 Diabetes 0.27891173
## 23 0.349 42 Diabetes 0.35937553
## 24 0.251 42 Diabetes 0.44894623
## 25 0.265 45 Diabetes 0.17968222
## 26 0.236 38 Diabetes 0.38171256
## 27 0.496 25 Diabetes 0.25705032
## 28 0.433 22 Diabetes 0.33049915
## 29 0.326 22 Diabetes 0.46248430
## 30 0.141 22 Diabetes 0.09588001
## 31 0.323 34 Diabetes 0.26806391
## 32 0.259 22 Diabetes 0.34599547
## 33 0.646 24 No Diabetes 0.84963981
## 34 0.426 22 Diabetes 0.13232076
## 35 0.560 53 Diabetes 0.08174545
## 36 0.284 28 Diabetes 0.19837830
## 37 0.515 21 Diabetes 0.16206612
## 38 0.600 42 Diabetes 0.10642519
## 39 0.453 21 Diabetes 0.19198217
## 40 0.293 42 Diabetes 0.22781121
## 41 0.785 48 No Diabetes 0.73376818
## 42 0.400 26 Diabetes 0.15273892
## 43 0.219 22 Diabetes 0.12286610
## 44 0.734 45 No Diabetes 0.63232650
## 45 1.174 39 No Diabetes 0.90273907
## 46 0.488 46 Diabetes 0.35535578
## 47 0.358 27 No Diabetes 0.75570599
## 48 1.096 32 Diabetes 0.21265017
## 49 0.408 36 No Diabetes 0.82933278
## 50 0.178 50 No Diabetes 0.52583462
## 51 1.182 22 Diabetes 0.39667136
## 52 0.261 28 Diabetes 0.34532009
## 53 0.223 25 Diabetes 0.12549867
## 54 0.222 26 No Diabetes 0.82878182
## 55 0.443 45 No Diabetes 0.69783837
## 56 1.057 37 Diabetes 0.34730098
## 57 0.391 39 No Diabetes 0.52563730
## 58 0.258 52 Diabetes 0.27836018
## 59 0.197 26 Diabetes 0.21987859
## 60 0.278 66 No Diabetes 0.88267472
## 61 0.766 22 Diabetes 0.07403892
## 62 0.403 43 No Diabetes 0.92631946
## 63 0.142 33 Diabetes 0.11499251
## 64 0.171 63 Diabetes 0.35793888
## 65 0.340 27 Diabetes 0.34019466
## 66 0.245 30 Diabetes 0.22846897
## 67 0.349 47 Diabetes 0.22025669
## 68 0.315 23 Diabetes 0.08917608
Confusion Matrix
prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 400 105
## 1 57 136
##
## Accuracy : 0.7679
## 95% CI : (0.7348, 0.7988)
## No Information Rate : 0.6547
## P-Value [Acc > NIR] : 5.547e-11
##
## Kappa : 0.4613
## Mcnemar's Test P-Value : 0.0002219
##
## Sensitivity : 0.8753
## Specificity : 0.5643
## Pos Pred Value : 0.7921
## Neg Pred Value : 0.7047
## Prevalence : 0.6547
## Detection Rate : 0.5731
## Detection Prevalence : 0.7235
## Balanced Accuracy : 0.7198
##
## 'Positive' Class : 0
##
Summary Test Data The main aim was to predict whether the patients in the Test Data have diabetes or not This was obtained by making the confusion matrix cut off was 0.5 indicating that if a person has probability of more then 0.5 he is considered diabetic The outcome is shown in the Predict model The model has an accuracy of 0.7679