Problem Definition
The objective is to predict based on diagnostic measurements whether a patient has diabetes.
The niddkd-diabetes.csv is used to create a logistic model. It has 699 observations of 9 variables. The niddkd-diabetes.csv is used to find out whether the patient has diabetes or not. It has 69 observations of 9 variables.
All patients here are females at least 21 years old of Pima Indian heritage.
Data Location
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.
Data Description
Patient ID: serial number for the patient
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
Setup
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
library(pscl)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
Functions
Dataset
setwd("D:/R-BA/R-Scripts")
dfrModel <- read.csv("./data/niddkd-diabetes.csv", header=T, stringsAsFactors=F)
head(dfrModel)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Observation
The dataset consists only of numeric data.
Missing Data
lapply(dfrModel, FUN=detect_na)
## $Pregnancies
## [1] 0
##
## $Glucose
## [1] 0
##
## $BloodPressure
## [1] 0
##
## $SkinThickness
## [1] 0
##
## $Insulin
## [1] 0
##
## $BMI
## [1] 0
##
## $DiabetesPedigreeFunction
## [1] 0
##
## $Age
## [1] 0
##
## $Outcome
## [1] 0
Observation
There is no missing data in this dataset. Hence, no data imputation will be required.
Outliers Data
#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)
## $Pregnancies
## integer(0)
##
## $Glucose
## integer(0)
##
## $BloodPressure
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## $SkinThickness
## integer(0)
##
## $Insulin
## [1] 543 846 495 485 495 478 744 680 545 465 579 474 480 600 540 480
##
## $BMI
## [1] 0.0 0.0 0.0 0.0 0.0 67.1 0.0 0.0 0.0 0.0 0.0
##
## $DiabetesPedigreeFunction
## [1] 2.288 1.893 1.781 2.329 2.137 1.731 2.420 1.699 1.698
##
## $Age
## [1] 81
##
## $Outcome
## integer(0)
Observation
Outliers are present but its count is low. Hence, in this model we can work with outliers.
Outliers Graph
lapply(dfrModel, FUN=plotgraph)
## $Pregnancies
##
## $Glucose
##
## $BloodPressure
##
## $SkinThickness
##
## $Insulin
##
## $BMI
##
## $DiabetesPedigreeFunction
##
## $Age
##
## $Outcome
Correlation
vctCorr = numeric(0)
for (i in names(dfrModel)){
cor.result <- cor(as.numeric(dfrModel$Outcome), as.numeric(dfrModel[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
## Pregnancies Glucose BloodPressure
## 0.22788336 0.45937296 0.06027482
## SkinThickness Insulin BMI
## 0.08621860 0.14530972 0.30891558
## DiabetesPedigreeFunction Age Outcome
## 0.17211110 0.22650741 1.00000000
Data For Visualization
dfrGraph <- gather(dfrModel, variable, value, -Outcome)
head(dfrGraph)
## Outcome variable value
## 1 1 Pregnancies 6
## 2 0 Pregnancies 1
## 3 1 Pregnancies 8
## 4 0 Pregnancies 1
## 5 1 Pregnancies 0
## 6 0 Pregnancies 5
Data Visualization
ggplot(dfrGraph) +
geom_jitter(aes(value,Outcome, colour=variable)) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Outcome With Other Features")
Observation
There is some impact of all the features with respect to the outcome.
Summary
lapply(dfrModel, FUN=summary)
## $Pregnancies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.827 6.000 17.000
##
## $Glucose
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 99.0 116.0 120.5 140.5 199.0
##
## $BloodPressure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 63.00 72.00 68.88 80.00 122.00
##
## $SkinThickness
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 23.00 20.41 32.00 99.00
##
## $Insulin
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 37.00 79.99 127.00 846.00
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 27.00 32.00 31.87 36.50 67.10
##
## $DiabetesPedigreeFunction
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2400 0.3750 0.4754 0.6340 2.4200
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 33.13 40.00 81.00
##
## $Outcome
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3448 1.0000 1.0000
Best Multi Logistic Model
stpModel=step(glm(data=dfrModel, formula=Outcome~., family=binomial), trace=0, steps=10000)
summary(stpModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction, family = binomial, data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7640 -0.7318 -0.4090 0.7146 2.8871
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.053940 0.716823 -11.236 < 2e-16 ***
## Pregnancies 0.157926 0.029314 5.387 7.15e-08 ***
## Glucose 0.033384 0.003518 9.488 < 2e-16 ***
## BloodPressure -0.012549 0.005256 -2.388 0.01696 *
## BMI 0.092717 0.015076 6.150 7.75e-10 ***
## DiabetesPedigreeFunction 0.909844 0.305380 2.979 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 657.22 on 693 degrees of freedom
## AIC: 669.22
##
## Number of Fisher Scoring iterations: 5
Observation
Best results given by Outcome ~ Pregnancies + Glucose + BloodPressure + BMI + DiabetesPedigreeFunction
The difference between the null deviance and residual deviance is greater than 70% which indicates that this model is a good fit.
Final Multi Linear Model
mgmModel <- glm(data=dfrModel, formula=Outcome ~ Pregnancies + Glucose + BloodPressure + BMI + DiabetesPedigreeFunction, family=binomial(link="logit"))
summary(mgmModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction, family = binomial(link = "logit"),
## data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7640 -0.7318 -0.4090 0.7146 2.8871
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.053940 0.716823 -11.236 < 2e-16 ***
## Pregnancies 0.157926 0.029314 5.387 7.15e-08 ***
## Glucose 0.033384 0.003518 9.488 < 2e-16 ***
## BloodPressure -0.012549 0.005256 -2.388 0.01696 *
## BMI 0.092717 0.015076 6.150 7.75e-10 ***
## DiabetesPedigreeFunction 0.909844 0.305380 2.979 0.00289 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 900.53 on 698 degrees of freedom
## Residual deviance: 657.22 on 693 degrees of freedom
## AIC: 669.22
##
## Number of Fisher Scoring iterations: 5
Confusion Matrix
prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 403 103
## 1 55 138
##
## Accuracy : 0.774
## 95% CI : (0.7411, 0.8045)
## No Information Rate : 0.6552
## P-Value [Acc > NIR] : 5.519e-12
##
## Kappa : 0.4749
## Mcnemar's Test P-Value : 0.0001847
##
## Sensitivity : 0.8799
## Specificity : 0.5726
## Pos Pred Value : 0.7964
## Neg Pred Value : 0.7150
## Prevalence : 0.6552
## Detection Rate : 0.5765
## Detection Prevalence : 0.7239
## Balanced Accuracy : 0.7263
##
## 'Positive' Class : 0
##
Observation
The confusion matrix shows an accuracy of 77.4%.
Regression Data
dfrPlot <- mutate(dfrModel, PrdVal=prdVal, PSurvived=prdBln)
head(dfrPlot)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome PrdVal PSurvived
## 1 0.627 50 1 0.64944982 1
## 2 0.351 31 0 0.04306549 0
## 3 0.672 32 1 0.78368126 1
## 4 0.167 21 0 0.04761529 0
## 5 2.288 33 1 0.89046208 1
## 6 0.201 30 0 0.14628967 0
Regression Visulaization
#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=PSurvived)) +
geom_point(shape=19, colour="blue", fill="blue") +
geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
labs(title="Binomial Regression Curve") +
labs(x="") +
labs(y="")
Test Data
setwd("D:/R-BA/R-Scripts")
dfrTests <- read.csv("./data/niddkd-diabetes-test.csv", header=T, stringsAsFactors=F)
head(dfrTests)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 4 118 70 0 0 44.5
## 2 2 122 76 27 200 35.9
## 3 6 125 78 31 0 27.6
## 4 1 168 88 29 0 35.0
## 5 2 129 0 0 0 38.5
## 6 4 110 76 20 100 28.4
## DiabetesPedigreeFunction Age Outcome
## 1 0.904 26 0
## 2 0.483 26 0
## 3 0.565 49 1
## 4 0.905 52 1
## 5 0.304 41 0
## 6 0.118 27 0
Observation
Test Data successfully created.
Predict
resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
prdSur <- as.factor(prdSur)
levels(prdSur) <- c("Not Diabetic", "Diabetic")
dfrTests <- mutate(dfrTests, Result=resVal, Outcome=prdSur)
dfrTests
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 4 118 70 0 0 44.5
## 2 2 122 76 27 200 35.9
## 3 6 125 78 31 0 27.6
## 4 1 168 88 29 0 35.0
## 5 2 129 0 0 0 38.5
## 6 4 110 76 20 100 28.4
## 7 6 80 80 36 0 39.8
## 8 10 115 0 0 0 0.0
## 9 2 127 46 21 335 34.4
## 10 9 164 78 0 0 32.8
## 11 2 93 64 32 160 38.0
## 12 3 158 64 13 387 31.2
## 13 5 126 78 27 22 29.6
## 14 10 129 62 36 0 41.2
## 15 0 134 58 20 291 26.4
## 16 3 102 74 0 0 29.5
## 17 7 187 50 33 392 33.9
## 18 3 173 78 39 185 33.8
## 19 10 94 72 18 0 23.1
## 20 1 108 60 46 178 35.5
## 21 5 97 76 27 0 35.6
## 22 4 83 86 19 0 29.3
## 23 1 114 66 36 200 38.1
## 24 1 149 68 29 127 29.3
## 25 5 117 86 30 105 39.1
## 26 1 111 94 0 0 32.8
## 27 4 112 78 40 0 39.4
## 28 1 116 78 29 180 36.1
## 29 0 141 84 26 0 32.4
## 30 2 175 88 0 0 22.9
## 31 2 92 52 0 0 30.1
## 32 3 130 78 23 79 28.4
## 33 8 120 86 0 0 28.4
## 34 2 174 88 37 120 44.5
## 35 2 106 56 27 165 29.0
## 36 2 105 75 0 0 23.3
## 37 4 95 60 32 0 35.4
## 38 0 126 86 27 120 27.4
## 39 8 65 72 23 0 32.0
## 40 2 99 60 17 160 36.6
## 41 1 102 74 0 0 39.5
## 42 11 120 80 37 150 42.3
## 43 3 102 44 20 94 30.8
## 44 1 109 58 18 116 28.5
## 45 9 140 94 0 0 32.7
## 46 13 153 88 37 140 40.6
## 47 12 100 84 33 105 30.0
## 48 1 147 94 41 0 49.3
## 49 1 81 74 41 57 46.3
## 50 3 187 70 22 200 36.4
## 51 6 162 62 0 0 24.3
## 52 4 136 70 0 0 31.2
## 53 1 121 78 39 74 39.0
## 54 3 108 62 24 0 26.0
## 55 0 181 88 44 510 43.3
## 56 8 154 78 32 0 32.4
## 57 1 128 88 39 110 36.5
## 58 7 137 90 41 0 32.0
## 59 0 123 72 0 0 36.3
## 60 1 106 76 0 0 37.5
## 61 6 190 92 0 0 35.5
## 62 2 88 58 26 16 28.4
## 63 9 170 74 31 0 44.0
## 64 9 89 62 0 0 22.5
## 65 10 101 76 48 180 32.9
## 66 2 122 70 27 0 36.8
## 67 5 121 72 23 112 26.2
## 68 1 126 60 0 0 30.1
## 69 1 93 70 31 0 30.4
## DiabetesPedigreeFunction Age Outcome Result
## 1 0.904 26 Diabetic 0.64267647
## 2 0.483 26 Not Diabetic 0.29921925
## 3 0.565 49 Not Diabetic 0.30170247
## 4 0.905 52 Diabetic 0.66298462
## 5 0.304 41 Diabetic 0.60218508
## 6 0.118 27 Not Diabetic 0.12311702
## 7 0.177 28 Not Diabetic 0.16960382
## 8 0.261 30 Not Diabetic 0.08331746
## 9 0.176 22 Not Diabetic 0.32606579
## 10 0.148 45 Diabetic 0.73870615
## 11 0.674 23 Not Diabetic 0.21415309
## 12 0.295 24 Diabetic 0.51311756
## 13 0.439 40 Not Diabetic 0.29050080
## 14 0.441 38 Diabetic 0.78158743
## 15 0.352 21 Not Diabetic 0.17647903
## 16 0.121 32 Not Diabetic 0.09462590
## 17 0.826 34 Diabetic 0.92833929
## 18 0.970 31 Diabetic 0.77430137
## 19 0.595 56 Not Diabetic 0.17407344
## 20 0.415 24 Not Diabetic 0.20189551
## 21 0.378 52 Not Diabetic 0.20830398
## 22 0.317 34 Not Diabetic 0.06148078
## 23 0.289 21 Not Diabetic 0.24544457
## 24 0.349 42 Not Diabetic 0.32277527
## 25 0.251 42 Not Diabetic 0.35800410
## 26 0.265 45 Not Diabetic 0.11028728
## 27 0.236 38 Not Diabetic 0.31124115
## 28 0.496 25 Not Diabetic 0.23076938
## 29 0.433 22 Not Diabetic 0.26835720
## 30 0.326 22 Not Diabetic 0.35883181
## 31 0.141 22 Not Diabetic 0.08315153
## 32 0.323 34 Not Diabetic 0.21550226
## 33 0.259 22 Not Diabetic 0.26994538
## 34 0.646 24 Diabetic 0.84290582
## 35 0.426 22 Not Diabetic 0.13874107
## 36 0.560 53 Not Diabetic 0.07556685
## 37 0.284 28 Not Diabetic 0.18798455
## 38 0.515 21 Not Diabetic 0.12810698
## 39 0.600 42 Not Diabetic 0.11802413
## 40 0.453 21 Not Diabetic 0.20093624
## 41 0.293 42 Not Diabetic 0.18383046
## 42 0.785 48 Diabetic 0.78943297
## 43 0.400 26 Not Diabetic 0.18129856
## 44 0.219 22 Not Diabetic 0.10495385
## 45 0.734 45 Diabetic 0.63672200
## 46 1.174 39 Diabetic 0.94453500
## 47 0.488 46 Not Diabetic 0.34319929
## 48 0.358 27 Diabetic 0.67446282
## 49 1.096 32 Not Diabetic 0.30352146
## 50 0.408 36 Diabetic 0.82204322
## 51 0.178 50 Not Diabetic 0.48467839
## 52 1.182 22 Diabetic 0.55174117
## 53 0.261 28 Not Diabetic 0.27249731
## 54 0.223 25 Not Diabetic 0.10534153
## 55 0.222 26 Diabetic 0.75041171
## 56 0.443 45 Diabetic 0.68542367
## 57 1.057 37 Not Diabetic 0.40580178
## 58 0.391 39 Not Diabetic 0.45471929
## 59 0.258 52 Not Diabetic 0.22254103
## 60 0.197 26 Not Diabetic 0.16044107
## 61 0.278 66 Diabetic 0.83567334
## 62 0.766 22 Not Diabetic 0.09992387
## 63 0.403 43 Diabetic 0.92825503
## 64 0.142 33 Not Diabetic 0.09760062
## 65 0.171 63 Not Diabetic 0.29928838
## 66 0.340 27 Not Diabetic 0.30525694
## 67 0.245 30 Not Diabetic 0.18598321
## 68 0.349 47 Not Diabetic 0.20844816
## 69 0.315 23 Not Diabetic 0.07145369
Observation
Thus, the test data has been tested for Outcome prediction.