Problem Definition
Here, we have a dataset, containing diagnostic measurements of more than 700 females.
Now, based on these data variables available, we need to predict wether the female is a Diabetic or not.
Variable Description
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Shows wether the person has Diabetes (1) or Not (0)
Setup
Loading Libraries
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(Deducer)
## Loading required package: JGR
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots
##
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Note Non-JGR console detected:
## Deducer is best used from within JGR (http://jgr.markushelbig.org/).
## To Bring up GUI dialogs, type deducer().
library(caret)
## Loading required package: lattice
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
Functions
Dataset
dfrModel <- read.csv("niddkd-diabetes.csv", header=T, stringsAsFactors=F)
head(dfrModel)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 4 146 85 27 100 28.9
## 2 2 100 66 20 90 32.9
## 3 5 139 64 35 140 28.6
## 4 13 126 90 0 0 43.4
## 5 4 129 86 20 270 35.1
## 6 1 79 75 30 0 32.0
## DiabetesPedigreeFunction Age Outcome
## 1 0.189 27 0
## 2 0.867 28 1
## 3 0.411 26 0
## 4 0.583 42 1
## 5 0.231 23 0
## 6 0.396 22 0
Observation
The dataset contains all numeric data.
Missing Data
#sum(is.na(dfrModel$Age))
lapply(dfrModel, FUN=detect_na)
## $Pregnancies
## [1] 0
##
## $Glucose
## [1] 0
##
## $BloodPressure
## [1] 0
##
## $SkinThickness
## [1] 0
##
## $Insulin
## [1] 0
##
## $BMI
## [1] 0
##
## $DiabetesPedigreeFunction
## [1] 0
##
## $Age
## [1] 0
##
## $Outcome
## [1] 0
Observation
No missing data is there in any column.
There are many 0 values in the data in the varoius caokumns.this 0 value is not good nd cannot be accepted beacuse a 0 in insulin level or 0 in blood pressure doesnt inicate that the person is helathy.It might be the case that the value is missing or is replaced wth 0
But, In this case, we choose to retain 0, seeing it as a value itself.
Detecting Outliers
#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)
## $Pregnancies
## integer(0)
##
## $Glucose
## integer(0)
##
## $BloodPressure
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## $SkinThickness
## integer(0)
##
## $Insulin
## [1] 495 485 495 478 744 680 545 465 579 474 480 600 540 480 510
##
## $BMI
## [1] 0.0 0.0 67.1 0.0 0.0 59.4 0.0 0.0 0.0 0.0
##
## $DiabetesPedigreeFunction
## [1] 2.329 2.137 1.731 1.600 2.420 1.699 1.698
##
## $Age
## [1] 81
##
## $Outcome
## integer(0)
Outliers Graph
plotgraph <- function(inp, na.rm=TRUE) {
outplot <- ggplot(dfrModel, aes(x="", y=inp)) +
geom_boxplot(aes(fill=inp), color="blue") +
labs(title="Diabetes Outcome Outliers")
outplot
}
lapply(dfrModel, FUN=plotgraph)
## $Pregnancies
##
## $Glucose
##
## $BloodPressure
##
## $SkinThickness
##
## $Insulin
##
## $BMI
##
## $DiabetesPedigreeFunction
##
## $Age
##
## $Outcome
Correlation
vctCorr = numeric(0)
for (i in names(dfrModel)){
cor.result <- cor(as.numeric(dfrModel$Outcome), as.numeric(dfrModel[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
## Pregnancies Glucose BloodPressure
## 0.22217638 0.47010501 0.05862803
## SkinThickness Insulin BMI
## 0.05770830 0.11357000 0.30207096
## DiabetesPedigreeFunction Age Outcome
## 0.17888267 0.22844423 1.00000000
_ Observation _
The correlation coefficient shows High correlation of probabilty of a person getting diabetes, only with Glucose levels, followed by her BMI.
dfrGraph <- gather(dfrModel, variable, value, -Outcome)
head(dfrGraph)
## Outcome variable value
## 1 0 Pregnancies 4
## 2 1 Pregnancies 2
## 3 0 Pregnancies 5
## 4 1 Pregnancies 13
## 5 0 Pregnancies 4
## 6 0 Pregnancies 1
Data Visualization
ggplot(dfrGraph) +
geom_jitter(aes(value,Outcome, colour=variable)) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Outcome With Other Features")
Summary
lapply(dfrModel, FUN=summary)
## $Pregnancies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.748 6.000 17.000
##
## $Glucose
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 99.0 117.0 120.6 140.0 199.0
##
## $BloodPressure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 62.00 72.00 69.14 80.00 122.00
##
## $SkinThickness
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 23.00 20.66 32.00 99.00
##
## $Insulin
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 40.00 79.69 128.00 744.00
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 27.40 32.00 32.11 36.50 67.10
##
## $DiabetesPedigreeFunction
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2385 0.3700 0.4664 0.6250 2.4200
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 32.93 40.00 81.00
##
## $Outcome
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3405 1.0000 1.0000
Find Best Multi Logistic Model
Choose the best logistic model by using step().
stpModel=step(glm(data=dfrModel, formula=Outcome~., family=binomial), trace=0, steps=1000)
summary(stpModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## Insulin + BMI + DiabetesPedigreeFunction, family = binomial,
## data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7181 -0.7126 -0.4112 0.6955 3.0563
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.7295109 0.7642967 -11.422 < 2e-16 ***
## Pregnancies 0.1530382 0.0302963 5.051 4.39e-07 ***
## Glucose 0.0386501 0.0038904 9.935 < 2e-16 ***
## BloodPressure -0.0131962 0.0055018 -2.399 0.016461 *
## Insulin -0.0019808 0.0008939 -2.216 0.026692 *
## BMI 0.0964194 0.0156434 6.164 7.11e-10 ***
## DiabetesPedigreeFunction 1.1729491 0.3265008 3.592 0.000328 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 896.62 on 698 degrees of freedom
## Residual deviance: 640.39 on 692 degrees of freedom
## AIC: 654.39
##
## Number of Fisher Scoring iterations: 5
Observation
Logistic Model has been created
Make Final Multi Linear Model
# make model
mgmModel <- glm(data=dfrModel, formula=Outcome~Pregnancies+Glucose+BMI+DiabetesPedigreeFunction, family=binomial(link="logit"))
# print summary
summary(mgmModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction,
## family = binomial(link = "logit"), data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8971 -0.7045 -0.4203 0.6884 2.9627
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.883748 0.717353 -12.384 < 2e-16 ***
## Pregnancies 0.144692 0.029087 4.975 6.54e-07 ***
## Glucose 0.035274 0.003611 9.770 < 2e-16 ***
## BMI 0.083309 0.014804 5.628 1.83e-08 ***
## DiabetesPedigreeFunction 1.064782 0.320967 3.317 0.000909 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 896.62 on 698 degrees of freedom
## Residual deviance: 651.55 on 694 degrees of freedom
## AIC: 661.55
##
## Number of Fisher Scoring iterations: 5
Confusion Matrix
prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.6, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 433 123
## 1 28 115
##
## Accuracy : 0.784
## 95% CI : (0.7516, 0.8139)
## No Information Rate : 0.6595
## P-Value [Acc > NIR] : 3.836e-13
##
## Kappa : 0.4676
## Mcnemar's Test P-Value : 2.016e-14
##
## Sensitivity : 0.9393
## Specificity : 0.4832
## Pos Pred Value : 0.7788
## Neg Pred Value : 0.8042
## Prevalence : 0.6595
## Detection Rate : 0.6195
## Detection Prevalence : 0.7954
## Balanced Accuracy : 0.7112
##
## 'Positive' Class : 0
##
Obersvation
The confusion Matrix shows an accuracy of 78.4 % of the predicted value as that of the actual ones.
Thus , this model is good to go with.
Regression Data
dfrPlot <- mutate(dfrModel, PrdVal=prdVal, PSurvived=prdBln)
head(dfrPlot)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 4 146 85 27 100 28.9
## 2 2 100 66 20 90 32.9
## 3 5 139 64 35 140 28.6
## 4 13 126 90 0 0 43.4
## 5 4 129 86 20 270 35.1
## 6 1 79 75 30 0 32.0
## DiabetesPedigreeFunction Age Outcome PrdVal PSurvived
## 1 0.189 27 0 0.36677766 0
## 2 0.867 28 1 0.19735484 0
## 3 0.411 26 0 0.39247883 0
## 4 0.583 42 1 0.84266368 1
## 5 0.231 23 0 0.35789950 0
## 6 0.396 22 0 0.05391868 0
Regression Visulaization
#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=Outcome)) +
geom_point(shape=19, colour="blue", fill="blue") +
geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
labs(title="Binomial Regression Curve") +
labs(x="") +
labs(y="")
ROC Visulaization
#rocplot(logistic.model,diag=TRUE,pred.prob.labels=FALSE,prob.label.digits=3,AUC=TRUE)
rocplot(mgmModel)
Observtion
Here , the AUC value is 0.8433, which is good for the model to predict the results.
Test Data
dfrTest <- read.csv("niddkd-diabetestest.csv", header=T, stringsAsFactors=F)
head(dfrTest)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Observation
Similar Data as in the train dataset , here too, we choose to retain 0 in the values.
Predict
resVal <- predict(mgmModel, dfrTest, type="response")
prdSur <- ifelse(resVal > 0.6, "Diabetic", "Non- Diabetic")
dfrTest <- mutate(dfrTest, Result=resVal, PredictedOutcome=prdSur)
dfrTest
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## 7 3 78 50 32 88 31.0
## 8 10 115 0 0 0 35.3
## 9 2 197 70 45 543 30.5
## 10 8 125 96 0 0 0.0
## 11 4 110 92 0 0 37.6
## 12 10 168 74 0 0 38.0
## 13 10 139 80 0 0 27.1
## 14 1 189 60 23 846 30.1
## 15 5 166 72 19 175 25.8
## 16 7 100 0 0 0 30.0
## 17 0 118 84 47 230 45.8
## 18 7 107 74 0 0 29.6
## 19 1 103 30 38 83 43.3
## 20 1 115 70 30 96 34.6
## 21 3 126 88 41 235 39.3
## 22 8 99 84 0 0 35.4
## 23 7 196 90 0 0 39.8
## 24 9 119 80 35 0 29.0
## 25 11 143 94 33 146 36.6
## 26 10 125 70 26 115 31.1
## 27 7 147 76 0 0 39.4
## 28 1 97 66 15 140 23.2
## 29 13 145 82 19 110 22.2
## 30 5 117 92 0 0 34.1
## 31 5 109 75 26 0 36.0
## 32 3 158 76 36 245 31.6
## 33 3 88 58 11 54 24.8
## 34 6 92 92 0 0 19.9
## 35 10 122 78 31 0 27.6
## 36 4 103 60 33 192 24.0
## 37 11 138 76 0 0 33.2
## 38 9 102 76 37 0 32.9
## 39 2 90 68 42 0 38.2
## 40 4 111 72 47 207 37.1
## 41 3 180 64 25 70 34.0
## 42 7 133 84 0 0 40.2
## 43 7 106 92 18 0 22.7
## 44 9 171 110 24 240 45.4
## 45 7 159 64 0 0 27.4
## 46 0 180 66 39 0 42.0
## 47 1 146 56 0 0 29.7
## 48 2 71 70 27 0 28.0
## 49 7 103 66 32 0 39.1
## 50 7 105 0 0 0 0.0
## 51 1 103 80 11 82 19.4
## 52 1 101 50 15 36 24.2
## 53 5 88 66 21 23 24.4
## 54 8 176 90 34 300 33.7
## 55 7 150 66 42 342 34.7
## 56 1 73 50 10 0 23.0
## 57 7 187 68 39 304 37.7
## 58 0 100 88 60 110 46.8
## 59 0 146 82 0 0 40.5
## 60 0 105 64 41 142 41.5
## 61 2 84 0 0 0 0.0
## 62 8 133 72 0 0 32.9
## 63 5 44 62 0 0 25.0
## 64 2 141 58 34 128 25.4
## 65 7 114 66 0 0 32.8
## 66 5 99 74 27 0 29.0
## 67 0 109 88 30 0 32.5
## 68 2 109 92 0 0 42.7
## 69 1 95 66 13 38 19.6
## DiabetesPedigreeFunction Age Outcome Result PredictedOutcome
## 1 0.627 50 1 0.661904634 Diabetic
## 2 0.351 31 0 0.041052328 Non- Diabetic
## 3 0.672 32 1 0.799890729 Diabetic
## 4 0.167 21 0 0.043904187 Non- Diabetic
## 5 2.288 33 1 0.878217665 Diabetic
## 6 0.201 30 0 0.151646742 Non- Diabetic
## 7 0.248 26 1 0.054596722 Non- Diabetic
## 8 0.134 29 0 0.426339825 Non- Diabetic
## 9 0.158 53 1 0.743420504 Diabetic
## 10 0.232 54 1 0.044366217 Non- Diabetic
## 11 0.191 30 0 0.251802324 Non- Diabetic
## 12 0.537 34 1 0.902628441 Diabetic
## 13 1.441 57 0 0.778736625 Diabetic
## 14 0.398 59 1 0.702485308 Diabetic
## 15 0.587 51 1 0.615297992 Diabetic
## 16 0.484 32 1 0.209348853 Non- Diabetic
## 17 0.551 31 1 0.420884383 Non- Diabetic
## 18 0.254 31 1 0.204214578 Non- Diabetic
## 19 0.183 33 0 0.213546747 Non- Diabetic
## 20 0.529 32 1 0.225000660 Non- Diabetic
## 21 0.704 27 0 0.504632674 Non- Diabetic
## 22 0.388 50 0 0.294882396 Non- Diabetic
## 23 0.451 41 1 0.944734971 Diabetic
## 24 0.263 29 1 0.334503807 Non- Diabetic
## 25 0.254 51 1 0.744909960 Diabetic
## 26 0.205 41 1 0.445627005 Non- Diabetic
## 27 0.257 43 1 0.704835927 Diabetic
## 28 0.487 22 0 0.053853118 Non- Diabetic
## 29 0.245 57 0 0.555359459 Non- Diabetic
## 30 0.337 38 0 0.302898448 Non- Diabetic
## 31 0.546 60 0 0.324120061 Non- Diabetic
## 32 0.851 28 1 0.659788667 Diabetic
## 33 0.267 22 0 0.047644141 Non- Diabetic
## 34 0.188 28 0 0.051547100 Non- Diabetic
## 35 0.512 45 0 0.428280225 Non- Diabetic
## 36 0.966 33 0 0.161957162 Non- Diabetic
## 37 0.420 35 0 0.687569115 Diabetic
## 38 0.665 46 1 0.369444549 Non- Diabetic
## 39 0.503 27 1 0.154243926 Non- Diabetic
## 40 1.390 56 1 0.545193892 Non- Diabetic
## 41 0.271 26 0 0.735119516 Diabetic
## 42 0.696 37 0 0.713139325 Diabetic
## 43 0.235 48 0 0.120204555 Non- Diabetic
## 44 0.721 54 1 0.952588251 Diabetic
## 45 0.294 40 0 0.582590200 Non- Diabetic
## 46 1.893 25 1 0.951672685 Diabetic
## 47 0.564 29 0 0.374208877 Non- Diabetic
## 48 0.586 22 0 0.041755159 Non- Diabetic
## 49 0.344 31 1 0.351149366 Non- Diabetic
## 50 0.305 24 0 0.020993095 Non- Diabetic
## 51 0.491 22 0 0.048948177 Non- Diabetic
## 52 0.526 26 0 0.069125223 Non- Diabetic
## 53 0.342 30 0 0.065420603 Non- Diabetic
## 54 0.467 58 1 0.856537389 Diabetic
## 55 0.718 42 0 0.745654588 Diabetic
## 56 0.248 21 0 0.018273432 Non- Diabetic
## 57 0.254 41 1 0.894408532 Diabetic
## 58 0.962 31 0 0.393375186 Non- Diabetic
## 59 1.781 44 0 0.822981607 Diabetic
## 60 0.173 22 0 0.176764340 Non- Diabetic
## 61 0.304 21 0 0.004929229 Non- Diabetic
## 62 0.270 39 1 0.498405208 Non- Diabetic
## 63 0.587 36 0 0.019831160 Non- Diabetic
## 64 0.699 24 0 0.318567227 Non- Diabetic
## 65 0.258 42 1 0.301031467 Non- Diabetic
## 66 0.203 32 0 0.115471980 Non- Diabetic
## 67 0.855 38 1 0.194516372 Non- Diabetic
## 68 0.845 54 0 0.427406777 Non- Diabetic
## 69 0.334 25 0 0.032310796 Non- Diabetic
Observation
Thus, the test data has been tested for Outcome prediction, as above.
Test Data Confusion Matrix
resVal <- predict(mgmModel,dfrTest, type='response')
prdSur <- ifelse(resVal > 0.6,1, 0)
cnfmtrx <- table(prd=prdSur, act=dfrTest$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 33 15
## 1 6 15
##
## Accuracy : 0.6957
## 95% CI : (0.5731, 0.8008)
## No Information Rate : 0.5652
## P-Value [Acc > NIR] : 0.01827
##
## Kappa : 0.3586
## Mcnemar's Test P-Value : 0.08086
##
## Sensitivity : 0.8462
## Specificity : 0.5000
## Pos Pred Value : 0.6875
## Neg Pred Value : 0.7143
## Prevalence : 0.5652
## Detection Rate : 0.4783
## Detection Prevalence : 0.6957
## Balanced Accuracy : 0.6731
##
## 'Positive' Class : 0
##
Observation
There is an accuracy of more than 69 % of the actual outcome in the predicted outcome.