Problem Definition
The objective is to predict based on diagnostic measurements whether a patient has diabetes.
Use train.csv to create a logistic model (700 observations of 9 variables).
Use test.csv and find using diagnostic measurements whether a patient has diabetes. (68 observations of 9 variables).
Data Location
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.
Data Description
Several constraints were placed on the selection of these instances from a larger database.
In particular, all patients here are females at least 21 years old of Pima Indian heritage.
Attributes:
Patient ID: serial number for the patient
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
(Assuming 0 as a patient “Does not Have Diabetes”" and 1 as a patient “Has Diabetes”)
Setup
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(rJava)
library(JavaGD)
library(iplots)
## Note: On Mac OS X we strongly recommend using iplots from within JGR.
## Proceed at your own risk as iplots cannot resolve potential ev.loop deadlocks.
## 'Yes' is assumed for all dialogs as they cannot be shown without a deadlock,
## also ievent.wait() is disabled.
## More recent OS X version do not allow signle-threaded GUIs and will fail.
library(Deducer)
## Loading required package: JGR
##
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Note Non-JGR console detected:
## Deducer is best used from within JGR (http://jgr.markushelbig.org/).
## To Bring up GUI dialogs, type deducer().
library(lattice)
library(caret)
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
Functions
Dataset
dfrModel <- read.csv("./DATA/train.csv", header=T, stringsAsFactors=F)
head(dfrModel)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Observation
Numeric data is seen.
None of the columns are dropped as all columns are required.
All the columns are numeric columns.
There are no alphanumeric columns.
For regression only numeric data is needed.
Thus in this case we do not need to convert any data or drop any columns.
Missing Data
Checking missing data in the columns.
This is an important step in predictive analytics.
lapply(dfrModel, FUN=detect_na)
## $Pregnancies
## [1] 0
##
## $Glucose
## [1] 0
##
## $BloodPressure
## [1] 0
##
## $SkinThickness
## [1] 0
##
## $Insulin
## [1] 0
##
## $BMI
## [1] 0
##
## $DiabetesPedigreeFunction
## [1] 0
##
## $Age
## [1] 0
##
## $Outcome
## [1] 0
Observation
There is no missing data in any of the columns.
Impute Data
There is no need to impute data as there are no missing values.
There are 0s in the various columns of the data.
However we will work with these 0s in this model.
Outliers Data
Detecting outliers in the columns of the data.
lapply(dfrModel, FUN=detect_outliers)
## $Pregnancies
## integer(0)
##
## $Glucose
## integer(0)
##
## $BloodPressure
## [1] 0 0 0 0 0 0 122 0 0 0 0 0 0 0 0 0 0
## [18] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## $SkinThickness
## integer(0)
##
## $Insulin
## [1] 543 846 495 485 495 478 744 680 545 465 579 474 480 600 540 480
##
## $BMI
## [1] 0.0 0.0 0.0 0.0 0.0 67.1 0.0 0.0 0.0 0.0 0.0
##
## $DiabetesPedigreeFunction
## [1] 2.288 1.893 1.781 2.329 2.137 1.731 2.420 1.699 1.698
##
## $Age
## [1] 81
##
## $Outcome
## integer(0)
Outliers Graph
Plotting outliers on the graph.
plotgraph <- function(inp, na.rm=TRUE) {
mpgPlot <- ggplot(dfrModel, aes(x="", y=inp)) +
geom_boxplot(aes(fill=inp), color="blue") +
labs(title="Outliers for Data")
mpgPlot
}
lapply(dfrModel, FUN=plotgraph)
## $Pregnancies
##
## $Glucose
##
## $BloodPressure
##
## $SkinThickness
##
## $Insulin
##
## $BMI
##
## $DiabetesPedigreeFunction
##
## $Age
##
## $Outcome
Observation
With respect to this model:
- Outliers present in many features.
- But Outlier count is low.
- For this model we will work with the outliers.
Correlation
vctCorr = numeric(0)
for (i in names(dfrModel)){
cor.result <- cor(as.numeric(dfrModel$Outcome), as.numeric(dfrModel[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
## Pregnancies Glucose BloodPressure
## 0.22774403 0.45928020 0.06019258
## SkinThickness Insulin BMI
## 0.08740524 0.14592233 0.30659734
## DiabetesPedigreeFunction Age Outcome
## 0.17053194 0.22699018 1.00000000
Data For Visualization
dfrGraph <- gather(dfrModel, variable, value, -Outcome)
head(dfrGraph)
## Outcome variable value
## 1 1 Pregnancies 6
## 2 0 Pregnancies 1
## 3 1 Pregnancies 8
## 4 0 Pregnancies 1
## 5 1 Pregnancies 0
## 6 0 Pregnancies 5
Data Visualization
ggplot(dfrGraph) +
geom_jitter(aes(value,Outcome, colour=variable)) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Outcome With Other Features")
Observation
There is some impact of all the features with respect to the outcome.
Summary
lapply(dfrModel, FUN=summary)
## $Pregnancies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.827 6.000 17.000
##
## $Glucose
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 99.0 116.5 120.5 140.2 199.0
##
## $BloodPressure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 63.50 72.00 68.88 80.00 122.00
##
## $SkinThickness
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 23.00 20.38 32.00 99.00
##
## $Insulin
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 36.50 79.88 126.50 846.00
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 27.00 32.00 31.89 36.50 67.10
##
## $DiabetesPedigreeFunction
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2400 0.3755 0.4760 0.6370 2.4200
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 33.12 40.00 81.00
##
## $Outcome
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3443 1.0000 1.0000
Observation
Getting necessary details for all numeric data.
Find Best Multi Logistic Model
Choose the best logistic model by using step().
stpModel=step(glm(data=dfrModel, formula=Outcome~., family=binomial), trace=0, steps=10000)
summary(stpModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction, family = binomial, data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7459 -0.7335 -0.4097 0.7153 2.8945
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.009215 0.713218 -11.230 < 2e-16 ***
## Pregnancies 0.157336 0.029270 5.375 7.64e-08 ***
## Glucose 0.033444 0.003517 9.509 < 2e-16 ***
## BloodPressure -0.012432 0.005246 -2.370 0.01781 *
## BMI 0.091142 0.014970 6.088 1.14e-09 ***
## DiabetesPedigreeFunction 0.885935 0.304143 2.913 0.00358 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 901.37 on 699 degrees of freedom
## Residual deviance: 659.26 on 694 degrees of freedom
## AIC: 671.26
##
## Number of Fisher Scoring iterations: 5
Observation
Best results given by Outcome ~ Pregnancies + Glucose + BloodPressure + BMI + DiabetesPedigreeFunction
The difference between null and residual deviance should be around 60-70%. This implies a good model fit.
This condition is fulfilled here.
The 3 stars for each coefficient and p value less than 0.05 indicate all the coefficients can be effectively used.
For null deviance:
Degree of Freedom = 700-1
= 699
(700 being the number of rows in the data)
For residual deviance:
Degree of Freedom = 699-5
= 694
(5 being the number of coefficients you get using step)
Make Final Multi Linear Model
# make model
mgmModel <- glm(data=dfrModel, formula=Outcome ~ Pregnancies + Glucose + BloodPressure +
BMI + DiabetesPedigreeFunction, family=binomial(link="logit"))
# print summary
summary(mgmModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction, family = binomial(link = "logit"),
## data = dfrModel)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7459 -0.7335 -0.4097 0.7153 2.8945
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.009215 0.713218 -11.230 < 2e-16 ***
## Pregnancies 0.157336 0.029270 5.375 7.64e-08 ***
## Glucose 0.033444 0.003517 9.509 < 2e-16 ***
## BloodPressure -0.012432 0.005246 -2.370 0.01781 *
## BMI 0.091142 0.014970 6.088 1.14e-09 ***
## DiabetesPedigreeFunction 0.885935 0.304143 2.913 0.00358 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 901.37 on 699 degrees of freedom
## Residual deviance: 659.26 on 694 degrees of freedom
## AIC: 671.26
##
## Number of Fisher Scoring iterations: 5
Confusion Matrix
prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5,1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 404 103
## 1 55 138
##
## Accuracy : 0.7743
## 95% CI : (0.7415, 0.8048)
## No Information Rate : 0.6557
## P-Value [Acc > NIR] : 5.601e-12
##
## Kappa : 0.4753
## Mcnemar's Test P-Value : 0.0001847
##
## Sensitivity : 0.8802
## Specificity : 0.5726
## Pos Pred Value : 0.7968
## Neg Pred Value : 0.7150
## Prevalence : 0.6557
## Detection Rate : 0.5771
## Detection Prevalence : 0.7243
## Balanced Accuracy : 0.7264
##
## 'Positive' Class : 0
##
Observation
The cut off kept is 50%. i.e- Anything with the probability less than 0.5 means the person does not have diabetes.
The model has an accuracy of 0.7743.
Regression Data
dfrPlot <- mutate(dfrModel, PrdVal=prdVal, POutcome=prdBln)
head(dfrPlot)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome PrdVal POutcome
## 1 0.627 50 1 0.64732497 1
## 2 0.351 31 0 0.04334372 0
## 3 0.672 32 1 0.78466856 1
## 4 0.167 21 0 0.04802554 0
## 5 2.288 33 1 0.88397222 1
## 6 0.201 30 0 0.14783877 0
Regression Visulaization
#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=POutcome)) +
geom_point(shape=19, colour="blue", fill="blue") +
geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
labs(title="Binomial Regression Curve") +
labs(x="") +
labs(y="")
ROC Visulaization
rocplot(mgmModel)
Observation
AUC is 0.8378.
Accuracy as per confusion matrix is 0.7743.
The two values should be close to one another.
Ideally, a difference of more than 10% is a cause of concern.
In this model the difference is less than 10%.
Test Data
dfrTests <- read.csv("./Data/test.csv", header=T, stringsAsFactors=F)
head(dfrTests)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 2 122 76 27 200 35.9
## 2 6 125 78 31 0 27.6
## 3 1 168 88 29 0 35.0
## 4 2 129 0 0 0 38.5
## 5 4 110 76 20 100 28.4
## 6 6 80 80 36 0 39.8
## DiabetesPedigreeFunction Age Outcome
## 1 0.483 26 0
## 2 0.565 49 1
## 3 0.905 52 1
## 4 0.304 41 0
## 5 0.118 27 0
## 6 0.177 28 0
Observation
Test Data successfully created.
Predict
If probability is greater than 0.5 implies the person has diabetes.
resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
dfrTests <- mutate(dfrTests, Result=resVal, PredictedOutcome=prdSur)
dfrTests
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 2 122 76 27 200 35.9
## 2 6 125 78 31 0 27.6
## 3 1 168 88 29 0 35.0
## 4 2 129 0 0 0 38.5
## 5 4 110 76 20 100 28.4
## 6 6 80 80 36 0 39.8
## 7 10 115 0 0 0 0.0
## 8 2 127 46 21 335 34.4
## 9 9 164 78 0 0 32.8
## 10 2 93 64 32 160 38.0
## 11 3 158 64 13 387 31.2
## 12 5 126 78 27 22 29.6
## 13 10 129 62 36 0 41.2
## 14 0 134 58 20 291 26.4
## 15 3 102 74 0 0 29.5
## 16 7 187 50 33 392 33.9
## 17 3 173 78 39 185 33.8
## 18 10 94 72 18 0 23.1
## 19 1 108 60 46 178 35.5
## 20 5 97 76 27 0 35.6
## 21 4 83 86 19 0 29.3
## 22 1 114 66 36 200 38.1
## 23 1 149 68 29 127 29.3
## 24 5 117 86 30 105 39.1
## 25 1 111 94 0 0 32.8
## 26 4 112 78 40 0 39.4
## 27 1 116 78 29 180 36.1
## 28 0 141 84 26 0 32.4
## 29 2 175 88 0 0 22.9
## 30 2 92 52 0 0 30.1
## 31 3 130 78 23 79 28.4
## 32 8 120 86 0 0 28.4
## 33 2 174 88 37 120 44.5
## 34 2 106 56 27 165 29.0
## 35 2 105 75 0 0 23.3
## 36 4 95 60 32 0 35.4
## 37 0 126 86 27 120 27.4
## 38 8 65 72 23 0 32.0
## 39 2 99 60 17 160 36.6
## 40 1 102 74 0 0 39.5
## 41 11 120 80 37 150 42.3
## 42 3 102 44 20 94 30.8
## 43 1 109 58 18 116 28.5
## 44 9 140 94 0 0 32.7
## 45 13 153 88 37 140 40.6
## 46 12 100 84 33 105 30.0
## 47 1 147 94 41 0 49.3
## 48 1 81 74 41 57 46.3
## 49 3 187 70 22 200 36.4
## 50 6 162 62 0 0 24.3
## 51 4 136 70 0 0 31.2
## 52 1 121 78 39 74 39.0
## 53 3 108 62 24 0 26.0
## 54 0 181 88 44 510 43.3
## 55 8 154 78 32 0 32.4
## 56 1 128 88 39 110 36.5
## 57 7 137 90 41 0 32.0
## 58 0 123 72 0 0 36.3
## 59 1 106 76 0 0 37.5
## 60 6 190 92 0 0 35.5
## 61 2 88 58 26 16 28.4
## 62 9 170 74 31 0 44.0
## 63 9 89 62 0 0 22.5
## 64 10 101 76 48 180 32.9
## 65 2 122 70 27 0 36.8
## 66 5 121 72 23 112 26.2
## 67 1 126 60 0 0 30.1
## 68 1 93 70 31 0 30.4
## DiabetesPedigreeFunction Age Outcome Result PredictedOutcome
## 1 0.483 26 0 0.29749244 0
## 2 0.565 49 1 0.30189651 0
## 3 0.905 52 1 0.66026773 1
## 4 0.304 41 0 0.59821566 1
## 5 0.118 27 0 0.12424247 0
## 6 0.177 28 0 0.16798839 0
## 7 0.261 30 1 0.08638919 0
## 8 0.176 22 0 0.32567987 0
## 9 0.148 45 1 0.73934182 1
## 10 0.674 23 1 0.21092502 0
## 11 0.295 24 0 0.51407613 1
## 12 0.439 40 0 0.29079639 0
## 13 0.441 38 1 0.77789023 1
## 14 0.352 21 0 0.17788535 0
## 15 0.121 32 0 0.09535232 0
## 16 0.826 34 1 0.92731122 1
## 17 0.970 31 1 0.77187221 1
## 18 0.595 56 0 0.17441146 0
## 19 0.415 24 0 0.20058927 0
## 20 0.378 52 1 0.20689720 0
## 21 0.317 34 0 0.06169707 0
## 22 0.289 21 0 0.24393972 0
## 23 0.349 42 1 0.32422871 0
## 24 0.251 42 0 0.35601984 0
## 25 0.265 45 0 0.11066833 0
## 26 0.236 38 0 0.30922789 0
## 27 0.496 25 0 0.22927908 0
## 28 0.433 22 0 0.26869678 0
## 29 0.326 22 0 0.36358534 0
## 30 0.141 22 0 0.08349014 0
## 31 0.323 34 1 0.21677629 0
## 32 0.259 22 1 0.27121466 0
## 33 0.646 24 1 0.84008714 1
## 34 0.426 22 0 0.13882109 0
## 35 0.560 53 0 0.07617043 0
## 36 0.284 28 0 0.18685838 0
## 37 0.515 21 0 0.12888735 0
## 38 0.600 42 0 0.11674276 0
## 39 0.453 21 0 0.19903185 0
## 40 0.293 42 1 0.18229997 0
## 41 0.785 48 1 0.78431616 1
## 42 0.400 26 0 0.18073784 0
## 43 0.219 22 0 0.10565220 0
## 44 0.734 45 1 0.63437356 1
## 45 1.174 39 0 0.94265225 1
## 46 0.488 46 0 0.34198956 0
## 47 0.358 27 1 0.66958025 1
## 48 1.096 32 0 0.29483781 0
## 49 0.408 36 1 0.82137020 1
## 50 0.178 50 1 0.48861097 0
## 51 1.182 22 1 0.54713868 1
## 52 0.261 28 0 0.27109988 0
## 53 0.223 25 0 0.10633357 0
## 54 0.222 26 1 0.74900417 1
## 55 0.443 45 1 0.68474539 1
## 56 1.057 37 1 0.40085432 0
## 57 0.391 39 0 0.45464341 0
## 58 0.258 52 1 0.22206950 0
## 59 0.197 26 0 0.15986116 0
## 60 0.278 66 1 0.83579980 1
## 61 0.766 22 0 0.09926279 0
## 62 0.403 43 1 0.92687453 1
## 63 0.142 33 0 0.09877301 0
## 64 0.171 63 0 0.29885728 0
## 65 0.340 27 0 0.30378496 0
## 66 0.245 30 0 0.18756614 0
## 67 0.349 47 1 0.20895172 0
## 68 0.315 23 0 0.07162370 0
Confusion Matrix of Test Data
resVal <- predict(mgmModel,dfrTests, type='response')
prdSur <- ifelse(resVal > 0.5,1, 0)
cnfmtrx <- table(prd=prdSur, act=dfrTests$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 38 12
## 1 3 15
##
## Accuracy : 0.7794
## 95% CI : (0.6624, 0.871)
## No Information Rate : 0.6029
## P-Value [Acc > NIR] : 0.001609
##
## Kappa : 0.5115
## Mcnemar's Test P-Value : 0.038867
##
## Sensitivity : 0.9268
## Specificity : 0.5556
## Pos Pred Value : 0.7600
## Neg Pred Value : 0.8333
## Prevalence : 0.6029
## Detection Rate : 0.5588
## Detection Prevalence : 0.7353
## Balanced Accuracy : 0.7412
##
## 'Positive' Class : 0
##
Observation
The cut off kept is 50%. i.e- Anything with the probability less than 0.5 means the person does not have diabetes.
The model has an accuracy of 0.7794.
Creating a table depicting predicted and actual outcome
#Changing Outcome column to a factor and giving levels for the same.
dfrTests$Outcome <- as.factor(dfrTests$Outcome)
levels(dfrTests$Outcome) <- c("Does not Have Diabetes", "Have Diabetes")
#Changing Predicted Outcome column to a factor and giving levels for the same.
dfrTests$PredictedOutcome <- as.factor(dfrTests$PredictedOutcome)
levels(dfrTests$PredictedOutcome) <- c("Does not Have Diabetes", "Have Diabetes")
dfrTests
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 2 122 76 27 200 35.9
## 2 6 125 78 31 0 27.6
## 3 1 168 88 29 0 35.0
## 4 2 129 0 0 0 38.5
## 5 4 110 76 20 100 28.4
## 6 6 80 80 36 0 39.8
## 7 10 115 0 0 0 0.0
## 8 2 127 46 21 335 34.4
## 9 9 164 78 0 0 32.8
## 10 2 93 64 32 160 38.0
## 11 3 158 64 13 387 31.2
## 12 5 126 78 27 22 29.6
## 13 10 129 62 36 0 41.2
## 14 0 134 58 20 291 26.4
## 15 3 102 74 0 0 29.5
## 16 7 187 50 33 392 33.9
## 17 3 173 78 39 185 33.8
## 18 10 94 72 18 0 23.1
## 19 1 108 60 46 178 35.5
## 20 5 97 76 27 0 35.6
## 21 4 83 86 19 0 29.3
## 22 1 114 66 36 200 38.1
## 23 1 149 68 29 127 29.3
## 24 5 117 86 30 105 39.1
## 25 1 111 94 0 0 32.8
## 26 4 112 78 40 0 39.4
## 27 1 116 78 29 180 36.1
## 28 0 141 84 26 0 32.4
## 29 2 175 88 0 0 22.9
## 30 2 92 52 0 0 30.1
## 31 3 130 78 23 79 28.4
## 32 8 120 86 0 0 28.4
## 33 2 174 88 37 120 44.5
## 34 2 106 56 27 165 29.0
## 35 2 105 75 0 0 23.3
## 36 4 95 60 32 0 35.4
## 37 0 126 86 27 120 27.4
## 38 8 65 72 23 0 32.0
## 39 2 99 60 17 160 36.6
## 40 1 102 74 0 0 39.5
## 41 11 120 80 37 150 42.3
## 42 3 102 44 20 94 30.8
## 43 1 109 58 18 116 28.5
## 44 9 140 94 0 0 32.7
## 45 13 153 88 37 140 40.6
## 46 12 100 84 33 105 30.0
## 47 1 147 94 41 0 49.3
## 48 1 81 74 41 57 46.3
## 49 3 187 70 22 200 36.4
## 50 6 162 62 0 0 24.3
## 51 4 136 70 0 0 31.2
## 52 1 121 78 39 74 39.0
## 53 3 108 62 24 0 26.0
## 54 0 181 88 44 510 43.3
## 55 8 154 78 32 0 32.4
## 56 1 128 88 39 110 36.5
## 57 7 137 90 41 0 32.0
## 58 0 123 72 0 0 36.3
## 59 1 106 76 0 0 37.5
## 60 6 190 92 0 0 35.5
## 61 2 88 58 26 16 28.4
## 62 9 170 74 31 0 44.0
## 63 9 89 62 0 0 22.5
## 64 10 101 76 48 180 32.9
## 65 2 122 70 27 0 36.8
## 66 5 121 72 23 112 26.2
## 67 1 126 60 0 0 30.1
## 68 1 93 70 31 0 30.4
## DiabetesPedigreeFunction Age Outcome Result
## 1 0.483 26 Does not Have Diabetes 0.29749244
## 2 0.565 49 Have Diabetes 0.30189651
## 3 0.905 52 Have Diabetes 0.66026773
## 4 0.304 41 Does not Have Diabetes 0.59821566
## 5 0.118 27 Does not Have Diabetes 0.12424247
## 6 0.177 28 Does not Have Diabetes 0.16798839
## 7 0.261 30 Have Diabetes 0.08638919
## 8 0.176 22 Does not Have Diabetes 0.32567987
## 9 0.148 45 Have Diabetes 0.73934182
## 10 0.674 23 Have Diabetes 0.21092502
## 11 0.295 24 Does not Have Diabetes 0.51407613
## 12 0.439 40 Does not Have Diabetes 0.29079639
## 13 0.441 38 Have Diabetes 0.77789023
## 14 0.352 21 Does not Have Diabetes 0.17788535
## 15 0.121 32 Does not Have Diabetes 0.09535232
## 16 0.826 34 Have Diabetes 0.92731122
## 17 0.970 31 Have Diabetes 0.77187221
## 18 0.595 56 Does not Have Diabetes 0.17441146
## 19 0.415 24 Does not Have Diabetes 0.20058927
## 20 0.378 52 Have Diabetes 0.20689720
## 21 0.317 34 Does not Have Diabetes 0.06169707
## 22 0.289 21 Does not Have Diabetes 0.24393972
## 23 0.349 42 Have Diabetes 0.32422871
## 24 0.251 42 Does not Have Diabetes 0.35601984
## 25 0.265 45 Does not Have Diabetes 0.11066833
## 26 0.236 38 Does not Have Diabetes 0.30922789
## 27 0.496 25 Does not Have Diabetes 0.22927908
## 28 0.433 22 Does not Have Diabetes 0.26869678
## 29 0.326 22 Does not Have Diabetes 0.36358534
## 30 0.141 22 Does not Have Diabetes 0.08349014
## 31 0.323 34 Have Diabetes 0.21677629
## 32 0.259 22 Have Diabetes 0.27121466
## 33 0.646 24 Have Diabetes 0.84008714
## 34 0.426 22 Does not Have Diabetes 0.13882109
## 35 0.560 53 Does not Have Diabetes 0.07617043
## 36 0.284 28 Does not Have Diabetes 0.18685838
## 37 0.515 21 Does not Have Diabetes 0.12888735
## 38 0.600 42 Does not Have Diabetes 0.11674276
## 39 0.453 21 Does not Have Diabetes 0.19903185
## 40 0.293 42 Have Diabetes 0.18229997
## 41 0.785 48 Have Diabetes 0.78431616
## 42 0.400 26 Does not Have Diabetes 0.18073784
## 43 0.219 22 Does not Have Diabetes 0.10565220
## 44 0.734 45 Have Diabetes 0.63437356
## 45 1.174 39 Does not Have Diabetes 0.94265225
## 46 0.488 46 Does not Have Diabetes 0.34198956
## 47 0.358 27 Have Diabetes 0.66958025
## 48 1.096 32 Does not Have Diabetes 0.29483781
## 49 0.408 36 Have Diabetes 0.82137020
## 50 0.178 50 Have Diabetes 0.48861097
## 51 1.182 22 Have Diabetes 0.54713868
## 52 0.261 28 Does not Have Diabetes 0.27109988
## 53 0.223 25 Does not Have Diabetes 0.10633357
## 54 0.222 26 Have Diabetes 0.74900417
## 55 0.443 45 Have Diabetes 0.68474539
## 56 1.057 37 Have Diabetes 0.40085432
## 57 0.391 39 Does not Have Diabetes 0.45464341
## 58 0.258 52 Have Diabetes 0.22206950
## 59 0.197 26 Does not Have Diabetes 0.15986116
## 60 0.278 66 Have Diabetes 0.83579980
## 61 0.766 22 Does not Have Diabetes 0.09926279
## 62 0.403 43 Have Diabetes 0.92687453
## 63 0.142 33 Does not Have Diabetes 0.09877301
## 64 0.171 63 Does not Have Diabetes 0.29885728
## 65 0.340 27 Does not Have Diabetes 0.30378496
## 66 0.245 30 Does not Have Diabetes 0.18756614
## 67 0.349 47 Have Diabetes 0.20895172
## 68 0.315 23 Does not Have Diabetes 0.07162370
## PredictedOutcome
## 1 Does not Have Diabetes
## 2 Does not Have Diabetes
## 3 Have Diabetes
## 4 Have Diabetes
## 5 Does not Have Diabetes
## 6 Does not Have Diabetes
## 7 Does not Have Diabetes
## 8 Does not Have Diabetes
## 9 Have Diabetes
## 10 Does not Have Diabetes
## 11 Have Diabetes
## 12 Does not Have Diabetes
## 13 Have Diabetes
## 14 Does not Have Diabetes
## 15 Does not Have Diabetes
## 16 Have Diabetes
## 17 Have Diabetes
## 18 Does not Have Diabetes
## 19 Does not Have Diabetes
## 20 Does not Have Diabetes
## 21 Does not Have Diabetes
## 22 Does not Have Diabetes
## 23 Does not Have Diabetes
## 24 Does not Have Diabetes
## 25 Does not Have Diabetes
## 26 Does not Have Diabetes
## 27 Does not Have Diabetes
## 28 Does not Have Diabetes
## 29 Does not Have Diabetes
## 30 Does not Have Diabetes
## 31 Does not Have Diabetes
## 32 Does not Have Diabetes
## 33 Have Diabetes
## 34 Does not Have Diabetes
## 35 Does not Have Diabetes
## 36 Does not Have Diabetes
## 37 Does not Have Diabetes
## 38 Does not Have Diabetes
## 39 Does not Have Diabetes
## 40 Does not Have Diabetes
## 41 Have Diabetes
## 42 Does not Have Diabetes
## 43 Does not Have Diabetes
## 44 Have Diabetes
## 45 Have Diabetes
## 46 Does not Have Diabetes
## 47 Have Diabetes
## 48 Does not Have Diabetes
## 49 Have Diabetes
## 50 Does not Have Diabetes
## 51 Have Diabetes
## 52 Does not Have Diabetes
## 53 Does not Have Diabetes
## 54 Have Diabetes
## 55 Have Diabetes
## 56 Does not Have Diabetes
## 57 Does not Have Diabetes
## 58 Does not Have Diabetes
## 59 Does not Have Diabetes
## 60 Have Diabetes
## 61 Does not Have Diabetes
## 62 Have Diabetes
## 63 Does not Have Diabetes
## 64 Does not Have Diabetes
## 65 Does not Have Diabetes
## 66 Does not Have Diabetes
## 67 Does not Have Diabetes
## 68 Does not Have Diabetes
Summary
- The objective was to predict based on diagnostic measurements whether a patient has diabetes.
- Assumption for Outcome: 0 means a patient “Does not Have Diabetes”" and 1 means a patient “Has Diabetes”
- The entire data was divided into train (700 observations) and test data(68 observations).
With respect to Train Data:
- The data had no missing values.
- Outliers were present in the data. However outlier count was low so the model was worked upon with outliers.
- Positive correlations were observed between outcome and the other features.
- The difference between null and residual deviance swas around 60-70%. This implies a good model fit.
- The confusion matrix showed the model has an accuracy of 0.7743. Here, The cut off kept is 50%. i.e- Anything with the probability less than 0.5 means the person does not have diabetes.
- AUC is 0.8378. The difference between AUC and accuracy as per confusion matrix is less than 10% as needed.
With respect to Test Data:
- Test data was used to predict whether a patient has diabetes.
- A confusion matrix was created and the cut off was kept at 50%. i.e- Anything with the probability less than 0.5 means the person does not have diabetes. The model showed an accuracy of 0.7794.
- Finally, a table indicating the Outcome as per the data and the Predicted Outcome was shown.
###################################### END OF REPORT ######################################################