Problem Definition
Here, we have a dataset, containing diagnostic measurements of more than 700 females.
Now, based on these data variables available, we need to predict wether the female is a Diabetic or not.

Variable Description
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Shows wether the person has Diabetes (1) or Not (0)

Setup

Loading Libraries

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(Deducer)
## Loading required package: JGR
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots
## 
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Note Non-JGR console detected:
##  Deducer is best used from within JGR (http://jgr.markushelbig.org/).
##  To Bring up GUI dialogs, type deducer().
library(caret)
## Loading required package: lattice
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Functions

Dataset

dfrModel <- read.csv("niddkd-diabetes.csv", header=T, stringsAsFactors=F)
head(dfrModel)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           4     146            85            27     100 28.9
## 2           2     100            66            20      90 32.9
## 3           5     139            64            35     140 28.6
## 4          13     126            90             0       0 43.4
## 5           4     129            86            20     270 35.1
## 6           1      79            75            30       0 32.0
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.189  27       0
## 2                    0.867  28       1
## 3                    0.411  26       0
## 4                    0.583  42       1
## 5                    0.231  23       0
## 6                    0.396  22       0

Observation
The dataset contains all numeric data.

Missing Data

#sum(is.na(dfrModel$Age))
lapply(dfrModel, FUN=detect_na)
## $Pregnancies
## [1] 0
## 
## $Glucose
## [1] 0
## 
## $BloodPressure
## [1] 0
## 
## $SkinThickness
## [1] 0
## 
## $Insulin
## [1] 0
## 
## $BMI
## [1] 0
## 
## $DiabetesPedigreeFunction
## [1] 0
## 
## $Age
## [1] 0
## 
## $Outcome
## [1] 0

Observation
No missing data is there in any column.

There are many 0 values in the data in the varoius caokumns.this 0 value is not good nd cannot be accepted beacuse a 0 in insulin level or 0 in blood pressure doesnt inicate that the person is helathy.It might be the case that the value is missing or is replaced wth 0

But, In this case, we choose to retain 0, seeing it as a value itself.

Detecting Outliers

#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)
## $Pregnancies
## integer(0)
## 
## $Glucose
## integer(0)
## 
## $BloodPressure
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 
## $SkinThickness
## integer(0)
## 
## $Insulin
##  [1] 495 485 495 478 744 680 545 465 579 474 480 600 540 480 510
## 
## $BMI
##  [1]  0.0  0.0 67.1  0.0  0.0 59.4  0.0  0.0  0.0  0.0
## 
## $DiabetesPedigreeFunction
## [1] 2.329 2.137 1.731 1.600 2.420 1.699 1.698
## 
## $Age
## [1] 81
## 
## $Outcome
## integer(0)

Outliers Graph

plotgraph <- function(inp, na.rm=TRUE) {
outplot <- ggplot(dfrModel, aes(x="", y=inp)) +
            geom_boxplot(aes(fill=inp), color="blue") +
            labs(title="Diabetes Outcome Outliers")
outplot
}
lapply(dfrModel, FUN=plotgraph)
## $Pregnancies

## 
## $Glucose

## 
## $BloodPressure

## 
## $SkinThickness

## 
## $Insulin

## 
## $BMI

## 
## $DiabetesPedigreeFunction

## 
## $Age

## 
## $Outcome

Correlation

vctCorr = numeric(0)
for (i in names(dfrModel)){
    cor.result <- cor(as.numeric(dfrModel$Outcome), as.numeric(dfrModel[,i]))
    vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
##              Pregnancies                  Glucose            BloodPressure 
##               0.22217638               0.47010501               0.05862803 
##            SkinThickness                  Insulin                      BMI 
##               0.05770830               0.11357000               0.30207096 
## DiabetesPedigreeFunction                      Age                  Outcome 
##               0.17888267               0.22844423               1.00000000

_ Observation _
The correlation coefficient shows High correlation of probabilty of a person getting diabetes, only with Glucose levels, followed by her BMI.

dfrGraph <- gather(dfrModel, variable, value, -Outcome)
head(dfrGraph)
##   Outcome    variable value
## 1       0 Pregnancies     4
## 2       1 Pregnancies     2
## 3       0 Pregnancies     5
## 4       1 Pregnancies    13
## 5       0 Pregnancies     4
## 6       0 Pregnancies     1

Data Visualization

ggplot(dfrGraph) +
    geom_jitter(aes(value,Outcome, colour=variable)) + 
    facet_wrap(~variable, scales="free_x") +
    labs(title="Relation Of Outcome With Other Features")

Summary

lapply(dfrModel, FUN=summary)
## $Pregnancies
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.748   6.000  17.000 
## 
## $Glucose
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    99.0   117.0   120.6   140.0   199.0 
## 
## $BloodPressure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   62.00   72.00   69.14   80.00  122.00 
## 
## $SkinThickness
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   23.00   20.66   32.00   99.00 
## 
## $Insulin
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   40.00   79.69  128.00  744.00 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   27.40   32.00   32.11   36.50   67.10 
## 
## $DiabetesPedigreeFunction
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0780  0.2385  0.3700  0.4664  0.6250  2.4200 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   24.00   29.00   32.93   40.00   81.00 
## 
## $Outcome
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3405  1.0000  1.0000

Find Best Multi Logistic Model
Choose the best logistic model by using step().

stpModel=step(glm(data=dfrModel, formula=Outcome~., family=binomial), trace=0, steps=1000)
summary(stpModel)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     Insulin + BMI + DiabetesPedigreeFunction, family = binomial, 
##     data = dfrModel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7181  -0.7126  -0.4112   0.6955   3.0563  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.7295109  0.7642967 -11.422  < 2e-16 ***
## Pregnancies               0.1530382  0.0302963   5.051 4.39e-07 ***
## Glucose                   0.0386501  0.0038904   9.935  < 2e-16 ***
## BloodPressure            -0.0131962  0.0055018  -2.399 0.016461 *  
## Insulin                  -0.0019808  0.0008939  -2.216 0.026692 *  
## BMI                       0.0964194  0.0156434   6.164 7.11e-10 ***
## DiabetesPedigreeFunction  1.1729491  0.3265008   3.592 0.000328 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 896.62  on 698  degrees of freedom
## Residual deviance: 640.39  on 692  degrees of freedom
## AIC: 654.39
## 
## Number of Fisher Scoring iterations: 5

Observation
Logistic Model has been created

Make Final Multi Linear Model

# make model
mgmModel <- glm(data=dfrModel, formula=Outcome~Pregnancies+Glucose+BMI+DiabetesPedigreeFunction, family=binomial(link="logit"))
# print summary
summary(mgmModel)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, 
##     family = binomial(link = "logit"), data = dfrModel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8971  -0.7045  -0.4203   0.6884   2.9627  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.883748   0.717353 -12.384  < 2e-16 ***
## Pregnancies               0.144692   0.029087   4.975 6.54e-07 ***
## Glucose                   0.035274   0.003611   9.770  < 2e-16 ***
## BMI                       0.083309   0.014804   5.628 1.83e-08 ***
## DiabetesPedigreeFunction  1.064782   0.320967   3.317 0.000909 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 896.62  on 698  degrees of freedom
## Residual deviance: 651.55  on 694  degrees of freedom
## AIC: 661.55
## 
## Number of Fisher Scoring iterations: 5

Confusion Matrix

prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.6, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
## 
##    act
## prd   0   1
##   0 433 123
##   1  28 115
##                                           
##                Accuracy : 0.784           
##                  95% CI : (0.7516, 0.8139)
##     No Information Rate : 0.6595          
##     P-Value [Acc > NIR] : 3.836e-13       
##                                           
##                   Kappa : 0.4676          
##  Mcnemar's Test P-Value : 2.016e-14       
##                                           
##             Sensitivity : 0.9393          
##             Specificity : 0.4832          
##          Pos Pred Value : 0.7788          
##          Neg Pred Value : 0.8042          
##              Prevalence : 0.6595          
##          Detection Rate : 0.6195          
##    Detection Prevalence : 0.7954          
##       Balanced Accuracy : 0.7112          
##                                           
##        'Positive' Class : 0               
## 

Obersvation
The confusion Matrix shows an accuracy of 78.4 % of the predicted value as that of the actual ones.
Thus , this model is good to go with.

Regression Data

dfrPlot <- mutate(dfrModel, PrdVal=prdVal, PSurvived=prdBln)
head(dfrPlot)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           4     146            85            27     100 28.9
## 2           2     100            66            20      90 32.9
## 3           5     139            64            35     140 28.6
## 4          13     126            90             0       0 43.4
## 5           4     129            86            20     270 35.1
## 6           1      79            75            30       0 32.0
##   DiabetesPedigreeFunction Age Outcome     PrdVal PSurvived
## 1                    0.189  27       0 0.36677766         0
## 2                    0.867  28       1 0.19735484         0
## 3                    0.411  26       0 0.39247883         0
## 4                    0.583  42       1 0.84266368         1
## 5                    0.231  23       0 0.35789950         0
## 6                    0.396  22       0 0.05391868         0

Regression Visulaization

#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=Outcome))  + 
    geom_point(shape=19, colour="blue", fill="blue") +
    geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
    labs(title="Binomial Regression Curve") +
    labs(x="") +
    labs(y="")

ROC Visulaization

#rocplot(logistic.model,diag=TRUE,pred.prob.labels=FALSE,prob.label.digits=3,AUC=TRUE)
rocplot(mgmModel)

Observtion
Here , the AUC value is 0.8433, which is good for the model to predict the results.

Test Data

dfrTest <- read.csv("niddkd-diabetestest.csv", header=T, stringsAsFactors=F)
head(dfrTest)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Observation
Similar Data as in the train dataset , here too, we choose to retain 0 in the values.

Predict

resVal <- predict(mgmModel, dfrTest, type="response")
prdSur <- ifelse(resVal > 0.6, "Diabetic", "Non- Diabetic")
dfrTest <- mutate(dfrTest, Result=resVal, PredictedOutcome=prdSur)
dfrTest 
##    Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1            6     148            72            35       0 33.6
## 2            1      85            66            29       0 26.6
## 3            8     183            64             0       0 23.3
## 4            1      89            66            23      94 28.1
## 5            0     137            40            35     168 43.1
## 6            5     116            74             0       0 25.6
## 7            3      78            50            32      88 31.0
## 8           10     115             0             0       0 35.3
## 9            2     197            70            45     543 30.5
## 10           8     125            96             0       0  0.0
## 11           4     110            92             0       0 37.6
## 12          10     168            74             0       0 38.0
## 13          10     139            80             0       0 27.1
## 14           1     189            60            23     846 30.1
## 15           5     166            72            19     175 25.8
## 16           7     100             0             0       0 30.0
## 17           0     118            84            47     230 45.8
## 18           7     107            74             0       0 29.6
## 19           1     103            30            38      83 43.3
## 20           1     115            70            30      96 34.6
## 21           3     126            88            41     235 39.3
## 22           8      99            84             0       0 35.4
## 23           7     196            90             0       0 39.8
## 24           9     119            80            35       0 29.0
## 25          11     143            94            33     146 36.6
## 26          10     125            70            26     115 31.1
## 27           7     147            76             0       0 39.4
## 28           1      97            66            15     140 23.2
## 29          13     145            82            19     110 22.2
## 30           5     117            92             0       0 34.1
## 31           5     109            75            26       0 36.0
## 32           3     158            76            36     245 31.6
## 33           3      88            58            11      54 24.8
## 34           6      92            92             0       0 19.9
## 35          10     122            78            31       0 27.6
## 36           4     103            60            33     192 24.0
## 37          11     138            76             0       0 33.2
## 38           9     102            76            37       0 32.9
## 39           2      90            68            42       0 38.2
## 40           4     111            72            47     207 37.1
## 41           3     180            64            25      70 34.0
## 42           7     133            84             0       0 40.2
## 43           7     106            92            18       0 22.7
## 44           9     171           110            24     240 45.4
## 45           7     159            64             0       0 27.4
## 46           0     180            66            39       0 42.0
## 47           1     146            56             0       0 29.7
## 48           2      71            70            27       0 28.0
## 49           7     103            66            32       0 39.1
## 50           7     105             0             0       0  0.0
## 51           1     103            80            11      82 19.4
## 52           1     101            50            15      36 24.2
## 53           5      88            66            21      23 24.4
## 54           8     176            90            34     300 33.7
## 55           7     150            66            42     342 34.7
## 56           1      73            50            10       0 23.0
## 57           7     187            68            39     304 37.7
## 58           0     100            88            60     110 46.8
## 59           0     146            82             0       0 40.5
## 60           0     105            64            41     142 41.5
## 61           2      84             0             0       0  0.0
## 62           8     133            72             0       0 32.9
## 63           5      44            62             0       0 25.0
## 64           2     141            58            34     128 25.4
## 65           7     114            66             0       0 32.8
## 66           5      99            74            27       0 29.0
## 67           0     109            88            30       0 32.5
## 68           2     109            92             0       0 42.7
## 69           1      95            66            13      38 19.6
##    DiabetesPedigreeFunction Age Outcome      Result PredictedOutcome
## 1                     0.627  50       1 0.661904634         Diabetic
## 2                     0.351  31       0 0.041052328    Non- Diabetic
## 3                     0.672  32       1 0.799890729         Diabetic
## 4                     0.167  21       0 0.043904187    Non- Diabetic
## 5                     2.288  33       1 0.878217665         Diabetic
## 6                     0.201  30       0 0.151646742    Non- Diabetic
## 7                     0.248  26       1 0.054596722    Non- Diabetic
## 8                     0.134  29       0 0.426339825    Non- Diabetic
## 9                     0.158  53       1 0.743420504         Diabetic
## 10                    0.232  54       1 0.044366217    Non- Diabetic
## 11                    0.191  30       0 0.251802324    Non- Diabetic
## 12                    0.537  34       1 0.902628441         Diabetic
## 13                    1.441  57       0 0.778736625         Diabetic
## 14                    0.398  59       1 0.702485308         Diabetic
## 15                    0.587  51       1 0.615297992         Diabetic
## 16                    0.484  32       1 0.209348853    Non- Diabetic
## 17                    0.551  31       1 0.420884383    Non- Diabetic
## 18                    0.254  31       1 0.204214578    Non- Diabetic
## 19                    0.183  33       0 0.213546747    Non- Diabetic
## 20                    0.529  32       1 0.225000660    Non- Diabetic
## 21                    0.704  27       0 0.504632674    Non- Diabetic
## 22                    0.388  50       0 0.294882396    Non- Diabetic
## 23                    0.451  41       1 0.944734971         Diabetic
## 24                    0.263  29       1 0.334503807    Non- Diabetic
## 25                    0.254  51       1 0.744909960         Diabetic
## 26                    0.205  41       1 0.445627005    Non- Diabetic
## 27                    0.257  43       1 0.704835927         Diabetic
## 28                    0.487  22       0 0.053853118    Non- Diabetic
## 29                    0.245  57       0 0.555359459    Non- Diabetic
## 30                    0.337  38       0 0.302898448    Non- Diabetic
## 31                    0.546  60       0 0.324120061    Non- Diabetic
## 32                    0.851  28       1 0.659788667         Diabetic
## 33                    0.267  22       0 0.047644141    Non- Diabetic
## 34                    0.188  28       0 0.051547100    Non- Diabetic
## 35                    0.512  45       0 0.428280225    Non- Diabetic
## 36                    0.966  33       0 0.161957162    Non- Diabetic
## 37                    0.420  35       0 0.687569115         Diabetic
## 38                    0.665  46       1 0.369444549    Non- Diabetic
## 39                    0.503  27       1 0.154243926    Non- Diabetic
## 40                    1.390  56       1 0.545193892    Non- Diabetic
## 41                    0.271  26       0 0.735119516         Diabetic
## 42                    0.696  37       0 0.713139325         Diabetic
## 43                    0.235  48       0 0.120204555    Non- Diabetic
## 44                    0.721  54       1 0.952588251         Diabetic
## 45                    0.294  40       0 0.582590200    Non- Diabetic
## 46                    1.893  25       1 0.951672685         Diabetic
## 47                    0.564  29       0 0.374208877    Non- Diabetic
## 48                    0.586  22       0 0.041755159    Non- Diabetic
## 49                    0.344  31       1 0.351149366    Non- Diabetic
## 50                    0.305  24       0 0.020993095    Non- Diabetic
## 51                    0.491  22       0 0.048948177    Non- Diabetic
## 52                    0.526  26       0 0.069125223    Non- Diabetic
## 53                    0.342  30       0 0.065420603    Non- Diabetic
## 54                    0.467  58       1 0.856537389         Diabetic
## 55                    0.718  42       0 0.745654588         Diabetic
## 56                    0.248  21       0 0.018273432    Non- Diabetic
## 57                    0.254  41       1 0.894408532         Diabetic
## 58                    0.962  31       0 0.393375186    Non- Diabetic
## 59                    1.781  44       0 0.822981607         Diabetic
## 60                    0.173  22       0 0.176764340    Non- Diabetic
## 61                    0.304  21       0 0.004929229    Non- Diabetic
## 62                    0.270  39       1 0.498405208    Non- Diabetic
## 63                    0.587  36       0 0.019831160    Non- Diabetic
## 64                    0.699  24       0 0.318567227    Non- Diabetic
## 65                    0.258  42       1 0.301031467    Non- Diabetic
## 66                    0.203  32       0 0.115471980    Non- Diabetic
## 67                    0.855  38       1 0.194516372    Non- Diabetic
## 68                    0.845  54       0 0.427406777    Non- Diabetic
## 69                    0.334  25       0 0.032310796    Non- Diabetic

Observation

Thus, the test data has been tested for Outcome prediction, as above.

Test Data Confusion Matrix

resVal <- predict(mgmModel,dfrTest, type='response')
prdSur <- ifelse(resVal > 0.6,1, 0)
cnfmtrx <- table(prd=prdSur, act=dfrTest$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
## 
##    act
## prd  0  1
##   0 33 15
##   1  6 15
##                                           
##                Accuracy : 0.6957          
##                  95% CI : (0.5731, 0.8008)
##     No Information Rate : 0.5652          
##     P-Value [Acc > NIR] : 0.01827         
##                                           
##                   Kappa : 0.3586          
##  Mcnemar's Test P-Value : 0.08086         
##                                           
##             Sensitivity : 0.8462          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.6875          
##          Neg Pred Value : 0.7143          
##              Prevalence : 0.5652          
##          Detection Rate : 0.4783          
##    Detection Prevalence : 0.6957          
##       Balanced Accuracy : 0.6731          
##                                           
##        'Positive' Class : 0               
## 

Observation
There is an accuracy of more than 69 % of the actual outcome in the predicted outcome.