Objective The objective is to predict based on diagnostic measurements whether a patient has diabetes.

Dataset This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Attributes Patient ID: serial number for the patient Pregnancies: Number of times pregnant Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test BloodPressure: Diastolic blood pressure (mm Hg) SkinThickness: Triceps skin fold thickness (mm) Insulin: 2-Hour serum insulin (mu U/ml) BMI: Body mass index (weight in kg/(height in m)^2) DiabetesPedigreeFunction: Diabetes pedigree function Age: Age (years) Outcome: Class variable (0 or 1)

Setup

library(tidyr)
## Warning: package 'tidyr' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.3.3
library(gridExtra) 
## Warning: package 'gridExtra' was built under R version 3.3.3
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(Deducer)
## Warning: package 'Deducer' was built under R version 3.3.3
## Loading required package: JGR
## Warning: package 'JGR' was built under R version 3.3.3
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots
## Warning: package 'iplots' was built under R version 3.3.3
## 
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
## Warning: package 'car' was built under R version 3.3.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Note Non-JGR console detected:
##  Deducer is best used from within JGR (http://jgr.markushelbig.org/).
##  To Bring up GUI dialogs, type deducer().
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
library(pscl)
## Warning: package 'pscl' was built under R version 3.3.3
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

Functions

Dataset

dfrModel <- read.csv("./data/niddkd-diabetes train.csv", header=T, stringsAsFactors=F)
head(dfrModel)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Observation The above data has no non numeric value and hence no columns have to be dropped or converted into numeric values

Missing Data

#sum(is.na(dfrModel$Age))
lapply(dfrModel, FUN=detect_na)
## $Pregnancies
## [1] 0
## 
## $Glucose
## [1] 0
## 
## $BloodPressure
## [1] 0
## 
## $SkinThickness
## [1] 0
## 
## $Insulin
## [1] 0
## 
## $BMI
## [1] 0
## 
## $DiabetesPedigreeFunction
## [1] 0
## 
## $Age
## [1] 0
## 
## $Outcome
## [1] 0

Observation No missing data or NA records found in the Train data

Impute Data

dfrModel$Age[is.na(dfrModel$Age)] <- round(mean(dfrModel$Age[!is.na(dfrModel$Age)]),digits=0)
dfrModel$Age <- as.integer(dfrModel$Age)
detect_na(dfrModel$Age)
## [1] 0
#head(dfrModel)

Observation Since there are no NA’s found in the train data we dont have to replace anything with the 0 and hence impute data will show 0 as there are no NA’s found in the original data and hence no replacement data

Outliers Data

#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)
## $Pregnancies
## integer(0)
## 
## $Glucose
## integer(0)
## 
## $BloodPressure
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 
## $SkinThickness
## integer(0)
## 
## $Insulin
##  [1] 543 846 495 485 495 478 744 680 545 465 579 474 480 600 540 480
## 
## $BMI
##  [1]  0.0  0.0  0.0  0.0  0.0 67.1  0.0  0.0  0.0  0.0  0.0
## 
## $DiabetesPedigreeFunction
## [1] 2.288 1.893 1.781 2.329 2.137 1.731 2.420 1.699 1.698
## 
## $Age
## [1] 81
## 
## $Outcome
## integer(0)

Outliers Graph

plotgraph <- function(inp, na.rm=TRUE) {
outplot <- ggplot(dfrModel, aes(x="", y=inp)) +
            geom_boxplot(aes(fill=inp), color="blue") +
            labs(title=" Outliers")
outplot
}
lapply(dfrModel, FUN=plotgraph)
## $Pregnancies

## 
## $Glucose

## 
## $BloodPressure

## 
## $SkinThickness

## 
## $Insulin

## 
## $BMI

## 
## $DiabetesPedigreeFunction

## 
## $Age

## 
## $Outcome

Observation The above graphs indicates the outliers present in the data set in a visual form

Correlation

vctCorr = numeric(0)
for (i in names(dfrModel)){
    cor.result <- cor(as.numeric(dfrModel$Outcome), as.numeric(dfrModel[,i]))
    vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
##              Pregnancies                  Glucose            BloodPressure 
##               0.22802323               0.45976779               0.06135700 
##            SkinThickness                  Insulin                      BMI 
##               0.08566126               0.14608347               0.30940050 
## DiabetesPedigreeFunction                      Age                  Outcome 
##               0.17257062               0.22617053               1.00000000

Data For Visualization

dfrGraph <- gather(dfrModel, variable, value, -Outcome)
head(dfrGraph)
##   Outcome    variable value
## 1       1 Pregnancies     6
## 2       0 Pregnancies     1
## 3       1 Pregnancies     8
## 4       0 Pregnancies     1
## 5       1 Pregnancies     0
## 6       0 Pregnancies     5

Data Visualization

ggplot(dfrGraph) +
    geom_jitter(aes(value,Outcome, colour=variable)) + 
    facet_wrap(~variable, scales="free_x") +
    labs(title="Relation Of Outcome with other features")

Observation The above graphs show the impact of Other factors present in the dataset with the outcome

Summary

lapply(dfrModel, FUN=summary)
## $Pregnancies
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.827   6.000  17.000 
## 
## $Glucose
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    99.0   116.0   120.5   140.8   199.0 
## 
## $BloodPressure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   62.50   72.00   68.85   80.00  122.00 
## 
## $SkinThickness
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   23.00   20.43   32.00   99.00 
## 
## $Insulin
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   36.50   79.89  126.00  846.00 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   27.00   32.00   31.87   36.50   67.10 
## 
## $DiabetesPedigreeFunction
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0780  0.2400  0.3745  0.4752  0.6355  2.4200 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   24.00   29.00   33.14   40.00   81.00 
## 
## $Outcome
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3453  1.0000  1.0000

Find Best Multi Logistic Model
Choose the best logistic model by using step().

stpModel=step(glm(data=dfrModel, formula=Outcome~., family=binomial), trace=0, steps=100)
summary(stpModel)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     BMI + DiabetesPedigreeFunction, family = binomial, data = dfrModel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7673  -0.7310  -0.4075   0.7171   2.8851  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.061128   0.717120 -11.241  < 2e-16 ***
## Pregnancies               0.157717   0.029307   5.382 7.38e-08 ***
## Glucose                   0.033371   0.003518   9.487  < 2e-16 ***
## BloodPressure            -0.012371   0.005259  -2.352  0.01866 *  
## BMI                       0.092684   0.015071   6.150 7.76e-10 ***
## DiabetesPedigreeFunction  0.913510   0.305415   2.991  0.00278 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 899.68  on 697  degrees of freedom
## Residual deviance: 656.31  on 692  degrees of freedom
## AIC: 668.31
## 
## Number of Fisher Scoring iterations: 5

Observation Best results given by Outcome~Pregnancies+Glucose+BMI

Make Final Multi Linear Model

# make model
mgmModel <- glm(data=dfrModel, formula=Outcome ~Pregnancies+Glucose+BMI, family=binomial(link="logit"))
# print summary
summary(mgmModel)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI, family = binomial(link = "logit"), 
##     data = dfrModel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1712  -0.7257  -0.4233   0.7626   2.8089  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.21936    0.67333 -12.207  < 2e-16 ***
## Pregnancies  0.13938    0.02812   4.958 7.13e-07 ***
## Glucose      0.03291    0.00343   9.595  < 2e-16 ***
## BMI          0.08866    0.01458   6.081 1.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 899.68  on 697  degrees of freedom
## Residual deviance: 671.18  on 694  degrees of freedom
## AIC: 679.18
## 
## Number of Fisher Scoring iterations: 5

Confusion Matrix

prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
## 
##    act
## prd   0   1
##   0 400 105
##   1  57 136
##                                           
##                Accuracy : 0.7679          
##                  95% CI : (0.7348, 0.7988)
##     No Information Rate : 0.6547          
##     P-Value [Acc > NIR] : 5.547e-11       
##                                           
##                   Kappa : 0.4613          
##  Mcnemar's Test P-Value : 0.0002219       
##                                           
##             Sensitivity : 0.8753          
##             Specificity : 0.5643          
##          Pos Pred Value : 0.7921          
##          Neg Pred Value : 0.7047          
##              Prevalence : 0.6547          
##          Detection Rate : 0.5731          
##    Detection Prevalence : 0.7235          
##       Balanced Accuracy : 0.7198          
##                                           
##        'Positive' Class : 0               
## 

Observation The cut off for the above is 0.5 indicating that if a person has a probability of more then 0.5 he is considerd as diabetic The above model has an accuracy of 0.7679 Regression Data

dfrPlot <- mutate(dfrModel, PrdVal=prdVal, POutcome=prdBln)
head(dfrPlot)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome     PrdVal POutcome
## 1                    0.627  50       1 0.61471239        1
## 2                    0.351  31       0 0.05098206        0
## 3                    0.672  32       1 0.72804655        1
## 4                    0.167  21       0 0.06541757        0
## 5                    2.288  33       1 0.52773867        1
## 6                    0.201  30       0 0.19236087        0

Regression Visulaization

#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=POutcome))  + 
    geom_point(shape=19, colour="blue", fill="blue") +
    geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
    labs(title="Binomial Regression Curve") +
    labs(x="") +
    labs(y="")

ROC Visulaization

#rocplot(logistic.model,diag=TRUE,pred.prob.labels=FALSE,prob.label.digits=3,AUC=TRUE)
rocplot(mgmModel)

Observation AUC is 0.8279 Accuracy as per confusion matrix is 0.7679 The difference between the two should not be more then 10% since here it is less then 10% we consider the above model has a good model

Test Data

dfrTests <- read.csv("./data/niddkd-diabetes test.csv", header=T, stringsAsFactors=F)
head(dfrTests)
##   S.No Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1    1           2     122            76            27     200 35.9
## 2    2           6     125            78            31       0 27.6
## 3    3           1     168            88            29       0 35.0
## 4    4           2     129             0             0       0 38.5
## 5    5           4     110            76            20     100 28.4
## 6    6           6      80            80            36       0 39.8
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.483  26       0
## 2                    0.565  49       1
## 3                    0.905  52       1
## 4                    0.304  41       0
## 5                    0.118  27       0
## 6                    0.177  28       0

Observation Test Data successfully created Predict

resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
prdSur <- as.factor(prdSur)
levels(prdSur) <- c("Diabetes", "No Diabetes")
dfrTests <- mutate(dfrTests, Result=resVal, Outcome=prdSur)
dfrTests 
##    S.No Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1     1           2     122            76            27     200 35.9
## 2     2           6     125            78            31       0 27.6
## 3     3           1     168            88            29       0 35.0
## 4     4           2     129             0             0       0 38.5
## 5     5           4     110            76            20     100 28.4
## 6     6           6      80            80            36       0 39.8
## 7     7          10     115             0             0       0  0.0
## 8     8           2     127            46            21     335 34.4
## 9     9           9     164            78             0       0 32.8
## 10   10           2      93            64            32     160 38.0
## 11   11           3     158            64            13     387 31.2
## 12   12           5     126            78            27      22 29.6
## 13   13          10     129            62            36       0 41.2
## 14   14           0     134            58            20     291 26.4
## 15   15           3     102            74             0       0 29.5
## 16   16           7     187            50            33     392 33.9
## 17   17           3     173            78            39     185 33.8
## 18   18          10      94            72            18       0 23.1
## 19   19           1     108            60            46     178 35.5
## 20   20           5      97            76            27       0 35.6
## 21   21           4      83            86            19       0 29.3
## 22   22           1     114            66            36     200 38.1
## 23   23           1     149            68            29     127 29.3
## 24   24           5     117            86            30     105 39.1
## 25   25           1     111            94             0       0 32.8
## 26   26           4     112            78            40       0 39.4
## 27   27           1     116            78            29     180 36.1
## 28   28           0     141            84            26       0 32.4
## 29   29           2     175            88             0       0 22.9
## 30   30           2      92            52             0       0 30.1
## 31   31           3     130            78            23      79 28.4
## 32   32           8     120            86             0       0 28.4
## 33   33           2     174            88            37     120 44.5
## 34   34           2     106            56            27     165 29.0
## 35   35           2     105            75             0       0 23.3
## 36   36           4      95            60            32       0 35.4
## 37   37           0     126            86            27     120 27.4
## 38   38           8      65            72            23       0 32.0
## 39   39           2      99            60            17     160 36.6
## 40   40           1     102            74             0       0 39.5
## 41   41          11     120            80            37     150 42.3
## 42   42           3     102            44            20      94 30.8
## 43   43           1     109            58            18     116 28.5
## 44   44           9     140            94             0       0 32.7
## 45   45          13     153            88            37     140 40.6
## 46   46          12     100            84            33     105 30.0
## 47   47           1     147            94            41       0 49.3
## 48   48           1      81            74            41      57 46.3
## 49   49           3     187            70            22     200 36.4
## 50   50           6     162            62             0       0 24.3
## 51   51           4     136            70             0       0 31.2
## 52   52           1     121            78            39      74 39.0
## 53   53           3     108            62            24       0 26.0
## 54   54           0     181            88            44     510 43.3
## 55   55           8     154            78            32       0 32.4
## 56   56           1     128            88            39     110 36.5
## 57   57           7     137            90            41       0 32.0
## 58   58           0     123            72             0       0 36.3
## 59   59           1     106            76             0       0 37.5
## 60   60           6     190            92             0       0 35.5
## 61   61           2      88            58            26      16 28.4
## 62   62           9     170            74            31       0 44.0
## 63   63           9      89            62             0       0 22.5
## 64   64          10     101            76            48     180 32.9
## 65   65           2     122            70            27       0 36.8
## 66   66           5     121            72            23     112 26.2
## 67   67           1     126            60             0       0 30.1
## 68   68           1      93            70            31       0 30.4
##    DiabetesPedigreeFunction Age     Outcome     Result
## 1                     0.483  26    Diabetes 0.32251885
## 2                     0.565  49    Diabetes 0.30537882
## 3                     0.905  52 No Diabetes 0.63474985
## 4                     0.304  41    Diabetes 0.43013313
## 5                     0.118  27    Diabetes 0.17896659
## 6                     0.177  28    Diabetes 0.22770716
## 7                     0.261  30    Diabetes 0.04563324
## 8                     0.176  22    Diabetes 0.32945772
## 9                     0.148  45 No Diabetes 0.79265655
## 10                    0.674  23    Diabetes 0.18085833
## 11                    0.295  24 No Diabetes 0.54124880
## 12                    0.439  40    Diabetes 0.32061105
## 13                    0.441  38 No Diabetes 0.74519418
## 14                    0.352  21    Diabetes 0.18720393
## 15                    0.121  32    Diabetes 0.13841280
## 16                    0.826  34 No Diabetes 0.87178045
## 17                    0.970  31 No Diabetes 0.70880669
## 18                    0.595  56    Diabetes 0.15662207
## 19                    0.415  24    Diabetes 0.20135222
## 20                    0.378  52    Diabetes 0.23621983
## 21                    0.317  34    Diabetes 0.08848698
## 22                    0.289  21    Diabetes 0.27891173
## 23                    0.349  42    Diabetes 0.35937553
## 24                    0.251  42    Diabetes 0.44894623
## 25                    0.265  45    Diabetes 0.17968222
## 26                    0.236  38    Diabetes 0.38171256
## 27                    0.496  25    Diabetes 0.25705032
## 28                    0.433  22    Diabetes 0.33049915
## 29                    0.326  22    Diabetes 0.46248430
## 30                    0.141  22    Diabetes 0.09588001
## 31                    0.323  34    Diabetes 0.26806391
## 32                    0.259  22    Diabetes 0.34599547
## 33                    0.646  24 No Diabetes 0.84963981
## 34                    0.426  22    Diabetes 0.13232076
## 35                    0.560  53    Diabetes 0.08174545
## 36                    0.284  28    Diabetes 0.19837830
## 37                    0.515  21    Diabetes 0.16206612
## 38                    0.600  42    Diabetes 0.10642519
## 39                    0.453  21    Diabetes 0.19198217
## 40                    0.293  42    Diabetes 0.22781121
## 41                    0.785  48 No Diabetes 0.73376818
## 42                    0.400  26    Diabetes 0.15273892
## 43                    0.219  22    Diabetes 0.12286610
## 44                    0.734  45 No Diabetes 0.63232650
## 45                    1.174  39 No Diabetes 0.90273907
## 46                    0.488  46    Diabetes 0.35535578
## 47                    0.358  27 No Diabetes 0.75570599
## 48                    1.096  32    Diabetes 0.21265017
## 49                    0.408  36 No Diabetes 0.82933278
## 50                    0.178  50 No Diabetes 0.52583462
## 51                    1.182  22    Diabetes 0.39667136
## 52                    0.261  28    Diabetes 0.34532009
## 53                    0.223  25    Diabetes 0.12549867
## 54                    0.222  26 No Diabetes 0.82878182
## 55                    0.443  45 No Diabetes 0.69783837
## 56                    1.057  37    Diabetes 0.34730098
## 57                    0.391  39 No Diabetes 0.52563730
## 58                    0.258  52    Diabetes 0.27836018
## 59                    0.197  26    Diabetes 0.21987859
## 60                    0.278  66 No Diabetes 0.88267472
## 61                    0.766  22    Diabetes 0.07403892
## 62                    0.403  43 No Diabetes 0.92631946
## 63                    0.142  33    Diabetes 0.11499251
## 64                    0.171  63    Diabetes 0.35793888
## 65                    0.340  27    Diabetes 0.34019466
## 66                    0.245  30    Diabetes 0.22846897
## 67                    0.349  47    Diabetes 0.22025669
## 68                    0.315  23    Diabetes 0.08917608

Confusion Matrix

prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
## 
##    act
## prd   0   1
##   0 400 105
##   1  57 136
##                                           
##                Accuracy : 0.7679          
##                  95% CI : (0.7348, 0.7988)
##     No Information Rate : 0.6547          
##     P-Value [Acc > NIR] : 5.547e-11       
##                                           
##                   Kappa : 0.4613          
##  Mcnemar's Test P-Value : 0.0002219       
##                                           
##             Sensitivity : 0.8753          
##             Specificity : 0.5643          
##          Pos Pred Value : 0.7921          
##          Neg Pred Value : 0.7047          
##              Prevalence : 0.6547          
##          Detection Rate : 0.5731          
##    Detection Prevalence : 0.7235          
##       Balanced Accuracy : 0.7198          
##                                           
##        'Positive' Class : 0               
## 

Summary Test Data The main aim was to predict whether the patients in the Test Data have diabetes or not This was obtained by making the confusion matrix cut off was 0.5 indicating that if a person has probability of more then 0.5 he is considered diabetic The outcome is shown in the Predict model The model has an accuracy of 0.7679