Problem Definition
The objective is to predict based on diagnostic measurements whether a patient has diabetes.

Use train.csv to create a logistic model (700 observations of 9 variables).
Use test.csv and find using diagnostic measurements whether a patient has diabetes. (68 observations of 9 variables).

Data Location
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases.

Data Description
Several constraints were placed on the selection of these instances from a larger database.
In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Attributes:
Patient ID: serial number for the patient
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
(Assuming 0 as a patient “Does not Have Diabetes”" and 1 as a patient “Has Diabetes”)

Setup

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra) 
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(rJava)
library(JavaGD)
library(iplots)
## Note: On Mac OS X we strongly recommend using iplots from within JGR.
## Proceed at your own risk as iplots cannot resolve potential ev.loop deadlocks.
## 'Yes' is assumed for all dialogs as they cannot be shown without a deadlock,
## also ievent.wait() is disabled.
## More recent OS X version do not allow signle-threaded GUIs and will fail.
library(Deducer)
## Loading required package: JGR
## 
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## 
## 
## Note Non-JGR console detected:
##  Deducer is best used from within JGR (http://jgr.markushelbig.org/).
##  To Bring up GUI dialogs, type deducer().
library(lattice)
library(caret)
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

Functions

Dataset

dfrModel <- read.csv("./DATA/train.csv", header=T, stringsAsFactors=F)
head(dfrModel)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Observation
Numeric data is seen.
None of the columns are dropped as all columns are required.
All the columns are numeric columns.
There are no alphanumeric columns.

For regression only numeric data is needed.
Thus in this case we do not need to convert any data or drop any columns.

Missing Data
Checking missing data in the columns.
This is an important step in predictive analytics.

lapply(dfrModel, FUN=detect_na)
## $Pregnancies
## [1] 0
## 
## $Glucose
## [1] 0
## 
## $BloodPressure
## [1] 0
## 
## $SkinThickness
## [1] 0
## 
## $Insulin
## [1] 0
## 
## $BMI
## [1] 0
## 
## $DiabetesPedigreeFunction
## [1] 0
## 
## $Age
## [1] 0
## 
## $Outcome
## [1] 0

Observation
There is no missing data in any of the columns.

Impute Data
There is no need to impute data as there are no missing values.
There are 0s in the various columns of the data.
However we will work with these 0s in this model.

Outliers Data
Detecting outliers in the columns of the data.

lapply(dfrModel, FUN=detect_outliers)
## $Pregnancies
## integer(0)
## 
## $Glucose
## integer(0)
## 
## $BloodPressure
##  [1]   0   0   0   0   0   0 122   0   0   0   0   0   0   0   0   0   0
## [18]   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
## 
## $SkinThickness
## integer(0)
## 
## $Insulin
##  [1] 543 846 495 485 495 478 744 680 545 465 579 474 480 600 540 480
## 
## $BMI
##  [1]  0.0  0.0  0.0  0.0  0.0 67.1  0.0  0.0  0.0  0.0  0.0
## 
## $DiabetesPedigreeFunction
## [1] 2.288 1.893 1.781 2.329 2.137 1.731 2.420 1.699 1.698
## 
## $Age
## [1] 81
## 
## $Outcome
## integer(0)

Outliers Graph
Plotting outliers on the graph.

plotgraph <- function(inp, na.rm=TRUE) {
mpgPlot <- ggplot(dfrModel, aes(x="", y=inp)) +
            geom_boxplot(aes(fill=inp), color="blue") +
            labs(title="Outliers for Data")
mpgPlot
}
lapply(dfrModel, FUN=plotgraph)
## $Pregnancies

## 
## $Glucose

## 
## $BloodPressure

## 
## $SkinThickness

## 
## $Insulin

## 
## $BMI

## 
## $DiabetesPedigreeFunction

## 
## $Age

## 
## $Outcome

Observation
With respect to this model:
- Outliers present in many features.
- But Outlier count is low.
- For this model we will work with the outliers.

Correlation

vctCorr = numeric(0)
for (i in names(dfrModel)){
    cor.result <- cor(as.numeric(dfrModel$Outcome), as.numeric(dfrModel[,i]))
    vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel)
dfrCorr
##              Pregnancies                  Glucose            BloodPressure 
##               0.22774403               0.45928020               0.06019258 
##            SkinThickness                  Insulin                      BMI 
##               0.08740524               0.14592233               0.30659734 
## DiabetesPedigreeFunction                      Age                  Outcome 
##               0.17053194               0.22699018               1.00000000

Data For Visualization

dfrGraph <- gather(dfrModel, variable, value, -Outcome)
head(dfrGraph)
##   Outcome    variable value
## 1       1 Pregnancies     6
## 2       0 Pregnancies     1
## 3       1 Pregnancies     8
## 4       0 Pregnancies     1
## 5       1 Pregnancies     0
## 6       0 Pregnancies     5

Data Visualization

ggplot(dfrGraph) +
    geom_jitter(aes(value,Outcome, colour=variable)) + 
    facet_wrap(~variable, scales="free_x") +
    labs(title="Relation Of Outcome With Other Features")

Observation
There is some impact of all the features with respect to the outcome.

Summary

lapply(dfrModel, FUN=summary)
## $Pregnancies
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.827   6.000  17.000 
## 
## $Glucose
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    99.0   116.5   120.5   140.2   199.0 
## 
## $BloodPressure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   63.50   72.00   68.88   80.00  122.00 
## 
## $SkinThickness
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   23.00   20.38   32.00   99.00 
## 
## $Insulin
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00   36.50   79.88  126.50  846.00 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   27.00   32.00   31.89   36.50   67.10 
## 
## $DiabetesPedigreeFunction
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0780  0.2400  0.3755  0.4760  0.6370  2.4200 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   24.00   29.00   33.12   40.00   81.00 
## 
## $Outcome
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3443  1.0000  1.0000

Observation
Getting necessary details for all numeric data.

Find Best Multi Logistic Model
Choose the best logistic model by using step().

stpModel=step(glm(data=dfrModel, formula=Outcome~., family=binomial), trace=0, steps=10000)
summary(stpModel)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     BMI + DiabetesPedigreeFunction, family = binomial, data = dfrModel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7459  -0.7335  -0.4097   0.7153   2.8945  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.009215   0.713218 -11.230  < 2e-16 ***
## Pregnancies               0.157336   0.029270   5.375 7.64e-08 ***
## Glucose                   0.033444   0.003517   9.509  < 2e-16 ***
## BloodPressure            -0.012432   0.005246  -2.370  0.01781 *  
## BMI                       0.091142   0.014970   6.088 1.14e-09 ***
## DiabetesPedigreeFunction  0.885935   0.304143   2.913  0.00358 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 901.37  on 699  degrees of freedom
## Residual deviance: 659.26  on 694  degrees of freedom
## AIC: 671.26
## 
## Number of Fisher Scoring iterations: 5

Observation
Best results given by Outcome ~ Pregnancies + Glucose + BloodPressure + BMI + DiabetesPedigreeFunction

The difference between null and residual deviance should be around 60-70%. This implies a good model fit.
This condition is fulfilled here.

The 3 stars for each coefficient and p value less than 0.05 indicate all the coefficients can be effectively used.

For null deviance:
Degree of Freedom = 700-1
= 699
(700 being the number of rows in the data)

For residual deviance:
Degree of Freedom = 699-5
= 694
(5 being the number of coefficients you get using step)

Make Final Multi Linear Model

# make model
mgmModel <- glm(data=dfrModel, formula=Outcome ~ Pregnancies + Glucose + BloodPressure + 
    BMI + DiabetesPedigreeFunction, family=binomial(link="logit"))

# print summary
summary(mgmModel)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     BMI + DiabetesPedigreeFunction, family = binomial(link = "logit"), 
##     data = dfrModel)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7459  -0.7335  -0.4097   0.7153   2.8945  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.009215   0.713218 -11.230  < 2e-16 ***
## Pregnancies               0.157336   0.029270   5.375 7.64e-08 ***
## Glucose                   0.033444   0.003517   9.509  < 2e-16 ***
## BloodPressure            -0.012432   0.005246  -2.370  0.01781 *  
## BMI                       0.091142   0.014970   6.088 1.14e-09 ***
## DiabetesPedigreeFunction  0.885935   0.304143   2.913  0.00358 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 901.37  on 699  degrees of freedom
## Residual deviance: 659.26  on 694  degrees of freedom
## AIC: 671.26
## 
## Number of Fisher Scoring iterations: 5

Confusion Matrix

prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5,1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
## 
##    act
## prd   0   1
##   0 404 103
##   1  55 138
##                                           
##                Accuracy : 0.7743          
##                  95% CI : (0.7415, 0.8048)
##     No Information Rate : 0.6557          
##     P-Value [Acc > NIR] : 5.601e-12       
##                                           
##                   Kappa : 0.4753          
##  Mcnemar's Test P-Value : 0.0001847       
##                                           
##             Sensitivity : 0.8802          
##             Specificity : 0.5726          
##          Pos Pred Value : 0.7968          
##          Neg Pred Value : 0.7150          
##              Prevalence : 0.6557          
##          Detection Rate : 0.5771          
##    Detection Prevalence : 0.7243          
##       Balanced Accuracy : 0.7264          
##                                           
##        'Positive' Class : 0               
## 

Observation
The cut off kept is 50%. i.e- Anything with the probability less than 0.5 means the person does not have diabetes.
The model has an accuracy of 0.7743.

Regression Data

dfrPlot <- mutate(dfrModel, PrdVal=prdVal, POutcome=prdBln)
head(dfrPlot)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome     PrdVal POutcome
## 1                    0.627  50       1 0.64732497        1
## 2                    0.351  31       0 0.04334372        0
## 3                    0.672  32       1 0.78466856        1
## 4                    0.167  21       0 0.04802554        0
## 5                    2.288  33       1 0.88397222        1
## 6                    0.201  30       0 0.14783877        0

Regression Visulaization

#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=POutcome))  + 
    geom_point(shape=19, colour="blue", fill="blue") +
    geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
    labs(title="Binomial Regression Curve") +
    labs(x="") +
    labs(y="")

ROC Visulaization

rocplot(mgmModel)

Observation
AUC is 0.8378.
Accuracy as per confusion matrix is 0.7743.
The two values should be close to one another.

Ideally, a difference of more than 10% is a cause of concern.
In this model the difference is less than 10%.

Test Data

dfrTests <- read.csv("./Data/test.csv", header=T, stringsAsFactors=F)
head(dfrTests)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           2     122            76            27     200 35.9
## 2           6     125            78            31       0 27.6
## 3           1     168            88            29       0 35.0
## 4           2     129             0             0       0 38.5
## 5           4     110            76            20     100 28.4
## 6           6      80            80            36       0 39.8
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.483  26       0
## 2                    0.565  49       1
## 3                    0.905  52       1
## 4                    0.304  41       0
## 5                    0.118  27       0
## 6                    0.177  28       0

Observation
Test Data successfully created.

Predict
If probability is greater than 0.5 implies the person has diabetes.

resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0) 
dfrTests <- mutate(dfrTests, Result=resVal, PredictedOutcome=prdSur)
dfrTests 
##    Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1            2     122            76            27     200 35.9
## 2            6     125            78            31       0 27.6
## 3            1     168            88            29       0 35.0
## 4            2     129             0             0       0 38.5
## 5            4     110            76            20     100 28.4
## 6            6      80            80            36       0 39.8
## 7           10     115             0             0       0  0.0
## 8            2     127            46            21     335 34.4
## 9            9     164            78             0       0 32.8
## 10           2      93            64            32     160 38.0
## 11           3     158            64            13     387 31.2
## 12           5     126            78            27      22 29.6
## 13          10     129            62            36       0 41.2
## 14           0     134            58            20     291 26.4
## 15           3     102            74             0       0 29.5
## 16           7     187            50            33     392 33.9
## 17           3     173            78            39     185 33.8
## 18          10      94            72            18       0 23.1
## 19           1     108            60            46     178 35.5
## 20           5      97            76            27       0 35.6
## 21           4      83            86            19       0 29.3
## 22           1     114            66            36     200 38.1
## 23           1     149            68            29     127 29.3
## 24           5     117            86            30     105 39.1
## 25           1     111            94             0       0 32.8
## 26           4     112            78            40       0 39.4
## 27           1     116            78            29     180 36.1
## 28           0     141            84            26       0 32.4
## 29           2     175            88             0       0 22.9
## 30           2      92            52             0       0 30.1
## 31           3     130            78            23      79 28.4
## 32           8     120            86             0       0 28.4
## 33           2     174            88            37     120 44.5
## 34           2     106            56            27     165 29.0
## 35           2     105            75             0       0 23.3
## 36           4      95            60            32       0 35.4
## 37           0     126            86            27     120 27.4
## 38           8      65            72            23       0 32.0
## 39           2      99            60            17     160 36.6
## 40           1     102            74             0       0 39.5
## 41          11     120            80            37     150 42.3
## 42           3     102            44            20      94 30.8
## 43           1     109            58            18     116 28.5
## 44           9     140            94             0       0 32.7
## 45          13     153            88            37     140 40.6
## 46          12     100            84            33     105 30.0
## 47           1     147            94            41       0 49.3
## 48           1      81            74            41      57 46.3
## 49           3     187            70            22     200 36.4
## 50           6     162            62             0       0 24.3
## 51           4     136            70             0       0 31.2
## 52           1     121            78            39      74 39.0
## 53           3     108            62            24       0 26.0
## 54           0     181            88            44     510 43.3
## 55           8     154            78            32       0 32.4
## 56           1     128            88            39     110 36.5
## 57           7     137            90            41       0 32.0
## 58           0     123            72             0       0 36.3
## 59           1     106            76             0       0 37.5
## 60           6     190            92             0       0 35.5
## 61           2      88            58            26      16 28.4
## 62           9     170            74            31       0 44.0
## 63           9      89            62             0       0 22.5
## 64          10     101            76            48     180 32.9
## 65           2     122            70            27       0 36.8
## 66           5     121            72            23     112 26.2
## 67           1     126            60             0       0 30.1
## 68           1      93            70            31       0 30.4
##    DiabetesPedigreeFunction Age Outcome     Result PredictedOutcome
## 1                     0.483  26       0 0.29749244                0
## 2                     0.565  49       1 0.30189651                0
## 3                     0.905  52       1 0.66026773                1
## 4                     0.304  41       0 0.59821566                1
## 5                     0.118  27       0 0.12424247                0
## 6                     0.177  28       0 0.16798839                0
## 7                     0.261  30       1 0.08638919                0
## 8                     0.176  22       0 0.32567987                0
## 9                     0.148  45       1 0.73934182                1
## 10                    0.674  23       1 0.21092502                0
## 11                    0.295  24       0 0.51407613                1
## 12                    0.439  40       0 0.29079639                0
## 13                    0.441  38       1 0.77789023                1
## 14                    0.352  21       0 0.17788535                0
## 15                    0.121  32       0 0.09535232                0
## 16                    0.826  34       1 0.92731122                1
## 17                    0.970  31       1 0.77187221                1
## 18                    0.595  56       0 0.17441146                0
## 19                    0.415  24       0 0.20058927                0
## 20                    0.378  52       1 0.20689720                0
## 21                    0.317  34       0 0.06169707                0
## 22                    0.289  21       0 0.24393972                0
## 23                    0.349  42       1 0.32422871                0
## 24                    0.251  42       0 0.35601984                0
## 25                    0.265  45       0 0.11066833                0
## 26                    0.236  38       0 0.30922789                0
## 27                    0.496  25       0 0.22927908                0
## 28                    0.433  22       0 0.26869678                0
## 29                    0.326  22       0 0.36358534                0
## 30                    0.141  22       0 0.08349014                0
## 31                    0.323  34       1 0.21677629                0
## 32                    0.259  22       1 0.27121466                0
## 33                    0.646  24       1 0.84008714                1
## 34                    0.426  22       0 0.13882109                0
## 35                    0.560  53       0 0.07617043                0
## 36                    0.284  28       0 0.18685838                0
## 37                    0.515  21       0 0.12888735                0
## 38                    0.600  42       0 0.11674276                0
## 39                    0.453  21       0 0.19903185                0
## 40                    0.293  42       1 0.18229997                0
## 41                    0.785  48       1 0.78431616                1
## 42                    0.400  26       0 0.18073784                0
## 43                    0.219  22       0 0.10565220                0
## 44                    0.734  45       1 0.63437356                1
## 45                    1.174  39       0 0.94265225                1
## 46                    0.488  46       0 0.34198956                0
## 47                    0.358  27       1 0.66958025                1
## 48                    1.096  32       0 0.29483781                0
## 49                    0.408  36       1 0.82137020                1
## 50                    0.178  50       1 0.48861097                0
## 51                    1.182  22       1 0.54713868                1
## 52                    0.261  28       0 0.27109988                0
## 53                    0.223  25       0 0.10633357                0
## 54                    0.222  26       1 0.74900417                1
## 55                    0.443  45       1 0.68474539                1
## 56                    1.057  37       1 0.40085432                0
## 57                    0.391  39       0 0.45464341                0
## 58                    0.258  52       1 0.22206950                0
## 59                    0.197  26       0 0.15986116                0
## 60                    0.278  66       1 0.83579980                1
## 61                    0.766  22       0 0.09926279                0
## 62                    0.403  43       1 0.92687453                1
## 63                    0.142  33       0 0.09877301                0
## 64                    0.171  63       0 0.29885728                0
## 65                    0.340  27       0 0.30378496                0
## 66                    0.245  30       0 0.18756614                0
## 67                    0.349  47       1 0.20895172                0
## 68                    0.315  23       0 0.07162370                0

Confusion Matrix of Test Data

resVal <- predict(mgmModel,dfrTests, type='response')
prdSur <- ifelse(resVal > 0.5,1, 0)
cnfmtrx <- table(prd=prdSur, act=dfrTests$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
## 
##    act
## prd  0  1
##   0 38 12
##   1  3 15
##                                          
##                Accuracy : 0.7794         
##                  95% CI : (0.6624, 0.871)
##     No Information Rate : 0.6029         
##     P-Value [Acc > NIR] : 0.001609       
##                                          
##                   Kappa : 0.5115         
##  Mcnemar's Test P-Value : 0.038867       
##                                          
##             Sensitivity : 0.9268         
##             Specificity : 0.5556         
##          Pos Pred Value : 0.7600         
##          Neg Pred Value : 0.8333         
##              Prevalence : 0.6029         
##          Detection Rate : 0.5588         
##    Detection Prevalence : 0.7353         
##       Balanced Accuracy : 0.7412         
##                                          
##        'Positive' Class : 0              
## 

Observation
The cut off kept is 50%. i.e- Anything with the probability less than 0.5 means the person does not have diabetes.
The model has an accuracy of 0.7794.

Creating a table depicting predicted and actual outcome

#Changing Outcome column to a factor and giving levels for the same.
dfrTests$Outcome <- as.factor(dfrTests$Outcome)
levels(dfrTests$Outcome) <- c("Does not Have Diabetes", "Have Diabetes")

#Changing Predicted Outcome column to a factor and giving levels for the same.
dfrTests$PredictedOutcome <- as.factor(dfrTests$PredictedOutcome)
levels(dfrTests$PredictedOutcome) <- c("Does not Have Diabetes", "Have Diabetes") 
dfrTests 
##    Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1            2     122            76            27     200 35.9
## 2            6     125            78            31       0 27.6
## 3            1     168            88            29       0 35.0
## 4            2     129             0             0       0 38.5
## 5            4     110            76            20     100 28.4
## 6            6      80            80            36       0 39.8
## 7           10     115             0             0       0  0.0
## 8            2     127            46            21     335 34.4
## 9            9     164            78             0       0 32.8
## 10           2      93            64            32     160 38.0
## 11           3     158            64            13     387 31.2
## 12           5     126            78            27      22 29.6
## 13          10     129            62            36       0 41.2
## 14           0     134            58            20     291 26.4
## 15           3     102            74             0       0 29.5
## 16           7     187            50            33     392 33.9
## 17           3     173            78            39     185 33.8
## 18          10      94            72            18       0 23.1
## 19           1     108            60            46     178 35.5
## 20           5      97            76            27       0 35.6
## 21           4      83            86            19       0 29.3
## 22           1     114            66            36     200 38.1
## 23           1     149            68            29     127 29.3
## 24           5     117            86            30     105 39.1
## 25           1     111            94             0       0 32.8
## 26           4     112            78            40       0 39.4
## 27           1     116            78            29     180 36.1
## 28           0     141            84            26       0 32.4
## 29           2     175            88             0       0 22.9
## 30           2      92            52             0       0 30.1
## 31           3     130            78            23      79 28.4
## 32           8     120            86             0       0 28.4
## 33           2     174            88            37     120 44.5
## 34           2     106            56            27     165 29.0
## 35           2     105            75             0       0 23.3
## 36           4      95            60            32       0 35.4
## 37           0     126            86            27     120 27.4
## 38           8      65            72            23       0 32.0
## 39           2      99            60            17     160 36.6
## 40           1     102            74             0       0 39.5
## 41          11     120            80            37     150 42.3
## 42           3     102            44            20      94 30.8
## 43           1     109            58            18     116 28.5
## 44           9     140            94             0       0 32.7
## 45          13     153            88            37     140 40.6
## 46          12     100            84            33     105 30.0
## 47           1     147            94            41       0 49.3
## 48           1      81            74            41      57 46.3
## 49           3     187            70            22     200 36.4
## 50           6     162            62             0       0 24.3
## 51           4     136            70             0       0 31.2
## 52           1     121            78            39      74 39.0
## 53           3     108            62            24       0 26.0
## 54           0     181            88            44     510 43.3
## 55           8     154            78            32       0 32.4
## 56           1     128            88            39     110 36.5
## 57           7     137            90            41       0 32.0
## 58           0     123            72             0       0 36.3
## 59           1     106            76             0       0 37.5
## 60           6     190            92             0       0 35.5
## 61           2      88            58            26      16 28.4
## 62           9     170            74            31       0 44.0
## 63           9      89            62             0       0 22.5
## 64          10     101            76            48     180 32.9
## 65           2     122            70            27       0 36.8
## 66           5     121            72            23     112 26.2
## 67           1     126            60             0       0 30.1
## 68           1      93            70            31       0 30.4
##    DiabetesPedigreeFunction Age                Outcome     Result
## 1                     0.483  26 Does not Have Diabetes 0.29749244
## 2                     0.565  49          Have Diabetes 0.30189651
## 3                     0.905  52          Have Diabetes 0.66026773
## 4                     0.304  41 Does not Have Diabetes 0.59821566
## 5                     0.118  27 Does not Have Diabetes 0.12424247
## 6                     0.177  28 Does not Have Diabetes 0.16798839
## 7                     0.261  30          Have Diabetes 0.08638919
## 8                     0.176  22 Does not Have Diabetes 0.32567987
## 9                     0.148  45          Have Diabetes 0.73934182
## 10                    0.674  23          Have Diabetes 0.21092502
## 11                    0.295  24 Does not Have Diabetes 0.51407613
## 12                    0.439  40 Does not Have Diabetes 0.29079639
## 13                    0.441  38          Have Diabetes 0.77789023
## 14                    0.352  21 Does not Have Diabetes 0.17788535
## 15                    0.121  32 Does not Have Diabetes 0.09535232
## 16                    0.826  34          Have Diabetes 0.92731122
## 17                    0.970  31          Have Diabetes 0.77187221
## 18                    0.595  56 Does not Have Diabetes 0.17441146
## 19                    0.415  24 Does not Have Diabetes 0.20058927
## 20                    0.378  52          Have Diabetes 0.20689720
## 21                    0.317  34 Does not Have Diabetes 0.06169707
## 22                    0.289  21 Does not Have Diabetes 0.24393972
## 23                    0.349  42          Have Diabetes 0.32422871
## 24                    0.251  42 Does not Have Diabetes 0.35601984
## 25                    0.265  45 Does not Have Diabetes 0.11066833
## 26                    0.236  38 Does not Have Diabetes 0.30922789
## 27                    0.496  25 Does not Have Diabetes 0.22927908
## 28                    0.433  22 Does not Have Diabetes 0.26869678
## 29                    0.326  22 Does not Have Diabetes 0.36358534
## 30                    0.141  22 Does not Have Diabetes 0.08349014
## 31                    0.323  34          Have Diabetes 0.21677629
## 32                    0.259  22          Have Diabetes 0.27121466
## 33                    0.646  24          Have Diabetes 0.84008714
## 34                    0.426  22 Does not Have Diabetes 0.13882109
## 35                    0.560  53 Does not Have Diabetes 0.07617043
## 36                    0.284  28 Does not Have Diabetes 0.18685838
## 37                    0.515  21 Does not Have Diabetes 0.12888735
## 38                    0.600  42 Does not Have Diabetes 0.11674276
## 39                    0.453  21 Does not Have Diabetes 0.19903185
## 40                    0.293  42          Have Diabetes 0.18229997
## 41                    0.785  48          Have Diabetes 0.78431616
## 42                    0.400  26 Does not Have Diabetes 0.18073784
## 43                    0.219  22 Does not Have Diabetes 0.10565220
## 44                    0.734  45          Have Diabetes 0.63437356
## 45                    1.174  39 Does not Have Diabetes 0.94265225
## 46                    0.488  46 Does not Have Diabetes 0.34198956
## 47                    0.358  27          Have Diabetes 0.66958025
## 48                    1.096  32 Does not Have Diabetes 0.29483781
## 49                    0.408  36          Have Diabetes 0.82137020
## 50                    0.178  50          Have Diabetes 0.48861097
## 51                    1.182  22          Have Diabetes 0.54713868
## 52                    0.261  28 Does not Have Diabetes 0.27109988
## 53                    0.223  25 Does not Have Diabetes 0.10633357
## 54                    0.222  26          Have Diabetes 0.74900417
## 55                    0.443  45          Have Diabetes 0.68474539
## 56                    1.057  37          Have Diabetes 0.40085432
## 57                    0.391  39 Does not Have Diabetes 0.45464341
## 58                    0.258  52          Have Diabetes 0.22206950
## 59                    0.197  26 Does not Have Diabetes 0.15986116
## 60                    0.278  66          Have Diabetes 0.83579980
## 61                    0.766  22 Does not Have Diabetes 0.09926279
## 62                    0.403  43          Have Diabetes 0.92687453
## 63                    0.142  33 Does not Have Diabetes 0.09877301
## 64                    0.171  63 Does not Have Diabetes 0.29885728
## 65                    0.340  27 Does not Have Diabetes 0.30378496
## 66                    0.245  30 Does not Have Diabetes 0.18756614
## 67                    0.349  47          Have Diabetes 0.20895172
## 68                    0.315  23 Does not Have Diabetes 0.07162370
##          PredictedOutcome
## 1  Does not Have Diabetes
## 2  Does not Have Diabetes
## 3           Have Diabetes
## 4           Have Diabetes
## 5  Does not Have Diabetes
## 6  Does not Have Diabetes
## 7  Does not Have Diabetes
## 8  Does not Have Diabetes
## 9           Have Diabetes
## 10 Does not Have Diabetes
## 11          Have Diabetes
## 12 Does not Have Diabetes
## 13          Have Diabetes
## 14 Does not Have Diabetes
## 15 Does not Have Diabetes
## 16          Have Diabetes
## 17          Have Diabetes
## 18 Does not Have Diabetes
## 19 Does not Have Diabetes
## 20 Does not Have Diabetes
## 21 Does not Have Diabetes
## 22 Does not Have Diabetes
## 23 Does not Have Diabetes
## 24 Does not Have Diabetes
## 25 Does not Have Diabetes
## 26 Does not Have Diabetes
## 27 Does not Have Diabetes
## 28 Does not Have Diabetes
## 29 Does not Have Diabetes
## 30 Does not Have Diabetes
## 31 Does not Have Diabetes
## 32 Does not Have Diabetes
## 33          Have Diabetes
## 34 Does not Have Diabetes
## 35 Does not Have Diabetes
## 36 Does not Have Diabetes
## 37 Does not Have Diabetes
## 38 Does not Have Diabetes
## 39 Does not Have Diabetes
## 40 Does not Have Diabetes
## 41          Have Diabetes
## 42 Does not Have Diabetes
## 43 Does not Have Diabetes
## 44          Have Diabetes
## 45          Have Diabetes
## 46 Does not Have Diabetes
## 47          Have Diabetes
## 48 Does not Have Diabetes
## 49          Have Diabetes
## 50 Does not Have Diabetes
## 51          Have Diabetes
## 52 Does not Have Diabetes
## 53 Does not Have Diabetes
## 54          Have Diabetes
## 55          Have Diabetes
## 56 Does not Have Diabetes
## 57 Does not Have Diabetes
## 58 Does not Have Diabetes
## 59 Does not Have Diabetes
## 60          Have Diabetes
## 61 Does not Have Diabetes
## 62          Have Diabetes
## 63 Does not Have Diabetes
## 64 Does not Have Diabetes
## 65 Does not Have Diabetes
## 66 Does not Have Diabetes
## 67 Does not Have Diabetes
## 68 Does not Have Diabetes

Summary
- The objective was to predict based on diagnostic measurements whether a patient has diabetes.
- Assumption for Outcome: 0 means a patient “Does not Have Diabetes”" and 1 means a patient “Has Diabetes”
- The entire data was divided into train (700 observations) and test data(68 observations).

With respect to Train Data:
- The data had no missing values.
- Outliers were present in the data. However outlier count was low so the model was worked upon with outliers.
- Positive correlations were observed between outcome and the other features.
- The difference between null and residual deviance swas around 60-70%. This implies a good model fit.
- The confusion matrix showed the model has an accuracy of 0.7743. Here, The cut off kept is 50%. i.e- Anything with the probability less than 0.5 means the person does not have diabetes.
- AUC is 0.8378. The difference between AUC and accuracy as per confusion matrix is less than 10% as needed.

With respect to Test Data:
- Test data was used to predict whether a patient has diabetes.
- A confusion matrix was created and the cut off was kept at 50%. i.e- Anything with the probability less than 0.5 means the person does not have diabetes. The model showed an accuracy of 0.7794.
- Finally, a table indicating the Outcome as per the data and the Predicted Outcome was shown.

###################################### END OF REPORT ######################################################