Multi Logistic Regression Assignment 4

Problem Definition
Use Diabetes.csv to create a logistic model Use Diabetes.csv and find out the Diabetes Patient amongst the data Make the Model on train data and test the same on test data.

Data Description
A data frame with 768 observations on 9 variables
Patient ID: serial number for the patient
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

Assumption
In Outcome Variable of data, 1 is taken as the patients who are having Diebetes while 0 means no Diabetes.

Setup

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(corrgram)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(Deducer)

## Loading required package: JGR

## Loading required package: rJava

## Loading required package: JavaGD

## Loading required package: iplots

## 
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.

## Loading required package: car

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## 
## 
## Note Non-JGR console detected:
##  Deducer is best used from within JGR (http://jgr.markushelbig.org/).
##  To Bring up GUI dialogs, type deducer().

library(caret)

## Loading required package: lattice

library(pscl)

## Classes and Methods for R developed in the

## Political Science Computational Laboratory

## Department of Political Science

## Stanford University

## Simon Jackman

## hurdle and zeroinfl functions by Achim Zeileis

library(mice)

## 
## Attaching package: 'mice'

## The following object is masked from 'package:tidyr':
## 
##     complete

Functions

detect_outliers <- function(inp, na.rm=TRUE) {
  i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
  i.max <- 2.5 * IQR(inp, na.rm=na.rm)
  otp <- inp
  otp[inp < (i.qnt[1] - i.max)] <- NA
  otp[inp > (i.qnt[2] + i.max)] <- NA
  inp[is.na(otp)]
}

detect_na <- function(inp) {
  sum(is.na(inp))
}

Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
            geom_boxplot(aes(fill=input), color="green") +
            labs(title="Outliers")
Plot
}

Dataset

setwd("D:/Welingkar/Trim 4/Machine Learning/Assignment/Assignment 4")
dfrModel <- read.csv("./Diabetes.csv", header=T, stringsAsFactors=F)
head(dfrModel)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Observation
There is Numeric data only
As no Alphanumeric Column so no need to do any changes here or no need to drop any column as of now.

Outliers Data

#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)

## $Pregnancies
## integer(0)
## 
## $Glucose
## integer(0)
## 
## $BloodPressure
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 
## $SkinThickness
## integer(0)
## 
## $Insulin
##  [1] 543 846 495 485 495 478 744 680 545 465 579 474 480 600 540 480 510
## 
## $BMI
##  [1]  0.0  0.0  0.0  0.0  0.0 67.1  0.0  0.0  0.0  0.0  0.0  0.0
## 
## $DiabetesPedigreeFunction
##  [1] 2.288 1.893 1.781 2.329 2.137 1.731 1.600 2.420 1.699 1.698
## 
## $Age
## integer(0)
## 
## $Outcome
## integer(0)

Outliers Graph

lapply(dfrModel, FUN=Graph_Boxplot)

## $Pregnancies

## 
## $Glucose

## 
## $BloodPressure

## 
## $SkinThickness

## 
## $Insulin

## 
## $BMI

## 
## $DiabetesPedigreeFunction

## 
## $Age

## 
## $Outcome

Observations
There are few outliers.
We will go with Outliers

Train and test Data

dim(dfrModel)

## [1] 768   9

#Sample Indexes
indexes = sample(1:nrow(dfrModel), size=0.1*nrow(dfrModel))
class(indexes)

## [1] "integer"

# Split data
dfrModel_test = dfrModel[indexes,]
dim(dfrModel_test)

## [1] 76  9

dfrModel_train = dfrModel[-indexes,]
dim(dfrModel_train)

## [1] 692   9

Observations
Data has been divided in Train and Test data
There are 768 rows in the main data
10% data is in test data while 90% data is in train data to create model.
Now 76 data records in Test data while 692 data records in Train data.

Missing Data

lapply(dfrModel_train, FUN=detect_na)

## $Pregnancies
## [1] 0
## 
## $Glucose
## [1] 0
## 
## $BloodPressure
## [1] 0
## 
## $SkinThickness
## [1] 0
## 
## $Insulin
## [1] 0
## 
## $BMI
## [1] 0
## 
## $DiabetesPedigreeFunction
## [1] 0
## 
## $Age
## [1] 0
## 
## $Outcome
## [1] 0

Observation
There is no data in the dataset with values as NA

dfrColImpute <- colnames(dfrModel_train)[!colnames(dfrModel_train) %in% c("Pregnancies", "Outcome")]
dfr_data <- dfrModel_train[dfrColImpute] == 0
dfrModel_train[dfrColImpute][dfr_data] <- NA
head(dfrModel_train)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35      NA 33.6
## 3           8     183            64            NA      NA 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74            NA      NA 25.6
## 7           3      78            50            32      88 31.0
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0
## 7                    0.248  26       1

Observations
All the data records where data value was given properly (Zero), This has been changed to “NA” except column Pregnancies and Outcome as they can be Zero.

Missing Data

#sum(is.na(dfrModel$Age))
lapply(dfrModel_train, FUN=detect_na)

## $Pregnancies
## [1] 0
## 
## $Glucose
## [1] 5
## 
## $BloodPressure
## [1] 31
## 
## $SkinThickness
## [1] 208
## 
## $Insulin
## [1] 337
## 
## $BMI
## [1] 11
## 
## $DiabetesPedigreeFunction
## [1] 0
## 
## $Age
## [1] 0
## 
## $Outcome
## [1] 0

Observations
Now we can see that there are many records for which data is not available.

Summary

#summary(dfrModel)
lapply(dfrModel_train, FUN=summary)

## $Pregnancies
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.762   6.000  17.000 
## 
## $Glucose
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    56.0    99.0   117.0   121.9   141.0   199.0       5 
## 
## $BloodPressure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   24.00   64.00   72.00   72.42   80.00  122.00      31 
## 
## $SkinThickness
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    7.00   21.00   29.00   29.22   36.25   99.00     208 
## 
## $Insulin
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    14.0    76.0   120.0   153.9   190.0   846.0     337 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   18.20   27.60   32.40   32.53   36.80   67.10      11 
## 
## $DiabetesPedigreeFunction
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0780  0.2387  0.3745  0.4749  0.6325  2.4200 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   24.00   29.00   33.18   41.00   81.00 
## 
## $Outcome
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3425  1.0000  1.0000

Data imputation

dfrModel_train$Glucose[is.na(dfrModel_train$Glucose)] <- median(dfrModel_train$Glucose,na.rm = T)
dfrModel_train$BloodPressure[is.na(dfrModel_train$BloodPressure)] <- median(dfrModel_train$BloodPressure,na.rm = T)
#dfrModel_train$SkinThickness[is.na(dfrModel_train$SkinThickness)] <- mean(dfrModel_train$SkinThickness,na.rm = T)
#dfrModel_train$Insulin[is.na(dfrModel_train$Insulin)] <- median(dfrModel_train$Insulin,na.rm = T)
mice_mod <- mice(dfrModel_train[, c("SkinThickness","Insulin")], method='rf')

## 
##  iter imp variable
##   1   1  SkinThickness  Insulin
##   1   2  SkinThickness  Insulin
##   1   3  SkinThickness  Insulin
##   1   4  SkinThickness  Insulin
##   1   5  SkinThickness  Insulin
##   2   1  SkinThickness  Insulin
##   2   2  SkinThickness  Insulin
##   2   3  SkinThickness  Insulin
##   2   4  SkinThickness  Insulin
##   2   5  SkinThickness  Insulin
##   3   1  SkinThickness  Insulin
##   3   2  SkinThickness  Insulin
##   3   3  SkinThickness  Insulin
##   3   4  SkinThickness  Insulin
##   3   5  SkinThickness  Insulin
##   4   1  SkinThickness  Insulin
##   4   2  SkinThickness  Insulin
##   4   3  SkinThickness  Insulin
##   4   4  SkinThickness  Insulin
##   4   5  SkinThickness  Insulin
##   5   1  SkinThickness  Insulin
##   5   2  SkinThickness  Insulin
##   5   3  SkinThickness  Insulin
##   5   4  SkinThickness  Insulin
##   5   5  SkinThickness  Insulin

mice_complete <- complete(mice_mod)
dfrModel_train$SkinThickness <- mice_complete$SkinThickness
dfrModel_train$Insulin <- mice_complete$Insulin
dfrModel_train$BMI[is.na(dfrModel_train$BMI)] <- median(dfrModel_train$BMI,na.rm = T)
head(dfrModel_train)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35     194 33.6
## 3           8     183            64            32     165 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74            23     310 25.6
## 7           3      78            50            32      88 31.0
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0
## 7                    0.248  26       1

Observations
Data Imputation is done for all the column except Pregnancies and Outcome as other columns have many values as 0 which is important to be imputed.
Median and Mice Random factor is used to do the data Imputation.
For column where outliers are less, Median imputation is used while for others Mice random factor technique is used for Data Imputation.

Summary

#summary(dfrModel)
lapply(dfrModel_train, FUN=summary)

## $Pregnancies
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.762   6.000  17.000 
## 
## $Glucose
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    56.0    99.0   117.0   121.8   141.0   199.0 
## 
## $BloodPressure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    24.0    64.0    72.0    72.4    80.0   122.0 
## 
## $SkinThickness
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   22.00   29.00   29.14   36.00   99.00 
## 
## $Insulin
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    14.0    74.0   115.0   147.6   182.2   846.0 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.20   27.60   32.40   32.53   36.60   67.10 
## 
## $DiabetesPedigreeFunction
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0780  0.2387  0.3745  0.4749  0.6325  2.4200 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   24.00   29.00   33.18   41.00   81.00 
## 
## $Outcome
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3425  1.0000  1.0000

Observations
We can see that now Median and Mean are more close before the data computation which is good for the result as Outliers will be less.

Outliers Data

#detect_outliers(dfrModel$Age)
lapply(dfrModel_train, FUN=detect_outliers)

## $Pregnancies
## integer(0)
## 
## $Glucose
## integer(0)
## 
## $BloodPressure
## [1] 122
## 
## $SkinThickness
## [1] 99
## 
## $Insulin
##  [1] 543 846 485 495 485 545 744 478 744 680 545 495 465 579 474 744 480
## [18] 600 846 540 480 540
## 
## $BMI
## [1] 67.1 59.4
## 
## $DiabetesPedigreeFunction
## [1] 2.288 1.893 1.781 2.329 2.137 1.731 2.420 1.699 1.698
## 
## $Age
## integer(0)
## 
## $Outcome
## integer(0)

Observation
There are only few outliers so we will continue with outliers in the data.
As well as Ouliers have been reduced after data imputation which will help to get the better results.

Correlation

vctCorr = numeric(0)
for (i in names(dfrModel_train)){
    cor.result <- cor(as.numeric(dfrModel_train$Outcome), as.numeric(dfrModel_train[,i]))
    vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel_train)
dfrCorr

##              Pregnancies                  Glucose            BloodPressure 
##                0.2171568                0.4970330                0.1792162 
##            SkinThickness                  Insulin                      BMI 
##                0.1519821                0.1599135                0.3100974 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                0.1795693                0.2422029                1.0000000

Data For Visualization

dfrGraph <- gather(dfrModel_train, variable, value, -Outcome)
head(dfrGraph)

##   Outcome    variable value
## 1       1 Pregnancies     6
## 2       1 Pregnancies     8
## 3       0 Pregnancies     1
## 4       1 Pregnancies     0
## 5       0 Pregnancies     5
## 6       1 Pregnancies     3

Data Visualization

ggplot(dfrGraph) +
    geom_jitter(aes(value,Outcome, colour=variable)) + 
    facet_wrap(~variable, scales="free_x") +
    labs(title="Relation Of Diabetes [atient] With Other Features")

Observation
As per graph, There is some correlation between Outcome variable and other variables.

Summary

lapply(dfrModel_train, FUN=summary)

## $Pregnancies
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   3.000   3.762   6.000  17.000 
## 
## $Glucose
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    56.0    99.0   117.0   121.8   141.0   199.0 
## 
## $BloodPressure
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    24.0    64.0    72.0    72.4    80.0   122.0 
## 
## $SkinThickness
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   22.00   29.00   29.14   36.00   99.00 
## 
## $Insulin
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    14.0    74.0   115.0   147.6   182.2   846.0 
## 
## $BMI
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.20   27.60   32.40   32.53   36.60   67.10 
## 
## $DiabetesPedigreeFunction
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0780  0.2387  0.3745  0.4749  0.6325  2.4200 
## 
## $Age
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   24.00   29.00   33.18   41.00   81.00 
## 
## $Outcome
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.3425  1.0000  1.0000

Observation
Mean and Median are nearly equal after doing data imputation whicih help to reduce Outliers.
Data summary is looking good we can continue with logistic model

Find Best Multi Logistic Model
Choose the best logistic model by using step().

stpModel=step(glm(data=dfrModel_train, formula=Outcome~., family=binomial), trace=0, steps=100)
summary(stpModel)

## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction + 
##     Age, family = binomial, data = dfrModel_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7383  -0.7104  -0.3973   0.7008   2.4514  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.284179   0.764407 -12.146  < 2e-16 ***
## Pregnancies               0.108253   0.033331   3.248  0.00116 ** 
## Glucose                   0.035449   0.003711   9.553  < 2e-16 ***
## BMI                       0.086259   0.015421   5.594 2.22e-08 ***
## DiabetesPedigreeFunction  0.861598   0.304723   2.827  0.00469 ** 
## Age                       0.013568   0.009535   1.423  0.15471    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 889.46  on 691  degrees of freedom
## Residual deviance: 641.64  on 686  degrees of freedom
## AIC: 653.64
## 
## Number of Fisher Scoring iterations: 5

Observation
Best results given by Survived ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction
The degrees of Freedom for Null Variance is 692-1=691 when only outcome is considered so only one parameter
While for Reidual Deviance is 692-5=687 As no of parameters are 5 for residual deviance.

Make Final Multi Linear Model

# make model
mgmModel <- glm(data=dfrModel_train, formula=Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, family=binomial(link="logit"))
# print summary
summary(mgmModel)

## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, 
##     family = binomial(link = "logit"), data = dfrModel_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7872  -0.7169  -0.4066   0.6939   2.4351  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.99192    0.72902 -12.334  < 2e-16 ***
## Pregnancies               0.13268    0.02880   4.606 4.10e-06 ***
## Glucose                   0.03655    0.00365  10.012  < 2e-16 ***
## BMI                       0.08437    0.01537   5.490 4.03e-08 ***
## DiabetesPedigreeFunction  0.87105    0.30408   2.865  0.00418 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 889.46  on 691  degrees of freedom
## Residual deviance: 643.65  on 687  degrees of freedom
## AIC: 653.65
## 
## Number of Fisher Scoring iterations: 5

pR2 = 1 - mgmModel$deviance / mgmModel$null.deviance
pR2

## [1] 0.2763553

mgmModel_null <- glm(dfrModel_train$Outcome~1, family = binomial, data = dfrModel_train)
pR21= 1- logLik(mgmModel)/logLik(mgmModel_null)
pR21

## 'log Lik.' 0.2763553 (df=5)

library(rcompanion)
nagelkerke(mgmModel)

## $Models
##                                                                                                                           
## Model: "glm, Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, binomial(link = \"logit\"), dfrModel_train"
## Null:  "glm, Outcome ~ 1, binomial(link = \"logit\"), dfrModel_train"                                                     
## 
## $Pseudo.R.squared.for.model.vs.null
##                              Pseudo.R.squared
## McFadden                             0.276355
## Cox and Snell (ML)                   0.298974
## Nagelkerke (Cragg and Uhler)         0.413266
## 
## $Likelihood.ratio.test
##  Df.diff LogLik.diff  Chisq    p.value
##       -4      -122.9 245.81 5.2122e-52
## 
## $Number.of.observations
##           
## Model: 692
## Null:  692
## 
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
## 
## $Warnings
## [1] "None"

pR2(mgmModel)

##          llh      llhNull           G2     McFadden         r2ML 
## -321.8249793 -444.7279022  245.8058458    0.2763553    0.2989743 
##         r2CU 
##    0.4132655

Confusion Matrix

prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel_train$Outcome)
confusionMatrix(cnfmtrx)

## Confusion Matrix and Statistics
## 
##    act
## prd   0   1
##   0 403 107
##   1  52 130
##                                          
##                Accuracy : 0.7702         
##                  95% CI : (0.737, 0.8011)
##     No Information Rate : 0.6575         
##     P-Value [Acc > NIR] : 7.182e-11      
##                                          
##                   Kappa : 0.4598         
##  Mcnemar's Test P-Value : 1.848e-05      
##                                          
##             Sensitivity : 0.8857         
##             Specificity : 0.5485         
##          Pos Pred Value : 0.7902         
##          Neg Pred Value : 0.7143         
##              Prevalence : 0.6575         
##          Detection Rate : 0.5824         
##    Detection Prevalence : 0.7370         
##       Balanced Accuracy : 0.7171         
##                                          
##        'Positive' Class : 0              
##

Observations
Accuracy is good which is around 0.77

Regression Data

dfrPlot <- mutate(dfrModel_train, PrdVal=prdVal, POutcome=prdBln)
head(dfrPlot)

##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35     194 33.6
## 2           8     183            64            32     165 23.3
## 3           1      89            66            23      94 28.1
## 4           0     137            40            35     168 43.1
## 5           5     116            74            23     310 25.6
## 6           3      78            50            32      88 31.0
##   DiabetesPedigreeFunction Age Outcome     PrdVal POutcome
## 1                    0.627  50       1 0.64426718        1
## 2                    0.672  32       1 0.78728773        1
## 3                    0.167  21       0 0.04350404        0
## 4                    2.288  33       1 0.83809734        1
## 5                    0.201  30       0 0.14751314        0
## 6                    0.248  26       1 0.05157047        0

Regression Visulaization

#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=POutcome))  + 
    geom_point(shape=19, colour="blue", fill="blue") +
    geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
    labs(title="Binomial Regression Curve") +
    labs(x="") +
    labs(y="")

ROC Visulaization

#rocplot(logistic.model,diag=TRUE,pred.prob.labels=FALSE,prob.label.digits=3,AUC=TRUE)
rocplot(mgmModel)

Observations
The closer the ROC gets to the optimal point of perfect prediction the closer the AUC gets to 1.
We can see that AUC value is around 0.84 as well as Accuracy of our model is around 0.77 which are quite equal so Model is good.

Test Data

names(dfrModel_test)[names(dfrModel_test) == "Outcome"] <- "Act_Outcome"
dfrTests <- dplyr::select(dfrModel_test, -c(BloodPressure,SkinThickness,Insulin,Age))
head(dfrTests)

##     Pregnancies Glucose  BMI DiabetesPedigreeFunction Act_Outcome
## 713          10     129 41.2                    0.441           1
## 614           6     105 32.5                    0.878           0
## 173           2      87 28.9                    0.773           0
## 452           2     134 28.9                    0.542           1
## 356           9     165 30.4                    0.302           1
## 404           9      72 31.6                    0.280           0

Data Cleaning

dfrColImpute <- colnames(dfrTests)[!colnames(dfrTests) %in% c("Pregnancies", "Act_Outcome")]
dfr_data <- dfrTests[dfrColImpute] == 0
dfrTests[dfrColImpute][dfr_data] <- NA
head(dfrTests)

##     Pregnancies Glucose  BMI DiabetesPedigreeFunction Act_Outcome
## 713          10     129 41.2                    0.441           1
## 614           6     105 32.5                    0.878           0
## 173           2      87 28.9                    0.773           0
## 452           2     134 28.9                    0.542           1
## 356           9     165 30.4                    0.302           1
## 404           9      72 31.6                    0.280           0

Missing Data

#sum(is.na(dfrModel$Age))
lapply(dfrTests, FUN=detect_na)

## $Pregnancies
## [1] 0
## 
## $Glucose
## [1] 0
## 
## $BMI
## [1] 0
## 
## $DiabetesPedigreeFunction
## [1] 0
## 
## $Act_Outcome
## [1] 0

Test Data imputation

dfrTests$Glucose[is.na(dfrTests$Glucose)] <- median(dfrTests$Glucose,na.rm = T)
dfrTests$BMI[is.na(dfrTests$BMI)] <- median(dfrTests$BMI,na.rm = T)
head(dfrTests)

##     Pregnancies Glucose  BMI DiabetesPedigreeFunction Act_Outcome
## 713          10     129 41.2                    0.441           1
## 614           6     105 32.5                    0.878           0
## 173           2      87 28.9                    0.773           0
## 452           2     134 28.9                    0.542           1
## 356           9     165 30.4                    0.302           1
## 404           9      72 31.6                    0.280           0

Observation
Test Data successfully created.
Data Imputation is not required in Test data but as we took test data from file only so need to impute data wherever is missing
In actual it is not required

Predict

resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
cnfmtrx1 <- table(prd=prdSur, act=dfrModel_test$Act_Outcome)
length(dfrModel_test$Prd_Outcome)

## [1] 0

length(dfrModel_test$Act_Outcome)

## [1] 76

confusionMatrix(cnfmtrx1)

## Confusion Matrix and Statistics
## 
##    act
## prd  0  1
##   0 43 15
##   1  2 16
##                                          
##                Accuracy : 0.7763         
##                  95% CI : (0.6662, 0.864)
##     No Information Rate : 0.5921         
##     P-Value [Acc > NIR] : 0.0005624      
##                                          
##                   Kappa : 0.5046         
##  Mcnemar's Test P-Value : 0.0036093      
##                                          
##             Sensitivity : 0.9556         
##             Specificity : 0.5161         
##          Pos Pred Value : 0.7414         
##          Neg Pred Value : 0.8889         
##              Prevalence : 0.5921         
##          Detection Rate : 0.5658         
##    Detection Prevalence : 0.7632         
##       Balanced Accuracy : 0.7358         
##                                          
##        'Positive' Class : 0              
##

Observations
Accuracy is very good, it is around 0.77

Test Data Confusion Matrix

resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
prdSur <- as.factor(prdSur)
levels(prdSur) <- c("Not Have Diabetes", "Have Diabetes")
dfrTests <- mutate(dfrTests, Result=resVal, Prd_Outcome=prdSur)
dfrTests$Act_Outcome <- as.factor(dfrTests$Act_Outcome)
levels(dfrTests$Act_Outcome) <- c("Not Have Diabetes", "Have Diabetes")
head(dfrTests)

##   Pregnancies Glucose  BMI DiabetesPedigreeFunction       Act_Outcome
## 1          10     129 41.2                    0.441     Have Diabetes
## 2           6     105 32.5                    0.878 Not Have Diabetes
## 3           2      87 28.9                    0.773 Not Have Diabetes
## 4           2     134 28.9                    0.542     Have Diabetes
## 5           9     165 30.4                    0.302     Have Diabetes
## 6           9      72 31.6                    0.280 Not Have Diabetes
##       Result       Prd_Outcome
## 1 0.71288457     Have Diabetes
## 2 0.29906740 Not Have Diabetes
## 3 0.08050658 Not Have Diabetes
## 4 0.28516116 Not Have Diabetes
## 5 0.74272945     Have Diabetes
## 6 0.09478485 Not Have Diabetes

Summary

There was total data set of 768 records of Diabetes patients, It contains their details along with they are having diabetes or not.
Data Set is divided in two part one is Train data for building a model while other one is Test data to test the model.
In train data there is 90% of data which 692 records while in Test data there is 10% of data which is 76 records.
Prediction value is kept at 0.6 as it was giving better accuracy for data.

Model:
Data Imputation is done for all the column except Pregnancies and Outcome as other columns have many values as 0 which is important to be imputed.
Median and Mice Random factor is used to do the data Imputation.
For column where outliers are less, Median imputation is used while for others Mice random factor technique is used for Data Imputation.
Model is created by all the variables taken in account but only four variables are significant enough to give good result:
Outcome~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction Accuracy of Train Data is: 0.77 AUC Value of ROC Curve: 0.84 As no much difference between Accuracy and AUC value so it is a good model.

Data Testing
After Creating the model Data is tested on test data.
Same prediction value is used to get Diabetes patients.
Accuracy of Confusion matrix is around 0.77

Data is tested successfully.

Multi Logistic Regression Assignment 4

Monesh Sharma

July 29, 2017

Summary