Problem Definition
Use Diabetes.csv to create a logistic model Use Diabetes.csv and find out the Diabetes Patient amongst the data Make the Model on train data and test the same on test data.
Data Description
A data frame with 768 observations on 9 variables
Patient ID: serial number for the patient
Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)
Assumption
In Outcome Variable of data, 1 is taken as the patients who are having Diebetes while 0 means no Diabetes.
Setup
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrgram)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(Deducer)
## Loading required package: JGR
## Loading required package: rJava
## Loading required package: JavaGD
## Loading required package: iplots
##
## Please type JGR() to launch console. Platform specific launchers (.exe and .app) can also be obtained at http://www.rforge.net/JGR/files/.
## Loading required package: car
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
##
## Note Non-JGR console detected:
## Deducer is best used from within JGR (http://jgr.markushelbig.org/).
## To Bring up GUI dialogs, type deducer().
library(caret)
## Loading required package: lattice
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
library(mice)
##
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
##
## complete
Functions
detect_outliers <- function(inp, na.rm=TRUE) {
i.qnt <- quantile(inp, probs=c(.25, .75), na.rm=na.rm)
i.max <- 2.5 * IQR(inp, na.rm=na.rm)
otp <- inp
otp[inp < (i.qnt[1] - i.max)] <- NA
otp[inp > (i.qnt[2] + i.max)] <- NA
inp[is.na(otp)]
}
detect_na <- function(inp) {
sum(is.na(inp))
}
Graph_Boxplot <- function (input, na.rm = TRUE){
Plot <- ggplot(dfrModel, aes(x="", y=input)) +
geom_boxplot(aes(fill=input), color="green") +
labs(title="Outliers")
Plot
}
Dataset
setwd("D:/Welingkar/Trim 4/Machine Learning/Assignment/Assignment 4")
dfrModel <- read.csv("./Diabetes.csv", header=T, stringsAsFactors=F)
head(dfrModel)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Observation
There is Numeric data only
As no Alphanumeric Column so no need to do any changes here or no need to drop any column as of now.
Outliers Data
#detect_outliers(dfrModel$Age)
lapply(dfrModel, FUN=detect_outliers)
## $Pregnancies
## integer(0)
##
## $Glucose
## integer(0)
##
## $BloodPressure
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##
## $SkinThickness
## integer(0)
##
## $Insulin
## [1] 543 846 495 485 495 478 744 680 545 465 579 474 480 600 540 480 510
##
## $BMI
## [1] 0.0 0.0 0.0 0.0 0.0 67.1 0.0 0.0 0.0 0.0 0.0 0.0
##
## $DiabetesPedigreeFunction
## [1] 2.288 1.893 1.781 2.329 2.137 1.731 1.600 2.420 1.699 1.698
##
## $Age
## integer(0)
##
## $Outcome
## integer(0)
Outliers Graph
lapply(dfrModel, FUN=Graph_Boxplot)
## $Pregnancies
##
## $Glucose
##
## $BloodPressure
##
## $SkinThickness
##
## $Insulin
##
## $BMI
##
## $DiabetesPedigreeFunction
##
## $Age
##
## $Outcome
Observations
There are few outliers.
We will go with Outliers
Train and test Data
dim(dfrModel)
## [1] 768 9
#Sample Indexes
indexes = sample(1:nrow(dfrModel), size=0.1*nrow(dfrModel))
class(indexes)
## [1] "integer"
# Split data
dfrModel_test = dfrModel[indexes,]
dim(dfrModel_test)
## [1] 76 9
dfrModel_train = dfrModel[-indexes,]
dim(dfrModel_train)
## [1] 692 9
Observations
Data has been divided in Train and Test data
There are 768 rows in the main data
10% data is in test data while 90% data is in train data to create model.
Now 76 data records in Test data while 692 data records in Train data.
Missing Data
lapply(dfrModel_train, FUN=detect_na)
## $Pregnancies
## [1] 0
##
## $Glucose
## [1] 0
##
## $BloodPressure
## [1] 0
##
## $SkinThickness
## [1] 0
##
## $Insulin
## [1] 0
##
## $BMI
## [1] 0
##
## $DiabetesPedigreeFunction
## [1] 0
##
## $Age
## [1] 0
##
## $Outcome
## [1] 0
Observation
There is no data in the dataset with values as NA
dfrColImpute <- colnames(dfrModel_train)[!colnames(dfrModel_train) %in% c("Pregnancies", "Outcome")]
dfr_data <- dfrModel_train[dfrColImpute] == 0
dfrModel_train[dfrColImpute][dfr_data] <- NA
head(dfrModel_train)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 NA 33.6
## 3 8 183 64 NA NA 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 NA NA 25.6
## 7 3 78 50 32 88 31.0
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
## 7 0.248 26 1
Observations
All the data records where data value was given properly (Zero), This has been changed to “NA” except column Pregnancies and Outcome as they can be Zero.
Missing Data
#sum(is.na(dfrModel$Age))
lapply(dfrModel_train, FUN=detect_na)
## $Pregnancies
## [1] 0
##
## $Glucose
## [1] 5
##
## $BloodPressure
## [1] 31
##
## $SkinThickness
## [1] 208
##
## $Insulin
## [1] 337
##
## $BMI
## [1] 11
##
## $DiabetesPedigreeFunction
## [1] 0
##
## $Age
## [1] 0
##
## $Outcome
## [1] 0
Observations
Now we can see that there are many records for which data is not available.
Summary
#summary(dfrModel)
lapply(dfrModel_train, FUN=summary)
## $Pregnancies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.762 6.000 17.000
##
## $Glucose
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 56.0 99.0 117.0 121.9 141.0 199.0 5
##
## $BloodPressure
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 24.00 64.00 72.00 72.42 80.00 122.00 31
##
## $SkinThickness
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 7.00 21.00 29.00 29.22 36.25 99.00 208
##
## $Insulin
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 14.0 76.0 120.0 153.9 190.0 846.0 337
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 18.20 27.60 32.40 32.53 36.80 67.10 11
##
## $DiabetesPedigreeFunction
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2387 0.3745 0.4749 0.6325 2.4200
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 33.18 41.00 81.00
##
## $Outcome
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3425 1.0000 1.0000
Data imputation
dfrModel_train$Glucose[is.na(dfrModel_train$Glucose)] <- median(dfrModel_train$Glucose,na.rm = T)
dfrModel_train$BloodPressure[is.na(dfrModel_train$BloodPressure)] <- median(dfrModel_train$BloodPressure,na.rm = T)
#dfrModel_train$SkinThickness[is.na(dfrModel_train$SkinThickness)] <- mean(dfrModel_train$SkinThickness,na.rm = T)
#dfrModel_train$Insulin[is.na(dfrModel_train$Insulin)] <- median(dfrModel_train$Insulin,na.rm = T)
mice_mod <- mice(dfrModel_train[, c("SkinThickness","Insulin")], method='rf')
##
## iter imp variable
## 1 1 SkinThickness Insulin
## 1 2 SkinThickness Insulin
## 1 3 SkinThickness Insulin
## 1 4 SkinThickness Insulin
## 1 5 SkinThickness Insulin
## 2 1 SkinThickness Insulin
## 2 2 SkinThickness Insulin
## 2 3 SkinThickness Insulin
## 2 4 SkinThickness Insulin
## 2 5 SkinThickness Insulin
## 3 1 SkinThickness Insulin
## 3 2 SkinThickness Insulin
## 3 3 SkinThickness Insulin
## 3 4 SkinThickness Insulin
## 3 5 SkinThickness Insulin
## 4 1 SkinThickness Insulin
## 4 2 SkinThickness Insulin
## 4 3 SkinThickness Insulin
## 4 4 SkinThickness Insulin
## 4 5 SkinThickness Insulin
## 5 1 SkinThickness Insulin
## 5 2 SkinThickness Insulin
## 5 3 SkinThickness Insulin
## 5 4 SkinThickness Insulin
## 5 5 SkinThickness Insulin
mice_complete <- complete(mice_mod)
dfrModel_train$SkinThickness <- mice_complete$SkinThickness
dfrModel_train$Insulin <- mice_complete$Insulin
dfrModel_train$BMI[is.na(dfrModel_train$BMI)] <- median(dfrModel_train$BMI,na.rm = T)
head(dfrModel_train)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 194 33.6
## 3 8 183 64 32 165 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 23 310 25.6
## 7 3 78 50 32 88 31.0
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
## 7 0.248 26 1
Observations
Data Imputation is done for all the column except Pregnancies and Outcome as other columns have many values as 0 which is important to be imputed.
Median and Mice Random factor is used to do the data Imputation.
For column where outliers are less, Median imputation is used while for others Mice random factor technique is used for Data Imputation.
Summary
#summary(dfrModel)
lapply(dfrModel_train, FUN=summary)
## $Pregnancies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.762 6.000 17.000
##
## $Glucose
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.0 99.0 117.0 121.8 141.0 199.0
##
## $BloodPressure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 64.0 72.0 72.4 80.0 122.0
##
## $SkinThickness
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 22.00 29.00 29.14 36.00 99.00
##
## $Insulin
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.0 74.0 115.0 147.6 182.2 846.0
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.20 27.60 32.40 32.53 36.60 67.10
##
## $DiabetesPedigreeFunction
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2387 0.3745 0.4749 0.6325 2.4200
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 33.18 41.00 81.00
##
## $Outcome
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3425 1.0000 1.0000
Observations
We can see that now Median and Mean are more close before the data computation which is good for the result as Outliers will be less.
Outliers Data
#detect_outliers(dfrModel$Age)
lapply(dfrModel_train, FUN=detect_outliers)
## $Pregnancies
## integer(0)
##
## $Glucose
## integer(0)
##
## $BloodPressure
## [1] 122
##
## $SkinThickness
## [1] 99
##
## $Insulin
## [1] 543 846 485 495 485 545 744 478 744 680 545 495 465 579 474 744 480
## [18] 600 846 540 480 540
##
## $BMI
## [1] 67.1 59.4
##
## $DiabetesPedigreeFunction
## [1] 2.288 1.893 1.781 2.329 2.137 1.731 2.420 1.699 1.698
##
## $Age
## integer(0)
##
## $Outcome
## integer(0)
Observation
There are only few outliers so we will continue with outliers in the data.
As well as Ouliers have been reduced after data imputation which will help to get the better results.
Correlation
vctCorr = numeric(0)
for (i in names(dfrModel_train)){
cor.result <- cor(as.numeric(dfrModel_train$Outcome), as.numeric(dfrModel_train[,i]))
vctCorr <- c(vctCorr, cor.result)
}
dfrCorr <- vctCorr
names(dfrCorr) <- names(dfrModel_train)
dfrCorr
## Pregnancies Glucose BloodPressure
## 0.2171568 0.4970330 0.1792162
## SkinThickness Insulin BMI
## 0.1519821 0.1599135 0.3100974
## DiabetesPedigreeFunction Age Outcome
## 0.1795693 0.2422029 1.0000000
Data For Visualization
dfrGraph <- gather(dfrModel_train, variable, value, -Outcome)
head(dfrGraph)
## Outcome variable value
## 1 1 Pregnancies 6
## 2 1 Pregnancies 8
## 3 0 Pregnancies 1
## 4 1 Pregnancies 0
## 5 0 Pregnancies 5
## 6 1 Pregnancies 3
Data Visualization
ggplot(dfrGraph) +
geom_jitter(aes(value,Outcome, colour=variable)) +
facet_wrap(~variable, scales="free_x") +
labs(title="Relation Of Diabetes [atient] With Other Features")
Observation
As per graph, There is some correlation between Outcome variable and other variables.
Summary
lapply(dfrModel_train, FUN=summary)
## $Pregnancies
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 3.762 6.000 17.000
##
## $Glucose
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 56.0 99.0 117.0 121.8 141.0 199.0
##
## $BloodPressure
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.0 64.0 72.0 72.4 80.0 122.0
##
## $SkinThickness
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.00 22.00 29.00 29.14 36.00 99.00
##
## $Insulin
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.0 74.0 115.0 147.6 182.2 846.0
##
## $BMI
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.20 27.60 32.40 32.53 36.60 67.10
##
## $DiabetesPedigreeFunction
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0780 0.2387 0.3745 0.4749 0.6325 2.4200
##
## $Age
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.00 24.00 29.00 33.18 41.00 81.00
##
## $Outcome
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3425 1.0000 1.0000
Observation
Mean and Median are nearly equal after doing data imputation whicih help to reduce Outliers.
Data summary is looking good we can continue with logistic model
Find Best Multi Logistic Model
Choose the best logistic model by using step().
stpModel=step(glm(data=dfrModel_train, formula=Outcome~., family=binomial), trace=0, steps=100)
summary(stpModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction +
## Age, family = binomial, data = dfrModel_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7383 -0.7104 -0.3973 0.7008 2.4514
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.284179 0.764407 -12.146 < 2e-16 ***
## Pregnancies 0.108253 0.033331 3.248 0.00116 **
## Glucose 0.035449 0.003711 9.553 < 2e-16 ***
## BMI 0.086259 0.015421 5.594 2.22e-08 ***
## DiabetesPedigreeFunction 0.861598 0.304723 2.827 0.00469 **
## Age 0.013568 0.009535 1.423 0.15471
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 889.46 on 691 degrees of freedom
## Residual deviance: 641.64 on 686 degrees of freedom
## AIC: 653.64
##
## Number of Fisher Scoring iterations: 5
Observation
Best results given by Survived ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction
The degrees of Freedom for Null Variance is 692-1=691 when only outcome is considered so only one parameter
While for Reidual Deviance is 692-5=687 As no of parameters are 5 for residual deviance.
Make Final Multi Linear Model
# make model
mgmModel <- glm(data=dfrModel_train, formula=Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, family=binomial(link="logit"))
# print summary
summary(mgmModel)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction,
## family = binomial(link = "logit"), data = dfrModel_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7872 -0.7169 -0.4066 0.6939 2.4351
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.99192 0.72902 -12.334 < 2e-16 ***
## Pregnancies 0.13268 0.02880 4.606 4.10e-06 ***
## Glucose 0.03655 0.00365 10.012 < 2e-16 ***
## BMI 0.08437 0.01537 5.490 4.03e-08 ***
## DiabetesPedigreeFunction 0.87105 0.30408 2.865 0.00418 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 889.46 on 691 degrees of freedom
## Residual deviance: 643.65 on 687 degrees of freedom
## AIC: 653.65
##
## Number of Fisher Scoring iterations: 5
pR2 = 1 - mgmModel$deviance / mgmModel$null.deviance
pR2
## [1] 0.2763553
mgmModel_null <- glm(dfrModel_train$Outcome~1, family = binomial, data = dfrModel_train)
pR21= 1- logLik(mgmModel)/logLik(mgmModel_null)
pR21
## 'log Lik.' 0.2763553 (df=5)
library(rcompanion)
nagelkerke(mgmModel)
## $Models
##
## Model: "glm, Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, binomial(link = \"logit\"), dfrModel_train"
## Null: "glm, Outcome ~ 1, binomial(link = \"logit\"), dfrModel_train"
##
## $Pseudo.R.squared.for.model.vs.null
## Pseudo.R.squared
## McFadden 0.276355
## Cox and Snell (ML) 0.298974
## Nagelkerke (Cragg and Uhler) 0.413266
##
## $Likelihood.ratio.test
## Df.diff LogLik.diff Chisq p.value
## -4 -122.9 245.81 5.2122e-52
##
## $Number.of.observations
##
## Model: 692
## Null: 692
##
## $Messages
## [1] "Note: For models fit with REML, these statistics are based on refitting with ML"
##
## $Warnings
## [1] "None"
pR2(mgmModel)
## llh llhNull G2 McFadden r2ML
## -321.8249793 -444.7279022 245.8058458 0.2763553 0.2989743
## r2CU
## 0.4132655
Confusion Matrix
prdVal <- predict(mgmModel, type='response')
prdBln <- ifelse(prdVal > 0.5, 1, 0)
cnfmtrx <- table(prd=prdBln, act=dfrModel_train$Outcome)
confusionMatrix(cnfmtrx)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 403 107
## 1 52 130
##
## Accuracy : 0.7702
## 95% CI : (0.737, 0.8011)
## No Information Rate : 0.6575
## P-Value [Acc > NIR] : 7.182e-11
##
## Kappa : 0.4598
## Mcnemar's Test P-Value : 1.848e-05
##
## Sensitivity : 0.8857
## Specificity : 0.5485
## Pos Pred Value : 0.7902
## Neg Pred Value : 0.7143
## Prevalence : 0.6575
## Detection Rate : 0.5824
## Detection Prevalence : 0.7370
## Balanced Accuracy : 0.7171
##
## 'Positive' Class : 0
##
Observations
Accuracy is good which is around 0.77
Regression Data
dfrPlot <- mutate(dfrModel_train, PrdVal=prdVal, POutcome=prdBln)
head(dfrPlot)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 194 33.6
## 2 8 183 64 32 165 23.3
## 3 1 89 66 23 94 28.1
## 4 0 137 40 35 168 43.1
## 5 5 116 74 23 310 25.6
## 6 3 78 50 32 88 31.0
## DiabetesPedigreeFunction Age Outcome PrdVal POutcome
## 1 0.627 50 1 0.64426718 1
## 2 0.672 32 1 0.78728773 1
## 3 0.167 21 0 0.04350404 0
## 4 2.288 33 1 0.83809734 1
## 5 0.201 30 0 0.14751314 0
## 6 0.248 26 1 0.05157047 0
Regression Visulaization
#dfrPlot
ggplot(dfrPlot, aes(x=PrdVal, y=POutcome)) +
geom_point(shape=19, colour="blue", fill="blue") +
geom_smooth(method="gam", formula=y~s(log(x)), se=FALSE) +
labs(title="Binomial Regression Curve") +
labs(x="") +
labs(y="")
ROC Visulaization
#rocplot(logistic.model,diag=TRUE,pred.prob.labels=FALSE,prob.label.digits=3,AUC=TRUE)
rocplot(mgmModel)
Observations
The closer the ROC gets to the optimal point of perfect prediction the closer the AUC gets to 1.
We can see that AUC value is around 0.84 as well as Accuracy of our model is around 0.77 which are quite equal so Model is good.
Test Data
names(dfrModel_test)[names(dfrModel_test) == "Outcome"] <- "Act_Outcome"
dfrTests <- dplyr::select(dfrModel_test, -c(BloodPressure,SkinThickness,Insulin,Age))
head(dfrTests)
## Pregnancies Glucose BMI DiabetesPedigreeFunction Act_Outcome
## 713 10 129 41.2 0.441 1
## 614 6 105 32.5 0.878 0
## 173 2 87 28.9 0.773 0
## 452 2 134 28.9 0.542 1
## 356 9 165 30.4 0.302 1
## 404 9 72 31.6 0.280 0
Data Cleaning
dfrColImpute <- colnames(dfrTests)[!colnames(dfrTests) %in% c("Pregnancies", "Act_Outcome")]
dfr_data <- dfrTests[dfrColImpute] == 0
dfrTests[dfrColImpute][dfr_data] <- NA
head(dfrTests)
## Pregnancies Glucose BMI DiabetesPedigreeFunction Act_Outcome
## 713 10 129 41.2 0.441 1
## 614 6 105 32.5 0.878 0
## 173 2 87 28.9 0.773 0
## 452 2 134 28.9 0.542 1
## 356 9 165 30.4 0.302 1
## 404 9 72 31.6 0.280 0
Missing Data
#sum(is.na(dfrModel$Age))
lapply(dfrTests, FUN=detect_na)
## $Pregnancies
## [1] 0
##
## $Glucose
## [1] 0
##
## $BMI
## [1] 0
##
## $DiabetesPedigreeFunction
## [1] 0
##
## $Act_Outcome
## [1] 0
Test Data imputation
dfrTests$Glucose[is.na(dfrTests$Glucose)] <- median(dfrTests$Glucose,na.rm = T)
dfrTests$BMI[is.na(dfrTests$BMI)] <- median(dfrTests$BMI,na.rm = T)
head(dfrTests)
## Pregnancies Glucose BMI DiabetesPedigreeFunction Act_Outcome
## 713 10 129 41.2 0.441 1
## 614 6 105 32.5 0.878 0
## 173 2 87 28.9 0.773 0
## 452 2 134 28.9 0.542 1
## 356 9 165 30.4 0.302 1
## 404 9 72 31.6 0.280 0
Observation
Test Data successfully created.
Data Imputation is not required in Test data but as we took test data from file only so need to impute data wherever is missing
In actual it is not required
Predict
resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
cnfmtrx1 <- table(prd=prdSur, act=dfrModel_test$Act_Outcome)
length(dfrModel_test$Prd_Outcome)
## [1] 0
length(dfrModel_test$Act_Outcome)
## [1] 76
confusionMatrix(cnfmtrx1)
## Confusion Matrix and Statistics
##
## act
## prd 0 1
## 0 43 15
## 1 2 16
##
## Accuracy : 0.7763
## 95% CI : (0.6662, 0.864)
## No Information Rate : 0.5921
## P-Value [Acc > NIR] : 0.0005624
##
## Kappa : 0.5046
## Mcnemar's Test P-Value : 0.0036093
##
## Sensitivity : 0.9556
## Specificity : 0.5161
## Pos Pred Value : 0.7414
## Neg Pred Value : 0.8889
## Prevalence : 0.5921
## Detection Rate : 0.5658
## Detection Prevalence : 0.7632
## Balanced Accuracy : 0.7358
##
## 'Positive' Class : 0
##
Observations
Accuracy is very good, it is around 0.77
Test Data Confusion Matrix
resVal <- predict(mgmModel, dfrTests, type="response")
prdSur <- ifelse(resVal > 0.5, 1, 0)
prdSur <- as.factor(prdSur)
levels(prdSur) <- c("Not Have Diabetes", "Have Diabetes")
dfrTests <- mutate(dfrTests, Result=resVal, Prd_Outcome=prdSur)
dfrTests$Act_Outcome <- as.factor(dfrTests$Act_Outcome)
levels(dfrTests$Act_Outcome) <- c("Not Have Diabetes", "Have Diabetes")
head(dfrTests)
## Pregnancies Glucose BMI DiabetesPedigreeFunction Act_Outcome
## 1 10 129 41.2 0.441 Have Diabetes
## 2 6 105 32.5 0.878 Not Have Diabetes
## 3 2 87 28.9 0.773 Not Have Diabetes
## 4 2 134 28.9 0.542 Have Diabetes
## 5 9 165 30.4 0.302 Have Diabetes
## 6 9 72 31.6 0.280 Not Have Diabetes
## Result Prd_Outcome
## 1 0.71288457 Have Diabetes
## 2 0.29906740 Not Have Diabetes
## 3 0.08050658 Not Have Diabetes
## 4 0.28516116 Not Have Diabetes
## 5 0.74272945 Have Diabetes
## 6 0.09478485 Not Have Diabetes
There was total data set of 768 records of Diabetes patients, It contains their details along with they are having diabetes or not.
Data Set is divided in two part one is Train data for building a model while other one is Test data to test the model.
In train data there is 90% of data which 692 records while in Test data there is 10% of data which is 76 records.
Prediction value is kept at 0.6 as it was giving better accuracy for data.
Model:
Data Imputation is done for all the column except Pregnancies and Outcome as other columns have many values as 0 which is important to be imputed.
Median and Mice Random factor is used to do the data Imputation.
For column where outliers are less, Median imputation is used while for others Mice random factor technique is used for Data Imputation.
Model is created by all the variables taken in account but only four variables are significant enough to give good result:
Outcome~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction Accuracy of Train Data is: 0.77 AUC Value of ROC Curve: 0.84 As no much difference between Accuracy and AUC value so it is a good model.
Data Testing
After Creating the model Data is tested on test data.
Same prediction value is used to get Diabetes patients.
Accuracy of Confusion matrix is around 0.77
Data is tested successfully.