Background of study: Diabetes mellitus is a group of metabolic disorders where the blood sugar levels are higher than normal for prolonged periods of time [1]. Diabetes is caused either due to the insufficient production of insulin in the body or due to improper response of the body’s cells to Insulin. The former cause of Diabetes is also called Type 1 DM or Insulin-dependent Diabetes mellitus and the latter is known as Type 2 DM or Non-Insulin Dependent DM. Gestational Diabetes is a third type of Diabetes where women not suffering from DM develop high sugar levels during pregnancy. In the United States, 30.3 million Americans were recorded as suffering from Diabetes with 1.5 million being diagnosed with Diabetes every year. Total cost of diagnosed Diabetes in the US in 2017 was $327 billion [2]. Diabetes is especially hard on women as it can affect both the mother and their unborn children during pregnancy. Women with Diabetes have a higher likelihood at having a heart attack, miscarriages or babies born with birth defects [3].

Motivation and Goal of study: Due to increasing incidence rate of Diabetes and preDiabetes, it is a pressing issue in the health care industry to rightly identify the factors that contribute to the occurrence of Diabetes in people, more so, in Women. From secondary research, factors such as BMI, Blood Pressure, Cholesterol and Glucose levels are important factors that cause Diabetes. In Women, Pregnancy seems to be an additional factor. According to the World Health Organization, people with 2-hour post-load plasma glucose levels at least 200 mg/dl (11.1 mmol/l) at any survey examination were diagnosed with Diabetes [5]. To validate the above hypotheses, identify additional risk factors and build tools that can predict the occurrence of Diabetes, particularly in women, the Pima Indians’ Diabetes diabetesset was chosen.

About the diabetesset: The Diabetes diabetes containing information about PIMA Indian females, near Phoenix, Arizona has been under continuous study since 1965 due to the high incidence rate of Diabetes in PIMA females. The diabetesset was originally published by the National Institute of Diabetes and Digestive and Kidney Diseases, consisting of diagnostic measurements pertaining to females of age greater than 20. It contains information of 768 females, of which 268 females were diagnosed with Diabetes. Information available includes 8 variables, such as, Age, Number of Pregnancies, Glucose, Insulin, etc. More detailed description about the variables is listed in the table below. The response variable in the diabetesset is a binary classifier, Outcome, that indicates if the person was diagnosed with Diabetes or not.

library(knitr)
#library(tidyverse) #Required for analysis, visualization
library(plotly) # Required for interactive plotting of graphs
library(ggsci) #Themes packAge for the plots
# This R environment comes with all of CRAN preinstalled, as well as many other helpful packAges
# The environment is defined by the kaggle/rstats docker imAge: https://github.com/kaggle/docker-rstats
# For example, here's several helpful packAges to load in 

library(ggplot2) # diabetes visualization
library(readr) # CSV file I/O, e.g. the read_csv function

# Input diabetes files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory


library("ggcorrplot")
library("ROCR")

# Any results you write to the current directory are saved as output.
diabetes<-read.csv("diabetes.csv") # It reads the CSV file and assigns to diabetes object

library(dplyr)
variable.type <- lapply(diabetes, class)

variable.description <- c("Number of times pregnant", 
"Plasma glucose concentration at 2 hours in an oral glucose tolerance test",
"Diastolic blood pressure", "Triceps skin fold thickness", "2-hour serum insulin (µU/ml)",
"Body Mass Index", "Synthesis of the history of Diabetes Mellitus in relatives, generic 
relationship of those relatives to the subject", "Age of the individual", 
"Occurrence of Diabetes")
                          
variable.name <- colnames(diabetes)

diabetesdesc <- as_data_frame(cbind(variable.name, variable.type, variable.description))
colnames(diabetesdesc) <- c("Variable Name","diabetes Type","Variable Description")

kable(diabetesdesc)

Variable Name	diabetes Type	Variable Description
Pregnancies	integer	Number of times pregnant
Glucose	integer	Plasma glucose concentration at 2 hours in an oral glucose tolerance test
BloodPressure	integer	Diastolic blood pressure
SkinThickness	integer	Triceps skin fold thickness
Insulin	integer	2-hour serum insulin (µU/ml)
BMI	numeric	Body Mass Index
DiabetesPedigreeFunction	numeric	Synthesis of the history of Diabetes Mellitus in relatives, generic
relationship of those relatives to the subject
Age	integer	Age of the individual
Outcome	integer	Occurrence of Diabetes

diabetes PreProcessing At first glance, the diabetesset appeared to be clean. On deeper analysis, the diabetesset revealed many abnormal values for biological measures. Variables such as Skin Thickness and Glucose had 227 and 374 zero-values respectively. The fact that both measures cannot hold zero values indicated that the missing values in the diabetesset were represented as zero values in the diabetesset. The missing values in the diabetesset constituted to about 30% of the observations in the diabetesset. As removing these values would result in significant information loss, kNN imputation was performed to impute the missing values in the diabetes set. Only obvious wrong values in the diabetesset (zero values) were imputed. Large outliers in the variables were not handled.

# Dealing with zeros
missing_diabetes <- diabetes[,setdiff(names(diabetes), c('Outcome', 'Pregnancies'))]
features_miss_num <- apply(missing_diabetes, 2, function(x) sum(x <= 0))
features_miss <- names(missing_diabetes)[ features_miss_num > 0]

rows_miss <- apply(missing_diabetes, 1, function(x) sum(x <= 0) >= 1) 
sum(rows_miss)

## [1] 376

library(VIM)
missing_diabetes[missing_diabetes <= 0] <- NA
diabetes[, names(missing_diabetes)] <- missing_diabetes

# KNN imputation
orig_diabetes <- diabetes
colSums(is.na(diabetes))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                      227                      374                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

diabetes[,c(-8,-9)] <- kNN(diabetes[,c(-8,-9)], k = 5)

Exploratory diabetes Analysis

Response variable: Outcome

The diabetesset has 268 women that were diagnosed with Diabetes and 500 women that didn’t have Diabetes. The sample has a high occurrence (34%) of positive records of Diabetes.

diabetes$Outcome <- factor(diabetes$Outcome)

# 1. Outcome
ggplot(diabetes,aes(Outcome,fill = Outcome)) +
  geom_bar() + 
  ggtitle("Distribution of Outcome variable")

Predictor variables

Pregnancies From the figures below, it’s evident that women who have been diagnosed with Diabetes have had more pregnancies than women who were not (in this diabetesset). However, from the histogram below, there’s no clear relationship between the number of pregnancies and the occurrence of Diabetes.

# 2. Pregnancies
p1 <- ggplot(diabetes, aes(x = Outcome, y = Pregnancies,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Number of pregnancies Vs Diabetes")

p2 <- ggplot(diabetes,aes(x = Pregnancies,fill = factor(Outcome))) + 
  geom_bar(position = "Dodge") + 
  scale_x_continuous(limits = c(0,16)) +
  theme(legend.position = "bottom") +
  labs(title = "Pregnancies Vs Outcome")

gridExtra::grid.arrange(p1, p2, ncol = 2)

## Warning: Removed 1 rows containing non-finite values (stat_count).

## Warning: Removed 1 rows containing missing values (geom_bar).

Glucose

From the figures below, there’s a clear difference in the amount of Glucose present in the women who have been diagnosed with Diabetes and those who haven’t. While the density plot indicates an overlap in the levels of glucose in both categories of women, the below plots show that Glucose could be a good indicator of the response.

# 2. Pregnancies
p2 <- ggplot(diabetes, aes(x = Glucose, color = Outcome, fill = Outcome)) +
  geom_density(alpha = 0.8) +
  theme(legend.position = "bottom") +
  labs(x = "Glucose", y = "Density", title = "Density plot of glucose")

p1 <- ggplot(diabetes, aes(x = Outcome, y = Glucose,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of glucose in women Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Blood Pressure

From the below plots, no clear difference is seen in the two categories of women who have and don’t have Diabetes. This shows that Blood Pressure might not be a good predictor of the response variable.

# 3. Blood Pressure
p2 <- ggplot(diabetes, aes(x = BloodPressure, color = Outcome, fill = Outcome)) +
  geom_density(alpha = 0.8) +
  theme(legend.position = "bottom") +
  labs(x = "Blood pressure", y = "Density", title = "Density plot of Blood pressure")

p1 <- ggplot(diabetes, aes(x = Outcome, y = BloodPressure,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of blood pressure in women Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Skin Thickness

From the below plots, no clear difference can be seen in the two categories of women who have and don’t have Diabetes. This shows that Skin Thickness might not be a good predictor of the response variable.

# 4. Skin Thickness
p2 <- ggplot(diabetes, aes(x = SkinThickness, color = Outcome, fill = Outcome)) +
  geom_density(alpha = 0.8) +
  theme(legend.position = "bottom") +
  labs(x = "Skin thickness", y = "Density", title = "Density plot of skin thickness")

p1 <- ggplot(diabetes, aes(x = Outcome, y = SkinThickness,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of skin thickness Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Insulin

Again, no clear difference can be observed between the categories of women in the diabetes, indicating Insulin may not be a good predictor of the response variable.

# 5. Insulin
p1 <- ggplot(diabetes, aes(x = Outcome, y = Insulin,fill = Outcome)) +
  geom_boxplot() + 
  theme(legend.position = "bottom") +
  ggtitle("Variation of Insulin content Vs Diabetes")

p2 <- ggplot(diabetes, aes(Insulin, fill = Outcome)) +
  geom_histogram(binwidth=10) +
  theme(legend.position = "bottom") +
  ggtitle("Variation of Insulin content Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

BMI

The below plots show that all the women who had Diabetes had a BMI greater than 25, which is above the normal levels. On the other hand, women who did not have Diabetes had a BMI ranging from 18 to 60.

# 6. BMI
p2 <- ggplot(diabetes, aes(BMI, fill = Outcome)) +
  geom_histogram() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of BMI of women Vs Diabetes")

p1 <- ggplot(diabetes, aes(x = Outcome, y = BMI,fill = Outcome)) +
  geom_boxplot(binwidth = 5) +
  theme(legend.position = "bottom") +
  ggtitle("Variation of BMI of women Vs Diabetes")

## Warning: Ignoring unknown parameters: binwidth

gridExtra::grid.arrange(p1, p2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Diabetes Pedigree Function

From fig 8, no clear difference can be seen in the two categories of women who have and don’t have Diabetes. This shows that DPF might not be a good predictor of the response variable.

# 7. DPF
p1 <- ggplot(diabetes, aes(x = Outcome, y = DiabetesPedigreeFunction, fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of DPF of women Vs Diabetes")


p2 <- ggplot(diabetes, aes(DiabetesPedigreeFunction,fill = Outcome)) +
  geom_histogram() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of DPF Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Age

Again, no clear distinction is seen in the distribution of the ‘Age’ variable for women that have Diabetes versus those who don’t.

# 8. Age
p2 <- ggplot(diabetes, aes(Age, fill = Outcome)) +
  geom_histogram(binwidth = 5) +
  theme(legend.position = "bottom") +
  ggtitle("Variation of Age of women Vs Diabetes")

p1 <- ggplot(diabetes, aes(x = Outcome, y = Age,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of Age of women Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Bi-variate associations

From the below bivariate associations, the following inferences can be made.

No clear boundary can be drawn that separates Non-diabetic and diabetic women based on Number of Pregnancies vs Age Non-diabetic women seemed to have lower levels of Insulin and Glucose as opposed to diabetic women who recorded low to high levels of Insulin and high levels of Glucose

p1 <- ggplot(diabetes, aes(x = Age, y = Pregnancies)) +
  geom_point(aes(color=Outcome)) + 
  theme(legend.position = "bottom") +
  ggtitle("Relationship of Pregnancies with Age Vs Diabetes")

p2 <- ggplot(diabetes,aes(x=Insulin,y=Glucose))+
  geom_point(aes(color=Outcome))+
  theme(legend.position = "bottom") +
  ggtitle("Relationship of Insulin with Glucose Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Women who have Diabetes can be differentiated from those who don’t have based on BMI and BP values Women with low values of BMI and Skin Thickness did not have Diabetes

p1 <- ggplot(diabetes,aes(x=BMI,y=BloodPressure))+
  geom_point(aes(color=Outcome))+
  theme(legend.position = "bottom") +
  ggtitle("Relationship of BMI with BP Vs Diabetes")

p2 <- ggplot(diabetes,aes(x=BMI,y=SkinThickness))+
  geom_point(aes(color=Outcome))+
  theme(legend.position = "bottom") +
  ggtitle("Relationship of BMI with Skin Thickness Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Correlations between Predictor variables Subsequently, a correlation plot was drawn between all the numerical variables to establish the linear association between each other. As observed in the bivariate associations, Insulin and Glucose, BMI and Skin Thickness had a moderate – high linear correlation.

library(corrplot)

## corrplot 0.92 loaded

corMat = cor (diabetes[, -9])
diag (corMat) = 0 #Remove self correlations
corrplot.mixed(corMat,tl.pos = "lt")

Modeling Approach To build a predictive model that classifies women with/ without Diabetes, the following modeling approach was used.

Modeling Model Building - The modeling techniques performed on this diabetes include Logistic Regression, Classification Trees, Random Forests and SVM. The diabetes set was split into an 80-20 train-test diabetes set.

Create training and testing set

set.seed(12345)
ratio = sample(1:nrow(diabetes), size = 0.20*nrow(diabetes))
test.diabetes = diabetes[ratio,] #Test diabetesset 20% of total
train.diabetes = diabetes[-ratio,] #Train diabetesset 80% of total

Parameter tuning - The parameter(s) of each of the models built were tuned to identify the best model using cross validation

Model Performance Evaluation - Model performance was evaluated for each model using sensitivity as the criterion for the 80% training (in-sample) and 20% (out-of-sample). Sensitivity is the ability of the model to determine the true cases correctly, in this case, identify the patients with Diabetes correctly. From the below confusion matrix, Sensitivity is defined as the ratio of (True Positives) / (True Positives + False Negatives).

Modeling

Logistic Regression

A full model was built with Outcome as the response variable with the rest of the 8 predictor variables. Step-wise variable selection method was used to identify the most important variables. The final model chosen with AIC as the criterion for selection generated a logistic regression model with the lowest AIC value of 593.85 as below.

model.glm <- glm(Outcome~.,data=train.diabetes,family = binomial)
step_model <- step(model.glm)

## Start:  AIC=591.08
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - Age                       1   573.09 589.09
## - BloodPressure             1   573.16 589.16
## - Insulin                   1   573.28 589.28
## - SkinThickness             1   573.31 589.31
## <none>                          573.08 591.08
## - DiabetesPedigreeFunction  1   581.06 597.06
## - Pregnancies               1   586.95 602.95
## - BMI                       1   588.95 604.95
## - Glucose                   1   640.88 656.88
## 
## Step:  AIC=589.09
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction
## 
##                            Df Deviance    AIC
## - BloodPressure             1   573.17 587.17
## - Insulin                   1   573.29 587.29
## - SkinThickness             1   573.31 587.31
## <none>                          573.09 589.09
## - DiabetesPedigreeFunction  1   581.09 595.09
## - BMI                       1   589.18 603.18
## - Pregnancies               1   591.57 605.57
## - Glucose                   1   642.99 656.99
## 
## Step:  AIC=587.17
## Outcome ~ Pregnancies + Glucose + SkinThickness + Insulin + BMI + 
##     DiabetesPedigreeFunction
## 
##                            Df Deviance    AIC
## - Insulin                   1   573.36 585.36
## - SkinThickness             1   573.41 585.41
## <none>                          573.17 587.17
## - DiabetesPedigreeFunction  1   581.14 593.14
## - BMI                       1   590.50 602.50
## - Pregnancies               1   592.73 604.73
## - Glucose                   1   645.27 657.27
## 
## Step:  AIC=585.36
## Outcome ~ Pregnancies + Glucose + SkinThickness + BMI + DiabetesPedigreeFunction
## 
##                            Df Deviance    AIC
## - SkinThickness             1   573.60 583.60
## <none>                          573.36 585.36
## - DiabetesPedigreeFunction  1   581.55 591.55
## - BMI                       1   592.25 602.25
## - Pregnancies               1   592.83 602.83
## - Glucose                   1   693.11 703.11
## 
## Step:  AIC=583.6
## Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction
## 
##                            Df Deviance    AIC
## <none>                          573.60 583.60
## - DiabetesPedigreeFunction  1   581.83 589.83
## - Pregnancies               1   593.69 601.69
## - BMI                       1   609.34 617.34
## - Glucose                   1   693.41 701.41

Final model using Logistic Regression

model.new <- glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, family = binomial, data = train.diabetes)

summary(model.new)

## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, 
##     family = binomial, data = train.diabetes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8683  -0.7124  -0.3998   0.7388   2.1968  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.328566   0.789472 -11.816  < 2e-16 ***
## Pregnancies               0.138180   0.031431   4.396 1.10e-05 ***
## Glucose                   0.037075   0.003859   9.608  < 2e-16 ***
## BMI                       0.092051   0.016198   5.683 1.33e-08 ***
## DiabetesPedigreeFunction  0.940640   0.333118   2.824  0.00475 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 800.9  on 614  degrees of freedom
## Residual deviance: 573.6  on 610  degrees of freedom
## AIC: 583.6
## 
## Number of Fisher Scoring iterations: 5

Logistic Regression – Parameter Tuning Using the above full model, the optimal cut-off probability was identified using cross-validation. An asymmetric cost function was defined penalizing the model for wrongly identifying patients with Diabetes. Using the defined asymmetric cost function (2:1 for False Positives: False Negatives), the optimal cut-off probability was calculated. The optimal cut-off probability is the probability at which the model has the least misclassification rate.

CV to choose cut-off probability

library(boot)
# CV to choose cut-off probability
searchgrid = seq(0.01, 0.6, 0.02)
result = cbind(searchgrid, NA)
cost1 <- function(r, pi) {
  weight1 = 2
  weight0 = 1
  c1 = (r == 1) & (pi < pcut)  #logical vector - true if actual 1 but predict 0 (False Negative)
  c0 = (r == 0) & (pi > pcut)  #logical vector - true if actual 0 but predict 1 (False Positive)
  return(mean(weight1 * c1 + weight0 * c0))
}
for (i in 1:length(searchgrid)) {
  pcut <- result[i, 1]
  result[i, 2] <- cv.glm(data = train.diabetes, glmfit = model.new, cost = cost1, 
                         K = 3)$delta[2]
}
plot(result, ylab = "CV Cost",main = "Optimal cut-off probability identification")

Plotting the cross-validated cost with different values of pcut resulted in the above figure. 0.29 was chosen as the optimal cut-off probability with a CV cost of 0.3420.

Classification Tree A classification tree is used for modeling diabetes when the response variable is categorical. The tree splits the diabetes into two or more homogeneous sets based on the most significant differentiator in the predictor variables-value set. [7]

For this data, a classification tree was built for the binary response variable ‘Outcome’, the output of which can be seen below in Fig 14. An asymmetric cost (2:1 for False Positives: False Negatives) was used to build the tree.

library(rpart)
library(rpart.plot)
tree.model <- rpart(Outcome~., data=train.diabetes, method="class")
rpart.plot(tree.model)

Classification Tree - Parameter Tuning The complexity parameter (Cp) of the tree was tuned using a reference of the Relative error VS Cp. The complexity parameter value after which the relative error does not decrease significantly was chosen as the Cp of the final tree. From Fig 15, using the Cp value of 0.015, the decision tree was pruned. The final decision tree can be seen in Fig 16.

plotcp(tree.model)

# Optimal cut-off value
# Pruning the tree
tree.model<- rpart(Outcome~., data=train.diabetes, method="class",cp=0.015)
rpart.plot(tree.model)

Random Forest Random forests grow many decision trees by training on different selections of the diabetes and on a different selection of variables for each tree. Each tree classifies the decision variable by considering a random subset of the predictors and diabetes. The random forest, in turn, combines the decision of the individual trees (mode) to make final decision. The mtry parameter controls the choice of the number of variables selected to build each tree.

#random forest
library(randomForest)
set.seed(1234567)
rf.model <- randomForest(Outcome~., data = train.diabetes, importance=TRUE)
print(rf.model)

## 
## Call:
##  randomForest(formula = Outcome ~ ., data = train.diabetes, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 24.88%
## Confusion matrix:
##     0   1 class.error
## 0 326  70   0.1767677
## 1  83 136   0.3789954

Random Forest - Parameter Tuning The below figure plots the OOB error vs mtry,. From the figure, mtry equal to 3 gives the least OOB error.

The OOB error refers to the overall misclassification of both positive and negative classes. However, while the focus of the current prediction is to reduce the misclassification of the positive class, the confusion matrix was manually compared.

# Tune only mtry parameter using tuneRF()
set.seed(1234567)      
res <- tuneRF(x = subset(train.diabetes, select = -Outcome),
              y = train.diabetes$Outcome,
              ntreeTry = 500,
              plot = TRUE, tunecontrol = tune.control(cross = 5))

## mtry = 2  OOB error = 24.88% 
## Searching left ...
## mtry = 1     OOB error = 25.04% 
## -0.006535948 0.05 
## Searching right ...
## mtry = 4     OOB error = 25.04% 
## -0.006535948 0.05

res[res[, 2] == min(res[, 2]), 1]

## [1] 2

From the above table, as mtry at 4 gives the least value of positive class misclassification error, it was chosen as the optimal value and the random forest was rebuilt using this parameter.

#random forest
set.seed(1234567)
rf.model <- randomForest(Outcome~., data = train.diabetes, importance = TRUE, mtry = 2)
rf.model

## 
## Call:
##  randomForest(formula = Outcome ~ ., data = train.diabetes, importance = TRUE,      mtry = 2) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 24.88%
## Confusion matrix:
##     0   1 class.error
## 0 326  70   0.1767677
## 1  83 136   0.3789954

Model Performance Evaluation The performance of the models was compared across the in-sample training diabetes (80% of the diabetes) and the out-of-sample testing diabetes (20% of the diabetes). The below plots summarize the results of the model performance. The best model was chosen based on the performance in the out-of-sample diabetes and sensitivity was chosen as the decision criterion.

Logistic Regression

# Logistic Regression Performance
library(ROCR)
library(rocc)
library(ROCit)
library(pROC)
library(caTools)
library(precrec)
library(AUC)
library(verification)
par(mfrow = c(1,2))

pred.in <- predict(model.new,newdatas = train.diabetes,type = "response")
prediction.in <- ifelse(pred.in<0.29,0,1)
tab<-table(as.factor(train.diabetes$Outcome),prediction.in)
roc.plot(train.diabetes$Outcome=="1", pred.in)$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8496495 3.918146e-47          NA

pred.out <- predict(model.new,newdata = test.diabetes,type = "response")
prediction.out <- ifelse(pred.out<0.29,0,1)
table(as.factor(test.diabetes$Outcome),prediction.out)

##    prediction.out
##      0  1
##   0 76 28
##   1 14 35

roc.plot(test.diabetes$Outcome=="1", pred.out)$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8086735 3.898053e-10          NA

Classification Tree

# Classification Tree Performance
library(caret)
par(mfrow = c(1,2))

tree.predict.in<- predict(tree.model, train.diabetes, type = "class")
tree.pred.in<- predict(tree.model, train.diabetes, type = "prob")
confusionMatrix(train.diabetes$Outcome, tree.predict.in)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 331  65
##          1  50 169
##                                           
##                Accuracy : 0.813           
##                  95% CI : (0.7799, 0.8431)
##     No Information Rate : 0.6195          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5984          
##                                           
##  Mcnemar's Test P-Value : 0.1917          
##                                           
##             Sensitivity : 0.8688          
##             Specificity : 0.7222          
##          Pos Pred Value : 0.8359          
##          Neg Pred Value : 0.7717          
##              Prevalence : 0.6195          
##          Detection Rate : 0.5382          
##    Detection Prevalence : 0.6439          
##       Balanced Accuracy : 0.7955          
##                                           
##        'Positive' Class : 0               
##

roc.plot(train.diabetes$Outcome== "1", tree.pred.in[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8381302 1.541058e-48          NA

tree.predict<- predict(tree.model, test.diabetes, type = "class")
tree.pred<- predict(tree.model, test.diabetes, type = "prob")
confusionMatrix(test.diabetes$Outcome, tree.predict)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 83 21
##          1 17 32
##                                           
##                Accuracy : 0.7516          
##                  95% CI : (0.6754, 0.8179)
##     No Information Rate : 0.6536          
##     P-Value [Acc > NIR] : 0.005891        
##                                           
##                   Kappa : 0.4416          
##                                           
##  Mcnemar's Test P-Value : 0.626496        
##                                           
##             Sensitivity : 0.8300          
##             Specificity : 0.6038          
##          Pos Pred Value : 0.7981          
##          Neg Pred Value : 0.6531          
##              Prevalence : 0.6536          
##          Detection Rate : 0.5425          
##    Detection Prevalence : 0.6797          
##       Balanced Accuracy : 0.7169          
##                                           
##        'Positive' Class : 0               
##

roc.plot(test.diabetes$Outcome== "1", tree.pred[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model     Area      p.value binorm.area
## 1 Model  1 0.764325 2.072602e-08          NA

Random Forest

Random Forest Performance

library(randomForest)
par(mfrow = c(1,2))

rf.predict.in <- predict(rf.model, train.diabetes)
rf.pred.in <- predict(rf.model, train.diabetes,type="prob")
confusionMatrix(train.diabetes$Outcome, rf.predict.in)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 396   0
##          1   0 219
##                                     
##                Accuracy : 1         
##                  95% CI : (0.994, 1)
##     No Information Rate : 0.6439    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##                                     
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.0000    
##             Specificity : 1.0000    
##          Pos Pred Value : 1.0000    
##          Neg Pred Value : 1.0000    
##              Prevalence : 0.6439    
##          Detection Rate : 0.6439    
##    Detection Prevalence : 0.6439    
##       Balanced Accuracy : 1.0000    
##                                     
##        'Positive' Class : 0         
##

roc.plot(train.diabetes$Outcome == "1", rf.pred.in[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model Area      p.value binorm.area
## 1 Model  1    1 3.701072e-94          NA

rf.model<- randomForest(Outcome~., data = train.diabetes, importance=TRUE)
rf.predict.out <- predict(rf.model, test.diabetes)
rf.pred.out <- predict(rf.model, test.diabetes,type="prob")
confusionMatrix(test.diabetes$Outcome, rf.predict.out)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 88 16
##          1 24 25
##                                           
##                Accuracy : 0.7386          
##                  95% CI : (0.6615, 0.8062)
##     No Information Rate : 0.732           
##     P-Value [Acc > NIR] : 0.4693          
##                                           
##                   Kappa : 0.3724          
##                                           
##  Mcnemar's Test P-Value : 0.2684          
##                                           
##             Sensitivity : 0.7857          
##             Specificity : 0.6098          
##          Pos Pred Value : 0.8462          
##          Neg Pred Value : 0.5102          
##              Prevalence : 0.7320          
##          Detection Rate : 0.5752          
##    Detection Prevalence : 0.6797          
##       Balanced Accuracy : 0.6977          
##                                           
##        'Positive' Class : 0               
##

roc.plot(test.diabetes$Outcome== "1", rf.pred.out[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8182889 1.142747e-10          NA

Comparing all model AUCs

par(mfrow = c(1,2))

# Plot multiple ROC curves in one - Logistic, GAM, Tree and Random Forest
# Compare different models using ROC curves
# In-sample ROC
rocplot1 <- roc.plot(x = (train.diabetes$Outcome == "1"), 
                     pred = cbind(pred.in, tree.pred.in[,2], rf.pred.in[,2] ),
                     main = "ROC curve: In-sample",
                     legend = T, leg.text = c("GLM","Tree", "RForest"))

## Warning in roc.plot.default(x = (train.diabetes$Outcome == "1"), pred =
## cbind(pred.in, : Large amount of unique predictions used as thresholds. Consider
## specifying thresholds.

# Validation set ROC
rocplot2 <- roc.plot(x = (test.diabetes$Outcome == "1"), 
                     pred = cbind(pred.out, tree.pred[,2], rf.pred.out[,2]),
                     main = "ROC curve: Out-of-sample",
                     legend = T, leg.text = c("GLM", "Tree", "RForest"))

rocplot1$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8496495 3.918146e-47          NA
## 2 Model  2 0.8381302 1.541058e-48          NA
## 3 Model  3 1.0000000 3.701072e-94          NA

rocplot2$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8086735 3.898053e-10          NA
## 2 Model  2 0.7643250 2.072602e-08          NA
## 3 Model  3 0.8182889 1.142747e-10          NA

The Logistic regression model was chosen as the best model for predicting the occurrence of Diabetes in PIMA women, as it reported the highest AUC for predicting diabetes with an accuracy of over 80%.

Conclusion The PIMA Indian Women’s diabetes data was analyzed and explored in detail. The patterns identified using diabetes exploration methods were validated using the modeling techniques employed. Classification models such as Logistic Regression, Classification Trees, Random Forest were built and evaluated to identify best model to predict the occurrence of Diabetes in PIMA Indian women. From the AUC for test data, the Logistic Regression model was concluded as the best performing model.

References:

https://en.wikipedia.org/wiki/Diabetes_mellitus
http://www.Diabetes.org/Diabetes-basics/statistics/
http://www.Diabetes.org/living-with-Diabetes/treatment-and-care/women/
http://www.Diabetes.org/assets/pdfs/basics/cdc-statistics-report-2017.pdf
WHO technical report series (1985). Technical Report 727, WHO Study Group
http://www.personal.kent.edu/~mshanker/personal/Zip_files/sar_2000.pdf
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4418458/

Predicting Occurence of Diabetes in PIMA Women

Sakib Faisal

07/30/2021

Create training and testing set

Logistic Regression

Final model using Logistic Regression

CV to choose cut-off probability

Random Forest Performance