Prediction of Diabetes in PIMA Women

Background of study

Diabetes mellitus is a group of metabolic disorders where the blood sugar levels are higher than normal for prolonged periods of time [1]. Diabetes is caused either due to the insufficient production of insulin in the body or due to improper response of the body’s cells to Insulin. The former cause of Diabetes is also called Type 1 DM or Insulin-dependent Diabetes mellitus and the latter is known as Type 2 DM or Non-Insulin Dependent DM. Gestational Diabetes is a third type of Diabetes where women not suffering from DM develop high sugar levels during pregnancy. In the United States, 30.3 million Americans were recorded as suffering from Diabetes with 1.5 million being diagnosed with Diabetes every year. Total cost of diagnosed Diabetes in the US in 2017 was $327 billion [2]. Diabetes is especially hard on women as it can affect both the mother and their unborn children during pregnancy. Women with Diabetes have a higher likelihood at having a heart attack, miscarriages or babies born with birth defects [3].

Motivation and Goal of study

Due to increasing incidence rate of diabetes and prediabetes, it is a pressing issue in the health care industry to rightly identify the factors that contribute to the occurrence of Diabetes in people, more so, in Women. From secondary research, factors such as BMI, Blood Pressure, Cholesterol and Glucose levels are important factors that cause Diabetes. In Women, Pregnancy seems to be an additional factor. According to the World Health Organization, people with 2-hour post-load plasma glucose levels at least 200 mg/dl (11.1 mmol/l) at any survey examination were diagnosed with Diabetes [5]. To validate the above hypotheses, identify additional risk factors and build tools that can predict the occurrence of Diabetes, particularly in women, the Pima Indians’ Diabetes dataset was chosen.

About the Dataset

The diabetes data containing information about PIMA Indian females, near Phoenix, Arizona has been under continuous study since 1965 due to the high incidence rate of Diabetes in PIMA females. The dataset was originally published by the National Institute of Diabetes and Digestive and Kidney Diseases, consisting of diagnostic measurements pertaining to females of age greater than 20. It contains information of 768 females, of which 268 females were diagnosed with Diabetes. Information available includes 8 variables, such as, Age, Number of Pregnancies, Glucose, Insulin, etc. More detailed description about the variables is listed in the table below. The response variable in the dataset is a binary classifier, Outcome, that indicates if the person was diagnosed with Diabetes or not.

Data Description

Details about the Variables in the Dataset are provided below.

variable.type <- lapply(data, class)

variable.description <- c("Number of times pregnant", 
"Plasma glucose concentration at 2 hours in an oral glucose tolerance test",
"Diastolic blood pressure", "Triceps skin fold thickness", "2-hour serum insulin (µU/ml)",
"Body Mass Index", "Synthesis of the history of Diabetes Mellitus in relatives, generic 
relationship of those relatives to the subject", "Age of the individual", 
"Occurrence of Diabetes")
                          
variable.name <- colnames(data)

datadesc <- as_data_frame(cbind(variable.name, variable.type, variable.description))
colnames(datadesc) <- c("Variable Name","Data Type","Variable Description")

kable(datadesc)

Variable Name	Data Type	Variable Description
Pregnancies	integer	Number of times pregnant
Glucose	integer	Plasma glucose concentration at 2 hours in an oral glucose tolerance test
BloodPressure	integer	Diastolic blood pressure
SkinThickness	integer	Triceps skin fold thickness
Insulin	integer	2-hour serum insulin (µU/ml)
BMI	numeric	Body Mass Index
DiabetesPedigreeFunction	numeric	Synthesis of the history of Diabetes Mellitus in relatives, generic
relationship of those relat	ives to the	subject
Age	integer	Age of the individual
Outcome	integer	Occurrence of Diabetes

Data Preprocessing

At first glance, the dataset appeared to be clean. On deeper analysis, the dataset revealed many abnormal values for biological measures. Variables such as Skin Thickness and Glucose had 227 and 374 zero-values respectively. The fact that both measures cannot hold zero values indicated that the missing values in the dataset were represented as zero values in the dataset. The missing values in the dataset constituted to about 30% of the observations in the dataset. As removing these values would result in significant information loss, kNN imputation was performed to impute the missing values in the data set. Only obvious wrong values in the dataset (zero values) were imputed. Large outliers in the variables were not handled.

# Dealing with zeros
missing_data <- data[,setdiff(names(data), c('Outcome', 'Pregnancies'))]
features_miss_num <- apply(missing_data, 2, function(x) sum(x <= 0))
features_miss <- names(missing_data)[ features_miss_num > 0]

rows_miss <- apply(missing_data, 1, function(x) sum(x <= 0) >= 1) 
sum(rows_miss)

## [1] 376

missing_data[missing_data <= 0] <- NA
data[, names(missing_data)] <- missing_data

# KNN imputation
orig_data <- data
colSums(is.na(data))

##              Pregnancies                  Glucose            BloodPressure 
##                        0                        5                       35 
##            SkinThickness                  Insulin                      BMI 
##                      227                      374                       11 
## DiabetesPedigreeFunction                      Age                  Outcome 
##                        0                        0                        0

data[,c(-8,-9)] <- knnImputation(data[,c(-8,-9)], k = 5)

Data Exploration

Response variable

Outcome

The dataset has 268 women that were diagnosed with Diabetes and 500 women that didn’t have Diabetes. The sample has a high occurrence (34%) of positive records of Diabetes.

data$Outcome <- factor(data$Outcome)

# 1. Outcome
ggplot(data,aes(Outcome,fill = Outcome)) +
  geom_bar() + 
  ggtitle("Distribution of Outcome variable")

Predictor variables

Pregnancies

From the figures below, it’s evident that women who have been diagnosed with Diabetes have had more pregnancies than women who were not (in this dataset). However, from the histogram below, there’s no clear relationship between the number of pregnancies and the occurrence of Diabetes.

# 2. Pregnancies
p1 <- ggplot(data, aes(x = Outcome, y = Pregnancies,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Number of pregnancies Vs Diabetes")

p2 <- ggplot(data,aes(x = Pregnancies,fill = factor(Outcome))) + 
  geom_bar(position = "Dodge") + 
  scale_x_continuous(limits = c(0,16)) +
  theme(legend.position = "bottom") +
  labs(title = "Pregnancies Vs Outcome")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Glucose

From the figures below, there’s a clear difference in the amount of Glucose present in the women who have been diagnosed with Diabetes and those who haven’t. While the density plot indicates an overlap in the levels of glucose in both categories of women, the below plots show that Glucose could be a good indicator of the response.

# 2. Pregnancies
p2 <- ggplot(data, aes(x = Glucose, color = Outcome, fill = Outcome)) +
  geom_density(alpha = 0.8) +
  theme(legend.position = "bottom") +
  labs(x = "Glucose", y = "Density", title = "Density plot of glucose")

p1 <- ggplot(data, aes(x = Outcome, y = Glucose,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of glucose in women Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Blood Pressure

From the below plots, no clear difference is seen in the two categories of women who have and don’t have Diabetes. This shows that Blood Pressure might not be a good predictor of the response variable.

# 3. Blood Pressure
p2 <- ggplot(data, aes(x = BloodPressure, color = Outcome, fill = Outcome)) +
  geom_density(alpha = 0.8) +
  theme(legend.position = "bottom") +
  labs(x = "Blood pressure", y = "Density", title = "Density plot of Blood pressure")

p1 <- ggplot(data, aes(x = Outcome, y = BloodPressure,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of blood pressure in women Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Skin Thickness

From the below plots, no clear difference can be seen in the two categories of women who have and don’t have Diabetes. This shows that Skin Thickness might not be a good predictor of the response variable.

# 4. Skin Thickness
p2 <- ggplot(data, aes(x = SkinThickness, color = Outcome, fill = Outcome)) +
  geom_density(alpha = 0.8) +
  theme(legend.position = "bottom") +
  labs(x = "Skin thickness", y = "Density", title = "Density plot of skin thickness")

p1 <- ggplot(data, aes(x = Outcome, y = SkinThickness,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of skin thickness Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Insulin

Again, no clear difference can be observed between the categories of women in the data, indicating Insulin may not be a good predictor of the response variable.

# 5. Insulin
p1 <- ggplot(data, aes(x = Outcome, y = Insulin,fill = Outcome)) +
  geom_boxplot() + 
  theme(legend.position = "bottom") +
  ggtitle("Variation of Insulin content Vs Diabetes")

p2 <- ggplot(data, aes(Insulin, fill = Outcome)) +
  geom_histogram(binwidth=10) +
  theme(legend.position = "bottom") +
  ggtitle("Variation of Insulin content Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

BMI

The below plots show that all the women who had Diabetes had a BMI greater than 25, which is above the normal levels. On the other hand, women who did not have Diabetes had a BMI ranging from 18 to 60.

# 6. BMI
p2 <- ggplot(data, aes(BMI, fill = Outcome)) +
  geom_histogram() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of BMI of women Vs Diabetes")

p1 <- ggplot(data, aes(x = Outcome, y = BMI,fill = Outcome)) +
  geom_boxplot(binwidth = 5) +
  theme(legend.position = "bottom") +
  ggtitle("Variation of BMI of women Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Diabetes Pedigree Function

From fig 8, no clear difference can be seen in the two categories of women who have and don’t have Diabetes. This shows that DPF might not be a good predictor of the response variable.

# 7. DPF
p1 <- ggplot(data, aes(x = Outcome, y = DiabetesPedigreeFunction,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of DPF of women Vs Diabetes")


p2 <- ggplot(data, aes(DiabetesPedigreeFunction,fill = Outcome)) +
  geom_histogram() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of DPF Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Age

Again, no clear distinction is seen in the distribution of the ‘Age’ variable for women that have Diabetes versus those who don’t.

# 8. Age
p2 <- ggplot(data, aes(Age, fill = Outcome)) +
  geom_histogram(binwidth = 5) +
  theme(legend.position = "bottom") +
  ggtitle("Variation of Age of women Vs Diabetes")

p1 <- ggplot(data, aes(x = Outcome, y = Age,fill = Outcome)) +
  geom_boxplot() +
  theme(legend.position = "bottom") +
  ggtitle("Variation of Age of women Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Bi-variate associations

From the below bivariate associations, the following inferences can be made.

No clear boundary can be drawn that separates Non-diabetic and Diabetic women based on Number of Pregnancies vs Age
Non-diabetic women seemed to have lower levels of Insulin and Glucose as opposed to Diabetic women who recorded low to high levels of Insulin and high levels of Glucose

p1 <- ggplot(data, aes(x = Age, y = Pregnancies)) +
  geom_point(aes(color=Outcome)) + 
  theme(legend.position = "bottom") +
  ggtitle("Relationship of Pregnancies with Age Vs Diabetes")

p2 <- ggplot(data,aes(x=Insulin,y=Glucose))+
  geom_point(aes(color=Outcome))+
  theme(legend.position = "bottom") +
  ggtitle("Relationship of Insulin with Glucose Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Women who have Diabetes can be differentiated from those who don’t have based on BMI and BP values
Women with low values of BMI and Skin Thickness did not have Diabetes

p1 <- ggplot(data,aes(x=BMI,y=BloodPressure))+
  geom_point(aes(color=Outcome))+
  theme(legend.position = "bottom") +
  ggtitle("Relationship of BMI with BP Vs Diabetes")

p2 <- ggplot(data,aes(x=BMI,y=SkinThickness))+
  geom_point(aes(color=Outcome))+
  theme(legend.position = "bottom") +
  ggtitle("Relationship of BMI with Skin Thickness Vs Diabetes")

gridExtra::grid.arrange(p1, p2, ncol = 2)

Correlations between Predictor variables

Subsequently, a correlation plot was drawn between all the numerical variables to establish the linear association between each other. As observed in the bivariate associations, Insulin and Glucose, BMI and Skin Thickness had a moderate – high linear correlation.

corMat = cor (data[, -9])
diag (corMat) = 0 #Remove self correlations
corrplot.mixed(corMat,tl.pos = "lt")

Modeling Approach

To build a predictive model that classifies women with/ without Diabetes, the following modeling approach was used.

Modeling

Model Building - The modeling techniques performed on this data include Logistic Regression, Classification Trees, Random Forests and SVM. The data set was split into an 80-20 train-test data set.

# Create training and testing set
set.seed(12345)
ratio = sample(1:nrow(data), size = 0.20*nrow(data))
test.data = data[ratio,] #Test dataset 20% of total
train.data = data[-ratio,] #Train dataset 80% of total

Parameter tuning - The parameter(s) of each of the models built were tuned to identify the best model using cross validation

Model Performance Evaluation - Model performance was evaluated for each model using sensitivity as the criterion for the 80% training (in-sample) and 20% (out-of-sample). Sensitivity is the ability of the model to determine the true cases correctly, in this case, identify the patients with Diabetes correctly. From the below confusion matrix, Sensitivity is defined as the ratio of (True Positives) / (True Positives + False Negatives).

Modeling

Logistic Regression

A full model was built with Outcome as the response variable with the rest of the 8 predictor variables. Step-wise variable selection method was used to identify the most important variables. The final model chosen with AIC as the criterion for selection generated a logistic regression model with the lowest AIC value of 593.85 as below.

# Logistic Regression
model.glm <- glm(Outcome~.,data=train.data,family = binomial)
step_model <- step(model.glm)

## Start:  AIC=600.7
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - SkinThickness             1   582.81 598.81
## - Insulin                   1   582.97 598.97
## - BloodPressure             1   582.99 598.99
## - Age                       1   583.34 599.34
## <none>                          582.70 600.70
## - DiabetesPedigreeFunction  1   586.20 602.20
## - Pregnancies               1   595.66 611.66
## - BMI                       1   606.83 622.83
## - Glucose                   1   646.98 662.98
## 
## Step:  AIC=598.81
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI + 
##     DiabetesPedigreeFunction + Age
## 
##                            Df Deviance    AIC
## - Insulin                   1   583.09 597.09
## - BloodPressure             1   583.10 597.10
## - Age                       1   583.42 597.42
## <none>                          582.81 598.81
## - DiabetesPedigreeFunction  1   586.24 600.24
## - Pregnancies               1   595.67 609.67
## - BMI                       1   619.62 633.62
## - Glucose                   1   646.98 660.98
## 
## Step:  AIC=597.09
## Outcome ~ Pregnancies + Glucose + BloodPressure + BMI + DiabetesPedigreeFunction + 
##     Age
## 
##                            Df Deviance    AIC
## - BloodPressure             1   583.40 595.40
## - Age                       1   583.70 595.70
## <none>                          583.09 597.09
## - DiabetesPedigreeFunction  1   586.79 598.79
## - Pregnancies               1   596.06 608.06
## - BMI                       1   624.11 636.11
## - Glucose                   1   683.69 695.69
## 
## Step:  AIC=595.4
## Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction + 
##     Age
## 
##                            Df Deviance    AIC
## - Age                       1   583.85 593.85
## <none>                          583.40 595.40
## - DiabetesPedigreeFunction  1   587.13 597.13
## - Pregnancies               1   596.19 606.19
## - BMI                       1   625.92 635.92
## - Glucose                   1   683.75 693.75
## 
## Step:  AIC=593.85
## Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction
## 
##                            Df Deviance    AIC
## <none>                          583.85 593.85
## - DiabetesPedigreeFunction  1   587.72 595.72
## - Pregnancies               1   605.27 613.27
## - BMI                       1   626.00 634.00
## - Glucose                   1   691.67 699.67

# Final model using Logistic Regression
model.new <- glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, family = binomial, data = train.data)

summary(model.new)

## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, 
##     family = binomial, data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6551  -0.7302  -0.4086   0.7295   2.1683  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.161189   0.783300 -11.696  < 2e-16 ***
## Pregnancies               0.138323   0.030588   4.522 6.12e-06 ***
## Glucose                   0.035024   0.003797   9.223  < 2e-16 ***
## BMI                       0.100968   0.016547   6.102 1.05e-09 ***
## DiabetesPedigreeFunction  0.642931   0.328343   1.958   0.0502 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 804.39  on 614  degrees of freedom
## Residual deviance: 583.85  on 610  degrees of freedom
## AIC: 593.85
## 
## Number of Fisher Scoring iterations: 5

Logistic Regression – Parameter Tuning

Using the above full model, the optimal cut-off probability was identified using cross-validation. An asymmetric cost function was defined penalizing the model for wrongly identifying patients with Diabetes. Using the defined asymmetric cost function (2:1 for False Positives: False Negatives), the optimal cut-off probability was calculated. The optimal cut-off probability is the probability at which the model has the least misclassification rate.

# CV to choose cut-off probability
searchgrid = seq(0.01, 0.6, 0.02)
result = cbind(searchgrid, NA)
cost1 <- function(r, pi) {
  weight1 = 2
  weight0 = 1
  c1 = (r == 1) & (pi < pcut)  #logical vector - true if actual 1 but predict 0 (False Negative)
  c0 = (r == 0) & (pi > pcut)  #logical vector - true if actual 0 but predict 1 (False Positive)
  return(mean(weight1 * c1 + weight0 * c0))
}
for (i in 1:length(searchgrid)) {
  pcut <- result[i, 1]
  result[i, 2] <- cv.glm(data = train.data, glmfit = model.new, cost = cost1, 
                         K = 3)$delta[2]
}
plot(result, ylab = "CV Cost",main = "Optimal cut-off probability identification")

Plotting the cross-validated cost with different values of pcut resulted in the above figure. 0.29 was chosen as the optimal cut-off probability with a CV cost of 0.3420.

Classification Tree

A classification tree is used for modeling data when the response variable is categorical. The tree splits the data into two or more homogeneous sets based on the most significant differentiator in the predictor variables-value set. [7]

For this dataset, a classification tree was built for the binary response variable ‘Outcome’, the output of which can be seen below in Fig 14. An asymmetric cost (2:1 for False Positives: False Negatives) was used to build the tree.

tree.model <- rpart(Outcome~., data=train.data, method="class")
rpart.plot(tree.model)

Classification Tree - Parameter Tuning

The complexity parameter (Cp) of the tree was tuned using a reference of the Relative error VS Cp. The complexity parameter value after which the relative error does not decrease significantly was chosen as the Cp of the final tree. From Fig 15, using the Cp value of 0.015, the decision tree was pruned. The final decision tree can be seen in Fig 16.

plotcp(tree.model)

# Optimal cut-off value
# Pruning the tree
tree.model<- rpart(Outcome~., data=train.data, method="class",cp=0.015)
rpart.plot(tree.model)

Random Forest

Random forests grow many decision trees by training on different selections of the data and on a different selection of variables for each tree. Each tree classifies the decision variable by considering a random subset of the predictors and data. The random forest, in turn, combines the decision of the individual trees (mode) to make final decision. The mtry parameter controls the choice of the number of variables selected to build each tree.

#random forest
set.seed(1234567)
rf.model <- randomForest(Outcome~., data = train.data, importance=TRUE)
print(rf.model)

## 
## Call:
##  randomForest(formula = Outcome ~ ., data = train.data, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 25.04%
## Confusion matrix:
##     0   1 class.error
## 0 331  62   0.1577608
## 1  92 130   0.4144144

Random Forest - Parameter Tuning

The below figure plots the OOB error vs mtry,. From the figure, mtry equal to 3 gives the least OOB error.

The OOB error refers to the overall misclassification of both positive and negative classes. However, while the focus of the current prediction is to reduce the misclassification of the positive class, the confusion matrix was manually compared.

# Tune only mtry parameter using tuneRF()
set.seed(1234567)      
res <- tuneRF(x = subset(train.data, select = -Outcome),
              y = train.data$Outcome,
              ntreeTry = 500,
              plot = TRUE, tunecontrol = tune.control(cross = 5))

## mtry = 2  OOB error = 26.34% 
## Searching left ...
## mtry = 1     OOB error = 25.2% 
## 0.04320988 0.05 
## Searching right ...
## mtry = 4     OOB error = 25.85% 
## 0.01851852 0.05

res[res[, 2] == min(res[, 2]), 1]

## [1] 1

From the above table, as mtry at 4 gives the least value of positive class misclassification error, it was chosen as the optimal value and the random forest was rebuilt using this parameter.

#random forest
set.seed(1234567)
rf.model <- randomForest(Outcome~., data = train.data, importance = TRUE, mtry = 4)
rf.model

## 
## Call:
##  randomForest(formula = Outcome ~ ., data = train.data, importance = TRUE,      mtry = 4) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 25.2%
## Confusion matrix:
##     0   1 class.error
## 0 325  68   0.1730280
## 1  87 135   0.3918919

Support Vector Machines

SVM is a technique that identifies a hyperplane that best separates the classes by maximizing the distance between the hyperplane and the data points. SVM focuses on two aspects – maximizing the minimum margin between the hyperplane and support vectors; and minimizing the misclassification rate. By using a larger cost function (for misclassification), SVM compromises on the minimum margin while correctly classifying the response variable.

svm.model <- svm(Outcome~., data = train.data,kernel = "radial", cost = 1, gamma = 0.1,probability = TRUE)

Support Vector Machines - Parameter Tuning

The cost parameter was tuned to identify the best model that minimizes the overall misclassification error rate. The cost of 0.3 and gamma of 0.125 gave the least value of misclassification making it the optimal parameter for the SVM model. The SVM model was rebuilt using the identified parameter.

#tuning svm
svm_tune <- tune(svm, train.x = train.data[,1:8], train.y = train.data[,9], kernel = "radial",
                  ranges = list(cost = seq(0.1,5,0.1)),tunecontrol = tune.control(cross=5),
                 parms = list(loss = matrix(c(0, 2, 1, 0), nrow = 2)))
svm_tune$best.model

## 
## Call:
## best.tune(method = svm, train.x = train.data[, 1:8], train.y = train.data[, 
##     9], ranges = list(cost = seq(0.1, 5, 0.1)), tunecontrol = tune.control(cross = 5), 
##     kernel = "radial", parms = list(loss = matrix(c(0, 2, 1, 
##         0), nrow = 2)))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  0.2 
##       gamma:  0.125 
## 
## Number of Support Vectors:  398

df<-data.frame(svm_tune$performances)
plot(df$cost,df$error,xlab="Cost",ylab="Misclassification Error")

svm.model <- svm(Outcome~., data = train.data,kernel = "radial", cost = 0.2, gamma = 0.125,
                 probability = TRUE)

Model Performance Evaluation

The performance of the models was compared across the in-sample training data (80% of the data) and the out-of-sample testing data (20% of the data). The below plots summarize the results of the model performance. The best model was chosen based on the performance in the out-of-sample data and sensitivity was chosen as the decision criterion.

1. Logistic Regression

# Logistic Regression Performance
par(mfrow = c(1,2))

pred.in <- predict(model.new,newdata = train.data,type = "response")
prediction.in <- ifelse(pred.in<0.29,0,1)
tab<-table(as.factor(train.data$Outcome),prediction.in)
roc.plot(train.data$Outcome=="1", pred.in)$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8388923 1.171605e-44          NA

pred.out <- predict(model.new,newdata = test.data,type = "response")
prediction.out <- ifelse(pred.out<0.29,0,1)
table(as.factor(test.data$Outcome),prediction.out)

##    prediction.out
##      0  1
##   0 71 36
##   1  6 40

roc.plot(test.data$Outcome=="1", pred.out)$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8571719 1.346698e-12          NA

2. Classification Tree

# Classification Tree Performance
par(mfrow = c(1,2))

tree.predict.in<- predict(tree.model, train.data, type = "class")
tree.pred.in<- predict(tree.model, train.data, type = "prob")
confusionMatrix(train.data$Outcome, tree.predict.in)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 338  55
##          1  53 169
##                                          
##                Accuracy : 0.8244         
##                  95% CI : (0.792, 0.8537)
##     No Information Rate : 0.6358         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.6201         
##  Mcnemar's Test P-Value : 0.9233         
##                                          
##             Sensitivity : 0.8645         
##             Specificity : 0.7545         
##          Pos Pred Value : 0.8601         
##          Neg Pred Value : 0.7613         
##              Prevalence : 0.6358         
##          Detection Rate : 0.5496         
##    Detection Prevalence : 0.6390         
##       Balanced Accuracy : 0.8095         
##                                          
##        'Positive' Class : 0              
##

roc.plot(train.data$Outcome== "1", tree.pred.in[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area     p.value binorm.area
## 1 Model  1 0.8542741 3.19379e-51          NA

tree.predict<- predict(tree.model, test.data, type = "class")
tree.pred<- predict(tree.model, test.data, type = "prob")
confusionMatrix(test.data$Outcome, tree.predict)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 84 23
##          1  9 37
##                                           
##                Accuracy : 0.7908          
##                  95% CI : (0.7178, 0.8523)
##     No Information Rate : 0.6078          
##     P-Value [Acc > NIR] : 1.054e-06       
##                                           
##                   Kappa : 0.5423          
##  Mcnemar's Test P-Value : 0.02156         
##                                           
##             Sensitivity : 0.9032          
##             Specificity : 0.6167          
##          Pos Pred Value : 0.7850          
##          Neg Pred Value : 0.8043          
##              Prevalence : 0.6078          
##          Detection Rate : 0.5490          
##    Detection Prevalence : 0.6993          
##       Balanced Accuracy : 0.7599          
##                                           
##        'Positive' Class : 0               
##

roc.plot(test.data$Outcome== "1", tree.pred[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8476229 1.345129e-12          NA

3. Random Forest

# Random Forest Performance
par(mfrow = c(1,2))

rf.predict.in <- predict(rf.model, train.data)
rf.pred.in <- predict(rf.model, train.data,type="prob")
confusionMatrix(train.data$Outcome, rf.predict.in)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 393   0
##          1   0 222
##                                     
##                Accuracy : 1         
##                  95% CI : (0.994, 1)
##     No Information Rate : 0.639     
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 1         
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##              Prevalence : 0.639     
##          Detection Rate : 0.639     
##    Detection Prevalence : 0.639     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : 0         
##

roc.plot(train.data$Outcome == "1", rf.pred.in[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model Area      p.value binorm.area
## 1 Model  1    1 1.036689e-94          NA

rf.model<- randomForest(Outcome~., data = train.data, importance=TRUE)
rf.predict.out <- predict(rf.model, test.data)
rf.pred.out <- predict(rf.model, test.data,type="prob")
confusionMatrix(test.data$Outcome, rf.predict.out)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 93 14
##          1 15 31
##                                           
##                Accuracy : 0.8105          
##                  95% CI : (0.7393, 0.8692)
##     No Information Rate : 0.7059          
##     P-Value [Acc > NIR] : 0.002179        
##                                           
##                   Kappa : 0.5465          
##  Mcnemar's Test P-Value : 1.000000        
##                                           
##             Sensitivity : 0.8611          
##             Specificity : 0.6889          
##          Pos Pred Value : 0.8692          
##          Neg Pred Value : 0.6739          
##              Prevalence : 0.7059          
##          Detection Rate : 0.6078          
##    Detection Prevalence : 0.6993          
##       Balanced Accuracy : 0.7750          
##                                           
##        'Positive' Class : 0               
##

roc.plot(test.data$Outcome== "1", rf.pred.out[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8465055 5.840181e-12          NA

4. Support Vector Machines

# SVM Performance
par(mfrow = c(1,2))

svm.predict.in <- predict(svm.model, train.data)
svm.pred.in <- predict(svm.model,train.data,probability = TRUE)
confusionMatrix(train.data$Outcome, svm.predict.in)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 355  38
##          1  87 135
##                                           
##                Accuracy : 0.7967          
##                  95% CI : (0.7627, 0.8279)
##     No Information Rate : 0.7187          
##     P-Value [Acc > NIR] : 5.589e-06       
##                                           
##                   Kappa : 0.5372          
##  Mcnemar's Test P-Value : 1.761e-05       
##                                           
##             Sensitivity : 0.8032          
##             Specificity : 0.7803          
##          Pos Pred Value : 0.9033          
##          Neg Pred Value : 0.6081          
##              Prevalence : 0.7187          
##          Detection Rate : 0.5772          
##    Detection Prevalence : 0.6390          
##       Balanced Accuracy : 0.7918          
##                                           
##        'Positive' Class : 0               
##

roc.plot(train.data$Outcome== "1", attr(svm.pred.in, "probabilities")[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8552369 7.268933e-49          NA

svm.predict <- predict(svm.model, test.data)
svm.pred <- predict(svm.model,test.data,probability = TRUE)
confusionMatrix(test.data$Outcome, svm.predict)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 98  9
##          1 20 26
##                                           
##                Accuracy : 0.8105          
##                  95% CI : (0.7393, 0.8692)
##     No Information Rate : 0.7712          
##     P-Value [Acc > NIR] : 0.14427         
##                                           
##                   Kappa : 0.5163          
##  Mcnemar's Test P-Value : 0.06332         
##                                           
##             Sensitivity : 0.8305          
##             Specificity : 0.7429          
##          Pos Pred Value : 0.9159          
##          Neg Pred Value : 0.5652          
##              Prevalence : 0.7712          
##          Detection Rate : 0.6405          
##    Detection Prevalence : 0.6993          
##       Balanced Accuracy : 0.7867          
##                                           
##        'Positive' Class : 0               
##

roc.plot(test.data$Outcome== "1", attr(svm.pred, "probabilities")[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8655018 4.153473e-13          NA

Comparing all model AUCs

par(mfrow = c(1,2))

# Plot multiple ROC curves in one - Logistic, GAM, Tree and Random Forest
# Compare different models using ROC curves
# In-sample ROC
rocplot1 <- roc.plot(x = (train.data$Outcome == "1"), 
                     pred = cbind(pred.in, tree.pred.in[,2], rf.pred.in[,2], attr(svm.pred.in, "probabilities")[,2]),
                     main = "ROC curve: In-sample",
                     legend = T, leg.text = c("GLM","Tree", "Random Forest", "SVM"))

# Validation set ROC
rocplot2 <- roc.plot(x = (test.data$Outcome == "1"), 
                     pred = cbind(pred.out, tree.pred[,2], rf.pred.out[,2], attr(svm.pred, "probabilities")[,2]),
                     main = "ROC curve: Out-of-sample",
                     legend = T, leg.text = c("GLM", "Tree", "Random Forest", "SVM"))

rocplot1$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8388923 1.171605e-44          NA
## 2 Model  2 0.8542741 3.193790e-51          NA
## 3 Model  3 1.0000000 1.036689e-94          NA
## 4 Model  4 0.8552369 7.268933e-49          NA

rocplot2$roc.vol

##      Model      Area      p.value binorm.area
## 1 Model  1 0.8571719 1.346698e-12          NA
## 2 Model  2 0.8476229 1.345129e-12          NA
## 3 Model  3 0.8465055 5.840181e-12          NA
## 4 Model  4 0.8655018 4.153473e-13          NA

Cross-Validation to determine best performing model

The above results were reported after building the models on a single split of the dataset. To eliminate the effects of randomness due to the split of the data, a 5-fold cross validation was used to determine the best model (highest cross-validated sensitivity). The below result summarizes the cross-validated sensitivity of the evaluated models.

## [1] 0.7984117

## [1] 0.7715438

## [1] 0.59497

## [1] 0.5059086

The cross validated sensitivity for the above models were respectively, 0.7984117 for Logistic Regression, 0.7715438 for Classification Tree, 0.59497 for Random Forest and 0.5059086 for SVM.

The Logistic regression model was chosen as the best model for predicting the occurrence of diabetes in PIMA women, as it reported the highest cross-validated sensitivity.

Conclusion

The PIMA Indian Women’s Database was analyzed and explored in detail. The patterns identified using Data exploration methods were validated using the modeling techniques employed. Classification models such as Logistic Regression, Classification Trees, Random Forest and SVM were built and evaluated to identify best model to predict the occurrence of Diabetes in PIMA Indian women. From the cross-validated performance measure of sensitivity, the Logistic Regression model was concluded as the best performing model.

References

https://en.wikipedia.org/wiki/Diabetes_mellitus
http://www.diabetes.org/diabetes-basics/statistics/
http://www.diabetes.org/living-with-diabetes/treatment-and-care/women/
http://www.diabetes.org/assets/pdfs/basics/cdc-statistics-report-2017.pdf
WHO technical report series (1985). Technical Report 727, WHO Study Group
http://www.personal.kent.edu/~mshanker/personal/Zip_files/sar_2000.pdf
https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/