Diabetes mellitus is a group of metabolic disorders where the blood sugar levels are higher than normal for prolonged periods of time [1]. Diabetes is caused either due to the insufficient production of insulin in the body or due to improper response of the body’s cells to Insulin. The former cause of Diabetes is also called Type 1 DM or Insulin-dependent Diabetes mellitus and the latter is known as Type 2 DM or Non-Insulin Dependent DM. Gestational Diabetes is a third type of Diabetes where women not suffering from DM develop high sugar levels during pregnancy. In the United States, 30.3 million Americans were recorded as suffering from Diabetes with 1.5 million being diagnosed with Diabetes every year. Total cost of diagnosed Diabetes in the US in 2017 was $327 billion [2]. Diabetes is especially hard on women as it can affect both the mother and their unborn children during pregnancy. Women with Diabetes have a higher likelihood at having a heart attack, miscarriages or babies born with birth defects [3].
Due to increasing incidence rate of diabetes and prediabetes, it is a pressing issue in the health care industry to rightly identify the factors that contribute to the occurrence of Diabetes in people, more so, in Women. From secondary research, factors such as BMI, Blood Pressure, Cholesterol and Glucose levels are important factors that cause Diabetes. In Women, Pregnancy seems to be an additional factor. According to the World Health Organization, people with 2-hour post-load plasma glucose levels at least 200 mg/dl (11.1 mmol/l) at any survey examination were diagnosed with Diabetes [5]. To validate the above hypotheses, identify additional risk factors and build tools that can predict the occurrence of Diabetes, particularly in women, the Pima Indians’ Diabetes dataset was chosen.
The diabetes data containing information about PIMA Indian females, near Phoenix, Arizona has been under continuous study since 1965 due to the high incidence rate of Diabetes in PIMA females. The dataset was originally published by the National Institute of Diabetes and Digestive and Kidney Diseases, consisting of diagnostic measurements pertaining to females of age greater than 20. It contains information of 768 females, of which 268 females were diagnosed with Diabetes. Information available includes 8 variables, such as, Age, Number of Pregnancies, Glucose, Insulin, etc. More detailed description about the variables is listed in the table below. The response variable in the dataset is a binary classifier, Outcome, that indicates if the person was diagnosed with Diabetes or not.
Details about the Variables in the Dataset are provided below.
variable.type <- lapply(data, class)
variable.description <- c("Number of times pregnant",
"Plasma glucose concentration at 2 hours in an oral glucose tolerance test",
"Diastolic blood pressure", "Triceps skin fold thickness", "2-hour serum insulin (µU/ml)",
"Body Mass Index", "Synthesis of the history of Diabetes Mellitus in relatives, generic
relationship of those relatives to the subject", "Age of the individual",
"Occurrence of Diabetes")
variable.name <- colnames(data)
datadesc <- as_data_frame(cbind(variable.name, variable.type, variable.description))
colnames(datadesc) <- c("Variable Name","Data Type","Variable Description")
kable(datadesc)
Variable Name | Data Type | Variable Description |
---|---|---|
Pregnancies | integer | Number of times pregnant |
Glucose | integer | Plasma glucose concentration at 2 hours in an oral glucose tolerance test |
BloodPressure | integer | Diastolic blood pressure |
SkinThickness | integer | Triceps skin fold thickness |
Insulin | integer | 2-hour serum insulin (µU/ml) |
BMI | numeric | Body Mass Index |
DiabetesPedigreeFunction | numeric | Synthesis of the history of Diabetes Mellitus in relatives, generic |
relationship of those relat | ives to the | subject |
Age | integer | Age of the individual |
Outcome | integer | Occurrence of Diabetes |
At first glance, the dataset appeared to be clean. On deeper analysis, the dataset revealed many abnormal values for biological measures. Variables such as Skin Thickness and Glucose had 227 and 374 zero-values respectively. The fact that both measures cannot hold zero values indicated that the missing values in the dataset were represented as zero values in the dataset. The missing values in the dataset constituted to about 30% of the observations in the dataset. As removing these values would result in significant information loss, kNN imputation was performed to impute the missing values in the data set. Only obvious wrong values in the dataset (zero values) were imputed. Large outliers in the variables were not handled.
# Dealing with zeros
missing_data <- data[,setdiff(names(data), c('Outcome', 'Pregnancies'))]
features_miss_num <- apply(missing_data, 2, function(x) sum(x <= 0))
features_miss <- names(missing_data)[ features_miss_num > 0]
rows_miss <- apply(missing_data, 1, function(x) sum(x <= 0) >= 1)
sum(rows_miss)
## [1] 376
missing_data[missing_data <= 0] <- NA
data[, names(missing_data)] <- missing_data
# KNN imputation
orig_data <- data
colSums(is.na(data))
## Pregnancies Glucose BloodPressure
## 0 5 35
## SkinThickness Insulin BMI
## 227 374 11
## DiabetesPedigreeFunction Age Outcome
## 0 0 0
data[,c(-8,-9)] <- knnImputation(data[,c(-8,-9)], k = 5)
Outcome
The dataset has 268 women that were diagnosed with Diabetes and 500 women that didn’t have Diabetes. The sample has a high occurrence (34%) of positive records of Diabetes.
data$Outcome <- factor(data$Outcome)
# 1. Outcome
ggplot(data,aes(Outcome,fill = Outcome)) +
geom_bar() +
ggtitle("Distribution of Outcome variable")
Pregnancies
From the figures below, it’s evident that women who have been diagnosed with Diabetes have had more pregnancies than women who were not (in this dataset). However, from the histogram below, there’s no clear relationship between the number of pregnancies and the occurrence of Diabetes.
# 2. Pregnancies
p1 <- ggplot(data, aes(x = Outcome, y = Pregnancies,fill = Outcome)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Number of pregnancies Vs Diabetes")
p2 <- ggplot(data,aes(x = Pregnancies,fill = factor(Outcome))) +
geom_bar(position = "Dodge") +
scale_x_continuous(limits = c(0,16)) +
theme(legend.position = "bottom") +
labs(title = "Pregnancies Vs Outcome")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Glucose
From the figures below, there’s a clear difference in the amount of Glucose present in the women who have been diagnosed with Diabetes and those who haven’t. While the density plot indicates an overlap in the levels of glucose in both categories of women, the below plots show that Glucose could be a good indicator of the response.
# 2. Pregnancies
p2 <- ggplot(data, aes(x = Glucose, color = Outcome, fill = Outcome)) +
geom_density(alpha = 0.8) +
theme(legend.position = "bottom") +
labs(x = "Glucose", y = "Density", title = "Density plot of glucose")
p1 <- ggplot(data, aes(x = Outcome, y = Glucose,fill = Outcome)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Variation of glucose in women Vs Diabetes")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Blood Pressure
From the below plots, no clear difference is seen in the two categories of women who have and don’t have Diabetes. This shows that Blood Pressure might not be a good predictor of the response variable.
# 3. Blood Pressure
p2 <- ggplot(data, aes(x = BloodPressure, color = Outcome, fill = Outcome)) +
geom_density(alpha = 0.8) +
theme(legend.position = "bottom") +
labs(x = "Blood pressure", y = "Density", title = "Density plot of Blood pressure")
p1 <- ggplot(data, aes(x = Outcome, y = BloodPressure,fill = Outcome)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Variation of blood pressure in women Vs Diabetes")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Skin Thickness
From the below plots, no clear difference can be seen in the two categories of women who have and don’t have Diabetes. This shows that Skin Thickness might not be a good predictor of the response variable.
# 4. Skin Thickness
p2 <- ggplot(data, aes(x = SkinThickness, color = Outcome, fill = Outcome)) +
geom_density(alpha = 0.8) +
theme(legend.position = "bottom") +
labs(x = "Skin thickness", y = "Density", title = "Density plot of skin thickness")
p1 <- ggplot(data, aes(x = Outcome, y = SkinThickness,fill = Outcome)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Variation of skin thickness Vs Diabetes")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Insulin
Again, no clear difference can be observed between the categories of women in the data, indicating Insulin may not be a good predictor of the response variable.
# 5. Insulin
p1 <- ggplot(data, aes(x = Outcome, y = Insulin,fill = Outcome)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Variation of Insulin content Vs Diabetes")
p2 <- ggplot(data, aes(Insulin, fill = Outcome)) +
geom_histogram(binwidth=10) +
theme(legend.position = "bottom") +
ggtitle("Variation of Insulin content Vs Diabetes")
gridExtra::grid.arrange(p1, p2, ncol = 2)
BMI
The below plots show that all the women who had Diabetes had a BMI greater than 25, which is above the normal levels. On the other hand, women who did not have Diabetes had a BMI ranging from 18 to 60.
# 6. BMI
p2 <- ggplot(data, aes(BMI, fill = Outcome)) +
geom_histogram() +
theme(legend.position = "bottom") +
ggtitle("Variation of BMI of women Vs Diabetes")
p1 <- ggplot(data, aes(x = Outcome, y = BMI,fill = Outcome)) +
geom_boxplot(binwidth = 5) +
theme(legend.position = "bottom") +
ggtitle("Variation of BMI of women Vs Diabetes")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Diabetes Pedigree Function
From fig 8, no clear difference can be seen in the two categories of women who have and don’t have Diabetes. This shows that DPF might not be a good predictor of the response variable.
# 7. DPF
p1 <- ggplot(data, aes(x = Outcome, y = DiabetesPedigreeFunction,fill = Outcome)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Variation of DPF of women Vs Diabetes")
p2 <- ggplot(data, aes(DiabetesPedigreeFunction,fill = Outcome)) +
geom_histogram() +
theme(legend.position = "bottom") +
ggtitle("Variation of DPF Vs Diabetes")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Age
Again, no clear distinction is seen in the distribution of the ‘Age’ variable for women that have Diabetes versus those who don’t.
# 8. Age
p2 <- ggplot(data, aes(Age, fill = Outcome)) +
geom_histogram(binwidth = 5) +
theme(legend.position = "bottom") +
ggtitle("Variation of Age of women Vs Diabetes")
p1 <- ggplot(data, aes(x = Outcome, y = Age,fill = Outcome)) +
geom_boxplot() +
theme(legend.position = "bottom") +
ggtitle("Variation of Age of women Vs Diabetes")
gridExtra::grid.arrange(p1, p2, ncol = 2)
From the below bivariate associations, the following inferences can be made.
p1 <- ggplot(data, aes(x = Age, y = Pregnancies)) +
geom_point(aes(color=Outcome)) +
theme(legend.position = "bottom") +
ggtitle("Relationship of Pregnancies with Age Vs Diabetes")
p2 <- ggplot(data,aes(x=Insulin,y=Glucose))+
geom_point(aes(color=Outcome))+
theme(legend.position = "bottom") +
ggtitle("Relationship of Insulin with Glucose Vs Diabetes")
gridExtra::grid.arrange(p1, p2, ncol = 2)
p1 <- ggplot(data,aes(x=BMI,y=BloodPressure))+
geom_point(aes(color=Outcome))+
theme(legend.position = "bottom") +
ggtitle("Relationship of BMI with BP Vs Diabetes")
p2 <- ggplot(data,aes(x=BMI,y=SkinThickness))+
geom_point(aes(color=Outcome))+
theme(legend.position = "bottom") +
ggtitle("Relationship of BMI with Skin Thickness Vs Diabetes")
gridExtra::grid.arrange(p1, p2, ncol = 2)
Subsequently, a correlation plot was drawn between all the numerical variables to establish the linear association between each other. As observed in the bivariate associations, Insulin and Glucose, BMI and Skin Thickness had a moderate – high linear correlation.
corMat = cor (data[, -9])
diag (corMat) = 0 #Remove self correlations
corrplot.mixed(corMat,tl.pos = "lt")
To build a predictive model that classifies women with/ without Diabetes, the following modeling approach was used.
# Create training and testing set
set.seed(12345)
ratio = sample(1:nrow(data), size = 0.20*nrow(data))
test.data = data[ratio,] #Test dataset 20% of total
train.data = data[-ratio,] #Train dataset 80% of total
A full model was built with Outcome as the response variable with the rest of the 8 predictor variables. Step-wise variable selection method was used to identify the most important variables. The final model chosen with AIC as the criterion for selection generated a logistic regression model with the lowest AIC value of 593.85 as below.
# Logistic Regression
model.glm <- glm(Outcome~.,data=train.data,family = binomial)
step_model <- step(model.glm)
## Start: AIC=600.7
## Outcome ~ Pregnancies + Glucose + BloodPressure + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - SkinThickness 1 582.81 598.81
## - Insulin 1 582.97 598.97
## - BloodPressure 1 582.99 598.99
## - Age 1 583.34 599.34
## <none> 582.70 600.70
## - DiabetesPedigreeFunction 1 586.20 602.20
## - Pregnancies 1 595.66 611.66
## - BMI 1 606.83 622.83
## - Glucose 1 646.98 662.98
##
## Step: AIC=598.81
## Outcome ~ Pregnancies + Glucose + BloodPressure + Insulin + BMI +
## DiabetesPedigreeFunction + Age
##
## Df Deviance AIC
## - Insulin 1 583.09 597.09
## - BloodPressure 1 583.10 597.10
## - Age 1 583.42 597.42
## <none> 582.81 598.81
## - DiabetesPedigreeFunction 1 586.24 600.24
## - Pregnancies 1 595.67 609.67
## - BMI 1 619.62 633.62
## - Glucose 1 646.98 660.98
##
## Step: AIC=597.09
## Outcome ~ Pregnancies + Glucose + BloodPressure + BMI + DiabetesPedigreeFunction +
## Age
##
## Df Deviance AIC
## - BloodPressure 1 583.40 595.40
## - Age 1 583.70 595.70
## <none> 583.09 597.09
## - DiabetesPedigreeFunction 1 586.79 598.79
## - Pregnancies 1 596.06 608.06
## - BMI 1 624.11 636.11
## - Glucose 1 683.69 695.69
##
## Step: AIC=595.4
## Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction +
## Age
##
## Df Deviance AIC
## - Age 1 583.85 593.85
## <none> 583.40 595.40
## - DiabetesPedigreeFunction 1 587.13 597.13
## - Pregnancies 1 596.19 606.19
## - BMI 1 625.92 635.92
## - Glucose 1 683.75 693.75
##
## Step: AIC=593.85
## Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction
##
## Df Deviance AIC
## <none> 583.85 593.85
## - DiabetesPedigreeFunction 1 587.72 595.72
## - Pregnancies 1 605.27 613.27
## - BMI 1 626.00 634.00
## - Glucose 1 691.67 699.67
# Final model using Logistic Regression
model.new <- glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction, family = binomial, data = train.data)
summary(model.new)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BMI + DiabetesPedigreeFunction,
## family = binomial, data = train.data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6551 -0.7302 -0.4086 0.7295 2.1683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.161189 0.783300 -11.696 < 2e-16 ***
## Pregnancies 0.138323 0.030588 4.522 6.12e-06 ***
## Glucose 0.035024 0.003797 9.223 < 2e-16 ***
## BMI 0.100968 0.016547 6.102 1.05e-09 ***
## DiabetesPedigreeFunction 0.642931 0.328343 1.958 0.0502 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 804.39 on 614 degrees of freedom
## Residual deviance: 583.85 on 610 degrees of freedom
## AIC: 593.85
##
## Number of Fisher Scoring iterations: 5
Using the above full model, the optimal cut-off probability was identified using cross-validation. An asymmetric cost function was defined penalizing the model for wrongly identifying patients with Diabetes. Using the defined asymmetric cost function (2:1 for False Positives: False Negatives), the optimal cut-off probability was calculated. The optimal cut-off probability is the probability at which the model has the least misclassification rate.
# CV to choose cut-off probability
searchgrid = seq(0.01, 0.6, 0.02)
result = cbind(searchgrid, NA)
cost1 <- function(r, pi) {
weight1 = 2
weight0 = 1
c1 = (r == 1) & (pi < pcut) #logical vector - true if actual 1 but predict 0 (False Negative)
c0 = (r == 0) & (pi > pcut) #logical vector - true if actual 0 but predict 1 (False Positive)
return(mean(weight1 * c1 + weight0 * c0))
}
for (i in 1:length(searchgrid)) {
pcut <- result[i, 1]
result[i, 2] <- cv.glm(data = train.data, glmfit = model.new, cost = cost1,
K = 3)$delta[2]
}
plot(result, ylab = "CV Cost",main = "Optimal cut-off probability identification")
Plotting the cross-validated cost with different values of pcut resulted in the above figure. 0.29 was chosen as the optimal cut-off probability with a CV cost of 0.3420.
A classification tree is used for modeling data when the response variable is categorical. The tree splits the data into two or more homogeneous sets based on the most significant differentiator in the predictor variables-value set. [7]
For this dataset, a classification tree was built for the binary response variable ‘Outcome’, the output of which can be seen below in Fig 14. An asymmetric cost (2:1 for False Positives: False Negatives) was used to build the tree.
tree.model <- rpart(Outcome~., data=train.data, method="class")
rpart.plot(tree.model)
The complexity parameter (Cp) of the tree was tuned using a reference of the Relative error VS Cp. The complexity parameter value after which the relative error does not decrease significantly was chosen as the Cp of the final tree. From Fig 15, using the Cp value of 0.015, the decision tree was pruned. The final decision tree can be seen in Fig 16.
plotcp(tree.model)
# Optimal cut-off value
# Pruning the tree
tree.model<- rpart(Outcome~., data=train.data, method="class",cp=0.015)
rpart.plot(tree.model)
Random forests grow many decision trees by training on different selections of the data and on a different selection of variables for each tree. Each tree classifies the decision variable by considering a random subset of the predictors and data. The random forest, in turn, combines the decision of the individual trees (mode) to make final decision. The mtry parameter controls the choice of the number of variables selected to build each tree.
#random forest
set.seed(1234567)
rf.model <- randomForest(Outcome~., data = train.data, importance=TRUE)
print(rf.model)
##
## Call:
## randomForest(formula = Outcome ~ ., data = train.data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 25.04%
## Confusion matrix:
## 0 1 class.error
## 0 331 62 0.1577608
## 1 92 130 0.4144144
The below figure plots the OOB error vs mtry,. From the figure, mtry equal to 3 gives the least OOB error.
The OOB error refers to the overall misclassification of both positive and negative classes. However, while the focus of the current prediction is to reduce the misclassification of the positive class, the confusion matrix was manually compared.
# Tune only mtry parameter using tuneRF()
set.seed(1234567)
res <- tuneRF(x = subset(train.data, select = -Outcome),
y = train.data$Outcome,
ntreeTry = 500,
plot = TRUE, tunecontrol = tune.control(cross = 5))
## mtry = 2 OOB error = 26.34%
## Searching left ...
## mtry = 1 OOB error = 25.2%
## 0.04320988 0.05
## Searching right ...
## mtry = 4 OOB error = 25.85%
## 0.01851852 0.05
res[res[, 2] == min(res[, 2]), 1]
## [1] 1
From the above table, as mtry at 4 gives the least value of positive class misclassification error, it was chosen as the optimal value and the random forest was rebuilt using this parameter.
#random forest
set.seed(1234567)
rf.model <- randomForest(Outcome~., data = train.data, importance = TRUE, mtry = 4)
rf.model
##
## Call:
## randomForest(formula = Outcome ~ ., data = train.data, importance = TRUE, mtry = 4)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 25.2%
## Confusion matrix:
## 0 1 class.error
## 0 325 68 0.1730280
## 1 87 135 0.3918919
SVM is a technique that identifies a hyperplane that best separates the classes by maximizing the distance between the hyperplane and the data points. SVM focuses on two aspects – maximizing the minimum margin between the hyperplane and support vectors; and minimizing the misclassification rate. By using a larger cost function (for misclassification), SVM compromises on the minimum margin while correctly classifying the response variable.
svm.model <- svm(Outcome~., data = train.data,kernel = "radial", cost = 1, gamma = 0.1,probability = TRUE)
The cost parameter was tuned to identify the best model that minimizes the overall misclassification error rate. The cost of 0.3 and gamma of 0.125 gave the least value of misclassification making it the optimal parameter for the SVM model. The SVM model was rebuilt using the identified parameter.
#tuning svm
svm_tune <- tune(svm, train.x = train.data[,1:8], train.y = train.data[,9], kernel = "radial",
ranges = list(cost = seq(0.1,5,0.1)),tunecontrol = tune.control(cross=5),
parms = list(loss = matrix(c(0, 2, 1, 0), nrow = 2)))
svm_tune$best.model
##
## Call:
## best.tune(method = svm, train.x = train.data[, 1:8], train.y = train.data[,
## 9], ranges = list(cost = seq(0.1, 5, 0.1)), tunecontrol = tune.control(cross = 5),
## kernel = "radial", parms = list(loss = matrix(c(0, 2, 1,
## 0), nrow = 2)))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 0.2
## gamma: 0.125
##
## Number of Support Vectors: 398
df<-data.frame(svm_tune$performances)
plot(df$cost,df$error,xlab="Cost",ylab="Misclassification Error")
svm.model <- svm(Outcome~., data = train.data,kernel = "radial", cost = 0.2, gamma = 0.125,
probability = TRUE)
The performance of the models was compared across the in-sample training data (80% of the data) and the out-of-sample testing data (20% of the data). The below plots summarize the results of the model performance. The best model was chosen based on the performance in the out-of-sample data and sensitivity was chosen as the decision criterion.
1. Logistic Regression
# Logistic Regression Performance
par(mfrow = c(1,2))
pred.in <- predict(model.new,newdata = train.data,type = "response")
prediction.in <- ifelse(pred.in<0.29,0,1)
tab<-table(as.factor(train.data$Outcome),prediction.in)
roc.plot(train.data$Outcome=="1", pred.in)$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8388923 1.171605e-44 NA
pred.out <- predict(model.new,newdata = test.data,type = "response")
prediction.out <- ifelse(pred.out<0.29,0,1)
table(as.factor(test.data$Outcome),prediction.out)
## prediction.out
## 0 1
## 0 71 36
## 1 6 40
roc.plot(test.data$Outcome=="1", pred.out)$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8571719 1.346698e-12 NA
2. Classification Tree
# Classification Tree Performance
par(mfrow = c(1,2))
tree.predict.in<- predict(tree.model, train.data, type = "class")
tree.pred.in<- predict(tree.model, train.data, type = "prob")
confusionMatrix(train.data$Outcome, tree.predict.in)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 338 55
## 1 53 169
##
## Accuracy : 0.8244
## 95% CI : (0.792, 0.8537)
## No Information Rate : 0.6358
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6201
## Mcnemar's Test P-Value : 0.9233
##
## Sensitivity : 0.8645
## Specificity : 0.7545
## Pos Pred Value : 0.8601
## Neg Pred Value : 0.7613
## Prevalence : 0.6358
## Detection Rate : 0.5496
## Detection Prevalence : 0.6390
## Balanced Accuracy : 0.8095
##
## 'Positive' Class : 0
##
roc.plot(train.data$Outcome== "1", tree.pred.in[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8542741 3.19379e-51 NA
tree.predict<- predict(tree.model, test.data, type = "class")
tree.pred<- predict(tree.model, test.data, type = "prob")
confusionMatrix(test.data$Outcome, tree.predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 84 23
## 1 9 37
##
## Accuracy : 0.7908
## 95% CI : (0.7178, 0.8523)
## No Information Rate : 0.6078
## P-Value [Acc > NIR] : 1.054e-06
##
## Kappa : 0.5423
## Mcnemar's Test P-Value : 0.02156
##
## Sensitivity : 0.9032
## Specificity : 0.6167
## Pos Pred Value : 0.7850
## Neg Pred Value : 0.8043
## Prevalence : 0.6078
## Detection Rate : 0.5490
## Detection Prevalence : 0.6993
## Balanced Accuracy : 0.7599
##
## 'Positive' Class : 0
##
roc.plot(test.data$Outcome== "1", tree.pred[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8476229 1.345129e-12 NA
3. Random Forest
# Random Forest Performance
par(mfrow = c(1,2))
rf.predict.in <- predict(rf.model, train.data)
rf.pred.in <- predict(rf.model, train.data,type="prob")
confusionMatrix(train.data$Outcome, rf.predict.in)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 393 0
## 1 0 222
##
## Accuracy : 1
## 95% CI : (0.994, 1)
## No Information Rate : 0.639
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.639
## Detection Rate : 0.639
## Detection Prevalence : 0.639
## Balanced Accuracy : 1.000
##
## 'Positive' Class : 0
##
roc.plot(train.data$Outcome == "1", rf.pred.in[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 1 1.036689e-94 NA
rf.model<- randomForest(Outcome~., data = train.data, importance=TRUE)
rf.predict.out <- predict(rf.model, test.data)
rf.pred.out <- predict(rf.model, test.data,type="prob")
confusionMatrix(test.data$Outcome, rf.predict.out)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 93 14
## 1 15 31
##
## Accuracy : 0.8105
## 95% CI : (0.7393, 0.8692)
## No Information Rate : 0.7059
## P-Value [Acc > NIR] : 0.002179
##
## Kappa : 0.5465
## Mcnemar's Test P-Value : 1.000000
##
## Sensitivity : 0.8611
## Specificity : 0.6889
## Pos Pred Value : 0.8692
## Neg Pred Value : 0.6739
## Prevalence : 0.7059
## Detection Rate : 0.6078
## Detection Prevalence : 0.6993
## Balanced Accuracy : 0.7750
##
## 'Positive' Class : 0
##
roc.plot(test.data$Outcome== "1", rf.pred.out[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8465055 5.840181e-12 NA
4. Support Vector Machines
# SVM Performance
par(mfrow = c(1,2))
svm.predict.in <- predict(svm.model, train.data)
svm.pred.in <- predict(svm.model,train.data,probability = TRUE)
confusionMatrix(train.data$Outcome, svm.predict.in)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 355 38
## 1 87 135
##
## Accuracy : 0.7967
## 95% CI : (0.7627, 0.8279)
## No Information Rate : 0.7187
## P-Value [Acc > NIR] : 5.589e-06
##
## Kappa : 0.5372
## Mcnemar's Test P-Value : 1.761e-05
##
## Sensitivity : 0.8032
## Specificity : 0.7803
## Pos Pred Value : 0.9033
## Neg Pred Value : 0.6081
## Prevalence : 0.7187
## Detection Rate : 0.5772
## Detection Prevalence : 0.6390
## Balanced Accuracy : 0.7918
##
## 'Positive' Class : 0
##
roc.plot(train.data$Outcome== "1", attr(svm.pred.in, "probabilities")[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8552369 7.268933e-49 NA
svm.predict <- predict(svm.model, test.data)
svm.pred <- predict(svm.model,test.data,probability = TRUE)
confusionMatrix(test.data$Outcome, svm.predict)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 98 9
## 1 20 26
##
## Accuracy : 0.8105
## 95% CI : (0.7393, 0.8692)
## No Information Rate : 0.7712
## P-Value [Acc > NIR] : 0.14427
##
## Kappa : 0.5163
## Mcnemar's Test P-Value : 0.06332
##
## Sensitivity : 0.8305
## Specificity : 0.7429
## Pos Pred Value : 0.9159
## Neg Pred Value : 0.5652
## Prevalence : 0.7712
## Detection Rate : 0.6405
## Detection Prevalence : 0.6993
## Balanced Accuracy : 0.7867
##
## 'Positive' Class : 0
##
roc.plot(test.data$Outcome== "1", attr(svm.pred, "probabilities")[,2], ylab = "True Positive Rate", xlab = "False Positive Rate")$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8655018 4.153473e-13 NA
Comparing all model AUCs
par(mfrow = c(1,2))
# Plot multiple ROC curves in one - Logistic, GAM, Tree and Random Forest
# Compare different models using ROC curves
# In-sample ROC
rocplot1 <- roc.plot(x = (train.data$Outcome == "1"),
pred = cbind(pred.in, tree.pred.in[,2], rf.pred.in[,2], attr(svm.pred.in, "probabilities")[,2]),
main = "ROC curve: In-sample",
legend = T, leg.text = c("GLM","Tree", "Random Forest", "SVM"))
# Validation set ROC
rocplot2 <- roc.plot(x = (test.data$Outcome == "1"),
pred = cbind(pred.out, tree.pred[,2], rf.pred.out[,2], attr(svm.pred, "probabilities")[,2]),
main = "ROC curve: Out-of-sample",
legend = T, leg.text = c("GLM", "Tree", "Random Forest", "SVM"))
rocplot1$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8388923 1.171605e-44 NA
## 2 Model 2 0.8542741 3.193790e-51 NA
## 3 Model 3 1.0000000 1.036689e-94 NA
## 4 Model 4 0.8552369 7.268933e-49 NA
rocplot2$roc.vol
## Model Area p.value binorm.area
## 1 Model 1 0.8571719 1.346698e-12 NA
## 2 Model 2 0.8476229 1.345129e-12 NA
## 3 Model 3 0.8465055 5.840181e-12 NA
## 4 Model 4 0.8655018 4.153473e-13 NA
The above results were reported after building the models on a single split of the dataset. To eliminate the effects of randomness due to the split of the data, a 5-fold cross validation was used to determine the best model (highest cross-validated sensitivity). The below result summarizes the cross-validated sensitivity of the evaluated models.
## [1] 0.7984117
## [1] 0.7715438
## [1] 0.59497
## [1] 0.5059086
The cross validated sensitivity for the above models were respectively, 0.7984117 for Logistic Regression, 0.7715438 for Classification Tree, 0.59497 for Random Forest and 0.5059086 for SVM.
The Logistic regression model was chosen as the best model for predicting the occurrence of diabetes in PIMA women, as it reported the highest cross-validated sensitivity.
The PIMA Indian Women’s Database was analyzed and explored in detail. The patterns identified using Data exploration methods were validated using the modeling techniques employed. Classification models such as Logistic Regression, Classification Trees, Random Forest and SVM were built and evaluated to identify best model to predict the occurrence of Diabetes in PIMA Indian women. From the cross-validated performance measure of sensitivity, the Logistic Regression model was concluded as the best performing model.