This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.
The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Variable (columns) descriptions: All the predictors are of numeric type. The usage of outcome here is categorical even though 1 (positive case) and 0 (negative case) are ordinarily numeric types. See below other specific details about the predictors:
. Pregnancy: Number of pregnancies . Glucose: Plasma glucose concentration 2 hours in an oral glucose tolerance test . BloodPressure: Diastolic blood pressure (mm Hg) . SkinThicknessTriceps: skin fold thickness (mm) . Insulin: 2-Hour serum insulin (mu U/ml) . BMI: Body mass index (weight in kg/(height in m)^2) . DiabetesPedigreeFunction: Diabetes pedigree function . Age: Age (years) . Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0
Find below the summary of the dataset:
diabetes <- read.csv('diabetes.csv')
summary(diabetes)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Find below the snapshot of the first few rows:
head(diabetes)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
It is also useful to take a look at the pair plot to visually observe the relationship among predictors:
cols <- character(nrow(diabetes))
cols[] <- 'black'
cols[diabetes$Outcome == '1'] <- 'red'
pairs(diabetes, col=cols,pch="*")
One major difference between statistical modeling and machine learning is that statistics is more focused on explaining and understanding the process by which data is generated with an ultimate goal of inference while machine learning most of the times tries to predict or classify which is a reason they are called blackbox models. With this difference in orientation, Machine learning algorithms offer only some degree of explainability which varies depending on the algorithm.
The choice of Logistics regression here has a lot to do with it’s somewhat higher degree of explainability compared with other models. I’ll be using 2 approaches here: Train/Test Split Approach and Cross Validation Approach
Train/Test Split Approach creates 2 different samples for training and testing. Here we will use 70/30 ratio
Below is a code to achieve the slpit:
dt= sort(sample(nrow(diabetes), nrow(diabetes)*.7))
train <- diabetes[dt,]
test <- diabetes[-dt,]
Now we fit a logistic regression model on our diabetes dataset first on all predictors:
attach(diabetes)
glm.fit=glm(Outcome~.,data = diabetes, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = diabetes)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5566 -0.7274 -0.4159 0.7267 2.9297
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.4046964 0.7166359 -11.728 < 2e-16 ***
## Pregnancies 0.1231823 0.0320776 3.840 0.000123 ***
## Glucose 0.0351637 0.0037087 9.481 < 2e-16 ***
## BloodPressure -0.0132955 0.0052336 -2.540 0.011072 *
## SkinThickness 0.0006190 0.0068994 0.090 0.928515
## Insulin -0.0011917 0.0009012 -1.322 0.186065
## BMI 0.0897010 0.0150876 5.945 2.76e-09 ***
## DiabetesPedigreeFunction 0.9451797 0.2991475 3.160 0.001580 **
## Age 0.0148690 0.0093348 1.593 0.111192
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 723.45 on 759 degrees of freedom
## AIC: 741.45
##
## Number of Fisher Scoring iterations: 5
Here we can see looking at the p-values that Skin thickness, insulin and age have no significance. Skin thickness according to Dr Spanakis is a trivial predictor so that makes sense. Insulin on the other hand looking at the pairplot above show some positive correlation with glucose which is most likely the reason for the high p-value. For some reasons perhaps correlation again, Age is not signifcant.
I will exclude the insignifcant predictors and refit the model this time using just the train dataset:
glm.fit=glm(Outcome~Pregnancies+Glucose+BloodPressure+BMI+DiabetesPedigreeFunction,data = train, family = binomial)
summary(glm.fit)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure +
## BMI + DiabetesPedigreeFunction, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5319 -0.7374 -0.4032 0.6914 2.8833
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.94759 0.81802 -9.716 < 2e-16 ***
## Pregnancies 0.17850 0.03322 5.372 7.78e-08 ***
## Glucose 0.03289 0.00415 7.927 2.25e-15 ***
## BloodPressure -0.01101 0.00614 -1.793 0.07298 .
## BMI 0.08271 0.01679 4.927 8.35e-07 ***
## DiabetesPedigreeFunction 1.16614 0.36537 3.192 0.00141 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 696.65 on 536 degrees of freedom
## Residual deviance: 507.23 on 531 degrees of freedom
## AIC: 519.23
##
## Number of Fisher Scoring iterations: 5
We can go ahead to evaluate the model to see the performance of the model on train and test dataset. We’ll use a treshold of >0.5 to be 1 and 0.5 or < to be 0:
glm.probs1=predict(glm.fit, type='response')
glm.probs2=predict(glm.fit,newdata = test, type='response')
glm.pred1=ifelse(glm.probs1>0.5,'1','0')
glm.pred2=ifelse(glm.probs2>0.5,'1','0')
table(glm.pred1, train$Outcome)
##
## glm.pred1 0 1
## 0 307 81
## 1 41 108
table(glm.pred2, test$Outcome)
##
## glm.pred2 0 1
## 0 134 34
## 1 18 45
We can compute accuracy using the code below:
mean(glm.pred1==train$Outcome)
## [1] 0.7728119
mean(glm.pred2==test$Outcome)
## [1] 0.7748918
This gives an accuracy of roughly 77% on both train and test dataset which gives an evidence that the model does not overfit.
Cross Validation approach splits the dataset into k equal folds then trains using k-1 folds and test using the holdout portion recursively. The overall performance is then the average of individual iteration of k model fits.
suppressPackageStartupMessages(require(boot))
cv.error.5 = rep(0,5)
for (i in 1:5) {
glm.fitCv=glm(Outcome~Pregnancies+Glucose+BloodPressure+BMI+DiabetesPedigreeFunction,data = diabetes, family = binomial)
cv.error.5[i]=cv.glm(diabetes, glm.fitCv, K=5)$delta[1]
}
cv.error.5
## [1] 0.1557265 0.1567445 0.1559942 0.1555383 0.1554973
We can compute accuracy with the code below:
accuracy <- mean(1 - cv.error.5)
accuracy
## [1] 0.8440998
At 84% cross valdation recorded a better prediction accuracy
Now i am going to try Support Vector Machines and see what we get. I’ll train with the train dataset and evaluate with the test.
suppressPackageStartupMessages(require(e1071))
## Warning: package 'e1071' was built under R version 3.5.2
trainSVM <- svm(formula = Outcome ~ ., data = train, type = 'C-classification', kernel = 'polynomial')
trainSVM
##
## Call:
## svm(formula = Outcome ~ ., data = train, type = "C-classification",
## kernel = "polynomial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: polynomial
## cost: 1
## degree: 3
## gamma: 0.125
## coef.0: 0
##
## Number of Support Vectors: 304
predSVM = predict(trainSVM, newdata=test)
table(predSVM, test$Outcome)
##
## predSVM 0 1
## 0 142 51
## 1 10 28
mean(predSVM==test$Outcome)
## [1] 0.7359307
Support Vector Machines with train/test split records a performanceof 73%
Lastly i’m going to try a deep learning implementation using Keras Library
Libraries:
suppressPackageStartupMessages(require(dplyr))
## Warning: package 'dplyr' was built under R version 3.5.2
suppressPackageStartupMessages(require(keras))
## Warning: package 'keras' was built under R version 3.5.2
suppressPackageStartupMessages(require(caret))
## Warning: package 'caret' was built under R version 3.5.2
## Warning: package 'ggplot2' was built under R version 3.5.2
source("./train_val_test.R")
Data pre-processing
c(train, val, test) %<-% train_val_test_split(df = diabetes, train_ratio = 0.8, val_ratio = 0, test_ratio = 0.2)
X_train <- train %>%
select(-Outcome) %>%
as.matrix()
y_train <- train %>%
select(Outcome) %>%
as.matrix()
X_test <- test %>%
select(-Outcome) %>%
as.matrix()
y_test <- test %>%
select(Outcome) %>%
as.matrix()
dimnames(X_train) <- NULL
dimnames(X_test) <- NULL
X_train_scale <- X_train %>%
scale()
col_mean_train <- attr(X_train_scale, "scaled:center")
col_sd_train <- attr(X_train_scale, "scaled:scale")
X_test_scale <- X_test %>%
scale(center = col_mean_train,
scale = col_sd_train)
Model function-Layers and compile stage:
create_model <- function() {
dnn_class_model <-
keras_model_sequential() %>%
layer_dense(units = 50,
activation = 'relu',
input_shape = c(ncol(X_train_scale))) %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 50, activation = 'relu') %>%
layer_dropout(rate = 0.4) %>%
layer_dense(units = 1, activation = 'sigmoid') %>%
compile(optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = 'accuracy')}
Model fiting:
dnn_class_model <- create_model()
history <- dnn_class_model %>%
keras::fit(x = X_train_scale,
y = y_train,
epochs = 30,
validation_split = 0.2,
batch_size = 128)
plot(history,
smooth = F)
Preidction and Evaluation:
y_test_pred <- predict(object = dnn_class_model, x = X_test_scale)
y_test_pred %>% table() %>% head()
## .
## 0.00133283727336675 0.00220338138751686 0.00252745184116066
## 1 1 1
## 0.0148538555949926 0.0176389180123806 0.0240529496222734
## 1 1 1
test$outcome_pred <- y_test_pred[, 1]
test <- test %>%
mutate(outcome_pred_class = ifelse(outcome_pred < 0.5, 0, 1)) %>%
mutate(outcome_pred_class = as.factor(outcome_pred_class)) %>%
mutate(Outcome = as.factor(Outcome))
cm_tab <- caret::confusionMatrix(test$outcome_pred_class,
test$Outcome)
cm_tab
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 85 21
## 1 14 34
##
## Accuracy : 0.7727
## 95% CI : (0.6984, 0.8363)
## No Information Rate : 0.6429
## P-Value [Acc > NIR] : 0.000356
##
## Kappa : 0.4906
## Mcnemar's Test P-Value : 0.310494
##
## Sensitivity : 0.8586
## Specificity : 0.6182
## Pos Pred Value : 0.8019
## Neg Pred Value : 0.7083
## Prevalence : 0.6429
## Detection Rate : 0.5519
## Detection Prevalence : 0.6883
## Balanced Accuracy : 0.7384
##
## 'Positive' Class : 0
##
Deep Learning records an accuracy of 76%
This dataset was downloaded from: Kaggle.com - https://www.kaggle.com/uciml/pima-indians-diabetes-database