1 Introduction

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

The datasets consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Variable (columns) descriptions: All the predictors are of numeric type. The usage of outcome here is categorical even though 1 (positive case) and 0 (negative case) are ordinarily numeric types. See below other specific details about the predictors:

. Pregnancy: Number of pregnancies . Glucose: Plasma glucose concentration 2 hours in an oral glucose tolerance test . BloodPressure: Diastolic blood pressure (mm Hg) . SkinThicknessTriceps: skin fold thickness (mm) . Insulin: 2-Hour serum insulin (mu U/ml) . BMI: Body mass index (weight in kg/(height in m)^2) . DiabetesPedigreeFunction: Diabetes pedigree function . Age: Age (years) . Outcome: Class variable (0 or 1) 268 of 768 are 1, the others are 0

Find below the summary of the dataset:

diabetes <- read.csv('diabetes.csv')
summary(diabetes)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Find below the snapshot of the first few rows:

head(diabetes)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

It is also useful to take a look at the pair plot to visually observe the relationship among predictors:

cols <- character(nrow(diabetes))
cols[] <-  'black'
cols[diabetes$Outcome == '1'] <- 'red'
pairs(diabetes, col=cols,pch="*")

2 Logistics Regression

One major difference between statistical modeling and machine learning is that statistics is more focused on explaining and understanding the process by which data is generated with an ultimate goal of inference while machine learning most of the times tries to predict or classify which is a reason they are called blackbox models. With this difference in orientation, Machine learning algorithms offer only some degree of explainability which varies depending on the algorithm.

The choice of Logistics regression here has a lot to do with it’s somewhat higher degree of explainability compared with other models. I’ll be using 2 approaches here: Train/Test Split Approach and Cross Validation Approach

2.1 Train/Test Split

Train/Test Split Approach creates 2 different samples for training and testing. Here we will use 70/30 ratio

Below is a code to achieve the slpit:

dt= sort(sample(nrow(diabetes), nrow(diabetes)*.7))
train <- diabetes[dt,]
test <- diabetes[-dt,]

Now we fit a logistic regression model on our diabetes dataset first on all predictors:

attach(diabetes)
glm.fit=glm(Outcome~.,data = diabetes, family = binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Outcome ~ ., family = binomial, data = diabetes)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5566  -0.7274  -0.4159   0.7267   2.9297  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.4046964  0.7166359 -11.728  < 2e-16 ***
## Pregnancies               0.1231823  0.0320776   3.840 0.000123 ***
## Glucose                   0.0351637  0.0037087   9.481  < 2e-16 ***
## BloodPressure            -0.0132955  0.0052336  -2.540 0.011072 *  
## SkinThickness             0.0006190  0.0068994   0.090 0.928515    
## Insulin                  -0.0011917  0.0009012  -1.322 0.186065    
## BMI                       0.0897010  0.0150876   5.945 2.76e-09 ***
## DiabetesPedigreeFunction  0.9451797  0.2991475   3.160 0.001580 ** 
## Age                       0.0148690  0.0093348   1.593 0.111192    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 723.45  on 759  degrees of freedom
## AIC: 741.45
## 
## Number of Fisher Scoring iterations: 5

Here we can see looking at the p-values that Skin thickness, insulin and age have no significance. Skin thickness according to Dr Spanakis is a trivial predictor so that makes sense. Insulin on the other hand looking at the pairplot above show some positive correlation with glucose which is most likely the reason for the high p-value. For some reasons perhaps correlation again, Age is not signifcant.

I will exclude the insignifcant predictors and refit the model this time using just the train dataset:

glm.fit=glm(Outcome~Pregnancies+Glucose+BloodPressure+BMI+DiabetesPedigreeFunction,data = train, family = binomial)
summary(glm.fit)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + BloodPressure + 
##     BMI + DiabetesPedigreeFunction, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5319  -0.7374  -0.4032   0.6914   2.8833  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -7.94759    0.81802  -9.716  < 2e-16 ***
## Pregnancies               0.17850    0.03322   5.372 7.78e-08 ***
## Glucose                   0.03289    0.00415   7.927 2.25e-15 ***
## BloodPressure            -0.01101    0.00614  -1.793  0.07298 .  
## BMI                       0.08271    0.01679   4.927 8.35e-07 ***
## DiabetesPedigreeFunction  1.16614    0.36537   3.192  0.00141 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 696.65  on 536  degrees of freedom
## Residual deviance: 507.23  on 531  degrees of freedom
## AIC: 519.23
## 
## Number of Fisher Scoring iterations: 5

We can go ahead to evaluate the model to see the performance of the model on train and test dataset. We’ll use a treshold of >0.5 to be 1 and 0.5 or < to be 0:

glm.probs1=predict(glm.fit, type='response')
glm.probs2=predict(glm.fit,newdata = test, type='response')
glm.pred1=ifelse(glm.probs1>0.5,'1','0')
glm.pred2=ifelse(glm.probs2>0.5,'1','0')
table(glm.pred1, train$Outcome)
##          
## glm.pred1   0   1
##         0 307  81
##         1  41 108
table(glm.pred2, test$Outcome)
##          
## glm.pred2   0   1
##         0 134  34
##         1  18  45

We can compute accuracy using the code below:

mean(glm.pred1==train$Outcome)
## [1] 0.7728119
mean(glm.pred2==test$Outcome)
## [1] 0.7748918

This gives an accuracy of roughly 77% on both train and test dataset which gives an evidence that the model does not overfit.

2.2 Cross Validation

Cross Validation approach splits the dataset into k equal folds then trains using k-1 folds and test using the holdout portion recursively. The overall performance is then the average of individual iteration of k model fits.

suppressPackageStartupMessages(require(boot))
cv.error.5 = rep(0,5)
for (i in 1:5) {
  glm.fitCv=glm(Outcome~Pregnancies+Glucose+BloodPressure+BMI+DiabetesPedigreeFunction,data = diabetes, family = binomial)
  cv.error.5[i]=cv.glm(diabetes, glm.fitCv, K=5)$delta[1]
}
cv.error.5
## [1] 0.1557265 0.1567445 0.1559942 0.1555383 0.1554973

We can compute accuracy with the code below:

accuracy <- mean(1 - cv.error.5)
accuracy
## [1] 0.8440998

At 84% cross valdation recorded a better prediction accuracy

3 Support Vector Machines

Now i am going to try Support Vector Machines and see what we get. I’ll train with the train dataset and evaluate with the test.

suppressPackageStartupMessages(require(e1071))
## Warning: package 'e1071' was built under R version 3.5.2
trainSVM <-  svm(formula = Outcome ~ ., data = train, type = 'C-classification', kernel = 'polynomial')
trainSVM
## 
## Call:
## svm(formula = Outcome ~ ., data = train, type = "C-classification", 
##     kernel = "polynomial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  polynomial 
##        cost:  1 
##      degree:  3 
##       gamma:  0.125 
##      coef.0:  0 
## 
## Number of Support Vectors:  304
predSVM = predict(trainSVM, newdata=test)

table(predSVM, test$Outcome)
##        
## predSVM   0   1
##       0 142  51
##       1  10  28
mean(predSVM==test$Outcome)
## [1] 0.7359307

Support Vector Machines with train/test split records a performanceof 73%

4 Deep Learning

Lastly i’m going to try a deep learning implementation using Keras Library

Libraries:

suppressPackageStartupMessages(require(dplyr))
## Warning: package 'dplyr' was built under R version 3.5.2
suppressPackageStartupMessages(require(keras))
## Warning: package 'keras' was built under R version 3.5.2
suppressPackageStartupMessages(require(caret))
## Warning: package 'caret' was built under R version 3.5.2
## Warning: package 'ggplot2' was built under R version 3.5.2
source("./train_val_test.R")

Data pre-processing

c(train, val, test) %<-% train_val_test_split(df = diabetes, train_ratio = 0.8, val_ratio = 0, test_ratio = 0.2)

X_train <- train %>% 
  select(-Outcome) %>% 
  as.matrix()
y_train <- train %>% 
  select(Outcome) %>% 
  as.matrix()

X_test <- test %>% 
  select(-Outcome) %>% 
  as.matrix()
y_test <- test %>% 
  select(Outcome) %>% 
  as.matrix()

dimnames(X_train) <- NULL
dimnames(X_test) <- NULL

X_train_scale <- X_train %>% 
  scale()

col_mean_train <- attr(X_train_scale, "scaled:center") 
col_sd_train <- attr(X_train_scale, "scaled:scale")

X_test_scale <- X_test %>% 
  scale(center = col_mean_train, 
        scale = col_sd_train)

Model function-Layers and compile stage:

create_model <- function() {
  dnn_class_model <- 
    keras_model_sequential() %>% 
    layer_dense(units = 50, 
                activation = 'relu', 
                input_shape = c(ncol(X_train_scale))) %>% 
    layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 50, activation = 'relu') %>% 
    layer_dropout(rate = 0.4) %>% 
    layer_dense(units = 1, activation = 'sigmoid') %>% 
     compile(optimizer = 'adam',
            loss = 'binary_crossentropy',
            metrics = 'accuracy')}

Model fiting:

dnn_class_model <- create_model()
history <- dnn_class_model %>% 
  keras::fit(x = X_train_scale, 
             y = y_train,
             epochs = 30, 
             validation_split = 0.2,
              
             batch_size = 128)
plot(history,
     smooth = F)

Preidction and Evaluation:

y_test_pred <- predict(object = dnn_class_model, x = X_test_scale)
y_test_pred %>% table() %>% head()
## .
## 0.00133283727336675 0.00220338138751686 0.00252745184116066 
##                   1                   1                   1 
##  0.0148538555949926  0.0176389180123806  0.0240529496222734 
##                   1                   1                   1
test$outcome_pred <- y_test_pred[, 1]


test <- test %>% 
  mutate(outcome_pred_class = ifelse(outcome_pred < 0.5, 0, 1)) %>%   
  mutate(outcome_pred_class = as.factor(outcome_pred_class)) %>% 
  mutate(Outcome = as.factor(Outcome))


cm_tab <- caret::confusionMatrix(test$outcome_pred_class, 
                                 test$Outcome)
cm_tab
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 85 21
##          1 14 34
##                                           
##                Accuracy : 0.7727          
##                  95% CI : (0.6984, 0.8363)
##     No Information Rate : 0.6429          
##     P-Value [Acc > NIR] : 0.000356        
##                                           
##                   Kappa : 0.4906          
##  Mcnemar's Test P-Value : 0.310494        
##                                           
##             Sensitivity : 0.8586          
##             Specificity : 0.6182          
##          Pos Pred Value : 0.8019          
##          Neg Pred Value : 0.7083          
##              Prevalence : 0.6429          
##          Detection Rate : 0.5519          
##    Detection Prevalence : 0.6883          
##       Balanced Accuracy : 0.7384          
##                                           
##        'Positive' Class : 0               
## 

Deep Learning records an accuracy of 76%

5 Acknowledgement

This dataset was downloaded from: Kaggle.com - https://www.kaggle.com/uciml/pima-indians-diabetes-database