Synopsis

The goal of this project is to build a logistic regression model that would predict the likelihood of diabetes. The dataset was collected and publicly shared by “National Institute of Diabetes and Digestive and Kidney Diseases”.The link to original dataset can be found here. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. We will start the analysis from basic data cleaning steps such as looking for missing values, duplicate records and identifying for outliers in each covariate. Next, we will compare the correlation between the variables themselves following with the model variable selections. In order to select the best covariates for the model we proceed selection based on the p-value, BIC plotting and backward selection method. The model with the lowest AIC number will be selected for the final model building. Next, we will do residual analysis and transformation. Lastly, model validation is performed by splitting the dataset into training and testing sample size then we measure the model accuracy in each subset. ROC is used in validating the model’s diagnostic ability. Based on our AUC we will finalize the accuracy level of our model.

Our final model is given as:

Outcome = -9.22 + 0.13* Pregnancies + 0.03* Glucose + 0.04* SkinThickness + 0.005* Insulin + 0.05* BMI + 0.80* DiabetesPedigreeFunction


image

Packages Required

Package Function
library(GGally) GGally extends ggplot2 by adding several functions to reduce the complexity
library(tidyr) For changing the layout of your data sets, to convert data into the tidy format
library(DT) For HTML display of data
library(ggplot2) For customizable graphical representation
library(dplyr) For data manipulation
library(tidyverse) Collection of R packages designed for data science that works harmoniously with other packages
library(kableExtra) To display table in a fancy way
library(corrplot) To create the correlation plot
library(readxl) To read the excel data
library(gridExtra) To lay out the plot together in same row
library(ROCR) To find performance of the model
library(leaps) To perform an exhaustive search for the best subsets of the variables in x
library(PRROC) To make PR and ROC curve
library(boot) To add bootstraping to the file

The above packages were installed and used in this project.

Data Preparation

Data Source

The dataset contains 768 rows and 9 columns. These columns’s label are listed below.

## [1] 768
## [1] 9
## [1] "Pregnancies"              "Glucose"                 
## [3] "BloodPressure"            "SkinThickness"           
## [5] "Insulin"                  "BMI"                     
## [7] "DiabetesPedigreeFunction" "Age"                     
## [9] "Outcome"

We can understand more about the structure of the dataset by using the str() function.

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

There are 8 variables are taken as indicators in the dataset. The variable Outcome is a response stated whether or not a person has diabetes by showing the result value as 0 for NO and 1 for Yes.

Data Cleaning

Checking for missing value and NULL value in the given dataset is one of the crucial steps in data cleaning

    any(is.na(data))
## [1] FALSE
    any(is.null(data))
## [1] FALSE

The results are False indicating that our dataset does not contain neither missing value nor NULL value. So we can do further analysis
Next step, we will look for dupplicated records in the dataset.

df<-unique(data)
dim(df)[1]
## [1] 768

Using unique function, we can sort out the unique records within the original dataset. The new dataset’s dimension has 768 unique rows which has the same number of records with our original dataset. Both contain 768 rows. Thus, we come to the conclusion that there is no duplicate value.
Then we perform testing on the outliers of each variable.

We can see that women in non diabetes group (outcome =0) have fewer number of pregnancies compared to those who are in the diabetes group. The distribuition of pregnant women in non diabetes group is skew to the right. The diabetic women appear to have higher glucose concentrations. The two group have quite similar blood pressure measurement. There are many outliers in Insulin from both groups, especially women with diabetes are heavily skewed to the right. The diabetic women have slightly higher BMI than the other group.The pedigree function distribution in both groups have outliers and have positive skew.The average age of women in diabetic group seem older than women in non diabetic group. Let’s take a closer look into those variables which has outliers: In reality, living organisms can’t have zero value for their Blood Pressure. We will check if there how many rows that contains 0 value in Blood Pressure.

## [1] 35

With fasting Glucose levels would not be as low as zero. Therefore zero is an invalid reading. We will check if there how many rows that contains 0 value in Glucose.

## [1] 5

For normal people, Skin Fold Thickness can’t be less than 10 mm better yet zero. We will check if there how many rows that contains 0 value.

## [1] 227

So a fasting Insulin level should never be 0, which it might be in a person with untreated Type 1. It shouldn’t go below 3. We will check if there how many rows that contains 0 value.

## [1] 374

BMI can’t be or close to 0 cause it is not reality related. We will check if there how many rows that contains 0 value.

## [1] 11


With the domain knowledge checking, we found invalid values in some columns. We conclude that the given data set is imcomplete. This leads to our group decision to use imputation to make the dataset more relevant and reasonable. In order to proceed this goal, We will replace the rows contained zero value in Blood Pressure, Glucose, Skin Fold Thickness, Insulin and BMI variables with the median value which is the midpoint of a frequency distribution of observed values. We did not choose mean because it won’t reflect the reality in observation.

##   Pregnancies        Glucose       BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   : 44.00   Min.   : 24.00   Min.   : 7.00  
##  1st Qu.: 1.000   1st Qu.: 99.75   1st Qu.: 64.00   1st Qu.:25.00  
##  Median : 3.000   Median :117.00   Median : 72.00   Median :28.00  
##  Mean   : 3.845   Mean   :121.68   Mean   : 72.39   Mean   :29.09  
##  3rd Qu.: 6.000   3rd Qu.:140.25   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.00   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   : 14.0   Min.   :18.20   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:102.5   1st Qu.:27.50   1st Qu.:0.2437           1st Qu.:24.00  
##  Median :102.5   Median :32.05   Median :0.3725           Median :29.00  
##  Mean   :141.8   Mean   :32.43   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:169.5   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Exploratory Data Analysis

Check correlation between the variables

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

By looking at the scatterplot mattrices, we can make an assumption that the data’s dots in Skin Thickness might have correlation with BMI. The coefficient in the correlation plot shows 0.57 between BMI and Skin Thickness which means there is a moderate positive relationship. It also shows correlation coefficient of 0.54 between Age and Pregnancies and 0.49 between Insulin and Glucose. These coefficients measure the strength and direction of a linear relationship between two variables. However,these coefficients from the scatterplot are not strong enough to assure that there are a significant relationship among the covariates. So we can do further analysis without dropping any columns.

Also, from the ggplot function we can visualize the distribution in each reagressor. Blood Pressure and BMI plotS seems to follow a normal distribution. Pregnancies, Age, Insulin and Diebetes Pedigree Function are skewed to the right.

Variable Selection and Final Model

Variables selection is done using P-value, AIC (Akaike Information Criterion), BIC(Bayesian Information Criterion)

Model based on P-value

# Selecting significant variables based on p-value
model1 = glm(Outcome ~ .-BloodPressure-Age,data,family="binomial")
summary(model1)
## 
## Call:
## glm(formula = Outcome ~ . - BloodPressure - Age, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6895  -0.6841  -0.3693   0.7120   2.4378  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.221068   0.723672 -12.742  < 2e-16 ***
## Pregnancies               0.137385   0.027719   4.956 7.18e-07 ***
## Glucose                   0.031139   0.003776   8.247  < 2e-16 ***
## SkinThickness             0.038310   0.013748   2.787  0.00533 ** 
## Insulin                   0.005314   0.001480   3.590  0.00033 ***
## BMI                       0.054954   0.017390   3.160  0.00158 ** 
## DiabetesPedigreeFunction  0.806248   0.299221   2.694  0.00705 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 690.52  on 761  degrees of freedom
## AIC: 704.52
## 
## Number of Fisher Scoring iterations: 5

Plotting BIC

plot(subset_result, scale="bic")

Based on BIC graph, we should not include BloodPressure, SkinThickness and Age in our model and proceed with building a new model.

Model based on BIC

# Model based on BIC
model2 = glm(Outcome~.-BloodPressure-SkinThickness-Age,data=data,family="binomial")
summary(model2)
## 
## Call:
## glm(formula = Outcome ~ . - BloodPressure - SkinThickness - Age, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6114  -0.6806  -0.3877   0.6942   2.4394  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.962782   0.708231 -12.655  < 2e-16 ***
## Pregnancies               0.141859   0.027632   5.134 2.84e-07 ***
## Glucose                   0.031410   0.003753   8.370  < 2e-16 ***
## Insulin                   0.005209   0.001458   3.571 0.000355 ***
## BMI                       0.080805   0.014944   5.407 6.40e-08 ***
## DiabetesPedigreeFunction  0.809910   0.297109   2.726 0.006411 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 698.58  on 762  degrees of freedom
## AIC: 710.58
## 
## Number of Fisher Scoring iterations: 5

Here we, will run backward regression to get another model.

Model after running backward selection

# Variables based on least AIC
model3 = glm(Outcome ~ Pregnancies + Glucose + SkinThickness + Insulin + BMI + 
               DiabetesPedigreeFunction,data,family="binomial")
summary(model3)
## 
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6895  -0.6841  -0.3693   0.7120   2.4378  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.221068   0.723672 -12.742  < 2e-16 ***
## Pregnancies               0.137385   0.027719   4.956 7.18e-07 ***
## Glucose                   0.031139   0.003776   8.247  < 2e-16 ***
## SkinThickness             0.038310   0.013748   2.787  0.00533 ** 
## Insulin                   0.005314   0.001480   3.590  0.00033 ***
## BMI                       0.054954   0.017390   3.160  0.00158 ** 
## DiabetesPedigreeFunction  0.806248   0.299221   2.694  0.00705 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 690.52  on 761  degrees of freedom
## AIC: 704.52
## 
## Number of Fisher Scoring iterations: 5

After running the three different models, we came up with the following AIC values:

Model 1: 704.52 Model 2: 710.58 Model 3: 704.52

Model 1 and Model 3 returned the same set of variables with least AIC. Hence, selecting only those variables as that would yield an efficient model.

Final model based on least AIC amongst the above three model.

final_model = glm(Outcome ~ .-BloodPressure-Age,data,family="binomial")
summary(final_model)
## 
## Call:
## glm(formula = Outcome ~ . - BloodPressure - Age, family = "binomial", 
##     data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6895  -0.6841  -0.3693   0.7120   2.4378  
## 
## Coefficients:
##                           Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -9.221068   0.723672 -12.742  < 2e-16 ***
## Pregnancies               0.137385   0.027719   4.956 7.18e-07 ***
## Glucose                   0.031139   0.003776   8.247  < 2e-16 ***
## SkinThickness             0.038310   0.013748   2.787  0.00533 ** 
## Insulin                   0.005314   0.001480   3.590  0.00033 ***
## BMI                       0.054954   0.017390   3.160  0.00158 ** 
## DiabetesPedigreeFunction  0.806248   0.299221   2.694  0.00705 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 690.52  on 761  degrees of freedom
## AIC: 704.52
## 
## Number of Fisher Scoring iterations: 5

As we interpret estimated coefficent of our significant covariates, we come up with the conclusion that when a person have pregnancy, increasing Glucose, Insulin, BMI,DiabetesPedigreeFunction and Skin Thickness will likely have diabetes.

The model will look like:

Outcome = -9.22 + 0.13* Pregnancies + 0.03* Glucose + 0.04* SkinThickness + 0.005* Insulin + 0.05* BMI + 0.80*DiabetesPedigreeFunction

Finding optimum P-cut with least error

# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, mean_error)

What’s the best p-cut and lowest mean error we can get?

optimal.pcut = p.seq[which(mean_error==min(mean_error))]
optimal.pcut
## [1] 0.34 0.35
min(mean_error)
## [1] 0.2005208

Thus, let’s use 0.35 as new p-cut value and check the misclassification error.

actual_value=d$Outcome
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
##             pred_value
## actual_value   0   1
##            0 404  96
##            1  58 210
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2005208

The final model will look like:

Outcome = -9.22 + 0.13* Pregnancies + 0.03* Glucose + 0.04* SkinThickness + 0.005* Insulin + 0.05* BMI + 0.80*DiabetesPedigreeFunction and we use p-cut of 0.35 which gives the misclassification error rate of 0.2005.

Residual Analysis

Let’s do the residual analysis to check if our assumptions are met.

As we can see, it’s violating our assumptions of Normality and Equal variance.

Residual plot after applying SQRT transformation on independent variables.

As we can see in the above plots, the assumption of equal variance is met. Though the Normality assumption is not completely met, we still proceed further for model validation as we have huge number of observations which would compensate for non normality.

Model Validation

Splitting of data

split_data = sample(nrow(data),nrow(data)*0.80)
diab.train = data[split_data,]
diab.test =  data[-split_data,]

nrow(data)
## [1] 768
nrow(diab.train)
## [1] 614
nrow(diab.test)
## [1] 154

We are randomly splitting the data to be used as training(80%) and testing(20%) datasets.

Train a logistic regression model with all X variables

final_model <- glm(Outcome ~ sqrt(Pregnancies+Glucose+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction), family=binomial, data = diab.train)
summary(final_model)
## 
## Call:
## glm(formula = Outcome ~ sqrt(Pregnancies + Glucose + SkinThickness + 
##     Insulin + BMI + DiabetesPedigreeFunction), family = binomial, 
##     data = diab.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.6611  -0.6735  -0.4831   0.9536   2.4484  
## 
## Coefficients:
##                                                                                         Estimate
## (Intercept)                                                                            -10.40677
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction)   0.53485
##                                                                                        Std. Error
## (Intercept)                                                                               0.93914
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction)    0.05107
##                                                                                        z value
## (Intercept)                                                                             -11.08
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction)   10.47
##                                                                                        Pr(>|z|)
## (Intercept)                                                                              <2e-16
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction)   <2e-16
##                                                                                           
## (Intercept)                                                                            ***
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction) ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 782.10  on 613  degrees of freedom
## Residual deviance: 601.06  on 612  degrees of freedom
## AIC: 605.06
## 
## Number of Fisher Scoring iterations: 5

we create a logistic regression model that involved 6 out of 8 variables against the outcome.

We use type=response to predict the probability of each training observation

hist(predict(final_model,type="response"))

The distribution in probability is plotted over here. With this histogram we can see the largest probability and the smallest predicted probability.

In-sample prediction - i.e. the training data

pred.diab.train <- predict(final_model,type="response")

We can get the predicted probability for each of the values in the training dataset.

Use ROC to give a an overall measure of goodness of classification and calculate AUC as well

ROC (Receiver operating characteristic) - predicting the probability of a binary outcome.

AUC -> This the area under the ROC curve.

ROC curve plots the true positive rate(TPR) and false positive rate(FPR) that is calculated from the confusion matrix at various threshold settings. We can see that the curve is really close to the top left corner which is just a way of saying that the model’s accuracy is good.

From this ROC curve we can see that the model is better at discriminating between positives and negatives in general.From the above graph it is inferred that AUC gives an accuracy rate of approx 88%.

AUC metric ranges from 0.50 to 1.00 and more closer the AUC is to 1.00, more it indicates that the model does a good job in discriminating between the two categories which comprises our target variable - Outcome.

Out-of-Sample Prediction

Now we make our predictions on the test data

pred.diab.test <- predict(final_model,newdata = diab.test, type="response")
#Convert probabilities to values
test_tab <- table(diab.test$Outcome,pred.diab.test > 0.5)
test_tab
##    
##     FALSE TRUE
##   0    83    8
##   1    29   34
accuracy_test <- round(sum(diag(test_tab))/sum(test_tab),2)
accuracy_test
## [1] 0.76

We see that the above output gives us an accuracy rate of 73%. We can further improve the performance of this model.

Plot ROC Curve with the test data

pred2 <- prediction(pred.diab.test,diab.test$Outcome)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE,print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.2,1.7))
abline(a=0,b=1)
auc_out <- round (unlist(slot(performance(pred,"auc"),"y.values")),2)
legend(.8,.2,auc_out,title="AUC",cex= 0.5)

Now, we can see here that the curve generated by the test data is equal to the curve generated by the training data. Overfitting is a situation where the training data provides a model accuracy greater than the test data. But with our model, we achieve very good accuracy at 88%.

So we can say that the prediction is reliable enough to identify whether a person is affected by diabetes or not when its fed a set of variables.

Cross Validation

Now we are gonna perform cross validation which is another approach to manipulation of training and test data.

This is a 5 fold cross validation where the dataset is divided into 5 parts and each part serves as the test set and the rest of them serve as training set.

cv.glm(data=diab.train,glmfit = final_model,K=5)$delta[2]
## [1] 0.1533482

We get a low error rate of 14% which is good for a model.We try another variant of cross validation to make sure that the final model is perfect to predict diabetes.

Leave one out cross validation (LOOCV)

LOOCV is the type of cross validation where one record is used as test set and the rest is used as training set.

cv.glm(data = diab.train, glmfit = final_model, K = nrow(diab.train))$delta[2]
## [1] 0.1528156

The overall error estimate on the final model is 14%. Thus this model can be used to reliably predict whether a person gets affected with diabetes or not.

Model Accuracy

The Model has an accuracy of 88% on both training and testing data.