The goal of this project is to build a logistic regression model that would predict the likelihood of diabetes. The dataset was collected and publicly shared by “National Institute of Diabetes and Digestive and Kidney Diseases”.The link to original dataset can be found here. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. We will start the analysis from basic data cleaning steps such as looking for missing values, duplicate records and identifying for outliers in each covariate. Next, we will compare the correlation between the variables themselves following with the model variable selections. In order to select the best covariates for the model we proceed selection based on the p-value, BIC plotting and backward selection method. The model with the lowest AIC number will be selected for the final model building. Next, we will do residual analysis and transformation. Lastly, model validation is performed by splitting the dataset into training and testing sample size then we measure the model accuracy in each subset. ROC is used in validating the model’s diagnostic ability. Based on our AUC we will finalize the accuracy level of our model.
Our final model is given as:
| Package | Function |
|---|---|
| library(GGally) | GGally extends ggplot2 by adding several functions to reduce the complexity |
| library(tidyr) | For changing the layout of your data sets, to convert data into the tidy format |
| library(DT) | For HTML display of data |
| library(ggplot2) | For customizable graphical representation |
| library(dplyr) | For data manipulation |
| library(tidyverse) | Collection of R packages designed for data science that works harmoniously with other packages |
| library(kableExtra) | To display table in a fancy way |
| library(corrplot) | To create the correlation plot |
| library(readxl) | To read the excel data |
| library(gridExtra) | To lay out the plot together in same row |
| library(ROCR) | To find performance of the model |
| library(leaps) | To perform an exhaustive search for the best subsets of the variables in x |
| library(PRROC) | To make PR and ROC curve |
| library(boot) | To add bootstraping to the file |
The above packages were installed and used in this project.
The dataset contains 768 rows and 9 columns. These columns’s label are listed below.
## [1] 768
## [1] 9
## [1] "Pregnancies" "Glucose"
## [3] "BloodPressure" "SkinThickness"
## [5] "Insulin" "BMI"
## [7] "DiabetesPedigreeFunction" "Age"
## [9] "Outcome"
We can understand more about the structure of the dataset by using the str() function.
## 'data.frame': 768 obs. of 9 variables:
## $ Pregnancies : int 6 1 8 1 0 5 3 10 2 8 ...
## $ Glucose : int 148 85 183 89 137 116 78 115 197 125 ...
## $ BloodPressure : int 72 66 64 66 40 74 50 0 70 96 ...
## $ SkinThickness : int 35 29 0 23 35 0 32 0 45 0 ...
## $ Insulin : int 0 0 0 94 168 0 88 0 543 0 ...
## $ BMI : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
## $ DiabetesPedigreeFunction: num 0.627 0.351 0.672 0.167 2.288 ...
## $ Age : int 50 31 32 21 33 30 26 29 53 54 ...
## $ Outcome : int 1 0 1 0 1 0 1 0 1 1 ...
There are 8 variables are taken as indicators in the dataset. The variable Outcome is a response stated whether or not a person has diabetes by showing the result value as 0 for NO and 1 for Yes.
Checking for missing value and NULL value in the given dataset is one of the crucial steps in data cleaning
any(is.na(data))
## [1] FALSE
any(is.null(data))
## [1] FALSE
The results are False indicating that our dataset does not contain neither missing value nor NULL value. So we can do further analysis Next step, we will look for dupplicated records in the dataset.
df<-unique(data)
dim(df)[1]
## [1] 768
Using unique function, we can sort out the unique records within the original dataset. The new dataset’s dimension has 768 unique rows which has the same number of records with our original dataset. Both contain 768 rows. Thus, we come to the conclusion that there is no duplicate value. Then we perform testing on the outliers of each variable.
We can see that women in non diabetes group (outcome =0) have fewer number of pregnancies compared to those who are in the diabetes group. The distribuition of pregnant women in non diabetes group is skew to the right. The diabetic women appear to have higher glucose concentrations. The two group have quite similar blood pressure measurement. There are many outliers in Insulin from both groups, especially women with diabetes are heavily skewed to the right. The diabetic women have slightly higher BMI than the other group.The pedigree function distribution in both groups have outliers and have positive skew.The average age of women in diabetic group seem older than women in non diabetic group. Let’s take a closer look into those variables which has outliers: In reality, living organisms can’t have zero value for their Blood Pressure. We will check if there how many rows that contains 0 value in Blood Pressure.
## [1] 35
With fasting Glucose levels would not be as low as zero. Therefore zero is an invalid reading. We will check if there how many rows that contains 0 value in Glucose.
## [1] 5
For normal people, Skin Fold Thickness can’t be less than 10 mm better yet zero. We will check if there how many rows that contains 0 value.
## [1] 227
So a fasting Insulin level should never be 0, which it might be in a person with untreated Type 1. It shouldn’t go below 3. We will check if there how many rows that contains 0 value.
## [1] 374
BMI can’t be or close to 0 cause it is not reality related. We will check if there how many rows that contains 0 value.
## [1] 11
With the domain knowledge checking, we found invalid values in some columns. We conclude that the given data set is imcomplete. This leads to our group decision to use imputation to make the dataset more relevant and reasonable. In order to proceed this goal, We will replace the rows contained zero value in Blood Pressure, Glucose, Skin Fold Thickness, Insulin and BMI variables with the median value which is the midpoint of a frequency distribution of observed values. We did not choose mean because it won’t reflect the reality in observation.
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 44.00 Min. : 24.00 Min. : 7.00
## 1st Qu.: 1.000 1st Qu.: 99.75 1st Qu.: 64.00 1st Qu.:25.00
## Median : 3.000 Median :117.00 Median : 72.00 Median :28.00
## Mean : 3.845 Mean :121.68 Mean : 72.39 Mean :29.09
## 3rd Qu.: 6.000 3rd Qu.:140.25 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.00 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 14.0 Min. :18.20 Min. :0.0780 Min. :21.00
## 1st Qu.:102.5 1st Qu.:27.50 1st Qu.:0.2437 1st Qu.:24.00
## Median :102.5 Median :32.05 Median :0.3725 Median :29.00
## Mean :141.8 Mean :32.43 Mean :0.4719 Mean :33.24
## 3rd Qu.:169.5 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
By looking at the scatterplot mattrices, we can make an assumption that the data’s dots in Skin Thickness might have correlation with BMI. The coefficient in the correlation plot shows 0.57 between BMI and Skin Thickness which means there is a moderate positive relationship. It also shows correlation coefficient of 0.54 between Age and Pregnancies and 0.49 between Insulin and Glucose. These coefficients measure the strength and direction of a linear relationship between two variables. However,these coefficients from the scatterplot are not strong enough to assure that there are a significant relationship among the covariates. So we can do further analysis without dropping any columns.
Also, from the ggplot function we can visualize the distribution in each reagressor. Blood Pressure and BMI plotS seems to follow a normal distribution. Pregnancies, Age, Insulin and Diebetes Pedigree Function are skewed to the right.
# Selecting significant variables based on p-value
model1 = glm(Outcome ~ .-BloodPressure-Age,data,family="binomial")
summary(model1)
##
## Call:
## glm(formula = Outcome ~ . - BloodPressure - Age, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6895 -0.6841 -0.3693 0.7120 2.4378
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.221068 0.723672 -12.742 < 2e-16 ***
## Pregnancies 0.137385 0.027719 4.956 7.18e-07 ***
## Glucose 0.031139 0.003776 8.247 < 2e-16 ***
## SkinThickness 0.038310 0.013748 2.787 0.00533 **
## Insulin 0.005314 0.001480 3.590 0.00033 ***
## BMI 0.054954 0.017390 3.160 0.00158 **
## DiabetesPedigreeFunction 0.806248 0.299221 2.694 0.00705 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 690.52 on 761 degrees of freedom
## AIC: 704.52
##
## Number of Fisher Scoring iterations: 5
plot(subset_result, scale="bic")
Based on BIC graph, we should not include BloodPressure, SkinThickness and Age in our model and proceed with building a new model.
# Model based on BIC
model2 = glm(Outcome~.-BloodPressure-SkinThickness-Age,data=data,family="binomial")
summary(model2)
##
## Call:
## glm(formula = Outcome ~ . - BloodPressure - SkinThickness - Age,
## family = "binomial", data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6114 -0.6806 -0.3877 0.6942 2.4394
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.962782 0.708231 -12.655 < 2e-16 ***
## Pregnancies 0.141859 0.027632 5.134 2.84e-07 ***
## Glucose 0.031410 0.003753 8.370 < 2e-16 ***
## Insulin 0.005209 0.001458 3.571 0.000355 ***
## BMI 0.080805 0.014944 5.407 6.40e-08 ***
## DiabetesPedigreeFunction 0.809910 0.297109 2.726 0.006411 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 698.58 on 762 degrees of freedom
## AIC: 710.58
##
## Number of Fisher Scoring iterations: 5
Here we, will run backward regression to get another model.
# Variables based on least AIC
model3 = glm(Outcome ~ Pregnancies + Glucose + SkinThickness + Insulin + BMI +
DiabetesPedigreeFunction,data,family="binomial")
summary(model3)
##
## Call:
## glm(formula = Outcome ~ Pregnancies + Glucose + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6895 -0.6841 -0.3693 0.7120 2.4378
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.221068 0.723672 -12.742 < 2e-16 ***
## Pregnancies 0.137385 0.027719 4.956 7.18e-07 ***
## Glucose 0.031139 0.003776 8.247 < 2e-16 ***
## SkinThickness 0.038310 0.013748 2.787 0.00533 **
## Insulin 0.005314 0.001480 3.590 0.00033 ***
## BMI 0.054954 0.017390 3.160 0.00158 **
## DiabetesPedigreeFunction 0.806248 0.299221 2.694 0.00705 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 690.52 on 761 degrees of freedom
## AIC: 704.52
##
## Number of Fisher Scoring iterations: 5
After running the three different models, we came up with the following AIC values:
Model 1: 704.52 Model 2: 710.58 Model 3: 704.52
Model 1 and Model 3 returned the same set of variables with least AIC. Hence, selecting only those variables as that would yield an efficient model.
final_model = glm(Outcome ~ .-BloodPressure-Age,data,family="binomial")
summary(final_model)
##
## Call:
## glm(formula = Outcome ~ . - BloodPressure - Age, family = "binomial",
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6895 -0.6841 -0.3693 0.7120 2.4378
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.221068 0.723672 -12.742 < 2e-16 ***
## Pregnancies 0.137385 0.027719 4.956 7.18e-07 ***
## Glucose 0.031139 0.003776 8.247 < 2e-16 ***
## SkinThickness 0.038310 0.013748 2.787 0.00533 **
## Insulin 0.005314 0.001480 3.590 0.00033 ***
## BMI 0.054954 0.017390 3.160 0.00158 **
## DiabetesPedigreeFunction 0.806248 0.299221 2.694 0.00705 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 993.48 on 767 degrees of freedom
## Residual deviance: 690.52 on 761 degrees of freedom
## AIC: 704.52
##
## Number of Fisher Scoring iterations: 5
As we interpret estimated coefficent of our significant covariates, we come up with the conclusion that when a person have pregnancy, increasing Glucose, Insulin, BMI,DiabetesPedigreeFunction and Skin Thickness will likely have diabetes.
The model will look like:
Outcome = -9.22 + 0.13* Pregnancies + 0.03* Glucose + 0.04* SkinThickness + 0.005* Insulin + 0.05* BMI + 0.80*DiabetesPedigreeFunction
Finding optimum P-cut with least error
# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, mean_error)
What’s the best p-cut and lowest mean error we can get?
optimal.pcut = p.seq[which(mean_error==min(mean_error))]
optimal.pcut
## [1] 0.34 0.35
min(mean_error)
## [1] 0.2005208
Thus, let’s use 0.35 as new p-cut value and check the misclassification error.
actual_value=d$Outcome
confusion_matrix=table(actual_value,pred_value)
confusion_matrix
## pred_value
## actual_value 0 1
## 0 404 96
## 1 58 210
misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate
## [1] 0.2005208
The final model will look like:
Outcome = -9.22 + 0.13* Pregnancies + 0.03* Glucose + 0.04* SkinThickness + 0.005* Insulin + 0.05* BMI + 0.80*DiabetesPedigreeFunction and we use p-cut of 0.35 which gives the misclassification error rate of 0.2005.
Let’s do the residual analysis to check if our assumptions are met.
As we can see, it’s violating our assumptions of Normality and Equal variance.
Residual plot after applying SQRT transformation on independent variables.
As we can see in the above plots, the assumption of equal variance is met. Though the Normality assumption is not completely met, we still proceed further for model validation as we have huge number of observations which would compensate for non normality.
split_data = sample(nrow(data),nrow(data)*0.80)
diab.train = data[split_data,]
diab.test = data[-split_data,]
nrow(data)
## [1] 768
nrow(diab.train)
## [1] 614
nrow(diab.test)
## [1] 154
We are randomly splitting the data to be used as training(80%) and testing(20%) datasets.
final_model <- glm(Outcome ~ sqrt(Pregnancies+Glucose+SkinThickness+Insulin+BMI+DiabetesPedigreeFunction), family=binomial, data = diab.train)
summary(final_model)
##
## Call:
## glm(formula = Outcome ~ sqrt(Pregnancies + Glucose + SkinThickness +
## Insulin + BMI + DiabetesPedigreeFunction), family = binomial,
## data = diab.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6611 -0.6735 -0.4831 0.9536 2.4484
##
## Coefficients:
## Estimate
## (Intercept) -10.40677
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction) 0.53485
## Std. Error
## (Intercept) 0.93914
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction) 0.05107
## z value
## (Intercept) -11.08
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction) 10.47
## Pr(>|z|)
## (Intercept) <2e-16
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction) <2e-16
##
## (Intercept) ***
## sqrt(Pregnancies + Glucose + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction) ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 782.10 on 613 degrees of freedom
## Residual deviance: 601.06 on 612 degrees of freedom
## AIC: 605.06
##
## Number of Fisher Scoring iterations: 5
we create a logistic regression model that involved 6 out of 8 variables against the outcome.
hist(predict(final_model,type="response"))
The distribution in probability is plotted over here. With this histogram we can see the largest probability and the smallest predicted probability.
pred.diab.train <- predict(final_model,type="response")
We can get the predicted probability for each of the values in the training dataset.
ROC (Receiver operating characteristic) - predicting the probability of a binary outcome.
AUC -> This the area under the ROC curve.
ROC curve plots the true positive rate(TPR) and false positive rate(FPR) that is calculated from the confusion matrix at various threshold settings. We can see that the curve is really close to the top left corner which is just a way of saying that the model’s accuracy is good.
From this ROC curve we can see that the model is better at discriminating between positives and negatives in general.From the above graph it is inferred that AUC gives an accuracy rate of approx 88%.
AUC metric ranges from 0.50 to 1.00 and more closer the AUC is to 1.00, more it indicates that the model does a good job in discriminating between the two categories which comprises our target variable - Outcome.
Now we make our predictions on the test data
pred.diab.test <- predict(final_model,newdata = diab.test, type="response")
#Convert probabilities to values
test_tab <- table(diab.test$Outcome,pred.diab.test > 0.5)
test_tab
##
## FALSE TRUE
## 0 83 8
## 1 29 34
accuracy_test <- round(sum(diag(test_tab))/sum(test_tab),2)
accuracy_test
## [1] 0.76
We see that the above output gives us an accuracy rate of 73%. We can further improve the performance of this model.
pred2 <- prediction(pred.diab.test,diab.test$Outcome)
perf <- performance(pred,"tpr","fpr")
plot(perf,colorize=TRUE,print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.2,1.7))
abline(a=0,b=1)
auc_out <- round (unlist(slot(performance(pred,"auc"),"y.values")),2)
legend(.8,.2,auc_out,title="AUC",cex= 0.5)
Now, we can see here that the curve generated by the test data is equal to the curve generated by the training data. Overfitting is a situation where the training data provides a model accuracy greater than the test data. But with our model, we achieve very good accuracy at 88%.
So we can say that the prediction is reliable enough to identify whether a person is affected by diabetes or not when its fed a set of variables.
Now we are gonna perform cross validation which is another approach to manipulation of training and test data.
This is a 5 fold cross validation where the dataset is divided into 5 parts and each part serves as the test set and the rest of them serve as training set.
cv.glm(data=diab.train,glmfit = final_model,K=5)$delta[2]
## [1] 0.1533482
We get a low error rate of 14% which is good for a model.We try another variant of cross validation to make sure that the final model is perfect to predict diabetes.
LOOCV is the type of cross validation where one record is used as test set and the rest is used as training set.
cv.glm(data = diab.train, glmfit = final_model, K = nrow(diab.train))$delta[2]
## [1] 0.1528156
The overall error estimate on the final model is 14%. Thus this model can be used to reliably predict whether a person gets affected with diabetes or not.
The Model has an accuracy of 88% on both training and testing data.