Introduction

The diabetes is threatening a lot of people nowadays, without having a perfect cure for it.There are actually two types of diabetes,namely Type 1 and Type 2.The type 2 diabetes is commonly called as diabetes mellitus.It can be defined as a chronic condition that affects the way the body processes blood sugar (glucose).We consider the mellitus here.After deep researches we found that, that some parameters are directly responsible to for the mellitus to occur.By using the data of the people with and without diabetes,a dataset has been build.We use that dataset to classify the people who are in the risk of getting diabetes.

Loading the required libraries

library(ggplot2)
library(ggvis)
library(corrplot)
library(caTools)
library(ROCR)

Data Loading

The observations of the people are stored in a CSV format, named diabetes.csv.The data is loaded in the environment.Let’s check how the data is structured.

data = read.csv("C:/Users/crsri/Documents/Diabetes_Prediction/Data/diabetes.csv")
head(data)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0
summary(data)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
str(data)
## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Correlations

The proportionalities of the attributes of the data can be identified by the correlation coefficient,either numerically or visually.They helps to know which attributes are highy dependent on the prediction variable:Outcome.

correlations <- cor(data)
correlations
##                          Pregnancies    Glucose BloodPressure
## Pregnancies               1.00000000 0.12945867    0.14128198
## Glucose                   0.12945867 1.00000000    0.15258959
## BloodPressure             0.14128198 0.15258959    1.00000000
## SkinThickness            -0.08167177 0.05732789    0.20737054
## Insulin                  -0.07353461 0.33135711    0.08893338
## BMI                       0.01768309 0.22107107    0.28180529
## DiabetesPedigreeFunction -0.03352267 0.13733730    0.04126495
## Age                       0.54434123 0.26351432    0.23952795
## Outcome                   0.22189815 0.46658140    0.06506836
##                          SkinThickness     Insulin        BMI
## Pregnancies                -0.08167177 -0.07353461 0.01768309
## Glucose                     0.05732789  0.33135711 0.22107107
## BloodPressure               0.20737054  0.08893338 0.28180529
## SkinThickness               1.00000000  0.43678257 0.39257320
## Insulin                     0.43678257  1.00000000 0.19785906
## BMI                         0.39257320  0.19785906 1.00000000
## DiabetesPedigreeFunction    0.18392757  0.18507093 0.14064695
## Age                        -0.11397026 -0.04216295 0.03624187
## Outcome                     0.07475223  0.13054795 0.29269466
##                          DiabetesPedigreeFunction         Age    Outcome
## Pregnancies                           -0.03352267  0.54434123 0.22189815
## Glucose                                0.13733730  0.26351432 0.46658140
## BloodPressure                          0.04126495  0.23952795 0.06506836
## SkinThickness                          0.18392757 -0.11397026 0.07475223
## Insulin                                0.18507093 -0.04216295 0.13054795
## BMI                                    0.14064695  0.03624187 0.29269466
## DiabetesPedigreeFunction               1.00000000  0.03356131 0.17384407
## Age                                    0.03356131  1.00000000 0.23835598
## Outcome                                0.17384407  0.23835598 1.00000000
corrplot(correlations, method="color")

Visualization

Visualizations are used to grasp the structure of data and its relations,like how they vary and their relationships with the otehr data.They are said to be EDA.

A matrix of scatterplots is produce for this dataset.

pairs(data, col=data$Outcome)

Glucose and Insulin

The glucose and the insulin are the major factors of the diabetes…which in turn have direct proportionality in the future during the diabetes.They are the major cause of the occurence.They are strong correlated on each other.

data %>% ggvis(~Glucose,~Insulin,fill =~Outcome) %>% layer_points()

BMI ad DiabetesPedigreeFunction

The BMI and DiabetesPedigreeFunction is plotted here.

data %>% ggvis(~BMI,~DiabetesPedigreeFunction,fill =~Outcome) %>% layer_points()

Age and Pregnancies

The males have 0 for the pregnancy attribute, which is why we find a lot of values plottinh zero in this grpah.

data %>% ggvis(~Age,~Pregnancies,fill =~Outcome) %>% layer_points()

Preparing the data

The dataset is divided as two parts, training data and testing data, with a Splitratio of 0.75. It means that 2/3rds of the data is labelled by training set and the rest 1/3rd of data is the testing set.The division of the dataset is by means of a random order generated by the seed.

set.seed(88)
split <- sample.split(data$Outcome, SplitRatio = 0.75)
data_train <- subset(data, split == TRUE)
data_test <- subset(data, split == FALSE)

Logistic regression

The Logistic regression helps to classify the concern person will get diabetes or not.Since we are using the logistic regression we have to mention that, family = binomial.We are using all the attributes we have in the dataset.Let us take a look at the summary.

model <- glm (Outcome ~ .-Pregnancies + Glucose + BloodPressure + SkinThickness + Insulin + BMI + DiabetesPedigreeFunction + Age, data = data_train, family = binomial)
summary(model)
## 
## Call:
## glm(formula = Outcome ~ . - Pregnancies + Glucose + BloodPressure + 
##     SkinThickness + Insulin + BMI + DiabetesPedigreeFunction + 
##     Age, family = binomial, data = data_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4254  -0.7250  -0.4361   0.7487   2.9829  
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              -8.3339721  0.8159489 -10.214  < 2e-16 ***
## Glucose                   0.0382162  0.0044235   8.639  < 2e-16 ***
## BloodPressure            -0.0088309  0.0060059  -1.470   0.1415    
## SkinThickness             0.0007624  0.0081902   0.093   0.9258    
## Insulin                  -0.0017095  0.0010823  -1.580   0.1142    
## BMI                       0.0792632  0.0169318   4.681 2.85e-06 ***
## DiabetesPedigreeFunction  0.7386714  0.3332368   2.217   0.0266 *  
## Age                       0.0204344  0.0095270   2.145   0.0320 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 745.11  on 575  degrees of freedom
## Residual deviance: 552.82  on 568  degrees of freedom
## AIC: 568.82
## 
## Number of Fisher Scoring iterations: 5

Prediction

The trained model is used to predict the data for the testing data and for the training data(For checking accuracy purposes and for ROC curve)

predict_train <- predict(model, type = 'response')
predict_test <- predict(model, newdata = data_test, type = 'response')

ROC Curve

ROCRpred <- prediction(predict_train, data_train$Outcome)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

Comparison

By comparing the real values with the real data, we can see the how our machine learning algorithm performs.

predict_test_c = predict_test
i = 1
while(i <= length(predict_test))
{
  if(predict_test[i] < 0.5)
    predict_test_c[i] = 0
  else
    predict_test_c[i] = 1
  i = i + 1;
}
compare <- data.frame(data_test$Outcome,predict_test_c)
colnames(compare) <- c("Observed Values","Predicted values")
ggplot(data = compare,aes(x = "Observed Values", y = "Predicted values")) + geom_abline() +
  xlab("Observed Values") + ylab("Predicted values") + theme_classic()

compare
##     Observed Values Predicted values
## 6                 0                0
## 8                 0                0
## 9                 1                1
## 10                1                0
## 11                0                0
## 14                1                1
## 16                1                0
## 32                1                1
## 33                0                0
## 37                0                0
## 38                1                0
## 41                0                1
## 44                1                1
## 45                0                1
## 46                1                1
## 49                1                0
## 56                0                0
## 60                0                0
## 62                1                0
## 70                0                0
## 73                1                1
## 77                0                0
## 78                0                0
## 84                0                0
## 86                0                0
## 89                1                1
## 91                0                0
## 93                0                0
## 94                1                0
## 95                0                0
## 102               0                0
## 103               0                0
## 105               0                0
## 110               1                0
## 111               1                1
## 114               0                0
## 124               0                0
## 128               0                0
## 130               1                0
## 142               0                0
## 143               0                0
## 150               0                0
## 153               1                1
## 163               0                0
## 164               0                0
## 168               0                0
## 182               0                0
## 192               0                0
## 195               0                0
## 199               1                0
## 201               0                0
## 204               0                0
## 209               0                0
## 216               1                1
## 219               1                0
## 225               0                0
## 227               0                0
## 228               1                1
## 236               1                1
## 239               1                1
## 240               0                0
## 244               1                0
## 256               1                0
## 262               1                1
## 264               0                1
## 272               0                0
## 280               0                0
## 281               1                1
## 283               0                0
## 285               1                0
## 291               0                0
## 292               1                0
## 299               1                0
## 304               1                1
## 312               0                0
## 315               1                0
## 323               1                0
## 326               0                0
## 327               1                0
## 341               0                0
## 342               0                0
## 343               0                0
## 344               0                0
## 346               0                0
## 350               1                0
## 356               1                1
## 357               1                0
## 358               1                1
## 363               0                0
## 364               1                1
## 367               1                0
## 374               0                0
## 379               1                1
## 381               0                0
## 382               0                0
## 388               1                0
## 391               0                0
## 392               1                1
## 395               1                1
## 396               0                0
## 408               0                0
## 414               0                0
## 417               0                0
## 419               0                0
## 422               0                0
## 424               0                0
## 431               0                0
## 432               0                0
## 433               0                0
## 436               1                1
## 437               0                1
## 439               0                0
## 448               0                0
## 449               1                0
## 450               0                0
## 451               0                0
## 453               0                0
## 456               1                1
## 463               0                0
## 466               0                0
## 473               0                0
## 478               0                0
## 493               0                0
## 498               0                0
## 500               0                1
## 504               0                0
## 508               0                0
## 509               0                0
## 513               0                0
## 531               0                0
## 532               0                0
## 533               0                0
## 536               1                1
## 538               0                0
## 542               1                0
## 543               1                0
## 548               0                0
## 550               0                1
## 562               1                1
## 563               0                0
## 567               0                0
## 573               0                0
## 577               0                0
## 580               1                1
## 583               0                0
## 585               1                0
## 586               0                0
## 592               0                0
## 599               1                1
## 606               0                0
## 608               0                0
## 610               0                0
## 623               0                1
## 625               0                0
## 627               0                0
## 636               1                0
## 639               1                0
## 640               0                0
## 652               0                0
## 655               0                0
## 663               1                1
## 664               1                1
## 665               1                0
## 671               0                1
## 673               0                0
## 675               0                0
## 680               0                0
## 681               0                0
## 691               0                0
## 694               1                1
## 695               0                0
## 700               0                1
## 703               1                1
## 711               0                0
## 714               0                0
## 719               0                0
## 721               0                0
## 724               0                0
## 728               0                0
## 736               0                0
## 740               1                0
## 741               1                1
## 744               1                1
## 746               0                0
## 747               1                1
## 749               1                1
## 753               0                0
## 757               0                0
## 759               0                0
## 760               1                1
## 764               0                0
## 766               0                0

Conclusion

The results can be improved by applyting the feature scaling and data cleaning.From this project we predict the type 2 diabetes, commonly called as diabetes mellitus.As a result it can help to improve their health conditions.

Things to do in future

Data cleaning and Feature Scaling have to be done with the data.Then running the prepared data with the logistic regression to get the improved results.