Importing the dataset

Information about the dataset.

The datset is sourced from the “Vanderbilt Biostatics home”. The dataset contains 23 columns and 403 Rows. Description about the various variables is given below

X= index generated by R

id= id of the patients

chol= Total Cholesterol level

stab.glu= Stabilized Glucose

hdl= High Density Lipoprotein

ratio= Cholesterol/HDL ratio

glyhb= Glycosolated Hemoglobin

location= location of patient (levels=Buckingham, Louisa )

age= Age of patient

gender= Gender of patient(levels= Male, Female)

height= hieght of patient in inches

weight= weight of patient in pounds

frame= NA

bp.1s= First Systolic Blood Pressure

bp.1d= First Diastolic Blood Pressure

bp.2s= Second Systolic Blood Pressure

bp.2d= Second Diastolic Blood Pressure

waist= Waist of Patient in inches

hip= Measurement of hip in inches

time.ppn=Postprandial Time when Labs were Drawn minutes

## 'data.frame':    403 obs. of  24 variables:
##  $ X        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ id       : int  1000 1001 1002 1003 1005 1008 1011 1015 1016 1022 ...
##  $ chol     : int  203 165 228 78 249 248 195 227 177 263 ...
##  $ stab.glu : int  82 97 92 93 90 94 92 75 87 89 ...
##  $ hdl      : int  56 24 37 12 28 69 41 44 49 40 ...
##  $ ratio    : num  3.6 6.9 6.2 6.5 8.9 ...
##  $ glyhb    : num  4.31 4.44 4.64 4.63 7.72 ...
##  $ location : chr  "Buckingham" "Buckingham" "Buckingham" "Buckingham" ...
##  $ age      : int  46 29 58 67 64 34 30 37 45 55 ...
##  $ gender   : chr  "female" "female" "female" "male" ...
##  $ height   : int  62 64 61 67 68 71 69 59 69 63 ...
##  $ weight   : int  121 218 256 119 183 190 191 170 166 202 ...
##  $ frame    : chr  "medium" "large" "large" "large" ...
##  $ bp.1s    : int  118 112 190 110 138 132 161 NA 160 108 ...
##  $ bp.1d    : int  59 68 92 50 80 86 112 NA 80 72 ...
##  $ bp.2s    : int  NA NA 185 NA NA NA 161 NA 128 NA ...
##  $ bp.2d    : int  NA NA 92 NA NA NA 112 NA 86 NA ...
##  $ waist    : int  29 46 49 33 44 36 46 34 34 45 ...
##  $ hip      : int  38 48 57 38 41 42 49 39 40 50 ...
##  $ time.ppn : int  720 360 180 480 300 195 720 1020 300 240 ...
##  $ insurance: int  1 0 2 1 0 1 2 0 2 2 ...
##  $ fh       : int  0 0 0 0 0 0 1 0 1 0 ...
##  $ smoking  : int  3 2 2 3 3 1 2 2 1 2 ...
##  $ dm       : chr  "no" "no" "no" "no" ...

Cleaning the dataset

A number of variables including X, ID, time.ppn and frame does not have any significance as far as prediction of diabetes is concerned. Hence these variables are dropped from the dataset.

Further the column “ratio” also is derived from chol and hdl, hence has high degree of correlation with ratio. This column can be removed.

Handling the null Values and changing the varible class

the following variables are converted into factors

  • location

  • gender

  • smoking

  • dm

  • insurance

  • fh

##      chol  stab.glu       hdl     glyhb  location       age    gender    height 
##         1         0         1        13         0         0         0         5 
##    weight     bp.1s     bp.1d     bp.2s     bp.2d     waist       hip insurance 
##         1         5         5       262       262         2         2         0 
##        fh   smoking        dm 
##         0         0        13

Above data suggest that bp.2s and bp.2d have a very hig number of missing values and hence can not be imputed.These variables are removed from the analysis.

For the rest of the variable appropriate imputing techniques are applied

The above graph suggest that even location is insignificant and hence we remove the column

## 'data.frame':    403 obs. of  16 variables:
##  $ chol     : int  203 165 228 78 249 248 195 227 177 263 ...
##  $ stab.glu : int  82 97 92 93 90 94 92 75 87 89 ...
##  $ hdl      : int  56 24 37 12 28 69 41 44 49 40 ...
##  $ glyhb    : num  4.31 4.44 4.64 4.63 7.72 ...
##  $ age      : int  46 29 58 67 64 34 30 37 45 55 ...
##  $ gender   : Factor w/ 2 levels "female","male": 1 1 1 2 2 2 2 2 2 1 ...
##  $ height   : int  62 64 61 67 68 71 69 59 69 63 ...
##  $ weight   : int  121 218 256 119 183 190 191 170 166 202 ...
##  $ bp.1s    : int  118 112 190 110 138 132 161 NA 160 108 ...
##  $ bp.1d    : int  59 68 92 50 80 86 112 NA 80 72 ...
##  $ waist    : int  29 46 49 33 44 36 46 34 34 45 ...
##  $ hip      : int  38 48 57 38 41 42 49 39 40 50 ...
##  $ insurance: Factor w/ 3 levels "0","1","2": 2 1 3 2 1 2 3 1 3 3 ...
##  $ fh       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
##  $ smoking  : Factor w/ 3 levels "1","2","3": 3 2 2 3 3 1 2 2 1 2 ...
##  $ dm       : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...

The above graph suggest that except for glyhb rest of the columns can be imputed by taking mean. For glyhb mode should be used as it is heavely skewed

##      chol  stab.glu       hdl     glyhb       age    gender    height    weight 
##         0         0         0         0         0         0         0         0 
##     bp.1s     bp.1d     waist       hip insurance        fh   smoking        dm 
##         0         0         0         0         0         0         0        13

Applying different Machine learning algorithms to train our model. Dividing the dataset into 7:3 ratio for training and testing.

n <- nrow (diabetes)
n_train<-round(.7*n)
set.seed(123)
train_indices <- sample(1:n, n_train)
diabetes_train<-diabetes[train_indices,]
diabetes_test<- diabetes[-train_indices,]

Logistic regression

## 
## Call:
## glm(formula = dm ~ ., family = binomial, data = diabetes_train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -4.972e-05  -2.100e-08  -2.100e-08  -2.100e-08   5.173e-05  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.541e+02  1.550e+06   0.000    1.000
## chol         1.865e-01  3.425e+02   0.001    1.000
## stab.glu    -1.042e-01  3.640e+02   0.000    1.000
## hdl         -4.046e-01  3.839e+03   0.000    1.000
## glyhb        4.064e+01  2.204e+04   0.002    0.999
## age          1.039e-01  2.643e+03   0.000    1.000
## gendermale  -1.523e+01  6.489e+04   0.000    1.000
## height       1.615e+00  1.669e+04   0.000    1.000
## weight       2.214e-01  3.196e+03   0.000    1.000
## bp.1s        9.331e-02  9.803e+02   0.000    1.000
## bp.1d       -6.131e-01  1.173e+03  -0.001    1.000
## waist       -4.166e+00  2.057e+04   0.000    1.000
## hip          2.865e+00  8.394e+03   0.000    1.000
## insurance1   2.977e+00  8.661e+04   0.000    1.000
## insurance2   3.423e+00  6.315e+04   0.000    1.000
## fh1         -3.139e+01  7.873e+04   0.000    1.000
## smoking2    -1.490e+01  3.418e+04   0.000    1.000
## smoking3    -3.839e+00  4.672e+04   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.5153e+02  on 274  degrees of freedom
## Residual deviance: 1.9283e-08  on 257  degrees of freedom
##   (7 observations deleted due to missingness)
## AIC: 36
## 
## Number of Fisher Scoring iterations: 25
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  101   2
##        yes   1  11
##                                           
##                Accuracy : 0.9739          
##                  95% CI : (0.9257, 0.9946)
##     No Information Rate : 0.887           
##     P-Value [Acc > NIR] : 0.0006462       
##                                           
##                   Kappa : 0.8654          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.9902          
##             Specificity : 0.8462          
##          Pos Pred Value : 0.9806          
##          Neg Pred Value : 0.9167          
##              Prevalence : 0.8870          
##          Detection Rate : 0.8783          
##    Detection Prevalence : 0.8957          
##       Balanced Accuracy : 0.9182          
##                                           
##        'Positive' Class : no              
## 

RPART

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  no yes
##        no  102   0
##        yes   0  13
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9684, 1)
##     No Information Rate : 0.887      
##     P-Value [Acc > NIR] : 1.02e-06   
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.000      
##             Specificity : 1.000      
##          Pos Pred Value : 1.000      
##          Neg Pred Value : 1.000      
##              Prevalence : 0.887      
##          Detection Rate : 0.887      
##    Detection Prevalence : 0.887      
##       Balanced Accuracy : 1.000      
##                                      
##        'Positive' Class : no         
## 

Comparing Accuracy and ROC/AUC curves

##                     Accuracy     Kappa
## Decision Tree       1.000000 1.0000000
## Logistic Regression 0.973913 0.8653921

Decision tree provides the best accuracy score for the data and hence the model shoul be used for deployement