Information about the dataset.
The datset is sourced from the “Vanderbilt Biostatics home”. The dataset contains 23 columns and 403 Rows. Description about the various variables is given below
X= index generated by R
id= id of the patients
chol= Total Cholesterol level
stab.glu= Stabilized Glucose
hdl= High Density Lipoprotein
ratio= Cholesterol/HDL ratio
glyhb= Glycosolated Hemoglobin
location= location of patient (levels=Buckingham, Louisa )
age= Age of patient
gender= Gender of patient(levels= Male, Female)
height= hieght of patient in inches
weight= weight of patient in pounds
frame= NA
bp.1s= First Systolic Blood Pressure
bp.1d= First Diastolic Blood Pressure
bp.2s= Second Systolic Blood Pressure
bp.2d= Second Diastolic Blood Pressure
waist= Waist of Patient in inches
hip= Measurement of hip in inches
time.ppn=Postprandial Time when Labs were Drawn minutes
## 'data.frame': 403 obs. of 24 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ id : int 1000 1001 1002 1003 1005 1008 1011 1015 1016 1022 ...
## $ chol : int 203 165 228 78 249 248 195 227 177 263 ...
## $ stab.glu : int 82 97 92 93 90 94 92 75 87 89 ...
## $ hdl : int 56 24 37 12 28 69 41 44 49 40 ...
## $ ratio : num 3.6 6.9 6.2 6.5 8.9 ...
## $ glyhb : num 4.31 4.44 4.64 4.63 7.72 ...
## $ location : chr "Buckingham" "Buckingham" "Buckingham" "Buckingham" ...
## $ age : int 46 29 58 67 64 34 30 37 45 55 ...
## $ gender : chr "female" "female" "female" "male" ...
## $ height : int 62 64 61 67 68 71 69 59 69 63 ...
## $ weight : int 121 218 256 119 183 190 191 170 166 202 ...
## $ frame : chr "medium" "large" "large" "large" ...
## $ bp.1s : int 118 112 190 110 138 132 161 NA 160 108 ...
## $ bp.1d : int 59 68 92 50 80 86 112 NA 80 72 ...
## $ bp.2s : int NA NA 185 NA NA NA 161 NA 128 NA ...
## $ bp.2d : int NA NA 92 NA NA NA 112 NA 86 NA ...
## $ waist : int 29 46 49 33 44 36 46 34 34 45 ...
## $ hip : int 38 48 57 38 41 42 49 39 40 50 ...
## $ time.ppn : int 720 360 180 480 300 195 720 1020 300 240 ...
## $ insurance: int 1 0 2 1 0 1 2 0 2 2 ...
## $ fh : int 0 0 0 0 0 0 1 0 1 0 ...
## $ smoking : int 3 2 2 3 3 1 2 2 1 2 ...
## $ dm : chr "no" "no" "no" "no" ...
A number of variables including X, ID, time.ppn and frame does not have any significance as far as prediction of diabetes is concerned. Hence these variables are dropped from the dataset.
Further the column “ratio” also is derived from chol and hdl, hence has high degree of correlation with ratio. This column can be removed.
the following variables are converted into factors
location
gender
smoking
dm
insurance
fh
## chol stab.glu hdl glyhb location age gender height
## 1 0 1 13 0 0 0 5
## weight bp.1s bp.1d bp.2s bp.2d waist hip insurance
## 1 5 5 262 262 2 2 0
## fh smoking dm
## 0 0 13
For the rest of the variable appropriate imputing techniques are applied
The above graph suggest that even location is insignificant and hence we remove the column## 'data.frame': 403 obs. of 16 variables:
## $ chol : int 203 165 228 78 249 248 195 227 177 263 ...
## $ stab.glu : int 82 97 92 93 90 94 92 75 87 89 ...
## $ hdl : int 56 24 37 12 28 69 41 44 49 40 ...
## $ glyhb : num 4.31 4.44 4.64 4.63 7.72 ...
## $ age : int 46 29 58 67 64 34 30 37 45 55 ...
## $ gender : Factor w/ 2 levels "female","male": 1 1 1 2 2 2 2 2 2 1 ...
## $ height : int 62 64 61 67 68 71 69 59 69 63 ...
## $ weight : int 121 218 256 119 183 190 191 170 166 202 ...
## $ bp.1s : int 118 112 190 110 138 132 161 NA 160 108 ...
## $ bp.1d : int 59 68 92 50 80 86 112 NA 80 72 ...
## $ waist : int 29 46 49 33 44 36 46 34 34 45 ...
## $ hip : int 38 48 57 38 41 42 49 39 40 50 ...
## $ insurance: Factor w/ 3 levels "0","1","2": 2 1 3 2 1 2 3 1 3 3 ...
## $ fh : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 2 1 2 1 ...
## $ smoking : Factor w/ 3 levels "1","2","3": 3 2 2 3 3 1 2 2 1 2 ...
## $ dm : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
The above graph suggest that except for glyhb rest of the columns can be imputed by taking mean. For glyhb mode should be used as it is heavely skewed
## chol stab.glu hdl glyhb age gender height weight
## 0 0 0 0 0 0 0 0
## bp.1s bp.1d waist hip insurance fh smoking dm
## 0 0 0 0 0 0 0 13
n <- nrow (diabetes)
n_train<-round(.7*n)
set.seed(123)
train_indices <- sample(1:n, n_train)
diabetes_train<-diabetes[train_indices,]
diabetes_test<- diabetes[-train_indices,]
##
## Call:
## glm(formula = dm ~ ., family = binomial, data = diabetes_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.972e-05 -2.100e-08 -2.100e-08 -2.100e-08 5.173e-05
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.541e+02 1.550e+06 0.000 1.000
## chol 1.865e-01 3.425e+02 0.001 1.000
## stab.glu -1.042e-01 3.640e+02 0.000 1.000
## hdl -4.046e-01 3.839e+03 0.000 1.000
## glyhb 4.064e+01 2.204e+04 0.002 0.999
## age 1.039e-01 2.643e+03 0.000 1.000
## gendermale -1.523e+01 6.489e+04 0.000 1.000
## height 1.615e+00 1.669e+04 0.000 1.000
## weight 2.214e-01 3.196e+03 0.000 1.000
## bp.1s 9.331e-02 9.803e+02 0.000 1.000
## bp.1d -6.131e-01 1.173e+03 -0.001 1.000
## waist -4.166e+00 2.057e+04 0.000 1.000
## hip 2.865e+00 8.394e+03 0.000 1.000
## insurance1 2.977e+00 8.661e+04 0.000 1.000
## insurance2 3.423e+00 6.315e+04 0.000 1.000
## fh1 -3.139e+01 7.873e+04 0.000 1.000
## smoking2 -1.490e+01 3.418e+04 0.000 1.000
## smoking3 -3.839e+00 4.672e+04 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.5153e+02 on 274 degrees of freedom
## Residual deviance: 1.9283e-08 on 257 degrees of freedom
## (7 observations deleted due to missingness)
## AIC: 36
##
## Number of Fisher Scoring iterations: 25
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 101 2
## yes 1 11
##
## Accuracy : 0.9739
## 95% CI : (0.9257, 0.9946)
## No Information Rate : 0.887
## P-Value [Acc > NIR] : 0.0006462
##
## Kappa : 0.8654
##
## Mcnemar's Test P-Value : 1.0000000
##
## Sensitivity : 0.9902
## Specificity : 0.8462
## Pos Pred Value : 0.9806
## Neg Pred Value : 0.9167
## Prevalence : 0.8870
## Detection Rate : 0.8783
## Detection Prevalence : 0.8957
## Balanced Accuracy : 0.9182
##
## 'Positive' Class : no
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction no yes
## no 102 0
## yes 0 13
##
## Accuracy : 1
## 95% CI : (0.9684, 1)
## No Information Rate : 0.887
## P-Value [Acc > NIR] : 1.02e-06
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.887
## Detection Rate : 0.887
## Detection Prevalence : 0.887
## Balanced Accuracy : 1.000
##
## 'Positive' Class : no
##
## Accuracy Kappa
## Decision Tree 1.000000 1.0000000
## Logistic Regression 0.973913 0.8653921