There are a lot of risk factors that associate with DM. With provided dataset here I try to figure out which predictors in dataset associate with DM outcome,patient has DM or not, by creating multiple logistic regression model. Conclusion,after created models by predictors of interest (patient’s age, gender, smoking status, blood pressure, hemoglobin A1c level, BMI, cholesterol level and HDL level). I found that patient’s age, BMI and HDL level are the good prediction of DM in this dataset.

        This article is a part of Logistic Regression in R for Public Health by Imperial College London course.The study question of this article is "What are the factors that predict whether patients have diabetes ?"  

Step 0 : Import dataset into Rstudio.

dm<- read.csv("DM.csv")

Step 1 : Inspect dataset and cleaning data.

library(DT)
datatable(dm)
dim(dm)
## [1] 403  24

There are 24 variables and 403 observations.

  I will create multiple logistic regression model with patient's age, gender, smoking status, blood pressure,hemoglobin A1c, BMI, cholesterol level and  HDL level as predictors. And will remove unuseful predictors by backwards eliminations method.
head(dm[rowSums(is.na(dm))!=0,c(1:3,14:17)]  )
##   X   id chol bp.1s bp.1d bp.2s bp.2d
## 1 1 1000  203   118    59    NA    NA
## 2 2 1001  165   112    68    NA    NA
## 4 4 1003   78   110    50    NA    NA
## 5 5 1005  249   138    80    NA    NA
## 6 6 1008  248   132    86    NA    NA
## 8 8 1015  227    NA    NA    NA    NA

There are a lot of missing value, most of them are in bp.2s and bp.2d. This because patients was measured 2nd blood pressure only when they have high blood pressure in the first attempt. So I prefer use mean between 1st and 2nd attempt with ignore if there are NA both 1st and 2nd attempts.

library(dplyr)
dm<- dm%>%
   rowwise()%>%
   mutate(sys_bp = mean(c(bp.1s,bp.2s),na.rm = TRUE))

dm<- dm%>%
   rowwise()%>%
   mutate(di_bp = mean(c(bp.1d,bp.2d),na.rm = TRUE))
#Set binary and category variables.
dm$dm<- as.factor(dm$dm)
dm$gender <- as.factor(dm$gender)
dm$smoking<-as.factor(dm$smoking)

This dataset provide patients weight and height. I am interested in using BMI, so let calculate BMI.

dm<-dm %>% 
      rowwise()%>%
      mutate(BMI = (weight*0.45359237)/((height*0.0254)^2)) ##Change data to SI unit first.

step 2 : Explore interested predictors.

summary(dm$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   34.00   45.00   46.85   60.00   92.00
hist(dm$age, main = "patients age histogram", xlab = "Age", breaks = 20, col =rgb(0.5,0,0,0.5))

Age is skewed to the right.

summary(dm$BMI)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   15.20   24.13   27.80   28.79   32.24   55.79       6
summary(dm$glyhb)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.68    4.38    4.84    5.59    5.60   16.11      13
par(mfrow=c(1,2))
hist(dm$BMI, main = "patients BMI histogram", xlab = "BMI", breaks = 20, col =rgb(0,0.5,0,0.5))
hist(dm$glyhb, main = "patients HbA1c histogram", xlab = "HbA1c", breaks = 20, col =rgb(0,0.5,0,0.5))

BMI and HbA1C are skewed to the right.

summary(dm$sys_bp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    90.0   121.0   136.0   136.5   146.0   250.0       5
summary(dm$di_bp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   48.00   75.00   82.00   82.98   90.00  124.00       5
par(mfrow = c(1,2))
hist(dm$sys_bp, main = "patients systolic BP histogram", xlab = "Systolic BP", breaks = 20)
hist(dm$di_bp, main = "patients diastolic BP histogram", xlab = "Diastolic BP", breaks = 20)

Systolic blood pressure is skewed, but diastolic blood pressure is approximate to normal distribution.

summary(dm$chol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    78.0   179.0   204.0   207.8   230.0   443.0       1
summary(dm$hdl)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.00   38.00   46.00   50.45   59.00  120.00       1
par(mfrow = c(1,2))
hist(dm$chol, main = "patients cholesterol level histogram", xlab = "Cholesterol", breaks = 20)
hist(dm$hdl, main = "patients HDL level histogram", xlab = "HDL", breaks = 20)

Both cholesterol level and HDL level are skewed.

Step 3 : EDA plot.

For categories variables Cramer’s V statistic is used fo identify reletionship.

round(prop.table(table(dm$gender)),2)
## 
## female   male 
##   0.58   0.42
round(prop.table(table(dm$smoking)),2) ##smoking: 1=current, 2=never and 3=ex
## 
##    1    2    3 
## 0.30 0.55 0.15
library(gmodels)
CrossTable(dm$gender,dm$smoking)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  403 
## 
##  
##              | dm$smoking 
##    dm$gender |         1 |         2 |         3 | Row Total | 
## -------------|-----------|-----------|-----------|-----------|
##       female |        71 |       131 |        32 |       234 | 
##              |     0.008 |     0.018 |     0.149 |           | 
##              |     0.303 |     0.560 |     0.137 |     0.581 | 
##              |     0.587 |     0.587 |     0.542 |           | 
##              |     0.176 |     0.325 |     0.079 |           | 
## -------------|-----------|-----------|-----------|-----------|
##         male |        50 |        92 |        27 |       169 | 
##              |     0.011 |     0.025 |     0.206 |           | 
##              |     0.296 |     0.544 |     0.160 |     0.419 | 
##              |     0.413 |     0.413 |     0.458 |           | 
##              |     0.124 |     0.228 |     0.067 |           | 
## -------------|-----------|-----------|-----------|-----------|
## Column Total |       121 |       223 |        59 |       403 | 
##              |     0.300 |     0.553 |     0.146 |           | 
## -------------|-----------|-----------|-----------|-----------|
## 
## 
##Cramer’s V is used to calculate the correlation between nominal categorical variables. 
library(rcompanion)
cramerV(dm$gender,dm$smoking)
## Cramer V 
##  0.03213

There is weak correlation across gender and smoking status.

For continuous variables correlation matrix is used for identify relationship across all predictors

library(Hmisc)
cor<-rcorr(as.matrix(dm[,c("age","sys_bp","di_bp","glyhb","BMI","chol","hdl")]))
round(cor$r,3)
##           age sys_bp di_bp  glyhb    BMI  chol    hdl
## age     1.000  0.445 0.073  0.339 -0.008 0.233  0.038
## sys_bp  0.445  1.000 0.614  0.185  0.112 0.199  0.042
## di_bp   0.073  0.614 1.000  0.024  0.164 0.167  0.087
## glyhb   0.339  0.185 0.024  1.000  0.129 0.247 -0.149
## BMI    -0.008  0.112 0.164  0.129  1.000 0.086 -0.242
## chol    0.233  0.199 0.167  0.247  0.086 1.000  0.187
## hdl     0.038  0.042 0.087 -0.149 -0.242 0.187  1.000
round(cor$P,3)
##          age sys_bp di_bp glyhb   BMI  chol   hdl
## age       NA  0.000 0.144 0.000 0.879 0.000 0.446
## sys_bp 0.000     NA 0.000 0.000 0.026 0.000 0.409
## di_bp  0.144  0.000    NA 0.640 0.001 0.001 0.082
## glyhb  0.000  0.000 0.640    NA 0.011 0.000 0.003
## BMI    0.879  0.026 0.001 0.011    NA 0.088 0.000
## chol   0.000  0.000 0.001 0.000 0.088    NA 0.000
## hdl    0.446  0.409 0.082 0.003 0.000 0.000    NA

The correlation matrix show that there are statistical significant relationship when p-value < 0.05. So I will eliminate sys_bp,di_bp,glyhb and chol from predictors of interest to avoid multicollinearity.

Step 4 : Create multiple logistic regression model.

dm_model <- glm(data = dm,dm~age+gender+smoking+BMI+hdl, family = binomial (link = logit))
summary(dm_model)
## 
## Call:
## glm(formula = dm ~ age + gender + smoking + BMI + hdl, family = binomial(link = logit), 
##     data = dm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4864  -0.5463  -0.3842  -0.2254   2.8598  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.33306    1.21267  -4.398 1.09e-05 ***
## age          0.05859    0.01037   5.651 1.59e-08 ***
## gendermale   0.01802    0.34030   0.053   0.9578    
## smoking2    -0.07449    0.34833  -0.214   0.8307    
## smoking3    -0.06028    0.48340  -0.125   0.9008    
## BMI          0.05812    0.02451   2.372   0.0177 *  
## hdl         -0.02321    0.01048  -2.214   0.0268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 325.70  on 382  degrees of freedom
## Residual deviance: 275.88  on 376  degrees of freedom
##   (20 observations deleted due to missingness)
## AIC: 289.88
## 
## Number of Fisher Scoring iterations: 5
anova(dm_model, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: dm
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                      382     325.70              
## age      1   34.302       381     291.39 4.719e-09 ***
## gender   1    0.019       380     291.38  0.889155    
## smoking  2    0.256       378     291.12  0.879687    
## BMI      1    9.752       377     281.37  0.001791 ** 
## hdl      1    5.488       376     275.88  0.019145 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model show that gender and smoking are not statistical significant, so I will eliminate gender and smoking from predictors of interest.

dm_model2 <- glm(data = dm,dm~age+BMI+hdl, family = binomial (link = logit))
summary(dm_model2)
## 
## Call:
## glm(formula = dm ~ age + BMI + hdl, family = binomial(link = logit), 
##     data = dm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4929  -0.5507  -0.3832  -0.2235   2.8829  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.35643    1.09013  -4.914 8.94e-07 ***
## age          0.05879    0.01027   5.726 1.03e-08 ***
## BMI          0.05763    0.02295   2.512   0.0120 *  
## hdl         -0.02352    0.01005  -2.340   0.0193 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 325.70  on 382  degrees of freedom
## Residual deviance: 275.93  on 379  degrees of freedom
##   (20 observations deleted due to missingness)
## AIC: 283.93
## 
## Number of Fisher Scoring iterations: 5
anova(dm_model2, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: dm
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                   382     325.70              
## age   1   34.302       381     291.39 4.719e-09 ***
## BMI   1    9.301       380     282.09   0.00229 ** 
## hdl   1    6.166       379     275.93   0.01302 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After eliminate some predictors, model coefficient do not change too much.

round(exp(dm_model2$coefficients),3)
## (Intercept)         age         BMI         hdl 
##       0.005       1.061       1.059       0.977

The coefficients of model show that age and BMI increase OR by 6%, while HDL level decrease OR by 2%.

Step 5 : Check model fit.

library(DescTools)
library(generalhoslem)
Cstat(dm_model2)
## [1] 0.7931565
logitgof(obs = dm_model2$y,exp = fitted(dm_model2), g=10)
## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  dm_model2$y, fitted(dm_model2)
## X-squared = 16.943, df = 8, p-value = 0.03071

The c-stat, power of prediction, of this model is good, but the goodness of fit is not good enough.