There are a lot of risk factors that associate with DM. With provided dataset here I try to figure out which predictors in dataset associate with DM outcome,patient has DM or not, by creating multiple logistic regression model. Conclusion,after created models by predictors of interest (patient’s age, gender, smoking status, blood pressure, hemoglobin A1c level, BMI, cholesterol level and HDL level). I found that patient’s age, BMI and HDL level are the good prediction of DM in this dataset.
This article is a part of Logistic Regression in R for Public Health by Imperial College London course.The study question of this article is "What are the factors that predict whether patients have diabetes ?"
Step 0 : Import dataset into Rstudio.
dm<- read.csv("DM.csv")
Step 1 : Inspect dataset and cleaning data.
library(DT)
datatable(dm)
dim(dm)
## [1] 403 24
There are 24 variables and 403 observations.
I will create multiple logistic regression model with patient's age, gender, smoking status, blood pressure,hemoglobin A1c, BMI, cholesterol level and HDL level as predictors. And will remove unuseful predictors by backwards eliminations method.
head(dm[rowSums(is.na(dm))!=0,c(1:3,14:17)] )
## X id chol bp.1s bp.1d bp.2s bp.2d
## 1 1 1000 203 118 59 NA NA
## 2 2 1001 165 112 68 NA NA
## 4 4 1003 78 110 50 NA NA
## 5 5 1005 249 138 80 NA NA
## 6 6 1008 248 132 86 NA NA
## 8 8 1015 227 NA NA NA NA
There are a lot of missing value, most of them are in bp.2s and bp.2d. This because patients was measured 2nd blood pressure only when they have high blood pressure in the first attempt. So I prefer use mean between 1st and 2nd attempt with ignore if there are NA both 1st and 2nd attempts.
library(dplyr)
dm<- dm%>%
rowwise()%>%
mutate(sys_bp = mean(c(bp.1s,bp.2s),na.rm = TRUE))
dm<- dm%>%
rowwise()%>%
mutate(di_bp = mean(c(bp.1d,bp.2d),na.rm = TRUE))
#Set binary and category variables.
dm$dm<- as.factor(dm$dm)
dm$gender <- as.factor(dm$gender)
dm$smoking<-as.factor(dm$smoking)
This dataset provide patients weight and height. I am interested in using BMI, so let calculate BMI.
dm<-dm %>%
rowwise()%>%
mutate(BMI = (weight*0.45359237)/((height*0.0254)^2)) ##Change data to SI unit first.
step 2 : Explore interested predictors.
summary(dm$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 34.00 45.00 46.85 60.00 92.00
hist(dm$age, main = "patients age histogram", xlab = "Age", breaks = 20, col =rgb(0.5,0,0,0.5))
Age is skewed to the right.
summary(dm$BMI)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 15.20 24.13 27.80 28.79 32.24 55.79 6
summary(dm$glyhb)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.68 4.38 4.84 5.59 5.60 16.11 13
par(mfrow=c(1,2))
hist(dm$BMI, main = "patients BMI histogram", xlab = "BMI", breaks = 20, col =rgb(0,0.5,0,0.5))
hist(dm$glyhb, main = "patients HbA1c histogram", xlab = "HbA1c", breaks = 20, col =rgb(0,0.5,0,0.5))
BMI and HbA1C are skewed to the right.
summary(dm$sys_bp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 90.0 121.0 136.0 136.5 146.0 250.0 5
summary(dm$di_bp)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 48.00 75.00 82.00 82.98 90.00 124.00 5
par(mfrow = c(1,2))
hist(dm$sys_bp, main = "patients systolic BP histogram", xlab = "Systolic BP", breaks = 20)
hist(dm$di_bp, main = "patients diastolic BP histogram", xlab = "Diastolic BP", breaks = 20)
Systolic blood pressure is skewed, but diastolic blood pressure is approximate to normal distribution.
summary(dm$chol)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 78.0 179.0 204.0 207.8 230.0 443.0 1
summary(dm$hdl)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 12.00 38.00 46.00 50.45 59.00 120.00 1
par(mfrow = c(1,2))
hist(dm$chol, main = "patients cholesterol level histogram", xlab = "Cholesterol", breaks = 20)
hist(dm$hdl, main = "patients HDL level histogram", xlab = "HDL", breaks = 20)
Both cholesterol level and HDL level are skewed.
Step 3 : EDA plot.
For categories variables Cramer’s V statistic is used fo identify reletionship.
round(prop.table(table(dm$gender)),2)
##
## female male
## 0.58 0.42
round(prop.table(table(dm$smoking)),2) ##smoking: 1=current, 2=never and 3=ex
##
## 1 2 3
## 0.30 0.55 0.15
library(gmodels)
CrossTable(dm$gender,dm$smoking)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 403
##
##
## | dm$smoking
## dm$gender | 1 | 2 | 3 | Row Total |
## -------------|-----------|-----------|-----------|-----------|
## female | 71 | 131 | 32 | 234 |
## | 0.008 | 0.018 | 0.149 | |
## | 0.303 | 0.560 | 0.137 | 0.581 |
## | 0.587 | 0.587 | 0.542 | |
## | 0.176 | 0.325 | 0.079 | |
## -------------|-----------|-----------|-----------|-----------|
## male | 50 | 92 | 27 | 169 |
## | 0.011 | 0.025 | 0.206 | |
## | 0.296 | 0.544 | 0.160 | 0.419 |
## | 0.413 | 0.413 | 0.458 | |
## | 0.124 | 0.228 | 0.067 | |
## -------------|-----------|-----------|-----------|-----------|
## Column Total | 121 | 223 | 59 | 403 |
## | 0.300 | 0.553 | 0.146 | |
## -------------|-----------|-----------|-----------|-----------|
##
##
##Cramer’s V is used to calculate the correlation between nominal categorical variables.
library(rcompanion)
cramerV(dm$gender,dm$smoking)
## Cramer V
## 0.03213
There is weak correlation across gender and smoking status.
For continuous variables correlation matrix is used for identify relationship across all predictors
library(Hmisc)
cor<-rcorr(as.matrix(dm[,c("age","sys_bp","di_bp","glyhb","BMI","chol","hdl")]))
round(cor$r,3)
## age sys_bp di_bp glyhb BMI chol hdl
## age 1.000 0.445 0.073 0.339 -0.008 0.233 0.038
## sys_bp 0.445 1.000 0.614 0.185 0.112 0.199 0.042
## di_bp 0.073 0.614 1.000 0.024 0.164 0.167 0.087
## glyhb 0.339 0.185 0.024 1.000 0.129 0.247 -0.149
## BMI -0.008 0.112 0.164 0.129 1.000 0.086 -0.242
## chol 0.233 0.199 0.167 0.247 0.086 1.000 0.187
## hdl 0.038 0.042 0.087 -0.149 -0.242 0.187 1.000
round(cor$P,3)
## age sys_bp di_bp glyhb BMI chol hdl
## age NA 0.000 0.144 0.000 0.879 0.000 0.446
## sys_bp 0.000 NA 0.000 0.000 0.026 0.000 0.409
## di_bp 0.144 0.000 NA 0.640 0.001 0.001 0.082
## glyhb 0.000 0.000 0.640 NA 0.011 0.000 0.003
## BMI 0.879 0.026 0.001 0.011 NA 0.088 0.000
## chol 0.000 0.000 0.001 0.000 0.088 NA 0.000
## hdl 0.446 0.409 0.082 0.003 0.000 0.000 NA
The correlation matrix show that there are statistical significant relationship when p-value < 0.05. So I will eliminate sys_bp,di_bp,glyhb and chol from predictors of interest to avoid multicollinearity.
Step 4 : Create multiple logistic regression model.
dm_model <- glm(data = dm,dm~age+gender+smoking+BMI+hdl, family = binomial (link = logit))
summary(dm_model)
##
## Call:
## glm(formula = dm ~ age + gender + smoking + BMI + hdl, family = binomial(link = logit),
## data = dm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4864 -0.5463 -0.3842 -0.2254 2.8598
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.33306 1.21267 -4.398 1.09e-05 ***
## age 0.05859 0.01037 5.651 1.59e-08 ***
## gendermale 0.01802 0.34030 0.053 0.9578
## smoking2 -0.07449 0.34833 -0.214 0.8307
## smoking3 -0.06028 0.48340 -0.125 0.9008
## BMI 0.05812 0.02451 2.372 0.0177 *
## hdl -0.02321 0.01048 -2.214 0.0268 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 325.70 on 382 degrees of freedom
## Residual deviance: 275.88 on 376 degrees of freedom
## (20 observations deleted due to missingness)
## AIC: 289.88
##
## Number of Fisher Scoring iterations: 5
anova(dm_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: dm
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 382 325.70
## age 1 34.302 381 291.39 4.719e-09 ***
## gender 1 0.019 380 291.38 0.889155
## smoking 2 0.256 378 291.12 0.879687
## BMI 1 9.752 377 281.37 0.001791 **
## hdl 1 5.488 376 275.88 0.019145 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The model show that gender and smoking are not statistical significant, so I will eliminate gender and smoking from predictors of interest.
dm_model2 <- glm(data = dm,dm~age+BMI+hdl, family = binomial (link = logit))
summary(dm_model2)
##
## Call:
## glm(formula = dm ~ age + BMI + hdl, family = binomial(link = logit),
## data = dm)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4929 -0.5507 -0.3832 -0.2235 2.8829
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.35643 1.09013 -4.914 8.94e-07 ***
## age 0.05879 0.01027 5.726 1.03e-08 ***
## BMI 0.05763 0.02295 2.512 0.0120 *
## hdl -0.02352 0.01005 -2.340 0.0193 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 325.70 on 382 degrees of freedom
## Residual deviance: 275.93 on 379 degrees of freedom
## (20 observations deleted due to missingness)
## AIC: 283.93
##
## Number of Fisher Scoring iterations: 5
anova(dm_model2, test = "Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: dm
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 382 325.70
## age 1 34.302 381 291.39 4.719e-09 ***
## BMI 1 9.301 380 282.09 0.00229 **
## hdl 1 6.166 379 275.93 0.01302 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After eliminate some predictors, model coefficient do not change too much.
round(exp(dm_model2$coefficients),3)
## (Intercept) age BMI hdl
## 0.005 1.061 1.059 0.977
The coefficients of model show that age and BMI increase OR by 6%, while HDL level decrease OR by 2%.
Step 5 : Check model fit.
library(DescTools)
library(generalhoslem)
Cstat(dm_model2)
## [1] 0.7931565
logitgof(obs = dm_model2$y,exp = fitted(dm_model2), g=10)
##
## Hosmer and Lemeshow test (binary model)
##
## data: dm_model2$y, fitted(dm_model2)
## X-squared = 16.943, df = 8, p-value = 0.03071
The c-stat, power of prediction, of this model is good, but the goodness of fit is not good enough.