Ensure to download and store the data file into your PC from the link: https://github.com/winterwang/logistic-reg-CSS/raw/gh-pages/dataset/mortality.dta
Ensure you have installed R, Rstudio, and the following packages by commands below:
requiredPackages = c('haven','Epi','epiDisplay', 'tidyverse')
for(p in requiredPackages){
if(!require(p,character.only = TRUE)) install.packages(p)
library(p,character.only = TRUE)
}
library(haven)
mortality <- read_dta("C:/Users/{YOUR.USER.NAME}/Desktop/mortality.dta")
summary()
or etc. to see more details of the data.library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
## Loading required package: nnet
with(mortality, tabpct(mfgrp, died, graph = FALSE))
##
## Original table
## died
## mfgrp 0 1 Total
## 0 1274 29 1303
## 1 1609 62 1671
## 2 996 33 1029
## 3 193 9 202
## Total 4072 133 4205
##
## Row percent
## died
## mfgrp 0 1 Total
## 0 1274 29 1303
## (97.8) (2.2) (100)
## 1 1609 62 1671
## (96.3) (3.7) (100)
## 2 996 33 1029
## (96.8) (3.2) (100)
## 3 193 9 202
## (95.5) (4.5) (100)
##
## Column percent
## died
## mfgrp 0 % 1 %
## 0 1274 (31.3) 29 (21.8)
## 1 1609 (39.5) 62 (46.6)
## 2 996 (24.5) 33 (24.8)
## 3 193 (4.7) 9 (6.8)
## Total 4072 (100) 133 (100)
The outcome variable is died
, which is coded as 1 for an individual who has died and 0 for an individual who is alive after 3 years of follow-up. Analyse this data set treating it as a cohort study with fixed follow-up time (ie., assume that all individuals were followed for the same length of time and analyse the data using odds).
To use logistic regression to examine the association between died
and vimp
, type
vimp_logm0 <- glm(died ~ vimp, data = mortality,
family = binomial(link = "logit"))
summary(vimp_logm0)
##
## Call:
## glm(formula = died ~ vimp, family = binomial(link = "logit"),
## data = mortality)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.5108 -0.2224 -0.2224 -0.2224 2.7247
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.6873 0.1028 -35.870 <2e-16 ***
## vimp 1.7167 0.1976 8.687 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1213.8 on 4297 degrees of freedom
## Residual deviance: 1154.7 on 4296 degrees of freedom
## AIC: 1158.7
##
## Number of Fisher Scoring iterations: 6
Check that the the output corresponds to that in the slides.
Logistic regression estimates are derived by starting with a guess at the parameter estimates, then using the result to compute a better guess (nearer to the maximum likelihood estimates). This is known as iteration. The log likelihood for each iteration is shown. The procedure stops when there is no further increase in the log likelihood.
library(epiDisplay)
logistic.display(vimp_logm0, decimal = 3)
##
## Logistic regression predicting died
##
## OR(95%CI) P(Wald's test) P(LR-test)
## vimp: 1 vs 0 5.566 (3.779,8.199) < 0.001 < 0.001
##
## Log-likelihood = -577.36596
## No. of observations = 4298
## AIC value = 1158.73192
The results are now shown as odds ratios. Note that:
“Visual impairment is strongly associated with risk of death (P<0.0001). Those visually impaired have nearly 6 times the odds of death compared to those with unimpaired vision (OR 5.57 95% CI: 3.78-8.20).”
with(mortality, tabpct(mfgrp, died,
percent = "row",
graph = FALSE))
##
## Row percent
## died
## mfgrp 0 1 Total
## 0 1274 29 1303
## (97.8) (2.2) (100)
## 1 1609 62 1671
## (96.3) (3.7) (100)
## 2 996 33 1029
## (96.8) (3.2) (100)
## 3 193 9 202
## (95.5) (4.5) (100)
We can then describe the risk of death over a 3 year period for those with no microfilarial infection (2.2%), <10 microfilarial load/mg infection (3.7%), etc.
library(tidyverse)
mortality <- mortality %>%
mutate(mfgrp = factor(mfgrp)) %>%
mutate(mfgrp = fct_recode(mfgrp,
"Uninfected" = "0",
"< 10" = "1",
"10-49" = "2",
"50+" = "3"))
vimp_logm1 <- glm(died ~ mfgrp, data = mortality,
family = binomial(link = "logit"))
summary(vimp_logm1)
##
## Call:
## glm(formula = died ~ mfgrp, family = binomial(link = "logit"),
## data = mortality)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.3019 -0.2750 -0.2553 -0.2122 2.7587
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.7826 0.1878 -20.143 <2e-16 ***
## mfgrp< 10 0.5264 0.2281 2.308 0.0210 *
## mfgrp10-49 0.3754 0.2580 1.455 0.1457
## mfgrp50+ 0.7172 0.3893 1.842 0.0655 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1180.4 on 4204 degrees of freedom
## Residual deviance: 1173.7 on 4201 degrees of freedom
## (93 observations deleted due to missingness)
## AIC: 1181.7
##
## Number of Fisher Scoring iterations: 6
To get the output on the odds ratio scale, type:
logistic.display(vimp_logm1, decimal = 3)
##
## Logistic regression predicting died
##
## OR(95%CI) P(Wald's test) P(LR-test)
## mfgrp: ref.=Uninfected 0.0822
## < 10 1.693 (1.083,2.647) 0.021
## 10-49 1.456 (0.878,2.414) 0.1457
## 50+ 2.049 (0.955,4.394) 0.0655
##
## Log-likelihood = -586.86507
## No. of observations = 4205
## AIC value = 1181.73014
Note that there are three odds ratios each of which refers to the same baseline group (those uninfected). The odds ratio is - 1.69 for those with microfilarial load <10 mf/mg (coded as 1) compared to those uninfected (coded as 0) - 1.46 for those microfilarial load 10-49 mf/mg (coded as 2) compared to those uninfected (0) and - 2.05 for those microfilarial load ≥50 mf/mg (coded as 3) compared to those uninfected (0).
There is weak evidence of an association between microfilarial infection (measured on 4 levels) and death (p=0.08). The risk of death among those uninfected is 2.2% which is lower than those infected. The odd ratios for those with <10, 10-49 and 50+ microfilarial /mg compared with being infected were 1.69 (95% CI: 1.08-2.64), 1.46 (0.88-2.41) and 2.05 (0.96-4.39), respectively.
mortality <- mortality %>%
mutate(agegrp = factor(agegrp)) %>%
mutate(agegrp = fct_recode(agegrp,
"15-34" = "0",
"35-54" = "1",
"55-64" = "2",
"65+" = "3"))
vimp_logm2 <- glm(died ~ vimp + agegrp,
data = mortality,
family = binomial(link = "logit"))
logistic.display(vimp_logm2, decimal = 3)
##
## Logistic regression predicting died
##
## crude OR(95%CI) adj. OR(95%CI)
## vimp: 1 vs 0 5.566 (3.779,8.199) 2.202 (1.409,3.443)
##
## agegrp: ref.=15-34
## 35-54 2.578 (1.633,4.07) 2.355 (1.484,3.737)
## 55-64 6.934 (4.058,11.847) 5.415 (3.083,9.513)
## 65+ 14.802 (8.809,24.872) 9.901 (5.541,17.692)
##
## P(Wald's test) P(LR-test)
## vimp: 1 vs 0 < 0.001 < 0.001
##
## agegrp: ref.=15-34 < 0.001
## 35-54 < 0.001
## 55-64 < 0.001
## 65+ < 0.001
##
## Log-likelihood = -545.8609
## No. of observations = 4298
## AIC value = 1101.7218
There is strong evidence against the null hypothesis of no association of age with death (p<0.001). The odds of death increase with increasing age.