By the end of this practical, you will be able to:

Follow the following steps:

  1. Ensure to download and store the data file into your PC from the link: https://github.com/winterwang/logistic-reg-CSS/raw/gh-pages/dataset/mortality.dta

  2. Ensure you have installed R, Rstudio, and the following packages by commands below:

requiredPackages = c('haven','Epi','epiDisplay', 'tidyverse')
for(p in requiredPackages){
  if(!require(p,character.only = TRUE)) install.packages(p)
  library(p,character.only = TRUE)
}
  1. If data is downloaded to your desktop, use the following command to read the data into R:
library(haven)
mortality <- read_dta("C:/Users/{YOUR.USER.NAME}/Desktop/mortality.dta")
  1. If data is successfully loaded into R, you can see the dataname, number of variables, number of observations on the upper right side of the Rstudio program:

  1. You may go ahead and explore the dataset as you wish by clicking on the name of the data, or by commands such as summary() or etc. to see more details of the data.

library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
## Loading required package: nnet
with(mortality, tabpct(mfgrp, died, graph = FALSE))
## 
## Original table 
##        died
## mfgrp       0     1  Total
##   0      1274    29   1303
##   1      1609    62   1671
##   2       996    33   1029
##   3       193     9    202
##   Total  4072   133   4205
## 
## Row percent 
##      died
## mfgrp       0      1  Total
##     0    1274     29   1303
##        (97.8)  (2.2)  (100)
##     1    1609     62   1671
##        (96.3)  (3.7)  (100)
##     2     996     33   1029
##        (96.8)  (3.2)  (100)
##     3     193      9    202
##        (95.5)  (4.5)  (100)
## 
## Column percent 
##        died
## mfgrp       0       %    1       %
##   0      1274  (31.3)   29  (21.8)
##   1      1609  (39.5)   62  (46.6)
##   2       996  (24.5)   33  (24.8)
##   3       193   (4.7)    9   (6.8)
##   Total  4072   (100)  133   (100)
  1. The outcome variable is died, which is coded as 1 for an individual who has died and 0 for an individual who is alive after 3 years of follow-up. Analyse this data set treating it as a cohort study with fixed follow-up time (ie., assume that all individuals were followed for the same length of time and analyse the data using odds).

  2. To use logistic regression to examine the association between died and vimp, type

vimp_logm0 <- glm(died ~ vimp, data = mortality, 
                  family = binomial(link = "logit"))
summary(vimp_logm0)
## 
## Call:
## glm(formula = died ~ vimp, family = binomial(link = "logit"), 
##     data = mortality)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5108  -0.2224  -0.2224  -0.2224   2.7247  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.6873     0.1028 -35.870   <2e-16 ***
## vimp          1.7167     0.1976   8.687   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1213.8  on 4297  degrees of freedom
## Residual deviance: 1154.7  on 4296  degrees of freedom
## AIC: 1158.7
## 
## Number of Fisher Scoring iterations: 6

Check that the the output corresponds to that in the slides.

Logistic regression estimates are derived by starting with a guess at the parameter estimates, then using the result to compute a better guess (nearer to the maximum likelihood estimates). This is known as iteration. The log likelihood for each iteration is shown. The procedure stops when there is no further increase in the log likelihood.

  1. So far, all the output has been on the log scale. This was needed in order to explain how confidence intervals and Wald tests (z-tests) are derived. In fact, R allows us to derive estimates on the odds ratio scale, which is much more convenient for reporting results. Type:
library(epiDisplay)
logistic.display(vimp_logm0, decimal = 3)
## 
## Logistic regression predicting died 
##  
##               OR(95%CI)            P(Wald's test) P(LR-test)
## vimp: 1 vs 0  5.566 (3.779,8.199)  < 0.001        < 0.001   
##                                                             
## Log-likelihood = -577.36596
## No. of observations = 4298
## AIC value = 1158.73192

The results are now shown as odds ratios. Note that:

  1. Example of how results may be summarised in a sentence.

“Visual impairment is strongly associated with risk of death (P<0.0001). Those visually impaired have nearly 6 times the odds of death compared to those with unimpaired vision (OR 5.57 95% CI: 3.78-8.20).”

  1. When tabulating variables for a cohort (or cross-sectional) study, use row percentages if the explanatory variable is the row variable, column percentages if it is the column variable. For example:
with(mortality, tabpct(mfgrp, died, 
                       percent = "row", 
                       graph = FALSE))
## 
## Row percent 
##      died
## mfgrp       0      1  Total
##     0    1274     29   1303
##        (97.8)  (2.2)  (100)
##     1    1609     62   1671
##        (96.3)  (3.7)  (100)
##     2     996     33   1029
##        (96.8)  (3.2)  (100)
##     3     193      9    202
##        (95.5)  (4.5)  (100)

We can then describe the risk of death over a 3 year period for those with no microfilarial infection (2.2%), <10 microfilarial load/mg infection (3.7%), etc.

  1. As explained in the slides, we need to use indicator variables to examine the association between microfilarial load (4 levels) and death. We need to recode the varibales, type
library(tidyverse)
mortality <- mortality %>% 
  mutate(mfgrp = factor(mfgrp)) %>% 
  mutate(mfgrp = fct_recode(mfgrp, 
                            "Uninfected" = "0", 
                            "< 10"       = "1",
                            "10-49"      = "2", 
                            "50+"        = "3"))

vimp_logm1 <- glm(died ~ mfgrp, data = mortality, 
                  family = binomial(link = "logit"))
summary(vimp_logm1)
## 
## Call:
## glm(formula = died ~ mfgrp, family = binomial(link = "logit"), 
##     data = mortality)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3019  -0.2750  -0.2553  -0.2122   2.7587  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.7826     0.1878 -20.143   <2e-16 ***
## mfgrp< 10     0.5264     0.2281   2.308   0.0210 *  
## mfgrp10-49    0.3754     0.2580   1.455   0.1457    
## mfgrp50+      0.7172     0.3893   1.842   0.0655 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1180.4  on 4204  degrees of freedom
## Residual deviance: 1173.7  on 4201  degrees of freedom
##   (93 observations deleted due to missingness)
## AIC: 1181.7
## 
## Number of Fisher Scoring iterations: 6
  1. To get the output on the odds ratio scale, type:
logistic.display(vimp_logm1, decimal = 3)
## 
## Logistic regression predicting died 
##  
##                        OR(95%CI)            P(Wald's test) P(LR-test)
## mfgrp: ref.=Uninfected                                     0.0822    
##    < 10                1.693 (1.083,2.647)  0.021                    
##    10-49               1.456 (0.878,2.414)  0.1457                   
##    50+                 2.049 (0.955,4.394)  0.0655                   
##                                                                      
## Log-likelihood = -586.86507
## No. of observations = 4205
## AIC value = 1181.73014

Note that there are three odds ratios each of which refers to the same baseline group (those uninfected). The odds ratio is - 1.69 for those with microfilarial load <10 mf/mg (coded as 1) compared to those uninfected (coded as 0) - 1.46 for those microfilarial load 10-49 mf/mg (coded as 2) compared to those uninfected (0) and - 2.05 for those microfilarial load ≥50 mf/mg (coded as 3) compared to those uninfected (0).

  1. Write 2-3 sentences describing the relationship between microfilarial infection and death

There is weak evidence of an association between microfilarial infection (measured on 4 levels) and death (p=0.08). The risk of death among those uninfected is 2.2% which is lower than those infected. The odd ratios for those with <10, 10-49 and 50+ microfilarial /mg compared with being infected were 1.69 (95% CI: 1.08-2.64), 1.46 (0.88-2.41) and 2.05 (0.96-4.39), respectively.

  1. Logistic regression model with agegrp as an explanatory variable:
mortality <- mortality %>% 
  mutate(agegrp = factor(agegrp)) %>% 
  mutate(agegrp = fct_recode(agegrp, 
                            "15-34" = "0", 
                            "35-54"       = "1",
                            "55-64"      = "2", 
                            "65+"        = "3"))
vimp_logm2 <- glm(died ~ vimp + agegrp, 
                  data = mortality, 
                  family = binomial(link = "logit"))
logistic.display(vimp_logm2, decimal = 3)
## 
## Logistic regression predicting died 
##  
##                    crude OR(95%CI)        adj. OR(95%CI)       
## vimp: 1 vs 0       5.566 (3.779,8.199)    2.202 (1.409,3.443)  
##                                                                
## agegrp: ref.=15-34                                             
##    35-54           2.578 (1.633,4.07)     2.355 (1.484,3.737)  
##    55-64           6.934 (4.058,11.847)   5.415 (3.083,9.513)  
##    65+             14.802 (8.809,24.872)  9.901 (5.541,17.692) 
##                                                                
##                    P(Wald's test) P(LR-test)
## vimp: 1 vs 0       < 0.001        < 0.001   
##                                             
## agegrp: ref.=15-34                < 0.001   
##    35-54           < 0.001                  
##    55-64           < 0.001                  
##    65+             < 0.001                  
##                                             
## Log-likelihood = -545.8609
## No. of observations = 4298
## AIC value = 1101.7218

There is strong evidence against the null hypothesis of no association of age with death (p<0.001). The odds of death increase with increasing age.