## By the end of this practical, you will be able to:

• Use R to compare two or more groups in a crude analysis
• Use R to estimate the effect (odds ratio) of one exposure, controlled for the effect of a second variable, and obtain a confidence interval for this odds ratio
• Use logistic regression to assess if the effect of an exposure (e.g. visual impairment) is confounded by another variable (e.g. age)

2. Ensure you have installed R, Rstudio, and the following packages by commands below:

requiredPackages = c('haven','Epi','epiDisplay', 'tidyverse')
for(p in requiredPackages){
if(!require(p,character.only = TRUE)) install.packages(p)
library(p,character.only = TRUE)
}
library(haven)
mortality <- read_dta("C:/Users/{YOUR.USER.NAME}/Desktop/mortality.dta")
1. If data is successfully loaded into R, you can see the dataname, number of variables, number of observations on the upper right side of the Rstudio program:

1. You may go ahead and explore the dataset as you wish by clicking on the name of the data, or by commands such as summary() or etc. to see more details of the data.

library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
## Loading required package: nnet
with(mortality, tabpct(mfgrp, died, graph = FALSE))
##
## Original table
##        died
## mfgrp       0     1  Total
##   0      1274    29   1303
##   1      1609    62   1671
##   2       996    33   1029
##   3       193     9    202
##   Total  4072   133   4205
##
## Row percent
##      died
## mfgrp       0      1  Total
##     0    1274     29   1303
##        (97.8)  (2.2)  (100)
##     1    1609     62   1671
##        (96.3)  (3.7)  (100)
##     2     996     33   1029
##        (96.8)  (3.2)  (100)
##     3     193      9    202
##        (95.5)  (4.5)  (100)
##
## Column percent
##        died
## mfgrp       0       %    1       %
##   0      1274  (31.3)   29  (21.8)
##   1      1609  (39.5)   62  (46.6)
##   2       996  (24.5)   33  (24.8)
##   3       193   (4.7)    9   (6.8)
##   Total  4072   (100)  133   (100)
1. The outcome variable is died, which is coded as 1 for an individual who has died and 0 for an individual who is alive after 3 years of follow-up. Analyse this data set treating it as a cohort study with fixed follow-up time (ie., assume that all individuals were followed for the same length of time and analyse the data using odds).

2. To use logistic regression to examine the association between died and vimp, type

vimp_logm0 <- glm(died ~ vimp, data = mortality,
summary(vimp_logm0)
##
## Call:
## glm(formula = died ~ vimp, family = binomial(link = "logit"),
##     data = mortality)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -0.5108  -0.2224  -0.2224  -0.2224   2.7247
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -3.6873     0.1028 -35.870   <2e-16 ***
## vimp          1.7167     0.1976   8.687   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 1213.8  on 4297  degrees of freedom
## Residual deviance: 1154.7  on 4296  degrees of freedom
## AIC: 1158.7
##
## Number of Fisher Scoring iterations: 6

Check that the the output corresponds to that in the slides.

Logistic regression estimates are derived by starting with a guess at the parameter estimates, then using the result to compute a better guess (nearer to the maximum likelihood estimates). This is known as iteration. The log likelihood for each iteration is shown. The procedure stops when there is no further increase in the log likelihood.

1. So far, all the output has been on the log scale. This was needed in order to explain how confidence intervals and Wald tests (z-tests) are derived. In fact, R allows us to derive estimates on the odds ratio scale, which is much more convenient for reporting results. Type:
library(epiDisplay)
logistic.display(vimp_logm0, decimal = 3)
##
## Logistic regression predicting died
##
##               OR(95%CI)            P(Wald's test) P(LR-test)
## vimp: 1 vs 0  5.566 (3.779,8.199)  < 0.001        < 0.001
##
## Log-likelihood = -577.36596
## No. of observations = 4298
## AIC value = 1158.73192

The results are now shown as odds ratios. Note that:

• The baseline term (intercept) represents the odds of outcome in the baseline group of the variable – so in this example it is the odds of death amongst those visually unimpaired.
• The confidence interval is also derived using the standard error for the log odds ratio, as shown in the slides. Hence, the confidence interval is correct.
1. Example of how results may be summarised in a sentence.

“Visual impairment is strongly associated with risk of death (P<0.0001). Those visually impaired have nearly 6 times the odds of death compared to those with unimpaired vision (OR 5.57 95% CI: 3.78-8.20).”

1. When tabulating variables for a cohort (or cross-sectional) study, use row percentages if the explanatory variable is the row variable, column percentages if it is the column variable. For example:
with(mortality, tabpct(mfgrp, died,
percent = "row",
graph = FALSE))
##
## Row percent
##      died
## mfgrp       0      1  Total
##     0    1274     29   1303
##        (97.8)  (2.2)  (100)
##     1    1609     62   1671
##        (96.3)  (3.7)  (100)
##     2     996     33   1029
##        (96.8)  (3.2)  (100)
##     3     193      9    202
##        (95.5)  (4.5)  (100)

We can then describe the risk of death over a 3 year period for those with no microfilarial infection (2.2%), <10 microfilarial load/mg infection (3.7%), etc.

1. As explained in the slides, we need to use indicator variables to examine the association between microfilarial load (4 levels) and death. We need to recode the varibales, type
library(tidyverse)
mortality <- mortality %>%
mutate(mfgrp = factor(mfgrp)) %>%
mutate(mfgrp = fct_recode(mfgrp,
"Uninfected" = "0",
"< 10"       = "1",
"10-49"      = "2",
"50+"        = "3"))

vimp_logm1 <- glm(died ~ mfgrp, data = mortality,
summary(vimp_logm1)
##
## Call:
## glm(formula = died ~ mfgrp, family = binomial(link = "logit"),
##     data = mortality)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -0.3019  -0.2750  -0.2553  -0.2122   2.7587
##
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -3.7826     0.1878 -20.143   <2e-16 ***
## mfgrp< 10     0.5264     0.2281   2.308   0.0210 *
## mfgrp10-49    0.3754     0.2580   1.455   0.1457
## mfgrp50+      0.7172     0.3893   1.842   0.0655 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 1180.4  on 4204  degrees of freedom
## Residual deviance: 1173.7  on 4201  degrees of freedom
##   (93 observations deleted due to missingness)
## AIC: 1181.7
##
## Number of Fisher Scoring iterations: 6
1. To get the output on the odds ratio scale, type:
logistic.display(vimp_logm1, decimal = 3)
##
## Logistic regression predicting died
##
##                        OR(95%CI)            P(Wald's test) P(LR-test)
## mfgrp: ref.=Uninfected                                     0.0822
##    < 10                1.693 (1.083,2.647)  0.021
##    10-49               1.456 (0.878,2.414)  0.1457
##    50+                 2.049 (0.955,4.394)  0.0655
##
## Log-likelihood = -586.86507
## No. of observations = 4205
## AIC value = 1181.73014

Note that there are three odds ratios each of which refers to the same baseline group (those uninfected). The odds ratio is - 1.69 for those with microfilarial load <10 mf/mg (coded as 1) compared to those uninfected (coded as 0) - 1.46 for those microfilarial load 10-49 mf/mg (coded as 2) compared to those uninfected (0) and - 2.05 for those microfilarial load ≥50 mf/mg (coded as 3) compared to those uninfected (0).

1. Write 2-3 sentences describing the relationship between microfilarial infection and death

There is weak evidence of an association between microfilarial infection (measured on 4 levels) and death (p=0.08). The risk of death among those uninfected is 2.2% which is lower than those infected. The odd ratios for those with <10, 10-49 and 50+ microfilarial /mg compared with being infected were 1.69 (95% CI: 1.08-2.64), 1.46 (0.88-2.41) and 2.05 (0.96-4.39), respectively.

1. Logistic regression model with agegrp as an explanatory variable:
mortality <- mortality %>%
mutate(agegrp = factor(agegrp)) %>%
mutate(agegrp = fct_recode(agegrp,
"15-34" = "0",
"35-54"       = "1",
"55-64"      = "2",
"65+"        = "3"))
vimp_logm2 <- glm(died ~ vimp + agegrp,
data = mortality,
logistic.display(vimp_logm2, decimal = 3)
##
## Logistic regression predicting died
##
## vimp: 1 vs 0       5.566 (3.779,8.199)    2.202 (1.409,3.443)
##
## agegrp: ref.=15-34
##    35-54           2.578 (1.633,4.07)     2.355 (1.484,3.737)
##    55-64           6.934 (4.058,11.847)   5.415 (3.083,9.513)
##    65+             14.802 (8.809,24.872)  9.901 (5.541,17.692)
##
##                    P(Wald's test) P(LR-test)
## vimp: 1 vs 0       < 0.001        < 0.001
##
## agegrp: ref.=15-34                < 0.001
##    35-54           < 0.001
##    55-64           < 0.001
##    65+             < 0.001
##
## Log-likelihood = -545.8609
## No. of observations = 4298
## AIC value = 1101.7218

There is strong evidence against the null hypothesis of no association of age with death (p<0.001). The odds of death increase with increasing age.