Logistic Regression to Age exposure

On SARI data from Brazil National Surveillance System

true , Otavio Ranzani https://github.com/oranzani (ISGlobal)https://www.isglobal.org
2021-05-11

1 Background

Over the year of 2020 the world has seen an unprecedented severe pandemic due to the Novel Coronavirus-2 (SARS-CoV-2), this has put challenges over each and every surveillance system on how to produce and curate data about the pandemic. Brazil has one of the biggest and most well established public health system, in Portuguese, Sistema Único de Saúde (SUS).

From the national surveillance system to the influenza (SIVEP-Gripe) has coming the most of the data about severe acute respiratory syndrome SARS cases due to the SARS-Cov-2. We compiled those data and run Logistic Regression Models, aiming to seen how can be associated to mortality the age factor. SIVEP-Gripe disposes the data of all hospitalizations with Severe Acute Respiratory Illness SARI.

Show code
dados <- read_csv(paste0("../CSV/sivep_mortality_cut_", last.date, ".csv"))
dados <- dados %>%
    filter(idade >= 0 & idade < 110 & !is.na(idade) & raca != "Não Informado") %>%
    mutate(outcome = case_when(evolucao == 1 ~ 0, evolucao == 2 ~ 1))
paged_table(head(dados, 10))

For this we’ll selected only age, in continuous scale, and outcome, respectively identified by, idade and evolucao.

2 Logistic Regression

2.1 Basics on Logistic Regression

When we have a binary outcome, as of the case for outcome due to a hospitalization by SARI, being discharged or dying, we can model this with a logistic regression model. The interpretation of this is we have an categorical outcome variable, which be a response to a, firstly continuous, variable, age.

In equation we have:

\[ \log(\frac{Death}{Discharged}) = \beta_0 + \beta_1(Age) \]

2.2 Age as Continuous

Our first approach is to use age as it comes from the data base, in continuous unit, 1 year per bin. So we just set some variables to be at the way glm uses it.

Show code
logit <- function(x) {
    logit <- log(x) - log(1 - x)
}  # defining the logit function

# To put the outcome in 0 to Discharged and 1 to Death
mortality_age_c <- dados %>%
    mutate(outcome = case_when(evolucao == 1 ~ 0, evolucao == 2 ~ 1), age = idade) %>%
    dplyr::select(outcome, age)
paged_table(mortality_age_c)

Now we can adjust the model, the idea is to use age to explain the outcome by a hospitalization. So the model is as the following:

Show code
model0 <- glm(outcome ~ age, data = mortality_age_c, family = binomial(link = logit))
summary(model0)

Call:
glm(formula = outcome ~ age, family = binomial(link = logit), 
    data = mortality_age_c)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7191  -0.9124  -0.6230   1.1526   2.5231  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.140757   0.008257  -380.4   <2e-16 ***
age          0.039994   0.000125   320.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1379327  on 1086794  degrees of freedom
Residual deviance: 1249571  on 1086793  degrees of freedom
AIC: 1249575

Number of Fisher Scoring iterations: 4

To interpret this model we have to look on Estimate column, which is saying how the outcome is explained by each unit of age, the value 0.04 is in the scale of log Odds, so is the amount of times in how an increase in an unit in age will increase the odds of outcome. So to better interpret it we transform by exponentiation it.

Show code
# Take the coefs of the model
coef_m0 <- coef(model0)
coef_m0 <- exp(coef_m0)
coef_m0
(Intercept)         age 
 0.04325005  1.04080452 

So, for each year in increase, we expect to see 4.1% odds increase of dying when in hospitalization by SARI.

2.3 Age in 10 years units, continuous

Now we group age in 10 years units, but we still permit the variable to be continuous. We start setting up the data.

Show code
# To put the outcome in 0 to Discharged and 1 to Death
mortality_age_10c <- dados %>%
    mutate(outcome = case_when(evolucao == 1 ~ 0, evolucao == 2 ~ 1), age = (idade + 
        1)/10) %>%
    dplyr::select(outcome, age)
paged_table(mortality_age_10c)

The model:

Show code
model1 <- glm(outcome ~ age, data = mortality_age_10c, family = binomial(link = logit))
summary(model1)

Call:
glm(formula = outcome ~ age, family = binomial(link = logit), 
    data = mortality_age_10c)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7191  -0.9124  -0.6230   1.1526   2.5231  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.180751   0.008378  -379.7   <2e-16 ***
age          0.399940   0.001250   320.0   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1379327  on 1086794  degrees of freedom
Residual deviance: 1249571  on 1086793  degrees of freedom
AIC: 1249575

Number of Fisher Scoring iterations: 4

Again Estimate gives the log Odds increase per unit of age, differently from last Section 2.2 due to the transformation we did to setup the age in units of 10 years, we see the increase is now 10 times greater than the previous estimate.

Show code
# Take the coefs of the model
coef_m1 <- coef(model1)
coef_m1 <- exp(coef_m1)
coef_m1
(Intercept)         age 
 0.04155444  1.49173502 

We transformed the coefficients to better interpret, now we see a 49.2% odds increase of dying when in hospitalization by SARI.

2.4 Age in 10 years bins, categorical

For now we setup the data in 10 years bins, if some has 12 years old, it will be classified as being in the category of 10 to 19 years, and so on.

Show code
labels = c("0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", 
    "70-79", "80-89", "90-99", "100-109", "110+")
breaks = c(0, 9, 19, 29, 39, 49, 59, 69, 79, 89, 99, 109, 150)

# To put the outcome in 0 to Discharged and 1 to Death, and years in 10
# years bins
mortality_age_10d <- dados %>%
    mutate(outcome = case_when(evolucao == 1 ~ 0, evolucao == 2 ~ 1), age = cut(idade, 
        breaks = breaks, labels = labels, right = T)) %>%
    dplyr::select(outcome, age)
paged_table(mortality_age_10d)

Before we proceed, we take a look in the new variable, we table it to understand how is distributed:

Show code
table(mortality_age_10d$outcome, mortality_age_10d$age)
   
       0-9  10-19  20-29  30-39  40-49  50-59  60-69  70-79  80-89
  0  48220  17191  37329  78285 107147 129608 131740 101695  59555
  1   1641   1247   4551  13157  26381  49155  83146  91278  67596
   
     90-99 100-109   110+
  0  14313     601      0
  1  19974     980      0

As we can see there are more people being discharged from the hospital than dying, but this inverts in the ending bins, suggesting the exposure of age has a increased association to the outcome as the age increase.

Let’s to the model:

Show code
model2 <- glm(outcome ~ age, data = mortality_age_10d, family = binomial(link = logit))
summary(model2)

Call:
glm(formula = outcome ~ age, family = binomial(link = logit), 
    data = mortality_age_10d)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.3908  -0.9892  -0.6635   1.1241   2.6130  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.38047    0.02508 -134.78   <2e-16 ***
age10-19     0.75682    0.03859   19.61   <2e-16 ***
age20-29     1.27604    0.02959   43.12   <2e-16 ***
age30-39     1.59706    0.02679   59.61   <2e-16 ***
age40-49     1.97891    0.02601   76.09   <2e-16 ***
age50-59     2.41093    0.02563   94.05   <2e-16 ***
age60-69     2.92023    0.02547  114.66   <2e-16 ***
age70-79     3.27240    0.02549  128.37   <2e-16 ***
age80-89     3.50712    0.02570  136.45   <2e-16 ***
age90-99     3.71373    0.02737  135.70   <2e-16 ***
age100-109   3.86942    0.05756   67.22   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1377490  on 1084789  degrees of freedom
Residual deviance: 1248912  on 1084779  degrees of freedom
  (2005 observations deleted due to missingness)
AIC: 1248934

Number of Fisher Scoring iterations: 5

Now we have more coefficients, one for each bin of age, a first look on those coefficients suggests that as older you got the odds of dying when hospitalized by SARI is increased a different amount of time per age bin, for example, being in the age bin of “10-19” is has a little higher odds of dying by hospitalization to the reference age bin, which is “0-9”, on the other way as you got older, all other age bins has odds times greater of dying than the reference bin.

To more easily interpret it we’ll transform the coefficients to percent scale:

Show code
# Take the coefs of the model
coef_m2 <- coef(model2)
coef_m2 <- exp(coef_m2)
paged_table(data.frame(coef_m2))

Now we see the percentual increase in the odds per age bin, to be in the age bin of “10-19” has a 113.1% odds of dying than in the age bin of “0-9”. Besides this, to be in the age bin of “50-59” it has a odds 1014.4% greater to dying. The main take home message in this model is age plays a non linear role in determining the outcome of a hospitalization.

2.5 Age as Continuous, natural spline

Now we’ll test method to estimate the holes that could be in the age scale, a first approach is to fit a natural spline in the scale, we took the data from Section 2.2.

Show code
model3 <- glm(outcome ~ age + I(age^2), data = mortality_age_c, family = binomial(link = logit))
coef_m3 <- coef(model3)
coef_m3 <- exp(coef_m3)
paged_table(data.frame(coef_m3))

After adding up the term of quadratic form we give one more degree of freedom to the fit, now account the effects of age and its doubled effect, as we see it has, per unit age, a adds up of 0% increase in the odds, differently as before in the Section 2.2 the effect of each year to the outcome is about 6%, there the effect was 1%.

2.6 Age as Continuous, Restricted Cubic Spline (RCS)

We can give even more complex terms, now we propose a Restricted Cubic Spline, which give a transform to the data that can encompasses the non-linearity.

Show code
model4 <- glm(outcome ~ rcs(age), data = mortality_age_c, family = binomial(link = logit))
coef_m4 <- coef(model4)
coef_m4 <- exp(coef_m4)
paged_table(data.frame(coef_m4))

The RCS gave even more degrees of freedom to the fitting, which can be controlled too, now the effect of unit age on odds to die is 4.4%, followed of 0.1% increase to the square of age unit, 2.6% increase to the term of the derivative of the age function, and -24.6% increase to the third derivative.

2.7 Age as Continuous, Polynomials

Another way to give more degrees of freedom is to giver the response variable as a polynomial transformed variable, this has the idea to give to the glm a very hard function, stating only clear effects to give a response.

Show code
model5 <- glm(outcome ~ poly(age, degree = 2, raw = T), data = mortality_age_c, 
    family = binomial(link = logit))
coef_m5 <- exp(coef(model5))

model6 <- glm(outcome ~ poly(age, degree = 3, raw = T), data = mortality_age_c, 
    family = binomial(link = logit))
coef_m6 <- exp(coef(model6))

paged_table(data.frame(coef_m5))
Show code

The coefficients to a polynomial is a little bit harder to interpret than when using splines to transform the data, but they give almost the same results. The fit of polynomial of degree 2 give pretty much the same as those in the Section 2.5.

The polynomial of third degree, has a similar output, it has pretty much the behavior of the RCS Section 2.6, it only has reduced effects in the third degree term.

2.8 Plots

Now we have a bunch of models to explain the data and has some speculation to it, let’s see how well each of them fit to it.

Show code
df_pred_model <- function(x, df_origin) {
    pred_m <- x %>%
        augment(newdata = df_origin, type.predict = "response", se_fit = T)
    df_m <- pred_m %>%
        mutate(fit = pred_m$.fitted, upr = pred_m$.fitted + 2 * pred_m$.se.fit, 
            lwr = pred_m$.fitted - 2 * pred_m$.se.fit, se.fit = pred_m$.se.fit)
    df_m <- df_m %>%
        group_by(outcome, age) %>%
        summarise(fit.exp = exp(first(fit)), upr.exp = exp(first(upr)), 
            lwr.exp = exp(first(lwr)), se.fit.exp = exp(first(.se.fit)), 
            fit = first(fit), upr = first(upr), lwr = first(lwr), se.fit = first(.se.fit))
}

m0 <- df_pred_model(model0, mortality_age_c)
m1 <- df_pred_model(model1, mortality_age_10c)
m1$age <- (m1$age * 10) - 1
m2 <- df_pred_model(model2, mortality_age_10d)
m3 <- df_pred_model(model3, mortality_age_c)
m4 <- df_pred_model(model4, mortality_age_c)
m5 <- df_pred_model(model5, mortality_age_c)
m6 <- df_pred_model(model6, mortality_age_c)

# Organizing the data

plot_odds <- function(m, num_m) {
    p <- m %>%
        filter(!is.na(age)) %>%
        ggplot(aes(x = age, y = fit)) + geom_line() + geom_point(alpha = 0.15, 
        shape = 1) + geom_ribbon(aes(ymax = upr, ymin = lwr), alpha = 0.15) + 
        theme_minimal() + labs(x = "Age", y = "Odds Ratio", title = paste0("Model ", 
        num_m)) + theme(legend.position = "bottom", panel.spacing = unit(0.01, 
        "lines"), axis.text.x = element_text(size = 8, angle = 90), axis.text.y = element_text(size = 8), 
        plot.title = element_text(size = 14))
    return(p)
}

p0 <- plot_odds(m0, 0)
p1 <- plot_odds(m1, 1)
p2 <- plot_odds(m2, 2)
p3 <- plot_odds(m3, 3)
p4 <- plot_odds(m4, 4)
p5 <- plot_odds(m5, 5)
p6 <- plot_odds(m6, 6)

arrange_odds <- ggarrange(p0, p1, p2, p3, p4, p5, p6)
arrange_odds

3 Comparing Models

A way to compare models is by the Occam’s Razor, or by a criteria, that tells us on how better a model is to fit the data, this criteria is the Aikaike Information Criterion, AIC, it tells how well a model fit the data but has cleverity of penalizing those models which fits too good the data due to over parameterization. It is basically this:

\[ AIC = 2k - 2\log(L) \] Where \(k\) is the number of parameters to the model and \(L\) is the likelihood of the model.

Show code
AIC(model0, model1, model2, model3, model4, model5, model6)
       df     AIC
model0  2 1249575
model1  2 1249575
model2 11 1248934
model3  3 1248610
model4  5 1247887
model5  3 1248610
model6  4 1248183

And we choose the model with minor AIC value. Other way we can choose a model is by a Bayesian Information Criterion, is slightly different to AIC, has the formula of:

\[ BIC = k\ln(n) - 2\ln(L) \]

As before, \(k\) and \(L\), are the number of parameters in the model and the likelihood, but now we add the penalty of the observed point, the \(n\), again we choose the model with the minor BIC value:

Show code
BIC(model0, model1, model2, model3, model4, model5, model6)
       df     BIC
model0  2 1249598
model1  2 1249598
model2 11 1249065
model3  3 1248645
model4  5 1247947
model5  3 1248645
model6  4 1248230

A last the we can run with the models is a “Likelihood ratio test” (LRT), it will express how each model can be compared to each other, in terms of their likelihood, we choose a model to be the baseline one and with it we run the anova function, which output the ratio between the models, here we choose to express each likelihood in terms of the likelihood of the model0, with age in continuous format.

Show code
lrt_m0m1 <- anova(model0, model1, test = "LRT")
# lrt_m0m2 <- anova(model0, model2, test = "LRT")
lrt_m0m3 <- anova(model0, model3, test = "LRT")
lrt_m0m4 <- anova(model0, model4, test = "LRT")
lrt_m0m5 <- anova(model0, model5, test = "LRT")
lrt_m0m6 <- anova(model0, model6, test = "LRT")

anova_tab<-bind_rows(lrt_m0m1, 
                 # lrt_m0m2, 
                 lrt_m0m3, 
                 lrt_m0m4, 
                 lrt_m0m5, 
                 lrt_m0m6)
anova_tab
Analysis of Deviance Table

Model 1: outcome ~ age
Model 2: outcome ~ age
       Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1...1    1086793    1249571                          
2...2    1086793    1249571  0     0.00              
1...3    1086793    1249571                          
2...4    1086792    1248604  1   967.06 < 2.2e-16 ***
1...5    1086793    1249571                          
2...6    1086790    1247877  3  1693.46 < 2.2e-16 ***
1...7    1086793    1249571                          
2...8    1086792    1248604  1   967.06 < 2.2e-16 ***
1...9    1086793    1249571                          
2...10   1086791    1248175  2  1395.84 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After the computation of each LRT for each pair of models, we’ve obtain the above table, if the deviance column is negative it is says the model has less capacity to explain the data than the baseline model, in here again model0. From the anova_tab we’ve got as the most powerful model to explain the data the model in which the age is set to be in units of 10 yeras. A problem in here is due to the way the anova computes the LRT, its set which models are nested and which aren’t, and this is different from the lrtest() function from the rms package, we can produce the LRT, now the model which is can’t be computed is the model1.

Show code
# lrtest_m0m1 <- lrtest(model0, model1)
lrtest_m0m2 <- lrtest(model0, model2)
lrtest_m0m3 <- lrtest(model0, model3)
lrtest_m0m4 <- lrtest(model0, model4)
lrtest_m0m5 <- lrtest(model0, model5)
lrtest_m0m6 <- lrtest(model0, model6)

lrtest_tab<-bind_rows(
  # lrtest_m0m1, 
  lrtest_m0m2$stats,
  lrtest_m0m3$stats, 
  lrtest_m0m4$stats, 
  lrtest_m0m5$stats, 
  lrtest_m0m6$stats)
lrtest_tab
# A tibble: 5 x 3
  `L.R. Chisq`  d.f.     P
         <dbl> <dbl> <dbl>
1        1179.     9     0
2         967.     1     0
3        1693.     3     0
4         967.     1     0
5        1396.     2     0