On SARI data from Brazil National Surveillance System
Over the year of 2020 the world has seen an unprecedented severe pandemic due to the Novel Coronavirus-2 (SARS-CoV-2), this has put challenges over each and every surveillance system on how to produce and curate data about the pandemic. Brazil has one of the biggest and most well established public health system, in Portuguese, Sistema Único de Saúde (SUS).
From the national surveillance system to the influenza (SIVEP-Gripe) has coming the most of the data about severe acute respiratory syndrome SARS cases due to the SARS-Cov-2. We compiled those data and run Logistic Regression Models, aiming to seen how can be associated to mortality the age factor. SIVEP-Gripe disposes the data of all hospitalizations with Severe Acute Respiratory Illness SARI.
For this we’ll selected only age, in continuous scale, and outcome, respectively identified by, idade and evolucao.
When we have a binary outcome, as of the case for outcome due to a hospitalization by SARI, being discharged or dying, we can model this with a logistic regression model. The interpretation of this is we have an categorical outcome variable, which be a response to a, firstly continuous, variable, age.
In equation we have:
\[ \log(\frac{Death}{Discharged}) = \beta_0 + \beta_1(Age) \]
Our first approach is to use age as it comes from the data base, in continuous unit, 1 year per bin. So we just set some variables to be at the way glm uses it.
logit <- function(x) {
logit <- log(x) - log(1 - x)
} # defining the logit function
# To put the outcome in 0 to Discharged and 1 to Death
mortality_age_c <- dados %>%
mutate(outcome = case_when(evolucao == 1 ~ 0, evolucao == 2 ~ 1), age = idade) %>%
dplyr::select(outcome, age)
paged_table(mortality_age_c)
Now we can adjust the model, the idea is to use age to explain the outcome by a hospitalization. So the model is as the following:
Call:
glm(formula = outcome ~ age, family = binomial(link = logit),
data = mortality_age_c)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7191 -0.9124 -0.6230 1.1526 2.5231
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.140757 0.008257 -380.4 <2e-16 ***
age 0.039994 0.000125 320.0 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1379327 on 1086794 degrees of freedom
Residual deviance: 1249571 on 1086793 degrees of freedom
AIC: 1249575
Number of Fisher Scoring iterations: 4
To interpret this model we have to look on Estimate column, which is saying how the outcome is explained by each unit of age, the value 0.04 is in the scale of log Odds, so is the amount of times in how an increase in an unit in age will increase the odds of outcome. So to better interpret it we transform by exponentiation it.
(Intercept) age
0.04325005 1.04080452
So, for each year in increase, we expect to see 4.1% odds increase of dying when in hospitalization by SARI.
Now we group age in 10 years units, but we still permit the variable to be continuous. We start setting up the data.
# To put the outcome in 0 to Discharged and 1 to Death
mortality_age_10c <- dados %>%
mutate(outcome = case_when(evolucao == 1 ~ 0, evolucao == 2 ~ 1), age = (idade +
1)/10) %>%
dplyr::select(outcome, age)
paged_table(mortality_age_10c)
The model:
Call:
glm(formula = outcome ~ age, family = binomial(link = logit),
data = mortality_age_10c)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7191 -0.9124 -0.6230 1.1526 2.5231
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.180751 0.008378 -379.7 <2e-16 ***
age 0.399940 0.001250 320.0 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1379327 on 1086794 degrees of freedom
Residual deviance: 1249571 on 1086793 degrees of freedom
AIC: 1249575
Number of Fisher Scoring iterations: 4
Again Estimate gives the log Odds increase per unit of age, differently from last Section 2.2 due to the transformation we did to setup the age in units of 10 years, we see the increase is now 10 times greater than the previous estimate.
(Intercept) age
0.04155444 1.49173502
We transformed the coefficients to better interpret, now we see a 49.2% odds increase of dying when in hospitalization by SARI.
For now we setup the data in 10 years bins, if some has 12 years old, it will be classified as being in the category of 10 to 19 years, and so on.
labels = c("0-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69",
"70-79", "80-89", "90-99", "100-109", "110+")
breaks = c(0, 9, 19, 29, 39, 49, 59, 69, 79, 89, 99, 109, 150)
# To put the outcome in 0 to Discharged and 1 to Death, and years in 10
# years bins
mortality_age_10d <- dados %>%
mutate(outcome = case_when(evolucao == 1 ~ 0, evolucao == 2 ~ 1), age = cut(idade,
breaks = breaks, labels = labels, right = T)) %>%
dplyr::select(outcome, age)
paged_table(mortality_age_10d)
Before we proceed, we take a look in the new variable, we table it to understand how is distributed:
table(mortality_age_10d$outcome, mortality_age_10d$age)
0-9 10-19 20-29 30-39 40-49 50-59 60-69 70-79 80-89
0 48220 17191 37329 78285 107147 129608 131740 101695 59555
1 1641 1247 4551 13157 26381 49155 83146 91278 67596
90-99 100-109 110+
0 14313 601 0
1 19974 980 0
As we can see there are more people being discharged from the hospital than dying, but this inverts in the ending bins, suggesting the exposure of age has a increased association to the outcome as the age increase.
Let’s to the model:
Call:
glm(formula = outcome ~ age, family = binomial(link = logit),
data = mortality_age_10d)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.3908 -0.9892 -0.6635 1.1241 2.6130
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.38047 0.02508 -134.78 <2e-16 ***
age10-19 0.75682 0.03859 19.61 <2e-16 ***
age20-29 1.27604 0.02959 43.12 <2e-16 ***
age30-39 1.59706 0.02679 59.61 <2e-16 ***
age40-49 1.97891 0.02601 76.09 <2e-16 ***
age50-59 2.41093 0.02563 94.05 <2e-16 ***
age60-69 2.92023 0.02547 114.66 <2e-16 ***
age70-79 3.27240 0.02549 128.37 <2e-16 ***
age80-89 3.50712 0.02570 136.45 <2e-16 ***
age90-99 3.71373 0.02737 135.70 <2e-16 ***
age100-109 3.86942 0.05756 67.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1377490 on 1084789 degrees of freedom
Residual deviance: 1248912 on 1084779 degrees of freedom
(2005 observations deleted due to missingness)
AIC: 1248934
Number of Fisher Scoring iterations: 5
Now we have more coefficients, one for each bin of age, a first look on those coefficients suggests that as older you got the odds of dying when hospitalized by SARI is increased a different amount of time per age bin, for example, being in the age bin of “10-19” is has a little higher odds of dying by hospitalization to the reference age bin, which is “0-9”, on the other way as you got older, all other age bins has odds times greater of dying than the reference bin.
To more easily interpret it we’ll transform the coefficients to percent scale:
# Take the coefs of the model
coef_m2 <- coef(model2)
coef_m2 <- exp(coef_m2)
paged_table(data.frame(coef_m2))
Now we see the percentual increase in the odds per age bin, to be in the age bin of “10-19” has a 113.1% odds of dying than in the age bin of “0-9”. Besides this, to be in the age bin of “50-59” it has a odds 1014.4% greater to dying. The main take home message in this model is age plays a non linear role in determining the outcome of a hospitalization.
Now we’ll test method to estimate the holes that could be in the age scale, a first approach is to fit a natural spline in the scale, we took the data from Section 2.2.
model3 <- glm(outcome ~ age + I(age^2), data = mortality_age_c, family = binomial(link = logit))
coef_m3 <- coef(model3)
coef_m3 <- exp(coef_m3)
paged_table(data.frame(coef_m3))
After adding up the term of quadratic form we give one more degree of freedom to the fit, now account the effects of age and its doubled effect, as we see it has, per unit age, a adds up of 0% increase in the odds, differently as before in the Section 2.2 the effect of each year to the outcome is about 6%, there the effect was 1%.
We can give even more complex terms, now we propose a Restricted Cubic Spline, which give a transform to the data that can encompasses the non-linearity.
model4 <- glm(outcome ~ rcs(age), data = mortality_age_c, family = binomial(link = logit))
coef_m4 <- coef(model4)
coef_m4 <- exp(coef_m4)
paged_table(data.frame(coef_m4))
The RCS gave even more degrees of freedom to the fitting, which can be controlled too, now the effect of unit age on odds to die is 4.4%, followed of 0.1% increase to the square of age unit, 2.6% increase to the term of the derivative of the age function, and -24.6% increase to the third derivative.
Another way to give more degrees of freedom is to giver the response variable as a polynomial transformed variable, this has the idea to give to the glm a very hard function, stating only clear effects to give a response.
model5 <- glm(outcome ~ poly(age, degree = 2, raw = T), data = mortality_age_c,
family = binomial(link = logit))
coef_m5 <- exp(coef(model5))
model6 <- glm(outcome ~ poly(age, degree = 3, raw = T), data = mortality_age_c,
family = binomial(link = logit))
coef_m6 <- exp(coef(model6))
paged_table(data.frame(coef_m5))
paged_table(data.frame(coef_m6))
The coefficients to a polynomial is a little bit harder to interpret than when using splines to transform the data, but they give almost the same results. The fit of polynomial of degree 2 give pretty much the same as those in the Section 2.5.
The polynomial of third degree, has a similar output, it has pretty much the behavior of the RCS Section 2.6, it only has reduced effects in the third degree term.
Now we have a bunch of models to explain the data and has some speculation to it, let’s see how well each of them fit to it.
df_pred_model <- function(x, df_origin) {
pred_m <- x %>%
augment(newdata = df_origin, type.predict = "response", se_fit = T)
df_m <- pred_m %>%
mutate(fit = pred_m$.fitted, upr = pred_m$.fitted + 2 * pred_m$.se.fit,
lwr = pred_m$.fitted - 2 * pred_m$.se.fit, se.fit = pred_m$.se.fit)
df_m <- df_m %>%
group_by(outcome, age) %>%
summarise(fit.exp = exp(first(fit)), upr.exp = exp(first(upr)),
lwr.exp = exp(first(lwr)), se.fit.exp = exp(first(.se.fit)),
fit = first(fit), upr = first(upr), lwr = first(lwr), se.fit = first(.se.fit))
}
m0 <- df_pred_model(model0, mortality_age_c)
m1 <- df_pred_model(model1, mortality_age_10c)
m1$age <- (m1$age * 10) - 1
m2 <- df_pred_model(model2, mortality_age_10d)
m3 <- df_pred_model(model3, mortality_age_c)
m4 <- df_pred_model(model4, mortality_age_c)
m5 <- df_pred_model(model5, mortality_age_c)
m6 <- df_pred_model(model6, mortality_age_c)
# Organizing the data
plot_odds <- function(m, num_m) {
p <- m %>%
filter(!is.na(age)) %>%
ggplot(aes(x = age, y = fit)) + geom_line() + geom_point(alpha = 0.15,
shape = 1) + geom_ribbon(aes(ymax = upr, ymin = lwr), alpha = 0.15) +
theme_minimal() + labs(x = "Age", y = "Odds Ratio", title = paste0("Model ",
num_m)) + theme(legend.position = "bottom", panel.spacing = unit(0.01,
"lines"), axis.text.x = element_text(size = 8, angle = 90), axis.text.y = element_text(size = 8),
plot.title = element_text(size = 14))
return(p)
}
p0 <- plot_odds(m0, 0)
p1 <- plot_odds(m1, 1)
p2 <- plot_odds(m2, 2)
p3 <- plot_odds(m3, 3)
p4 <- plot_odds(m4, 4)
p5 <- plot_odds(m5, 5)
p6 <- plot_odds(m6, 6)
arrange_odds <- ggarrange(p0, p1, p2, p3, p4, p5, p6)
arrange_odds
A way to compare models is by the Occam’s Razor, or by a criteria, that tells us on how better a model is to fit the data, this criteria is the Aikaike Information Criterion, AIC, it tells how well a model fit the data but has cleverity of penalizing those models which fits too good the data due to over parameterization. It is basically this:
\[ AIC = 2k - 2\log(L) \] Where \(k\) is the number of parameters to the model and \(L\) is the likelihood of the model.
AIC(model0, model1, model2, model3, model4, model5, model6)
df AIC
model0 2 1249575
model1 2 1249575
model2 11 1248934
model3 3 1248610
model4 5 1247887
model5 3 1248610
model6 4 1248183
And we choose the model with minor AIC value. Other way we can choose a model is by a Bayesian Information Criterion, is slightly different to AIC, has the formula of:
\[ BIC = k\ln(n) - 2\ln(L) \]
As before, \(k\) and \(L\), are the number of parameters in the model and the likelihood, but now we add the penalty of the observed point, the \(n\), again we choose the model with the minor BIC value:
BIC(model0, model1, model2, model3, model4, model5, model6)
df BIC
model0 2 1249598
model1 2 1249598
model2 11 1249065
model3 3 1248645
model4 5 1247947
model5 3 1248645
model6 4 1248230
A last the we can run with the models is a “Likelihood ratio test” (LRT), it will express how each model can be compared to each other, in terms of their likelihood, we choose a model to be the baseline one and with it we run the anova function, which output the ratio between the models, here we choose to express each likelihood in terms of the likelihood of the model0, with age in continuous format.
lrt_m0m1 <- anova(model0, model1, test = "LRT")
# lrt_m0m2 <- anova(model0, model2, test = "LRT")
lrt_m0m3 <- anova(model0, model3, test = "LRT")
lrt_m0m4 <- anova(model0, model4, test = "LRT")
lrt_m0m5 <- anova(model0, model5, test = "LRT")
lrt_m0m6 <- anova(model0, model6, test = "LRT")
anova_tab<-bind_rows(lrt_m0m1,
# lrt_m0m2,
lrt_m0m3,
lrt_m0m4,
lrt_m0m5,
lrt_m0m6)
anova_tab
Analysis of Deviance Table
Model 1: outcome ~ age
Model 2: outcome ~ age
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1...1 1086793 1249571
2...2 1086793 1249571 0 0.00
1...3 1086793 1249571
2...4 1086792 1248604 1 967.06 < 2.2e-16 ***
1...5 1086793 1249571
2...6 1086790 1247877 3 1693.46 < 2.2e-16 ***
1...7 1086793 1249571
2...8 1086792 1248604 1 967.06 < 2.2e-16 ***
1...9 1086793 1249571
2...10 1086791 1248175 2 1395.84 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After the computation of each LRT for each pair of models, we’ve obtain the above table, if the deviance column is negative it is says the model has less capacity to explain the data than the baseline model, in here again model0. From the anova_tab we’ve got as the most powerful model to explain the data the model in which the age is set to be in units of 10 yeras. A problem in here is due to the way the anova computes the LRT, its set which models are nested and which aren’t, and this is different from the lrtest() function from the rms package, we can produce the LRT, now the model which is can’t be computed is the model1.
# lrtest_m0m1 <- lrtest(model0, model1)
lrtest_m0m2 <- lrtest(model0, model2)
lrtest_m0m3 <- lrtest(model0, model3)
lrtest_m0m4 <- lrtest(model0, model4)
lrtest_m0m5 <- lrtest(model0, model5)
lrtest_m0m6 <- lrtest(model0, model6)
lrtest_tab<-bind_rows(
# lrtest_m0m1,
lrtest_m0m2$stats,
lrtest_m0m3$stats,
lrtest_m0m4$stats,
lrtest_m0m5$stats,
lrtest_m0m6$stats)
lrtest_tab
# A tibble: 5 x 3
`L.R. Chisq` d.f. P
<dbl> <dbl> <dbl>
1 1179. 9 0
2 967. 1 0
3 1693. 3 0
4 967. 1 0
5 1396. 2 0