library(tidyverse); library(knitr); library(kableExtra); library(broom); library(cowplot);
library(rstan); library(tidybayes); library(scales); library(DT); library(plotly)
library(lubridate)
theme_set(theme_classic(base_size = 12) +
background_grid(color.major = "grey90",
color.minor = "grey95",
minor = "xy", major = "xy") +
theme(legend.position = "none"))
Let \(E\) be the population in Iceland and \(I_{t}\) be the number of infected in country Iceland at time \(t\). Then the percent of infected can be calculated as
\[ P_{t} = \frac{I_{t}}{E}. \]
For example, there are now around 1800 diagnosed cases and if we assume an Icelandic population of 364134 we get
\[ P = \frac{1800}{364134} \approx 0.005 = 0.5\%. \]
In the book and lectures, Richard has taught you about linear models where the outcome of interest is binomial (0 or 1). In that case we could model the percent of infected, as a function of time (in days), with
\[ \log\left(\frac{P_{t}}{1 - P_{t}}\right) = \alpha + \beta \cdot t, \]
or
\[ \mathrm{logit}(P_t) = \alpha + \beta \cdot t \]
where \(\text{inv_logit}(\alpha)\) is a measure of how many have been infected in Iceland at time \(t = 0\) and \(\beta\) is a measure of how fast the number of cases grows. If we say that \(t=0\) on March \(1^{st}\) then \(t = 10\) on March \(11^{th}\) and so on. We can predict future values of \(P_t\) by transforming back from \(\alpha + \beta \cdot t\) to \(P_t\) with the familiar formula
\[ P_t = \frac{1}{1 + e^{-(\alpha + \beta\cdot t)}} \]
But this model is too simple, because if we let \(t\) become really large, then we would expect \(P_t\) to become almost one, meaning that the whole country would become infected.
inv_logit <- function(x) 1 / (1 + exp(-x))
tibble(t = seq(0, 100),
alpha = -5,
beta = 0.2) %>%
mutate(percent_infected = inv_logit(alpha + beta * t),
date = ymd("2020-03-01") + t) %>%
ggplot(aes(date, percent_infected)) +
geom_line() +
scale_y_continuous(labels = percent) +
labs(title = "Usual logistic regression formula is too simple",
subtitle = "100% of Iceland must be infected according to this model",
x = "Date", y = "% Infected")
In the case of COVID-19 infections we don’t know the maximum percent of populations that will be infected, so we have another unknown parameter, the saturation percent at which Iceland will reach its maximum number of infected, \(S\). It is actually very easy to incorporate this into our model, we only have to change a \(1\) into \(S\). Thus our model looks like
\[ \log\left(\frac{P_{t}}{S - P_{t}}\right) = \alpha + \beta \cdot t. \]
Now, \(S\) is the maximum number of cases that will become infected (in this wave / according to this model). As with the logistic regression functions above and from the book, we can transform back with the formula
\[ P_t = \frac{S}{1 + e^{-(\alpha + \beta \cdot t)}}, \]
the only change being the extra S!
inv_logit <- function(x, S) S / (1 + exp(-x))
tibble(t = seq(0, 100),
alpha = -5,
beta = 0.2,
S = 0.005) %>%
mutate(percent_infected = inv_logit(alpha + beta * t, S),
date = ymd("2020-03-01") + t) %>%
ggplot(aes(date, percent_infected)) +
geom_line() +
scale_y_continuous(labels = percent) +
labs(title = "Ordinary logistic growth",
subtitle = "Adding the S gives us a general formula for growth",
x = "Date", y = "% Infected")
Below is an app where the points are the observed number of new cases and total cases in Iceland. The lines are predictions from a logistic growth curve. You’ll notice that the plots are called “f” and “dfdt”. Here, f, is simply the total percent of infected as given by the formula above. We are also interested in the daily number of cases. Those familiar with calculus might remember that the derivative of a function tells you how much you expect the output to increase if you increase the input by a small amount. We there fore approximate the number of daily new cases with the derivative of the logistic growth function. We are lucky because the derivative of this particular function is easy to calculate.
For \(f(t)\) being the logistic growth function
\[ f(t) = \frac{S}{1 + e^{-(\alpha + \beta \cdot t)}}, \]
the derivative, \(\frac{d}{dt}f(t)\) is
\[ \frac{d}{dt}f = \beta \cdot f \cdot (1 - \frac{f}{S}). \]
The \(1 - \frac f S\) part makes sure that, as we get closer to the maximum, \(S\), the growth rate slows down, but when \(f\) is very low the growth rate is almost exponential.
You can play around with the values of \(\alpha\), \(\beta\) and \(S\) to try to get a better fit.
knitr::include_app("https://bgautijonsson.shinyapps.io/Logistic_Growth_App/", height = "600px")
Thinking about derivatives can be a very good way to gain intuition into mathematical modeling. The last few months, a very popular estimate from Infectious Disease Epidemiology has been \(R\), the reproduction number. The basic reproduction number, \(R_0\) (pronounced R-naught), is an estimate of how many people one infected person will infect on average if everyone else is susceptible to infection. Let’s imagine that people are only sick for one day and that \(R_0 = 2\). That means that on day one there is one infection, the next day there are 2, then 4 and so one. So every day the cumulative number of infections doubles. We can write this out using derivatives. If we say that \(I(t)\) is the number of infected at time \(t\) then the derivative of \(I(t)\) with respect to \(t\) is
\[ \frac{d}{dt}I(t) = I(t). \]
That is, Going from day \(t\) to day \(t+1\) the function \(I(t)\) changes by adding itself to itself (\(I(t + 1) = I(t) + I(t)\)). Above we are describing a process using its derivative or change (here with respect to time). This is called a differential equation.There is one mathematical function that has this property; the exponential function, \(e^t\):
\[ \frac{d}{dt}e^t = e^t. \]
It is a pretty marvelous property that the derivative of \(e^t\) is also equal to \(e^t\), and it is in fact one way to define it from first principles (https://en.wikipedia.org/wiki/Characterizations_of_the_exponential_function).
tibble(t = seq(0, 5, length.out = 100),
e_t = exp(t)) %>%
ggplot(aes(t, e_t)) +
geom_line() +
xlab("t") +
ylab(latex2exp::TeX("$e^t$"))
This is why everyone has been talking about this so-called “exponential growth”. If something changes at a speed proportional to its current value, then the growth is exponential. But we saw that the growth of COVID-19 was not exponential, it reached a plateau. What if we added something to the differential equation above that slows down the growth as it becomes bigger. We want the growth to be very fast in the beginning and then we want it to slow down as it grows. The simplest solution to this is the logistic growth model
\[ \frac{d}{dt}I(t) = I(t) \cdot (1 - \frac{I(t)}{\text{Max}}). \]
When \(I(t)\) is small then \((1 - \frac{I(t)}{\text{Max}})\) is almost 1 and so the derivative is almost exponential. But as \(I(t)\) grows closer to \(\text{Max}\) we see that \((1 - \frac{I(t)}{\text{Max}})\) will become 0 and since \(I(t) \cdot 0 = 0\) the growth will stop. So the logistic growth model is actually a pretty intuitive way of expanding on exponential growth such that the growth slows down over time.
At the moment we are only using data from Iceland. Like Richard talks about in the opening of Chapter 12, we want our robot (model) to remember past experiences from other countries, except instead of cafés we are now learning about the growth COVID-19 cases. The pandemic is going on in many other countries and it would be a shame to not use any of that data to help in estimation for Icelandic predictions. Therefore we create a multilevel / hierarchical model that estimates \(\alpha\), \(\beta\) and \(S\) parameters for every country that has an active COVID-19 pandemic going on. Another good way of thinking about this problem is as if you were predicting how tall a child will become. It’s always better to have information about the growth curves of more children even if the goal is just to predict the height of that one child.
First we need to define what “an active COVID-19 pandemic” means. In our case we made the simple definition:
This means that a country that just crossed the \(0.002\%\) threshold today won’t be used in our modeling for 11 days. We also did not use data from China or South Korea. This is because both countries had already finished their first wave (in China its also very hard to talk about a “first wave” since it is so big and so many people live there. A wave might be reaching its maximum in one part while it is starting in another part).
For each country, the \(\alpha\) parameter can be interpreted as the logit of the number of cases at the start of a pandemic the way we defined it above. \(\alpha\) can be both positive or negative so we simply give each country’s alpha a normal prior distribution with parameters that we will learn from the data. For country \(i\) we therefore get
\[ \begin{aligned} \alpha_i &\sim \mathrm{Normal}(\mu_\alpha, \sigma^2_\alpha). \end{aligned} \]
Since the total number of infected cases can’t decrease we know that the \(\beta\) parameters must be greater than zero. We therefore use a LogNormal distribution as the hierarchical priors for the \(\beta\) parameters. In country \(i\) we then get
\[ \begin{aligned} \beta_i &\sim \mathrm{LogNormal}(\mu_\beta, \sigma^2_\beta). \end{aligned} \]
We know very little about the possible values for \(\alpha\) and \(\beta\) so we give the mean parameters pretty open ended priors. The \(\mu\) parameters are given vaguely informative prior distributions
\[ \begin{aligned} \mu_\alpha &\sim \mathrm{Normal}(0 , 3^2) \\ \mu_\beta &\sim \mathrm{Normal}(-2, 1^2) \end{aligned} \]
The standard deviations, \(\sigma_\beta\) and \(\sigma_\alpha\) are given \(\mathrm{Exponential}\) prior distributions
\[ \begin{aligned} \sigma_\alpha &\sim \mathrm{Exponential}(0.5) \\ \sigma_\beta &\sim \mathrm{Exponential}(1). \end{aligned} \]
The \(S_i\) parameters take on values in \((0, 1)\), so we thought it proper to model them with Beta distributions. By putting hierarchical priors on them we could also share information between countries on the estimated saturation points. Thus the saturation parameter for country \(i\) is sampled as
\[ S_i \sim \mathrm{Beta}(a_S, b_S). \]
We parametrise in terms of the mean, \(\mu_S\), and prior sample size, \(\kappa_S\),
\[ \begin{aligned} \mu_S &\sim \mathrm{Beta}(1, 99) \\ \kappa_S &\sim \mathrm{Exponential}(0.01). \end{aligned} \]
The prior for \(\mu_S\) is a deliberately pessimistic mean saturation \(1\%\). We can then back-transform into \(a_S\) and \(b_S\) as
\[ \begin{aligned} a_S &= \mu_S \cdot \kappa_S \\ b_S &= (1 - \mu_S) \cdot \kappa_S \end{aligned} \]
The model uses a negative binomial likelihood which is similar to the Poisson but allows for more flexible variance. If we use a Poisson likelihood the variance is forced to always be equal to the mean. In a negative binomial model the variance can be greater than the mean. We parametrise the negative binomial likelihood in the form of mean and overdispersion. If we write
\[ D_{i, t} \sim \mathrm{NegBin}(\mu_{i, t}, \phi_i), \] then the mean and variance of a negative binomial outcome are
\[ E[D_{i, t}] = \mu_{i, t} \qquad Var[D_{i, t}] = \mu_{i, t} + \phi_i \cdot \mu_{i, t}^2, \]
So \(\phi_i\) is the extra amount of variance in country number \(i\). We put a hierarchical half-normal prior on the \(\phi_i\) parameters as follows:
\[ \begin{aligned} \sqrt{\phi_i} &\sim \mathrm{Half-Normal}(0, \sigma_\phi) \\ \sigma_\phi &\sim \mathrm{Half-Normal}(0, 1) \end{aligned} \]
Below are estimated parameters from a model that was fitted using data up to March \(24^{th}\).
d <- read_csv("Data/COVID_Data_2020-03-24.csv")
m <- read_rds("Models/Model_2020-03-24.rds")
countries <- d %>% distinct(country, country_id)
which_iceland <- d %>% filter(country == "Iceland") %>% .$country_id %>% unique
n_countries <- max(d$country_id)
results <- tidyMCMC(m, conf.int = T, rhat = T, ess = T,
estimate.method = "median", conf.method = "quantile") %>%
mutate(par = str_match(term, "[a-zA-Z_2]+")) %>%
group_by(par) %>%
mutate(num = row_number() %>% as.numeric)
p <- results %>%
ungroup %>%
filter(par %in% c("beta", "alpha", "S", "phi_inv", "nu")) %>%
mutate(par = str_replace(par, "phi_inv", "phi")) %>%
inner_join(countries, by = c("num" = "country_id")) %>%
mutate(par = str_to_title(par)) %>%
mutate(plot_var = str_c(par, "_", country)) %>%
ggplot(aes(country, estimate, ymin = conf.low, ymax = conf.high, col = country == "Iceland")) +
geom_linerange() +
geom_point() +
coord_flip() +
facet_wrap("par", scales = "free_x", ncol = 2) +
scale_colour_manual(values = c("grey", "blue")) +
labs(title = "Posterior medians and 95% PIs") +
theme(axis.title = element_blank())
ggplotly(p)
Below are the hyperparameters that the model has learned and uses as priors to estimate each country’s specific coefficients. For example, we see that \(\mu_\alpha \approx -2.9\) and \(\sigma_\alpha \approx 2\) which means that each country’s \(\alpha\) parameter was estimated using a prior distribution of \(\alpha_i \sim \mathrm{Normal}(-2.9, 2^2)\). \(\mu_S \approx 0.003 = 0.3\%\) is the mean percent of countries’ population that will be infected with a 95% predictive interval of 0.16% - 0.75%.
results %>%
ungroup %>%
filter(par %in% c("mu_beta", "sigma_beta",
"mu_alpha", "sigma_alpha",
"mu_s", "kappa_s",
"sigma_phi_inv_sqrt")) %>%
mutate(par = str_replace(par, "sigma_phi_inv_sqrt", "sigma_phi_sqrt")) %>%
select(par, everything(), -num, -term, -std.error) %>%
set_names(c("Parameter", "Median", "Lower", "Upper", "Rhat", "ESS")) %>%
kable(digits = 4, align = c("l", "l", rep("c", ncol(.) - 2)),
caption = "Table 3. Summary of posterior samples of hyperparameters") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
add_header_above(c("", "", "95% PI" = 2, "Convergence" = 2)) %>%
column_spec(1, bold = T)
| Parameter | Median | Lower | Upper | Rhat | ESS |
|---|---|---|---|---|---|
| mu_beta | -1.9113 | -2.3553 | -1.6690 | 1.0083 | 614 |
| sigma_beta | 0.3928 | 0.1730 | 0.8423 | 1.0085 | 551 |
| mu_alpha | -2.8663 | -4.0690 | -1.7759 | 1.0023 | 682 |
| sigma_alpha | 2.0026 | 0.6149 | 3.1778 | 1.0083 | 377 |
| mu_s | 0.0029 | 0.0016 | 0.0075 | 1.0130 | 306 |
| kappa_s | 343.1988 | 106.2073 | 756.6843 | 1.0025 | 1791 |
| sigma_phi_sqrt | 0.5219 | 0.3822 | 0.7823 | 1.0008 | 1330 |
pars <- tidyMCMC(m, pars = c("mu_alpha", "sigma_alpha"))
out <- tidyMCMC(m, pars = "alpha")
lower = min(out$estimate)
upper = max(out$estimate)
out %>%
ggplot(aes(estimate)) +
geom_histogram(col = "white", fill = "grey50", aes(y = stat(density))) +
geom_rug() +
stat_function(fun = dnorm, args = list(mean = pars[1, 2, drop = T], sd = pars[2, 2, drop = T]), lty = 2) +
scale_x_continuous(limits = 1.4 * c(lower, upper)) +
theme(axis.title = element_blank()) +
labs(title = "Does the hyperprior cover the country-specific parameters?",
subtitle = "Dashed line is hyperprior, histogram and rug are country effects")
pars <- tidyMCMC(m, pars = c("mu_beta", "sigma_beta"))
out <- tidyMCMC(m, pars = "log_beta")
lower = min(out$estimate)
upper = max(out$estimate)
out %>%
ggplot(aes(estimate)) +
geom_histogram(col = "white", fill = "grey50", aes(y = stat(density))) +
geom_rug() +
stat_function(fun = dnorm, args = list(mean = pars[1, 2, drop = T], sd = pars[2, 2, drop = T]), lty = 2) +
scale_x_continuous(limits = c(1.2 * lower, 0.6 * upper)) +
theme(axis.title = element_blank()) +
labs(title = "Does the hyperprior cover the country-specific parameters?",
subtitle = "Dashed line is hyperprior, histogram and rug are country effects")
pars <- tidyMCMC(m, pars = c("a_s", "b_s"))
out <- tidyMCMC(m, pars = "S")
lower = min(out$estimate)
upper = max(out$estimate)
out %>%
ggplot(aes(estimate)) +
geom_histogram(col = "white", fill = "grey50", aes(y = stat(density)), bins = 40) +
geom_rug() +
stat_function(fun = dbeta, args = list(shape1 = pars[1, 2, drop = T], shape2 = pars[2, 2, drop = T]), lty = 2, n = 800) +
scale_x_continuous(limits = c(0, upper * 1.1)) +
theme(axis.title = element_blank()) +
labs(title = "Does the hyperprior cover the country-specific parameters?",
subtitle = "Dashed line is hyperprior, histogram and rug are country effects")
How do we make predictions with this model? First we need to sample the parameters from the posterior. We’ll do so in the code below.
add_daily_cases <- function(data, days = seq(0, 91), pop = 364134, ...) {
data <- data %>% expand_grid(days = days, pop = pop)
alpha <- data$alpha
beta <- data$beta
S <- data$S
phi <- data$phi
linear = alpha + beta * days
f = S / (1 + exp(-linear))
dfdt = beta * f * (1 - f/S)
data$new_cases <- rnbinom(nrow(data), mu = dfdt * pop, size = phi)
data
}
draws <- spread_draws(m,
alpha[country],
beta[country],
S[country],
phi[country],
n = 4000) %>%
filter(country == which_iceland) %>%
group_by(country) %>%
mutate(iter = row_number()) %>%
select(iter, alpha, beta, S, phi)
head(draws)
## # A tibble: 6 x 6
## # Groups: country [1]
## country iter alpha beta S phi
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 10 1 -4.08 0.150 0.00609 7.16
## 2 10 2 -4.93 0.176 0.00873 7.20
## 3 10 3 -4.16 0.167 0.00542 10.5
## 4 10 4 -4.06 0.170 0.00459 10.5
## 5 10 5 -4.82 0.165 0.00827 7.29
## 6 10 6 -3.70 0.162 0.00415 3.70
Now we have 4000 draws from the posterior distribution for Iceland’s estimated \(\alpha\), \(\beta\), \(S\) and \(\phi\) parameters. What were going to do is calculate posterior predictions for daily new cases and then sum them up to get total cases.
First we need to calculate the expected number of cases (as a percentage of the population) using \(\alpha\), \(\beta\) and \(S\), then multiply that number by the population of Iceland and sample from a negative binomial distribution with a sample of the overdispersion parameter from the posterior distribution.
\[ \begin{aligned} f(t) &= \frac{S}{1 + e^{-(\alpha + \beta \cdot t)}} \\ \frac{d}{dt} f &= \beta \cdot f \cdot (1 - \frac{f}{S}) \\ \text{New cases on day }t &\sim \mathrm{NegBinomial}(\frac{d}{dt}f \cdot 364.134, \phi) \end{aligned} \]
The function add_daily_cases() will do that for us if we apply it to the data table containing our samples.
draws <- draws %>%
add_daily_cases() %>%
group_by(iter) %>%
mutate(total_cases = cumsum(new_cases)) %>%
ungroup
head(draws)
## # A tibble: 6 x 10
## country iter alpha beta S phi days pop new_cases total_cases
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 10 1 -4.08 0.150 0.00609 7.16 0 364134 0 0
## 2 10 1 -4.08 0.150 0.00609 7.16 1 364134 6 6
## 3 10 1 -4.08 0.150 0.00609 7.16 2 364134 2 8
## 4 10 1 -4.08 0.150 0.00609 7.16 3 364134 6 14
## 5 10 1 -4.08 0.150 0.00609 7.16 4 364134 7 21
## 6 10 1 -4.08 0.150 0.00609 7.16 5 364134 17 38
We now have 4000 draws from the posterior predictive distribution for new and total cases in Iceland up to using data up to March \(24^{th}\). We’ll plot the median and 95% predictive intervals below.
iceland_d <- d %>% filter(country == "Iceland")
start_date <- min(iceland_d$date)
draws %>%
pivot_longer(c(new_cases, total_cases)) %>%
group_by(date = start_date + days - 1, name) %>%
summarise(median = median(value),
lower = quantile(value, 0.025),
upper = quantile(value, 0.975)) %>%
ggplot(aes(date, median)) +
geom_line() +
geom_line(aes(y = lower), lty = 2) +
geom_line(aes(y = upper), lty = 2) +
geom_vline(xintercept = ymd("2020-03-24"), lty = 2) +
facet_wrap("name", scales = "free")
How well do the predictions line up with real data? We’ll plot the number of new and total cases in Iceland up to June \(1^{st}\) below. The data are added as points.
plot_dat <- read_csv("https://docs.google.com/spreadsheets/d/1xgDhtejTtcyy6EN5dbDp5W3TeJhKFRRgm6Xk0s0YFeA/export?format=csv&gid=1788393542") %>%
rename(new_deaths = "Dauðsföll", new_cases = Ny_Smit, total_cases = Smit_Samtals, total_deaths = Dauðsföll_Samtals) %>%
mutate(date = ymd(Dagsetning),
country = "Iceland",
pop = 364134,
continent = "Europe") %>%
select(date, new_cases, total_cases) %>%
pivot_longer(c(new_cases, total_cases)) %>%
filter(date <= ymd("2020-06-01"))
draws %>%
pivot_longer(c(new_cases, total_cases)) %>%
group_by(date = start_date + days - 1, name) %>%
summarise(median = median(value),
lower = quantile(value, 0.025),
upper = quantile(value, 0.975)) %>%
ggplot(aes(date, median)) +
geom_line() +
geom_line(aes(y = lower), lty = 2) +
geom_line(aes(y = upper), lty = 2) +
geom_point(data = plot_dat, aes(y = value)) +
geom_vline(xintercept = ymd("2020-03-24"), lty = 2) +
facet_wrap("name", scales = "free")
That’s actually a pretty good fit, but we can always update the model with new data. Once we have developed workflow like this with code, we can very easily update the model every day by simply running the code again the next day.
Below are estimated parameters from a model that was fitted using data up to and including April \(1^{st}\). The code is exactly the same, but we read the model and data from another file.
d <- read_csv("Data/COVID_Data_2020-04-01.csv")
m <- read_rds("Models/Model_2020-04-01.rds")
countries <- d %>% distinct(country, country_id)
which_iceland <- d %>% filter(country == "Iceland") %>% .$country_id %>% unique
n_countries <- max(d$country_id)
results <- tidyMCMC(m, conf.int = T, rhat = T, ess = T,
estimate.method = "median", conf.method = "quantile") %>%
mutate(par = str_match(term, "[a-zA-Z_2]+")) %>%
group_by(par) %>%
mutate(num = row_number() %>% as.numeric)
Since we use more than one country in our model, we get estimates of \(\alpha\), \(\beta\), \(S\) and \(\phi\) for each country in our model.
p <- results %>%
ungroup %>%
filter(par %in% c("beta", "alpha", "S", "phi_inv", "nu")) %>%
mutate(par = str_replace(par, "phi_inv", "phi")) %>%
inner_join(countries, by = c("num" = "country_id")) %>%
mutate(par = str_to_title(par)) %>%
mutate(plot_var = str_c(par, "_", country)) %>%
ggplot(aes(country, estimate, ymin = conf.low, ymax = conf.high, col = country == "Iceland")) +
geom_linerange() +
geom_point() +
coord_flip() +
facet_wrap("par", scales = "free_x", ncol = 2) +
scale_colour_manual(values = c("grey", "blue")) +
labs(title = "Posterior medians and 95% PIs") +
theme(axis.title = element_blank())
ggplotly(p)
Below are the hyperparameters that the model has learned and uses as priors to estimate each country’s specific coefficients. For example, we see that \(\mu_\alpha \approx -3.9\) and \(\sigma_\alpha \approx 0.4\) which means that each country’s \(\alpha\) parameter was estimated using a prior distribution of \(\alpha_i \sim \mathrm{Normal}(-3.9, 0.4^2)\). \(\mu_S \approx 0.0048 = 0.48\%\) is the mean percent of countries’ population that will be infected with a 95% predictive interval of 0.31% - 0.72%.
results %>%
ungroup %>%
filter(par %in% c("mu_beta", "sigma_beta",
"mu_alpha", "sigma_alpha",
"mu_s", "kappa_s",
"sigma_phi_inv_sqrt")) %>%
mutate(par = str_replace(par, "sigma_phi_inv_sqrt", "sigma_phi_sqrt")) %>%
select(par, everything(), -num, -term, -std.error) %>%
set_names(c("Parameter", "Median", "Lower", "Upper", "Rhat", "ESS")) %>%
kable(digits = 4, align = c("l", "l", rep("c", ncol(.) - 2)),
caption = "Table 3. Summary of posterior samples of hyperparameters") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
add_header_above(c("", "", "95% PI" = 2, "Convergence" = 2)) %>%
column_spec(1, bold = T)
| Parameter | Median | Lower | Upper | Rhat | ESS |
|---|---|---|---|---|---|
| mu_beta | -2.4090 | -2.6445 | -2.2015 | 1.0018 | 1512 |
| sigma_beta | 0.6499 | 0.5016 | 0.8493 | 1.0020 | 2081 |
| mu_alpha | -3.8512 | -4.1198 | -3.5012 | 1.0072 | 908 |
| sigma_alpha | 0.3677 | 0.0614 | 0.7065 | 1.0081 | 766 |
| mu_s | 0.0048 | 0.0031 | 0.0072 | 1.0062 | 985 |
| kappa_s | 413.3400 | 198.6603 | 820.8333 | 1.0039 | 1653 |
| sigma_phi_sqrt | 0.6159 | 0.5001 | 0.7784 | 1.0028 | 913 |
pars <- tidyMCMC(m, pars = c("mu_alpha", "sigma_alpha"))
out <- tidyMCMC(m, pars = "alpha")
lower = min(out$estimate)
upper = max(out$estimate)
out %>%
ggplot(aes(estimate)) +
geom_histogram(col = "white", fill = "grey50", aes(y = stat(density))) +
geom_rug() +
stat_function(fun = dnorm, args = list(mean = pars[1, 2, drop = T], sd = pars[2, 2, drop = T]), lty = 2) +
scale_x_continuous(limits = c(lower - 0.1, upper + 0.3)) +
theme(axis.title = element_blank()) +
labs(title = "Does the hyperprior cover the country-specific parameters?",
subtitle = "Dashed line is hyperprior, histogram and rug are country effects")
pars <- tidyMCMC(m, pars = c("mu_beta", "sigma_beta"))
out <- tidyMCMC(m, pars = "log_beta")
lower = min(out$estimate)
upper = max(out$estimate)
out %>%
ggplot(aes(estimate)) +
geom_histogram(col = "white", fill = "grey50", aes(y = stat(density))) +
geom_rug() +
stat_function(fun = dnorm, args = list(mean = pars[1, 2, drop = T], sd = pars[2, 2, drop = T]), lty = 2) +
scale_x_continuous(limits = c(1.2 * lower, 0.6 * upper)) +
theme(axis.title = element_blank()) +
labs(title = "Does the hyperprior cover the country-specific parameters?",
subtitle = "Dashed line is hyperprior, histogram and rug are country effects")
pars <- tidyMCMC(m, pars = c("a_s", "b_s"))
out <- tidyMCMC(m, pars = "S")
lower = min(out$estimate)
upper = max(out$estimate)
out %>%
ggplot(aes(estimate)) +
geom_histogram(col = "white", fill = "grey50", aes(y = stat(density)), bins = 40) +
geom_rug() +
stat_function(fun = dbeta, args = list(shape1 = pars[1, 2, drop = T], shape2 = pars[2, 2, drop = T]), lty = 2, n = 800) +
scale_y_continuous(expand = expansion(mult = 0.01)) +
scale_x_continuous(limits = c(0, upper * 1.4)) +
theme(axis.title = element_blank()) +
labs(title = "Does the hyperprior cover the country-specific parameters?",
subtitle = "Dashed line is hyperprior, histogram and rug are country effects")
How do we make predictions with this model? First we need to sample the parameters from the posterior. We’ll do so in the code below.
add_daily_cases <- function(data, days = seq(0, 91), pop = 364134, ...) {
data <- data %>% expand_grid(days = days, pop = pop)
alpha <- data$alpha
beta <- data$beta
S <- data$S
phi <- data$phi
linear = alpha + beta * days
f = S / (1 + exp(-linear))
dfdt = beta * f * (1 - f/S)
data$new_cases <- rnbinom(nrow(data), mu = dfdt * pop, size = phi)
data
}
draws <- spread_draws(m,
alpha[country],
beta[country],
S[country],
phi[country],
n = 4000) %>%
filter(country == which_iceland) %>%
group_by(country) %>%
mutate(iter = row_number()) %>%
select(iter, alpha, beta, S, phi)
head(draws)
## # A tibble: 6 x 6
## # Groups: country [1]
## country iter alpha beta S phi
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 23 1 -4.01 0.143 0.00598 15.6
## 2 23 2 -3.77 0.138 0.00581 12.8
## 3 23 3 -3.76 0.156 0.00514 4.01
## 4 23 4 -4.49 0.185 0.00475 4.06
## 5 23 5 -3.96 0.162 0.00501 5.29
## 6 23 6 -4.28 0.167 0.00595 4.77
Now we have 4000 draws from the posterior distribution for Iceland’s estimated \(\alpha\), \(\beta\), \(S\) and \(\phi\) parameters. What were going to do is calculate posterior predictions for daily new cases and then sum them up to get total cases.
First we need to calculate the expected number of cases (as a percentage of the population) using \(\alpha\), \(\beta\) and \(S\), then multiply that number by the population of Iceland and sample from a negative binomial distribution with a sample of the overdispersion parameter from the posterior distribution.
\[ \begin{aligned} f(t) &= \frac{S}{1 + e^{-(\alpha + \beta \cdot t)}} \\ \frac{d}{dt} f &= \beta \cdot f \cdot (1 - \frac{f}{S}) \\ \text{New cases on day }t &\sim \mathrm{NegBinomial}(\frac{d}{dt}f \cdot 364.134, \phi) \end{aligned} \]
The function add_daily_cases() will do that for us if we apply it to the data table containing our samples.
draws <- draws %>%
add_daily_cases() %>%
group_by(iter) %>%
mutate(total_cases = cumsum(new_cases)) %>%
ungroup
head(draws)
## # A tibble: 6 x 10
## country iter alpha beta S phi days pop new_cases total_cases
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 23 1 -4.01 0.143 0.00598 15.6 0 364134 5 5
## 2 23 1 -4.01 0.143 0.00598 15.6 1 364134 11 16
## 3 23 1 -4.01 0.143 0.00598 15.6 2 364134 8 24
## 4 23 1 -4.01 0.143 0.00598 15.6 3 364134 6 30
## 5 23 1 -4.01 0.143 0.00598 15.6 4 364134 6 36
## 6 23 1 -4.01 0.143 0.00598 15.6 5 364134 16 52
We now have 4000 draws from the posterior predictive distribution for new and total cases in Iceland up to using data up to May \(4^{th}\). We’ll plot the median and 95% predictive intervals below.
iceland_d <- d %>% filter(country == "Iceland")
start_date <- min(iceland_d$date)
draws %>%
pivot_longer(c(new_cases, total_cases)) %>%
group_by(date = start_date + days - 1, name) %>%
summarise(median = median(value),
lower = quantile(value, 0.025),
upper = quantile(value, 0.975)) %>%
ggplot(aes(date, median)) +
geom_line() +
geom_line(aes(y = lower), lty = 2) +
geom_line(aes(y = upper), lty = 2) +
geom_vline(xintercept = ymd("2020-04-01"), lty = 2) +
facet_wrap("name", scales = "free")
How well do the predictions line up with real data? We’ll plot the number of new and total cases in Iceland up to June \(1^{st}\) below. The data are added as points.
plot_dat <- read_csv("https://docs.google.com/spreadsheets/d/1xgDhtejTtcyy6EN5dbDp5W3TeJhKFRRgm6Xk0s0YFeA/export?format=csv&gid=1788393542") %>%
rename(new_deaths = "Dauðsföll", new_cases = Ny_Smit, total_cases = Smit_Samtals, total_deaths = Dauðsföll_Samtals) %>%
mutate(date = ymd(Dagsetning),
country = "Iceland",
pop = 364134,
continent = "Europe") %>%
select(date, new_cases, total_cases) %>%
pivot_longer(c(new_cases, total_cases)) %>%
filter(date <= ymd("2020-06-01"))
draws %>%
pivot_longer(c(new_cases, total_cases)) %>%
group_by(date = start_date + days - 1, name) %>%
summarise(median = median(value),
lower = quantile(value, 0.025),
upper = quantile(value, 0.975)) %>%
ggplot(aes(date, median)) +
geom_line() +
geom_line(aes(y = lower), lty = 2) +
geom_line(aes(y = upper), lty = 2) +
geom_point(data = plot_dat, aes(y = value)) +
geom_vline(xintercept = ymd("2020-04-01"), lty = 2) +
facet_wrap("name", scales = "free")
Using more data gives us a much better fit, but of course we expected that.
Below are estimated parameters from a model that was fitted using data up to May \(4^{th}\), which was the date of the last forecast. The code is exactly the same, but we read the model and data from another file.
d <- read_csv("Data/COVID_Data_2020-05-04.csv")
m <- read_rds("Models/Model_2020-05-04.rds")
countries <- d %>% distinct(country, country_id)
which_iceland <- d %>% filter(country == "Iceland") %>% .$country_id %>% unique
n_countries <- max(d$country_id)
results <- tidyMCMC(m, conf.int = T, rhat = T, ess = T,
estimate.method = "median", conf.method = "quantile") %>%
mutate(par = str_match(term, "[a-zA-Z_2]+")) %>%
group_by(par) %>%
mutate(num = row_number() %>% as.numeric)
Since we use more than one country in our model, we get estimates of \(\alpha\), \(\beta\), \(S\) and \(\phi\) for each country in our model.
p <- results %>%
ungroup %>%
filter(par %in% c("beta", "alpha", "S", "phi_inv", "nu")) %>%
mutate(par = str_replace(par, "phi_inv", "phi")) %>%
inner_join(countries, by = c("num" = "country_id")) %>%
mutate(par = str_to_title(par)) %>%
mutate(plot_var = str_c(par, "_", country)) %>%
ggplot(aes(country, estimate, ymin = conf.low, ymax = conf.high, col = country == "Iceland")) +
geom_linerange() +
geom_point() +
coord_flip() +
facet_wrap("par", scales = "free_x", ncol = 2) +
scale_colour_manual(values = c("grey", "blue")) +
labs(title = "Posterior medians and 95% PIs") +
theme(axis.title = element_blank())
ggplotly(p)
Below are the hyperparameters that the model has learned and uses as priors to estimate each country’s specific coefficients. For example, we see that \(\mu_\alpha \approx -2\) and \(\sigma_\alpha \approx 2\) which means that each country’s \(\alpha\) parameter was estimated using a prior distribution of \(\alpha_i \sim \mathrm{Normal}(-2, 2^2)\). \(\mu_S \approx 0.001 = 0.1\%\) is the mean percent of countries’ population that will be infected with a 95% predictive interval of 0.16% - 0.2%.
results %>%
ungroup %>%
filter(par %in% c("mu_beta", "sigma_beta",
"mu_alpha", "sigma_alpha",
"mu_s", "kappa_s",
"sigma_phi_inv_sqrt")) %>%
mutate(par = str_replace(par, "sigma_phi_inv_sqrt", "sigma_phi_sqrt")) %>%
select(par, everything(), -num, -term, -std.error) %>%
set_names(c("Parameter", "Median", "Lower", "Upper", "Rhat", "ESS")) %>%
kable(digits = 4, align = c("l", "l", rep("c", ncol(.) - 2)),
caption = "Table 3. Summary of posterior samples of hyperparameters") %>%
kable_styling(bootstrap_options = c("striped", "hover")) %>%
add_header_above(c("", "", "95% PI" = 2, "Convergence" = 2)) %>%
column_spec(1, bold = T)
| Parameter | Median | Lower | Upper | Rhat | ESS |
|---|---|---|---|---|---|
| mu_beta | -2.4650 | -2.5681 | -2.3660 | 1.0061 | 716 |
| sigma_beta | 0.4231 | 0.3386 | 0.5294 | 1.0040 | 800 |
| mu_alpha | -2.0485 | -2.4499 | -1.6434 | 1.0124 | 323 |
| sigma_alpha | 2.0252 | 1.7035 | 2.4234 | 1.0019 | 818 |
| mu_s | 0.0016 | 0.0013 | 0.0020 | 1.0004 | 2863 |
| kappa_s | 628.4726 | 447.2466 | 852.7344 | 1.0007 | 4080 |
| sigma_phi_sqrt | 0.8451 | 0.7426 | 0.9635 | 1.0118 | 322 |
pars <- tidyMCMC(m, pars = c("mu_alpha", "sigma_alpha"))
out <- tidyMCMC(m, pars = "alpha")
lower = min(out$estimate)
upper = max(out$estimate)
out %>%
ggplot(aes(estimate)) +
geom_histogram(col = "white", fill = "grey50", aes(y = stat(density))) +
geom_rug() +
stat_function(fun = dnorm, args = list(mean = pars[1, 2, drop = T], sd = pars[2, 2, drop = T]), lty = 2) +
scale_x_continuous(limits = 1.4 * c(lower, upper)) +
theme(axis.title = element_blank()) +
labs(title = "Does the hyperprior cover the country-specific parameters?",
subtitle = "Dashed line is hyperprior, histogram and rug are country effects")
pars <- tidyMCMC(m, pars = c("mu_beta", "sigma_beta"))
out <- tidyMCMC(m, pars = "log_beta")
lower = min(out$estimate)
upper = max(out$estimate)
out %>%
ggplot(aes(estimate)) +
geom_histogram(col = "white", fill = "grey50", aes(y = stat(density))) +
geom_rug() +
stat_function(fun = dnorm, args = list(mean = pars[1, 2, drop = T], sd = pars[2, 2, drop = T]), lty = 2) +
scale_x_continuous(limits = c(1.2 * lower, 0.6 * upper)) +
theme(axis.title = element_blank()) +
labs(title = "Does the hyperprior cover the country-specific parameters?",
subtitle = "Dashed line is hyperprior, histogram and rug are country effects")
pars <- tidyMCMC(m, pars = c("a_s", "b_s"))
out <- tidyMCMC(m, pars = "S")
lower = min(out$estimate)
upper = max(out$estimate)
out %>%
ggplot(aes(estimate)) +
geom_histogram(col = "white", fill = "grey50", aes(y = stat(density)), bins = 40) +
geom_rug() +
stat_function(fun = dbeta, args = list(shape1 = pars[1, 2, drop = T], shape2 = pars[2, 2, drop = T]), lty = 2, n = 800) +
scale_x_continuous(limits = c(0, upper * 1.1)) +
theme(axis.title = element_blank()) +
labs(title = "Does the hyperprior cover the country-specific parameters?",
subtitle = "Dashed line is hyperprior, histogram and rug are country effects")
How do we make predictions with this model? First we need to sample the parameters from the posterior. We’ll do so in the code below.
add_daily_cases <- function(data, days = seq(0, 91), pop = 364134, ...) {
data <- data %>% expand_grid(days = days, pop = pop)
alpha <- data$alpha
beta <- data$beta
S <- data$S
phi <- data$phi
linear = alpha + beta * days
f = S / (1 + exp(-linear))
dfdt = beta * f * (1 - f/S)
data$new_cases <- rnbinom(nrow(data), mu = dfdt * pop, size = phi)
data
}
draws <- spread_draws(m,
alpha[country],
beta[country],
S[country],
phi[country],
n = 4000) %>%
filter(country == which_iceland) %>%
group_by(country) %>%
mutate(iter = row_number()) %>%
select(iter, alpha, beta, S, phi)
head(draws)
## # A tibble: 6 x 6
## # Groups: country [1]
## country iter alpha beta S phi
## <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 59 1 -4.01 0.165 0.00486 7.53
## 2 59 2 -4.30 0.174 0.00493 7.56
## 3 59 3 -4.49 0.182 0.00534 6.75
## 4 59 4 -4.21 0.174 0.00464 10.9
## 5 59 5 -4.34 0.170 0.00462 6.67
## 6 59 6 -4.39 0.177 0.00484 8.76
Now we have 4000 draws from the posterior distribution for Iceland’s estimated \(\alpha\), \(\beta\), \(S\) and \(\phi\) parameters. What were going to do is calculate posterior predictions for daily new cases and then sum them up to get total cases.
First we need to calculate the expected number of cases (as a percentage of the population) using \(\alpha\), \(\beta\) and \(S\), then multiply that number by the population of Iceland and sample from a negative binomial distribution with a sample of the overdispersion parameter from the posterior distribution.
\[ \begin{aligned} f(t) &= \frac{S}{1 + e^{-(\alpha + \beta \cdot t)}} \\ \frac{d}{dt} f &= \beta \cdot f \cdot (1 - \frac{f}{S}) \\ \text{New cases on day }t &\sim \mathrm{NegBinomial}(\frac{d}{dt}f \cdot 364.134, \phi) \end{aligned} \]
The function add_daily_cases() will do that for us if we apply it to the data table containing our samples.
draws <- draws %>%
add_daily_cases() %>%
group_by(iter) %>%
mutate(total_cases = cumsum(new_cases)) %>%
ungroup
head(draws)
## # A tibble: 6 x 10
## country iter alpha beta S phi days pop new_cases total_cases
## <int> <int> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 59 1 -4.01 0.165 0.00486 7.53 0 364134 4 4
## 2 59 1 -4.01 0.165 0.00486 7.53 1 364134 9 13
## 3 59 1 -4.01 0.165 0.00486 7.53 2 364134 5 18
## 4 59 1 -4.01 0.165 0.00486 7.53 3 364134 8 26
## 5 59 1 -4.01 0.165 0.00486 7.53 4 364134 5 31
## 6 59 1 -4.01 0.165 0.00486 7.53 5 364134 6 37
We now have 4000 draws from the posterior predictive distribution for new and total cases in Iceland up to using data up to May \(4^{th}\). We’ll plot the median and 95% predictive intervals below.
iceland_d <- d %>% filter(country == "Iceland")
start_date <- min(iceland_d$date)
draws %>%
pivot_longer(c(new_cases, total_cases)) %>%
group_by(date = start_date + days - 1, name) %>%
summarise(median = median(value),
lower = quantile(value, 0.025),
upper = quantile(value, 0.975)) %>%
ggplot(aes(date, median)) +
geom_line() +
geom_line(aes(y = lower), lty = 2) +
geom_line(aes(y = upper), lty = 2) +
geom_vline(xintercept = ymd("2020-05-04"), lty = 2) +
facet_wrap("name", scales = "free")
How well do the predictions line up with real data? We’ll plot the number of new and total cases in Iceland up to June \(1^{st}\) below. The data are added as points.
plot_dat <- read_csv("https://docs.google.com/spreadsheets/d/1xgDhtejTtcyy6EN5dbDp5W3TeJhKFRRgm6Xk0s0YFeA/export?format=csv&gid=1788393542") %>%
rename(new_deaths = "Dauðsföll", new_cases = Ny_Smit, total_cases = Smit_Samtals, total_deaths = Dauðsföll_Samtals) %>%
mutate(date = ymd(Dagsetning),
country = "Iceland",
pop = 364134,
continent = "Europe") %>%
select(date, new_cases, total_cases) %>%
pivot_longer(c(new_cases, total_cases)) %>%
filter(date <= ymd("2020-06-01"))
draws %>%
pivot_longer(c(new_cases, total_cases)) %>%
group_by(date = start_date + days - 1, name) %>%
summarise(median = median(value),
lower = quantile(value, 0.025),
upper = quantile(value, 0.975)) %>%
ggplot(aes(date, median)) +
geom_line() +
geom_line(aes(y = lower), lty = 2) +
geom_line(aes(y = upper), lty = 2) +
geom_point(data = plot_dat, aes(y = value)) +
geom_vline(xintercept = ymd("2020-05-04"), lty = 2) +
facet_wrap("name", scales = "free")
Using more data gives us a much better fit, but of course we expected that.
The model worked well for Iceland, but might not work as well in other countries. One example is the United States of America. We have seen that the model does not predict very well there. Why would that be? One reason is that the USA is not really one country but a collection of semi autonomous states. The first wave is peaking in one state while it hasn’t even started in another state. The autonomy of states also leads to drastically different measures taken by the government which leads to different types of growth. The logistic growth model does not use any information about government action, it is what you might call a “curve fitting”.
We identified a known mathematical function that has been used extensively in similar applications and used it to predict future values of that mathematical function. The model does not know what COVID-19 is and it does not know about the 2-meter-rule. It only knows how to find values of \(\alpha\), \(\beta\), \(S\) and \(\phi\) to make the logistic growth function fit well to the data. It is important to keep in mind what you are doing and the limits to your methods when dealing with as serious matters as predicting COVID cases.
One important thing to keep in mind is that the model is forced to find mathematical functions where the increase and decrease of new cases is forced to be symmetrical, such that the new cases reach a peak and then must descend exactly as fast as they ascended. One way to combat this which we explored is a Generalized Logistic Growth Model. This mathematical function introduces a new parameter, \(\nu\), that allows the growth to be non-symmetrical. If you are interested, I created notes on both models and comparisons for a Statistics Colloquium and you can see the notes here: https://rpubs.com/bgautijonsson/GenLogErindi
The figure below compares predictions from the logistic model (blue), which we have learned about here, and the generalized logistic model (red). The points are observed data, the lines are median predictions and the transparent ribbons are 95% prediction intervals. Each row is a country and in each column we use only data up to and including the listed date. We can see here that the logistic model worked well in Iceland, but it seems that the generalized model did better in several other countries. You can see that one decisive flaw of the model is that it most often believes the pandemic will end very soon leading to predictions that are too low.
We were worried about this in Iceland and spent everyday expecting that the model would fail on that day, but somehow it just didn’t happen!
knitr::include_graphics("calibration_newcases.png")
It is interesting to notice how much the estimates of the hyperparameters (\(\mu_\alpha\), \(\mu_S\), etc.) changed with the addition of new data.
We haven’t yet talked about how to use these predictions for new cases to predict hospital and ICU admissions. There will be notes on that later this week.