As for a lot of other people, COVID-19 is currently very much on my mind. Thankfully, there is a lot of good information and also data available. In this document I will collect some thoughts and analyses regarding the spread of the disease. For anybody reading this, please keep in mind that I am not an epidemologist or virologist, just someone with a solid statistical education, trying to make sense – mostly for myself – out of the crisis.
Compared to the first post rpubs.com, I have made the following additions.
For now, my main data source is going to be the dataset provided by Johns Hopkins University on github.com.
As a first step, I will look at the development of confirmed cases in Germany and Italy.
## # A tibble: 100 x 3
## `Country/Region` Date Confirmed
## <chr> <date> <dbl>
## 1 Germany 2020-01-22 0
## 2 Italy 2020-01-22 0
## 3 Germany 2020-01-23 0
## 4 Italy 2020-01-23 0
## 5 Germany 2020-01-24 0
## 6 Italy 2020-01-24 0
## 7 Germany 2020-01-25 0
## 8 Italy 2020-01-25 0
## 9 Germany 2020-01-26 0
## 10 Italy 2020-01-26 0
## # ... with 90 more rows
What we see is that the early trajectory for both countries was very similar, with only a handful confirmed cases. We also see that Italy has started on an exponential growth path. Looking at the log plot, we see two important points:
So, my takeaway for now is that Germany indeed is probably going to follow the path Italy has shown us, with maybe a week of delay.
In the next step, I am estimating the number of confirmed cases for the two coountries. For this, I am using only the days where the number of infected is above 50.
## # A tibble: 6 x 5
## `Country/Region` Date Confirmed l_confirmed n_day
## <chr> <date> <dbl> <dbl> <int>
## 1 Germany 2020-03-09 1176 7.07 16
## 2 Italy 2020-03-09 9172 9.12 16
## 3 Germany 2020-03-10 1457 7.28 17
## 4 Italy 2020-03-10 10149 9.23 17
## 5 Germany 2020-03-11 1908 7.55 18
## 6 Italy 2020-03-11 12462 9.43 18
In the first step, we are using a linear regression of the whole data set. The number of confirmed cases has been transformed to natural logs. The trend is represented by \(n_{day}\). The general form of the estimation is
\[y = \beta_0 + \beta_{IT}+ \beta_{n_{day}} + \epsilon\]
lm_abs = lm(l_confirmed ~ `Country/Region` + n_day, data = ts_filtered)
lm_abs_last = lm(l_confirmed ~ `Country/Region` + n_day, data = ts_filtered %>%
filter(Date != max(Date)))
summary(lm_abs)
##
## Call:
## lm(formula = l_confirmed ~ `Country/Region` + n_day, data = ts_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77228 -0.10959 -0.00706 0.17721 0.35128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.701897 0.131199 20.59 <2e-16 ***
## `Country/Region`Italy 2.197517 0.093355 23.54 <2e-16 ***
## n_day 0.273020 0.008934 30.56 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2386 on 28 degrees of freedom
## Multiple R-squared: 0.9759, Adjusted R-squared: 0.9742
## F-statistic: 566.6 on 2 and 28 DF, p-value: < 2.2e-16
summary(lm_abs_last)
##
## Call:
## lm(formula = l_confirmed ~ `Country/Region` + n_day, data = ts_filtered %>%
## filter(Date != max(Date)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.73615 -0.10032 0.01088 0.16380 0.33336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.62650 0.13426 19.56 <2e-16 ***
## `Country/Region`Italy 2.23678 0.09508 23.52 <2e-16 ***
## n_day 0.27977 0.00954 29.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2326 on 26 degrees of freedom
## Multiple R-squared: 0.976, Adjusted R-squared: 0.9742
## F-statistic: 529.8 on 2 and 26 DF, p-value: < 2.2e-16
vec_coeff = coefficients(lm_abs)
names(vec_coeff) = NULL
Now, what do the results tell us? The intercept is the starting point in the amount of cases, \(exp(\beta_0)\) = 14.9. This is a result of the chosen cut-off point at fifty (the intercept is not fifty, as Germany’s timeline does not start on day one).
The estimate for the Italy dummy variable means that on average the case load for Italy was \(exp(\beta_{IT})\) = 9 times higher than for Germany.
The estimate \(\beta_{n_{day}}\) is the interesting part, because it gets multiplied by the number of days. We therefore can use it as an approximate daily growth rate, that is the number of cases will grow with 27.3%. The adjusted \(R^2\) gives us a high confidence into the results. Furthermore, we can approximate the number of days which Italy is ahead of Germany by dividing the estimate for the dummy through the estimate for the daily growth rate: 2.2 / 0.3 = 8.
In the next step I am looking at the national growth rates.
lm_de = lm(l_confirmed ~ n_day, data = ts_filter_de)
lm_de_last = lm(l_confirmed ~ n_day, data = ts_filter_de %>%
filter(Date != max(Date)))
summary(lm_de)
##
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_de)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.15320 -0.11788 -0.04525 0.11501 0.24787
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.49564 0.16517 15.11 3.26e-08 ***
## n_day 0.28952 0.01274 22.73 6.12e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1523 on 10 degrees of freedom
## Multiple R-squared: 0.981, Adjusted R-squared: 0.9791
## F-statistic: 516.7 on 1 and 10 DF, p-value: 6.119e-10
summary(lm_de_last)
##
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_de %>% filter(Date !=
## max(Date)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18906 -0.11377 -0.02128 0.08951 0.22559
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.40929 0.17581 13.70 2.47e-07 ***
## n_day 0.29788 0.01417 21.03 5.84e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1486 on 9 degrees of freedom
## Multiple R-squared: 0.98, Adjusted R-squared: 0.9778
## F-statistic: 442.1 on 1 and 9 DF, p-value: 5.838e-09
So, for Germany the estimate for the growth rate is 28.95% after 29.79% yesterday.
lm_it = lm(l_confirmed ~ n_day, data = ts_filter_it)
lm_it_last = lm(l_confirmed ~ n_day, data = ts_filter_it %>%
filter(Date != max(Date)))
summary(lm_it)
##
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_it)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.80954 -0.08943 0.03124 0.20147 0.34714
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.93667 0.12263 40.26 < 2e-16 ***
## n_day 0.26888 0.01164 23.10 2.8e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2779 on 17 degrees of freedom
## Multiple R-squared: 0.9691, Adjusted R-squared: 0.9673
## F-statistic: 533.7 on 1 and 17 DF, p-value: 2.803e-14
summary(lm_it_last)
##
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_it %>% filter(Date !=
## max(Date)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77108 -0.11495 0.05052 0.19273 0.33130
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.89822 0.12203 40.14 < 2e-16 ***
## n_day 0.27567 0.01225 22.50 1.55e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2697 on 16 degrees of freedom
## Multiple R-squared: 0.9694, Adjusted R-squared: 0.9674
## F-statistic: 506.1 on 1 and 16 DF, p-value: 1.549e-13
And for Italy the estimate for the growth rate is 26.89% after 27.57% yesterday. Plotting shows us why the \(R^2\) is lower for Italy: In the first couply of days the growth rate was considerably higher than in the last 2 weeks.
On the basis of the full model, a very simplified prediction is possible, based on the assumption that the identified trend will continue for the next 7 days.
Date | type | Germany | Italy |
---|---|---|---|
2020-03-11 | Actual | 1,908 | 12,462 |
2020-03-11 | Prediction_Original | 2,031 | 18,284 |
2020-03-12 | Prediction_Original | 2,669 | 24,024 |
2020-03-13 | Prediction_Original | 3,506 | 31,565 |
2020-03-14 | Prediction_Original | 4,607 | 41,474 |
2020-03-15 | Prediction_Original | 6,053 | 54,494 |
2020-03-16 | Prediction_Original | 7,953 | 71,601 |
2020-03-17 | Prediction_Original | 10,450 | 94,079 |
2020-03-18 | Prediction_Original | 13,731 | 123,613 |
Compared to the analysis done yesterday, we are seeing a reduction the growth rates. In the next step I will look at the daily change in growth rates, approximated as the first differences in the log.
dts_filtered = ts_filtered %>%
select(-Confirmed)%>%
spread(`Country/Region`, l_confirmed) %>%
mutate(Germany = c(NA, diff(Germany))) %>%
mutate(Italy = c(NA, diff(Italy))) %>%
gather(Country, dl_confirmed,-n_day,-Date)
dts_filtered %>%
ggplot(aes(Date, dl_confirmed, colour = Country)) +
geom_line() +
geom_smooth(method = "lm")
cutoff_date = as.Date("2020-02-25")
dts_de = dts_filtered %>%
filter(Country == "Germany", Date > cutoff_date)
dts_it = dts_filtered %>%
filter(Country == "Italy", Date > cutoff_date)
The plot shows a marked downward trend in the growth rate, which is slightly larger for Italy than for Germany. Of course, with a low number of observed days, one has to be carfull to draw too strong conclusions. Furthermore, I assume that the first, very high growth rates are not representative anymore, and therefore select only growth rates after 25th February.
lm_diff = lm(dl_confirmed ~ Country + n_day, data = dts_filtered %>% filter(Date > cutoff_date))
summary(lm_diff)
##
## Call:
## lm(formula = dl_confirmed ~ Country + n_day, data = dts_filtered %>%
## filter(Date > cutoff_date))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.143166 -0.057504 -0.007931 0.041795 0.306350
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.468394 0.071387 6.561 1.08e-06 ***
## CountryItaly -0.073286 0.040391 -1.814 0.0827 .
## n_day -0.013762 0.004992 -2.757 0.0112 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09859 on 23 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.2805, Adjusted R-squared: 0.2179
## F-statistic: 4.483 on 2 and 23 DF, p-value: 0.0227
predict_in_d = tibble(
Date = rep(seq.Date(max(ts_filtered$Date), by = "1 day", length.out = 8),2),
n_day = rep(seq(max(ts_filtered$n_day),by = 1, length.out = 8), 2),
Country = rep(c("Germany", "Italy"), each = 8)
)
predict_val_d = predict(lm_diff, newdata = predict_in_d)
predict_out_d = predict_in_d %>%
mutate(dl_confirmed = predict_val_d,
type = "Prediction_Panel")
dts_filtered %>%
mutate(type = "Actual") %>%
bind_rows(predict_out_d) %>%
ggplot(aes(Date, dl_confirmed)) +
geom_line(aes(linetype = type, colour = Country)) +
ggtitle("Predicted growth rate")
## Warning: Removed 9 row(s) containing missing values (geom_path).
Given the current trend in the growth rate, we would expect a reduction in new cases for both countries, and for Italy even a reversal as the growth rate turns negative within a week. A problem of this analysis is that it treats both countries more or less the same, and only estimates an offset for the country dummy. It is therefore more helpful to estimate the trend individually.
lm_diff_de = lm(dl_confirmed ~ n_day, data = dts_de)
summary(lm_diff_de)
##
## Call:
## lm(formula = dl_confirmed ~ n_day, data = dts_de)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.154888 -0.106612 -0.008457 0.051748 0.303420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.50649 0.17963 2.820 0.0201 *
## n_day -0.01669 0.01343 -1.243 0.2452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1408 on 9 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.1466, Adjusted R-squared: 0.05174
## F-statistic: 1.546 on 1 and 9 DF, p-value: 0.2452
lm_diff_it = lm(dl_confirmed ~ n_day, data = dts_it)
summary(lm_diff_it)
##
## Call:
## lm(formula = dl_confirmed ~ n_day, data = dts_it)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.085053 -0.042517 -0.002443 0.035067 0.125088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.382445 0.040993 9.329 4e-07 ***
## n_day -0.012611 0.003469 -3.636 0.00302 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05804 on 13 degrees of freedom
## Multiple R-squared: 0.5042, Adjusted R-squared: 0.466
## F-statistic: 13.22 on 1 and 13 DF, p-value: 0.003019
The estimates show that the intercept is lower for Italy, but for Germany the trend shows a higher reduction. For Germany, the \(R^2\) is rather low due to the small sample size.
The plot shows the difference in the development of the growth rates depending on using the panel estimate (combined estimation plus country dummy), or the single estimate.
Using the estimates for the growth rate, we reconstruct the projection of the confirmed cases. For this, we are starting with the last actual value, and use the growth rates for future dates from the single estimation.
Date | Type | Germany | Italy |
---|---|---|---|
2020-03-11 | Actual | 1,908 | 12,462 |
2020-03-11 | Prediction_Revised | 1,790 | 11,856 |
2020-03-12 | Prediction_Revised | 2,164 | 13,676 |
2020-03-13 | Prediction_Revised | 2,571 | 15,579 |
2020-03-14 | Prediction_Revised | 3,005 | 17,523 |
2020-03-15 | Prediction_Revised | 3,454 | 19,463 |
2020-03-16 | Prediction_Revised | 3,905 | 21,347 |
2020-03-17 | Prediction_Revised | 4,341 | 23,120 |
2020-03-18 | Prediction_Revised | 4,745 | 24,726 |
The revised projection shows the sensitivity of the model to the assumption about the growth rate - with only a horizon of 1 week, the revised projection is at 24,726 instead of 123,613 for Italy, and 4,745 instead of 13,731 for Germany.