Introduction

As for a lot of other people, COVID-19 is currently very much on my mind. Thankfully, there is a lot of good information and also data available. In this document I will collect some thoughts and analyses regarding the spread of the disease. For anybody reading this, please keep in mind that I am not an epidemologist or virologist, just someone with a solid statistical education, trying to make sense – mostly for myself – out of the crisis.

Updates

Compared to the first post rpubs.com, I have made the following additions.

Getting the data

For now, my main data source is going to be the dataset provided by Johns Hopkins University on github.com.

Development of confirmed cases

As a first step, I will look at the development of confirmed cases in Germany and Italy.

## # A tibble: 100 x 3
##    `Country/Region` Date       Confirmed
##    <chr>            <date>         <dbl>
##  1 Germany          2020-01-22         0
##  2 Italy            2020-01-22         0
##  3 Germany          2020-01-23         0
##  4 Italy            2020-01-23         0
##  5 Germany          2020-01-24         0
##  6 Italy            2020-01-24         0
##  7 Germany          2020-01-25         0
##  8 Italy            2020-01-25         0
##  9 Germany          2020-01-26         0
## 10 Italy            2020-01-26         0
## # ... with 90 more rows

What we see is that the early trajectory for both countries was very similar, with only a handful confirmed cases. We also see that Italy has started on an exponential growth path. Looking at the log plot, we see two important points:

  1. Surprisingly, Italy seems to have been able to flatten the curve, at least with regards to the growth factor.
  2. We also see that Germany currently is on a parallel path to Italy, but without a first steep part of the curve.

So, my takeaway for now is that Germany indeed is probably going to follow the path Italy has shown us, with maybe a week of delay.

Estimating the number of confirmed cases

In the next step, I am estimating the number of confirmed cases for the two coountries. For this, I am using only the days where the number of infected is above 50.

## # A tibble: 6 x 5
##   `Country/Region` Date       Confirmed l_confirmed n_day
##   <chr>            <date>         <dbl>       <dbl> <int>
## 1 Germany          2020-03-09      1176        7.07    16
## 2 Italy            2020-03-09      9172        9.12    16
## 3 Germany          2020-03-10      1457        7.28    17
## 4 Italy            2020-03-10     10149        9.23    17
## 5 Germany          2020-03-11      1908        7.55    18
## 6 Italy            2020-03-11     12462        9.43    18

In the first step, we are using a linear regression of the whole data set. The number of confirmed cases has been transformed to natural logs. The trend is represented by \(n_{day}\). The general form of the estimation is

\[y = \beta_0 + \beta_{IT}+ \beta_{n_{day}} + \epsilon\]

lm_abs = lm(l_confirmed ~ `Country/Region` + n_day, data = ts_filtered)
lm_abs_last = lm(l_confirmed ~ `Country/Region` + n_day, data = ts_filtered %>%
                   filter(Date != max(Date)))
summary(lm_abs)
## 
## Call:
## lm(formula = l_confirmed ~ `Country/Region` + n_day, data = ts_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77228 -0.10959 -0.00706  0.17721  0.35128 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.701897   0.131199   20.59   <2e-16 ***
## `Country/Region`Italy 2.197517   0.093355   23.54   <2e-16 ***
## n_day                 0.273020   0.008934   30.56   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2386 on 28 degrees of freedom
## Multiple R-squared:  0.9759, Adjusted R-squared:  0.9742 
## F-statistic: 566.6 on 2 and 28 DF,  p-value: < 2.2e-16
summary(lm_abs_last)
## 
## Call:
## lm(formula = l_confirmed ~ `Country/Region` + n_day, data = ts_filtered %>% 
##     filter(Date != max(Date)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73615 -0.10032  0.01088  0.16380  0.33336 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.62650    0.13426   19.56   <2e-16 ***
## `Country/Region`Italy  2.23678    0.09508   23.52   <2e-16 ***
## n_day                  0.27977    0.00954   29.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2326 on 26 degrees of freedom
## Multiple R-squared:  0.976,  Adjusted R-squared:  0.9742 
## F-statistic: 529.8 on 2 and 26 DF,  p-value: < 2.2e-16
vec_coeff = coefficients(lm_abs)
names(vec_coeff) = NULL

Now, what do the results tell us? The intercept is the starting point in the amount of cases, \(exp(\beta_0)\) = 14.9. This is a result of the chosen cut-off point at fifty (the intercept is not fifty, as Germany’s timeline does not start on day one).

The estimate for the Italy dummy variable means that on average the case load for Italy was \(exp(\beta_{IT})\) = 9 times higher than for Germany.

The estimate \(\beta_{n_{day}}\) is the interesting part, because it gets multiplied by the number of days. We therefore can use it as an approximate daily growth rate, that is the number of cases will grow with 27.3%. The adjusted \(R^2\) gives us a high confidence into the results. Furthermore, we can approximate the number of days which Italy is ahead of Germany by dividing the estimate for the dummy through the estimate for the daily growth rate: 2.2 / 0.3 = 8.

In the next step I am looking at the national growth rates.

lm_de = lm(l_confirmed ~  n_day, data = ts_filter_de)
lm_de_last = lm(l_confirmed ~  n_day, data = ts_filter_de %>%
                   filter(Date != max(Date)))
summary(lm_de)
## 
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_de)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.15320 -0.11788 -0.04525  0.11501  0.24787 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.49564    0.16517   15.11 3.26e-08 ***
## n_day        0.28952    0.01274   22.73 6.12e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1523 on 10 degrees of freedom
## Multiple R-squared:  0.981,  Adjusted R-squared:  0.9791 
## F-statistic: 516.7 on 1 and 10 DF,  p-value: 6.119e-10
summary(lm_de_last)
## 
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_de %>% filter(Date != 
##     max(Date)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18906 -0.11377 -0.02128  0.08951  0.22559 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.40929    0.17581   13.70 2.47e-07 ***
## n_day        0.29788    0.01417   21.03 5.84e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1486 on 9 degrees of freedom
## Multiple R-squared:   0.98,  Adjusted R-squared:  0.9778 
## F-statistic: 442.1 on 1 and 9 DF,  p-value: 5.838e-09

So, for Germany the estimate for the growth rate is 28.95% after 29.79% yesterday.

lm_it = lm(l_confirmed ~  n_day, data = ts_filter_it)
lm_it_last = lm(l_confirmed ~  n_day, data = ts_filter_it %>%
                   filter(Date != max(Date)))
summary(lm_it)
## 
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_it)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.80954 -0.08943  0.03124  0.20147  0.34714 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.93667    0.12263   40.26  < 2e-16 ***
## n_day        0.26888    0.01164   23.10  2.8e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2779 on 17 degrees of freedom
## Multiple R-squared:  0.9691, Adjusted R-squared:  0.9673 
## F-statistic: 533.7 on 1 and 17 DF,  p-value: 2.803e-14
summary(lm_it_last)
## 
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_it %>% filter(Date != 
##     max(Date)))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77108 -0.11495  0.05052  0.19273  0.33130 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.89822    0.12203   40.14  < 2e-16 ***
## n_day        0.27567    0.01225   22.50 1.55e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2697 on 16 degrees of freedom
## Multiple R-squared:  0.9694, Adjusted R-squared:  0.9674 
## F-statistic: 506.1 on 1 and 16 DF,  p-value: 1.549e-13

And for Italy the estimate for the growth rate is 26.89% after 27.57% yesterday. Plotting shows us why the \(R^2\) is lower for Italy: In the first couply of days the growth rate was considerably higher than in the last 2 weeks.

Predicting the next week

On the basis of the full model, a very simplified prediction is possible, based on the assumption that the identified trend will continue for the next 7 days.

Date type Germany Italy
2020-03-11 Actual 1,908 12,462
2020-03-11 Prediction_Original 2,031 18,284
2020-03-12 Prediction_Original 2,669 24,024
2020-03-13 Prediction_Original 3,506 31,565
2020-03-14 Prediction_Original 4,607 41,474
2020-03-15 Prediction_Original 6,053 54,494
2020-03-16 Prediction_Original 7,953 71,601
2020-03-17 Prediction_Original 10,450 94,079
2020-03-18 Prediction_Original 13,731 123,613

Identifying a possible trend in the growth rate

Compared to the analysis done yesterday, we are seeing a reduction the growth rates. In the next step I will look at the daily change in growth rates, approximated as the first differences in the log.

dts_filtered = ts_filtered %>%
  select(-Confirmed)%>%
  spread(`Country/Region`, l_confirmed) %>%
  mutate(Germany = c(NA, diff(Germany))) %>%
  mutate(Italy = c(NA, diff(Italy))) %>%
  gather(Country, dl_confirmed,-n_day,-Date)

dts_filtered  %>%
  ggplot(aes(Date, dl_confirmed, colour = Country)) + 
  geom_line() + 
  geom_smooth(method = "lm")

cutoff_date = as.Date("2020-02-25")

dts_de = dts_filtered %>%
  filter(Country == "Germany", Date > cutoff_date)

dts_it = dts_filtered %>%
  filter(Country == "Italy", Date > cutoff_date)

The plot shows a marked downward trend in the growth rate, which is slightly larger for Italy than for Germany. Of course, with a low number of observed days, one has to be carfull to draw too strong conclusions. Furthermore, I assume that the first, very high growth rates are not representative anymore, and therefore select only growth rates after 25th February.

lm_diff = lm(dl_confirmed ~ Country + n_day, data = dts_filtered %>% filter(Date > cutoff_date))
summary(lm_diff)
## 
## Call:
## lm(formula = dl_confirmed ~ Country + n_day, data = dts_filtered %>% 
##     filter(Date > cutoff_date))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.143166 -0.057504 -0.007931  0.041795  0.306350 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.468394   0.071387   6.561 1.08e-06 ***
## CountryItaly -0.073286   0.040391  -1.814   0.0827 .  
## n_day        -0.013762   0.004992  -2.757   0.0112 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09859 on 23 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.2805, Adjusted R-squared:  0.2179 
## F-statistic: 4.483 on 2 and 23 DF,  p-value: 0.0227
predict_in_d = tibble(
  Date = rep(seq.Date(max(ts_filtered$Date), by = "1 day", length.out = 8),2),
  n_day = rep(seq(max(ts_filtered$n_day),by = 1, length.out = 8), 2),
  Country = rep(c("Germany", "Italy"), each = 8)
)

predict_val_d = predict(lm_diff, newdata = predict_in_d)

predict_out_d = predict_in_d %>%
  mutate(dl_confirmed = predict_val_d,
         type = "Prediction_Panel")

dts_filtered %>%
  mutate(type = "Actual") %>%
  bind_rows(predict_out_d) %>%
  ggplot(aes(Date, dl_confirmed)) + 
  geom_line(aes(linetype = type, colour = Country)) +
  ggtitle("Predicted growth rate") 
## Warning: Removed 9 row(s) containing missing values (geom_path).

Given the current trend in the growth rate, we would expect a reduction in new cases for both countries, and for Italy even a reversal as the growth rate turns negative within a week. A problem of this analysis is that it treats both countries more or less the same, and only estimates an offset for the country dummy. It is therefore more helpful to estimate the trend individually.

lm_diff_de = lm(dl_confirmed ~ n_day, data = dts_de)
summary(lm_diff_de)
## 
## Call:
## lm(formula = dl_confirmed ~ n_day, data = dts_de)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.154888 -0.106612 -0.008457  0.051748  0.303420 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  0.50649    0.17963   2.820   0.0201 *
## n_day       -0.01669    0.01343  -1.243   0.2452  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1408 on 9 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.1466, Adjusted R-squared:  0.05174 
## F-statistic: 1.546 on 1 and 9 DF,  p-value: 0.2452
lm_diff_it = lm(dl_confirmed ~ n_day, data = dts_it)
summary(lm_diff_it)
## 
## Call:
## lm(formula = dl_confirmed ~ n_day, data = dts_it)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.085053 -0.042517 -0.002443  0.035067  0.125088 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.382445   0.040993   9.329    4e-07 ***
## n_day       -0.012611   0.003469  -3.636  0.00302 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05804 on 13 degrees of freedom
## Multiple R-squared:  0.5042, Adjusted R-squared:  0.466 
## F-statistic: 13.22 on 1 and 13 DF,  p-value: 0.003019

The estimates show that the intercept is lower for Italy, but for Germany the trend shows a higher reduction. For Germany, the \(R^2\) is rather low due to the small sample size.

The plot shows the difference in the development of the growth rates depending on using the panel estimate (combined estimation plus country dummy), or the single estimate.

Revised Prediction

Using the estimates for the growth rate, we reconstruct the projection of the confirmed cases. For this, we are starting with the last actual value, and use the growth rates for future dates from the single estimation.

Date Type Germany Italy
2020-03-11 Actual 1,908 12,462
2020-03-11 Prediction_Revised 1,790 11,856
2020-03-12 Prediction_Revised 2,164 13,676
2020-03-13 Prediction_Revised 2,571 15,579
2020-03-14 Prediction_Revised 3,005 17,523
2020-03-15 Prediction_Revised 3,454 19,463
2020-03-16 Prediction_Revised 3,905 21,347
2020-03-17 Prediction_Revised 4,341 23,120
2020-03-18 Prediction_Revised 4,745 24,726

The revised projection shows the sensitivity of the model to the assumption about the growth rate - with only a horizon of 1 week, the revised projection is at 24,726 instead of 123,613 for Italy, and 4,745 instead of 13,731 for Germany.