Some thoughts on COVID-19

Introduction

As for a lot of other people, COVID-19 is currently very much on my mind. Thankfully, there is a lot of good information and also data available. In this document I will collect some thoughts and analyses regarding the spread of the disease. For anybody reading this, please keep in mind that I am not an epidemologist or virologist, just someone with a solid statistical education, trying to make sense – mostly for myself – out of the crisis.

Getting the data

For now, my main data source is going to be the dataset provided by Johns Hopkins University on github.com.

Development of confirmed cases

As a first step, I will look at the development of confirmed cases in Germany and Italy.

## # A tibble: 96 x 3
##    `Country/Region` Date       Confirmed
##    <chr>            <date>         <dbl>
##  1 Germany          2020-01-22         0
##  2 Italy            2020-01-22         0
##  3 Germany          2020-01-23         0
##  4 Italy            2020-01-23         0
##  5 Germany          2020-01-24         0
##  6 Italy            2020-01-24         0
##  7 Germany          2020-01-25         0
##  8 Italy            2020-01-25         0
##  9 Germany          2020-01-26         0
## 10 Italy            2020-01-26         0
## # ... with 86 more rows

What we see is that the early trajectory for both countries was very similar, with only a handful confirmed cases. We also see that Italy has started on an exponential growth path. Looking at the log plot, we see two important points:

Surprisingly, Italy seems to have been able to flatten the curve, at least with regards to the growth factor.
We also see that Germany currently is on a parallel path to Italy, but without a first steep part of the curve.

So, my takeaway for now is that Germany indeed is probably going to follow the path Italy has shown us, with maybe a week of delay.

Estimating the rate of infection

In the next step, I am estimating the current rate of infection for the two coountries. For this, I am using only the days where the number of infected is above 50.

## # A tibble: 27 x 5
##    `Country/Region` Date       Confirmed l_confirmed n_day
##    <chr>            <date>         <dbl>       <dbl> <int>
##  1 Italy            2020-02-22        62        4.13     1
##  2 Italy            2020-02-23       155        5.04     2
##  3 Italy            2020-02-24       229        5.43     3
##  4 Italy            2020-02-25       322        5.77     4
##  5 Italy            2020-02-26       453        6.12     5
##  6 Italy            2020-02-27       655        6.48     6
##  7 Italy            2020-02-28       888        6.79     7
##  8 Germany          2020-02-29        79        4.37     8
##  9 Italy            2020-02-29      1128        7.03     9
## 10 Germany          2020-03-01       130        4.87    10
## # ... with 17 more rows

In the first step, we are using a linear regression of the whole data set. The number of confirmed cases has been transformed to natural logs. The trend is represented by \(n_{day}\).

lm_abs = lm(l_confirmed ~ `Country/Region` + n_day, data = ts_filtered)
summary(lm_abs)

## 
## Call:
## lm(formula = l_confirmed ~ `Country/Region` + n_day, data = ts_filtered)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18811 -0.15145  0.02037  0.16673  0.50910 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.11594    0.19580   15.91 2.98e-14 ***
## `Country/Region`Italy  2.03826    0.15081   13.52 1.03e-12 ***
## n_day                  0.16105    0.00935   17.22 5.16e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3615 on 24 degrees of freedom
## Multiple R-squared:  0.9398, Adjusted R-squared:  0.9347 
## F-statistic: 187.2 on 2 and 24 DF,  p-value: 2.279e-15

Now, what do the results tell us? The intercept is the starting point in the amount of cases, \(exp(3.11594)\) = 22.6. This is a result of the chosen cut-off point at fifty (the intercept is not fifty, as Germany’s timeline does not start on day one). The estimate for the Italy dummy variable means that on average the case load for Italy was \(exp(2.038)\) = 7.7 times higher than for Germany. The estimate for \(n_{day}\) is the interesting part, because it gets multiplied by the number of days. We therefore can use it as an approximate daily growth rate, that is the number of cases will grow with \(~16.1\%\). The adjusted \(R^2\) gives us a high confidence into the results. Furthermore, we can approximate the number of days which Italy is ahead of Germany by dividing the estimate for the dummy through the estimate for the daily growth rate: \(2.038 / 0.161\) = 12.7.

In the next step I am looking at the national growth rates.

lm_de = lm(l_confirmed ~  n_day, data = ts_filter_de)
summary(lm_de)

## 
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_de)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.181066 -0.104249 -0.001078  0.088674  0.187779 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.214693   0.133570   24.07 9.47e-09 ***
## n_day       0.155240   0.007444   20.86 2.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1352 on 8 degrees of freedom
## Multiple R-squared:  0.9819, Adjusted R-squared:  0.9797 
## F-statistic:   435 on 1 and 8 DF,  p-value: 2.93e-08

So, for Germany the current growth rate is \(~15.5\%\).

lm_it = lm(l_confirmed ~  n_day, data = ts_filter_it)
summary(lm_it)

## 
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_it)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16963 -0.19623  0.03263  0.36964  0.51604 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.13407    0.19280   26.63 4.82e-14 ***
## n_day        0.16269    0.01305   12.47 2.56e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4455 on 15 degrees of freedom
## Multiple R-squared:  0.912,  Adjusted R-squared:  0.9061 
## F-statistic: 155.4 on 1 and 15 DF,  p-value: 2.563e-09

And for Italy the current growth rate is \(~16.3\%\). Plotting shows us why the \(R^2\) is lower for Italy: In the first couply of days the growth rate was considerably higher than in the last 2 weeks.

Predicting the next week

On the basis of the full model, a very simplified prediction is possible, based on the assumption that the identified trend will continue for the next 7 days.

The reason why we are seeing an offset between last actual and first predicted value for Italy could be a first hint that the growth rate as estimated does not hold for the more recent development.