As for a lot of other people, COVID-19 is currently very much on my mind. Thankfully, there is a lot of good information and also data available. In this document I will collect some thoughts and analyses regarding the spread of the disease. For anybody reading this, please keep in mind that I am not an epidemologist or virologist, just someone with a solid statistical education, trying to make sense – mostly for myself – out of the crisis.
For now, my main data source is going to be the dataset provided by Johns Hopkins University on github.com.
As a first step, I will look at the development of confirmed cases in Germany and Italy.
## # A tibble: 96 x 3
## `Country/Region` Date Confirmed
## <chr> <date> <dbl>
## 1 Germany 2020-01-22 0
## 2 Italy 2020-01-22 0
## 3 Germany 2020-01-23 0
## 4 Italy 2020-01-23 0
## 5 Germany 2020-01-24 0
## 6 Italy 2020-01-24 0
## 7 Germany 2020-01-25 0
## 8 Italy 2020-01-25 0
## 9 Germany 2020-01-26 0
## 10 Italy 2020-01-26 0
## # ... with 86 more rows
What we see is that the early trajectory for both countries was very similar, with only a handful confirmed cases. We also see that Italy has started on an exponential growth path. Looking at the log plot, we see two important points:
So, my takeaway for now is that Germany indeed is probably going to follow the path Italy has shown us, with maybe a week of delay.
In the next step, I am estimating the current rate of infection for the two coountries. For this, I am using only the days where the number of infected is above 50.
## # A tibble: 27 x 5
## `Country/Region` Date Confirmed l_confirmed n_day
## <chr> <date> <dbl> <dbl> <int>
## 1 Italy 2020-02-22 62 4.13 1
## 2 Italy 2020-02-23 155 5.04 2
## 3 Italy 2020-02-24 229 5.43 3
## 4 Italy 2020-02-25 322 5.77 4
## 5 Italy 2020-02-26 453 6.12 5
## 6 Italy 2020-02-27 655 6.48 6
## 7 Italy 2020-02-28 888 6.79 7
## 8 Germany 2020-02-29 79 4.37 8
## 9 Italy 2020-02-29 1128 7.03 9
## 10 Germany 2020-03-01 130 4.87 10
## # ... with 17 more rows
In the first step, we are using a linear regression of the whole data set. The number of confirmed cases has been transformed to natural logs. The trend is represented by \(n_{day}\).
lm_abs = lm(l_confirmed ~ `Country/Region` + n_day, data = ts_filtered)
summary(lm_abs)
##
## Call:
## lm(formula = l_confirmed ~ `Country/Region` + n_day, data = ts_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18811 -0.15145 0.02037 0.16673 0.50910
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.11594 0.19580 15.91 2.98e-14 ***
## `Country/Region`Italy 2.03826 0.15081 13.52 1.03e-12 ***
## n_day 0.16105 0.00935 17.22 5.16e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3615 on 24 degrees of freedom
## Multiple R-squared: 0.9398, Adjusted R-squared: 0.9347
## F-statistic: 187.2 on 2 and 24 DF, p-value: 2.279e-15
Now, what do the results tell us? The intercept is the starting point in the amount of cases, \(exp(3.11594)\) = 22.6. This is a result of the chosen cut-off point at fifty (the intercept is not fifty, as Germany’s timeline does not start on day one). The estimate for the Italy dummy variable means that on average the case load for Italy was \(exp(2.038)\) = 7.7 times higher than for Germany. The estimate for \(n_{day}\) is the interesting part, because it gets multiplied by the number of days. We therefore can use it as an approximate daily growth rate, that is the number of cases will grow with \(~16.1\%\). The adjusted \(R^2\) gives us a high confidence into the results. Furthermore, we can approximate the number of days which Italy is ahead of Germany by dividing the estimate for the dummy through the estimate for the daily growth rate: \(2.038 / 0.161\) = 12.7.
In the next step I am looking at the national growth rates.
lm_de = lm(l_confirmed ~ n_day, data = ts_filter_de)
summary(lm_de)
##
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_de)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.181066 -0.104249 -0.001078 0.088674 0.187779
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.214693 0.133570 24.07 9.47e-09 ***
## n_day 0.155240 0.007444 20.86 2.93e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1352 on 8 degrees of freedom
## Multiple R-squared: 0.9819, Adjusted R-squared: 0.9797
## F-statistic: 435 on 1 and 8 DF, p-value: 2.93e-08
So, for Germany the current growth rate is \(~15.5\%\).
lm_it = lm(l_confirmed ~ n_day, data = ts_filter_it)
summary(lm_it)
##
## Call:
## lm(formula = l_confirmed ~ n_day, data = ts_filter_it)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16963 -0.19623 0.03263 0.36964 0.51604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.13407 0.19280 26.63 4.82e-14 ***
## n_day 0.16269 0.01305 12.47 2.56e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4455 on 15 degrees of freedom
## Multiple R-squared: 0.912, Adjusted R-squared: 0.9061
## F-statistic: 155.4 on 1 and 15 DF, p-value: 2.563e-09
And for Italy the current growth rate is \(~16.3\%\). Plotting shows us why the \(R^2\) is lower for Italy: In the first couply of days the growth rate was considerably higher than in the last 2 weeks.
On the basis of the full model, a very simplified prediction is possible, based on the assumption that the identified trend will continue for the next 7 days.
The reason why we are seeing an offset between last actual and first predicted value for Italy could be a first hint that the growth rate as estimated does not hold for the more recent development.