At the time this notebook is published, COVID-19 has become an international oandemic. Although the situation has been greatly improved in China, the initial source of this disease, Europe has become the new epidemic point. With the old population and a rapid speed of transmission, Europe has now become the newcenter of this pandemic, with a lot of countries reporting more than 1000 cases per day. Particularly, in Italy, the number of deceased cases has exceeded that of China, leading to the mortality rate of this country being 3-4 times higher than the average of the remaining part of the world.
Along with Italy, France, Spain and Germany are countries with which suffer the most from the pandemic. Take France as an example. After a long time without any strong measures, France had to go through and establish many aggressive steps, including:
March 16th, 2020: Once again, President Macron appeared in front of public to annouce the 15-day lockdown at the national level in the “war” against coronavirus. This means that:
With all that being said, getting to understand, analyse and predict the situation, at this moment, is extremely crucial.
Currently, R is one of the tools of choice for outbreak epidemiologists, with a huge range of libraries on CRAN and GitHub devoted to outbreak management and analysis. This post is considered as a brief introduction to a few of the excellent packages available in the R Epidemics Consortium (RECON) suite, as well as the use of base R and tidyverse packages for data wrangling and visualization.
The main idea and flow of this notebook is based on the work done previously by Professor Tim Churches, Senior Research Fellow at the UNSW Medicine South Western Sydney Clinical School at Liverpool Hospital, and a health data scientist at the Ingham Institute for Applied Medical Research, also located at Liverpool, Sydney. The author wants to send the most sincere thank to Prof. Tim Churches for his wonderful ideas and support upon conducting the analyses.
Between 2020-03-04 and 2020-03-07 and from 2020-03-10 onwards, the total number of infected cases by region and the total number of deceased cases nationally are provided by the French National Public Health Agency (Agence National de santé publique) on the website . For the remaining part of the dataset, the numbers are taken from the following sources:
As there are certain days where data for different regions came from different sources, the deviance of the collected data from the number annouced by the French Ministry of Health is inevitable. We will briefly investigate that difference in the following parts.
As mentioned above, from 2020-03-10 onwards, data on the infected cases will be updated at the end of each day by the National Public Health Agency. With that being said, the dataset will also be updated with the same pace.
First of all, we will load this dataset and visualize the number of incident infected cases and death cases everyday, at the national level.
## [1] "The current date is: 2020-03-22"
## [1] "The last date registered in the dataset is: 2020-03-21"
Now, we will calculate the mortality rate this virus caused to the French population, and compare it with data from the rest of the world.
On 2020-03-21, the mortality rate registered in France caused by COVID-19 is at 3.887%. This is lower than both the overall mortality rate of the world (4.293%) and of China (4.0084%)
Next, we will see the total number of cases by region.
Next, as mentioned previously, there are days within the dataset when data is collected from various different sources. As a result, it can be different from what has been announced by the French Ministry of Health on respective days. Here, we will have a look at the difference between the total number of cases collected and the total number of cases declared by the Government, by day.
We can see that, the difference between the two data sources are always less than 10%, with only 3 among those days having the difference greater than 5%. As a result, this dataset is stable and reliable enough to conduct any analysis.
The sections on Japan, South Korea, Italy and Iran use the earlyR and EpiEstim packages, also published by RECON. In particular, the function estimate_R() function in earlyR estimates the reproduction number of an epidemic, given the incidence time series and the serial interval dítribution; while the overall_infectivity() function in the EpiEstim package calculates \(\lambda\) (lambda), which is a relative measure of the current “force of infection” or infectivity of an outbreak: \[ \lambda = \sum_{s=1}^{t-1} {y_{s} w (t - s)} \] where \(w()\) is the probability mass function (PMF) of the serial interval, and \(y_s\) is the incidence at time \(s\).
The resulting \(\lambda\) “force of infection” plot indicates the daily effective infectiousness (subject to public health controls), with a projection of the diminution of the force of infection if no further cases are observed. The last date of observed data is indicated by the vertical blue dashed line. New cases are shown in a cumulative manner as black dots. It is a sign that the outbreak is being brought under control if \(\lambda\), as indicated by the orange bars, is falling prior to or at the date of last observation (as indicated by the vertical blue line). Note that left of the vertical blue line the \(\lambda\) values are projections, valid only in no further cases are observed. As such, the plot is a bit confusing, but it is nonetheless useful if interpreted with this explanation in mind. The RECON packages are all open-source, and easier-to-interpret plots of \(\lambda\) could readily be constructed.
The critical parameter for these calculation is the distribution of serial intervals (SI), which is the time between the date of onset of symptoms for a case and the dates of onsets for any secondary cases that case gives rise to. Typically a discrete \(\gamma\) distribution for these serial intervals is assumed, parameterised by a mean and standard deviation, although more complex distributions are probably more realistic. See the previous post for more detailed discussion of the serial interval, and the paramount importance of line-listing data from which it can be empirically estimated.
In this post, we will incorporate this uncertainty around the serial interval distribution by specifying a distribution of SI distributions for the estimation of the instantaneous effective reproduction number \(R_{e}\). We’ll retain the mean SI estimated by Li et al. of 7.5 days, with an SD of 3.4, but let’s also allow that mean SI to vary between 2.3 and 8.4 using a truncated normal distribution with an SD of 2.0. We’ll also allow the SD of the SD to vary between 0.5 and 4.0.
For the estimation of the force of infection \(\lambda\), for the serial interval we’ll use a discrete \(\gamma\) distribution with a mean of 5.0 days and a standard deviation of 3.4.
We can easily see that the \(\lambda\) value is increasing extremely fast, with no point of stopping or slowing down. This is reflected clearly through the speed of increase in cases of France, especially in recent days.
First of all, to omit any misbehavior, we will once again start with the date when the whole nation believes to be the first date that this pandemic truly happens, 25 February, 2020.
Taking a look at the object estimate_R_national, we can see that at the date this notebook is finish (2020-03-21), the mean reproduction number is at 2.54, with 95% Confidence Interval of 1.61 - 3.42. This means that, although the overall trend is decreasing, the reproduction number value is still high enough to consider the situation at this country as serious.
To fit a log-linear regression under the form of \[ \log(y) = ax + b, \] the RECON incidence package can be used. Usually, two models will be implemented, one for the growth phase and one for the decay phase, awpEtws by a peak. For example, this can be applied for the current situation in Wuhan (or China, in general), where the disease has come to the decay phase.
However, as we acknowledge that France is still in the growth phase with the peak estimated to arrive in at least two weeks, we will only fit one curve for the growth phase.
Note: During the time this analysis is conducted, the function plot() used to visualize the incident object is misbehaving. As a result, instead of representing the incidence object and fit object on the same graph, only the fit object is displayed.
We can easily see that our model above is not actually an excellent fit, because we are including the handful of very early cases that did not appear to establish sustained chains of local transmission. We can also obseerve that France does not register any new cases for a long enough period of time, which triggers our modelb. We will exclude those days, and start again with the day when the French government considered to be the beginning of the pandemic within the country, 25 February, 2020.
We can verify and compare the efficiency of the models through the fit statistics, including R-squared, adjusted R-squared and deviance as follows:
| Dates | R2 | Adjusted R2 | Deviance |
|---|---|---|---|
| From 2020-01-24 | 0.84 | 0.84 | 32.81 |
| From 2020-02-25 | 0.91 | 0.91 | 8.36 |
It can be clearly seen that the trimmed model works better, with higher R^2 and much lower deviance.
From the that model, we can extract various parameters of our interest about the current siuation: the growth rate is 0.25 (95% Confidence Interval 0.21 - 0.28), which is equivalent to a doubling time of 2.76 days (95% Confidence 2.44 - 3.18 days). This is seriously alarming, as the speed of increase is extremely high, while the medical system in France is already overwhelming at this moment.
We can also project how many cases might be expected in the 10 days, assuming that there would be no shock to the public health system control, and the testing system functions normally. One thing worths mentioning is that our model is fitted to predict only a few days ahead, knowing that during this time interval, the pandemic has not reached its peak. First, we’ll plot our predictions on a log scale graph.
On the linear scale, the prediction looks like:
Up until now, all 20/20 regions of France, both within the metropolitane (13 regions) and overseas (7 regions). They are:
However, as the number of cases of French overseas regions are neglectable comparing to the number of cases within French metropolitane, and the incident cases are so few for any analysis, at this point, to be meaningful. As a result, we will only be considering only 13 metropolitane regions, as follows.
Assessment: The outbreak appears to be under control at this moment, with the overall trend of R decreasing and approaching 1.
Assessment: After the orignial days with high values, the effective reproduction number has been brought back to around 1. However, this is still not an overall descending trend, meaning that the values at the final days are still fluctuating wildly.
Assessment: The situation now seems to be under control with the overall trend of the effective reproduction number dropping rapidly, nearly approaching 1.
Assessment: The time interval is too short to make any firm conclusions from the analysis; however, the pandemic is controlling well and the values of R keep decresaing, appoaching very near to 1.
Grand Est: Grand Est starts off with a terrible situation. The effective reproduction is extremely high. Up until March 9th, the values are still around 10. However, within the recent days, the pandemic seems to be under control. The battle is not yet to be won, but at least the region itself is doing a great job.
Assessment: Although the overall trend is decreasing, we can easily see some fluctuations between the days. This means that, although the situation seems under control, there are potential risks that can ruin the efforts at any time.
Assessment: Ile-de-France is now the most heavily infected region within the whole country. Just like Grand Est, the region starts of with a very high value of R. However, although the overall trend is decreasing, we can see that there are still fluctuations, and the final effective reproduction number of yesterday is nowhere near 1. This means that the region has a lot to do to control this pandemic.
Assessments: Just like most of the other regions, Normandie is doing its job well. Although starting of with a high value of R, a strong overall descending trend and the values approaching 1 is promising enough for a near future possibility of controlling the outbreak well.
Assessment: This is a typical “two-wave pattern”, with the last day witnessing an upward trend. Although the overall trend is decreasing, there is no certainty in predicting the upcoming days of this region.
Assessment: The overall trend of R is decreasing. Specifically, there are two peaks out of this trend, one of which may be explained by the registration of data. The final value of R at around 2.5 is still not good enough to confirm that the region has controlled the pandemic well.
Assessment: This is the most unstable graph of the value of R of all metropolitane regions in France. There is no certain trend, and the values keep fluctuating up and down, making any analysis or prediction difficult at this moment.
Assessment: Once again, the overall trend is decreasing, which means that efforts to control the outbreak is proving effective. However, with a high value of R at around 3.0 at this moment, there is still a long way to fully control the situation.
With all the analysis that we have done previously, it seems that under strong movements by the French government, the country is following a correct pathway in the progress of putting this pandemic under control.
Most specialists in the field believe that France is the second version of Italy in this pandemic, with the delay of 7-9 days. However, personally speaking, I do not believe that the same outcome will happen to France. With early, strong and strict measures, along with a world class medical system, France will have their own way in controlling this outbreak as fast and as effective as possible.
Churches, Tim. 2020a. “COVID-19 Epidemiology with R.” https://rviews.rstudio.com/2020/03/05/covid-19-epidemiology-with-r/.
———. 2020b. “Tim Churches Health Data Science Blog: Analysing Covid-19 (2019-nCoV) Outbreak Data with R - Part 1.” https://timchurches.github.io/blog/posts/2020-02-18-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-1/.
———. 2020c. “Tim Churches Health Data Science Blog: Analysing Covid-19 (2019-nCoV) Outbreak Data with R - Part 2.” https://timchurches.github.io/blog/posts/2020-03-01-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-2/.
Cori, Anne, Simon Simon Cauchemez, and Neil M. et al. Ferguson. 2019. EpiEstim: Estimate Time Varying Reproduction Numbers from Epidemic Curves. https://cran.r-project.org/web/packages/EpiEstim/index.html.
Jombart, Thibaut, Anne Cori, and Pierre Pierre Nouvellet. 2017. EarlyR: Estimation of Transmissibility in the Early Stages of a Disease Outbreak. https://cran.r-project.org/web/packages/earlyR/index.html.
Jombart, Thibaut, Zhian N. Zhian N. Kamvar, and Rich et al. FitzJohn. 2019. Incidence: Compute, Handle, Plot and Model Incidence of Dated Events. https://cran.r-project.org/web/packages/incidence/index.html.