![]()
A photo of the Ebola virus
A new EVD outbreak has been notified in the small city of Ankh, located in the Northern, rural district of the Republic of Morporkia. Public Health Morporkia (PHM) is in charge of coordinating the outbreak response, and have contracted you as a consultant in epidemics analysis to inform the response in real time.
![]()
Ankh, Republic of Morkporkia
We’ll load in various R libraries, and three documents that the Ministry of Health has provided us. First, the linelist of 12 confirmed Ebola cases. Second, a list of 11 contacts from these cases. Last, a document that tells R how to clean the messy data.
library(rio) library(ggplot2) library(dplyr) library(magrittr) library(outbreaks) library(incidence) library(epicontacts) library(linelist) library(distcrete) library(epitrix) library(earlyR) library(projections) library(tidyverse) library(sjPlot) library(jtools)
linelist_raw <- rio::import(“phm_evd_linelist_2017-10-27.xlsx”) contacts_raw <- rio::import(“phm_evd_contacts_2017-10-27.xlsx”) cleaning_rules <- rio::import(“phm_evd_cleaning_rules.xlsx”)
In the file we’ve been sent, the dates and the sex variable are not standardized. Location names are misspelled. In the contacts list, all we have is a case ID for the contact and it’s connection to the confirmed case.
## id Date of Onset Date.Report. SEX. Âge location
## 1 39e9dc 43018 43030 Female 62 Pseudopolis
## 2 664549 43024 43032 Male 28 peudopolis
## 3 B4D8AA 43025 “23/10/2017” male 54 Ankh-Morpork
## 4 51883d “18// 10/2017” 22-10-2017 male 57 PSEUDOPOLIS
## 5 947e40 43028 “2017/10/25” f 23 Ankh Morpork
## 6 9aa197 43028 “2017-10-23” f 66 AM
## 7 e4b0a2 43029 “2017-10-24” female 13 Ankh Morpork
## 8 AF0AC0 “2017-10-21” “26-10-2017” M 10 ankh morpork
## 9 185911 “2017-10-21” “26-10-2017” female 34 AM
## 10 601D2E “2017/10/22” “28-10-2017” <NA> 11 AM
## 11 605322 43030 “28 / 10 / 2017” FEMALE 23 Ankh Morpork
## 12 E399B1 23 / 10 / 2017 “28/10/2017” female 23 Ankh Morpork
## Source Case #ID case.id
## 1 51883d 185911
## 2 b4d8aa e4b0a2
## 3 39e9dc b4d8aa
## 4 39E9DC 601d2e
## 5 51883d 9AA197
## 6 39e9dc 51883d
## 7 39E9DC e399b1
## 8 b4d8aa AF0AC0
## 9 39E9DC 947e40
## 10 39E9DC 664549
## 11 39e9dc 605322
Once we run it through our cleaning rules, the data looks better.
| id | date of_onset | date report | sex | age | location |
|---|---|---|---|---|---|
| 39e9dc | 2017-10-10 | 2017-10-22 | female | 62 | pseudopolis |
| 664549 | 2017-10-16 | 2017-10-24 | male | 28 | pseudopolis |
| b4d8aa | 2017-10-17 | 2017-10-23 | male | 54 | ankh_morpork |
| 51883d | 2017-10-18 | 2017-10-22 | male | 57 | pseudopolis |
| 947e40 | 2017-10-20 | 2017-10-25 | female | 23 | ankh_morpork |
| 9aa197 | 2017-10-20 | 2017-10-23 | female | 66 | ankh_morpork |
| e4b0a2 | 2017-10-21 | 2017-10-24 | female | 13 | ankh_morpork |
| af0ac0 | 2017-10-21 | 2017-10-26 | male | 10 | ankh_morpork |
| 185911 | 2017-10-21 | 2017-10-26 | female | 34 | ankh_morpork |
| 601d2e | 2017-10-22 | 2017-10-28 | unknown | 11 | ankh_morpork |
| 605322 | 2017-10-22 | 2017-10-28 | female | 23 | ankh_morpork |
| e399b1 | 2017-10-23 | 2017-10-28 | female | 23 | ankh_morpork |
| source case_id | case id |
|---|---|
| 51883d | 185911 |
| b4d8aa | e4b0a2 |
| 39e9dc | b4d8aa |
| 39e9dc | 601d2e |
| 51883d | 9aa197 |
| 39e9dc | 51883d |
| 39e9dc | e399b1 |
| b4d8aa | af0ac0 |
| 39e9dc | 947e40 |
| 39e9dc | 664549 |
| 39e9dc | 605322 |
Next, we will connect the contacts to their cases, and map the linkages. We can see here that a few cases have spread EVD to multiple people. You can scroll over the icons to see the characteristics of these “superspreaders”. We have identified many contacts, but not all of them. There is lots of work to do.
Next we want to find the incidence of Ebola in the community. We will make a simple epi curve by date of symptom onset. It is important to note that the graph includes data through Oct. 28, there just haven’t been any reported cases from Oct. 23 to Oct. 28.
The simplest model of incidence is probably the log-linear model. This is a linear regression on log-transformed incidences. We will estimate the parameters of the model (the model fit, f) based on the outbreak’s incidence (i). When we look at the model fit, we will derive the estimates of the growth rate (r) and the doubling time of the outbreak. Here, we’ve added the model to the incidence plot.
We see the estimated daily growth rate is 0.05 (-0.03, 0.14) and that the outbreak is doubling every 12 days.
## <incidence_fit object>
##
## $model: regression of log-incidence over time
##
## $info: list containing the following items:
## $r (daily growth rate):
## [1] 0.05352107
##
## $r.conf (confidence interval):
## 2.5 % 97.5 %
## [1,] -0.0390633 0.1461054
##
## $doubling (doubling time in days):
## [1] 12.95092
##
## $doubling.conf (confidence interval):
## 2.5 % 97.5 %
## [1,] 4.744158 -17.74421
##
## $pred: data.frame of incidence predictions (8 rows, 5 columns)
We’ll now add the log-linear fit to our graph of outbreak data.
How would you interpret this result? What criticism would you make on this model?
It works fine.
The transmissibility of the disease can be assessed through the estimation of the reproduction number R, defined as the number of expected secondary cases per infected case. In the early stages of an outbreak, and assuming no immunity in the population, this quantity is also the basic reproduction number R0, i.e. R in a fully susceptible population.
The package earlyR implements a simple maximum-likelihood estimation of R, using dates of onset of symptoms and information on the serial interval distribution.
As current data are insufficient to estimate the serial interval distribution, some colleague recommends using data from a past outbreak stored in the outbreaks package, as the dataset ebola_sim_clean.
We load in this large dataset (6,000 cases, 4,000 contacts) in order to estimate the serial number from this outbreak.
The function get_pairwise can be used to extract pairwise features of contacts based on attributes of the linelist. For instance, it could be used to test for assortativity, but also to compute delays between connected cases. Here, we use it to extract the serial interval:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 9.00 11.06 15.00 99.00
Removing the missing variables, we get this five number summary of the serial interval. The mean serial interval is 11.06.
Here is the empirical distribution of the serial intervals, if we were interested.
What do you think of this distribution?
Then we’ll use the function fit_disc_gamma from the package epitrix to fit a discretised Gamma distribution to these data.
## $mu
## [1] 11.55634
##
## $cv
## [1] 0.6493035
##
## $sd
## [1] 7.50357
##
## $ll
## [1] -6939.76
##
## $converged
## [1] TRUE
##
## $distribution
## A discrete distribution
## name: gamma
## parameters:
## shape: 2.37194455151125
## scale: 4.87209431456314
si_fit contains various information about the fitted delays, including the estimated distribution in the $distribution slot.
You can compare this distribution to the empirical data in the following plot:
Would you trust this estimation of the generation time? How would you compare it to actual estimates from the West African EVD outbreak (WHO Ebola Response Team (2014) NEJM 371:1481–1495) with a mean of 15.3 days and a standard deviation 9.3 days?
The model somewhat fits the data. The serial interval we’ve estimated is lower than what WHO published in 2014, closer to 11 than 15.
Now that we have estimates of the serial interval based on a previous outbreak, we can use this information to estimate transmissibility of the disease (as measured by R0) in the current outbreak.
Using the estimates of the mean and standard deviation of the serial interval you just obtained, we’ll use the function get_R to estimate the reproduction number, specifying a maximum R of 10.
##
## /// Early estimate of reproduction number (R) //
## // class: earlyR, list
##
## // Maximum-Likelihood estimate of R ($R_ml):
## [1] 1.821822
##
##
## // $lambda:
## 0.0414587 0.05362817 0.06170817 0.06622619 0.06786168 0.06728274...
##
## // $dates:
## [1] "2017-10-11" "2017-10-12" "2017-10-13" "2017-10-14" "2017-10-15"
## [6] "2017-10-16"
## ...
##
## // $si (serial interval):
## A discrete distribution
## name: gamma
## parameters:
## shape: 2.37194455151125
## scale: 4.87209431456314
## NULL
This graph shows us the possible range of R values, with the most likely value being 1.822.
We can also graph the force of infection over time, based on our current 12 cases.
Interpret these results: what do you make of the reproduction number? What does it reflect? Based on the last part of the epicurve, some colleagues suggest that incidence is going down and the outbreak may be under control. What is your opinion on this?
R0 is significantly higher than 1, which is concerning. Incidence may be 0 for the last few days, but that is not enough to say it is going down. The force of infection graph shows us that Ebola is still transmitting quickly in the community through the last days of October. This would suggest there will be many more cases through early November.
The function project from the package projections can be used to simulate plausible epidemic trajectories by simulating daily incidence using the same branching process as the one used to estimate R0 in earlyR. All that is needed is one or several values of R0 and a serial interval distribution, stored as a distcrete object.
Here we have generated 5 random trajectories of Ebola over the next few days. In this case, we used a fixed R value. Here we are showing the random trajectories over the next four days.
##
## /// Incidence projections //
##
## // class: projections, matrix
## // 10 dates (rows); 5 simulations (columns)
##
## // first rows/columns:
## [,1] [,2] [,3] [,4] [,5]
## 2017-10-28 2 2 3 1 0
## 2017-10-29 0 1 1 2 1
## 2017-10-30 0 3 4 0 2
## 2017-10-31 0 4 6 0 1
## .
## .
## .
##
## // dates:
## [1] "2017-10-28" "2017-10-29" "2017-10-30" "2017-10-31" "2017-11-01"
## [6] "2017-11-02" "2017-11-03" "2017-11-04" "2017-11-05" "2017-11-06"
Next, we’ll generate a graph of 1,000 possible trajectories over the next 30 days. We’ll use an unfixed R value, which will produce slightly more varied (dramatic) projections.
Now we’ll combine this with the case data we already have. This is what the Ministry of Health will be most interested in. In gray we see the currently collected data, and the light blue shows the confidence intervals of the potential trajectory of cases.
Last, we’ll look just at the projection of mean number of cases per day over the next 30 days.
## 2017-10-28 2017-10-29 2017-10-30 2017-10-31 2017-11-01 2017-11-02 2017-11-03
## 0.708 0.734 0.729 0.742 0.750 0.779 0.770
## 2017-11-04 2017-11-05 2017-11-06 2017-11-07 2017-11-08 2017-11-09 2017-11-10
## 0.779 0.818 0.831 0.832 0.856 0.851 0.868
## 2017-11-11 2017-11-12 2017-11-13 2017-11-14 2017-11-15 2017-11-16 2017-11-17
## 0.888 0.919 0.910 0.926 0.929 0.936 0.931
## 2017-11-18 2017-11-19 2017-11-20 2017-11-21 2017-11-22 2017-11-23 2017-11-24
## 0.954 0.955 0.953 0.959 0.971 0.973 0.967
## 2017-11-25 2017-11-26
## 0.985 0.983
According to these results, what are the chances that more cases will appear in the near future? Is this outbreak being brought under control? Would you recommend scaling up / down the response?
There will be an average of 0.7 (and increasing) number of new cases per day for the next month. More cases will appear in the future.
In conclusion, this outbreak is not under control. It is likely in it’s initial increasing phase. I would recommend scaling up the response.