EPI 569 – Homework 4: Calculating the effective reproduction number

library(kableExtra, warn.conflicts=F, quietly=T)
library(dplyr, warn.conflicts=F, quietly=T)
library(ggplot2, warn.conflicts=F, quietly=T)
library(EpiEstim, warn.conflicts=F, quietly=T)

Background 1

A large outbreak of norovirus occurred at the Rushmore Academy boarding school last fall. Here’s what we know about the outbreak:
- On October 16, 2019, a student returning from fall break vomited in the auditorium
- Unfortunately, this vomiting incidence took place during a lecture, when the auditorium was full, and initiated the largest norovirus outbreak to ever take place at Rushmore.
- Thanks to an extensive investigation by the Rushmore Department of Public Health (RDPH) and participation among the community, all cases involved in this outbreak were identified and interviewed.
- There were cases among both students and staff, but no cases outside of Rushmore
- Despite thorough and regular cleaning of all classrooms and common areas with bleach, promoting enhanced hand hygiene measures through signage and messaging, and requiring all students and staff with gastrointestinal symptoms to remain home until at least 48 hours after symptoms resolved, the outbreak persisted until the end of November.
- Laboratory testing confirmed the cause of the outbreak: norovirus.
Your job is to analyze the dataset to evaluate the transmission dynamics of the fall 2019 noroviurs outbreak.

Background 2

The line list for the Rushmore Academy norovirus outbreak is available on Canvas (Rushmore_Noro_LineList.csv). In this line list, key information about each case in the outbreak was recorded, with each row representing a case and each column representing a variable. Variables include demographic, clinical, and epidemiologic information about each case.
Import the line list into R and answer the questions below. Make sure to also run the code that converts the onset date variable to a date format so that later code runs properly.

linelist <- read.csv("https://raw.githubusercontent.com/blopman/epi569/master/Rushmore_Noro_LineList.csv") 
# Read in the data

# Changing the onset date variable into the R date format
linelist$onset_date <- as.Date(as.character(linelist$onset_date), format="%m/%d/%Y")
#code to answer Q1#
table (linelist$lab.confirmed)

## 
##   N   Y 
## 103   6

table (linelist$hospitalized)

## 
##   N 
## 109

table(linelist$gender)

## 
##  F  M 
## 77 32

table (linelist$case_type)

## 
##            Staff Student_MPHyear1 Student_MPHyear2      Student_PhD 
##                7               67               28                7

median (linelist$age)

## [1] 24

IQR (linelist$age)

## [1] 3

tapply(linelist$age, linelist$case_type, median)

##            Staff Student_MPHyear1 Student_MPHyear2      Student_PhD 
##               56               23               25               30

tapply(linelist$age, linelist$case_type, IQR)

##            Staff Student_MPHyear1 Student_MPHyear2      Student_PhD 
##             12.5              2.0              2.0              7.0

Question 1 (0.5pt)

Start your analysis with a basic description of the outbreak by answering the following questions:
- What were the first and last illness onset dates?
ANSWER FIRST: 10/16/2019 LAST: 11/27/2019
- How many days did the outbreak last (counting the first and last illness onset dates as the first and last days of the outbreak)?
ANSWER 42 days
How many cases total cases were there?
ANSWER 109 cases
- What number and percent of cases were lab-confirmed?
ANSWER 6/109 (5.5%)
- What number and percent of cases were hospitalized? * ANSWER 0/109 (0%)
- What number and percent of cases were female?
ANSWER 77/109 (70.6%)
What number and percent of cases were students? Staff?
ANSWER STUDENTS: 102/109 (93.6%) STAFF: 7/109 (6.4%)
Of student cases, what number and percent freshman, sophmores and juniors?
ANSWER Student_MPHyear1 = 67/102 (65.7%) Student_MPHyear2 = 28 /102 (27.5%) Student_PhD =7/102 (6.9%)
What was the median age of cases overall and by “case type”? Include the interquartile ranges (IQR).
ANSWER
Median (IQR) Age
all cases: 24 (3) years
Staff: 56 (12.5) years
Student_MPHyear1: 23 (2.0) years
Student_MPHyear2: 25 (2.0) years
Student_PhD: 30 (7.0) years
Next, use the code below to plot the epidemic (epi) curve for this outbreak. Symptom onset dates will be plotted on the x-axis and case counts on the y-axis.

# Plotting the epidemic curve

epicurve <- as.data.frame(linelist %>% group_by(onset_date) %>% summarize(incidence=n()))

## `summarise()` ungrouping output (override with `.groups` argument)

## `summarise()` ungrouping output (override with `.groups` argument)
plot1 <- ggplot(data=epicurve, aes(x=onset_date, y=incidence, labs=FALSE)) + 
         geom_bar(stat="identity", fill="gray45") +
         scale_x_date(date_breaks = "2 days", date_labels = "%m/%d") +
         scale_y_continuous(breaks = seq(0,15, by = 2)) +
         theme(axis.text.x = element_text(angle = 45)) +
         labs(x = "Symptom Onset Date", y = "Number of Cases", title = "Epi Curve for Rushmore Academy Norovirus Outbreak")
plot1

Question 2 (1pt)

Examine the epidemic curve above and answer the following questions:
- How many peaks are there?
- ANSWER - THis looks like propogated waves with at least two peaks and possibly a tiny third peak that would likely be difficult to differentiate signal from the noise. The first large peak occurs around 10/23, the second one smaller occurs around 11/13 and third (ish) occurs around 11/21.
- At what time(s) do cases start to decline?
- ANSWER around the end of October (it appears to be October 25).
- Does the epidemic curve support the theory that the outbreak began with the vomiting incident in the CNR auditorium on 10/16? Briefly explain why or why not.
- ANSWER Yes. This is because there is an incubation period-sized pause in the cases after the first case, then all cases occur after the auditorium incident. Additionally, there are no retroactively discovered cases outlined in the data.
- Does there appear to be sustained person-to-person transmission after the initial vomiting incidence? Briefly explain why or why not.
- ANSWER Yes, this is because there are subsequent cases with smaller peaks.
- Why is it difficult to determine who infected whom (i.e., the infector-infected pairs) from this epidemic curve?
ANSWER - because we are only seeing the aggregated case numbers per day in an epi curve (eg the first case followed by the bulk of cases occuring a few days later). We aren’t seeing the cases at the level of contacts.

Background 3

Next, we’ll calculate effective reproduction numbers by onset date (Rt) to examine how infectiousness changes throughout the outbreak.
Because there are no clear infector-infected pairs that we can identify from this outbreak (other than perhaps the first case and cases that occurred 2-3 days later) we cannot determine the distribution of the serial interval (the time between symptom onset in primary cases and the secondary cases they generate) from these data. Therefore, to calculate (Rt), we will use the serial interval distribution for norovirus previously estimated from an outbreak in Sweden for which infector-infected pairs were known (Heijne, et al.Emerg Infect Dis. 2009). This probability distribution for the serial interval is gamma distributed with mean = 3.6 days and SD = 2.0 days.
We will use the “wallinga_teunis” function from the “EpiEstim” package to calculate R(t).
Run the code below to calculate and plot R(t) estimates and 95% confidence intervals. This code will also print the serial interval distribution used for the estimates.

# First, convert the data to a format that can be used in the EpiEstim wallinga_teunis function 
# Data must be in the following format: 
### 1 column for symptom onset dates in ascending order, including dates on which 0 cases were reported, titled "dates"
### 1 column for case counts (incidence) titled "I"
### Note: to calculate an Rt estimate for day day 1 of the outbreak, we must start our epi curve 2 days prior the first symptom onset date

epicurve2 <- epicurve %>% arrange(onset_date) %>% rename(dates = onset_date, I = incidence)

all.dates <- as.data.frame(seq(as.Date("2019-10-14"), by = "day", length.out = 45))
names(all.dates) <- "dates"

epicurve.epiestim <- merge(x=epicurve2, y=all.dates, by="dates", all="TRUE")
epicurve.epiestim <- epicurve.epiestim %>% mutate(I = ifelse(is.na(I), 0, I)) 


# Next, run the code below to estimate Rt, along with 95% confidence intervals for Rt estimates
# This requires that we specify the mean and standard deviation of the serial interval  
# An offset gamma distribution will be used for the serial interval (by default)

mean_si <- 3.6   
std_si <- 2.0

estimates <- wallinga_teunis(epicurve.epiestim$I, 
                             method="parametric_si",
                             config = list(t_start = seq(3, 45), 
                                           t_end = seq(3, 45),
                                           mean_si = mean_si, 
                                           std_si = std_si, 
                                           n_sim = 1000))


# You can examine the serial interval distribution using the code below

plot(estimates$si_distr)

# Lastly, use the code below to plot the Rt estimates and 95% CIs over the epi curve to examine trends

plot2.data <- cbind(epicurve, estimates$R$`Mean(R)`,
                    estimates$R$`Quantile.0.025(R)`, estimates$R$`Quantile.0.975(R)`)
names(plot2.data) <- c("dates", "I", "R", "lowerCI", "upperCI")

plot2 <- ggplot(data=plot2.data, aes(x=dates, y=I, labs=FALSE)) + 
  geom_bar(stat="identity", fill="gray45") +
  scale_x_date(date_breaks = "2 days", date_labels = "%m/%d") +
  scale_y_continuous(breaks = seq(0,15, by = 2)) +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(x = "Symptom Onset Date", 
       y = "Number of Cases",
       title = "Epi Curve for Rushmore Academy Norovirus Outbreak") +
  geom_hline(aes(yintercept=1), colour="red", linetype="dashed", size=0.5) +
  geom_errorbar(data=plot2.data, aes(ymax=upperCI, ymin=lowerCI, width=0.6),stat="identity", size=0.8, show.legend=FALSE) +
  geom_line(data=plot2.data[!is.na(plot2.data$R),],aes(x=dates, y=R), color='blue', size=0.5) +
  geom_point(data = plot2.data, aes(x=dates, y=R), size=1.2, show.legend=FALSE) 
plot2

Question 3 (1pt)

Examine the R(t) estimates and answer the following questions:
- How does infectiousness change over the course of the outbreak? ANSWER When Rt is above 1, the disease is spreading and we are in an outbreak. The infectiousness peaks early with an Rt of 4 (meaning 4 people are infected on each of the first few days of the outbreak. Then it lowers to below one for a time and peaks back up to just shy of 3. right before the second wave. It does not appear to rise above one after 11/12.
- Why does the individual with the last illness onset date (11/27) have a R(t) of 0 (95% CI: 0, 0)? ANSWER The individual with the last illness onset has an Rt of 0 because they do not infect anyone else.
- Why do cases with onset dates on days with large case counts (e.g., 10/25 with n=15) have relatively small R(t) estimates, despite there being several cases that occur after these dates? ANSWER Because the days with large case counts are the number of cases that were infected about two days prior. When there is a smaller R(t) there will be fewer cases on subsequent days.
- Why does the outbreak continue even after Rt falls below 1?
ANSWER - There is still some spread because R(t) is not zero, R(t) falling below 1 just means, on average, there are fewer than 1 case infected per person and the number of new cases will drop.

Question 4 (1pt)

Using the code below, change the mean of the serial interval distribution (keeping the standard deviation the same) and examine how the R(t) estimates change. Answer the following questions regarding the first symptom onset date (10/16):
- What happens to the R(t) estimate when you decrease the mean (e.g., to 2 days)?
- ANSWER - When the mean serial interval is reduced, the maxiumums on the R(t) estimate are reduced. For example, if the mean Serial Interval is halved, the peak R(t) is also halved.
- What happens to the R(t) estimate when you increase the mean (e.g., to 6 days)?
- ANSWER - When the mean serial interval is increased, the peak R(t) is exponentially increased.
- Briefly explain why you see these changes.
ANSWER - calculating R(t) - a measure of infectiousness - depends on the serial interval. If the mean serial interval is longer, initially, more people will be infected. If the serial interval is shorter, fewer people are being infected. The mean serial interval directly relates to the infectiousness.

mean_si <- 3.6# Change this to answer Question 4   
std_si <- 2.0

estimates <- wallinga_teunis(epicurve.epiestim$I, 
                             method="parametric_si",
                             config = list(t_start = seq(3, 45), 
                                           t_end = seq(3, 45),
                                           mean_si = mean_si, 
                                           std_si = std_si, 
                                           n_sim = 1000))

plot2.data <- cbind(epicurve, estimates$R$`Mean(R)`,
                    estimates$R$`Quantile.0.025(R)`, estimates$R$`Quantile.0.975(R)`)
names(plot2.data) <- c("dates", "I", "R", "lowerCI", "upperCI")

plot2 <- ggplot(data=plot2.data, aes(x=dates, y=I, labs=FALSE)) + 
  geom_bar(stat="identity", fill="gray45") +
  scale_x_date(date_breaks = "4 days", date_labels = "%m/%d") +
  scale_y_continuous(breaks = seq(0,15, by = 2)) +
  theme(axis.text.x = element_text(angle = 45)) +
  labs(x = "Symptom Onset Date", 
       y = "Number of Cases",
       title = "Epi Curve for Rushmore Academy Norovirus Outbreak") +
  geom_hline(aes(yintercept=1), colour="red", linetype="dashed", size=0.5) +
  geom_errorbar(data=plot2.data, aes(ymax=upperCI, ymin=lowerCI, width=0.6),stat="identity", size=0.8, show.legend=FALSE) +
  geom_line(data=plot2.data[!is.na(plot2.data$R),],aes(x=dates, y=R), color='blue', size=0.5) +
  geom_point(data = plot2.data, aes(x=dates, y=R), size=1.2, show.legend=FALSE) 
plot2

ANSWER

Question 5 (1pt)

List 3 limitations of using the Wallinga-Teunis method to estimate R(t).
ANSWER One should be careful using the W-T method to estimate R(t), because it is sensitive to the assumptions about the serial interval. 1) it assumes transmission only occurs among the cases reported in the Epi Curve. (that is missing cases impact the results); 2) It assumes that asymptomatic transmission is not occuring; 3) that all reported cases are a part of the same outbreak (not belonging to another outbreak).

Question 6 (0.5 pt)

Why would this approach not work well for estimating R(t) for a disease that is not primarily spread person-to-person (e.g., vectorborne, waterborne, or foodborne)?
ANSWER
With vectorborne, waterborne, and foodborne cases, not all cases are captured as we don’t know where the vectors are going or how far the food or water contamination could have gone. For example, we will likely not have every case from a contaminated food recall. Additionally, there may be asymptomatic carriage/ transmission of these diseases. This will limit the use of W-T in these situations.

EPI 569 – Homework 4: Calculating the effective reproduction number

Carly Adams

9/1/2020

Background 1

Background 2

Question 1 (0.5pt)

Question 2 (1pt)

Background 3

Question 3 (1pt)

Question 4 (1pt)

Question 5 (1pt)

Question 6 (0.5 pt)