Salmonella Outbreak Investigation

Context

On 26 June 1998, the St Sebastian High School in Stegen (school A), Germany, celebrated a graduation party, where 250 to 350 participants were expected. Attendants included graduates from that school, their families and friends, teachers, 12th grade students, and some graduates from a nearby school (school B).

A self-service party buffet was supplied by a commercial caterer from a the nearby city of Freiburg. Food was prepared on the day of the party and transported in a refrigerated van to the school.

The alert

On 2nd July 1998, the Freiburg local health office reported to the Robert Koch Institute (RKI) in Berlin a significant increase of cases of gastroenteritis in the municipality of Stegen following the graduation party described above. More than 100 cases were suspected among participants and some of them were so ill that they were admitted to nearby hospitals. Sick people showed symptoms like; fever, nausea, diarrhoea and vomiting lasting for several days. Stool samples were collected from several ill cases and sent to the laboratory in Freiburg for microbiological identification. Salmonella enteritidis was identified in 19 stool samples.

In response to the large number of cases associated with the dinner the Freiburg health office sent an outbreak control team to investigate the kitchen of the caterer in Freiburg. Food preparation procedures were reviewed. Samples were taken of remaining food items after the graduation ceremony and were sent to the laboratory of Freiburg University. Microbiological analyses were performed on samples of the following food items: brown chocolate mousse, caramel cream, remoulade sauce, yoghurt dill sauce, and 10 raw eggs. Unfortunately no tiramisu was left over after the dinner so no samples were tested of this specific food item.

The Freiburg health office requested help from the RKI to assess the magnitude of the outbreak and identify potential vehicle(s) and risk factors for transmission in order to better control the outbreak.

The epidemiological study

Case definition: cases were defined as any person who had attended the party at St Sebastian High School and who suffered from diarrhoea (min. 3 loose stool for 24 hours) with the onset between 27 June and 29 June 1998, or from at least three of the following symptoms: vomiting, fever over 38.5° C, nausea, abdominal pain, headache.

Students from both schools attending the party were asked through phone interviews to provide names of persons who attended the party.

Overall, 291 responded to enquiries and 103 cases were identified. The linelist analysed in this case study was built from these 291 responses.

Initial Data Processing

Importing data

We’ll read in a raw file of line level data, as well as R packages that allow us to clean and manipulate our data, calculate incidence, calculate other epi measures, and create maps.

Data cleaning

We use epitrix’s function clean_labels() to standardise the variable names. We convert the unique identifiers to character strings (character), dates of onset to actual Date objects, and sex and illness are set to categorical variables (factor), which allows us to represent the variables as numbers so that we can do modelling, but allows us to add meaningful labels to them. We use the function recode() from the dplyr package to recode sex and ill more explicitely (while keeping them as number variables). Finally we look in more depth into these variables having maximum values of 9, where we expect 0/1 - pork, salmon, and horseradish. We recode the 9s to NAs.

stegen_old <- stegen # save 'dirty data'

new_labels <- clean_labels(names(stegen)) # generate standardised labels
new_labels # check the result

##  [1] "unique_key"  "ill"         "date_onset"  "sex"         "age"        
##  [6] "tiramisu"    "tportion"    "wmousse"     "dmousse"     "mousse"     
## [11] "mportion"    "beer"        "redjelly"    "fruit_salad" "tomato"     
## [16] "mince"       "salmon"      "horseradish" "chickenwin"  "roastbeef"  
## [21] "pork"        "latitude"    "longitude"

names(stegen) <- new_labels

# Edit variables
stegen$unique_key <- as.character(stegen$unique_key)
stegen$sex <- factor(stegen$sex)
stegen$ill <- factor(stegen$ill)
stegen$date_onset <- as.Date(stegen$date_onset)

# We use the function recode() from the dplyr package to recode sex and ill more explicitely.
stegen$sex <- recode_factor(stegen$sex, "0" = "male", "1" = "female")
stegen$ill <- recode_factor(stegen$ill, "0" = "non case", "1" = "case")

# recode the 9s to NAs
stegen$pork[stegen$pork == 9] <- NA
stegen$salmon[stegen$salmon == 9] <- NA
stegen$horseradish[stegen$horseradish == 9] <- NA

Initial data exploration

What is the breakdown by gender and age?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.00   18.00   20.00   26.66   27.00   80.00       8

##   male female 
##    139    152

What is the difference in age by gender?

## $male
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   12.00   17.50   19.00   26.38   28.00   80.00       4 
## 
## $female
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   13.00   18.00   20.00   26.93   24.50   65.00       4

How many cases are there?

## non case     case 
##      188      103

What is the gender breakdown of the cases?

## $male
## non case     case 
##       86       53 
## 
## $female
## non case     case 
##      102       50

Initial Graphical Exploration

The summaries above may be useful for reporting purposes, but graphics are usually better for getting a feel for the data.

Quick histogram of the population by age and gender

Here, we see that the age distribution is pretty much identical between male and female.

Epidemic curve

Epicurves (counts of incident cases) can be built using the package incidence, which will compute the number of new cases given a vector of dates (here, onset) and a time interval (1 day by default). We use the function incidence() to achieve this, and then visualise the results:

## <incidence object>
## [131 cases from days 1998-06-26 to 1998-07-09]
## 
## $counts: matrix with 14 rows and 1 columns
## $n: 131 cases in total
## $dates: 14 dates marking the left-side of bins
## $interval: 1 day
## $timespan: 14 days
## $cumulative: FALSE

If we want the number of reported illness on specific dates, we can report out the list:

##         dates counts
## 1  1998-06-26      1
## 2  1998-06-27     56
## 3  1998-06-28     51
## 4  1998-06-29     10
## 5  1998-06-30      3
## 6  1998-07-01      3
## 7  1998-07-02      3
## 8  1998-07-03      0
## 9  1998-07-04      1
## 10 1998-07-05      1
## 11 1998-07-06      1
## 12 1998-07-07      0
## 13 1998-07-08      0
## 14 1998-07-09      1

How long is this outbreak? It looks like most cases occurred over the course of 3 days, but that cases kept showing up 10 days after the peak. Is this true? Not really. Stratifying the epidemic curve by case definition will clarify the situation. We’ll also plot the cases and non-cases to get the visual representation.

## <incidence object>
## [131 cases from days 1998-06-26 to 1998-07-09]
## [2 groups: non case, case]
## 
## $counts: matrix with 14 rows and 2 columns
## $n: 131 cases in total
## $dates: 14 dates marking the left-side of bins
## $interval: 1 day
## $timespan: 14 days
## $cumulative: FALSE

##         dates non case case
## 1  1998-06-26        1    0
## 2  1998-06-27        8   48
## 3  1998-06-28        5   46
## 4  1998-06-29        2    8
## 5  1998-06-30        3    0
## 6  1998-07-01        3    0
## 7  1998-07-02        3    0
## 8  1998-07-03        0    0
## 9  1998-07-04        1    0
## 10 1998-07-05        1    0
## 11 1998-07-06        1    0
## 12 1998-07-07        0    0
## 13 1998-07-08        0    0
## 14 1998-07-09        1    0

Now we see that all the cases occurred within 3 days of the graduation ceremony. There were other cases of illness in the 10 day period, but they didn’t meet the case definition.

What is the age and gender distribution of the cases?

From these graphics, we can see that illness does not appear to depend on age or gender, which suggests that the illness was caused by one or more of the foods consumed. We will explore this with a risk ratio analysis in the next section.

The epidemiological investigation

In order to figure out which food item was most strongly associated with the illness, we need to test the Risk Ratio for all of the food items recorded. Risk ratios are normally computed on contingency tables (commonly known as 2x2 tables). In this part, we will compute contingency tables for each exposure (food item) against the defined cases, calculate their risk ratio with confidence intervals and p-values, and plot them as points and errorbars using ggplot2.

Isolating the variables to test

##  [1] "tiramisu"    "wmousse"     "dmousse"     "mousse"      "beer"       
##  [6] "redjelly"    "fruit_salad" "tomato"      "mince"       "salmon"     
## [11] "horseradish" "chickenwin"  "roastbeef"   "pork"

Calculating risk ratios

The risk ratio is defined as the ratio between the proportion of illness in one group (typically ‘exposed’) vs another (‘non-exposed’). We can get the risk ratio by using the riskratio() function from the epitools package. Here, we want to use Yates’ continuity correction and calculate Wald confidence intervals. In our case, we will choose the P-value from Fisher’s exact test of the exposure.

single_risk_ratio <- function(exposure, outcome) { # ingredients defined here
  et  <- epitable(exposure, outcome) # ingredients used here
  rr  <- riskratio(et)
  estimate <- rr$measure[2, ]
  res <- data.frame(estimate = estimate["estimate"],
                    lower    = estimate["lower"],
                    upper    = estimate["upper"],
                    p.value  = rr$p.value[2, "fisher.exact"]
  )
  return(res) # return the data frame
}

# the first part is the data, then the function to perform, then specify the outcome you're testing on
all_rr <- lapply(stegen[food], FUN = single_risk_ratio, outcome = stegen$ill)

# now reshape these results into a dataframe to view easily
all_food_df <- bind_rows(all_rr, .id = "exposure")

# now sort to make it descending by estimate (rr)
all_food_df <- arrange(all_food_df, desc(estimate))
all_food_df %>% knitr::kable()

exposure	estimate	lower	upper	p.value
tiramisu	18.3116883	8.8142022	38.042913	0.0000000
mousse	4.9689579	3.2994031	7.483336	0.0000000
dmousse	4.5010211	3.0869446	6.562862	0.0000000
wmousse	2.8472222	2.1282671	3.809049	0.0000000
fruit_salad	2.5006177	1.8867735	3.314171	0.0000000
redjelly	2.0820602	1.5559166	2.786123	0.0000044
tomato	1.2898653	0.9379601	1.773799	0.1368934
horseradish	1.2557870	0.9008457	1.750578	0.2026013
pork	1.2518519	0.9176849	1.707703	0.1708777
chickenwin	1.1617347	0.8376217	1.611261	0.4176598
mince	1.0568237	0.7571858	1.475036	0.7893882
salmon	1.0334249	0.7452531	1.433026	0.8976422
roastbeef	0.7607985	0.4129094	1.401795	0.4172932
beer	0.6767842	0.4757688	0.962730	0.0280639

Now we can use this data frame to plot the results using ggplot2:

The results are a lot clearer now: the tiramisu has by far the highest risk ratio in this outbreak, but there are a couple of things to note:

Its wide confidence interval suggests that the sample size is a bit small.
The other variables that have risk ratios significantly different from zero have overlapping confidence intervals, which may suggest that confounding factors were involved (i.e. the food items were not independent or in this case, were potentially sharing contaminated ingredients).

Spatial Mapping

To complete your report, you would like to include a very basic spatial overview of cases. Based on the postal codes of all individuals with a date of symptom onset spatial coordinates were generated providing everyone with a latitude and longitude which corresponds with their household (in relation to their postal code). The variables latitude and longitude include the longitude and latitude which we will use to plot the cases using ggplot2. Here, we want to plot the data to see if there is any spatial component to the outbreak. Since we have the coordinates for each household, it becomes straightforward to plot them with ggplot2. Here we will use points and color them based on whether or not the person was ill:

Static Map

Interactive Map

## Assuming "longitude" and "latitude" are longitude and latitude, respectively

Conclusion

In this case study, we calculated the epi curve of the cases by onset date. We saw that there was no distinction by age or sex of cases. We found that those who ate tiramisu at the buffet were 18 times more likely to get sick than those who did not, identifying this food as the most likely food that caused gastroenteritis. Lastly, we mapped cases to confirm that there is no visible spatial pattern to the cases, as we suspected. Of note, this study does not account for potential confounding factors.

Salmonella Outbreak Investigation - Stegen, Germany

Mackenzie Zendt

7/16/2020