In epidemiology, outbreak investigations often rely on case control studies to test hypotheses around a probable source of contagion among many exposures.
In this blog I walkthrough a tidy approach to case control analysis, using “nesting and many models” approach I picked up from the R for Data Science book and using the results of a CDC survey used while investigating an outbreak of E. coli O157:H7 in the United States. To begin, we will need to load and install a variety of libraries before we import our raw dataset.
# installing and loading packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
rio, # import data
tidyverse, janitor, # shape data
survival, # statistics
broom, # model evaluation
kableExtra, #table output
prettydoc #knitting
)
# turn off scipen
options(scipen = 999)
#import dataset, clean names
ecoli_cc <- import("case_control_study_readacted.xlsx") %>%
janitor::clean_names()Information from this study does not come to us in a pristine form. There are cases (people who became ill) observed in the dataset who do not have an equivalent control (not ill person with identical exposure). Having identified the cases without controls, we can make a list of the cdcid values and filter those cases out of an updated line list, or dataframe etc.
## first we make list of cases without matched controls
unmatched_cases <- c("CDC001", "CDC006", "CDC009", "CDC012",
"CDC015", "CDC022", "CDC023", "CDC024",
"CDC037", "CDC043", "CDC045", "CDC046",
"CDC048", "CDC049", "CDC067", "CDC068")One situation has multiple controls assigned to a single case. We need to pluck that case from the data frame before we can continue.
# some cases have multiple controls assigned.
ecoli_cc %>%
select(cdcid, case, controlletter) %>%
filter(cdcid == "CDC047")## cdcid case controlletter
## 1 CDC047 1 <NA>
## 2 CDC047 0 A
## 3 CDC047 0 B
# filter out unmatched cases
ecoli_cc_match <-
ecoli_cc %>%
filter(!(str_detect(cdcid,
paste(unmatched_cases, collapse = "|")))) %>%
# remove that one extra control at cdc047
filter(!(cdcid == "CDC047" & cdccaseid == "CDC047_B")) %>%
# create a strata number from cdcid strings
mutate(strata_num = as.numeric(stringr::str_extract_all(cdcid,"(..$)")))
# count to verify the case numbers match, 36 strata
ecoli_cc_match %>%
distinct(strata_num) %>% tally()## n
## 1 36
After verifying the 1:1 matching pattern of cases and controls present in the dataset, we select the food exposure variables, renaming the more cryptic values into human friendly identifiers. We must also recode the 99 and 3 values sprinkled throughout all the exposure data. These represent missing data in the survey response as well as the always troubling, “I’m not sure” survey responses recieved from participants.
# Select food exposure variables from CDC questionaire
cc_exposures <- ecoli_cc_match %>%
select(strata_num, case, strawberry,
apples, rollup, gb, rawcd, milk,
smoothie, cchip, carrot, cucumber,
raspberry, watermelon, nocdfzndes,
shopsmoothie, cantaloupe, mandarin,
grapes, bologna, hotdog, bacon) %>%
# rename variables
rename(ground_beef = gb, raw_cookie_dough = rawcd,
fruit_rollup = rollup,
chocolate_chips = cchip,
frozen_dessert = nocdfzndes,
storebought_smoothie = shopsmoothie)
# Replace 99 and 3 to NA in dataframe
cc_exposures <- map_df(cc_exposures, ~ na_if(.,"99"))
cc_exposures <- map_df(cc_exposures, ~ na_if(.,"3"))
#evaluate structure of new dataframe
str(cc_exposures)## tibble [72 × 22] (S3: tbl_df/tbl/data.frame)
## $ strata_num : num [1:72] 5 5 7 7 8 8 10 10 13 13 ...
## $ case : num [1:72] 1 0 1 0 1 0 1 0 1 0 ...
## $ strawberry : num [1:72] NA 1 1 0 0 0 NA 1 0 1 ...
## $ apples : num [1:72] NA 1 1 1 1 0 1 1 NA 0 ...
## $ fruit_rollup : num [1:72] 0 0 1 0 0 0 0 0 0 0 ...
## $ ground_beef : num [1:72] 1 0 1 1 1 0 0 0 1 1 ...
## $ raw_cookie_dough : num [1:72] 0 0 1 0 1 0 1 0 NA 0 ...
## $ milk : num [1:72] NA 0 NA 1 1 1 NA 0 1 0 ...
## $ smoothie : num [1:72] NA 0 NA 0 0 0 NA 1 0 1 ...
## $ chocolate_chips : num [1:72] NA 1 NA 1 1 0 NA 0 1 0 ...
## $ carrot : num [1:72] NA 1 NA 1 1 0 NA 1 0 0 ...
## $ cucumber : num [1:72] NA 1 NA 0 NA 0 NA 1 1 0 ...
## $ raspberry : num [1:72] NA 1 NA 0 NA 0 NA 0 0 0 ...
## $ watermelon : num [1:72] NA 0 NA 0 NA 0 NA 1 0 0 ...
## $ frozen_dessert : num [1:72] NA 1 NA 1 NA 1 NA 0 1 1 ...
## $ storebought_smoothie: num [1:72] NA 0 NA 0 0 0 NA 0 0 1 ...
## $ cantaloupe : num [1:72] NA 0 NA 0 NA 0 NA 0 0 0 ...
## $ mandarin : num [1:72] NA 0 NA 0 NA 0 NA 0 0 0 ...
## $ grapes : num [1:72] NA 1 NA 1 NA 0 NA 0 0 1 ...
## $ bologna : num [1:72] NA 0 NA 0 0 0 NA 0 1 0 ...
## $ hotdog : num [1:72] NA 0 NA 0 1 0 NA 0 1 1 ...
## $ bacon : num [1:72] NA 0 NA 0 0 0 NA 0 1 1 ...
Here we are with a cleaned dataset of food exposures between case and control survey participants. Our next step would be to conduct a statistical model of some sort, generate odds ratios to determine an estimate of disease given exposure to the food items. There are 20 distinct food exposures to analyze. So how can we run our models in an organized and comparable way without repeating a function 20 times, wrestling and herding the outputs together someplace to evaluate? The best answer I can think of is using the tidyverse nest() function with purrr::map.
To do that, we first we need to restructure the data so we can group_by the individual food exposures. Pivoting the wide exposure matrices into a long format works well.
# pivot longer to create category for each food.
cc_pivot <- cc_exposures %>%
pivot_longer(cols = 3:22, values_to = "exposure",
names_to = "food")
dim(cc_pivot)## [1] 1440 4
After pivoting, the dataframe contains identical data but is expressed differently: each observation, per exposure and case number, is treated as a row in the long format. 72 (observations) * 20 (exposures) = 1440 rows. Now we can group_by the distinct exposures and nest() our data via those groups.
# group and nest by food category.
cc_nested <- cc_pivot %>%
group_by(food) %>%
nest()
head(cc_nested)## # A tibble: 6 x 2
## # Groups: food [6]
## food data
## <chr> <list>
## 1 strawberry <tibble [72 × 3]>
## 2 apples <tibble [72 × 3]>
## 3 fruit_rollup <tibble [72 × 3]>
## 4 ground_beef <tibble [72 × 3]>
## 5 raw_cookie_dough <tibble [72 × 3]>
## 6 milk <tibble [72 × 3]>
After pivoting, grouping, and nesting up our data we have a dataframe of dataframes sort of object, with each value in the cc_nest$data column being a tibble in itself that contains the exposure, case and control counts for each food item we categorized with group_by.
The next step is to put togther a function that automates a conditional logistic regression model in a way that we can operate on all those nested tibbles at once, without repeating the process for each exposure explicitly in our code.
# function to map clogit model on dataframe.
clogit_model <- function(data){
survival::clogit(case ~ exposure +
strata(strata_num), data = data) }Once the function is put together, we can use mutate to create a new column that map-s the clogit operation to all the nested dataframes in one call. But wait, not so fast…
# map clogit onto nested column to create new row of outputs.
cc_logit <- cc_nested %>%
mutate(model = map(data, clogit_model))## Warning in fitter(X, Y, istrat, offset, init, control, weights = weights, : Ran
## out of iterations and did not converge
…we are getting a warning from R after running the models.
Due to the structure and limited amount of exposure data, one of our 20 models was unable to converge and will give us wild OR outputs that do not seem plausable (because they aren’t). There is a variation of the conditional logistic model, an “exact” method that is unavailable to the R computing environment at this time. Having done our homework, we learn that the original analysis of this data used STATA software, which has an exact method available and was able to converge and return the following values below. So we can assemble those values to insert into our final tables.
# outputs from an exact model for exposure that did not converge.
added_output <- data.frame(food = "raw_cookie_dough",
estimate = 41.34,
p.value = .001,
conf.low = 7.37,
conf.high = Inf)Once we have made a new column with the 20 model outputs, one for each nested food exposure, (including the erronious one in there), we can extract outputs using the broom package. Exponentiation is key to getting correct odds ratios in a logistic regression model, and confidence intervals are also necessary to evalutate the strength of our estimates beyond p-value thresholds, so we are sure to include those arguments in the broom::tidy function that we map on our nested data frames.
# summary of outputs.
cc_summary <- cc_logit %>%
mutate(outputs = map(model, broom::tidy,
# exponentiate and add confidence intervals.
exponentiate = TRUE, conf.int = TRUE))
head(cc_summary)## # A tibble: 6 x 4
## # Groups: food [6]
## food data model outputs
## <chr> <list> <list> <list>
## 1 strawberry <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 2 apples <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 3 fruit_rollup <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 4 ground_beef <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 5 raw_cookie_dough <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 6 milk <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
We have the information from our models at the ready, we then unnest() our model objects, select the relevant outputs, and organize them in a table object, including replacing data from our model that was unable to converge with survival::clogit().
# table of output summary of models
cc_summary %>%
unnest(outputs) %>%
# select the outputs relevant to our analysis
select(estimate, p.value, conf.low, conf.high) %>%
# round these values for legibility sake
mutate(p.value = round(p.value, 3),
estimate = round(estimate, 2),
conf.low = round(conf.low, 3),
conf.high = round(conf.high, 3)) %>%
# remove the row (or model) that was unable to converge
filter(food != "raw_cookie_dough") %>%
# add fixed outputs from published stata analysis
bind_rows(added_output) %>%
arrange(estimate) %>%
# create print quality table object for ease of viewing
kbl(caption = "Output of Conditional Logistic Regression Models",
col.names = c("Food Consumed",
"Estimated OR",
"p-value",
"CI-low",
"CI-high")) %>%
kable_classic_2(full_width = F) %>% column_spec(2, bold = T) | Food Consumed | Estimated OR | p-value | CI-low | CI-high |
|---|---|---|---|---|
| frozen_dessert | 0.25 | 0.215 | 0.028 | 2.237 |
| smoothie | 0.50 | 0.423 | 0.092 | 2.730 |
| mandarin | 0.50 | 0.423 | 0.092 | 2.730 |
| bologna | 0.50 | 0.423 | 0.092 | 2.730 |
| hotdog | 0.50 | 0.327 | 0.125 | 1.999 |
| carrot | 0.67 | 0.530 | 0.188 | 2.362 |
| cucumber | 0.75 | 0.706 | 0.168 | 3.351 |
| raspberry | 1.00 | 1.000 | 0.141 | 7.099 |
| storebought_smoothie | 1.00 | 1.000 | 0.141 | 7.099 |
| grapes | 1.00 | 1.000 | 0.323 | 3.101 |
| cantaloupe | 1.33 | 0.706 | 0.298 | 5.957 |
| apples | 1.40 | 0.566 | 0.444 | 4.411 |
| fruit_rollup | 1.50 | 0.530 | 0.423 | 5.315 |
| watermelon | 1.50 | 0.657 | 0.251 | 8.977 |
| bacon | 1.67 | 0.484 | 0.398 | 6.974 |
| milk | 2.00 | 0.571 | 0.181 | 22.056 |
| chocolate_chips | 2.00 | 0.327 | 0.500 | 7.997 |
| strawberry | 2.20 | 0.144 | 0.764 | 6.332 |
| ground_beef | 3.50 | 0.118 | 0.727 | 16.848 |
| raw_cookie_dough | 41.34 | 0.001 | 7.370 | Inf |
It is also relevant to explore a count and proportion of cases and controls who consumed the food items. Here we can filter our pivoted data based on case or control status, grouping again on the food exposure, counting and creating a proportional measure.
# count and proportions of controls per food exposure
controls_tab <- cc_pivot %>%
filter(case == 0) %>%
group_by(food) %>%
summarise(controls = sum(exposure, na.rm = T)) %>%
mutate(controls_percent = round(controls / 36, 2))Now we have created a table for the controls, we repeat the process for cases and inner_join the tables together. Finally we add a similar wrapper for print quality table object.
# join controls with cases
cc_pivot %>%
filter(case == 1) %>%
group_by(food) %>%
summarise(cases = sum(exposure, na.rm = T)) %>%
mutate(cases_percent = round(cases / 36, 2)) %>%
inner_join(controls_tab, by = "food") %>%
arrange(cases_percent) %>%
kbl(caption = "Counts and Percentages of Exposed Cases",
col.names = c("Food Consumed",
"Cases",
"No.",
"Controls",
"No.")) %>%
kable_classic_2(full_width = F)| Food Consumed | Cases | No. | Controls | No. |
|---|---|---|---|---|
| mandarin | 2 | 0.06 | 6 | 0.17 |
| smoothie | 2 | 0.06 | 5 | 0.14 |
| storebought_smoothie | 2 | 0.06 | 2 | 0.06 |
| bologna | 3 | 0.08 | 7 | 0.19 |
| raspberry | 3 | 0.08 | 5 | 0.14 |
| cantaloupe | 5 | 0.14 | 6 | 0.17 |
| watermelon | 5 | 0.14 | 6 | 0.17 |
| cucumber | 7 | 0.19 | 13 | 0.36 |
| hotdog | 7 | 0.19 | 15 | 0.42 |
| carrot | 9 | 0.25 | 17 | 0.47 |
| fruit_rollup | 9 | 0.25 | 6 | 0.17 |
| bacon | 10 | 0.28 | 10 | 0.28 |
| chocolate_chips | 10 | 0.28 | 12 | 0.33 |
| frozen_dessert | 11 | 0.31 | 26 | 0.72 |
| grapes | 12 | 0.33 | 17 | 0.47 |
| apples | 18 | 0.50 | 19 | 0.53 |
| strawberry | 18 | 0.50 | 14 | 0.39 |
| milk | 20 | 0.56 | 29 | 0.81 |
| ground_beef | 27 | 0.75 | 26 | 0.72 |
| raw_cookie_dough | 33 | 0.92 | 4 | 0.11 |
You may notice a high percentage of cases who reported eating the raw cookie dough at the bottom here. This is evidence that further supports raw cookie dough being the source of the outbreak. Another way to explore the summary data is with a graphical visualization. We can visualize estimated OR values with confidence intervals fairly simply with a ggplot bar graph.
cc_summary %>%
# we must unnest and ungroup our data object
unnest(outputs) %>%
filter(food != "raw_cookie_dough") %>%
ungroup() %>%
# can't neglect the exact test measurements
add_row(food = "raw_cookie_dough", estimate = 41.34,
conf.low = 7.37, conf.high = Inf) %>%
## plotting ###
# plot with theme and titles
ggplot(aes(food, estimate)) +
geom_bar(stat = "identity", fill = "steelblue",
alpha = .6) +
# use conf values to create error bar.
geom_errorbar(aes(x = food, ymin = conf.low, ymax = conf.high,
color = "green", alpha = .9)) +
ylim(0, 42) +
coord_flip() +
# add an hline for the conf.ints
geom_hline(yintercept = 1, color = "purple", linetype = "dotdash") +
# themes and title
theme_bw() +
theme(axis.text.x = element_text(angle = 90),
legend.position = "none") +
# title and labels
ylab("Estimated OR with Confidence Interval") +
xlab("Food Exposures") +
ggtitle("OR Estimates of Disease Given Food Exposure Status") Even if the scale of this visualization is practically destroyed by the raw_cookie_dough estimate, we can add a geom_hline at the OR = 1 intercept, giving us a reference and ability to assess multiple exposures for statistical significance visually as well as numerically. Here we see that ground_beef and strawberry are close to significance, but an OR estimate for every other exposure passes through that un-significant OR = 1 area, adding even more support to the hypothesis, and indeed true life situation where raw cookie dough was the source of this E. Coli outbreak.
Thank you for reading this far! You can find the raw data and code for this blog here at my personal github.