Nested and Functional Programming For Case Control Analysis

In epidemiology, outbreak investigations often rely on case control studies to test hypotheses around a probable source of contagion among many exposures.

In this blog I walkthrough a tidy approach to case control analysis, using “nesting and many models” approach I picked up from the R for Data Science book and using the results of a CDC survey used while investigating an outbreak of E. coli O157:H7 in the United States. To begin, we will need to load and install a variety of libraries before we import our raw dataset.

# installing and loading packages
if (!require("pacman")) install.packages("pacman")

pacman::p_load(
  rio, # import data
  tidyverse, janitor, # shape data
  survival, # statistics 
  broom, # model evaluation
  kableExtra, #table output
  prettydoc #knitting
  )

# turn off scipen
options(scipen = 999)

#import dataset, clean names
ecoli_cc <- import("case_control_study_readacted.xlsx") %>% 
  janitor::clean_names()

Information from this study does not come to us in a pristine form. There are cases (people who became ill) observed in the dataset who do not have an equivalent control (not ill person with identical exposure). Having identified the cases without controls, we can make a list of the cdcid values and filter those cases out of an updated line list, or dataframe etc.

## first we make list of cases without matched controls
unmatched_cases <- c("CDC001", "CDC006", "CDC009", "CDC012",
                     "CDC015", "CDC022", "CDC023", "CDC024",
                     "CDC037", "CDC043", "CDC045", "CDC046",
                     "CDC048", "CDC049", "CDC067", "CDC068")

One situation has multiple controls assigned to a single case. We need to pluck that case from the data frame before we can continue.

# some cases have multiple controls assigned.
ecoli_cc %>%  
  select(cdcid, case, controlletter) %>% 
filter(cdcid == "CDC047")

##    cdcid case controlletter
## 1 CDC047    1          <NA>
## 2 CDC047    0             A
## 3 CDC047    0             B

# filter out unmatched cases
ecoli_cc_match <- 
  ecoli_cc %>% 
filter(!(str_detect(cdcid, 
        paste(unmatched_cases, collapse = "|")))) %>% 
  # remove that one extra control at cdc047
  filter(!(cdcid == "CDC047" & cdccaseid == "CDC047_B")) %>% 

# create a strata number from cdcid strings 
  mutate(strata_num = as.numeric(stringr::str_extract_all(cdcid,"(..$)")))
  
# count to verify the case numbers match, 36 strata 
ecoli_cc_match %>% 
  distinct(strata_num) %>% tally()

##    n
## 1 36

After verifying the 1:1 matching pattern of cases and controls present in the dataset, we select the food exposure variables, renaming the more cryptic values into human friendly identifiers. We must also recode the 99 and 3 values sprinkled throughout all the exposure data. These represent missing data in the survey response as well as the always troubling, “I’m not sure” survey responses recieved from participants.

# Select food exposure variables from CDC questionaire 
cc_exposures <- ecoli_cc_match %>% 
  select(strata_num, case, strawberry, 
         apples, rollup, gb, rawcd, milk, 
         smoothie, cchip, carrot, cucumber, 
         raspberry, watermelon, nocdfzndes, 
         shopsmoothie, cantaloupe, mandarin, 
         grapes, bologna, hotdog, bacon) %>% 
 
   # rename variables
  rename(ground_beef = gb, raw_cookie_dough = rawcd, 
         fruit_rollup = rollup,
         chocolate_chips = cchip, 
         frozen_dessert = nocdfzndes, 
         storebought_smoothie = shopsmoothie)

# Replace 99 and 3 to NA in dataframe
cc_exposures <- map_df(cc_exposures, ~ na_if(.,"99")) 
cc_exposures <- map_df(cc_exposures, ~ na_if(.,"3"))

#evaluate structure of new dataframe
str(cc_exposures)

## tibble [72 × 22] (S3: tbl_df/tbl/data.frame)
##  $ strata_num          : num [1:72] 5 5 7 7 8 8 10 10 13 13 ...
##  $ case                : num [1:72] 1 0 1 0 1 0 1 0 1 0 ...
##  $ strawberry          : num [1:72] NA 1 1 0 0 0 NA 1 0 1 ...
##  $ apples              : num [1:72] NA 1 1 1 1 0 1 1 NA 0 ...
##  $ fruit_rollup        : num [1:72] 0 0 1 0 0 0 0 0 0 0 ...
##  $ ground_beef         : num [1:72] 1 0 1 1 1 0 0 0 1 1 ...
##  $ raw_cookie_dough    : num [1:72] 0 0 1 0 1 0 1 0 NA 0 ...
##  $ milk                : num [1:72] NA 0 NA 1 1 1 NA 0 1 0 ...
##  $ smoothie            : num [1:72] NA 0 NA 0 0 0 NA 1 0 1 ...
##  $ chocolate_chips     : num [1:72] NA 1 NA 1 1 0 NA 0 1 0 ...
##  $ carrot              : num [1:72] NA 1 NA 1 1 0 NA 1 0 0 ...
##  $ cucumber            : num [1:72] NA 1 NA 0 NA 0 NA 1 1 0 ...
##  $ raspberry           : num [1:72] NA 1 NA 0 NA 0 NA 0 0 0 ...
##  $ watermelon          : num [1:72] NA 0 NA 0 NA 0 NA 1 0 0 ...
##  $ frozen_dessert      : num [1:72] NA 1 NA 1 NA 1 NA 0 1 1 ...
##  $ storebought_smoothie: num [1:72] NA 0 NA 0 0 0 NA 0 0 1 ...
##  $ cantaloupe          : num [1:72] NA 0 NA 0 NA 0 NA 0 0 0 ...
##  $ mandarin            : num [1:72] NA 0 NA 0 NA 0 NA 0 0 0 ...
##  $ grapes              : num [1:72] NA 1 NA 1 NA 0 NA 0 0 1 ...
##  $ bologna             : num [1:72] NA 0 NA 0 0 0 NA 0 1 0 ...
##  $ hotdog              : num [1:72] NA 0 NA 0 1 0 NA 0 1 1 ...
##  $ bacon               : num [1:72] NA 0 NA 0 0 0 NA 0 1 1 ...

Here we are with a cleaned dataset of food exposures between case and control survey participants. Our next step would be to conduct a statistical model of some sort, generate odds ratios to determine an estimate of disease given exposure to the food items. There are 20 distinct food exposures to analyze. So how can we run our models in an organized and comparable way without repeating a function 20 times, wrestling and herding the outputs together someplace to evaluate? The best answer I can think of is using the tidyverse nest() function with purrr::map.

To do that, we first we need to restructure the data so we can group_by the individual food exposures. Pivoting the wide exposure matrices into a long format works well.

# pivot longer to create category for each food.  
cc_pivot <- cc_exposures %>% 
 pivot_longer(cols = 3:22, values_to = "exposure",
              names_to = "food")

dim(cc_pivot)

## [1] 1440    4

After pivoting, the dataframe contains identical data but is expressed differently: each observation, per exposure and case number, is treated as a row in the long format. 72 (observations) * 20 (exposures) = 1440 rows. Now we can group_by the distinct exposures and nest() our data via those groups.

# group and nest by food category.  
cc_nested <- cc_pivot %>% 
  group_by(food) %>% 
  nest()

head(cc_nested)

## # A tibble: 6 x 2
## # Groups:   food [6]
##   food             data             
##   <chr>            <list>           
## 1 strawberry       <tibble [72 × 3]>
## 2 apples           <tibble [72 × 3]>
## 3 fruit_rollup     <tibble [72 × 3]>
## 4 ground_beef      <tibble [72 × 3]>
## 5 raw_cookie_dough <tibble [72 × 3]>
## 6 milk             <tibble [72 × 3]>

After pivoting, grouping, and nesting up our data we have a dataframe of dataframes sort of object, with each value in the cc_nest$data column being a tibble in itself that contains the exposure, case and control counts for each food item we categorized with group_by.

The next step is to put togther a function that automates a conditional logistic regression model in a way that we can operate on all those nested tibbles at once, without repeating the process for each exposure explicitly in our code.

# function to map clogit model on dataframe. 
clogit_model <- function(data){
  survival::clogit(case ~ exposure + 
            strata(strata_num), data = data) }

Once the function is put together, we can use mutate to create a new column that map-s the clogit operation to all the nested dataframes in one call. But wait, not so fast…

# map clogit onto nested column to create new row of outputs.
cc_logit <- cc_nested %>% 
  mutate(model = map(data, clogit_model))

## Warning in fitter(X, Y, istrat, offset, init, control, weights = weights, : Ran
## out of iterations and did not converge

…we are getting a warning from R after running the models.

Due to the structure and limited amount of exposure data, one of our 20 models was unable to converge and will give us wild OR outputs that do not seem plausable (because they aren’t). There is a variation of the conditional logistic model, an “exact” method that is unavailable to the R computing environment at this time. Having done our homework, we learn that the original analysis of this data used STATA software, which has an exact method available and was able to converge and return the following values below. So we can assemble those values to insert into our final tables.

# outputs from an exact model for exposure that did not converge. 
added_output <- data.frame(food = "raw_cookie_dough", 
                              estimate = 41.34,
                              p.value = .001,
                              conf.low = 7.37,
                              conf.high = Inf)

Once we have made a new column with the 20 model outputs, one for each nested food exposure, (including the erronious one in there), we can extract outputs using the broom package. Exponentiation is key to getting correct odds ratios in a logistic regression model, and confidence intervals are also necessary to evalutate the strength of our estimates beyond p-value thresholds, so we are sure to include those arguments in the broom::tidy function that we map on our nested data frames.

# summary of outputs.
cc_summary <- cc_logit %>% 
  mutate(outputs = map(model, broom::tidy, 
    # exponentiate and add confidence intervals. 
        exponentiate = TRUE, conf.int = TRUE)) 

head(cc_summary)

## # A tibble: 6 x 4
## # Groups:   food [6]
##   food             data              model    outputs         
##   <chr>            <list>            <list>   <list>          
## 1 strawberry       <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 2 apples           <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 3 fruit_rollup     <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 4 ground_beef      <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 5 raw_cookie_dough <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 6 milk             <tibble [72 × 3]> <clogit> <tibble [1 × 7]>

We have the information from our models at the ready, we then unnest() our model objects, select the relevant outputs, and organize them in a table object, including replacing data from our model that was unable to converge with survival::clogit().

# table of output summary of models
cc_summary %>% 
unnest(outputs) %>% 
  # select the outputs relevant to our analysis
  select(estimate, p.value, conf.low, conf.high) %>% 
  # round these values for legibility sake
  mutate(p.value = round(p.value, 3),
         estimate = round(estimate, 2),
         conf.low = round(conf.low, 3),
        conf.high = round(conf.high, 3)) %>% 
  # remove the row (or model) that was unable to converge
  filter(food != "raw_cookie_dough") %>% 
  
# add fixed outputs from published stata analysis 
   bind_rows(added_output) %>% 
  arrange(estimate) %>% 
  # create print quality table object for ease of viewing
    kbl(caption = "Output of Conditional Logistic Regression Models", 
        col.names = c("Food Consumed", 
                      "Estimated OR",
                      "p-value", 
                      "CI-low", 
                      "CI-high")) %>% 
  kable_classic_2(full_width = F) %>% column_spec(2, bold = T)

Output of Conditional Logistic Regression Models
Food Consumed	Estimated OR	p-value	CI-low	CI-high
frozen_dessert	0.25	0.215	0.028	2.237
smoothie	0.50	0.423	0.092	2.730
mandarin	0.50	0.423	0.092	2.730
bologna	0.50	0.423	0.092	2.730
hotdog	0.50	0.327	0.125	1.999
carrot	0.67	0.530	0.188	2.362
cucumber	0.75	0.706	0.168	3.351
raspberry	1.00	1.000	0.141	7.099
storebought_smoothie	1.00	1.000	0.141	7.099
grapes	1.00	1.000	0.323	3.101
cantaloupe	1.33	0.706	0.298	5.957
apples	1.40	0.566	0.444	4.411
fruit_rollup	1.50	0.530	0.423	5.315
watermelon	1.50	0.657	0.251	8.977
bacon	1.67	0.484	0.398	6.974
milk	2.00	0.571	0.181	22.056
chocolate_chips	2.00	0.327	0.500	7.997
strawberry	2.20	0.144	0.764	6.332
ground_beef	3.50	0.118	0.727	16.848
raw_cookie_dough	41.34	0.001	7.370	Inf

It is also relevant to explore a count and proportion of cases and controls who consumed the food items. Here we can filter our pivoted data based on case or control status, grouping again on the food exposure, counting and creating a proportional measure.

# count and proportions of controls per food exposure
controls_tab <- cc_pivot %>% 
  filter(case == 0) %>% 
  group_by(food) %>% 
  summarise(controls = sum(exposure, na.rm = T)) %>% 
  mutate(controls_percent = round(controls / 36, 2))

Now we have created a table for the controls, we repeat the process for cases and inner_join the tables together. Finally we add a similar wrapper for print quality table object.

# join controls with cases 
 cc_pivot %>% 
  filter(case == 1) %>% 
  group_by(food) %>% 
  summarise(cases = sum(exposure, na.rm = T)) %>% 
  mutate(cases_percent = round(cases / 36, 2)) %>% 
  inner_join(controls_tab, by = "food") %>% 
    arrange(cases_percent) %>% 
  kbl(caption = "Counts and Percentages of Exposed Cases",
      col.names = c("Food Consumed",
                    "Cases",
                    "No.", 
                    "Controls", 
                    "No.")) %>%   
  kable_classic_2(full_width = F)

Counts and Percentages of Exposed Cases
Food Consumed	Cases	No.	Controls	No.
mandarin	2	0.06	6	0.17
smoothie	2	0.06	5	0.14
storebought_smoothie	2	0.06	2	0.06
bologna	3	0.08	7	0.19
raspberry	3	0.08	5	0.14
cantaloupe	5	0.14	6	0.17
watermelon	5	0.14	6	0.17
cucumber	7	0.19	13	0.36
hotdog	7	0.19	15	0.42
carrot	9	0.25	17	0.47
fruit_rollup	9	0.25	6	0.17
bacon	10	0.28	10	0.28
chocolate_chips	10	0.28	12	0.33
frozen_dessert	11	0.31	26	0.72
grapes	12	0.33	17	0.47
apples	18	0.50	19	0.53
strawberry	18	0.50	14	0.39
milk	20	0.56	29	0.81
ground_beef	27	0.75	26	0.72
raw_cookie_dough	33	0.92	4	0.11

You may notice a high percentage of cases who reported eating the raw cookie dough at the bottom here. This is evidence that further supports raw cookie dough being the source of the outbreak. Another way to explore the summary data is with a graphical visualization. We can visualize estimated OR values with confidence intervals fairly simply with a ggplot bar graph.

cc_summary %>% 
  # we must unnest and ungroup our data object 
 unnest(outputs) %>% 
  filter(food != "raw_cookie_dough") %>% 
  ungroup() %>% 
  # can't neglect the exact test measurements
  add_row(food = "raw_cookie_dough", estimate = 41.34,
          conf.low = 7.37, conf.high = Inf) %>% 
  
  ## plotting ###
  # plot with theme and titles 
ggplot(aes(food, estimate)) +
  geom_bar(stat = "identity", fill = "steelblue", 
           alpha = .6) +
  # use conf values to create error bar. 
  geom_errorbar(aes(x = food, ymin = conf.low, ymax = conf.high,
                    color = "green", alpha = .9)) +
    ylim(0, 42) +
  coord_flip() +
  # add an hline for the conf.ints
  geom_hline(yintercept = 1, color = "purple", linetype = "dotdash") +
 # themes and title
   theme_bw() +
  theme(axis.text.x = element_text(angle = 90),
        legend.position = "none") +

  # title and labels
  ylab("Estimated OR with Confidence Interval") +
  xlab("Food Exposures") +
  ggtitle("OR Estimates of Disease Given Food Exposure Status")

Even if the scale of this visualization is practically destroyed by the raw_cookie_dough estimate, we can add a geom_hline at the OR = 1 intercept, giving us a reference and ability to assess multiple exposures for statistical significance visually as well as numerically. Here we see that ground_beef and strawberry are close to significance, but an OR estimate for every other exposure passes through that un-significant OR = 1 area, adding even more support to the hypothesis, and indeed true life situation where raw cookie dough was the source of this E. Coli outbreak.

Thank you for reading this far! You can find the raw data and code for this blog here at my personal github.

Nested and Functional Programming For Case Control Analysis

Avery Richards

12/9/2021