Nested and Functional Programming For Case Control Analysis

Avery Richards

12/9/2021

In epidemiology, outbreak investigations often rely on case control studies to test hypotheses around a probable source of contagion among many exposures.

In this blog I walkthrough a tidy approach to case control analysis, using “nesting and many models” approach I picked up from the R for Data Science book and using the results of a CDC survey used while investigating an outbreak of E. coli O157:H7 in the United States. To begin, we will need to load and install a variety of libraries before we import our raw dataset.

Information from this study does not come to us in a pristine form. There are cases (people who became ill) observed in the dataset who do not have an equivalent control (not ill person with identical exposure). Having identified the cases without controls, we can make a list of the cdcid values and filter those cases out of an updated line list, or dataframe etc.

One situation has multiple controls assigned to a single case. We need to pluck that case from the data frame before we can continue.

##    cdcid case controlletter
## 1 CDC047    1          <NA>
## 2 CDC047    0             A
## 3 CDC047    0             B
##    n
## 1 36

After verifying the 1:1 matching pattern of cases and controls present in the dataset, we select the food exposure variables, renaming the more cryptic values into human friendly identifiers. We must also recode the 99 and 3 values sprinkled throughout all the exposure data. These represent missing data in the survey response as well as the always troubling, “I’m not sure” survey responses recieved from participants.

## tibble [72 × 22] (S3: tbl_df/tbl/data.frame)
##  $ strata_num          : num [1:72] 5 5 7 7 8 8 10 10 13 13 ...
##  $ case                : num [1:72] 1 0 1 0 1 0 1 0 1 0 ...
##  $ strawberry          : num [1:72] NA 1 1 0 0 0 NA 1 0 1 ...
##  $ apples              : num [1:72] NA 1 1 1 1 0 1 1 NA 0 ...
##  $ fruit_rollup        : num [1:72] 0 0 1 0 0 0 0 0 0 0 ...
##  $ ground_beef         : num [1:72] 1 0 1 1 1 0 0 0 1 1 ...
##  $ raw_cookie_dough    : num [1:72] 0 0 1 0 1 0 1 0 NA 0 ...
##  $ milk                : num [1:72] NA 0 NA 1 1 1 NA 0 1 0 ...
##  $ smoothie            : num [1:72] NA 0 NA 0 0 0 NA 1 0 1 ...
##  $ chocolate_chips     : num [1:72] NA 1 NA 1 1 0 NA 0 1 0 ...
##  $ carrot              : num [1:72] NA 1 NA 1 1 0 NA 1 0 0 ...
##  $ cucumber            : num [1:72] NA 1 NA 0 NA 0 NA 1 1 0 ...
##  $ raspberry           : num [1:72] NA 1 NA 0 NA 0 NA 0 0 0 ...
##  $ watermelon          : num [1:72] NA 0 NA 0 NA 0 NA 1 0 0 ...
##  $ frozen_dessert      : num [1:72] NA 1 NA 1 NA 1 NA 0 1 1 ...
##  $ storebought_smoothie: num [1:72] NA 0 NA 0 0 0 NA 0 0 1 ...
##  $ cantaloupe          : num [1:72] NA 0 NA 0 NA 0 NA 0 0 0 ...
##  $ mandarin            : num [1:72] NA 0 NA 0 NA 0 NA 0 0 0 ...
##  $ grapes              : num [1:72] NA 1 NA 1 NA 0 NA 0 0 1 ...
##  $ bologna             : num [1:72] NA 0 NA 0 0 0 NA 0 1 0 ...
##  $ hotdog              : num [1:72] NA 0 NA 0 1 0 NA 0 1 1 ...
##  $ bacon               : num [1:72] NA 0 NA 0 0 0 NA 0 1 1 ...

Here we are with a cleaned dataset of food exposures between case and control survey participants. Our next step would be to conduct a statistical model of some sort, generate odds ratios to determine an estimate of disease given exposure to the food items. There are 20 distinct food exposures to analyze. So how can we run our models in an organized and comparable way without repeating a function 20 times, wrestling and herding the outputs together someplace to evaluate? The best answer I can think of is using the tidyverse nest() function with purrr::map.

To do that, we first we need to restructure the data so we can group_by the individual food exposures. Pivoting the wide exposure matrices into a long format works well.

## [1] 1440    4

After pivoting, the dataframe contains identical data but is expressed differently: each observation, per exposure and case number, is treated as a row in the long format. 72 (observations) * 20 (exposures) = 1440 rows. Now we can group_by the distinct exposures and nest() our data via those groups.

## # A tibble: 6 x 2
## # Groups:   food [6]
##   food             data             
##   <chr>            <list>           
## 1 strawberry       <tibble [72 × 3]>
## 2 apples           <tibble [72 × 3]>
## 3 fruit_rollup     <tibble [72 × 3]>
## 4 ground_beef      <tibble [72 × 3]>
## 5 raw_cookie_dough <tibble [72 × 3]>
## 6 milk             <tibble [72 × 3]>

After pivoting, grouping, and nesting up our data we have a dataframe of dataframes sort of object, with each value in the cc_nest$data column being a tibble in itself that contains the exposure, case and control counts for each food item we categorized with group_by.

The next step is to put togther a function that automates a conditional logistic regression model in a way that we can operate on all those nested tibbles at once, without repeating the process for each exposure explicitly in our code.

Once the function is put together, we can use mutate to create a new column that map-s the clogit operation to all the nested dataframes in one call. But wait, not so fast…

## Warning in fitter(X, Y, istrat, offset, init, control, weights = weights, : Ran
## out of iterations and did not converge

…we are getting a warning from R after running the models.

Due to the structure and limited amount of exposure data, one of our 20 models was unable to converge and will give us wild OR outputs that do not seem plausable (because they aren’t). There is a variation of the conditional logistic model, an “exact” method that is unavailable to the R computing environment at this time. Having done our homework, we learn that the original analysis of this data used STATA software, which has an exact method available and was able to converge and return the following values below. So we can assemble those values to insert into our final tables.

Once we have made a new column with the 20 model outputs, one for each nested food exposure, (including the erronious one in there), we can extract outputs using the broom package. Exponentiation is key to getting correct odds ratios in a logistic regression model, and confidence intervals are also necessary to evalutate the strength of our estimates beyond p-value thresholds, so we are sure to include those arguments in the broom::tidy function that we map on our nested data frames.

## # A tibble: 6 x 4
## # Groups:   food [6]
##   food             data              model    outputs         
##   <chr>            <list>            <list>   <list>          
## 1 strawberry       <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 2 apples           <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 3 fruit_rollup     <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 4 ground_beef      <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 5 raw_cookie_dough <tibble [72 × 3]> <clogit> <tibble [1 × 7]>
## 6 milk             <tibble [72 × 3]> <clogit> <tibble [1 × 7]>

We have the information from our models at the ready, we then unnest() our model objects, select the relevant outputs, and organize them in a table object, including replacing data from our model that was unable to converge with survival::clogit().

Output of Conditional Logistic Regression Models
Food Consumed Estimated OR p-value CI-low CI-high
frozen_dessert 0.25 0.215 0.028 2.237
smoothie 0.50 0.423 0.092 2.730
mandarin 0.50 0.423 0.092 2.730
bologna 0.50 0.423 0.092 2.730
hotdog 0.50 0.327 0.125 1.999
carrot 0.67 0.530 0.188 2.362
cucumber 0.75 0.706 0.168 3.351
raspberry 1.00 1.000 0.141 7.099
storebought_smoothie 1.00 1.000 0.141 7.099
grapes 1.00 1.000 0.323 3.101
cantaloupe 1.33 0.706 0.298 5.957
apples 1.40 0.566 0.444 4.411
fruit_rollup 1.50 0.530 0.423 5.315
watermelon 1.50 0.657 0.251 8.977
bacon 1.67 0.484 0.398 6.974
milk 2.00 0.571 0.181 22.056
chocolate_chips 2.00 0.327 0.500 7.997
strawberry 2.20 0.144 0.764 6.332
ground_beef 3.50 0.118 0.727 16.848
raw_cookie_dough 41.34 0.001 7.370 Inf

It is also relevant to explore a count and proportion of cases and controls who consumed the food items. Here we can filter our pivoted data based on case or control status, grouping again on the food exposure, counting and creating a proportional measure.

Now we have created a table for the controls, we repeat the process for cases and inner_join the tables together. Finally we add a similar wrapper for print quality table object.

Counts and Percentages of Exposed Cases
Food Consumed Cases No. Controls No.
mandarin 2 0.06 6 0.17
smoothie 2 0.06 5 0.14
storebought_smoothie 2 0.06 2 0.06
bologna 3 0.08 7 0.19
raspberry 3 0.08 5 0.14
cantaloupe 5 0.14 6 0.17
watermelon 5 0.14 6 0.17
cucumber 7 0.19 13 0.36
hotdog 7 0.19 15 0.42
carrot 9 0.25 17 0.47
fruit_rollup 9 0.25 6 0.17
bacon 10 0.28 10 0.28
chocolate_chips 10 0.28 12 0.33
frozen_dessert 11 0.31 26 0.72
grapes 12 0.33 17 0.47
apples 18 0.50 19 0.53
strawberry 18 0.50 14 0.39
milk 20 0.56 29 0.81
ground_beef 27 0.75 26 0.72
raw_cookie_dough 33 0.92 4 0.11

You may notice a high percentage of cases who reported eating the raw cookie dough at the bottom here. This is evidence that further supports raw cookie dough being the source of the outbreak. Another way to explore the summary data is with a graphical visualization. We can visualize estimated OR values with confidence intervals fairly simply with a ggplot bar graph.

Even if the scale of this visualization is practically destroyed by the raw_cookie_dough estimate, we can add a geom_hline at the OR = 1 intercept, giving us a reference and ability to assess multiple exposures for statistical significance visually as well as numerically. Here we see that ground_beef and strawberry are close to significance, but an OR estimate for every other exposure passes through that un-significant OR = 1 area, adding even more support to the hypothesis, and indeed true life situation where raw cookie dough was the source of this E. Coli outbreak.

Thank you for reading this far! You can find the raw data and code for this blog here at my personal github.