Follow-up epi study data manipulation

1. Objectives
2. Some useful functions
3. Two example datasets
4. Unequal follow-up interval
5. Equal follow-up interval
6. Conclusion

In this tutorial, I share some functions and tricks to manipulate data for analysis of follow-up data using two by two contingency tables using some mock data. Follow-up data are typically collected from a cohort study or randomised controlled trials.

You can download all files and data related to this tutorial from my GitHub repository.

Throughout the tutorial, function calls are denoted with a pair of parentheses at the end. For example, sum(). Packages are marked within a pair of curly brackets. For example, {dplyr}. The code can be found in gray boxes. The white boxes with each row preceded by ## are the outputs of the code.

Pre-requisite: I assume you are familiar with the basics of R and {dplyr} package. If you want to learn more about these, Chapter 3, 4, and 5 of the online book R for Data Science by Garrett Grolemund and Hadley Wickham are helpful.

We will use five packages in this tutorial. The code below will automatically load necessary packages. It will install the packages automatically if not found in the system.

pkgs <- c(
  "dplyr",
  "tidyr",
  "tibble",
  "epiR",
  "broom"
)

invisible(sapply(pkgs, function(.) if (!requireNamespace(.)) install.packages(.)))
invisible(sapply(pkgs, library, character.only = TRUE))

1. Objectives

Learn some useful functions helpful in filtering rows
Compute person-days at risk to the first positive result among participants with a negative result at baseline
Compare incidence rates using person-days from objective 2
Compute total person-days at risk for repeated positive results

2. Some useful functions

In this section, I will discuss a few functions that will come in handy in later part of the tutorial. You might feel that some of these are out of place for now. But it will make sense in the end. I promise!

2.1. Cumulative sum

The first function is cumsum(). It takes a numeric vector and returns its cumulative sum. This is straightforward.

x <- c(1, 3, 6, 2, 2)
cumsum(x)

## [1]  1  4 10 12 14

cumsum(c(0, 1, 0, 0, 1, 0, 1))

## [1] 0 1 1 1 2 2 3

In the second example, did you notice how cumsum() preserves the sequence of changes between 0 and 1? We will come back to this in working with the example dataset.

2.2. Select elements of a vector after the first `TRUE` inclusive

In x defined above, let’s say we find the value 6 and select the elements that comes after it. That is, we will keep 6, 2, 2 and remove 1, 3. For a single vector, we can do it using the position indexes.

## [1] 1 3 6 2 2

x[c(3:5)]

## [1] 6 2 2

But this approach cannot scale – for multiple vectors, one must manually look for the position of 6 and type in the index numbers. We need a different approach.

cumany() comes to the rescue. It takes a logical vector and returns a logical vector which are TRUE for all elements after the first TRUE, inclusive, in the input. We can then use the output to select the elements from the vector.

## [1] 1 3 6 2 2

x == 6

## [1] FALSE FALSE  TRUE FALSE FALSE

cumany(x == 6)

## [1] FALSE FALSE  TRUE  TRUE  TRUE

x[cumany(x == 6)]

## [1] 6 2 2

y <- 1:10
y == 6

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

cumany(y == 6)

##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE

y[cumany(y == 6)]

## [1]  6  7  8  9 10

Quiz: What if we want the opposite? That is, we want to keep the elements before the first TRUE.

# Hint: use "!" (NOT operator)

2.3. Compare a value with one before or after it

Sometimes, we want to know when a result changed from negative to positive or vice versa. Then, we compare a test result with the next one. How do we do it in R? One way is to use lead() and lag() from {dplyr}.

lag() shifts the elements of a vector toward a higher index number by one position. Note that the empty element at the first index is filled with NA and the last of the shifted elements is truncated as lag() maintains the same length as the input vector.

x <- c(0, 0, 1, 0, 0, 1)
x

## [1] 0 0 1 0 0 1

lag(x)

## [1] NA  0  0  1  0  0

By following TRUE in x != lag(x), we know that values of x changed at the third, fourth, and sixth positions.

## [1] 0 0 1 0 0 1

x != lag(x)

## [1]    NA FALSE  TRUE  TRUE FALSE  TRUE

lead() does the opposite of lag() – it shifts the elements to a lower index number by one position.

## [1] 0 0 1 0 0 1

lead(x)

## [1]  0  1  0  0  1 NA

x != lead(x)

## [1] FALSE  TRUE  TRUE FALSE  TRUE    NA

By default, lead() and lag() shift the elements by one position. We can change it by specifying a positive integer to n = argument. Also, the empty positions are filled with NA by default after shifting. We can change that in default = argument.

## [1] 0 0 1 0 0 1

lag(x)

## [1] NA  0  0  1  0  0

lag(x, n = 3)

## [1] NA NA NA  0  0  1

lag(x, n = 3, default = 999)

## [1] 999 999 999   0   0   1

lead(x)

## [1]  0  1  0  0  1 NA

lead(x, n = 3)

## [1]  0  0  1 NA NA NA

lead(x, n = 3, default = 999)

## [1]   0   0   1 999 999 999

3. Two example datasets

dat_ue <- readr::read_csv(here::here("data", "derived_data", "unequal_interval.csv"))
dat_e <- readr::read_csv(here::here("data", "derived_data", "equal_interval.csv"))

Both datasets have the same columns:

id: Study ID
exposed: Exposure status (0: Not exposed, 1: Exposed)
d: Follow-up date
fu: Follow-up instance
result: Test result (neg: Negative, pos: Positive)

The main difference between the two datasets is that dates are equally spaced by one day between two successive follow-up instances in one dataset but not in the other. Here are how the datasets look like.

Dataset 1: Unequal follow-up interval

head(dat_ue)

## # A tibble: 6 x 5
##      id exposed d             fu result
##   <dbl>   <dbl> <date>     <dbl> <chr> 
## 1     1       0 2012-03-03     1 neg   
## 2     1       0 2012-03-13     2 pos   
## 3     1       0 2012-04-01     3 pos   
## 4     1       0 2012-04-16     4 pos   
## 5     1       0 2012-04-27     5 neg   
## 6     1       0 2012-05-25     6 neg

Dataset 2: Equal follow-up interval

head(dat_e)

## # A tibble: 6 x 5
##      id exposed d             fu result
##   <dbl>   <dbl> <date>     <dbl> <chr> 
## 1     1       0 2011-12-26     1 neg   
## 2     1       0 2011-12-27     2 pos   
## 3     1       0 2011-12-28     3 pos   
## 4     1       0 2011-12-29     4 pos   
## 5     1       0 2011-12-30     5 neg   
## 6     1       0 2011-12-31     6 neg

4. Unequal follow-up interval

Let’s start with the first dataset (dat_ue). Our objective is to calculate incidence rate among participants with a negative result at baseline.

4.1. Identify and remove the participants with a positive result at baseline

First, let’s see how many participants were there in the dataset?

# Number of unique IDs
dat_ue %>% 
  distinct(id) %>% 
  nrow()

## [1] 15

Let’s check how many participants (if any) had a positive result at baseline?

dat_ue %>% 
  filter(fu == 1) %>% 
  count(result)

## # A tibble: 2 x 2
##   result     n
##   <chr>  <int>
## 1 neg       11
## 2 pos        4

Four were positive at baseline. Since they are ineligible, we need to remove them. We can do it using two approaches. The first approach is to identify the ineligible participants first and then filter them out from the dataset using %in% operator.

first_pos <- dat_ue %>% 
  filter(fu == 1 & result == "pos")

dat_ue_1 <- dat_ue %>%
  filter(!id %in% first_pos$id)

Let’s check if it’s done correctly.

dat_ue_1 %>% 
  filter(fu == 1) %>% 
  count(result)

## # A tibble: 1 x 2
##   result     n
##   <chr>  <int>
## 1 neg       11

A second approach is to use cumany() and group_by(). group_by(id) groups the rows by id. Recall that cumany() looks for the first TRUE element starting from the first index. In cumany(), we identify participants with a positive result at the first follow-up instance and it marks TRUE in the remaining rows of such participants – FALSE for other participants. Filtering rows with TRUE now would retain rows of participants with a positive result at the first follow-up instance. This is opposite of what we wanted. So, we need to flip the logical vector using NOT (“!”) operator.

dat_ue_1 <- dat_ue %>%
  group_by(id) %>%
  filter(!cumany(fu == 1 & result == "pos")) %>%
  ungroup()

Let’s check if it’s done correctly.

dat_ue_1 %>% 
  filter(fu == 1) %>% 
  count(result)

## # A tibble: 1 x 2
##   result     n
##   <chr>  <int>
## 1 neg       11

Both approaches may seem similar in terms of number of lines and code complexity. Personally, the second approach is more elegant in that subsequent operations can be chained together seamlessly without a need to create a separate vector to identify the ineligible participants.

4.2. Remove the rows after the positive result among cases

Now, let’s count the positive cases among eligible participants.

dat_ue_1 %>% 
  filter(result == "pos") %>% 
  distinct(id) %>% 
  nrow()

## [1] 7

Among 11 eligible participants, 7 had at least one positive result throughout the follow-up period. Since we only want to compute incidence rate to the first positive result, we will identify the first positive result and remove the sequent rows.

We first call result == "pos" which returns a logical vector. When we pass the vector to cumsum(), it was coerced into integer (FALSE to 0 and TRUE to 1) before calculating cumulative sum. Rows with cumulative sum of one indicates both the first positive result and the negative results following it. We need to filter out the latter.

dat_ue_2 <- dat_ue_1 %>%
  group_by(id) %>%
  mutate(flag1 = lag(result, default = "impossible"),
         flag2 = result != flag1,
         flag3 = cumsum(flag2)) %>% 
  group_by(id, flag3) %>% 
  mutate(n = row_number()) %>% 
  filter(flag3 == 1 | (flag3 == 2 & n == 1)) %>%
  ungroup()

head(dat_ue_2)

## # A tibble: 6 x 9
##      id exposed d             fu result flag1      flag2 flag3     n
##   <dbl>   <dbl> <date>     <dbl> <chr>  <chr>      <lgl> <int> <int>
## 1     1       0 2012-03-03     1 neg    impossible TRUE      1     1
## 2     1       0 2012-03-13     2 pos    neg        TRUE      2     1
## 3     2       0 2012-02-14     1 neg    impossible TRUE      1     1
## 4     2       0 2012-02-24     2 pos    neg        TRUE      2     1
## 5     4       1 2012-02-29     1 neg    impossible TRUE      1     1
## 6     4       1 2012-03-28     2 neg    neg        FALSE     1     2

Notice that the follow-up of participant 1 and 2 ends with the first positive result?

4.3. Compute person-days at risk to the first positive result

To compute person-days at risk, we first need to determine follow-up start and end dates for each participant, using min() and max() over date column, respectively. The expression result == "pos" converts positive results into 1 which is repeated across the rows for each participant who had a positive result. At this point, values for exposure status, result, and start/end follow-up dates were the same across the rows for each participant. We remove duplicated rows using distinct(). Finally, we compute the person-days at risk by calculating the difference between start and end dates using difftime(). This creates difftime class column which cannot be used in summary computation. So, it is converted to a double.

dat_ue_3 <- dat_ue_2 %>%
  group_by(id) %>%
  mutate(start = min(d),
         end = max(d),
         result = max(result == "pos")) %>%
  ungroup() %>% 
  distinct(id, exposed, start, end, result) %>%
  mutate(persondays = as.double(difftime(end, start, units = "days")))

dat_ue_3

## # A tibble: 11 x 6
##       id exposed result start      end        persondays
##    <dbl>   <dbl>  <int> <date>     <date>          <dbl>
##  1     1       0      1 2012-03-03 2012-03-13         10
##  2     2       0      1 2012-02-14 2012-02-24         10
##  3     4       1      0 2012-02-29 2012-05-01         62
##  4     6       1      1 2012-02-08 2012-05-05         87
##  5     7       0      1 2012-02-20 2012-03-17         26
##  6     9       0      0 2012-01-19 2012-03-01         42
##  7    10       0      0 2012-01-22 2012-06-15        145
##  8    11       0      0 2012-01-24 2012-05-17        114
##  9    12       0      1 2012-01-13 2012-03-10         57
## 10    13       1      1 2012-01-02 2012-02-20         49
## 11    14       0      1 2012-03-10 2012-05-25         76

At this point, the dataset only contains one row per participant.

4.4. Compare incidence rates for exposed and unexposed

To compute incidence rates by exposure status, we use epi.2by2() from {epiR}. Take a look at the help file of the function (type ?epi.2by2 in the console). The first argument is dat = which accepts a table class object containing frequency numbers of a two by two contingency table. Here is how our table should look like.

dat <- as.table(matrix(c(5, 480, 2, 198), nrow = 2, byrow = TRUE))

class(dat)

## [1] "table"

dim(dat)

## [1] 2 2

dat

##     A   B
## A   5 480
## B   2 198

The table has a dimension of 2 by 2; the outcome in columns; the exposure in rows. It was created by converting a matrix of four numbers as a table.

Now, let’s create a similar contingency table using summarise() from {dplyr}.

tab_df <- dat_ue_3 %>% 
  group_by(exposed) %>% 
  summarise(pos = sum(result),
            fu = sum(persondays))

tab_df

## # A tibble: 2 x 3
##   exposed   pos    fu
##     <dbl> <int> <dbl>
## 1       0     5   480
## 2       1     2   198

This contingency table seems similar to the one above. But there are two problems. Let’s check the class and dimensions of tab_df.

dim(tab_df)

## [1] 2 3

class(tab_df)

## [1] "tbl_df"     "tbl"        "data.frame"

Problem 1: It has 3 columns, instead of 2. Actually, the exposed column should be row names, instead of a column.
Problem 2: Being part of Tidyverse, summarise() returns a dataframe, not a matrix/table.

We convert exposed column to row names using column_to_rownames("exposed"). That solves the first problem.

It turns out that a dataframe can be converted to a matrix using as.matrix(). We can then convert this matrix to a table using as.table(). Viola! We get the table for epi.2by2().

tab <- tab_df %>% 
  column_to_rownames("exposed") %>%  # Convert a column to row names
  as.matrix() %>%  # Convert to matrix
  as.table() # Convert to a table

class(tab)

## [1] "table"

dim(tab)

## [1] 2 2

tab

##   pos  fu
## 0   5 480
## 1   2 198

Then, we feed tab to epi.2by2(). We tell the function that this is a cohort study with follow-up time; 95% confidence level; 100 unit time (days in our example); outcome displayed as columns in the table.

tab_epi <- epi.2by2(dat = tab,
                    method = "cohort.time",
                    units = 100,
                    outcome = "as.columns")

tab_epi

##              Outcome +    Time at risk        Inc rate *
## Exposed +            5             480              1.04
## Exposed -            2             198              1.01
## Total                7             678              1.03
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc rate ratio                               1.03 (0.17, 10.83)
## Attrib rate *                                0.03 (-1.64, 1.70)
## Attrib rate in population *                  0.02 (-1.57, 1.62)
## Attrib fraction in exposed (%)               3.03 (-492.31, 90.77)
## Attrib fraction in population (%)            2.16 (-71.53, 70.53)
## -------------------------------------------------------------------
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 units of population time at risk

The output shows that incidence rate among exposed and unexposed were 1.04 and 1.01 per 100 person-days, respectively. Incidence rate ratio was 1.03 (95% CI: 0.17, 10.83). If you want to extract the point estimates and CIs, use tidy() from {broom}.

tab_epi_tidy <- tidy(tab_epi)

tab_epi_tidy

## # A tibble: 5 x 4
##   term                estimate conf.low conf.high
##   <chr>                  <dbl>    <dbl>     <dbl>
## 1 IRR.strata.wald       1.03      0.169    10.8  
## 2 ARate.strata.wald     0.0316   -1.64      1.70 
## 3 PARate.strata.wald    0.0223   -1.57      1.62 
## 4 AFRate.strata.wald    0.0303   -4.92      0.908
## 5 PAFRate.strata.wald   0.0216   -0.715     0.705

I will close this section by demonstrating that the above steps can be chained up together easily.

tab_epi <- dat_ue_3 %>% 
  group_by(exposed) %>% 
  summarise(pos = sum(result),
            fu = sum(persondays)) %>% 
  column_to_rownames("exposed") %>% 
  as.matrix() %>% 
  as.table() %>% 
  epi.2by2(method = "cohort.time",
           units = 100,
           outcome = "as.columns")

4.5. Calculate total person-days at risk for repeated positive results

Sometimes, we want to compute person-time for repeated positive results. Let’s look at how it can be done.

First, let’s go back the original dataframe dat_ue_1 where we have removed the ineligible participants. Here is what it looks like.

## # A tibble: 16 x 5
##       id exposed d             fu result
##    <dbl>   <dbl> <date>     <dbl> <chr> 
##  1     1       0 2012-03-03     1 neg   
##  2     1       0 2012-03-13     2 pos   
##  3     1       0 2012-04-01     3 pos   
##  4     1       0 2012-04-16     4 pos   
##  5     1       0 2012-04-27     5 neg   
##  6     1       0 2012-05-25     6 neg   
##  7     1       0 2012-06-11     7 neg   
##  8     1       0 2012-06-18     8 neg   
##  9     2       0 2012-02-14     1 neg   
## 10     2       0 2012-02-24     2 pos   
## 11     2       0 2012-03-01     3 neg   
## 12     2       0 2012-03-24     4 pos   
## 13     2       0 2012-04-17     5 pos   
## 14     2       0 2012-05-08     6 neg   
## 15     2       0 2012-06-06     7 neg   
## 16     2       0 2012-06-17     8 neg

To get ready for computing person-days at risk, we need to turn dat_ue_1 into this:

## # A tibble: 5 x 7
## # Groups:   id [2]
##      id exposed d1         d2         from_res to_res duration
##   <dbl>   <dbl> <date>     <date>     <chr>    <chr>     <dbl>
## 1     1       0 2012-03-03 2012-03-13 neg      pos          10
## 2     1       0 2012-04-27 2012-06-18 neg      neg          52
## 3     2       0 2012-02-14 2012-02-24 neg      pos          10
## 4     2       0 2012-03-01 2012-03-24 neg      pos          23
## 5     2       0 2012-05-08 2012-06-17 neg      neg          40

We do it in a series of steps. These are a little bit tricky. Please pay attention to the changes in each step. You may need to view the dataset after executing each step.

A useful trick: One feature of RStudio I find useful is the ability to view the dataset by clicking on the dataframe name in a script file while pressing Control key on keyboard.

Step 1: Capture the sequence of results - Compare the current result with a previous one using lag(). This produce a logical vector which is then passed to cumsum().

s1 <- dat_ue_1 %>%
  group_by(id) %>%
  mutate(
    flag = lag(result, default = "missing"),
    flag = result != flag,
    seq = cumsum(flag)
  )

head(s1, 16)

## # A tibble: 16 x 7
## # Groups:   id [2]
##       id exposed d             fu result flag    seq
##    <dbl>   <dbl> <date>     <dbl> <chr>  <lgl> <int>
##  1     1       0 2012-03-03     1 neg    TRUE      1
##  2     1       0 2012-03-13     2 pos    TRUE      2
##  3     1       0 2012-04-01     3 pos    FALSE     2
##  4     1       0 2012-04-16     4 pos    FALSE     2
##  5     1       0 2012-04-27     5 neg    TRUE      3
##  6     1       0 2012-05-25     6 neg    FALSE     3
##  7     1       0 2012-06-11     7 neg    FALSE     3
##  8     1       0 2012-06-18     8 neg    FALSE     3
##  9     2       0 2012-02-14     1 neg    TRUE      1
## 10     2       0 2012-02-24     2 pos    TRUE      2
## 11     2       0 2012-03-01     3 neg    TRUE      3
## 12     2       0 2012-03-24     4 pos    TRUE      4
## 13     2       0 2012-04-17     5 pos    FALSE     4
## 14     2       0 2012-05-08     6 neg    TRUE      5
## 15     2       0 2012-06-06     7 neg    FALSE     5
## 16     2       0 2012-06-17     8 neg    FALSE     5

Step 2: Capture the earliest date for each sequence using min(). This notes the start date of the sequence as well as the end date of previous sequence.

s2 <- s1 %>%
  group_by(id, seq) %>%
  mutate(d1 = min(d))

head(s2, 16)

## # A tibble: 16 x 8
## # Groups:   id, seq [8]
##       id exposed d             fu result flag    seq d1        
##    <dbl>   <dbl> <date>     <dbl> <chr>  <lgl> <int> <date>    
##  1     1       0 2012-03-03     1 neg    TRUE      1 2012-03-03
##  2     1       0 2012-03-13     2 pos    TRUE      2 2012-03-13
##  3     1       0 2012-04-01     3 pos    FALSE     2 2012-03-13
##  4     1       0 2012-04-16     4 pos    FALSE     2 2012-03-13
##  5     1       0 2012-04-27     5 neg    TRUE      3 2012-04-27
##  6     1       0 2012-05-25     6 neg    FALSE     3 2012-04-27
##  7     1       0 2012-06-11     7 neg    FALSE     3 2012-04-27
##  8     1       0 2012-06-18     8 neg    FALSE     3 2012-04-27
##  9     2       0 2012-02-14     1 neg    TRUE      1 2012-02-14
## 10     2       0 2012-02-24     2 pos    TRUE      2 2012-02-24
## 11     2       0 2012-03-01     3 neg    TRUE      3 2012-03-01
## 12     2       0 2012-03-24     4 pos    TRUE      4 2012-03-24
## 13     2       0 2012-04-17     5 pos    FALSE     4 2012-03-24
## 14     2       0 2012-05-08     6 neg    TRUE      5 2012-05-08
## 15     2       0 2012-06-06     7 neg    FALSE     5 2012-05-08
## 16     2       0 2012-06-17     8 neg    FALSE     5 2012-05-08

Step 3: The new date columns did not capture the follow-up end date of the participant. So, update d1 using the last date from fu.

s3 <- s2 %>%
  group_by(id) %>%
  mutate(d1 = if_else(fu == max(fu), d, d1))

head(s3, 16)

## # A tibble: 16 x 8
## # Groups:   id [2]
##       id exposed d             fu result flag    seq d1        
##    <dbl>   <dbl> <date>     <dbl> <chr>  <lgl> <int> <date>    
##  1     1       0 2012-03-03     1 neg    TRUE      1 2012-03-03
##  2     1       0 2012-03-13     2 pos    TRUE      2 2012-03-13
##  3     1       0 2012-04-01     3 pos    FALSE     2 2012-03-13
##  4     1       0 2012-04-16     4 pos    FALSE     2 2012-03-13
##  5     1       0 2012-04-27     5 neg    TRUE      3 2012-04-27
##  6     1       0 2012-05-25     6 neg    FALSE     3 2012-04-27
##  7     1       0 2012-06-11     7 neg    FALSE     3 2012-04-27
##  8     1       0 2012-06-18     8 neg    FALSE     3 2012-06-18
##  9     2       0 2012-02-14     1 neg    TRUE      1 2012-02-14
## 10     2       0 2012-02-24     2 pos    TRUE      2 2012-02-24
## 11     2       0 2012-03-01     3 neg    TRUE      3 2012-03-01
## 12     2       0 2012-03-24     4 pos    TRUE      4 2012-03-24
## 13     2       0 2012-04-17     5 pos    FALSE     4 2012-03-24
## 14     2       0 2012-05-08     6 neg    TRUE      5 2012-05-08
## 15     2       0 2012-06-06     7 neg    FALSE     5 2012-05-08
## 16     2       0 2012-06-17     8 neg    FALSE     5 2012-06-17

Step 4: Now, seq and d1 have the same values. Remove duplicated rows without removing all the columns. Note .keep_all = TRUE in distinct() as it only keeps the columns specified in the call by default.

# Reminder: Grouped by id
s4 <- s3 %>%
  distinct(id, seq, d1, .keep_all = TRUE)

head(s4, 10)

## # A tibble: 10 x 8
## # Groups:   id [2]
##       id exposed d             fu result flag    seq d1        
##    <dbl>   <dbl> <date>     <dbl> <chr>  <lgl> <int> <date>    
##  1     1       0 2012-03-03     1 neg    TRUE      1 2012-03-03
##  2     1       0 2012-03-13     2 pos    TRUE      2 2012-03-13
##  3     1       0 2012-04-27     5 neg    TRUE      3 2012-04-27
##  4     1       0 2012-06-18     8 neg    FALSE     3 2012-06-18
##  5     2       0 2012-02-14     1 neg    TRUE      1 2012-02-14
##  6     2       0 2012-02-24     2 pos    TRUE      2 2012-02-24
##  7     2       0 2012-03-01     3 neg    TRUE      3 2012-03-01
##  8     2       0 2012-03-24     4 pos    TRUE      4 2012-03-24
##  9     2       0 2012-05-08     6 neg    TRUE      5 2012-05-08
## 10     2       0 2012-06-17     8 neg    FALSE     5 2012-06-17

Step 5: To compute person-time at risk, we only count the follow-up time when the participant’s status is negative – the time for the positive status should be removed. But all the results structured in a single column makes it difficult to filter the rows. We need two rows for results (“from” and “to”) that tell us the direction of change in one row (negative to positive or vice versa). My strategy is to keep the existing result column as from_res and create a new column in which result column is pushed by one position for each participant using lead().

# Reminder: Grouped by id
s5 <- s4 %>%
  mutate(from_res = result,
         to_res = lead(result))

head(s5, 10)

## # A tibble: 10 x 10
## # Groups:   id [2]
##       id exposed d             fu result flag    seq d1         from_res to_res
##    <dbl>   <dbl> <date>     <dbl> <chr>  <lgl> <int> <date>     <chr>    <chr> 
##  1     1       0 2012-03-03     1 neg    TRUE      1 2012-03-03 neg      pos   
##  2     1       0 2012-03-13     2 pos    TRUE      2 2012-03-13 pos      neg   
##  3     1       0 2012-04-27     5 neg    TRUE      3 2012-04-27 neg      neg   
##  4     1       0 2012-06-18     8 neg    FALSE     3 2012-06-18 neg      <NA>  
##  5     2       0 2012-02-14     1 neg    TRUE      1 2012-02-14 neg      pos   
##  6     2       0 2012-02-24     2 pos    TRUE      2 2012-02-24 pos      neg   
##  7     2       0 2012-03-01     3 neg    TRUE      3 2012-03-01 neg      pos   
##  8     2       0 2012-03-24     4 pos    TRUE      4 2012-03-24 pos      neg   
##  9     2       0 2012-05-08     6 neg    TRUE      5 2012-05-08 neg      neg   
## 10     2       0 2012-06-17     8 neg    FALSE     5 2012-06-17 neg      <NA>

Step 6: Similarly, we need two columns for follow-up dates to capture the start and end dates of each result sequence. I use the strategy as step 5 and created a new column d2 by pushing d1 one row up. Then, follow-up time in result sequence is the different between d2 and d1.

# Grouped by id
s6 <- s5 %>%
  mutate(d2 = lead(d1),
         persondays = as.double(difftime(d2, d1, units = "days")))

head(s6, 10)

## # A tibble: 10 x 12
## # Groups:   id [2]
##       id exposed d             fu result flag    seq d1         from_res to_res
##    <dbl>   <dbl> <date>     <dbl> <chr>  <lgl> <int> <date>     <chr>    <chr> 
##  1     1       0 2012-03-03     1 neg    TRUE      1 2012-03-03 neg      pos   
##  2     1       0 2012-03-13     2 pos    TRUE      2 2012-03-13 pos      neg   
##  3     1       0 2012-04-27     5 neg    TRUE      3 2012-04-27 neg      neg   
##  4     1       0 2012-06-18     8 neg    FALSE     3 2012-06-18 neg      <NA>  
##  5     2       0 2012-02-14     1 neg    TRUE      1 2012-02-14 neg      pos   
##  6     2       0 2012-02-24     2 pos    TRUE      2 2012-02-24 pos      neg   
##  7     2       0 2012-03-01     3 neg    TRUE      3 2012-03-01 neg      pos   
##  8     2       0 2012-03-24     4 pos    TRUE      4 2012-03-24 pos      neg   
##  9     2       0 2012-05-08     6 neg    TRUE      5 2012-05-08 neg      neg   
## 10     2       0 2012-06-17     8 neg    FALSE     5 2012-06-17 neg      <NA>  
## # ... with 2 more variables: d2 <date>, persondays <dbl>

Step 7: The last row becomes redundant. So, I remove it using slice(-n()). Then, only keep the necessary columns.

# Reminder: Grouped by id
s7 <- s6 %>%
  slice(-n()) %>% 
  select(id, exposed, d1, d2, from_res, to_res, persondays)

head(s7, 8)

## # A tibble: 8 x 7
## # Groups:   id [2]
##      id exposed d1         d2         from_res to_res persondays
##   <dbl>   <dbl> <date>     <date>     <chr>    <chr>       <dbl>
## 1     1       0 2012-03-03 2012-03-13 neg      pos            10
## 2     1       0 2012-03-13 2012-04-27 pos      neg            45
## 3     1       0 2012-04-27 2012-06-18 neg      neg            52
## 4     2       0 2012-02-14 2012-02-24 neg      pos            10
## 5     2       0 2012-02-24 2012-03-01 pos      neg             6
## 6     2       0 2012-03-01 2012-03-24 neg      pos            23
## 7     2       0 2012-03-24 2012-05-08 pos      neg            45
## 8     2       0 2012-05-08 2012-06-17 neg      neg            40

Step 8: Now, remove the rows in which results changed from positive to negative as their follow-up time will not be counted in the person-days at risk.

# Reminder: Grouped by id
s8 <- s7 %>%
  filter(!(from_res == "pos" & to_res == "neg")) %>% 
  ungroup()

head(s8, 5)

## # A tibble: 5 x 7
##      id exposed d1         d2         from_res to_res persondays
##   <dbl>   <dbl> <date>     <date>     <chr>    <chr>       <dbl>
## 1     1       0 2012-03-03 2012-03-13 neg      pos            10
## 2     1       0 2012-04-27 2012-06-18 neg      neg            52
## 3     2       0 2012-02-14 2012-02-24 neg      pos            10
## 4     2       0 2012-03-01 2012-03-24 neg      pos            23
## 5     2       0 2012-05-08 2012-06-17 neg      neg            40

It is easier to understand these steps if they are put together in a single chain. Here it is.

s8 <- dat_ue_1 %>%
  group_by(id) %>%
  mutate(
    flag = lag(result, default = "missing"),
    flag = result != flag,
    seq = cumsum(flag)
  ) %>%
  group_by(id, seq) %>%
  mutate(d1 = min(d)) %>%
  group_by(id) %>%
  mutate(d1 = if_else(fu == max(fu), d, d1)) %>%
  distinct(id, seq, d1, .keep_all = TRUE) %>%
  mutate(from_res = result,
         to_res = lead(result)) %>%
  mutate(d2 = lead(d1),
         persondays = as.double(difftime(d2, d1, units = "days"))) %>%
  slice(-n()) %>% 
  select(id, exposed, d1, d2, from_res, to_res, persondays) %>%
  filter(!(from_res == "pos" & to_res == "neg")) %>% 
  ungroup()

4.6. Compare the incidence rates among exposed and unexposed participants with repeated positive results

Quiz: Compute incidence rate ratio and 95% CI using s8.

5. Equal follow-up interval

The operations described above can also be used to manage datasets with equal follow-up interval. These data are more structured compared to those with unequal follow-up intervals. It means that follow-up dates and results for each follow-up instance can be easily pivoted to one column per instance and one row per participant. This makes filtering rows by dates and results much easier.

dat_e_1 <- dat_e %>%
  pivot_wider(names_from = fu,
              values_from = c(d, result)) %>%

  # Remove patients with a positive result at the first follow-up
  filter(result_1 != "pos")

head(dat_e_1)

## # A tibble: 6 x 22
##      id exposed d_1        d_2        d_3        d_4        d_5       
##   <dbl>   <dbl> <date>     <date>     <date>     <date>     <date>    
## 1     1       0 2011-12-26 2011-12-27 2011-12-28 2011-12-29 2011-12-30
## 2     2       0 2012-02-15 2012-02-16 2012-02-17 2012-02-18 2012-02-19
## 3     4       1 2012-02-05 2012-02-06 2012-02-07 2012-02-08 2012-02-09
## 4     6       1 2011-12-30 2011-12-31 2012-01-01 2012-01-02 2012-01-03
## 5     7       0 2012-02-26 2012-02-27 2012-02-28 2012-02-29 2012-03-01
## 6     9       0 2012-03-02 2012-03-03 2012-03-04 2012-03-05 2012-03-06
## # ... with 15 more variables: d_6 <date>, d_7 <date>, d_8 <date>, d_9 <date>,
## #   d_10 <date>, result_1 <chr>, result_2 <chr>, result_3 <chr>,
## #   result_4 <chr>, result_5 <chr>, result_6 <chr>, result_7 <chr>,
## #   result_8 <chr>, result_9 <chr>, result_10 <chr>

Note that it is one row per participant and that follow-up dates and results are in structured in columns with one columns for one follow-up instance.

6. Conclusion

In this tutorial, we focused on data management and also cover some basic epidemiological analysis of time-to-event data. We learned some useful functions and used them to work through some examples using fake datasets. While these examples are useful, real life problems are obviously different and more complex. Chances are there is not a single function or a set of functions that are perfectly tailored for your needs. So, it is important to understand the logic and reasons behind each step so that you can use these basic building blocks to tackle more complex problems.

Session info (for reproducibility)

xfun::session_info()

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Locale:
##   LC_COLLATE=English_United States.1252 
##   LC_CTYPE=English_United States.1252   
##   LC_MONETARY=English_United States.1252
##   LC_NUMERIC=C                          
##   LC_TIME=English_United States.1252    
## 
## Package version:
##   assertthat_0.2.1 backports_1.1.8  base64enc_0.1.3  BH_1.72.0.3     
##   BiasedUrn_1.07   broom_0.7.0      cli_2.0.2        clipr_0.7.0     
##   compiler_4.0.2   cpp11_0.2.1      crayon_1.3.4     digest_0.6.25   
##   dplyr_1.0.1      ellipsis_0.3.1   epiR_1.0-15      evaluate_0.14   
##   fansi_0.4.1      generics_0.0.2   glue_1.4.1       graphics_4.0.2  
##   grDevices_4.0.2  grid_4.0.2       here_0.1         highr_0.8       
##   hms_0.5.3        htmltools_0.5.0  jsonlite_1.7.0   knitr_1.29      
##   lattice_0.20-41  lifecycle_0.2.0  magrittr_1.5     markdown_1.1    
##   Matrix_1.2-18    methods_4.0.2    mime_0.9         pillar_1.4.6    
##   pkgconfig_2.0.3  purrr_0.3.4      R6_2.4.1         Rcpp_1.0.5      
##   readr_1.3.1      rlang_0.4.7      rmarkdown_2.3    rprojroot_1.3-2 
##   splines_4.0.2    stats_4.0.2      stringi_1.4.6    stringr_1.4.0   
##   survival_3.1-12  tibble_3.0.3     tidyr_1.1.1      tidyselect_1.1.0
##   tinytex_0.25     tools_4.0.2      utf8_1.1.4       utils_4.0.2     
##   vctrs_0.3.2      xfun_0.16        yaml_2.2.1

Follow-up epi study data manipulation

Zaw Myo Tun

15 August 2020

1. Objectives

2. Some useful functions

2.1. Cumulative sum

2.2. Select elements of a vector after the first `TRUE` inclusive

2.3. Compare a value with one before or after it

3. Two example datasets

4. Unequal follow-up interval

4.1. Identify and remove the participants with a positive result at baseline

4.2. Remove the rows after the positive result among cases

4.3. Compute person-days at risk to the first positive result

4.4. Compare incidence rates for exposed and unexposed

4.5. Calculate total person-days at risk for repeated positive results

4.6. Compare the incidence rates among exposed and unexposed participants with repeated positive results

5. Equal follow-up interval

6. Conclusion

Follow-up epi study data manipulation

Zaw Myo Tun

15 August 2020

1. Objectives

2. Some useful functions

2.1. Cumulative sum

2.2. Select elements of a vector after the first TRUE inclusive

2.3. Compare a value with one before or after it

3. Two example datasets

4. Unequal follow-up interval

4.1. Identify and remove the participants with a positive result at baseline

4.2. Remove the rows after the positive result among cases

4.3. Compute person-days at risk to the first positive result

4.4. Compare incidence rates for exposed and unexposed

4.5. Calculate total person-days at risk for repeated positive results

4.6. Compare the incidence rates among exposed and unexposed participants with repeated positive results

5. Equal follow-up interval

6. Conclusion

2.2. Select elements of a vector after the first `TRUE` inclusive