In this tutorial, I share some functions and tricks to manipulate data for analysis of follow-up data using two by two contingency tables using some mock data. Follow-up data are typically collected from a cohort study or randomised controlled trials.
You can download all files and data related to this tutorial from my GitHub repository.
Throughout the tutorial, function calls are denoted with a pair of parentheses at the end. For example, sum(). Packages are marked within a pair of curly brackets. For example, {dplyr}. The code can be found in gray boxes. The white boxes with each row preceded by ## are the outputs of the code.
Pre-requisite: I assume you are familiar with the basics of R and {dplyr} package. If you want to learn more about these, Chapter 3, 4, and 5 of the online book R for Data Science by Garrett Grolemund and Hadley Wickham are helpful.
We will use five packages in this tutorial. The code below will automatically load necessary packages. It will install the packages automatically if not found in the system.
pkgs <- c(
"dplyr",
"tidyr",
"tibble",
"epiR",
"broom"
)
invisible(sapply(pkgs, function(.) if (!requireNamespace(.)) install.packages(.)))
invisible(sapply(pkgs, library, character.only = TRUE))
In this section, I will discuss a few functions that will come in handy in later part of the tutorial. You might feel that some of these are out of place for now. But it will make sense in the end. I promise!
The first function is cumsum(). It takes a numeric vector and returns its cumulative sum. This is straightforward.
x <- c(1, 3, 6, 2, 2)
cumsum(x)
## [1] 1 4 10 12 14
cumsum(c(0, 1, 0, 0, 1, 0, 1))
## [1] 0 1 1 1 2 2 3
In the second example, did you notice how cumsum() preserves the sequence of changes between 0 and 1? We will come back to this in working with the example dataset.
TRUE inclusiveIn x defined above, let’s say we find the value 6 and select the elements that comes after it. That is, we will keep 6, 2, 2 and remove 1, 3. For a single vector, we can do it using the position indexes.
x
## [1] 1 3 6 2 2
x[c(3:5)]
## [1] 6 2 2
But this approach cannot scale – for multiple vectors, one must manually look for the position of 6 and type in the index numbers. We need a different approach.
cumany() comes to the rescue. It takes a logical vector and returns a logical vector which are TRUE for all elements after the first TRUE, inclusive, in the input. We can then use the output to select the elements from the vector.
x
## [1] 1 3 6 2 2
x == 6
## [1] FALSE FALSE TRUE FALSE FALSE
cumany(x == 6)
## [1] FALSE FALSE TRUE TRUE TRUE
x[cumany(x == 6)]
## [1] 6 2 2
y <- 1:10
y == 6
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
cumany(y == 6)
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
y[cumany(y == 6)]
## [1] 6 7 8 9 10
Quiz: What if we want the opposite? That is, we want to keep the elements before the first TRUE.
# Hint: use "!" (NOT operator)
Sometimes, we want to know when a result changed from negative to positive or vice versa. Then, we compare a test result with the next one. How do we do it in R? One way is to use lead() and lag() from {dplyr}.
lag() shifts the elements of a vector toward a higher index number by one position. Note that the empty element at the first index is filled with NA and the last of the shifted elements is truncated as lag() maintains the same length as the input vector.
x <- c(0, 0, 1, 0, 0, 1)
x
## [1] 0 0 1 0 0 1
lag(x)
## [1] NA 0 0 1 0 0
By following TRUE in x != lag(x), we know that values of x changed at the third, fourth, and sixth positions.
x
## [1] 0 0 1 0 0 1
x != lag(x)
## [1] NA FALSE TRUE TRUE FALSE TRUE
lead() does the opposite of lag() – it shifts the elements to a lower index number by one position.
x
## [1] 0 0 1 0 0 1
lead(x)
## [1] 0 1 0 0 1 NA
x != lead(x)
## [1] FALSE TRUE TRUE FALSE TRUE NA
By default, lead() and lag() shift the elements by one position. We can change it by specifying a positive integer to n = argument. Also, the empty positions are filled with NA by default after shifting. We can change that in default = argument.
x
## [1] 0 0 1 0 0 1
lag(x)
## [1] NA 0 0 1 0 0
lag(x, n = 3)
## [1] NA NA NA 0 0 1
lag(x, n = 3, default = 999)
## [1] 999 999 999 0 0 1
lead(x)
## [1] 0 1 0 0 1 NA
lead(x, n = 3)
## [1] 0 0 1 NA NA NA
lead(x, n = 3, default = 999)
## [1] 0 0 1 999 999 999
dat_ue <- readr::read_csv(here::here("data", "derived_data", "unequal_interval.csv"))
dat_e <- readr::read_csv(here::here("data", "derived_data", "equal_interval.csv"))
Both datasets have the same columns:
id: Study IDexposed: Exposure status (0: Not exposed, 1: Exposed)d: Follow-up datefu: Follow-up instanceresult: Test result (neg: Negative, pos: Positive)The main difference between the two datasets is that dates are equally spaced by one day between two successive follow-up instances in one dataset but not in the other. Here are how the datasets look like.
Dataset 1: Unequal follow-up interval
head(dat_ue)
## # A tibble: 6 x 5
## id exposed d fu result
## <dbl> <dbl> <date> <dbl> <chr>
## 1 1 0 2012-03-03 1 neg
## 2 1 0 2012-03-13 2 pos
## 3 1 0 2012-04-01 3 pos
## 4 1 0 2012-04-16 4 pos
## 5 1 0 2012-04-27 5 neg
## 6 1 0 2012-05-25 6 neg
Dataset 2: Equal follow-up interval
head(dat_e)
## # A tibble: 6 x 5
## id exposed d fu result
## <dbl> <dbl> <date> <dbl> <chr>
## 1 1 0 2011-12-26 1 neg
## 2 1 0 2011-12-27 2 pos
## 3 1 0 2011-12-28 3 pos
## 4 1 0 2011-12-29 4 pos
## 5 1 0 2011-12-30 5 neg
## 6 1 0 2011-12-31 6 neg
Let’s start with the first dataset (dat_ue). Our objective is to calculate incidence rate among participants with a negative result at baseline.
First, let’s see how many participants were there in the dataset?
# Number of unique IDs
dat_ue %>%
distinct(id) %>%
nrow()
## [1] 15
Let’s check how many participants (if any) had a positive result at baseline?
dat_ue %>%
filter(fu == 1) %>%
count(result)
## # A tibble: 2 x 2
## result n
## <chr> <int>
## 1 neg 11
## 2 pos 4
Four were positive at baseline. Since they are ineligible, we need to remove them. We can do it using two approaches. The first approach is to identify the ineligible participants first and then filter them out from the dataset using %in% operator.
first_pos <- dat_ue %>%
filter(fu == 1 & result == "pos")
dat_ue_1 <- dat_ue %>%
filter(!id %in% first_pos$id)
Let’s check if it’s done correctly.
dat_ue_1 %>%
filter(fu == 1) %>%
count(result)
## # A tibble: 1 x 2
## result n
## <chr> <int>
## 1 neg 11
A second approach is to use cumany() and group_by(). group_by(id) groups the rows by id. Recall that cumany() looks for the first TRUE element starting from the first index. In cumany(), we identify participants with a positive result at the first follow-up instance and it marks TRUE in the remaining rows of such participants – FALSE for other participants. Filtering rows with TRUE now would retain rows of participants with a positive result at the first follow-up instance. This is opposite of what we wanted. So, we need to flip the logical vector using NOT (“!”) operator.
dat_ue_1 <- dat_ue %>%
group_by(id) %>%
filter(!cumany(fu == 1 & result == "pos")) %>%
ungroup()
Let’s check if it’s done correctly.
dat_ue_1 %>%
filter(fu == 1) %>%
count(result)
## # A tibble: 1 x 2
## result n
## <chr> <int>
## 1 neg 11
Both approaches may seem similar in terms of number of lines and code complexity. Personally, the second approach is more elegant in that subsequent operations can be chained together seamlessly without a need to create a separate vector to identify the ineligible participants.
Now, let’s count the positive cases among eligible participants.
dat_ue_1 %>%
filter(result == "pos") %>%
distinct(id) %>%
nrow()
## [1] 7
Among 11 eligible participants, 7 had at least one positive result throughout the follow-up period. Since we only want to compute incidence rate to the first positive result, we will identify the first positive result and remove the sequent rows.
We first call result == "pos" which returns a logical vector. When we pass the vector to cumsum(), it was coerced into integer (FALSE to 0 and TRUE to 1) before calculating cumulative sum. Rows with cumulative sum of one indicates both the first positive result and the negative results following it. We need to filter out the latter.
dat_ue_2 <- dat_ue_1 %>%
group_by(id) %>%
mutate(flag1 = lag(result, default = "impossible"),
flag2 = result != flag1,
flag3 = cumsum(flag2)) %>%
group_by(id, flag3) %>%
mutate(n = row_number()) %>%
filter(flag3 == 1 | (flag3 == 2 & n == 1)) %>%
ungroup()
head(dat_ue_2)
## # A tibble: 6 x 9
## id exposed d fu result flag1 flag2 flag3 n
## <dbl> <dbl> <date> <dbl> <chr> <chr> <lgl> <int> <int>
## 1 1 0 2012-03-03 1 neg impossible TRUE 1 1
## 2 1 0 2012-03-13 2 pos neg TRUE 2 1
## 3 2 0 2012-02-14 1 neg impossible TRUE 1 1
## 4 2 0 2012-02-24 2 pos neg TRUE 2 1
## 5 4 1 2012-02-29 1 neg impossible TRUE 1 1
## 6 4 1 2012-03-28 2 neg neg FALSE 1 2
Notice that the follow-up of participant 1 and 2 ends with the first positive result?
To compute person-days at risk, we first need to determine follow-up start and end dates for each participant, using min() and max() over date column, respectively. The expression result == "pos" converts positive results into 1 which is repeated across the rows for each participant who had a positive result. At this point, values for exposure status, result, and start/end follow-up dates were the same across the rows for each participant. We remove duplicated rows using distinct(). Finally, we compute the person-days at risk by calculating the difference between start and end dates using difftime(). This creates difftime class column which cannot be used in summary computation. So, it is converted to a double.
dat_ue_3 <- dat_ue_2 %>%
group_by(id) %>%
mutate(start = min(d),
end = max(d),
result = max(result == "pos")) %>%
ungroup() %>%
distinct(id, exposed, start, end, result) %>%
mutate(persondays = as.double(difftime(end, start, units = "days")))
dat_ue_3
## # A tibble: 11 x 6
## id exposed result start end persondays
## <dbl> <dbl> <int> <date> <date> <dbl>
## 1 1 0 1 2012-03-03 2012-03-13 10
## 2 2 0 1 2012-02-14 2012-02-24 10
## 3 4 1 0 2012-02-29 2012-05-01 62
## 4 6 1 1 2012-02-08 2012-05-05 87
## 5 7 0 1 2012-02-20 2012-03-17 26
## 6 9 0 0 2012-01-19 2012-03-01 42
## 7 10 0 0 2012-01-22 2012-06-15 145
## 8 11 0 0 2012-01-24 2012-05-17 114
## 9 12 0 1 2012-01-13 2012-03-10 57
## 10 13 1 1 2012-01-02 2012-02-20 49
## 11 14 0 1 2012-03-10 2012-05-25 76
At this point, the dataset only contains one row per participant.
To compute incidence rates by exposure status, we use epi.2by2() from {epiR}. Take a look at the help file of the function (type ?epi.2by2 in the console). The first argument is dat = which accepts a table class object containing frequency numbers of a two by two contingency table. Here is how our table should look like.
dat <- as.table(matrix(c(5, 480, 2, 198), nrow = 2, byrow = TRUE))
class(dat)
## [1] "table"
dim(dat)
## [1] 2 2
dat
## A B
## A 5 480
## B 2 198
The table has a dimension of 2 by 2; the outcome in columns; the exposure in rows. It was created by converting a matrix of four numbers as a table.
Now, let’s create a similar contingency table using summarise() from {dplyr}.
tab_df <- dat_ue_3 %>%
group_by(exposed) %>%
summarise(pos = sum(result),
fu = sum(persondays))
tab_df
## # A tibble: 2 x 3
## exposed pos fu
## <dbl> <int> <dbl>
## 1 0 5 480
## 2 1 2 198
This contingency table seems similar to the one above. But there are two problems. Let’s check the class and dimensions of tab_df.
dim(tab_df)
## [1] 2 3
class(tab_df)
## [1] "tbl_df" "tbl" "data.frame"
summarise() returns a dataframe, not a matrix/table.We convert exposed column to row names using column_to_rownames("exposed"). That solves the first problem.
It turns out that a dataframe can be converted to a matrix using as.matrix(). We can then convert this matrix to a table using as.table(). Viola! We get the table for epi.2by2().
tab <- tab_df %>%
column_to_rownames("exposed") %>% # Convert a column to row names
as.matrix() %>% # Convert to matrix
as.table() # Convert to a table
class(tab)
## [1] "table"
dim(tab)
## [1] 2 2
tab
## pos fu
## 0 5 480
## 1 2 198
Then, we feed tab to epi.2by2(). We tell the function that this is a cohort study with follow-up time; 95% confidence level; 100 unit time (days in our example); outcome displayed as columns in the table.
tab_epi <- epi.2by2(dat = tab,
method = "cohort.time",
units = 100,
outcome = "as.columns")
tab_epi
## Outcome + Time at risk Inc rate *
## Exposed + 5 480 1.04
## Exposed - 2 198 1.01
## Total 7 678 1.03
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc rate ratio 1.03 (0.17, 10.83)
## Attrib rate * 0.03 (-1.64, 1.70)
## Attrib rate in population * 0.02 (-1.57, 1.62)
## Attrib fraction in exposed (%) 3.03 (-492.31, 90.77)
## Attrib fraction in population (%) 2.16 (-71.53, 70.53)
## -------------------------------------------------------------------
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 units of population time at risk
The output shows that incidence rate among exposed and unexposed were 1.04 and 1.01 per 100 person-days, respectively. Incidence rate ratio was 1.03 (95% CI: 0.17, 10.83). If you want to extract the point estimates and CIs, use tidy() from {broom}.
tab_epi_tidy <- tidy(tab_epi)
tab_epi_tidy
## # A tibble: 5 x 4
## term estimate conf.low conf.high
## <chr> <dbl> <dbl> <dbl>
## 1 IRR.strata.wald 1.03 0.169 10.8
## 2 ARate.strata.wald 0.0316 -1.64 1.70
## 3 PARate.strata.wald 0.0223 -1.57 1.62
## 4 AFRate.strata.wald 0.0303 -4.92 0.908
## 5 PAFRate.strata.wald 0.0216 -0.715 0.705
I will close this section by demonstrating that the above steps can be chained up together easily.
tab_epi <- dat_ue_3 %>%
group_by(exposed) %>%
summarise(pos = sum(result),
fu = sum(persondays)) %>%
column_to_rownames("exposed") %>%
as.matrix() %>%
as.table() %>%
epi.2by2(method = "cohort.time",
units = 100,
outcome = "as.columns")
Sometimes, we want to compute person-time for repeated positive results. Let’s look at how it can be done.
First, let’s go back the original dataframe dat_ue_1 where we have removed the ineligible participants. Here is what it looks like.
## # A tibble: 16 x 5
## id exposed d fu result
## <dbl> <dbl> <date> <dbl> <chr>
## 1 1 0 2012-03-03 1 neg
## 2 1 0 2012-03-13 2 pos
## 3 1 0 2012-04-01 3 pos
## 4 1 0 2012-04-16 4 pos
## 5 1 0 2012-04-27 5 neg
## 6 1 0 2012-05-25 6 neg
## 7 1 0 2012-06-11 7 neg
## 8 1 0 2012-06-18 8 neg
## 9 2 0 2012-02-14 1 neg
## 10 2 0 2012-02-24 2 pos
## 11 2 0 2012-03-01 3 neg
## 12 2 0 2012-03-24 4 pos
## 13 2 0 2012-04-17 5 pos
## 14 2 0 2012-05-08 6 neg
## 15 2 0 2012-06-06 7 neg
## 16 2 0 2012-06-17 8 neg
To get ready for computing person-days at risk, we need to turn dat_ue_1 into this:
## # A tibble: 5 x 7
## # Groups: id [2]
## id exposed d1 d2 from_res to_res duration
## <dbl> <dbl> <date> <date> <chr> <chr> <dbl>
## 1 1 0 2012-03-03 2012-03-13 neg pos 10
## 2 1 0 2012-04-27 2012-06-18 neg neg 52
## 3 2 0 2012-02-14 2012-02-24 neg pos 10
## 4 2 0 2012-03-01 2012-03-24 neg pos 23
## 5 2 0 2012-05-08 2012-06-17 neg neg 40
We do it in a series of steps. These are a little bit tricky. Please pay attention to the changes in each step. You may need to view the dataset after executing each step.
A useful trick: One feature of RStudio I find useful is the ability to view the dataset by clicking on the dataframe name in a script file while pressing Control key on keyboard.
lag(). This produce a logical vector which is then passed to cumsum().s1 <- dat_ue_1 %>%
group_by(id) %>%
mutate(
flag = lag(result, default = "missing"),
flag = result != flag,
seq = cumsum(flag)
)
head(s1, 16)
## # A tibble: 16 x 7
## # Groups: id [2]
## id exposed d fu result flag seq
## <dbl> <dbl> <date> <dbl> <chr> <lgl> <int>
## 1 1 0 2012-03-03 1 neg TRUE 1
## 2 1 0 2012-03-13 2 pos TRUE 2
## 3 1 0 2012-04-01 3 pos FALSE 2
## 4 1 0 2012-04-16 4 pos FALSE 2
## 5 1 0 2012-04-27 5 neg TRUE 3
## 6 1 0 2012-05-25 6 neg FALSE 3
## 7 1 0 2012-06-11 7 neg FALSE 3
## 8 1 0 2012-06-18 8 neg FALSE 3
## 9 2 0 2012-02-14 1 neg TRUE 1
## 10 2 0 2012-02-24 2 pos TRUE 2
## 11 2 0 2012-03-01 3 neg TRUE 3
## 12 2 0 2012-03-24 4 pos TRUE 4
## 13 2 0 2012-04-17 5 pos FALSE 4
## 14 2 0 2012-05-08 6 neg TRUE 5
## 15 2 0 2012-06-06 7 neg FALSE 5
## 16 2 0 2012-06-17 8 neg FALSE 5
min(). This notes the start date of the sequence as well as the end date of previous sequence.s2 <- s1 %>%
group_by(id, seq) %>%
mutate(d1 = min(d))
head(s2, 16)
## # A tibble: 16 x 8
## # Groups: id, seq [8]
## id exposed d fu result flag seq d1
## <dbl> <dbl> <date> <dbl> <chr> <lgl> <int> <date>
## 1 1 0 2012-03-03 1 neg TRUE 1 2012-03-03
## 2 1 0 2012-03-13 2 pos TRUE 2 2012-03-13
## 3 1 0 2012-04-01 3 pos FALSE 2 2012-03-13
## 4 1 0 2012-04-16 4 pos FALSE 2 2012-03-13
## 5 1 0 2012-04-27 5 neg TRUE 3 2012-04-27
## 6 1 0 2012-05-25 6 neg FALSE 3 2012-04-27
## 7 1 0 2012-06-11 7 neg FALSE 3 2012-04-27
## 8 1 0 2012-06-18 8 neg FALSE 3 2012-04-27
## 9 2 0 2012-02-14 1 neg TRUE 1 2012-02-14
## 10 2 0 2012-02-24 2 pos TRUE 2 2012-02-24
## 11 2 0 2012-03-01 3 neg TRUE 3 2012-03-01
## 12 2 0 2012-03-24 4 pos TRUE 4 2012-03-24
## 13 2 0 2012-04-17 5 pos FALSE 4 2012-03-24
## 14 2 0 2012-05-08 6 neg TRUE 5 2012-05-08
## 15 2 0 2012-06-06 7 neg FALSE 5 2012-05-08
## 16 2 0 2012-06-17 8 neg FALSE 5 2012-05-08
d1 using the last date from fu.s3 <- s2 %>%
group_by(id) %>%
mutate(d1 = if_else(fu == max(fu), d, d1))
head(s3, 16)
## # A tibble: 16 x 8
## # Groups: id [2]
## id exposed d fu result flag seq d1
## <dbl> <dbl> <date> <dbl> <chr> <lgl> <int> <date>
## 1 1 0 2012-03-03 1 neg TRUE 1 2012-03-03
## 2 1 0 2012-03-13 2 pos TRUE 2 2012-03-13
## 3 1 0 2012-04-01 3 pos FALSE 2 2012-03-13
## 4 1 0 2012-04-16 4 pos FALSE 2 2012-03-13
## 5 1 0 2012-04-27 5 neg TRUE 3 2012-04-27
## 6 1 0 2012-05-25 6 neg FALSE 3 2012-04-27
## 7 1 0 2012-06-11 7 neg FALSE 3 2012-04-27
## 8 1 0 2012-06-18 8 neg FALSE 3 2012-06-18
## 9 2 0 2012-02-14 1 neg TRUE 1 2012-02-14
## 10 2 0 2012-02-24 2 pos TRUE 2 2012-02-24
## 11 2 0 2012-03-01 3 neg TRUE 3 2012-03-01
## 12 2 0 2012-03-24 4 pos TRUE 4 2012-03-24
## 13 2 0 2012-04-17 5 pos FALSE 4 2012-03-24
## 14 2 0 2012-05-08 6 neg TRUE 5 2012-05-08
## 15 2 0 2012-06-06 7 neg FALSE 5 2012-05-08
## 16 2 0 2012-06-17 8 neg FALSE 5 2012-06-17
seq and d1 have the same values. Remove duplicated rows without removing all the columns. Note .keep_all = TRUE in distinct() as it only keeps the columns specified in the call by default.# Reminder: Grouped by id
s4 <- s3 %>%
distinct(id, seq, d1, .keep_all = TRUE)
head(s4, 10)
## # A tibble: 10 x 8
## # Groups: id [2]
## id exposed d fu result flag seq d1
## <dbl> <dbl> <date> <dbl> <chr> <lgl> <int> <date>
## 1 1 0 2012-03-03 1 neg TRUE 1 2012-03-03
## 2 1 0 2012-03-13 2 pos TRUE 2 2012-03-13
## 3 1 0 2012-04-27 5 neg TRUE 3 2012-04-27
## 4 1 0 2012-06-18 8 neg FALSE 3 2012-06-18
## 5 2 0 2012-02-14 1 neg TRUE 1 2012-02-14
## 6 2 0 2012-02-24 2 pos TRUE 2 2012-02-24
## 7 2 0 2012-03-01 3 neg TRUE 3 2012-03-01
## 8 2 0 2012-03-24 4 pos TRUE 4 2012-03-24
## 9 2 0 2012-05-08 6 neg TRUE 5 2012-05-08
## 10 2 0 2012-06-17 8 neg FALSE 5 2012-06-17
from_res and create a new column in which result column is pushed by one position for each participant using lead().# Reminder: Grouped by id
s5 <- s4 %>%
mutate(from_res = result,
to_res = lead(result))
head(s5, 10)
## # A tibble: 10 x 10
## # Groups: id [2]
## id exposed d fu result flag seq d1 from_res to_res
## <dbl> <dbl> <date> <dbl> <chr> <lgl> <int> <date> <chr> <chr>
## 1 1 0 2012-03-03 1 neg TRUE 1 2012-03-03 neg pos
## 2 1 0 2012-03-13 2 pos TRUE 2 2012-03-13 pos neg
## 3 1 0 2012-04-27 5 neg TRUE 3 2012-04-27 neg neg
## 4 1 0 2012-06-18 8 neg FALSE 3 2012-06-18 neg <NA>
## 5 2 0 2012-02-14 1 neg TRUE 1 2012-02-14 neg pos
## 6 2 0 2012-02-24 2 pos TRUE 2 2012-02-24 pos neg
## 7 2 0 2012-03-01 3 neg TRUE 3 2012-03-01 neg pos
## 8 2 0 2012-03-24 4 pos TRUE 4 2012-03-24 pos neg
## 9 2 0 2012-05-08 6 neg TRUE 5 2012-05-08 neg neg
## 10 2 0 2012-06-17 8 neg FALSE 5 2012-06-17 neg <NA>
d2 by pushing d1 one row up. Then, follow-up time in result sequence is the different between d2 and d1.# Grouped by id
s6 <- s5 %>%
mutate(d2 = lead(d1),
persondays = as.double(difftime(d2, d1, units = "days")))
head(s6, 10)
## # A tibble: 10 x 12
## # Groups: id [2]
## id exposed d fu result flag seq d1 from_res to_res
## <dbl> <dbl> <date> <dbl> <chr> <lgl> <int> <date> <chr> <chr>
## 1 1 0 2012-03-03 1 neg TRUE 1 2012-03-03 neg pos
## 2 1 0 2012-03-13 2 pos TRUE 2 2012-03-13 pos neg
## 3 1 0 2012-04-27 5 neg TRUE 3 2012-04-27 neg neg
## 4 1 0 2012-06-18 8 neg FALSE 3 2012-06-18 neg <NA>
## 5 2 0 2012-02-14 1 neg TRUE 1 2012-02-14 neg pos
## 6 2 0 2012-02-24 2 pos TRUE 2 2012-02-24 pos neg
## 7 2 0 2012-03-01 3 neg TRUE 3 2012-03-01 neg pos
## 8 2 0 2012-03-24 4 pos TRUE 4 2012-03-24 pos neg
## 9 2 0 2012-05-08 6 neg TRUE 5 2012-05-08 neg neg
## 10 2 0 2012-06-17 8 neg FALSE 5 2012-06-17 neg <NA>
## # ... with 2 more variables: d2 <date>, persondays <dbl>
slice(-n()). Then, only keep the necessary columns.# Reminder: Grouped by id
s7 <- s6 %>%
slice(-n()) %>%
select(id, exposed, d1, d2, from_res, to_res, persondays)
head(s7, 8)
## # A tibble: 8 x 7
## # Groups: id [2]
## id exposed d1 d2 from_res to_res persondays
## <dbl> <dbl> <date> <date> <chr> <chr> <dbl>
## 1 1 0 2012-03-03 2012-03-13 neg pos 10
## 2 1 0 2012-03-13 2012-04-27 pos neg 45
## 3 1 0 2012-04-27 2012-06-18 neg neg 52
## 4 2 0 2012-02-14 2012-02-24 neg pos 10
## 5 2 0 2012-02-24 2012-03-01 pos neg 6
## 6 2 0 2012-03-01 2012-03-24 neg pos 23
## 7 2 0 2012-03-24 2012-05-08 pos neg 45
## 8 2 0 2012-05-08 2012-06-17 neg neg 40
# Reminder: Grouped by id
s8 <- s7 %>%
filter(!(from_res == "pos" & to_res == "neg")) %>%
ungroup()
head(s8, 5)
## # A tibble: 5 x 7
## id exposed d1 d2 from_res to_res persondays
## <dbl> <dbl> <date> <date> <chr> <chr> <dbl>
## 1 1 0 2012-03-03 2012-03-13 neg pos 10
## 2 1 0 2012-04-27 2012-06-18 neg neg 52
## 3 2 0 2012-02-14 2012-02-24 neg pos 10
## 4 2 0 2012-03-01 2012-03-24 neg pos 23
## 5 2 0 2012-05-08 2012-06-17 neg neg 40
It is easier to understand these steps if they are put together in a single chain. Here it is.
s8 <- dat_ue_1 %>%
group_by(id) %>%
mutate(
flag = lag(result, default = "missing"),
flag = result != flag,
seq = cumsum(flag)
) %>%
group_by(id, seq) %>%
mutate(d1 = min(d)) %>%
group_by(id) %>%
mutate(d1 = if_else(fu == max(fu), d, d1)) %>%
distinct(id, seq, d1, .keep_all = TRUE) %>%
mutate(from_res = result,
to_res = lead(result)) %>%
mutate(d2 = lead(d1),
persondays = as.double(difftime(d2, d1, units = "days"))) %>%
slice(-n()) %>%
select(id, exposed, d1, d2, from_res, to_res, persondays) %>%
filter(!(from_res == "pos" & to_res == "neg")) %>%
ungroup()
Quiz: Compute incidence rate ratio and 95% CI using s8.
#
The operations described above can also be used to manage datasets with equal follow-up interval. These data are more structured compared to those with unequal follow-up intervals. It means that follow-up dates and results for each follow-up instance can be easily pivoted to one column per instance and one row per participant. This makes filtering rows by dates and results much easier.
dat_e_1 <- dat_e %>%
pivot_wider(names_from = fu,
values_from = c(d, result)) %>%
# Remove patients with a positive result at the first follow-up
filter(result_1 != "pos")
head(dat_e_1)
## # A tibble: 6 x 22
## id exposed d_1 d_2 d_3 d_4 d_5
## <dbl> <dbl> <date> <date> <date> <date> <date>
## 1 1 0 2011-12-26 2011-12-27 2011-12-28 2011-12-29 2011-12-30
## 2 2 0 2012-02-15 2012-02-16 2012-02-17 2012-02-18 2012-02-19
## 3 4 1 2012-02-05 2012-02-06 2012-02-07 2012-02-08 2012-02-09
## 4 6 1 2011-12-30 2011-12-31 2012-01-01 2012-01-02 2012-01-03
## 5 7 0 2012-02-26 2012-02-27 2012-02-28 2012-02-29 2012-03-01
## 6 9 0 2012-03-02 2012-03-03 2012-03-04 2012-03-05 2012-03-06
## # ... with 15 more variables: d_6 <date>, d_7 <date>, d_8 <date>, d_9 <date>,
## # d_10 <date>, result_1 <chr>, result_2 <chr>, result_3 <chr>,
## # result_4 <chr>, result_5 <chr>, result_6 <chr>, result_7 <chr>,
## # result_8 <chr>, result_9 <chr>, result_10 <chr>
Note that it is one row per participant and that follow-up dates and results are in structured in columns with one columns for one follow-up instance.
In this tutorial, we focused on data management and also cover some basic epidemiological analysis of time-to-event data. We learned some useful functions and used them to work through some examples using fake datasets. While these examples are useful, real life problems are obviously different and more complex. Chances are there is not a single function or a set of functions that are perfectly tailored for your needs. So, it is important to understand the logic and reasons behind each step so that you can use these basic building blocks to tackle more complex problems.
Session info (for reproducibility)
xfun::session_info()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
##
## Locale:
## LC_COLLATE=English_United States.1252
## LC_CTYPE=English_United States.1252
## LC_MONETARY=English_United States.1252
## LC_NUMERIC=C
## LC_TIME=English_United States.1252
##
## Package version:
## assertthat_0.2.1 backports_1.1.8 base64enc_0.1.3 BH_1.72.0.3
## BiasedUrn_1.07 broom_0.7.0 cli_2.0.2 clipr_0.7.0
## compiler_4.0.2 cpp11_0.2.1 crayon_1.3.4 digest_0.6.25
## dplyr_1.0.1 ellipsis_0.3.1 epiR_1.0-15 evaluate_0.14
## fansi_0.4.1 generics_0.0.2 glue_1.4.1 graphics_4.0.2
## grDevices_4.0.2 grid_4.0.2 here_0.1 highr_0.8
## hms_0.5.3 htmltools_0.5.0 jsonlite_1.7.0 knitr_1.29
## lattice_0.20-41 lifecycle_0.2.0 magrittr_1.5 markdown_1.1
## Matrix_1.2-18 methods_4.0.2 mime_0.9 pillar_1.4.6
## pkgconfig_2.0.3 purrr_0.3.4 R6_2.4.1 Rcpp_1.0.5
## readr_1.3.1 rlang_0.4.7 rmarkdown_2.3 rprojroot_1.3-2
## splines_4.0.2 stats_4.0.2 stringi_1.4.6 stringr_1.4.0
## survival_3.1-12 tibble_3.0.3 tidyr_1.1.1 tidyselect_1.1.0
## tinytex_0.25 tools_4.0.2 utf8_1.1.4 utils_4.0.2
## vctrs_0.3.2 xfun_0.16 yaml_2.2.1