The following outlines an experimental approach to the analysis of contact tracing data, leveraging network analysis methods to observe connections between sources and targets of infectious disease.
The example here uses sample contact tracing data, generously provided by authors of the Epidemiologist R Handbook, and explores relationships between sources and targets during a COVID19 outbreak. Sources are people with illness, and targets those they have had recent, proximal contact with. Through documentation of these source and target connections we are able to create an interactive map of acyclic exposure situations, and hopefully learn something from it..
# install and load packages
if (!require("pacman")) install.packages("pacman")
library(pacman)
pacman::p_load(
tidyverse, # wrangling
visNetwork, # network visualization
prettydoc) # knitting
# import data from url
relationships <- read_rds(
url("https://github.com/appliedepi/epiRhandbook_eng/blob/master/data/godata/relationships_clean.rds?raw=true")) %>%
# select "to" and "from" variables
select(source_visualid, target_visualid) %>%
# reduce id strings for legibility
mutate(source_visualid = str_remove_all(source_visualid, "(-2020)"),
target_visualid = str_remove_all(target_visualid, "(-2020)"))This dataset is one of four used in a wider contact investigation excercise found here. For an example of wrangling data for network analysis, I am focusing on the relationships dataset used in the chapter, and two variables of that data in particular: source_visualid and target_visualid. These two columns document frequency and directionality of connections between people together, observed through a contact tracing procedure.
It may not seem obvious, but the observations in the relationships data represent contact events between sources and targets. To recognize that, we can countevents in both the source and target variables.
## # A tibble: 23 x 2
## source_visualid n
## <chr> <int>
## 1 <NA> 17
## 2 CASE-0001 13
## 3 CASE-0002 5
## 4 CASE-0005 5
## 5 CASE-0013 5
## 6 CASE-0004 4
## 7 CASE-0018 4
## 8 CASE-0023 4
## 9 CASE-0034 4
## 10 CASE-0006 3
## # … with 13 more rows
## # A tibble: 84 x 2
## target_visualid n
## <chr> <int>
## 1 <NA> 4
## 2 CONTACT-0046 3
## 3 CONTACT-0056 3
## 4 CASE-0006 2
## 5 CASE-0008 2
## 6 CASE-0009 2
## 7 CONTACT-0015 2
## 8 CONTACT-0027 2
## 9 CONTACT-0028 2
## 10 CONTACT-0029 2
## # … with 74 more rows
The two main ingredients of a network graph are nodes and edges. Nodes can be thought of as units of observation, and edges are the threads that connect nodes together. To make nodes, we fork our target and source variables into seperate objects that will become the distinct units, in this case unique people who are included in the contact tracing investigation.
# establish nodes for cases, "sources" of contagion
source_nodes <- relationships %>%
distinct(source_visualid) %>%
rename(label = source_visualid)
# establish contacts nodes, or "targets"
target_nodes <- relationships %>%
distinct(target_visualid) %>%
rename(label = target_visualid)
# join both into tibble
ct_nodes <- full_join(source_nodes,
target_nodes) %>%
# make unique id number for each node
rowid_to_column("id") %>%
# remove missing values
na.omit()
head(ct_nodes)## # A tibble: 6 x 2
## id label
## <int> <chr>
## 1 1 CASE-0016
## 2 3 CASE-0045
## 3 4 CASE-0004
## 4 5 CASE-0010
## 5 6 CASE-0034
## 6 7 CASE-0037
Once we have a list of the distinct nodes, each with unique id values, we can join ct_nodes to a new fork of the relationships data that will make our graph edges. The id variable is joined in two different ways: from the source_visualid as from and the target_visualid as to.
# create "edges", lines between nodes
ct_edges <- relationships %>%
select(target_visualid, source_visualid) %>%
# join the sources and rename those ids as from
left_join(ct_nodes, by = c("source_visualid" = "label")) %>%
rename(from = id) %>%
# then join the targets and rename those ids as to
left_join(ct_nodes, by = c("target_visualid" = "label")) %>%
rename(to = id)
head(ct_edges)## # A tibble: 6 x 4
## target_visualid source_visualid from to
## <chr> <chr> <int> <int>
## 1 CONTACT-0027 CASE-0016 1 24
## 2 CASE-0014 <NA> NA 17
## 3 CASE-0031 <NA> NA 18
## 4 CASE-0021 CASE-0045 3 25
## 5 CONTACT-0020 CASE-0004 4 26
## 6 CONTACT-0038 CASE-0010 5 27
The ct_edges object now contains all the observable connections between distinct nodes, directionalilty is obtained through the to and from variables. With them, we can put togther a fairly nice interactive graph using the visNetwork package. We can also add a few basic options, like a node id selection dropdown.
# visualise
visNetwork(ct_nodes, ct_edges) %>%
visNodes(shadow = list(enabled = TRUE,
size = 10)) %>%
# options for edges
visEdges(arrows = "middle",
width = 2,
hoverWidth = 20) %>%
# other options
visOptions(nodesIdSelection = TRUE) This interactive object is a visualization of the connections from sources (cases) to targets (contacts) in the relationships dataset. I strongly suggest zooming in and out of the graph in your browser, clicking and dragging on the nodes helps make sense of the shapes together. The scale of the graph can zoom out pretty easily as well, so just reload this page if it gets lost in whitespace.
Can we learn anything about the outbreak from this network graph? The visualization is interesting to look at and play around with, but in many ways it is a beginning, an exploratory view into what network analysis can offer the field of contact tracing.
If you like what you see, or have specific questions or feedback, feel free to email me directly: avery.richards@berkeley.edu