Setup

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

Data source: This workshop uses data from the following manuscript, in press with AJPH:

“Healthy Davis Together: Creating a Model for Community Control of COVID-19”

Brad H. Pollock, Charlotte L. Bergheimer, Thomas S. Nesbitt, Tod Stoltz, MBA3, Sheri R. Belafsky, Kenneth C. Burtis, Kelly M. Carey5, Miriam Nuño

#load packages

#these packages have already been installed in this cloud environment. 
#On your personal computer you need to install them before loading them using:
#install.package("tidyverse")
library(tidyverse) #GGPLOT is a component of the tidyverse framework
library(ggthemes) #additional themes for ggplots
library(ggsci) #Scientific Journal and Sci-Fi Themed Color Palettes
library(ggbeeswarm) # scatter plot in beeswarm shape
library(readr) # read in CSV files
library(lubridate) # great package for working with dates
library(kableExtra) # nice looking tables
library(betareg)  #Beta regression



### Helper function for plotting bimonthly date labels
bimonthly <- function(x) {
  x_range <- range(x, na.rm = TRUE)
  date_range <- c(
    floor_date(x_range[1], "month"),
    ceiling_date(x_range[2], "month")
  )
  monthly <- seq(date_range[1], date_range[2], by = "1 month")
}


#load data
#update file directory to where you saved the csv
d0 <- read_csv("C:/Users/kingl/Downloads/HDT_data_04.19.22.csv")

Data cleaning

# Populations numbers
# 2020 census 
# California - Yolo = 39538223 - 220882 = 
#Yolo county population - Davis
# Davis = town population 70,000 + campus (39,000) = 109,000
# campus population = 39,074
location <- c("California", "Yolo county", "Davis")
population <-  c(39317341, 149553, 109000)
pop <- data.frame(location, population)


d <- d0 %>% 
  select(-pCA_ave7, -pDA_ave7, -pYO_ave7) %>%
  #Check class of date variable and change to date 
  mutate(ResultDate = as.Date(ResultDate, "%m/%d/%y")) %>%
  #make the data set long so we can do some plotting and grouped analysis
  pivot_longer(cols = -c(ResultDate), names_to = "label") %>%
  separate(label, c("location", "type"), sep = c("_")) %>%
  pivot_wider(names_from = c("type"), values_from = c("value")) %>%
  #Nice label values for location
  mutate(location = factor(location, levels = c("ca", "yolo",  "davis"), labels = c("California", "Yolo county", "Davis"))) %>%
  #create variables for year, month and week
  mutate(year = year(ResultDate),
         month = month(ResultDate), 
         week = week(ResultDate)) %>%
  merge(., pop, by = c("location"))

TASK 1

Create a table of the total number of cases by location and year and the yearly period prevalence

#Your code here

Plotting cases over time

Introducing GGplot

There two most popular tools for creating plots in R:

    1. base, which come with all R installations
    1. ggplot2, a stand-alone package.

Why use ggplot?

  • More elegant & compact code than with base graphics
  • Very powerful for exploratory data analysis
  • Follows a grammar, just like any language.
  • It defines basic components that make up a sentence. In this case, the grammar defines components in a plot.
  • Grammar of graphics originally coined by Lee Wilkinson

GGplot “Grammar”

The general call for ggplot2 looks like this:

ggplot(data=___, aes(x=___, y=___)) + 
geom_xxx()+
geom_yyy()

The grammar involves some basic components:

  1. Data: a data.frame
  2. Aesthetics: How your data are represented visually, aka “mapping”. Which variables are shown on x, y axes, as well as color, size, shape, etc.
  3. Geometry: The geometric objects in a plot. points, lines, polygons, etc.

The key to understanding ggplot2 is thinking about a figure in layers: just like you might do in an image editing program like Photoshop, Illustrator, or Inkscape.

Let’s look at an example barplot:

ggplot(data = d, aes(x = ResultDate, y = pos)) +
  geom_bar(stat="identity") 

So the first thing we do is call the ggplot function. This function lets R know that we’re creating a new plot, and any of the arguments we give the ggplot function are the global options for the plot: they apply to all layers on the plot.

We’ve passed in two arguments to ggplot. First, we tell ggplot what data we want to show on our figure, d

For the second argument we passed in the aes function, which tells ggplot how variables in the data map to aesthetic properties of the figure, in this case the x and y locations. Here we told ggplot we want to plot the ResultDate column of the data frame on the x-axis, and the pos column on the y-axis.

By itself, the call to ggplot isn’t enough to draw a figure:

ggplot(data = d, aes(x = ResultDate, y = pos))

We need to tell ggplot how we want to visually represent the data, which we do by adding a new geom layer. In our example, we used geom_bar, which tells ggplot we want to visually represent the relationship between x and y as a a barplot. stat = “identity” means we want to leave the data as is. Whereas stat = “count” would sum up number of the observations

ggplot(data = d, aes(x = ResultDate, y = pos)) +
  geom_bar(stat="identity") 

There are definitely some improvements we can make to improve this visualization:

1. Facet wrap over location with a free x axis

We can split visualizations over several plots by adding a layer of facet panels.

ggplot(data = d, aes(x = ResultDate, y = pos)) +
  geom_bar(stat="identity") + 
  facet_wrap(~location, scales = "free_y", ncol=1)

2. Add axis labels and a title

Labels are their own layers in ggplot.

Let’s add the following labels to our ggplot: x = “Date” y= “Positive cases” title = “Covid-19 surveillance in California, Yolo county and Davis”

ggplot(data = d, aes(x = ResultDate, y = pos)) +
  geom_bar(stat="identity") + 
  facet_wrap(~location, scales = "free_y", ncol=1) + 
  labs(x = "Date", y = "Positive cases", title = "Covid-19 surveillance in California, Yolo county and Davis")

3. Change the colors

Option 1: outside aes

You can manually change the color inside specific layers like this:

note that when you specify a color by name i.e. red, blue, green, but must be included within quotation marks

ggplot(data = d, aes(x = ResultDate, y = pos)) +
  geom_bar(stat="identity", fill = "blue") + 
  facet_wrap(~location, scales = "free_y", ncol=1) + 
  labs(x = "Date", y = "Positive cases", title = "Covid-19 surveillance in California, Yolo county and Davis") +
  theme_minimal()

Option 2: inside aes

In the previous examples we used the aes function to tell the gggplot about the x and y data we wanted to plot. Another aesthetic property we can modify is the color of the barplot using the color (changes the outside lines) or the fill (changes the inside fill) aesthetic.

TASK 2

Modify the above plot to include a fill aesthetic (aes) by the variable location and the theme of your choice

There are many different options for color scales. The package “ggsci” (Scientific Journal and Sci-Fi Themed Color Palettes for ggplot2) has some great options.

Here are a few options. More options in this link (https://cran.r-project.org/web/packages/ggsci/vignettes/ggsci.html)

New England Journal of Medicine: scale_color_nejm()

The Lancet: scale_color_lancet()

Star Trek: scale_color_startrek()

We can change the theme (Background color, font, axis lines) of the plot to our liking. There are many options for ggplot themes including additional packages such as ggthemes. You can add the theme to the plot as an additional line.

Themes built in with ggplot: theme_bw() theme_minimal()

Themes included with the add-on package ggthemes theme_economist() theme_wsj() theme_stata()

type ?ggthemes into your console for more options or google ggthemes

ggplot(data = d, aes(x = ResultDate, y = pos, fill = location)) +
  geom_bar(stat="identity") + 
  facet_wrap(~location, scales = "free_y", ncol=1) + 
  labs(x = "Date", y = "Positive cases", title = "Covid-19 surveillance in California, Yolo county and Davis") +
  theme_minimal() + 
  scale_fill_nejm()

Saving your plot

use ggsave to save your plot. The default is to save in your working directly but you can change the filepath in the function call.

#Save as a jpeg
ggsave("CovidCases2021.2022.jpg", height=7, width=9, units="in")

#save as a pdf

Plotting period prevalence over time

By month

dsum1.a <- d %>%
  mutate(monthYr = floor_date(ResultDate, "month")) %>%
  group_by(location, monthYr) %>%
  summarize(periodprev = (sum(pos)/population))



ggplot(dsum1.a, aes(x=monthYr, y = periodprev, color = location, group = location)) + 
         geom_point() + 
  geom_line() + 
  theme_bw() + 
  scale_color_nejm() + 
  scale_y_continuous(labels = scales::percent) + 
  labs(x="Month", y = "Monthly period prevalence")

By week

dsum1.b <- d %>%
  mutate(weekYr = floor_date(ResultDate, "week")) %>%
  group_by(location, weekYr) %>%
  summarize(periodprev = (sum(pos)/population))


ggplot(dsum1.b, aes(x=weekYr, y = periodprev, color = location, group = location)) + 
         geom_point() + 
  geom_line() + 
  theme_bw() + 
  scale_color_nejm() + 
  scale_y_continuous(labels = scales::percent) + 
  labs(x="Week", y = "Weekly period prevalence")

Testing and test positivity over time

Figure of test positivity by test date

dsum2.a <- d %>%
  mutate(weekYr = floor_date(ResultDate, "week")) %>%
  group_by(location, weekYr) %>%
  summarize(TestPos = sum(pos)/sum(tests))


ggplot(dsum2.a, aes(x=weekYr, y = TestPos, group=location, color=location)) + 
  geom_point(size=.2) + 
  geom_line() + 
      scale_x_date(expan=expansion(mult = 0, add = 0),  breaks = bimonthly, date_labels = "%b-%y")+
    theme_bw()+
  labs(x = "Month of Test Result", y="Test Positivity Proportion") +
    theme(text = element_text(size=10),
          axis.text.x = element_text(size=10,angle = 30, hjust = 1,vjust=1),
          axis.text.y = element_text(size=10),
          axis.line = element_line(colour = "black"),
          panel.grid.minor = element_line(colour = "white"),
          panel.border = element_blank())

Figure 2 from HDT paper

Beta regression plot - Figure 2 from “Healthy Davis Together: Creating a Model for Community Control of COVID-19” uses 7 day moving averages

TASK 3

Create a figure of tests per capita per month and location

# Your code here

TASK 4

Recreate Figure 2 above using your choice of color, theme, font type, point shape, size, etc. Include the code and output here and also save your figure as a .jpg and submit with the assignment. Use your initials and last 4 digits of your cell in the caption (to keep it anonymous). We will have a friendly competition and Brad and Miriam will select the winner - there will be a prize :).

#Your code here

TASK 5

How would the test positivity compare if everyone in Davis and Yolo county the population was tested at the same rate as California? Assume the number of positive tests remains the same. Create a figure of test positivity over time with a standardized number of tests per capita. Include code below:

#Your code here

BONUS

The sensitivity of saliva versus NP swabs was 86.5% from a recent meta-analysis: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8191978/

How do the weekly test positivity and weekly period prevalence figures change when you account for the sensitivity of saliva-based tests among samples collected in Davis for HDT?

#Your code here