Coronavirus case data: a year in review

Background

We are approaching one full year since the COVID-19 pandemic shut down the United States. We have come to a sort of hybrid state with many people back in school or at work but still meeting remotely, social distancing, and wearing masks as vaccines are slowly distributed. Many are wondering how we got here. Back in the late spring of 2020, the strictest lockdowns were being lifted, but the future of the pandemic was unclear. The University of Minnesota Center for Infectious Disease Research and Policy (CIDRAP) predicted three possible scenarios for the future of the pandemic depending on regional responses and the behavior of the virus itself.

The COVID-19 pandemic in the United States has been well-documented both in terms of cases and response. Now, nearly one year since it began, I want to use this data to determine which of CIDRAP’s predictions most closely matches what has occurred and will continue to occur for the next several months. Not only will I compare the shape of the COVID-19 cases over time curves to the predicted scenarios, but I will also overlay the COVID-19 cases over time curves with data on government policy, holidays, and public events to see how these factors have influenced the number of COVID-19 cases. I want to answer how the government and the public’s responses to the COVID-19 pandemic have affected its course.

Materials and Methods

COVID-19 case data are available on CRAN and GitHub in the ‘coronavirus’ package. The dataset comes as a tidy format R package or in CSV form and contains information on COVID-19 cases reported by nation-state/province. The data is pulled from the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) and is updated daily, which can be retrieved easily with the command ‘update_dataset()’.

Policy data, human activity, vaccination rates, and other similar data are available on GitHub from Our World in Data. The dataset is mostly in Python but is also available in XSLX, CSV, and JSON form, all of which are updated daily. COVID-19 case data is also included, pulled from the JHU CSSE. For the purposes of this study, the dataset was downloaded once in CSV format on March 12, 2021.

Libraries

The following packages and datasets must be loaded

library(tidyverse)
library(coronavirus)
library(fmsb)

coronavirus::update_dataset(silence=TRUE) # Retrieve the most current data for coronavirus cases from coronavirus dev package

knitr::opts_chunk$set(cache=TRUE) # cache the results for quick compiling

Data

Retrieve ‘coronavirus’ data from the coronavirus package. Import Our World in Data dataset from CSV. The CSV was viewed in Excel to determine and set column types.

data(coronavirus)
owid = read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv",col_names=TRUE,col_types="ccc?nnnnnnnnnnnnnnnnnnnnnnnnnnnnncnnnnnnnnnnnnnnnnnnnnnnnnn",skip_empty_rows = FALSE)

Line graph

Wrangle the data

Filter ‘coronavirus’ data to United States confirmed cases

coronavirus <- filter(coronavirus,country=="US"&type=="confirmed")

Change units to thousands of cases for simpler y-axes

coronavirus$cases <- coronavirus$cases/1000

Filter ‘owid’ data to United States

owid_us <- filter(owid,location=="United States")

Add ‘height’ column to determine height of bars for bar graph to act as heat map

max(coronavirus$cases)

## [1] 300.473

owid_us$height <- 301

Define milestones (holidays and events)

memorial_day <- as.Date("2020-05-25")
christmas <- as.Date("2020-12-25")
thanksgiving <- as.Date("2020-11-26")
labor_day <- as.Date("2020-09-07")
election_day <- as.Date("2020-11-03")
first_lockdown <- as.Date("2020-03-13") # AJMC covid timeline
george_floyd <- as.Date("2020-05-25")
capitol_stormed <- as.Date("2021-01-06")
inauguration <- as.Date("2021-01-20")
holidays <- data.frame(Date=c(memorial_day,labor_day,election_day,thanksgiving,christmas,first_lockdown,george_floyd,capitol_stormed,inauguration),Event=c("Memorial Day","Labor Day","Election Day","Thanksgiving","Christmas","Trump declares National Emergency","George Floyd killed","Capitol building stormed","Biden inaugurated"))

Make sure text won’t overlap

baseline = min(coronavirus$cases)
delta = 0.05 * diff(range(coronavirus$cases))
holidays$ymin = baseline
holidays$timelapse = c(diff(holidays$Date),Inf)
holidays$bump = holidays$timelapse < 4*370 # ~4 years
offsets <- rle(holidays$bump)
holidays$offset <- unlist(mapply(function(l,v) {if(v){(l:1)+1}else{rep(1,l)}}, l=offsets$lengths, v=offsets$values, USE.NAMES=FALSE))
holidays$ymax <- holidays$ymin + holidays$offset * delta

Make the plot

Define base line graph of Cases over Time

base_line <- geom_line(data=coronavirus,aes(date,cases,alpha=0.8)) # base map

Define heat map of Government Stringency Index

heat_stringency <- geom_bar(data=owid_us,stat="identity",width=1,aes(date,height,fill=owid_us$stringency_index)) # stringency index heat map

Plot line graph over heat map and add milestones

ggplot()+heat_stringency+base_line+ # heat map of stringency index + line of cases
  theme_classic()+ # remove gray background and grid lines
  scale_fill_viridis_c("Stringency Index",begin=0.4,end=1,option="magma")+ # scale of heat map
  labs(x="Date",y="Cases (thousands)",title="Confirmed Cases of COVID-19 in U.S.",caption=("Source: JHSU CCSE and Our World in Data."))+ # labels
  scale_x_date(date_breaks = "1 month",date_labels="%b %y",limits=c(as.Date("2020-03-01"),as.Date("2021-03-01")))+ # change x axis scale
  scale_y_continuous(limits=c(0,301))+ # change y axis scale
  geom_segment(data = holidays, mapping=aes(x=Date, y=ymin, xend=Date, yend=ymax)) + # add milestone line segments
  geom_point(data = holidays, mapping=aes(x=Date,y=ymax), size=1) + # add milestone points
  geom_text(data = holidays, mapping=aes(x=Date, y=ymax, label=Event), hjust=-0.1, vjust=0.1, size=3)+ # add milestone labels
  guides(alpha=FALSE) # remove default legend for "alpha"

Compare to Italy and Japan

Wrangle the data

Filter owid dataset for U.S., Italy, and Japan, and calculate cases per capita

owid_us2 <- filter(owid,location=="United States"&date>=as.Date("2020-03-01")) #filter for country and after March 1
owid_us2$total_cases_per_capita <- owid_us2$total_cases/owid_us2$population # calculate cases per capita as cases/population
owid_it <- filter(owid,location=="Italy"&date>=as.Date("2020-03-01")) #filter for country and after March 1
owid_it$total_cases_per_capita <- owid_it$total_cases/owid_it$population # calculate cases per capita as cases/population
owid_jap <- filter(owid,location=="Japan"&date>=as.Date("2020-03-01")) #filter for country and after March 1
owid_jap$total_cases_per_capita <- owid_jap$total_cases/owid_jap$population # calculate cases per capita as cases/population

Set bar graph heights as one third of the top of the graph

owid_us2$height <- max(owid_us2$total_cases_per_capita)/3 # set bar graph height to  one third of top of graph
owid_it$height <- max(owid_us2$total_cases_per_capita)/3
owid_jap$height <- max(owid_us2$total_cases_per_capita)/3

Combine three country datasets into one graph for easy plotting

owid_us2$data <- 'United States'
owid_it$data <- 'Italy'
owid_jap$data <- 'Japan'
owid_all2 <- rbind.data.frame(owid_us2,owid_it,owid_jap)

Repeat the above steps with log of total cases per capita to more easily view peaks at lower numbers

owid_us3$total_cases_per_capita <- log(owid_us3$total_cases/owid_us3$population) # example

Make the plot

Plot U.S., Italy, and Japan case per capita data over time overlaid on heat maps of stringency index and label each heat map.

ggplot(owid_all2)+
  theme_classic()+
  geom_bar(stat="identity",position="stack",width=1,aes(date,height,fill=stringency_index))+
  scale_fill_viridis_c("Stringency Index",begin=0.4,end=1,option="magma",aesthetics = "fill")+
  geom_line(aes(date,total_cases_per_capita,color=data))+
  scale_color_viridis_d("Country",begin=0,end=0.7,option="E",aesthetics = "color")+
  labs(x="Date",y="Total Cases per capita",title="COVID-19 Cases in U.S., Italy, and Japan",caption=("Source: JHSU CCSE and Our World in Data."))+
  scale_x_date(date_breaks = "1 month",date_labels="%b %y",limits=c(as.Date("2020-03-01"),as.Date("2021-03-01")))+
  scale_y_continuous(limits=c(0,max(owid_us2$total_cases_per_capita)))+
  annotate("text",x=as.Date("2020-06-01"),y=0.05,label="Japan\n\n\n\n\n\nItaly\n\n\n\n\nUnited States")

Repeat the above graph on a log scale.

ggplot(owid_all3)+
  theme_classic()+
  geom_bar(stat="identity",position="stack",width=1,aes(date,height,fill=stringency_index))+
  scale_fill_viridis_c("Stringency Index",begin=0.4,end=1,option="magma",aesthetics = "fill")+
  geom_line(aes(date,total_cases_per_capita,color=data))+
  scale_color_viridis_d("Country",begin=0,end=0.7,option="E",aesthetics = "color")+
  labs(x="Date",y="log of Total Cases per capita",title="COVID-19 Cases in U.S., Italy, and Japan (log)",caption=("Source: JHSU CCSE and Our World in Data."))+
  scale_x_date(date_breaks = "1 month",date_labels="%b %y",limits=c(as.Date("2020-03-01"),as.Date("2021-03-01")))+
  scale_y_continuous(limits=c(min(owid_us3$total_cases_per_capita),0))+
  annotate("text",x=as.Date("2020-06-01"),y=-7.5,label="United States\n\n\n\n\nItaly\n\n\n\nJapan")

Result

This plot compares the number of confirmed COVID-19 cases in the U.S. to levels of policy stringency, called the stringency index, and milestones in the year like events and holidays.

A strong correlation cannot be drawn between number of cases and stringency index. During some periods of high stringency, cases seem to somewhat stall, but increases and decreases in stringency have no consistent effect on the number of cases.

Some correlation is seen between important holidays in November and December, but this is not seen for other, somewhat less significant holidays.

Some small periodicity is also seen on a weekly basis of peaks and valleys, indicating either peaks in testing or peaks in data aggregation around the work week.

The first multi-heat/line plot indicates the total number of COVID-19 cases per capita in the United States, Italy, and Japan over the last year. Italy and Japan were hit by COVID-19 earlier than the U.S. but also took more extreme measures to prevent its spread, which is not necessarily shown here (pre-March 2020).

Spikes in the increase of COVID-19 cases can be seen in the fall to winter months in both Italy and the United States, but the scale of the graph makes it difficult to determine seasonal changes in Japan.

The second multi-heat/line plot indicates the total number of COVID-19 cases per capita in the United States, Italy, and Japan over the last year on a logarithmic scale. Italy and Japan were hit by COVID-19 earlier than the U.S. but also took more extreme measures to prevent its spread, which the logarithmic scale shows more clearly, though some measures are not necessarily shown here (pre-March 2020).

In Italy and Japan, trends in case data seem to more closely correlate with the stringency of government policy than in the U.S. Periods of high stringency occur in reaction to rising case numbers and are followed by periods of stagnating case numbers; the opposite is seen for periods of low stringency.

Conclusion

The graphs indicate some level of success by government policy in slowing the spread of COVID-19. While it is hard to predict what COVID-19 will look like in the later months of 2021 and 2022, COVID-19 case data in the U.S. most closely resembles CIDRAP’s predicted Scenario 2, which expected large fall peaks. With the widespread distribution of multiple vaccines, hopefully future peaks will be as minimal as possible.

Coronavirus case data: a year in review

Final Project

Catherine Walker

3/16/2021

Background

Materials and Methods

Libraries

Data

Line graph

Wrangle the data

Make the plot

Compare to Italy and Japan

Wrangle the data

Make the plot

Result

Conclusion

References