library(tidyverse)
library(lubridate)
library(zoo)
library(scales)
library(janitor)
rm(list=ls())COVID-19 Exercise
The purpose of this exercise is to practice working with “real-world” data from the COVID-19 pandemic. The data is public and supplied by Johns Hopkins University (JHU). A couple of resources to check out before we get started:
General housekeeping items
Let’s begin by opening libraries and clearing the environment (note that you may need to install scales, zoo, and janitor if you have not already):
Download and import the COVID-19 cases and deaths datasets
First, we are going to download and import the COVID-19 cases and deaths time series datasets from GitHub (where JHU makes the data available). These datasets contain daily case and deaths data by country.
covid19_cases <- read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv', name_repair = 'universal')
covid19_deaths <- read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv', name_repair = 'universal')Clean and join the cases and deaths datasets
Notice that the data is well structured, but a little messy. There is a row for each country-province and a column for each date. One thing we notice is that some countries (e.g., Canada and China) are reported by province while other countries are reported in aggregate. First, lets remove variables we aren’t going to use and “collapse” the data such that there is one row for each country. Notice that the cases and deaths datasets share the same structure and formatting. Hence, we can follow the same procedure for cleaning each dataset.
covid_cases_wide <- covid19_cases %>%
select(-Province.State, -Lat, -Long) %>%
group_by(Country.Region) %>%
summarize(across(everything(), sum))
covid_deaths_wide <- covid19_deaths %>%
select(-Province.State, -Lat, -Long) %>%
group_by(Country.Region) %>%
summarize(across(everything(), sum))That is a little better. However, the data is still too wide. To “tidy” it up, we want a row for each observation (i.e., country and date). First, we need to pivot the data longer, then we are going to format and create some variables.
covid_cases_long <- covid_cases_wide %>%
pivot_longer(cols = !Country.Region, names_to = 'date', values_to = 'cases') %>%
clean_names() %>%
mutate(date = mdy(date)) %>%
arrange(country_region, date) %>%
group_by(country_region) %>%
mutate(new_cases = cases - lag(cases)) %>%
mutate(new_cases_7day = rollmean(new_cases, k = 7, align = 'right', fill = NA)) %>%
rename(case_date = date) %>%
ungroup()
covid_deaths_long <- covid_deaths_wide %>%
pivot_longer(cols = !Country.Region, names_to = 'date', values_to = 'deaths') %>%
clean_names() %>%
mutate(date = mdy(date)) %>%
arrange(country_region, date) %>%
group_by(country_region) %>%
mutate(new_deaths = deaths - lag(deaths)) %>%
mutate(new_deaths_7day = rollmean(new_deaths, k = 7, align = 'right', fill = NA)) %>%
mutate(lag_death_date = lag(date, n = 21)) %>%
rename(death_date = date) %>%
ungroup()Now that the cases and deaths datasets are clean, let’s join them. It turns out that deaths lag cases by about 3 weeks on average (21 days). Here, we are going to merge the the deaths dataset using lag death date so that we get the cases and deaths lined up with a 3-week lag. That is, cases will be matched with deaths 21 days later.
covid_cases_deaths <- covid_cases_long %>%
inner_join(covid_deaths_long, by = c('country_region', 'case_date' = 'lag_death_date')) %>%
mutate(death_rate = new_deaths_7day / new_cases_7day) Filter the data and plot
Next, let’s isolate observations in the US after June 1, 2020 (roughly when testing became widely available). Then, we will reshape for plotting purposes:
trend_cases <- covid_cases_deaths %>%
filter(country_region == 'US', case_date > mdy("06012020")) %>%
select(new_cases_7day, new_deaths_7day, death_rate, case_date, country_region) %>%
pivot_longer(cols = !c(country_region, case_date), names_to = 'type', values_to = 'amount') Finally, let’s plot!
ggplot(trend_cases, aes(x = case_date, y = amount, color = type)) +
geom_line() +
facet_wrap(~type, nrow = 3, scales = "free_y") +
labs(title = 'Trends in COVID data', y = 'Amount', x = 'Date') +
scale_x_date(date_breaks = "6 month") +
scale_y_continuous(labels = comma)There are a lot of directions you can go exploring these datasets. Alter and tinker with the code above to explore new insights.