Purpose

In this exploratory data analysis, we will be investigating the spread of Coronavirus (COVID-19) over time in the U.S. using the Johns Hopkins COVID-19 Dataset. The JHU folks have made an awesome dashboard which displays the geographical distribution of the COVID-19 across the globe. However, this dashboard provides just a single static snapshot of the current disease burden and does not illustrate how the distribution of cases has developed over time.

In this example, we will demonstrate how we used the JHU COVID-19 dataset to create the dynamic data visualization below which illustrates the temporal evolution of confirmed COVID-19 cases in the U.S.

Link to original post

Load Libraries

library(tictoc)
library(pryr)
library(sf)
library(maps)
library(gganimate)
library(magick)
library(janitor)
library(tidyverse)

Static Map of U.S.

First, we will need a plot/map of the continential U.S. The maps package will provide us with the shape file data which can be easily plotted using ggplot2 and sf package in R.

Reference: Blog post on how to make a map of the U.S. states

states <- st_as_sf(maps::map(database = "state",plot=F,fill=T)) 

states %>%
  ggplot()+
  geom_sf()

Get and Clean JHU COVID-19 Time Series Dataset

Next, we need to pull in time series data from JHU website. We’ll pull down the wide formatted data directly from the website and reformat to long using pivot_longer.

Also here we will:

  1. filter cases for U.S. only

  2. reformat dates (which were formerly column names)

  3. assign different sizes for each point to be included on map plot based on the total number of confirmed cases for the given date

jhu_confirmed_cases_wide <- read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv") %>%
  clean_names() %>%
    filter(country_region == "US")

jhu_confirmed_cases_county_level <- jhu_confirmed_cases_wide %>%
  filter(str_detect(province_state,",|Princess")&str_detect(province_state,"County|Princess|D.C.")) %>% # need to add filter statement to select only county-level and cruise ship data
  pivot_longer(matches("x"),
               names_to="dates",
               values_to = "case_count")%>%
  group_by(province_state) %>%
  mutate(date_count = row_number()) %>%
  ungroup() %>%
  mutate(date_str=str_replace_all(dates,"x","")) %>%
  mutate(date_str=str_replace_all(date_str,"_","/")) %>%
  mutate(date_fmt = as.Date(date_str,format="%m/%d/%y")) %>%
  filter(date_fmt<=as.Date("2020-03-09")) %>% # only collect data before 3/10/20 in this dataframe - will be using state-level summary data after this date
  mutate(size=case_when(case_count < 10 ~ 1,
                        case_count < 50 ~ 2,
                        case_count < 100 ~ 3,
                        case_count < 200 ~ 4,
                        T  ~ 5))


jhu_confirmed_cases_state_level <- jhu_confirmed_cases_wide %>%
  filter(!str_detect(province_state,",")) %>% # need to add filter statement to select only state-level and cruise ship data - avaiable on and after 3/10/2020
  pivot_longer(matches("x"),
               names_to="dates",
               values_to = "case_count")%>%
  group_by(province_state) %>%
  mutate(date_count = row_number()) %>%
  ungroup() %>%
  mutate(date_str=str_replace_all(dates,"x","")) %>%
  mutate(date_str=str_replace_all(date_str,"_","/")) %>%
  mutate(date_fmt = as.Date(date_str,format="%m/%d/%y")) %>%
  filter(date_fmt>=as.Date("2020-03-10")) %>% # only collect data from 3/10/20 and onward in this dataframe - will be using state-level summary data 
  mutate(size=case_when(case_count < 10 ~ 1,
                        case_count < 50 ~ 2,
                        case_count < 100 ~ 3,
                        case_count < 200 ~ 4,
                        T  ~ 5))

jhu_confirmed_cases <- jhu_confirmed_cases_county_level %>%
  bind_rows(jhu_confirmed_cases_state_level) %>%
  mutate(scaled_case_count = (case_count/(max(case_count))))

Augment JHU Time Series Dataset

Next, we will augment the JHU Dataset so our gganimations will have variable speeds when stepping through different dates. We want the steps in the gganimation to be faster earlier (when there wasn’t much growth in the number of confirmed cases) and slower later (when there was more rapid growth in the number of confirmed cases ) in the time series data.

Step 1: Define new dataframe which defines how long each date should be

In this example, we will make case counts from different dates have a variable number of frames:

  1. 1/22/2020 through 3/3/2020: 1 frame
  2. 3/4/2020 through 3/11/2020: 8 frames
  3. 3/12/2020 onward: 16 frames
date_augment_table <- jhu_confirmed_cases %>% 
  distinct(date_fmt) %>%
  arrange(date_fmt) %>%
  mutate(timesteps_amp = case_when(date_fmt < as.Date("2020-03-04") ~ 1,
                                   date_fmt < as.Date("2020-03-12") ~ 8,
                                   T ~ 16),
         timestep_cumsum = cumsum(timesteps_amp)) %>%
  mutate(timestep_amp_lag = lag(timesteps_amp,1),
         timestep_amp_lag = ifelse(is.na(timestep_amp_lag),0,timestep_amp_lag),
         timestep_cumsum_lag = cumsum(timestep_amp_lag))

transition_points <- date_augment_table %>%
  filter(timesteps_amp!=timestep_amp_lag) %>%
  left_join(date_augment_table %>% 
              count(timesteps_amp,name="number_of_days"),
            by="timesteps_amp")

first_transition_frame = transition_points$timestep_cumsum_lag[2]
first_transition_days_passed = transition_points$number_of_days[1]

second_transition_frame = transition_points$timestep_cumsum_lag[3]
second_transition_days_passed = sum(transition_points$number_of_days[1:2])

Step 2: Join new dataframe which defines the length of time step for each date to JHU dataframe

Here we join the new dataframe we created in the last step and expand our original dataframe using the uncount function. We also had to create new “calculated dates” to ensure that the displayed dates on resulting gifs would increment appropriately when stepping through the slower data frames.

jhu_confirmed_cases_time_augmentation <- jhu_confirmed_cases %>%
  left_join(date_augment_table,
            by="date_fmt") %>%
  mutate(timesteps_amp_2 = timesteps_amp) %>%
  uncount(timesteps_amp_2) %>%
  group_by(date_fmt,province_state) %>%
  mutate(timestep = row_number()) %>%
  ungroup() %>%
  mutate(timestep_aug = timestep_cumsum_lag+timestep-1) %>%
  arrange(province_state,timestep_aug) %>%
  mutate(calc_date = as.Date(case_when(timestep_aug <= first_transition_frame  ~ as.Date("2020-01-22") + trunc(timestep_aug) - 1,
                                       timestep_aug <= second_transition_frame ~ as.Date("2020-01-22") + first_transition_days_passed + (trunc((timestep_aug-first_transition_frame)/8)),
                                       T ~ as.Date("2020-01-22") + second_transition_days_passed + (trunc((timestep_aug-second_transition_frame)/16))),origin="1970-01-01"))

Create animated map to show geographic distribution of cases

Finally, using the ggplot2 package, we will take the static map we developed in the first step and overlay the JHU confirmed cases data. We will first start by making a static plot. In this step, we will also filter out any cases which were documented outside of the 48 continential states (sorry Alaska, Hawaii, and other U.S. territories) and cases that were not specifically assigned to a particular state (i.e. some cases from the Diamond Princess Cruise).

Step 1: Make Static Plot of U.S. States w/ Points

max_case_count = max(jhu_confirmed_cases$case_count)
max_size_point =   (max_case_count/10000) * 100

map_plot_aug <- ggplot()+
  geom_sf(data=states,
          fill="white")+
  geom_point(data=jhu_confirmed_cases_time_augmentation %>%
               filter(long>-140) %>%
               filter(lat>25) %>%
               filter(!str_detect(province_state,"Unassigned")) %>%
               filter(case_count>0),
             mapping=aes(x=long,
                 y=lat,
                 group=province_state,
                 size=scaled_case_count),
             fill="red",
             alpha=0.35,
             shape=21)+
  scale_size_continuous(breaks = function(x) c(10,
                                               100,
                                   350,
                                   700)/max_case_count,
    labels= function(x) round_half_up(x*max_case_count),
                        range=c(3,3+1.75*max_size_point))+
  scale_x_continuous(expand=expand_scale(mult=c(0.05,0.75)))+
  labs(title="Distribution of U.S. Confirmed COVID-19 Cases",
       subtitle="All Cases from All Dates Shown at Same Time",
       x="",
       y="",
       size="# of Confirmed\nCases") +
  coord_sf(xlim=c(-125,-63))+
  theme_bw()+
  theme(legend.position = c(0.90,0.3),
        title=element_text(size=20),
        legend.title = element_text(size=14),
        legend.background = element_rect(color="black",fill="NA"),
        legend.text = element_text(size=12))

map_plot_aug 

Step 2: Add gganimate functions to plot

Next, we’ll add the gganimate function transition_time to allow us to step through the different time frames of our JHU confirmed cases dataframe.

map_plot_w_animate_aug <- map_plot_aug +
  labs(title = "Distribution of U.S. Confirmed COVID-19 Cases",
  subtitle = "Date: {as.Date(case_when(frame_time<=first_transition_frame ~ as.Date('2020-01-22') + trunc(frame_time) - 1,
  frame_time<=second_transition_frame ~ as.Date('2020-01-22') + first_transition_days_passed + (trunc((frame_time-first_transition_frame)/8)),
  T ~ as.Date('2020-01-22') + second_transition_days_passed + (trunc((frame_time-second_transition_frame)/16))),
  origin='1970-01-01')}")+
  transition_time(timestep_aug)+
  ease_aes()

Step 3: Create gganimation

Lastly, we’ll use the gganimate function animate to compile the .gif image for the U.S. map which steps through confirmed cases based on date.

map_overtime_plot <- animate(map_plot_w_animate_aug,
        duration = 10,
        fps = 10,
        width = 650,
        height = 400, 
        renderer = gifski_renderer())

map_overtime_plot

Line Graph Over Time

In addition to the map, which provides dynamic visualization of how confirmed cases evolved over time in a geospatial sense, we will also want to make a dynamic line graph which illustrates how the case count increases with time.

Aside: See link here for a very similar dynamic line graph made by Patrick Rotzetter

Step 1: Build Dataframe to Summarize Case Count For Each day

First, we will start by making a dataframe which summarises the total case count across all areas for each date.

sum_ov_time_df <- jhu_confirmed_cases_time_augmentation %>%
  group_by(date_fmt,timestep_aug,calc_date) %>%
  summarise(case_total = sum(case_count,na.rm=T)) %>%
  ungroup()

Step 2: Build Static Line Plot of Cases Over Time

Next, we will make a static line graph using the above dataframe.

plot_line_ov_time <-sum_ov_time_df %>%
  ggplot(aes(x=date_fmt,
             y=case_total,
             color=timestep_aug))+
  geom_line(size=3)+
  geom_label(aes(label=case_total),
             size=12,
             fontface="bold")+
  scale_y_continuous(breaks = seq(0,10000,400),
                     expand = expand_scale(mult=c(0.07,0.11)))+
  scale_x_date(date_breaks = "1 week",
               labels= scales::date_format("%m/%d"),
               expand = expand_scale(mult=c(0.05,0.105)),
               limits=c(min(sum_ov_time_df$date_fmt),max(sum_ov_time_df$date_fmt)))+
  scale_color_gradient(low="black",high="red")+
  labs(title="# of U.S. Confirmed COVID-19 Cases Overall",
       x="",
       y="# of Confirmed Cases") +
  guides(color=F)+
  theme_bw()+
  theme(legend.position = c(0.92,0.3),
        axis.text.x =  element_text(size=20,
                                    face="bold"),
        axis.text.y =  element_text(size=20,
                                    face="bold"),
        title=element_text(size=20))
plot_line_ov_time

Step 3: Add gganimate functions to static plot

Then we will animate the static plot using the gganimate function transition_reveal

plot_line_ov_time_w_animate <- plot_line_ov_time + 
  labs(subtitle = "Date: {as.Date(case_when(frame_along<=first_transition_frame ~ as.Date('2020-01-22') + trunc(frame_along) - 1,
  frame_along<=second_transition_frame ~ as.Date('2020-01-22') + first_transition_days_passed + (trunc((frame_along-first_transition_frame)/8)),
  T ~ as.Date('2020-01-22') + second_transition_days_passed + (trunc((frame_along-second_transition_frame)/16))),
  origin='1970-01-01')}")+
  transition_reveal(timestep_aug)+
  shadow_mark()

Step 4: Create gganimation

Lastly, we’ll use the gganimate function animate to compile the .gif image for our dynamic line graph which steps through confirmed cases based on date.

line_overtime_gif <- animate(plot_line_ov_time_w_animate,
        duration = 10,
        fps = 10,
        width = 600,
        height = 400, 
        renderer = gifski_renderer())
line_overtime_gif

Over Time Line Plot by State

In addition to the line graph summarizing overall U.S. confirmed cases of COVID-19, we’ll present another line graph which breaks cases down by state.

Step 1: Re-aggregate JHU data at the state level

In this step, we will re-aggregate the county level data (which was primarily presented prior to 3/10/20 by the JHU folks) into state-level data.

state_df <- tibble(state_abb = state.abb,
                   state_fullname = state.name) %>%
  bind_rows(tibble(state_abb = c("D.C.","Grand Princess","Diamond Princess"),
            state_fullname = c("District of Columbia","Grand Princess","Diamond Princess")))

jhu_confirmed_cases_county_level_states <- jhu_confirmed_cases_county_level %>%
  mutate(state_abb = substr(province_state,str_locate(province_state,", ")+2,nchar(province_state))) %>%
  mutate(state_abb = ifelse(is.na(state_abb),province_state,state_abb)) %>%
  left_join(state_df,
            by="state_abb")

jhu_confirmed_cases_state_level_only <- jhu_confirmed_cases_state_level %>%
  mutate(state_fullname = province_state) %>%
  left_join(state_df,
            by="state_fullname") %>%
  select(names(jhu_confirmed_cases_county_level_states))

jhu_confirmed_cases_state_level_comb <- jhu_confirmed_cases_county_level_states %>%
  bind_rows(jhu_confirmed_cases_state_level_only)

jhu_confirmed_cases_state_level_comb_sum_by_state <- jhu_confirmed_cases_state_level_comb %>%
  group_by(state_fullname,state_abb,date_fmt) %>%
  summarise(case_total = sum(case_count,na.rm=T)) %>%
  ungroup() %>%
  filter(!str_detect(state_fullname,"Princess"))

Step 2: Augment JHU Dataset

Next, we will again augment the JHU Dataset so our gganimations will have variable speeds when stepping through different dates. This is done in a similar fashion to what was described above.

jhu_confirmed_cases_state_level_aug <- jhu_confirmed_cases_state_level_comb_sum_by_state %>%
  left_join(date_augment_table,
            by="date_fmt") %>%
  mutate(timesteps_amp_2 = timesteps_amp) %>%
  uncount(timesteps_amp_2) %>%
  group_by(date_fmt,state_fullname,state_abb) %>%
  mutate(timestep = row_number()) %>%
  ungroup() %>%
  mutate(timestep_aug = timestep_cumsum_lag+timestep-1) %>%
  arrange(state_fullname,timestep_aug) %>%
  mutate(calc_date = as.Date(case_when(timestep_aug <= first_transition_frame  ~ as.Date("2020-01-22") + trunc(timestep_aug) - 1,
                                       timestep_aug <= second_transition_frame ~ as.Date("2020-01-22") + first_transition_days_passed + (trunc((timestep_aug-first_transition_frame)/8)),
                                       T ~ as.Date("2020-01-22") + second_transition_days_passed + (trunc((timestep_aug-second_transition_frame)/16))),origin="1970-01-01")) %>%
  mutate(scaled_case_total = ((case_total-1)/(max(case_total)-1)))

Step 3: Make Static Line Plot Over Time by State

Next, we will make a static line graph using the above dataframe.

A few new features of this graph:

  1. All states will have their own distinct colored line

  2. Plot will not start showing lines until 2/24/2024 (since there is not much happening in terms of case growth before this time)

  3. The lines will have a case count and state label added to them after the respective states have at least 100 cases.

Inspiration for this line graph game from the following two twitter posts:

  1. Patrick Rotzetter’s post - as already mentioned above
  2. Matt Cowgill’s post gganimation of John Burn-Murdoch’s post graph
state_ov_time_line_plot <- jhu_confirmed_cases_state_level_aug %>%
  ggplot(aes(x=calc_date,
             y=case_total,
             color=state_fullname))+
  geom_line(size=1)+
  geom_segment(
    data=jhu_confirmed_cases_state_level_aug %>%
               filter(case_total>100),
               mapping=aes(x=calc_date,
                           xend=max(calc_date)+1,
                           y=case_total,
                           yend = case_total), linetype=2) +
  geom_label(data=jhu_confirmed_cases_state_level_aug %>%
               filter(case_total>100),
             mapping=aes(x=calc_date,
                 y=case_total,
                 color=state_fullname,
                 label=case_total,
                 size=scaled_case_total),
             fontface="bold")+
    geom_text(data=jhu_confirmed_cases_state_level_aug %>%
               filter(case_total>100),
             mapping=aes(x=max(calc_date),
                 y=case_total,
                 color=state_fullname,
                 label=state_abb),
             size=6,
             hjust=-1,
             fontface="bold")+
  scale_size_continuous(range=c(4,6))+
  scale_y_continuous(breaks = seq(0,10000,100),
                     expand = expand_scale(mult=c(0.07,0.11)))+
  scale_x_date(date_breaks = "4 days",
               labels= scales::date_format("%m/%d"),
               expand = expand_scale(mult=c(0.05,0.01)),
               limits=c(as.Date("2020-02-24"),max(sum_ov_time_df$date_fmt+2.5)))+
  #scale_color_gradient(low="black",high="red")+
  labs(title="# of U.S. Confirmed COVID-19 Cases by State",
       x="",
       y="# of Confirmed Cases") +
  guides(color=F,
         size=F)+
  theme_bw()+
  theme(legend.position = c(0.92,0.3),
        axis.text.x =  element_text(size=20,
                                    face="bold",
                                    #angle=60,hjust=1
                                    ),
        axis.text.y =  element_text(size=20,
                                    face="bold"),
        title=element_text(size=20))

state_ov_time_line_plot

Step 4: Add gganimate functions to static plot

ggan_state_ov_time_line_plot <- state_ov_time_line_plot +
  labs(subtitle = "Date: {as.Date(case_when(frame_along<=first_transition_frame ~ as.Date('2020-01-22') + trunc(frame_along) - 1,
  frame_along<=second_transition_frame ~ as.Date('2020-01-22') + first_transition_days_passed + (trunc((frame_along-first_transition_frame)/8)),
  T ~ as.Date('2020-01-22') + second_transition_days_passed + (trunc((frame_along-second_transition_frame)/16))),
  origin='1970-01-01')}")+
  transition_reveal(timestep_aug)+
  shadow_mark()

Step 5: Create gganimation

line_overtime_gif_by_state <- animate(ggan_state_ov_time_line_plot,
        duration = 10,
        fps = 10,
        width = 600,
        height = 400, 
        renderer = gifski_renderer())

line_overtime_gif_by_state

Building Combined GIF of Three Plots in Sync

In the final step, we’ll stitch our three .gif files together using functions from the magick package. The code here was adapted from this example by Thomas Lin Pedersen - one of the co-creaters of gganimate package.

a_mgif <- image_read(line_overtime_gif)
b_mgif <- image_read(line_overtime_gif_by_state)
c_mgif <- image_read(map_overtime_plot)
white_space <- image_blank(width=275,height=400)

new_gif_top_row <- image_append(c(a_mgif[1], b_mgif[1]))
new_gif_bottom_row <- image_append(c(white_space,c_mgif[1],white_space))
new_gif_top_and_bottom_syn <- image_append(c(new_gif_top_row,new_gif_bottom_row),
                                           stack=T)

for(i in 2:100){
  new_gif_top_row <- image_append(c(a_mgif[i], b_mgif[i]))
  new_gif_bottom_row <- image_append(c(white_space,c_mgif[i],white_space))
  new_gif_top_and_bottom_i <- image_append(c(new_gif_top_row,new_gif_bottom_row),
                                           stack=T)
  new_gif_top_and_bottom_syn <- c(new_gif_top_and_bottom_syn, new_gif_top_and_bottom_i)

}
new_gif_top_and_bottom_syn

A Few Final Notes

  1. The data displayed in these visualizations represent only confirmed cases of COVID-19 virus. It does not reflect active (i.e. those infected, but not tested) or recovered cases.

  2. On 3/10/2020, Johns Hopkins Novel Coronavirus (COVID-19) Cases dataset changed the reporting of the cases from primarily the county-level to primarily the state-level - which explains the abrupt transition in distribution of dots on 3/9/2020 to 3/10/2020.

Acknowledgements

Data Source

A very special thank you to the Johns Hopkins University Center for Systems Science and Engineering team for publishing their dataset. Thank you for aggregating and sharing these data and for your work tracking the spread of the disease across the world. Be sure to check out their awesome dashboard

R Package Creators

Many thanks to the creators of the many amazing packages we used in this example. These packages/folks include, but are not limited to:

  1. The tidyverse suite of packages (readr, dplyr, tidyr, ggplot2) - Hadley Wickham + R Studio team
  2. gganimate - Thomas Lin Pedersen + David Robinson
  3. sf - Edzer Pebesma
  4. maps - Alex Deckmyn + colleagues
  5. magick - Jeroen Ooms
  6. janitor - Sam Firke

  1. Case Western Reserve University School of Medicine, @rdwinkelman

  2. The Ohio State University, @cwaltz38