The Ohio Department of Health (ODH) recently published a dataset of all Ohio COVID-19 cases by county of residence. The dataset is updated daily at 2 pm EST. ODH also provides a very useful Tableau dashboard of the data on their website.
The purpose of this document is to outline the process it took to develop the following three animated visualizations of Ohio counties’ COVID-19 cases using the ODH COVID-19 dataset.
library(janitor)
library(gganimate)
library(ggrepel)
library(magick)
library(sf)
library(tidyverse)
date <- paste0(str_replace_all(Sys.Date(),"-","_"),"_")
First, we need a map of Ohio counties. We can get these data using the map function from maps package.
The map function provides us an object of class map. We will want to convert this to an sf object using the st_as_sf function from the sf package. This will allow us to more easily manipulate the Ohio map data (e.g. filtering for only Ohio counties, calculating new columns, and joining the map dataset with other datasets [e.g. the ODH COVID-19 dataset]).
In this step, we will also be calculating a centroid of each county using the st_centroid function from the sf package. We will need this centroid location so we can overlay the daily increases in cases as text when we build our map visualization.
ohio_county_map_df <-st_as_sf(maps::map("county",plot=F,fill=T)) %>%
filter(str_detect(ID,"ohio,")) %>%
separate(ID,into=c("state","county"),sep=",") %>%
mutate_if(is.character,funs(tools::toTitleCase(.))) %>%
mutate(centroid = st_centroid(geom),
lat_long = map(.x=centroid, ~st_coordinates(.x) %>% as_tibble(.))) %>%
unnest(lat_long) %>%
rename(centroid_lat=X,
centroid_long=Y)
Now we will quickly check to make sure we can generate a plot of Ohio counties.
ohio_county_map_df %>%
ggplot()+
geom_sf()
Here we will pull the ODH COVID dataset down directly from their website.
ohio_data_raw <- read_csv("https://coronavirus.ohio.gov/static/COVIDSummaryData.csv") %>%
clean_names() %>%
filter(!str_detect(county,"Total"))
Here we will take a quick glance at the dataset. There are a few important things to note:
The data appear to contain the number of cases/hospitalizations/deaths in each county on a given date + also provides info on the sex and age range of each new case.
Some counties are missing rows for certain dates - presumably there were no new cases in these counties on those dates
Some counties have more than one row for the same date - since each sex and age_range has its own row even if its on the same date
The variable onset_date is of class chr and not of class Date
ohio_data_raw %>%
arrange(county,onset_date)
## # A tibble: 2,033 x 8
## county sex age_range onset_date date_of_death case_count death_count
## <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 Allen Male 60-69 3/15/2020 <NA> 1 0
## 2 Allen Fema… 70-79 3/21/2020 <NA> 1 0
## 3 Allen Male 50-59 3/21/2020 <NA> 1 0
## 4 Allen Fema… 70-79 3/23/2020 <NA> 2 0
## 5 Allen Male 70-79 3/23/2020 <NA> 1 0
## 6 Allen Fema… 40-49 3/25/2020 <NA> 1 0
## 7 Allen Male 30-39 3/25/2020 <NA> 1 0
## 8 Allen Fema… 60-69 3/26/2020 <NA> 1 0
## 9 Allen Male 80+ 3/26/2020 <NA> 1 0
## 10 Ashla… Fema… 20-29 3/15/2020 <NA> 1 0
## # … with 2,023 more rows, and 1 more variable: hospitalized_count <dbl>
These charactersitics are important to note because they will influence how we will go about cleaning our data. For our final cleaned dataset we will need the following:
One row for every county on every date (from earliest to latest date available) which has the daily and cumulative cases/hospitalizations/deaths for that particular date
The variable of onset_date to be of class Date
We will clean the data in two Parts:
Nothing exciting here. Pretty self-explanatory.
ohio_data_cleaned_part_1 <- ohio_data_raw %>%
mutate(onset_date = as.Date(onset_date,format="%m/%d/%Y")) %>%
filter(!is.na(onset_date)) %>%
filter(onset_date<=as.Date("2020-03-31"))
First we will create a table which has a new row for each date for every county.
entire_date_range_df <- tibble(onset_date = seq(min(ohio_data_cleaned_part_1$onset_date),
max(ohio_data_cleaned_part_1$onset_date),
1)) %>%
crossing(county=unique(ohio_county_map_df$county)) %>%
arrange(county,onset_date)
entire_date_range_df %>% head()
## # A tibble: 6 x 2
## onset_date county
## <date> <chr>
## 1 2020-02-16 Adams
## 2 2020-02-17 Adams
## 3 2020-02-18 Adams
## 4 2020-02-19 Adams
## 5 2020-02-20 Adams
## 6 2020-02-21 Adams
Next, we will join our this new dataframe with our dataframe from Part 1 using the left_join function from dplyr. We will also arrange the data by so all rows are ordered by the county and then by date (both in ascending order).
ohio_data_cleaned_part_2a <- entire_date_range_df %>%
left_join(ohio_data_cleaned_part_1,
by=c("county","onset_date"))%>%
arrange(county,onset_date)
Lastly, we’ll resummarize the data in two steps:
First, we will group the data by county and onset_date to calculate the total number of cases/hospitalizations/deaths associated with each date. Note - it is very important to include the argument na.rm = T when summing up the case counts for each county + date since there are a lot of missing values.
Next, we will group the data by county to calculate the cumulative total of cases/hospitalizations/deaths for each date.
ohio_data_cleaned_part_2b <- ohio_data_cleaned_part_2a %>%
group_by(county,onset_date) %>%
summarize(case_count_daily = sum(case_count,na.rm=T),
hospitalized_count_daily = sum(hospitalized_count,na.rm=T),
death_count_daily = sum(death_count,na.rm=T)) %>%
ungroup() %>%
group_by(county) %>%
mutate(case_count_total = cumsum(case_count_daily),
hospitalized_count_total = cumsum(hospitalized_count_daily),
death_count_total = cumsum(death_count_daily)) %>%
ungroup()
Now we will do a quick check to show that each county has 54 rows which correspond to the daily and cumulative cases/hospitalzations/deaths for each date.
ohio_data_cleaned_part_2b %>%
count(county) %>%
head()
## # A tibble: 6 x 2
## county n
## <chr> <int>
## 1 Adams 45
## 2 Allen 45
## 3 Ashland 45
## 4 Ashtabula 45
## 5 Athens 45
## 6 Auglaize 45
We will join the Ohio COVID data to the Ohio counties’ map data using the left_join function from dplyr.
In this step, we will also use the hue_pal function from scales package to assign a single color to each county to help us have consistent color scheme across different visualizations.
ohio_data_summarized <- ohio_data_cleaned_part_2b %>%
left_join(ohio_county_map_df %>%
mutate(color_pal = scales::hue_pal()(nrow(.))),
by="county")
We will also include a counter with total cumulative cases across all counties for a given date. Thus, we will group the data by date and count the total cases for each date.
total_by_date <- ohio_data_summarized %>%
group_by(onset_date) %>%
summarise(case_count_total=sum(case_count_total)) %>%
ungroup()
For the bar plot, we only want to display the the top 10 counties with most cases for each date. Below is the code we used to quickly filter out only the 10 counties with the most cases for each date.
ohio_data_summarized_top_10 <-ohio_data_summarized %>%
arrange(onset_date,desc(case_count_total)) %>%
group_by(onset_date) %>%
mutate(rank=row_number()) %>%
ungroup() %>%
filter(rank<=10)
All visualizations will be completed in 3 steps:
We will not go over the details of the code used to generate each visualization - but encourage those interested to copy and experiment with the code themselves.
Heat maps are helpful for visualizing the geospatial distribution of cases.
We first saw a similar heat map of Ohio COVID cases posted by Anthony Reinhard, so credit to him for the original idea. Below is our attempt to recreate his work.
p_ohio_county_map<- ohio_data_summarized %>%
mutate(case_count_total=ifelse(case_count_total==0,
NA,
case_count_total)) %>%
ggplot()+
geom_sf(mapping=aes(geometry=geom,
fill=case_count_total),
color="black")+
geom_text(data=ohio_data_summarized %>%
filter(case_count_daily!=0) %>%
filter(onset_date!=max(onset_date)) %>%
mutate(county_date = paste0(county,onset_date)),
aes(x=centroid_lat,
y=centroid_long,
label=paste0("+",case_count_daily),
group=county_date))+
scale_fill_gradient(low="#fff2e5",
high="red",
na.value = "white")+
theme_bw()+
theme(title = element_text(size=14,face="bold"),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank())+
labs(title="Ohio Counties: # of Confirmed Cases",
subtitle='Date: 3/31/20',
x="",
y="",
fill="Total\nCases")
p_ohio_county_map
ggan_p_ohio_county_map <- p_ohio_county_map +
labs(title="Ohio Counties: # of Confirmed Cases",
subtitle='Date: {format(frame_time,format="%m/%d/%y")} Total Cases: {total_by_date$case_count_total[as.numeric(frame_time-min(total_by_date$onset_date))+1]}',
x="",
y="",
fill="Total\nCases")+
transition_time(onset_date)+
enter_fade()+
enter_drift(x_mod=0,y_mod=0.05)+
exit_fade()
Here is a quick explanation of the parameters we are providing to theanimate function to create the gif:
duration = 15 secs: the gif will last for 15 seconds
fps = 9: there will be 9 frames per second (fps), so 9 frames/sec * 15 seconds = 135 frames total
end_pause = 27: the last frame will display 27 times (27 frames / 9 frames/sec = 3 second pause at the end)
height = 400 and width = 500: the resulting gif will be 400 by 500 pixels
an_p_ohio_county_map <- animate(ggan_p_ohio_county_map,
duration=15,
fps=9,
end_pause = 27,
height=400,
width=500)
an_p_ohio_county_map
Line plots can show us how the data are trending over time for different groups of data - in this case we will be comparing counties.
We first saw a similar line graph in Patrick Rotzetter’s post. We also benefitted from reviewing the associated code on his github. Below is our attempt to repurpose his code for our dataset.
p_county_line_plot <- ohio_data_summarized %>%
ggplot(aes(x=onset_date,
y=case_count_total,
color=color_pal,
group=county))+
geom_line()+
geom_segment(data=ohio_data_summarized %>%
filter(case_count_total>=75),
mapping=aes(y=case_count_total,
yend=case_count_total,
x=onset_date,
color=color_pal),
xend=max(ohio_data_summarized$onset_date)+0.5,
linetype=2
)+
geom_text_repel(data=ohio_data_summarized %>%
filter(case_count_total>=75),
mapping=aes(y=case_count_total,
color=color_pal,
label=county),
x=max(ohio_data_summarized$onset_date)+0.5,
hjust=0,
fontface="bold",
nudge_y = 5,
direction="y")+
labs(y="# of Confirmed Cases",
x="Date")+
guides(color=F)+
scale_color_identity()+
scale_y_continuous(breaks=seq(0,1000,100))+
scale_x_date(expand=expand_scale(mult=c(0,0.13)),
breaks="1 week",
labels = function(x) format(x,format="%m/%d"))+
theme_bw()+
theme(axis.text = element_text(size=12,face="bold"),
title = element_text(size=14,face="bold"),
panel.grid.minor = element_blank())
p_county_line_plot
ggan_p_county_line_plot <- p_county_line_plot+
labs(title = 'Ohio Counties: # of Confirmed Cases',
subtitle = 'Date: {format(frame_along,format="%m/%d/%y")}')+
transition_reveal(onset_date)
Here is a quick explanation of the parameters we are providing to the animate function to create the gif:
duration = 15 secs: the gif will last for 15 seconds
fps = 9: there will be 9 frames per second (fps), so 9 frames/sec * 15 seconds = 135 frames total
end_pause = 27: the last frame will display 27 times (27 frames / 9 frames/sec = 3 second pause at the end)
height = 400 and width = 500: the resulting gif will be 400 by 500 pixels
an_p_county_line_plot <- animate(ggan_p_county_line_plot,
duration=15,
fps=9,
end_pause=27,
height=400,
width=500)
an_p_county_line_plot
Although much of what can be seen in a bar plot can be visualized using the above line plot, bar plots are useful for displaying total counts of data at one point in time.
We first saw a similar graph on a medium post, and benefitted tremendously from the code they shared in the post. Below is our attempt to repurpose their code for our dataset.
p_bar_counties_cases <- ohio_data_summarized_top_10 %>%
ggplot(aes(x=rank,y=case_count_total,fill=color_pal))+
geom_col()+
geom_text(aes(y=case_count_total,
group=county,
color=color_pal,
label=as.character(case_count_total)),
fontface="bold",
size=4,
show.legend = F,
hjust=0)+
geom_text(aes(group=county,
color=color_pal,
label=county),
show.legend = F,
fontface="bold",
size=4,
y=0,
hjust=1.1)+
scale_x_reverse(expand=expand_scale(mult=0))+
scale_y_continuous(expand=expand_scale(mult=c(0.2,0.06)),
breaks=seq(0,10000,100))+
scale_fill_identity()+
scale_color_identity()+
coord_flip()+
guides(fill=F)+
labs(y="# of Confirmed Cases")+
theme_bw()+
theme(axis.text = element_text(size=12,face="bold"),
title = element_text(size=14,face="bold"),
axis.text.y = element_blank(),
axis.ticks.y = element_blank(),
axis.title.y = element_blank(),
panel.border = element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank())
p_bar_counties_cases
ggan_p_bar_counties_cases<- p_bar_counties_cases +
transition_time(onset_date) +
labs(title = 'Top 10 Ohio Counties with Most Cases',
subtitle = 'Date: {format(frame_time,format="%m/%d/%y")}')
Here is a quick explanation of the parameters we are providing to the animate function to create the gif:
duration = 15 secs: the gif will last for 15 seconds
fps = 9: there will be 9 frames per second (fps), so 9 frames/sec * 15 seconds = 135 frames total
end_pause = 27: the last frame will display 27 times (27 frames / 9 frames/sec = 3 second pause at the end)
height = 400 and width = 500: the resulting gif will be 400 by 500 pixels
an_p_bar_counties_cases <- animate(ggan_p_bar_counties_cases,
duration=15,
fps=9,
end_pause = 27,
height=400,
width=500)
an_p_bar_counties_cases
Now that we have created our three individual .gif files we can merge them together using the magick package in R.
However, in our experience, this step was the biggest headache of all. Although we are not experts on the magick package and it does provide some very useful functionality for combining gganimate graphics, we found these functions were pretty unreliable. This step frequently led to endlessly running functions (literally > 10 hours with no results) or cause the R session to crash. Not sure if it is the computer we are using or how we are creating the files, but it was not a pleasant experience.
If you want to just combine two .gif files, we’d highly recommend using a gif combining website ( link here ) as it seems to be much more straightforward and much less headache.
However, if you really want to combine more than 2 .gif files, you are welcome to try your luck with the code below.
a_gif <- image_read(an_p_ohio_county_map)
b_gif <- image_read(an_p_county_line_plot)
c_gif <- image_read(an_p_bar_counties_cases)
new_gif_row <- image_append(c(a_gif[1],b_gif[1], c_gif[1]))
for(i in 2:135){
new_gif_row_i <- image_append(c(a_gif[i], b_gif[i], c_gif[i]))
new_gif_row <- c(new_gif_row, new_gif_row_i)
}
new_gif_row