1 Introduction
In this document, we will use R and the package ggplot2 to produce multiple graphs of the cumulative death per million people (this helps to account for countries/states’ sizes) both in the US and in the world. We use an online dataset from Johns Hopkins posted on datasource. This data is then scraped directly in R so every time the code that generated this document is run, we get the most updated dataset. If you have never used R, every journey needs to starts from somewhere, yours might as well start here. If you didn’t know, R is an open-source statistical programming language which has awesome features like produce this web page entirely for example.
1.1 Data Cleaning
1.1.1 US
Below are codes that will help making the US data ready for ggplot2 i.e makes it tidy. The most important package to make all this possible is tidyverse which contains ggplot2, magrittr and other important data manipulation packages like dplyr which is very powerful and many other packages. haven allows us to import data from STATA, SASA, SPSS, etc. Read more about it here tidyverse which was created by Hadley Wickham. You will see in almost every line of code the symbol %>% which is maybe the most important operator in all of R for wrangling data, it is referred to as the pipe operator from the magrittr package which helps combine/chain multiple operations/functions together (see appendix for a rigorous treatment).
library(tidyverse)
library(haven)
library(classInt)
#-----------------#
poplink_US <- "https://github.com/partha-deb/COVID-19-outcomes/blob/master/Data/statepopulation.dta?raw=true"
pop_US <- read_dta(poplink_us)
covidlink_US <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_US.csv"
covid_US0 <- read.csv(covidlink_US)The code below cleans the US data, and makes it in the appropriate format for ggplot2
ll <- 1
maxdays <- 50 # how far in term of days you want to see ahead
ref_date <- "3/20/20"
gtitle <- "COVID-19 summary by US States"
ylabel <- "Cumulative deaths per million people"
covid_US <- covid_US0 %>%
select(c(county="admin2",
state= "province_state"),
everything()) %>%
gather(13:ncol(covid_US0) , # getting the data long
key="date",
value="death") %>%
mutate(date=gsub("x","",date)) %>% # R puts 'x' in front of all numerical columns
mutate(date=as.Date(as.character(date),"%m.%d.%y")) %>%
group_by(state, date) %>% # Collapsing and summing deaths
summarize(deaths = sum(death)) %>%
left_join(pop_US, by="state") %>%
mutate(deathspc = (deaths / population) * 1000000, # The number, 1,000,000 , is arbitrary but allows easier interpretation
t0 = (deathspc > ll) ) %>%
filter(t0==TRUE) %>%
mutate(days = as.vector(unlist(lapply(table(state), seq_len))))1.1.2 World
For the world data, the following code would allow you to have access to it (no need to download it).
poplink_world <- "https://github.com/partha-deb/COVID-19-outcomes/blob/master/Data/worldpopulation.dta?raw=true"
pop_world <- read_dta(poplinK_world)
covidlink_world <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv"
covid_world <- read.csv(covidlink_world)The code below gets the world data ready for graphics (very similar to the US data).
colnames(covid_world) <- tolower(colnames(covid_world)) #Column names start with capital letters.
covid_world <- covid_world %>%
select(c(country="country.region",
provincestate="province.state"),
everything()) %>%
gather(5:ncol(covid_world) , # getting the data long
key="date",
value="death") %>%
mutate(date=gsub("x","",date)) %>% # R put 'x' in front of all dates
mutate(date=as.Date(as.character(date),"%m.%d.%y")) %>%
group_by(country, date) %>% # Collapsing and summing deaths
summarize(deaths = sum(death)) %>%
left_join(pop_world, by="country") %>%
mutate(deathspc = (deaths / population) * 1000,
t0 = (deathspc > ll) ) %>%
filter(t0==TRUE) %>%
mutate(days = as.vector(unlist(lapply(table(country), seq_len))))We use the following theme for all graphics. By doing this, it allows us to have shorter codes for graphs and a uniform theme across all graphics.
theme_element <- theme(axis.text=element_text(size=10),
strip.text=element_text(size=12),
axis.title = element_text(size=12),
legend.text=element_text(size=11),
legend.key.size = unit(0.5, "cm"),
legend.title=element_blank(),
legend.position="bottom",
strip.background=element_rect(fill="white"),
plot.title = element_blank())1.2 US level
At the US level we plot the cumulative death per million people for only a small sample of states as shown below. You can always substitute states or even add more (adding more won’t be pleasing for obvious graphical reasons)
US_states <- c("New York","Georgia","Texas","California",
"Michigan","Louisiana","New Jersey","Massachusetts",
"Connecticut")Here is a time series plot by the US states listed above:
covid_US %>%
filter(state %in% US_states ) %>%
ggplot(aes(x=days,
y=deathspc,
color= state))+
geom_line()+
geom_point()+
theme_bw()+
labs(title=gtitle)+
ylab(ylabel)+
xlab(paste0("Days since ", ll,"/million deaths"))+
theme_element+
xlim(c(0,maxdays))+
scale_color_manual(values=c("deepskyblue4","lightblue2",
"burlywood4","forestgreen",
"deeppink","red3","yellow3",
"orange3","purple2"))The graph below is a panel of states which may be a more convenient way to compare states. Note that you can order the state to have them side by side as you wish by turning the state column into a factor and specifying the levels in your prefered order (this is not done in the code).
covid_US %>%
filter(state %in% US_states ) %>%
ggplot(aes(x=days,y=deathspc, color=state))+
geom_line()+
geom_point()+
theme_bw()+
labs(title=gtitle)+
ylab(ylabel)+
xlab(paste0("Days since ", ll,"/million deaths"))+
theme_element+
scale_color_manual(values=c("deepskyblue4","lightblue2",
"burlywood4","forestgreen",
"deeppink","red3","yellow3",
"orange3","purple"))+
facet_wrap(~state)+
xlim(c(0,maxdays))+
theme(legend.position = "none")1.3 World graphs
We focus only on a subset of countries as we did for US states, and the same applies here, you can substitude or add countries to this list (making sure to match the way the country is written in the data)
world_countries <-c("US","France","Belgium",
"Norway","Switzerland",
"United Kingdom","Germany",
"Italy","Spain") covid_world %>%
filter(country %in% world_countries ) %>%
ggplot(aes(x=days,y=deathspc, color= country))+
geom_line()+
geom_point()+
theme_bw()+
labs(title="COVID-19 summary")+
ylab(ylabel)+
xlab(paste0("Days since ", ll,"/million deaths"))+
theme_element+
scale_color_manual(values=c("deepskyblue4","lightblue2",
"burlywood4","forestgreen",
"deeppink","red3","yellow3",
"orange3","purple2"))+
xlim(c(0,maxdays)) Here create a panel by state and reproduce the same graphs. This helps for a better comparison of countries. As above, the code is omitted for the same reason.
covid_world %>%
filter(country %in% world_countries ) %>%
ggplot(aes(x=days,y=deathspc, color=country))+
geom_line()+
geom_point()+
theme_bw()+
labs(title="COVID-19 summary")+
ylab(ylabel)+
xlab(paste0("Days since ", ll,"/million deaths"))+
scale_color_manual(values=c("deepskyblue4","lightblue2",
"burlywood4","forestgreen",
"deeppink","red3","yellow3",
"orange3","purple"))+
xlim(c(0,maxdays))+
theme_element+
facet_wrap(~country)+
theme(legend.position = "none")1.3.1 State specific graphs
If you want to make graphs for individual states, you can use the following code to automate that. Similarly for individual countries, replace US_states with world_countries where necessary.
for(i in US_states){
print (i)
individual_state <- covid_US %>%
filter(state== i) %>%
ggplot(aes(x=days,y=deathspc))+
geom_line()+
geom_point()+
theme_bw()+
labs(title=paste(i,"COVID-19 summary", sep=" "))+
ylab(ylabel)+
xlab(paste0("Days since ", ll,"/million deaths"))+
theme_element+
xlim(c(0,maxdays))+
ggsave(paste0(i,"_COVID19_graph.tiff"),
width=10,
height=10)
}2 US Map
In this section, we produce a thematic map of the cummulative death per million people for US states using the package tigris which grabs shapefiles from US Census. We use as projection Conus Albers or EPSG:5070 (a map projection is to compensate for the distortion of displaying a portion of a sphere on a flat surface; Robinson is another type of projection for instance). We only map the 48 contiguous states and DC for aesthetic purposes. We’ll need to use the following packages to be able to produce this map (you will need to install them to reproduce the map). For more on thematic maps in R, visit this page for a cookbook approach maps in R
2.1 Data cleaning
The following gets the shapefile from Census for state borders, etc. and cleans the data to make our map.
us_geo <- tigris::states(class = "sf", cb = TRUE)
US_pop <- pop_US %>%
select(c(state,population))
date <- gsub("\\.","/", colnames(covid_US0)[ncol(covid_US0)])
date <- gsub("x","",date) # This saves the last date
contiguous_states <- us_geo %>%
left_join(
covid_US0 %>%
rename(latest_date=ncol(covid_US0),
state="province_state") %>%
group_by(state) %>%
summarize(deaths_latest = sum(latest_date)
) %>%
left_join(US_pop, by="state") %>%
mutate(deathspc=(deaths_latest/population)*1000000),
by=c("NAME"="state")) %>%
filter(as.numeric(GEOID) < 60 &
STUSPS != "AK" &
STUSPS != "HI") # Only keeping the 48 contigious states plus DCBelow is the US map of cumulative death per million people as of 5/6/20 which is the date when the data was updated by Johns Hopkins and from the time this document was compiled. As we can observe, NY is by far the worst state in regards to the deaths (no surprises here).
title1 <- "Cumulative deaths per million people in the US on "
subtitle <- "COVID-19 deaths for contiguous states only"
label1 <- paste0("Source: Johns Hopkins,\nhttps://github.com/CSSEGISandData/COVID-19/; ",date," by Florian Mudekereza")
legend1 <- "Deaths per\nmillion people"
#---------------------------#
contiguous_states %>%
ggplot() +
geom_sf(aes(fill = deathspc),
color = "white", size=1) +
coord_sf(crs = 5070, datum = NA) + # for Robinson projection replace 5070 by "+proj=robin"
scale_fill_viridis(legend1,
option = "viridis", # other options: magma, inferno, plasma, etc.
direction = 1) +
labs(title = paste0("Cumulative total Deaths on ",date),
subtitle = subtitle)+
theme_classic()+
theme(legend.position="right",
plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5),
axis.title = element_blank())+
annotate("text", x = -100, y = 22,
label = label1, size = 2.5,
family = "Times", colour = "black")For a better scale, we will use the classInt package to explicitly determine the breaks. Since NY is an outlier, we need to separate from the others by maping it individually as an additional layer.
ggplot() +
geom_sf(aes(fill = deathspc_cat),data=cont,
color = "white", size=1) +
geom_sf(data=contiguous_states %>% # Separating NY and mapping it individually
filter(NAME=="New York"),
aes(fill=paste0(round(deathspc,1)," NY")),
color = "white", size=1)+
scale_fill_manual(legend1,
values=c("khaki1","orangered4",
"tan1","chocolate1",
"orangered3","black"))+
coord_sf(crs = 5070, datum = NA) + # for Robinson projection replace "+proj=robin" for 5070
theme_classic()+
labs(title = paste0(title1,date),
subtitle = subtitle)+
theme(legend.position="bottom",
plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5),
axis.title = element_blank())+
annotate("text", x = -100, y = 22,
label = label1, size = 2.5,
family = "Times", colour = "black")2.2 Comparison
To compare 5/6/20 death levels to say 03/20/2020, we need to put those maps side-by-side. Below is the code that will low us to wrangle the data to make it possible.
wrangled_data <- us_geo %>%
left_join(
covid_US_1 %>%
gather(c(march_20_2020,last_day),
key="dates",value="deaths") %>%
group_by(state, dates) %>%
summarize(deaths_latest = sum(deaths)) %>%
left_join(US_pop, by="state") %>%
mutate(deathspc=(deaths_latest/population)*1000000),
by=c("NAME"="state")
) %>%
filter(as.numeric(GEOID) < 60 &
STUSPS != "AK" &
STUSPS != "HI") %>%
mutate(dates=factor(dates,
levels=c("march_20_2020","last_day"),
labels=c(ref_date,date)))Below is our side-by-side comparison map between the death levels on 03/20/2020 and 5/6/20 :
breaks_qt1 <- classIntervals(
c(min(wrangled_data$deathspc)-num1,
wrangled_data$deathspc),
method="complete",n = 5,
style = "quantile"
)
ggplot() +
geom_sf(data=wrangled_data %>%
mutate(deathspc_cat = cut(deathspc,breaks_qt1$brks)),
aes(fill = as.factor(deathspc_cat)),
color = "white", size=1) +
geom_sf(data=wrangled_data %>% # Separating NY and mapping it individually
filter(NAME=="New York" & dates==ref_date),
aes(fill=paste0(round(deathspc,1)," NY on ",ref_date)),
color = "white", size=1)+
geom_sf(data=wrangled_data %>% # Separating NY and mapping it individually
filter(NAME=="New York" & dates==date),
aes(fill=paste0(round(deathspc,1)," NY on ",date)),
color = "white", size=1)+
coord_sf(crs = 5070, datum = NA) + # for Robinson projection replace "+proj=robin" for 5070
scale_fill_manual(legend1,
values=c("darkgray","khaki","tan1",
"orangered3","orangered4",
"black","tan1"))+
theme_classic()+
labs(title = paste0(title1,date),
subtitle = subtitle,
caption=label1)+
theme(legend.position="bottom",
plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5),
plot.caption = element_text(hjust=0.5,size=7),
axis.title = element_blank())+
facet_wrap(~dates)2.3 US County Level Map
We’ll look at actual death counts in the US by county. For this we’ll need to use the package urbnmapr which grabs shapefiles the Census bureau as well similar to tigris.
library(urbnmapr)
spatial_data <- get_urbn_map(map = "counties", sf = TRUE) %>%
mutate(county_name=gsub(" County","",county_name),
county_fips=as.numeric(sub("^0","",county_fips)))%>%
left_join(
covid_US0 %>%
rename(lastday=ncol(covid_US0)),
by = c("county_fips"="fips")
) %>%
mutate(lastday=case_when(lastday==0~num1,
TRUE ~ (lastday/population)*10000)) %>%
filter(state_abbv != "AK" & state_abbv != "HI")
breaks_qt2 <- classIntervals(
c(min(spatial_data$lastday)-num1,
spatial_data$lastday),
method="complete",n = 3,
style = "quantile"
)Anything in grey just means no reported deaths. As we can see, there are several counties with no reported deaths as of 5/6/20.
spatial_data <- spatial_data %>%
mutate(lastday_cat = cut(lastday,breaks_qt2$brks))
spatial_data %>%
ggplot() +
geom_sf(aes(fill=as.factor(lastday_cat)),
color= "white",
size = 0.25) +
coord_sf(crs = 5070, datum = NA) +
scale_fill_manual("deaths per \n10,000 people",
values=c("darkgray","khaki","tan1",
"orangered3","orangered4",
"tan1","black"))+
labs(fill = "")+
theme_classic()+
labs(title = paste0(title1,date),
subtitle = "3,142 counties exluding all county-equivalents")+
theme(legend.position="right",
plot.title = element_text(hjust=0.5),
plot.subtitle = element_text(hjust=0.5, size=8),
axis.title = element_blank())+
annotate("text", x = -100, y = 22,
label = label1, size = 2.5,
family = "Times", colour = "black")2.4 NY by county
Here we get a closer look at NY as of 5/6/20. We can observe that not all counties are as bad as you would think. For an interactive map we’ll use the package mapview. You can request your api here from the Census bureau to access shapefiles. This is an interactive map where you could track reported death by counties from 1/22/20 to 5/6/20 by clicking a specific county polygon (you can change the map view by clicking the square icon at the top-right corner).
#census_api_key(" ") # Put your api inside the " " and uncomment
NY_shape <-
get_acs(geography = "county",
variables = "B01003_001", # population column in census dataset
state = "NY",
geometry = TRUE)## Getting data from the 2014-2018 5-year ACS
library(mapview)
covid_NY <- spatial_data %>%
filter(state_abbv == "NY") %>%
st_as_sf(coords = c("long_", "lat"),
crs = st_crs(NY_shape)) %>%
rename(deaths="lastday_cat",
state="province_state",
county="admin2") %>%
mutate(deaths=as.factor(deaths))
colnames(covid_NY) <- gsub("x","",colnames(covid_NY))
colnames(covid_NY) <- gsub("\\.","/",colnames(covid_NY))
mapviewOptions(
vector.palette = colorRampPalette(c("darkgray",
"khaki",
"tan1")),
na.color = "magenta",
layers.control.pos = "topright"
)
mapview(
covid_NY,
zcol = "deaths",
homebutton = T
)3 Reference
Wickham, Hadley. ggplot2: Elegant Graphics for Data Analysis. Springer New York, 2009.
Wickham, Hadley. Advanced R. ProtoView, vol. 1, no. 49, 2014, pp. ProtoView, Vol.1(49).
Healy, Kieran. Data Visualization: A Practical Introduction. Princeton, 2019.
Rahlf, Thomas. Data Visualisation with R: 111 Examples. 2nd ed., 2019.
4 Appendix
Let \(\Big\{f_{i}\Big\}_{i=1}^{n}\) be a finite sequence of functions. Now suppose that: \[f_{1}: A_{1} \rightarrow A_{2}, \hspace{0.1in} f_{2}: A_{2} \rightarrow A_{3}, \hspace{0.1in} ...\hspace{0.1in} f_{n-1}:A_{n-1} \rightarrow A_{n} ,\hspace{0.1in} f_{n}: A_{n} \rightarrow B\] As we can see, it would be nice if there was a new function that maps from \(A_{1}\) directly to \(B\): \[\Phi: A_{1} \rightarrow B\] But we know from basic functions in math that we can define a new function called a composition such that: \[\begin{align*} \Phi &= f_{n} \circ f_{n-1} \circ ... \circ f_{2} \circ f_{1}\\\\ \Phi(x) &= f_{n}(f_{n-1}(...f_{2}(f_{1}(x))...) \end{align*}\]
By this definition, we can conclude that \(\Phi\) is simply the pipe operator %>% from earlier. To put this concretely, suppose we want to apply two functions (\(n = 2\) here) to the dataset/dataframe \(x\) in the following order:
We want to create a new column, \(c\), in \(x\) from existing columns \(a\) and \(b\) (by say, \(c=a+b\) ), \(f_{1}\)
we would like to graph the density of \(c\), \(f_{2}\)
Instead of applying these functions individually to \(x\), we want to compose them into one function that will graph the density of \(c\) directly by incorporating the intermediary step of creating \(c\), and %>% does that for us in R, so in this example: %>%\((x)= (f_{2} \circ f_{1})(x) = f_{2}(f_{1}(x))\), where \(x\) is an R dataframe.
In this document, as you realized, I was composing 5-6 functions together using %>% to get more compact and simpler codes (for more on this see Function composition (computer science)).