At the onset of the COVID-19 pandemic, the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) put out a dashboard with a map of the spread of SARS-CoV-2 which made the rounds in the internet.
Curious about the data source for the JHU CSSE map, I found their GitHub repository and started this simple data viz project, enriching the data with a dataset of country populations I cobbled together with internet searches and WHO data.
The pre-processed dataset is comprised of 358598 rows and 4 columns. Each single-status dataset is as long as the number of days times the number of countries for the data in a given day.
Today (2022-07-11) there are 901 days and 199 countries in the data, before removing the small and seasonal populations of ships, Antarctica, the Olympics, and the Holy See.
Since the project focuses on countries, latitude,
longitude, and the sub-national province.state
fields were discarded.
Back to Contents
I maintain a static data set of countries and their population. This data was cobbled together with internet searches and World Health Organization data. I use countries’ population counts to calculate three columns from the original count:
Cumulative_Count (original) tracks the cumulative count
of cases given a status (Confirmed or Fatal), country, and dateCumulative_PctPopulation (calculated) percentage of the
population the cumulative count represents [cumulative count * 100 /
population]NewCases (calculated) daily count of new cases
[cumulative count - cumulative count previous day]NewCases_per10K (calculated) daily count of new cases
per 10,000 people to facilitate comparisons [new cases * 10,000 /
population]I also added two new fields for the shiny app to help compare countries in the same geographical area or of similar population size:
Continent: coded in alphabetical order (1 - Africa, 2 -
Asia, 3 - Europe, 4 - North America, 5 - Oceania, 6 - South America),
this field oversimplifies countries spanning multiple continents (i.e.,
eastern Europe)PopulationCategory: (calculated) coded in descending
size (1 - 100 M +, 2 - 10 M to 100 M, 3 - 1 M to 10 M, 4 - less than 1
M)The top rows of the enriched data set for Brazil and the US are:
| Continent | PopulationCategory | Country | Status | Date | Cumulative_Count | Cumulative_PctPopulation | NewCases | NewCases_per10K | |
|---|---|---|---|---|---|---|---|---|---|
| 41447 | 6 | 1 | Brazil | Confirmed | 2022-07-10 | 32896464 | 15.84204 | 21963 | 1.0576780 |
| 41448 | 6 | 1 | Brazil | Confirmed | 2022-07-09 | 32874501 | 15.83146 | 43657 | 2.1024016 |
| 41449 | 6 | 1 | Brazil | Confirmed | 2022-07-08 | 32830844 | 15.81044 | 71114 | 3.4246556 |
| 322559 | 4 | 1 | US | Confirmed | 2022-07-10 | 88594594 | 27.49848 | 21068 | 0.6539202 |
| 322560 | 4 | 1 | US | Confirmed | 2022-07-09 | 88573526 | 27.49194 | 24925 | 0.7736359 |
| 322561 | 4 | 1 | US | Confirmed | 2022-07-08 | 88548601 | 27.48420 | 167012 | 5.1838103 |
Note: there were a few data quality issues in the original count which resulted in a few “negative counts” for new cases and other calculated fields, so I zeroed those for plotting but kept the original count issues. This will be seen in a few sudden drops in what should’ve been otherwise a cumulative count. I did not pursue each case individually to try to figure out whether these were data entry errors or valid course corrections.
Back to Contents
| Status | Total |
|---|---|
| Confirmed | 555,455,891 |
| Fatal | 6,351,193 |
For barplots of the last day’s top countries and time series of the last 30 days, I’ve created an app.
Interact with the shiny app here.
Back to Contents
Fork or clone this GitHub repository with all files and code, including the code for the Shiny app.
# load libraries
install_packages <- function(package){
newpackage <- package[!(package %in% installed.packages()[, "Package"])]
if (length(newpackage)) {
suppressMessages(install.packages(newpackage, dependencies = TRUE))
}
sapply(package, require, character.only = TRUE)
}
suppressPackageStartupMessages(
install_packages(
# list of packages
c("kableExtra", "tidyverse")
)
)
# read in preprocessed data
prep_data <- paste0("./data/", gsub("-", "", Sys.Date()), "_data.rds")
dfm <- readRDS(prep_data)
enriched_data <- paste0("./data/", gsub("-", "", Sys.Date()), "_enriched.rds")
merged <- readRDS(enriched_data)
# calculate number of countries and number of days in the time series
Ncountries <- length(unique(dfm$Country))
Ndays <- length(unique(dfm$Date))
# embedded vars generated during explanation
nrow(dfm)
length(dfm)
Sys.Date()
Ndays
Ncountries
# top rows for Brazil and US
brazil_us <- rbind(head(merged[merged$Country == "Brazil", ], 3),
head(merged[merged$Country == "US", ], 3))
kable(x = brazil_us, table.attr = "style='width:100%;'" ) %>%
kable_classic(full_width = TRUE, position = "center", )
# subset to last date and calculate world totals
current_data <- data.frame(
merged %>%
select(Country, Status, Date, Cumulative_Count) %>%
filter(Date == max(merged$Date)) %>%
arrange(Status, desc(Cumulative_Count))
)
world_totals <- data.frame(
current_data %>%
group_by(Status) %>%
summarise('Total'= sum(Cumulative_Count))
)
world_totals$Total <- formatC(world_totals$Total, big.mark=",")
kable(world_totals) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
# uncomment to run, creates Rcode file with R code, set documentation = 1 to avoid text commentary
#library(knitr)
#options(knitr.purl.inline = TRUE)
#purl("CoronavirusDataAnalysis.Rmd", output = "Rcode.R", documentation = 1)