Coronavirus Data Analysis

At the onset of the COVID-19 pandemic, the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) put out a dashboard with a map of the spread of SARS-CoV-2 which made the rounds in the internet.

Curious about the data source for the JHU CSSE map, I found their GitHub repository and started this simple data viz project, enriching the data with a dataset of country populations I cobbled together with internet searches and WHO data.

BigBangData

Data Pre-Processing: brief description of data pre-processing and cleanup
Data Wrangling and Enrichment: adding population data and calculated fields
Data Visualization: see shiny app for barplots and time series
Forecasting: in progress…
Code Appendix: code for Rpubs; see GitHub for reproducibility

Data Pre-Processing

The pre-processed dataset is comprised of 358598 rows and 4 columns. Each single-status dataset is as long as the number of days times the number of countries for the data in a given day.

Today (2022-07-11) there are 901 days and 199 countries in the data, before removing the small and seasonal populations of ships, Antarctica, the Olympics, and the Holy See.

Since the project focuses on countries, latitude, longitude, and the sub-national province.state fields were discarded.

Back to Contents

Data Wrangling and Enrichment

I maintain a static data set of countries and their population. This data was cobbled together with internet searches and World Health Organization data. I use countries’ population counts to calculate three columns from the original count:

Cumulative_Count (original) tracks the cumulative count of cases given a status (Confirmed or Fatal), country, and date
Cumulative_PctPopulation (calculated) percentage of the population the cumulative count represents [cumulative count * 100 / population]
NewCases (calculated) daily count of new cases [cumulative count - cumulative count previous day]
NewCases_per10K (calculated) daily count of new cases per 10,000 people to facilitate comparisons [new cases * 10,000 / population]

I also added two new fields for the shiny app to help compare countries in the same geographical area or of similar population size:

Continent: coded in alphabetical order (1 - Africa, 2 - Asia, 3 - Europe, 4 - North America, 5 - Oceania, 6 - South America), this field oversimplifies countries spanning multiple continents (i.e., eastern Europe)
PopulationCategory: (calculated) coded in descending size (1 - 100 M +, 2 - 10 M to 100 M, 3 - 1 M to 10 M, 4 - less than 1 M)

The top rows of the enriched data set for Brazil and the US are:

	Continent	PopulationCategory	Country	Status	Date	Cumulative_Count	Cumulative_PctPopulation	NewCases	NewCases_per10K
41447	6	1	Brazil	Confirmed	2022-07-10	32896464	15.84204	21963	1.0576780
41448	6	1	Brazil	Confirmed	2022-07-09	32874501	15.83146	43657	2.1024016
41449	6	1	Brazil	Confirmed	2022-07-08	32830844	15.81044	71114	3.4246556
322559	4	1	US	Confirmed	2022-07-10	88594594	27.49848	21068	0.6539202
322560	4	1	US	Confirmed	2022-07-09	88573526	27.49194	24925	0.7736359
322561	4	1	US	Confirmed	2022-07-08	88548601	27.48420	167012	5.1838103

Note: there were a few data quality issues in the original count which resulted in a few “negative counts” for new cases and other calculated fields, so I zeroed those for plotting but kept the original count issues. This will be seen in a few sudden drops in what should’ve been otherwise a cumulative count. I did not pursue each case individually to try to figure out whether these were data entry errors or valid course corrections.

Back to Contents

Exploratory Data Analysis

Total Counts

Status	Total
Confirmed	555,455,891
Fatal	6,351,193

Latest Country Statistics

For barplots of the last day’s top countries and time series of the last 30 days, I’ve created an app.

Interact with the shiny app here.

Forecasting

IN PROGRESS…

Back to Contents

Code Appendix

Fork or clone this GitHub repository with all files and code, including the code for the Shiny app.

# load libraries
install_packages <- function(package){
    newpackage <- package[!(package %in% installed.packages()[, "Package"])]
      if (length(newpackage)) {
        suppressMessages(install.packages(newpackage, dependencies = TRUE)) 
      }
      sapply(package, require, character.only = TRUE)
}

suppressPackageStartupMessages(
    install_packages(
        # list of packages
        c("kableExtra", "tidyverse")
    )
)

# read in preprocessed data
prep_data <- paste0("./data/", gsub("-", "", Sys.Date()), "_data.rds")
dfm <- readRDS(prep_data)

enriched_data <- paste0("./data/", gsub("-", "", Sys.Date()), "_enriched.rds")
merged <- readRDS(enriched_data)

# calculate number of countries and number of days in the time series
Ncountries <- length(unique(dfm$Country))
Ndays <- length(unique(dfm$Date))

# embedded vars generated during explanation
nrow(dfm)
length(dfm)
Sys.Date()
Ndays
Ncountries

# top rows for Brazil and US
brazil_us <- rbind(head(merged[merged$Country == "Brazil", ], 3), 
                   head(merged[merged$Country == "US", ], 3))

kable(x = brazil_us, table.attr = "style='width:100%;'" ) %>% 
kable_classic(full_width = TRUE, position = "center", )

# subset to last date and calculate world totals
current_data <- data.frame(
    merged %>%
    select(Country, Status, Date, Cumulative_Count) %>%
    filter(Date == max(merged$Date)) %>%
    arrange(Status, desc(Cumulative_Count))
)


world_totals <- data.frame(
    current_data %>%
    group_by(Status) %>%
    summarise('Total'= sum(Cumulative_Count))
)

world_totals$Total <- formatC(world_totals$Total, big.mark=",")

kable(world_totals) %>%
      kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

# uncomment to run, creates Rcode file with R code, set documentation = 1 to avoid text commentary
#library(knitr)
#options(knitr.purl.inline = TRUE)
#purl("CoronavirusDataAnalysis.Rmd", output = "Rcode.R", documentation = 1)

Coronavirus Data Analysis

Marcelo Sanches

07/11/2022

Contents

Data Pre-Processing

Data Wrangling and Enrichment

Exploratory Data Analysis

Total Counts

Latest Country Statistics

Forecasting

Code Appendix