DTSA 5301 - Analysis of COVID-19 Data

1. Introduction

In December 2019, COVID-19 coronavirus was identified in Wuhan, China. By March 11. 2020, the World Health Organization (WHO) had declared a global pandemic. This crisis underscored an urgent need for robust data tracking to monitor its spread and impact. The Johns Hopkins University Center for Systems Science & Engineering (JHU CSSE) responded by compiling a near real-time record of confirmed cases, deaths, and recoveries. This dataset quickly became a cornerstone for understanding the pandemic’s evolution until its final update on March 10, 2023, and it is now a valuable historical record. This R project uses this rich dataset to analyze patterns and trends in the COVID-19 pandemic at the global level.

2. Research Questions

This report aims to answer the following questions:

When did the COVID-19 outbreak become a global pandemic?
What is the overall trend in cases and deaths, globally?
Where did COVID-19 have the biggest impact?
How did COVID affect the financial markets?

In addition, we will explore a counter-factual scenario to understand the potential trajectory of COVID-19 in the absence of public health interventions. Modeling such a scenario is inherently challenging, as the interventions themselves shape the data we observe. However, by utilizing observations from the early days of the pandemic, we can attempt a predictive modeling exercise. This involves projecting forward under the assumption that the trends observed during the initial phase would have continued without significant alterations due to public health responses.

3. Environment Setup

First, we need to import the various packages that we will need to conduct our analysis.

knitr::opts_chunk$set(echo = TRUE)

library(tidyverse)
library(lubridate)
library(ggplot2)
library(plotly)
library(dplyr)
library(knitr)
library(scales)
library(zoo)
library(gridExtra)
library(grid)
library(quantmod)
library(forecast)
library(maps)
library(DT)

4. Load Data

The JHU CSSE COVID-19 dataset provides valuable time series data on the progression of the pandemic. We will download the daily summary tables for confirmed cases and deaths directly from their GitHub repository. Unfortunately, this dataset has historically had some inconsistencies in reporting recovered cases, particularly early in the pandemic and for certain regions, so we will exclude that subset of the data from our analysis.

url_base <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/"
filenames = c("time_series_covid19_confirmed_global.csv", "time_series_covid19_deaths_global.csv")

data_raw_list <- purrr::map(str_c(url_base, filenames), read_csv)
names(data_raw_list) <- str_remove(filenames, ".csv")

cases_raw <- data_raw_list$time_series_covid19_confirmed_global
deaths_raw <- data_raw_list$time_series_covid19_deaths_global

5. Data Cleaning & Transformation

The JHU CSSE time series data requires several transformations to prepare it for analysis. For example, the raw data is in a wide format, where each date is represented as a separate column. The date column is also the wrong data type. Thankfully, the tables have the same structure, which will make it easy to join the data into a single data frame.

Step 1: Inspecting Raw Data

First, let’s take a look at the raw data to understand its structure.

head(cases_raw)

Step 2: Load and Clean Population Data

We’ll start by loading population data, which will be used later to calculate per capita metrics.

population_url_base  <- "https://github.com/CSSEGISandData/COVID-19/blob/master/csse_covid_19_data/"
population_url <- paste0(population_url_base, "UID_ISO_FIPS_LookUp_Table.csv?raw=true")
population <- read_csv(population_url)

population_cleaned <- population %>%
  select(Country_Region, Population) %>%
  distinct(Country_Region, .keep_all = TRUE) %>%
  mutate(Population = if_else(is.na(Population), 0, Population))

Step 3: Define Helper Function for Data Cleaning

Next, we’ll define a helper function to clean and reshape the COVID data from wide to long format.

clean_covid_data <- function(df, value_name) {
  df %>%
    select(-`Province/State`, -Lat, -Long) %>%
    group_by(`Country/Region`) %>%
    summarize(across(everything(), sum, na.rm = TRUE), .groups = "drop")  %>%
    pivot_longer(
      cols = -`Country/Region`,
      names_to = "Date",
      values_to = value_name
    ) %>%
    mutate(
      Date = mdy(Date),
      `Country/Region` = as.factor(`Country/Region`)
    ) %>%
    arrange(Date)
}

Step 4: Apply Function & Join Data

We’ll apply the cleaning function to both cases and deaths data, and then join them into a single data frame.

covid <- list(cases_raw, deaths_raw) %>%
  map2(c("Cases", "Deaths"), clean_covid_data) %>%
  reduce(left_join, by = c("Country/Region", "Date"))

Step 5: Calculate Daily Changes

Now, we’ll calculate new daily cases and deaths, ensuring there are no negative values.

covid <- covid %>%
  group_by(`Country/Region`) %>%
  mutate(
    New_Cases = pmax(Cases - lag(Cases, default = 0), 0, na.rm = TRUE),
    New_Deaths = pmax(Deaths - lag(Deaths, default = 0), 0, na.rm = TRUE)
  ) %>%
  ungroup()

Step 6: Merge Population Data & Calculate Per Capita Metrics

Finally, we’ll merge the population data and compute incidence and mortality rates.

covid <- covid %>%
  left_join(population_cleaned, by = c("Country/Region" = "Country_Region")) %>%
  filter(Population > 0) %>%
  mutate(
    Incidence_Rate = if_else(Population == 0, 0, (Cases / Population) * 100000),
    Mortality_Rate = if_else(Population == 0, 0, (Deaths / Population) * 100000)
  )

Step 7: Summarize Transformed Data

We can now summarize our transformed dataset and get some high-level insights.

start_date <- min(covid$Date, na.rm = TRUE)
end_date <- max(covid$Date, na.rm = TRUE)

summary_data <- covid %>%
  filter(Date == end_date) %>%
  summarize(
    total_cases = sum(Cases, na.rm = TRUE),
    total_deaths = sum(Deaths, na.rm = TRUE),
  ) %>%
  mutate(
    start_date = start_date,
    end_date = end_date,
    num_countries = n_distinct(covid$`Country/Region`),
    num_days = n_distinct(covid$Date)
  )

print(summary_data)

## # A tibble: 1 × 6
##   total_cases total_deaths start_date end_date   num_countries num_days
##         <dbl>        <dbl> <date>     <date>             <int>    <int>
## 1   676568017      6881787 2020-01-22 2023-03-09           196     1143

6. Visualizations

Question 1: When did the COVID-19 outbreak become a global pandemic?

In February 2020, the outbreak was largely confined to China. That changed in mid-March. As previously mentioned, the WHO declared COVID-19 a pandemic on March 11th, and this is clearly reflected in the data. If we visualize the first few weeks of the outbreak we can pinpoint the exact date (March 15th) when the total number of cases outside China overtook the cases inside China, signalling that COVID-19 was now a global pandemic.

china_data <- covid %>%
  filter(`Country/Region` == "China") %>%
  group_by(Date) %>%
  summarise(China = sum(Cases, na.rm = TRUE))

world_data <- covid %>%
  filter(`Country/Region` != "China") %>%
  group_by(Date) %>%
  summarise(Rest_of_World = sum(Cases, na.rm = TRUE))

combined_data <- full_join(china_data, world_data, by = "Date") %>%
  filter(Date >= min(covid$Date) & Date <= as.Date("2020-03-20"))

pandemic_plot <- ggplot(combined_data) +
  geom_line(aes(x = Date, y = China, color = "China"), size = 1) +
  geom_line(aes(x = Date, y = Rest_of_World, color = "Rest of World"), size = 1) +
  labs(title = "COVID-19: From Epidemic to Pandemic",
       x = "Date",
       y = "Cumulative Cases",
       color = "Region") +
  theme_minimal() +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  geom_vline(xintercept = as.numeric(as.Date("2020-01-30")), 
             color = "purple", 
             linetype = "dotted", 
             size = 0.3) +
  geom_vline(xintercept = as.numeric(as.Date("2020-02-13")), 
             color = "purple", 
             linetype = "dotted", 
             size = 0.3) +
  geom_vline(xintercept = as.numeric(as.Date("2020-03-11")), 
             color = "purple", 
             linetype = "dotted", 
             size = 0.3) +
  annotate("text", x = as.Date("2020-01-30"), 
           y = max(combined_data$China, combined_data$Rest_of_World, na.rm = TRUE) * 0.95,
           label = "WHO Declares\nEmergency", size = 3, angle = 90, vjust = -0.5, hjust = 1) +
  annotate("text", x = as.Date("2020-02-13"), 
           y = max(combined_data$China, combined_data$Rest_of_World, na.rm = TRUE) * 0.95,
           label = "China Reports\nSurge", size = 3, angle = 90, vjust = -0.5, hjust = 1) +
  annotate("text", x = as.Date("2020-03-11"), 
           y = max(combined_data$China, combined_data$Rest_of_World, na.rm = TRUE) * 0.95,
           label = "WHO Declares\nPandemic", size = 3, angle = 90, vjust = -0.5, hjust = 1)

ggplotly(pandemic_plot)

Question 2: What is the overall trend in cases and deaths, globally?

The global trend in COVID-19 cases and deaths reveals a complex and dynamic progression of the pandemic. Initially, there was a rapid exponential growth in both cases and deaths, which over time transitioned into a slower rate of increase, as evidenced by the flattening of the curves. This change likely reflects the impact of public health measures, increased population immunity, and widespread vaccination efforts. The disparity between the number of cases and deaths highlights the relatively low case fatality rate of the virus, with stabilization in death rates being more pronounced than that of cases. This suggests improvements in treatment protocols and the protective effects of vaccines.

# Summarize global data
global_data <- covid %>%
  group_by(Date) %>%
  summarize(
    Cases = sum(Cases, na.rm = TRUE),
    Deaths = sum(Deaths, na.rm = TRUE)
  ) %>%
  pivot_longer(cols = c(Cases, Deaths), names_to = "Metric", values_to = "Count")

# Confirmed cases with linear scale
plot_cases_linear <- ggplot(global_data %>% filter(Metric == "Cases"), aes(x = Date, y = Count, color = Metric)) +
  geom_line(color = "firebrick") +
  labs(
    subtitle = "Cases with Linear Scale",
    x = "Date",
    y = "Count",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))

# Cases with logarithmic scale
plot_cases_log <- ggplot(global_data %>% filter(Metric == "Cases"), aes(x = Date, y = Count, color = Metric)) +
  geom_line(color = "firebrick") +
  labs(
    subtitle = "Cases with Logarithmic Scale",
    x = "Date",
    y = "Count",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_y_log10(labels = label_number(scale_cut = cut_short_scale()))

# Deaths with linear scale
plot_deaths_linear <- ggplot(global_data %>% filter(Metric == "Deaths"), aes(x = Date, y = Count, color = Metric)) +
  geom_line(color = "steelblue") +
  labs(
    subtitle = "Deaths with Linear Scale",
    x = "Date",
    y = "Count",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale()))

# Deaths with logarithmic scale
plot_deaths_log <- ggplot(global_data %>% filter(Metric == "Deaths"), aes(x = Date, y = Count, color = Metric)) +
  geom_line(color = "steelblue") +
  labs(
    subtitle = "Deaths with Logarithmic Scale",
    x = "Date",
    y = "Count",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_y_log10(labels = label_number(scale_cut = cut_short_scale()))

# Arrange plots in a grid
grid.arrange(
  plot_cases_linear, plot_cases_log, plot_deaths_linear, plot_deaths_log,
  ncol = 2,
  top = textGrob("Global Trend of Confirmed COVID-19 Cases & Deaths", gp = gpar(fontsize = 14, fontface = "bold"))
)

The most substantial growth in both cases and deaths occurred during 2020-2021, with a marked deceleration in subsequent years. This trend provides insights into the pandemic’s progression and the impact of interventions. However, it is important to consider potential data limitations due to under-reporting or regional differences in testing and reporting practices.

# Summarize weekly data
weekly_data <- covid %>%
  mutate(Week = floor_date(Date, unit = "week")) %>%
  group_by(Week) %>%
  summarize(
    Cases_7Day_Avg = mean(rollmean(New_Cases, k = 7, fill = NA, align = "right"), na.rm = TRUE),
    Deaths_7Day_Avg = mean(rollmean(New_Deaths, k = 7, fill = NA, align = "right"), na.rm = TRUE)
  )

# Plot weekly new cases
plot_cases_avg <- ggplot(weekly_data, aes(x = Week, y = Cases_7Day_Avg)) +
  geom_line(color = "orchid3", size = 1) +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE, color = "black", linetype = "dashed") +
  labs(
    title = "Confirmed Cases",
    x = "Week",
    y = "Count"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale()))

# Plot weekly new deaths
plot_deaths_avg <- ggplot(weekly_data, aes(x = Week, y = Deaths_7Day_Avg)) +
  geom_line(color = "steelblue", size = 1) +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE, color = "black", linetype = "dashed") +
  labs(
    title = "Confirmed Deaths",
    x = "Week",
    y = "Count"
  ) +
  theme_minimal() +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale()))

grid.arrange(
  plot_cases_avg, plot_deaths_avg,
  ncol = 2,
  top = textGrob("7-Day Rolling Averages", gp = gpar(fontsize = 14, fontface = "bold"))
)

The analysis of weekly data further underscores the dynamic nature of the pandemic, with several distinct peaks indicating waves of increased transmission. The initial rise in cases was followed by a series of surges, each reaching higher levels than the previous one, suggesting periods of intensified spread possibly due to new variants or changes in public health measures. The most pronounced peaks occurred around late 2021 and early 2022, highlighting times of heightened transmission. Following these peaks, there was a noticeable decline, suggesting effective interventions or increased immunity in the population. However, the persistence of smaller peaks indicates ongoing transmission and the potential for future surges.

Overall, the data reflects the impact of various factors such as public health responses, vaccination efforts, and viral mutations, with periods of rapid increase followed by declines. These observations provide valuable insights into the pandemic’s progression and the effectiveness of interventions, while also highlighting the need for continued vigilance and adaptation to changing circumstances.

Question 3: Where did COVID-19 have the biggest impact?

This question can have different answers depending on how you look at the data. If you look at the sheer total number of cases in each country, then many of the largest countries in the world dominate the top ten. Specifically, the United States has more than double the total cases compared to India.

Total Cases by Country

Let’s visualize the top 10 countries with the highest total cases.

top_ten_infected <- covid %>%
  group_by(`Country/Region`) %>%
  summarise(Total_Cases = max(Cases, na.rm = TRUE)) %>%
  arrange(desc(Total_Cases)) %>%
  top_n(10, Total_Cases)

infected_plot <- ggplot(top_ten_infected,
                        aes(y = reorder(`Country/Region`, Total_Cases),
                            x = Total_Cases,
                            fill = `Country/Region`)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 10 Most Infected Countries", y = "Country", x = "Total Cases") +
  theme_minimal() +
  theme(axis.text.y = element_text(hjust = 1), legend.position = "none") +
  scale_fill_brewer(palette = "Set3") +
  scale_x_continuous(labels = label_number(scale_cut = cut_short_scale())) 

ggplotly(infected_plot)

Incidence Rate Analysis

Next, we’ll examine the incidence rate, which is the number of cases per 100k population. This metric highlights smaller countries that lead in terms of cases relative to their population size.

# Identify top 10 countries by maximum incidence rate
top_ten_countries_incidence <- covid %>%
  group_by(`Country/Region`) %>%
  summarise(Max_Incidence_Rate = max(`Incidence_Rate`, na.rm = TRUE)) %>%
  arrange(desc(Max_Incidence_Rate)) %>%
  slice(1:10)

# Calculate average incidence rate for top 10 countries by incidence rate
top_ten_countries_incidence <- top_ten_countries_incidence %>%
  left_join(
    covid %>%
      group_by(`Country/Region`) %>%
      summarise(Avg_Incidence_Rate = mean(`Incidence_Rate`, na.rm = TRUE)),
    by = "Country/Region"
  )

# Display the table
datatable(top_ten_countries_incidence, options = list(pageLength = 10))

# Identify top 10 countries by total recorded cases
top_ten_countries_cases <- covid %>%
  group_by(`Country/Region`) %>%
  summarise(Total_Cases = max(Cases, na.rm = TRUE)) %>%
  arrange(desc(Total_Cases)) %>%
  slice(1:10) %>%
  pull(`Country/Region`)

# Calculate average incidence rate over time for top 10 countries by incidence rate
top_ten_data_incidence <- covid %>%
  filter(`Country/Region` %in% top_ten_countries_incidence$`Country/Region`) %>%
  group_by(Date) %>%
  summarise(Avg_Incidence_Rate_Top10_Incidence = mean(`Incidence_Rate`, na.rm = TRUE))

# Calculate average incidence rate over time for top 10 countries by total cases
top_ten_data_cases <- covid %>%
  filter(`Country/Region` %in% top_ten_countries_cases) %>%
  group_by(Date) %>%
  summarise(Avg_Incidence_Rate_Top10_Cases = mean(`Incidence_Rate`, na.rm = TRUE))

# Calculate average incidence rate over time for the rest of the countries
rest_data <- covid %>%
  filter(!`Country/Region` %in% c(top_ten_countries_incidence$`Country/Region`, top_ten_countries_cases)) %>%
  group_by(Date) %>%
  summarise(Avg_Incidence_Rate_Rest = mean(`Incidence_Rate`, na.rm = TRUE))

# Combine datasets
combined_data <- top_ten_data_incidence %>%
  left_join(top_ten_data_cases, by = "Date") %>%
  left_join(rest_data, by = "Date")

# Create visualization
incidence_line_plot <- ggplot(combined_data, aes(x = Date)) +
  geom_line(aes(y = Avg_Incidence_Rate_Top10_Incidence, color = "Top 10 by Incidence Rate"), size = 1) +
  geom_line(aes(y = Avg_Incidence_Rate_Top10_Cases, color = "Top 10 by Total Cases"), size = 1, linetype = "dotdash") +
  geom_line(aes(y = Avg_Incidence_Rate_Rest, color = "Rest of the World"), size = 1, linetype = "dashed") +
  labs(title = "Average Incidence Rate Over Time", x = "Date", y = "Average Incidence Rate") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels = scales::label_number(scale_cut = scales::cut_short_scale())) +
  scale_color_manual(values = c("Top 10 by Incidence Rate" = "steelblue2",
                                "Top 10 by Total Cases" = "chartreuse4",
                                "Rest of the World" = "firebrick3" )) +
  guides(color = guide_legend(title = "Legend"))

ggplotly(incidence_line_plot)

Total Deaths by Country

Let’s explore the total number of deaths, where the United States again dominates, followed by other large nations like Brazil, India, and Russia.

country_deaths <- covid %>%
  group_by(region = `Country/Region`) %>%
  summarize(
    total_deaths = max(Deaths, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  filter(!is.na(total_deaths))

world_map_deaths <- map_data("world") %>%
  mutate(region = case_when(region == "USA" ~ "US", TRUE ~ as.character(region))) %>%
  left_join(country_deaths, by = "region")

country_coords <- world_map_deaths %>%
  filter(!is.na(total_deaths)) %>%
  group_by(region) %>%
  summarize(
    Long = mean(long, na.rm = TRUE),
    Lat = mean(lat, na.rm = TRUE),
    total_deaths = first(total_deaths),
    .groups = "drop"
  ) %>%
  filter(total_deaths > -Inf)

top_ten_countries <- country_coords %>%
  arrange(desc(total_deaths)) %>%
  slice(1:10)

bubble_map <- ggplot() +
  geom_polygon(data = map_data("world"),
               aes(x = long, y = lat, group = group),
               fill = "grey90",
               color = "white",
               linewidth = 0.1) +
  geom_point(
    data = top_ten_countries,
    aes(x = Long, y = Lat, size = total_deaths),
    shape = 21,
    fill = NA,
    color = "black",
    stroke = 1
  ) +
  geom_point(
    data = country_coords,
    aes(x = Long, y = Lat, size = total_deaths, color = total_deaths,
        text = paste(region, "<br>Total Deaths:", scales::comma(total_deaths))),
    alpha = 0.7
  ) +
  scale_size_continuous(range = c(1, 15),
                        name = "Total Deaths",
                        labels = scales::comma,
                        guide = "none") +
    scale_color_gradient2(low = "yellow", mid = "blue", high = "red",
                          name = "Total Deaths", labels = scales::comma) +
  theme_minimal() +
  labs(title = "Total COVID-19 Deaths per Country", x = "", y = "") +
  coord_fixed(1.3)

ggplotly(bubble_map, tooltip = "text")

Mortality Analysis

Finally, we’ll analyze mortality rates, which are crucial for understanding the burden of COVID-19. Differences in mortality can be influenced by demographics, healthcare system characteristics, and other factors. We’ll visualize the mortality rate across countries, highlighting those with more than 100 deaths per 100k population.

mortality <- covid %>%
  group_by(`Country/Region`) %>%
  summarise(
    Total_Deaths = max(Deaths, na.rm = TRUE),
    Population = max(Population, na.rm = TRUE)
  ) %>%
  mutate(Mortality_Rate = (Total_Deaths / Population) * 100000) %>%
  filter(!is.nan(Mortality_Rate) & !is.infinite(Mortality_Rate) & Population > 0 & Total_Deaths > 0)

top_10_deaths <- mortality %>%
  arrange(desc(Total_Deaths)) %>%
  head(10)

annotation_pop <- max(mortality$Population) * 0.8

mortality_plot <- ggplot(mortality,
                         aes(x = Population,
                             y = Total_Deaths,
                             text = `Country/Region`)) +
  geom_point(color = "orange", alpha = 0.7) +
  geom_point(data = top_10_deaths,
             aes(x = Population, y = Total_Deaths),
             color = "darkorange", shape = 21,
             size = 3, stroke = 1, fill = NA) +
  geom_abline(intercept = log10(0.01) - 5, slope = 1, color = "grey", alpha = 0.4) +
  geom_abline(intercept = log10(0.1) - 5, slope = 1, color = "grey", alpha = 0.4) +
  geom_abline(intercept = log10(1) - 5, slope = 1, color = "grey", alpha = 0.4) +
  geom_abline(intercept = log10(10) - 5, slope = 1, color = "grey", alpha = 0.4) +
  geom_abline(intercept = log10(100) - 5, slope = 1, color = "grey", alpha = 0.4) +
  geom_abline(intercept = log10(1000) - 5, slope = 1, color = "grey", alpha = 0.4) +
  scale_x_log10(labels = label_number(scale_cut = cut_short_scale())) +
  scale_y_log10(labels = label_number(scale_cut = cut_short_scale())) +
  labs(
    title = "Worldwide Mortality",
    x = "Population",
    y = "Deaths"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.line = element_line(color = "black"),
    axis.ticks = element_line(color = "black"),
    axis.text.x = element_text(color = "black"),
    axis.text.y = element_text(color = "black"),
    axis.title.x = element_text(color = "black"),
    axis.title.y = element_text(color = "black"),
    legend.position = "none"
  ) +
  annotate("text", x = annotation_pop, y = (0.01 / 100000) * annotation_pop,
           label = "0.01", color = "black", size = 3, angle = 45, hjust = 0) +
  annotate("text", x = annotation_pop, y = (0.1 / 100000) * annotation_pop,
           label = "0.1", color = "black", size = 3, angle = 45, hjust = 0) +
  annotate("text", x = annotation_pop, y = (1 / 100000) * annotation_pop,
           label = "1", color = "black", size = 3, angle = 45, hjust = 0) +
  annotate("text", x = annotation_pop, y = (10 / 100000) * annotation_pop,
           label = "10", color = "black", size = 3, angle = 45, hjust = 0) +
  annotate("text", x = annotation_pop, y = (100 / 100000) * annotation_pop,
           label = "100", color = "black", size = 3, angle = 45, hjust = 0) +
  annotate("text", x = annotation_pop, y = (1000 / 100000) * annotation_pop,
           label = "1000 per 100k", color = "black", size = 3, angle = 45, hjust = 0)

ggplotly(mortality_plot)

This analysis reveals a large number of countries reporting more than 100 deaths per 100k population, underscoring the significant impact of COVID-19 globally.

Question 3: What impact did COVID-19 have on financial markets?

The market reaction during the first five months of the pandemic was significant. We can observe that the S&P 500 experienced a sharp decline around the time the WHO declared COVID-19 a global pandemic, with index returns remaining below the starting point until July. Simultaneously, the VIX (Volatility Index) surged as the market crashed, reflecting heightened investor fear. The almost symmetrical plots for the two indexes suggest a strong negative linear correlation during this period, indicating that as volatility increased, the market decreased, and vice-versa.

Market Data Preparation

First, we’ll download and prepare the financial data for analysis.

# Set options to display numbers in regular decimal format
options(scipen = 999)

# Define date range
start_date <- "2020-01-01"
end_date <- "2020-07-31"

# Download financial data
getSymbols("^GSPC", from = start_date, to = end_date, adjust = TRUE)

## [1] "GSPC"

getSymbols("^VIX", from = start_date, to = end_date)

## [1] "VIX"

# Calculate daily returns for S&P 500 and VIX
sp500_returns <- dailyReturn(Cl(GSPC))
vix_returns <- dailyReturn(Cl(VIX))

# Convert xts objects to data frames
sp500_returns_df <- data.frame(Date = index(sp500_returns), daily.returns = coredata(sp500_returns))
vix_returns_df <- data.frame(Date = index(vix_returns), daily.returns = coredata(vix_returns))

# Merge markets data with covid data
market_data <- sp500_returns_df %>%
  left_join(vix_returns_df, by = "Date", suffix = c(".SP500", ".VIX")) %>%
  rename(`S&P 500` = daily.returns.SP500, VIX = daily.returns.VIX) %>%
  fill(`S&P 500`, VIX, .direction = "down")

Cumulative Returns Calculation

Next, we’ll calculate the cumulative returns for both indexes to visualize their performance over time.

# Calculate cumulative returns for indexes
market_data$Cumulative_SP500 <- cumprod(1 + market_data$`S&P 500`) - 1
market_data$Cumulative_VIX <- cumprod(1 + market_data$VIX) - 1

# Find dates of highest and lowest CUMULATIVE S&P 500 returns
max_return_date <- market_data$Date[which.max(market_data$Cumulative_SP500)]
min_return_date <- market_data$Date[which.min(market_data$Cumulative_SP500)]

# Calculate the range of both variables for scaling
range_sp500 <- range(market_data$Cumulative_SP500, na.rm = TRUE)
range_vix <- range(market_data$Cumulative_VIX, na.rm = TRUE)

# Define a transformation function for the secondary axis
scale_factor <- diff(range_sp500) / diff(range_vix)

# Find first return for S&P 500
first_cumulative_sp500 <- market_data$Cumulative_SP500[1]

Visualization of Market Crash

We’ll plot the cumulative returns of the S&P 500 and VIX to illustrate the market crash and investor sentiment.

ggplot(market_data, aes(x = Date)) +
  geom_rect(aes(xmin = as.Date(min_return_date),
                  xmax = as.Date(max_return_date),
                  ymin = -Inf,
                  ymax = Inf),
              fill = "lightgrey", alpha = 0.3) +
  geom_line(aes(y = Cumulative_SP500, color = "S&P 500"),
            linewidth = 1) +
  geom_line(aes(y = Cumulative_VIX * scale_factor, color = "VIX"),
            linewidth = 1) +
  geom_vline(aes(xintercept = as.Date(max_return_date)),
             linetype = "solid", color = "blueviolet", linewidth = 1) +
  geom_vline(aes(xintercept = as.Date(min_return_date)),
             linetype = "solid", color = "blueviolet", linewidth = 1) +
  geom_hline(yintercept = first_cumulative_sp500,
             linetype = "solid", linewidth = 0.5) +
  scale_y_continuous(
    name = "Cumulative Return S&P 500",
    sec.axis = sec_axis(~ . / scale_factor, name = "Cumulative VIX (Scaled)")
  ) +
  scale_color_manual(name = "Index",
                     values = c("S&P 500" = "steelblue", "VIX" = "firebrick")) +
  labs(title = "COVID-19 Crash",
       x = "Date") +
  theme_minimal() +
  theme(
    axis.title.y.left = element_text(),
    axis.title.y.right = element_text()
  )

Correlation Analysis

Finally, we’ll calculate and visualize the correlation between the S&P 500 and VIX returns to quantify their relationship.

correlation <- cor(market_data$`S&P 500`, market_data$VIX, use = "complete.obs")

ggplot(market_data, aes(x = `S&P 500`, y = VIX)) +
  geom_point(color = "steelblue", alpha = 0.6) +
  geom_smooth(method = "lm", formula = "y ~ x", color = "firebrick", se = FALSE) +
  labs(title = "S&P 500 vs. VIX during the COVID-19 Outbreak",
       subtitle = paste("Correlation Coefficient:", round(correlation, 2)),
       x = "S&P 500 Daily Returns",
       y = "VIX Daily Returns") +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

A correlation coefficient of -0.73 indicates a strong negative correlation between the S&P 500 and VIX during the COVID-19 outbreak. This suggests that as market volatility increased, stock prices tended to decrease, highlighting the inverse relationship between these two variables.

Counterfactual Time Series Forecasting

Data Preparation

# Seed control for reproducibility
set.seed(123)

# Define cutoff date
cutoff_date <- ymd("2020-12-31")

# Filter data up to cutoff date
early_pandemic_data_new_cases <- covid %>%
  filter(Date <= cutoff_date)

# Aggregate Daily New Cases Globally ---
global_early_new_cases <- early_pandemic_data_new_cases %>%
  group_by(Date) %>%
  summarise(Total_New_Cases_Early = sum(New_Cases, na.rm = TRUE)) %>%
  ungroup()

# Aggregate actual daily new cases globally
global_actual_new_cases <- covid %>%
  group_by(Date) %>%
  summarise(Total_New_Cases_Actual = sum(New_Cases, na.rm = TRUE)) %>%
  ungroup()

Time Series Modeling

We create a time series object and fit an ARIMA model to forecast future values based on early pandemic data.

# Create a time series object for early global new cases
start_date_new_cases <- min(global_early_new_cases$Date)
global_new_cases_ts_early <- ts(global_early_new_cases$Total_New_Cases_Early,
                                             start = decimal_date(start_date_new_cases),
                                             frequency = 365.25)

# Fit an ARIMA model to daily new cases
fit_new_cases_arima_early <- auto.arima(global_new_cases_ts_early)
summary(fit_new_cases_arima_early)

## Series: global_new_cases_ts_early 
## ARIMA(0,1,5) with drift 
## 
## Coefficients:
##           ma1      ma2      ma3     ma4     ma5     drift
##       -0.6736  -0.1469  -0.1635  0.0402  0.2638  1899.827
## s.e.   0.0532   0.0672   0.0630  0.0862  0.0648  1023.915
## 
## sigma^2 = 3519175155:  log likelihood = -4266.69
## AIC=8547.39   AICc=8547.72   BIC=8574.27
## 
## Training set error measures:
##                   ME     RMSE      MAE       MPE     MAPE MASE       ACF1
## Training set 11.3939 58717.73 24817.06 -75.17429 82.55502  NaN 0.02564603

Forecasting

We define the forecast horizon and generate predictions using the ARIMA model.

# Determine final date in actual data
final_date_actual_new_cases <- max(global_actual_new_cases$Date)

# Define number of days to forecast beyond final actual date
forecast_beyond <- 365

# Calculate total forecast horizon relative to cutoff date
total_forecast_length <- as.integer(final_date_actual_new_cases - cutoff_date + forecast_beyond)

# Forecast daily new cases for entire horizon
forecast_new_cases_arima_no_response <- forecast(fit_new_cases_arima_early,
                                                   h = total_forecast_length)

Visualization

# Create data frame for forecast with confidence intervals
forecast_new_cases_df <- as.data.frame(forecast_new_cases_arima_no_response) %>%
  mutate(Date = seq(cutoff_date + days(1), by = "day", length.out = n())) %>%
  rename(Projected_New_Cases = `Point Forecast`,
         Lo80 = `Lo 80`,
         Hi80 = `Hi 80`,
         Lo95 = `Lo 95`,
         Hi95 = `Hi 95`) %>%
  select(Date, Projected_New_Cases, Lo80, Hi80, Lo95, Hi95)

# Create a full date range for plot
full_date_range <- seq(min(global_actual_new_cases$Date),
                       max(forecast_new_cases_df$Date), by = "day")
plot_dates <- tibble(Date = full_date_range)

# Join actual cases to full date range
plot_data_actual <- plot_dates %>%
  left_join(global_actual_new_cases, by = "Date")

# Join projected cases and confidence intervals to the full date range
plot_data_extended <- plot_data_actual %>%
  left_join(forecast_new_cases_df, by = "Date")

# Create plot
ggplot(plot_data_extended, aes(x = Date)) +
  geom_ribbon(aes(ymin = Lo95, ymax = Hi95, fill = "95% Conf"), alpha = 0.2) +
  geom_ribbon(aes(ymin = Lo80, ymax = Hi80, fill = "80% Conf"), alpha = 0.4) +
  geom_line(aes(y = Total_New_Cases_Actual, color = "New Cases"), linewidth = 1) +
  geom_line(aes(y = Projected_New_Cases, color = "Projected Cases"), linewidth = 1, linetype = "dashed") +
  scale_color_manual(values = c("New Cases" = "steelblue2", "Projected Cases" = "darkorchid")) +
  labs(title = "Actual vs. Projected Cases (ARIMA, Extended Forecast)",
       x = "Date",
       y = "New Cases",
       color = "Legend") +
  scale_y_continuous(labels = label_number(scale_cut = cut_short_scale())) +
  scale_x_date(date_breaks = "6 months", date_labels = "%b %Y") +
  theme_minimal() +
  theme(legend.position = "top") +
  theme(plot.margin = margin(5, 15, 5, 5, "mm"))

The projected global daily new COVID-19 cases, along with the associated 80% and 95% confidence intervals, significantly diverged from the actual data after mid-2021. While the ARIMA model captured the initial upward trend to some extent, it failed to predict the pronounced and volatile waves of infection observed in reality. The projection suggests a smoother, more linear increase in daily cases, starkly contrasting with the peaks and troughs of the actual data. Furthermore, the widening confidence intervals in the forecast highlight the increasing uncertainty associated with longer-term predictions based solely on the early pandemic dynamics. This discrepancy underscores the substantial impact of factors absent in the initial data used for training the model, such as public health interventions and the emergence of new variants, which fundamentally altered the pandemic’s progression and introduced complexities beyond the model’s initial assumptions.

Conclusions

The analysis of the COVID-19 pandemic using the JHU CSSE dataset provides valuable insights into the global progression and impact of the virus. The data reveals the rapid transition from a localized outbreak to a global pandemic, highlighting the importance of timely public health interventions. The visualization of trends in cases and deaths underscores the dynamic nature of the pandemic, with distinct waves of transmission influenced by factors such as public health measures, vaccination efforts, and viral mutations. The examination of financial markets illustrates the significant economic impact of the pandemic, characterized by heightened volatility and investor fear. Additionally, the counterfactual forecasting exercise demonstrates the challenges of predicting pandemic trajectories without considering the influence of interventions and emerging variants. Overall, this report emphasizes the critical role of data-driven analysis in understanding and responding to global health crises.

Addressing Potential Biases

While the JHU CSSE dataset is a comprehensive resource for analyzing the COVID-19 pandemic, it is important to acknowledge potential biases and limitations inherent in the data. Reporting inconsistencies, particularly in the early stages of the pandemic, may affect the accuracy of case and death counts. Variations in testing capacity and reporting practices across regions can lead to under-reporting or misrepresentation of the true extent of the pandemic. Additionally, the exclusion of recovered cases from the analysis may limit the understanding of recovery trends. The counterfactual forecasting model, based solely on early pandemic data, does not account for subsequent interventions or the emergence of new variants, which significantly altered the pandemic’s trajectory. These factors highlight the need for cautious interpretation of the results and underscore the importance of considering multiple data sources and methodologies to obtain a comprehensive understanding of the pandemic’s impact.

Session Info

sessionInfo()

## R version 4.4.3 (2025-02-28 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] DT_0.33         maps_3.4.2.1    forecast_8.23.0 quantmod_0.4.26
##  [5] TTR_0.24.4      xts_0.14.1      gridExtra_2.3   zoo_1.8-14     
##  [9] scales_1.3.0    knitr_1.50      plotly_4.10.4   lubridate_1.9.4
## [13] forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.4    
## [17] readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.2  
## [21] tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6       xfun_0.51          bslib_0.9.0        htmlwidgets_1.6.4 
##  [5] lattice_0.22-6     tzdb_0.5.0         crosstalk_1.2.1    quadprog_1.5-8    
##  [9] vctrs_0.6.5        tools_4.4.3        generics_0.1.3     curl_6.2.2        
## [13] parallel_4.4.3     pkgconfig_2.0.3    Matrix_1.7-2       data.table_1.17.0 
## [17] RColorBrewer_1.1-3 lifecycle_1.0.4    farver_2.1.2       compiler_4.4.3    
## [21] munsell_0.5.1      htmltools_0.5.8.1  sass_0.4.9         yaml_2.3.10       
## [25] lazyeval_0.2.2     crayon_1.5.3       pillar_1.10.1      jquerylib_0.1.4   
## [29] cachem_1.1.0       nlme_3.1-168       fracdiff_1.5-3     tidyselect_1.2.1  
## [33] digest_0.6.37      stringi_1.8.7      splines_4.4.3      labeling_0.4.3    
## [37] tseries_0.10-58    fastmap_1.2.0      colorspace_2.1-1   cli_3.6.4         
## [41] magrittr_2.0.3     withr_3.0.2        bit64_4.6.0-1      timechange_0.3.0  
## [45] rmarkdown_2.29     httr_1.4.7         bit_4.6.0          nnet_7.3-20       
## [49] timeDate_4041.110  hms_1.1.3          urca_1.3-4         evaluate_1.0.3    
## [53] lmtest_0.9-40      viridisLite_0.4.2  mgcv_1.9-1         rlang_1.1.5       
## [57] Rcpp_1.0.14        glue_1.8.0         vroom_1.6.5        rstudioapi_0.17.1 
## [61] jsonlite_2.0.0     R6_2.6.1