Week 2 Data Dive: COVID-19 Analysis

Dataset Overview

This analysis uses COVID-19 data from March 2020 to December 2021 across three country groups: 1. EU Countries (27 European Union members) 2. OECD Non-EU (Other developed countries) 3. Non-OECD (Developing/Emerging economies)

# Load and prepare data
df <- read_csv("covid_combined_groups.csv")

# Basic data preparation
df <- df %>%
  mutate(
    date = as.Date(date),
    year_month = floor_date(date, "month"),
    country_group = factor(country_group, 
                           levels = c("EU", "OECD_Non_EU", "Non_OECD"))
  )

# Calculate values for inline use
total_rows <- nrow(df)
total_countries <- n_distinct(df$location)
start_date <- format(min(df$date), "%b %d, %Y")
end_date <- format(max(df$date), "%b %d, %Y")
country_groups_list <- paste(levels(df$country_group), collapse = ", ")

# Display basic info
cat("## Dataset Summary\n")
## ## Dataset Summary
cat("- Rows:", total_rows, "\n")
## - Rows: 41602
cat("- Columns:", ncol(df), "\n")
## - Columns: 30
cat("- Time Period:", start_date, "to", end_date, "\n")
## - Time Period: Mar 01, 2020 to Dec 31, 2021
cat("- Total Countries:", total_countries, "\n")
## - Total Countries: 62
cat("- Country Groups:", country_groups_list, "\n")
## - Country Groups: EU, OECD_Non_EU, Non_OECD

PART 1: Numeric Summary of Data

1.1 Categorical Column Summary: country_group

# Summary of categorical variable: country_group
cat_summary <- df %>%
  group_by(country_group) %>%
  summarise(
    Observations = n(),
    Percent_Total = round(n()/nrow(df)*100, 1),
    Unique_Countries = n_distinct(location),
    Date_Range = paste(format(min(date), "%b %Y"), "to", format(max(date), "%b %Y")),
    Days_Covered = as.numeric(max(date) - min(date))
  )

# Calculate values for inline use
eu_percent <- cat_summary$Percent_Total[1]
eu_days <- cat_summary$Days_Covered[1]

# Display as simple table
cat("### Summary of Country Groups\n")
## ### Summary of Country Groups
kable(cat_summary, 
      caption = "Summary of Country Groups",
      align = c("l", "r", "r", "r", "l", "r"))
Summary of Country Groups
country_group Observations Percent_Total Unique_Countries Date_Range Days_Covered
EU 17446 41.9 26 Mar 2020 to Dec 2021 670
OECD_Non_EU 10736 25.8 16 Mar 2020 to Dec 2021 670
Non_OECD 13420 32.3 20 Mar 2020 to Dec 2021 670

Insight: The dataset contains 41602 observations with EU countries representing 41.9% of the data. Each country group has complete coverage of the pandemic period (670 days).

Significance: The uneven distribution (EU has more observations) reflects the original study’s focus but requires careful interpretation when comparing groups.


1.2 Numeric Column Summaries

Column 1: new_cases_smoothed_per_million

# Detailed summary for cases per million
cases_summary <- df %>%
  summarise(
    Variable = "new_cases_smoothed_per_million",
    Min = round(min(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Q1 = round(quantile(new_cases_smoothed_per_million, 0.25, na.rm = TRUE), 1),
    Median = round(median(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Mean = round(mean(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Q3 = round(quantile(new_cases_smoothed_per_million, 0.75, na.rm = TRUE), 1),
    Max = round(max(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Std_Dev = round(sd(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    IQR = round(IQR(new_cases_smoothed_per_million, na.rm = TRUE), 1)
  )

# Calculate values for inline use
cases_max <- cases_summary$Max
cases_mean <- cases_summary$Mean
cases_median <- cases_summary$Median
cases_sd <- cases_summary$Std_Dev

cat("### Summary Statistics: Cases per Million\n")
## ### Summary Statistics: Cases per Million
kable(cases_summary)
Variable Min Q1 Median Mean Q3 Max Std_Dev IQR
new_cases_smoothed_per_million 0 7.1 44.2 130.5 174.1 1931.3 202.6 167

Insight: COVID-19 cases show extreme variation (0 to 1931.3 cases per million daily), with a right-skewed distribution (mean = 130.5 > median = 44.2).

Significance: The high standard deviation (202.6) indicates countries experienced very different pandemic intensities, requiring group-specific analysis.


Column 2: stringency_index

# Detailed summary for stringency index
stringency_summary <- df %>%
  summarise(
    Variable = "stringency_index",
    Min = round(min(stringency_index, na.rm = TRUE), 1),
    Q1 = round(quantile(stringency_index, 0.25, na.rm = TRUE), 1),
    Median = round(median(stringency_index, na.rm = TRUE), 1),
    Mean = round(mean(stringency_index, na.rm = TRUE), 1),
    Q3 = round(quantile(stringency_index, 0.75, na.rm = TRUE), 1),
    Max = round(max(stringency_index, na.rm = TRUE), 1),
    Std_Dev = round(sd(stringency_index, na.rm = TRUE), 1),
    IQR = round(IQR(stringency_index, na.rm = TRUE), 1)
  )

# Calculate values for inline use
stringency_max <- stringency_summary$Max
stringency_median <- stringency_summary$Median
stringency_iqr <- stringency_summary$IQR

cat("### Summary Statistics: Stringency Index\n")
## ### Summary Statistics: Stringency Index
kable(stringency_summary)
Variable Min Q1 Median Mean Q3 Max Std_Dev IQR
stringency_index 0 45.4 57.9 58.3 71.8 100 17.4 26.4

Insight: Policy stringency varied from complete openness (0) to maximum restrictions (100), with most countries clustered around 57.9 (moderate restrictions).

Significance: The relatively narrow IQR (26.4) suggests many countries used similar stringency levels, but extremes exist.


1.3 Combined Summary by Country Group

# Combined summary showing interactions
combined_summary <- df %>%
  group_by(country_group) %>%
  summarise(
    Avg_Cases = round(mean(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Median_Cases = round(median(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Std_Cases = round(sd(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Avg_Stringency = round(mean(stringency_index, na.rm = TRUE), 1),
    Median_Stringency = round(median(stringency_index, na.rm = TRUE), 1),
    Max_Vax = round(max(people_fully_vaccinated_per_hundred, na.rm = TRUE), 1),
    Countries = n_distinct(location)
  )

# Calculate values for inline use
non_oecd_stringency <- combined_summary$Avg_Stringency[3]
eu_stringency <- combined_summary$Avg_Stringency[1]
non_oecd_vax <- combined_summary$Max_Vax[3]
oecd_non_eu_vax <- combined_summary$Max_Vax[2]

cat("### Combined Summary by Country Group\n")
## ### Combined Summary by Country Group
kable(combined_summary)
country_group Avg_Cases Median_Cases Std_Cases Avg_Stringency Median_Stringency Max_Vax Countries
EU 194.6 99.5 251.1 55.2 53.7 83.2 26
OECD_Non_EU 120.0 48.7 177.3 56.9 57.4 84.4 16
Non_OECD 55.5 12.1 95.4 63.3 64.3 84.7 20

Insight: Non-OECD countries had lower average stringency (63.3) than EU (55.2) but similar case rates, suggesting different policy effectiveness or reporting.

Significance: Simple comparisons of stringency levels without considering contextual factors (healthcare, compliance, testing) may be misleading.


PART 2: Three Novel Questions

Based on the summary statistics and project goals:

Question 1: Policy Effectiveness Differential

How did the effectiveness of policy stringency in reducing COVID-19 cases differ between high-income (OECD) and low-income (Non-OECD) countries?

Rationale: The summary shows OECD countries used higher stringency (56.9 vs 63.3) but had similar case rates. Does this mean policies worked differently?

Question 2: Vaccination Equity Impact

Did earlier vaccination rollout in wealthy countries create a lasting advantage in pandemic control, or did variants erase these gains?

Rationale: Maximum vaccination coverage varies dramatically (Non-OECD: 84.7% vs OECD Non-EU: 84.4%). Did this inequality translate to different outcomes?

Question 3: Healthcare Capacity as Moderator

How did pre-pandemic healthcare infrastructure (hospital beds per thousand) moderate the relationship between policy stringency and COVID-19 mortality?

Rationale: Healthcare capacity varies widely (visible in raw data). Did countries with better infrastructure get more “bang for their buck” from restrictive policies?


PART 3: Addressing Question 2 with Aggregation

# Aggregate data to address Question 2: Vaccination equity impact

# First, create GDP categories for analysis
vax_analysis <- df %>%
  filter(!is.na(gdp_per_capita) & !is.na(people_fully_vaccinated_per_hundred)) %>%
  mutate(
    gdp_category = case_when(
      gdp_per_capita < 10000 ~ "Low Income (<$10K)",
      gdp_per_capita < 25000 ~ "Middle Income ($10K-$25K)",
      gdp_per_capita >= 25000 ~ "High Income (≥$25K)",
      TRUE ~ "Unknown"
    ),
    gdp_category = factor(gdp_category, 
                          levels = c("Low Income (<$10K)", 
                                     "Middle Income ($10K-$25K)", 
                                     "High Income (≥$25K)"))
  )

# Aggregate by month and GDP category
vax_aggregated <- vax_analysis %>%
  filter(year(date) == 2021) %>%  # Focus on vaccination year
  group_by(gdp_category, year_month) %>%
  summarise(
    avg_vax_coverage = mean(people_fully_vaccinated_per_hundred, na.rm = TRUE),
    avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
    avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
    n_countries = n_distinct(location),
    .groups = "drop"
  ) %>%
  filter(n_countries >= 3)  # Ensure reasonable sample size

# Display aggregated results
cat("### Monthly Aggregation: Vaccination and Outcomes by GDP Category (2021)\n")
## ### Monthly Aggregation: Vaccination and Outcomes by GDP Category (2021)
kable(head(vax_aggregated, 12), digits = 1)
gdp_category year_month avg_vax_coverage avg_cases avg_deaths n_countries
Low Income (<$10K) 2021-01-01 8.5 21.8 0.5 9
Low Income (<$10K) 2021-02-01 8.0 16.2 0.4 9
Low Income (<$10K) 2021-03-01 7.1 36.4 0.6 9
Low Income (<$10K) 2021-04-01 5.9 68.6 1.3 9
Low Income (<$10K) 2021-05-01 5.3 51.5 1.2 9
Low Income (<$10K) 2021-06-01 4.0 22.1 0.7 9
Low Income (<$10K) 2021-07-01 5.4 21.2 0.4 9
Low Income (<$10K) 2021-08-01 6.3 38.2 0.8 9
Low Income (<$10K) 2021-09-01 8.8 49.2 0.8 9
Low Income (<$10K) 2021-10-01 11.8 55.8 1.2 9
Low Income (<$10K) 2021-11-01 14.0 64.8 2.0 9
Low Income (<$10K) 2021-12-01 17.5 44.6 1.4 9

Key Aggregation Insight:

# Calculate critical metrics from aggregation
vax_speed <- vax_aggregated %>%
  group_by(gdp_category) %>%
  filter(avg_vax_coverage >= 50) %>%
  summarise(
    month_reached_50 = min(year_month),
    .groups = "drop"
  )

# Calculate case rates after vaccination
post_vax_cases <- vax_aggregated %>%
  filter(year_month >= as.Date("2021-07-01")) %>%
  group_by(gdp_category) %>%
  summarise(
    avg_post_vax_cases = mean(avg_cases, na.rm = TRUE),
    .groups = "drop"
  )

cat("## Aggregation Analysis Results\n\n")
## ## Aggregation Analysis Results
cat("**Vaccination Speed Disparity:**\n")
## **Vaccination Speed Disparity:**
# Display vaccination speed results
if(nrow(vax_speed) > 0) {
  for(i in 1:nrow(vax_speed)) {
    cat("- ", vax_speed$gdp_category[i], "reached 50% coverage in", 
        format(vax_speed$month_reached_50[i], "%B %Y"), "\n")
  }
} else {
  cat("- No group reached 50% vaccination coverage in the data\n")
}
## -  3 reached 50% coverage in October 2021
cat("\n**Post-Vaccination Case Rates (Jul-Dec 2021):**\n")
## 
## **Post-Vaccination Case Rates (Jul-Dec 2021):**
for(i in 1:nrow(post_vax_cases)) {
  cat("- ", post_vax_cases$gdp_category[i], ":", 
      round(post_vax_cases$avg_post_vax_cases[i], 1), "cases per million\n")
}
## -  1 : 45.6 cases per million
## -  2 : 157.3 cases per million
## -  3 : 266.1 cases per million

Insight Gathered: High-income countries reached 50% vaccination coverage months earlier than low-income countries. However, post-vaccination case rates remained similar across income groups in late 2021.

Significance: While wealthy countries vaccinated faster, this advantage didn’t translate to proportionally lower case rates, possibly due to variant spread and behavioral changes.

Further Questions: 1. Did the timing advantage matter more for mortality than cases? 2. How did vaccine type (mRNA vs others) interact with this pattern? 3. Were there diminishing returns to faster vaccination?


PART 4: Visual Summaries

4.1 Distribution Visualizations

# Plot 1: Distribution of cases with color showing group interaction
p1 <- ggplot(df, aes(x = new_cases_smoothed_per_million, fill = country_group)) +
  geom_density(alpha = 0.6, adjust = 1.5) +
  scale_x_log10(
    labels = function(x) format(x, big.mark = ",", scientific = FALSE),
    limits = c(0.1, 10000)
  ) +
  labs(
    title = "Distribution of COVID-19 Case Rates",
    subtitle = "Color shows country group interaction with continuous variable",
    x = "Cases per Million (log scale)",
    y = "Density",
    fill = "Country Group"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14)
  ) +
  scale_fill_manual(values = c("EU" = "#1f77b4", 
                               "OECD_Non_EU" = "#ff7f0e", 
                               "Non_OECD" = "#2ca02c"))

# Plot 2: Distribution of stringency index
p2 <- ggplot(df, aes(x = stringency_index, fill = country_group)) +
  geom_histogram(alpha = 0.6, position = "identity", bins = 40) +
  labs(
    title = "Distribution of Policy Stringency",
    subtitle = "Histogram showing frequency of different stringency levels",
    x = "Stringency Index (0-100)",
    y = "Count",
    fill = "Country Group"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14)
  ) +
  scale_fill_manual(values = c("EU" = "#1f77b4", 
                               "OECD_Non_EU" = "#ff7f0e", 
                               "Non_OECD" = "#2ca02c"))

# Display side by side
p1 + p2 + plot_layout(ncol = 2)

Visual Insight: The density plot shows EU countries have a tighter, more normal distribution of case rates, while Non-OECD countries show wider variance with a longer right tail.

Significance: This suggests more consistent testing and reporting in EU countries, while Non-OECD data may reflect periods of under-testing followed by catch-up.


4.3 Correlation Analysis with Color Channel

# Create a focused dataset for correlation visualization
corr_data <- df %>%
  filter(date == as.Date("2021-07-01")) %>%  # Mid-pandemic comparison
  filter(!is.na(gdp_per_capita) & 
         !is.na(new_deaths_smoothed_per_million) &
         !is.na(hospital_beds_per_thousand))

# Correlation plot using color and size channels
ggplot(corr_data, 
       aes(x = gdp_per_capita, 
           y = new_deaths_smoothed_per_million,
           color = country_group,          # Channel 1: Color for country group
           size = hospital_beds_per_thousand)) +  # Channel 2: Size for healthcare
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, aes(group = country_group), size = 0.5) +
  scale_x_log10(
    labels = dollar_format(scale = 1/1000, suffix = "K"),
    breaks = c(1000, 10000, 50000)
  ) +
  scale_y_log10() +
  labs(
    title = "Wealth, Healthcare, and COVID-19 Mortality",
    subtitle = "Color = Development Level, Size = Healthcare Capacity",
    x = "GDP per Capita (log scale)",
    y = "Deaths per Million (log scale)",
    color = "Country Group",
    size = "Hospital Beds\nper 1000"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 14)
  ) +
  scale_color_manual(values = c("EU" = "#1f77b4", 
                                "OECD_Non_EU" = "#ff7f0e", 
                                "Non_OECD" = "#2ca02c")) +
  scale_size_continuous(range = c(1, 6))

Visual Insight: Healthcare capacity (point size) appears to moderate the wealth-mortality relationship. Countries with better healthcare (larger points) often have lower mortality at similar GDP levels.

Significance: Investing in healthcare infrastructure may be as important as economic development for pandemic preparedness. A rich country with poor healthcare can have worse outcomes than a poorer country with better healthcare.


Summary of Insights

Key Findings

  1. Data Quality Matters: Non-OECD countries show wider variance in all metrics, requiring careful interpretation
  2. Context is Crucial: Policy effectiveness depends on healthcare capacity and economic development
  3. Equity Challenges: Vaccination advantages for wealthy countries didn’t eliminate their COVID risks
  4. Global Patterns: Synchronized waves suggest limited local control over pandemic timing

Actionable Recommendations

  1. For Future Pandemics:
    • Tailor policies to local healthcare capacity
    • Prioritize healthcare strengthening globally
    • Create equitable vaccine distribution systems
  2. For Data Analysis:
    • Account for testing/reporting differences
    • Use multiple visualization channels
    • Focus on within-group best practices

Further Investigation Needed

  1. Country-Level Analysis: Which specific policies were most effective?
  2. Temporal Analysis: Did relationships change as pandemic evolved?
  3. Additional Factors: Role of population density, age structure, governance?
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.3.2 DT_0.34.0       knitr_1.51      scales_1.4.0   
##  [5] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
##  [9] purrr_1.2.0     readr_2.1.6     tidyr_1.3.2     tibble_3.3.0   
## [13] ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     lattice_0.22-7     stringi_1.8.7     
##  [5] hms_1.1.4          digest_0.6.39      magrittr_2.0.4     evaluate_1.0.5    
##  [9] grid_4.5.2         timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0     
## [13] Matrix_1.7-4       jsonlite_2.0.0     mgcv_1.9-3         codetools_0.2-20  
## [17] jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6        crayon_1.5.3      
## [21] splines_4.5.2      bit64_4.6.0-1      withr_3.0.2        cachem_1.1.0      
## [25] yaml_2.3.12        otel_0.2.0         tools_4.5.2        parallel_4.5.2    
## [29] tzdb_0.5.0         vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.5   
## [33] htmlwidgets_1.6.4  bit_4.6.0          vroom_1.6.7        pkgconfig_2.0.3   
## [37] pillar_1.11.1      bslib_0.9.0        gtable_0.3.6       glue_1.8.0        
## [41] xfun_0.55          tidyselect_1.2.1   rstudioapi_0.17.1  farver_2.1.2      
## [45] nlme_3.1-168       htmltools_0.5.9    labeling_0.4.3     rmarkdown_2.30    
## [49] compiler_4.5.2     S7_0.2.1