Week 2 Data Dive: COVID-19 Analysis

Dataset Overview

This analysis uses COVID-19 data from March 2020 to December 2021 across three country groups: 1. EU Countries (27 European Union members) 2. OECD Non-EU (Other developed countries) 3. Non-OECD (Developing/Emerging economies)

# Load and prepare data
df <- read_csv("covid_combined_groups.csv")

# Basic data preparation
df <- df %>%
  mutate(
    date = as.Date(date),
    year_month = floor_date(date, "month"),
    country_group = factor(country_group, 
                           levels = c("EU", "OECD_Non_EU", "Non_OECD"))
  )

# Calculate values for inline use
total_rows <- nrow(df)
total_countries <- n_distinct(df$location)
start_date <- format(min(df$date), "%b %d, %Y")
end_date <- format(max(df$date), "%b %d, %Y")
country_groups_list <- paste(levels(df$country_group), collapse = ", ")

# Display basic info
cat("## Dataset Summary\n")

## ## Dataset Summary

cat("- Rows:", total_rows, "\n")

## - Rows: 41602

cat("- Columns:", ncol(df), "\n")

## - Columns: 30

cat("- Time Period:", start_date, "to", end_date, "\n")

## - Time Period: Mar 01, 2020 to Dec 31, 2021

cat("- Total Countries:", total_countries, "\n")

## - Total Countries: 62

cat("- Country Groups:", country_groups_list, "\n")

## - Country Groups: EU, OECD_Non_EU, Non_OECD

PART 1: Numeric Summary of Data

1.1 Categorical Column Summary: `country_group`

# Summary of categorical variable: country_group
cat_summary <- df %>%
  group_by(country_group) %>%
  summarise(
    Observations = n(),
    Percent_Total = round(n()/nrow(df)*100, 1),
    Unique_Countries = n_distinct(location),
    Date_Range = paste(format(min(date), "%b %Y"), "to", format(max(date), "%b %Y")),
    Days_Covered = as.numeric(max(date) - min(date))
  )

# Calculate values for inline use
eu_percent <- cat_summary$Percent_Total[1]
eu_days <- cat_summary$Days_Covered[1]

# Display as simple table
cat("### Summary of Country Groups\n")

## ### Summary of Country Groups

kable(cat_summary, 
      caption = "Summary of Country Groups",
      align = c("l", "r", "r", "r", "l", "r"))

Summary of Country Groups
country_group	Observations	Percent_Total	Unique_Countries	Date_Range	Days_Covered
EU	17446	41.9	26	Mar 2020 to Dec 2021	670
OECD_Non_EU	10736	25.8	16	Mar 2020 to Dec 2021	670
Non_OECD	13420	32.3	20	Mar 2020 to Dec 2021	670

Insight: The dataset contains 41602 observations with EU countries representing 41.9% of the data. Each country group has complete coverage of the pandemic period (670 days).

Significance: The uneven distribution (EU has more observations) reflects the original study’s focus but requires careful interpretation when comparing groups.

1.2 Numeric Column Summaries

Column 1: `new_cases_smoothed_per_million`

# Detailed summary for cases per million
cases_summary <- df %>%
  summarise(
    Variable = "new_cases_smoothed_per_million",
    Min = round(min(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Q1 = round(quantile(new_cases_smoothed_per_million, 0.25, na.rm = TRUE), 1),
    Median = round(median(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Mean = round(mean(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Q3 = round(quantile(new_cases_smoothed_per_million, 0.75, na.rm = TRUE), 1),
    Max = round(max(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Std_Dev = round(sd(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    IQR = round(IQR(new_cases_smoothed_per_million, na.rm = TRUE), 1)
  )

# Calculate values for inline use
cases_max <- cases_summary$Max
cases_mean <- cases_summary$Mean
cases_median <- cases_summary$Median
cases_sd <- cases_summary$Std_Dev

cat("### Summary Statistics: Cases per Million\n")

## ### Summary Statistics: Cases per Million

kable(cases_summary)

Variable	Min	Q1	Median	Mean	Q3	Max	Std_Dev	IQR
new_cases_smoothed_per_million	0	7.1	44.2	130.5	174.1	1931.3	202.6	167

Insight: COVID-19 cases show extreme variation (0 to 1931.3 cases per million daily), with a right-skewed distribution (mean = 130.5 > median = 44.2).

Significance: The high standard deviation (202.6) indicates countries experienced very different pandemic intensities, requiring group-specific analysis.

Column 2: `stringency_index`

# Detailed summary for stringency index
stringency_summary <- df %>%
  summarise(
    Variable = "stringency_index",
    Min = round(min(stringency_index, na.rm = TRUE), 1),
    Q1 = round(quantile(stringency_index, 0.25, na.rm = TRUE), 1),
    Median = round(median(stringency_index, na.rm = TRUE), 1),
    Mean = round(mean(stringency_index, na.rm = TRUE), 1),
    Q3 = round(quantile(stringency_index, 0.75, na.rm = TRUE), 1),
    Max = round(max(stringency_index, na.rm = TRUE), 1),
    Std_Dev = round(sd(stringency_index, na.rm = TRUE), 1),
    IQR = round(IQR(stringency_index, na.rm = TRUE), 1)
  )

# Calculate values for inline use
stringency_max <- stringency_summary$Max
stringency_median <- stringency_summary$Median
stringency_iqr <- stringency_summary$IQR

cat("### Summary Statistics: Stringency Index\n")

## ### Summary Statistics: Stringency Index

kable(stringency_summary)

Variable	Min	Q1	Median	Mean	Q3	Max	Std_Dev	IQR
stringency_index	0	45.4	57.9	58.3	71.8	100	17.4	26.4

Insight: Policy stringency varied from complete openness (0) to maximum restrictions (100), with most countries clustered around 57.9 (moderate restrictions).

Significance: The relatively narrow IQR (26.4) suggests many countries used similar stringency levels, but extremes exist.

1.3 Combined Summary by Country Group

# Combined summary showing interactions
combined_summary <- df %>%
  group_by(country_group) %>%
  summarise(
    Avg_Cases = round(mean(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Median_Cases = round(median(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Std_Cases = round(sd(new_cases_smoothed_per_million, na.rm = TRUE), 1),
    Avg_Stringency = round(mean(stringency_index, na.rm = TRUE), 1),
    Median_Stringency = round(median(stringency_index, na.rm = TRUE), 1),
    Max_Vax = round(max(people_fully_vaccinated_per_hundred, na.rm = TRUE), 1),
    Countries = n_distinct(location)
  )

# Calculate values for inline use
non_oecd_stringency <- combined_summary$Avg_Stringency[3]
eu_stringency <- combined_summary$Avg_Stringency[1]
non_oecd_vax <- combined_summary$Max_Vax[3]
oecd_non_eu_vax <- combined_summary$Max_Vax[2]

cat("### Combined Summary by Country Group\n")

## ### Combined Summary by Country Group

kable(combined_summary)

country_group	Avg_Cases	Median_Cases	Std_Cases	Avg_Stringency	Median_Stringency	Max_Vax	Countries
EU	194.6	99.5	251.1	55.2	53.7	83.2	26
OECD_Non_EU	120.0	48.7	177.3	56.9	57.4	84.4	16
Non_OECD	55.5	12.1	95.4	63.3	64.3	84.7	20

Insight: Non-OECD countries had lower average stringency (63.3) than EU (55.2) but similar case rates, suggesting different policy effectiveness or reporting.

Significance: Simple comparisons of stringency levels without considering contextual factors (healthcare, compliance, testing) may be misleading.

PART 2: Three Novel Questions

Based on the summary statistics and project goals:

Question 1: Policy Effectiveness Differential

How did the effectiveness of policy stringency in reducing COVID-19 cases differ between high-income (OECD) and low-income (Non-OECD) countries?

Rationale: The summary shows OECD countries used higher stringency (56.9 vs 63.3) but had similar case rates. Does this mean policies worked differently?

Question 2: Vaccination Equity Impact

Did earlier vaccination rollout in wealthy countries create a lasting advantage in pandemic control, or did variants erase these gains?

Rationale: Maximum vaccination coverage varies dramatically (Non-OECD: 84.7% vs OECD Non-EU: 84.4%). Did this inequality translate to different outcomes?

Question 3: Healthcare Capacity as Moderator

How did pre-pandemic healthcare infrastructure (hospital beds per thousand) moderate the relationship between policy stringency and COVID-19 mortality?

Rationale: Healthcare capacity varies widely (visible in raw data). Did countries with better infrastructure get more “bang for their buck” from restrictive policies?

PART 3: Addressing Question 2 with Aggregation

# Aggregate data to address Question 2: Vaccination equity impact

# First, create GDP categories for analysis
vax_analysis <- df %>%
  filter(!is.na(gdp_per_capita) & !is.na(people_fully_vaccinated_per_hundred)) %>%
  mutate(
    gdp_category = case_when(
      gdp_per_capita < 10000 ~ "Low Income (<$10K)",
      gdp_per_capita < 25000 ~ "Middle Income ($10K-$25K)",
      gdp_per_capita >= 25000 ~ "High Income (≥$25K)",
      TRUE ~ "Unknown"
    ),
    gdp_category = factor(gdp_category, 
                          levels = c("Low Income (<$10K)", 
                                     "Middle Income ($10K-$25K)", 
                                     "High Income (≥$25K)"))
  )

# Aggregate by month and GDP category
vax_aggregated <- vax_analysis %>%
  filter(year(date) == 2021) %>%  # Focus on vaccination year
  group_by(gdp_category, year_month) %>%
  summarise(
    avg_vax_coverage = mean(people_fully_vaccinated_per_hundred, na.rm = TRUE),
    avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
    avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
    n_countries = n_distinct(location),
    .groups = "drop"
  ) %>%
  filter(n_countries >= 3)  # Ensure reasonable sample size

# Display aggregated results
cat("### Monthly Aggregation: Vaccination and Outcomes by GDP Category (2021)\n")

## ### Monthly Aggregation: Vaccination and Outcomes by GDP Category (2021)

kable(head(vax_aggregated, 12), digits = 1)

gdp_category	year_month	avg_vax_coverage	avg_cases	avg_deaths	n_countries
Low Income (<$10K)	2021-01-01	8.5	21.8	0.5	9
Low Income (<$10K)	2021-02-01	8.0	16.2	0.4	9
Low Income (<$10K)	2021-03-01	7.1	36.4	0.6	9
Low Income (<$10K)	2021-04-01	5.9	68.6	1.3	9
Low Income (<$10K)	2021-05-01	5.3	51.5	1.2	9
Low Income (<$10K)	2021-06-01	4.0	22.1	0.7	9
Low Income (<$10K)	2021-07-01	5.4	21.2	0.4	9
Low Income (<$10K)	2021-08-01	6.3	38.2	0.8	9
Low Income (<$10K)	2021-09-01	8.8	49.2	0.8	9
Low Income (<$10K)	2021-10-01	11.8	55.8	1.2	9
Low Income (<$10K)	2021-11-01	14.0	64.8	2.0	9
Low Income (<$10K)	2021-12-01	17.5	44.6	1.4	9

Key Aggregation Insight:

# Calculate critical metrics from aggregation
vax_speed <- vax_aggregated %>%
  group_by(gdp_category) %>%
  filter(avg_vax_coverage >= 50) %>%
  summarise(
    month_reached_50 = min(year_month),
    .groups = "drop"
  )

# Calculate case rates after vaccination
post_vax_cases <- vax_aggregated %>%
  filter(year_month >= as.Date("2021-07-01")) %>%
  group_by(gdp_category) %>%
  summarise(
    avg_post_vax_cases = mean(avg_cases, na.rm = TRUE),
    .groups = "drop"
  )

cat("## Aggregation Analysis Results\n\n")

## ## Aggregation Analysis Results

cat("**Vaccination Speed Disparity:**\n")

## **Vaccination Speed Disparity:**

# Display vaccination speed results
if(nrow(vax_speed) > 0) {
  for(i in 1:nrow(vax_speed)) {
    cat("- ", vax_speed$gdp_category[i], "reached 50% coverage in", 
        format(vax_speed$month_reached_50[i], "%B %Y"), "\n")
  }
} else {
  cat("- No group reached 50% vaccination coverage in the data\n")
}

## -  3 reached 50% coverage in October 2021

cat("\n**Post-Vaccination Case Rates (Jul-Dec 2021):**\n")

## 
## **Post-Vaccination Case Rates (Jul-Dec 2021):**

for(i in 1:nrow(post_vax_cases)) {
  cat("- ", post_vax_cases$gdp_category[i], ":", 
      round(post_vax_cases$avg_post_vax_cases[i], 1), "cases per million\n")
}

## -  1 : 45.6 cases per million
## -  2 : 157.3 cases per million
## -  3 : 266.1 cases per million

Insight Gathered: High-income countries reached 50% vaccination coverage months earlier than low-income countries. However, post-vaccination case rates remained similar across income groups in late 2021.

Significance: While wealthy countries vaccinated faster, this advantage didn’t translate to proportionally lower case rates, possibly due to variant spread and behavioral changes.

Further Questions: 1. Did the timing advantage matter more for mortality than cases? 2. How did vaccine type (mRNA vs others) interact with this pattern? 3. Were there diminishing returns to faster vaccination?

PART 4: Visual Summaries

4.1 Distribution Visualizations

# Plot 1: Distribution of cases with color showing group interaction
p1 <- ggplot(df, aes(x = new_cases_smoothed_per_million, fill = country_group)) +
  geom_density(alpha = 0.6, adjust = 1.5) +
  scale_x_log10(
    labels = function(x) format(x, big.mark = ",", scientific = FALSE),
    limits = c(0.1, 10000)
  ) +
  labs(
    title = "Distribution of COVID-19 Case Rates",
    subtitle = "Color shows country group interaction with continuous variable",
    x = "Cases per Million (log scale)",
    y = "Density",
    fill = "Country Group"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14)
  ) +
  scale_fill_manual(values = c("EU" = "#1f77b4", 
                               "OECD_Non_EU" = "#ff7f0e", 
                               "Non_OECD" = "#2ca02c"))

# Plot 2: Distribution of stringency index
p2 <- ggplot(df, aes(x = stringency_index, fill = country_group)) +
  geom_histogram(alpha = 0.6, position = "identity", bins = 40) +
  labs(
    title = "Distribution of Policy Stringency",
    subtitle = "Histogram showing frequency of different stringency levels",
    x = "Stringency Index (0-100)",
    y = "Count",
    fill = "Country Group"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14)
  ) +
  scale_fill_manual(values = c("EU" = "#1f77b4", 
                               "OECD_Non_EU" = "#ff7f0e", 
                               "Non_OECD" = "#2ca02c"))

# Display side by side
p1 + p2 + plot_layout(ncol = 2)

Visual Insight: The density plot shows EU countries have a tighter, more normal distribution of case rates, while Non-OECD countries show wider variance with a longer right tail.

Significance: This suggests more consistent testing and reporting in EU countries, while Non-OECD data may reflect periods of under-testing followed by catch-up.

4.2 Time Trends Analysis

# Calculate monthly trends
monthly_trends <- df %>%
  group_by(country_group, year_month) %>%
  summarise(
    mean_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
    mean_stringency = mean(stringency_index, na.rm = TRUE),
    .groups = "drop"
  )

# Plot time trends
ggplot(monthly_trends, aes(x = year_month, y = mean_cases, color = country_group)) +
  geom_line(size = 1) +
  geom_point(size = 1) +
  labs(
    title = "COVID-19 Case Trends by Country Group",
    subtitle = "7-day average cases per million population",
    x = "Month",
    y = "Average Cases per Million",
    color = "Country Group"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1)
  ) +
  scale_y_continuous(labels = comma_format()) +
  scale_color_manual(values = c("EU" = "#1f77b4", 
                                "OECD_Non_EU" = "#ff7f0e", 
                                "Non_OECD" = "#2ca02c"))

Visual Insight: All groups experienced synchronized waves (peaks in Apr 2020, Nov 2020, Apr 2021, Dec 2021), suggesting global transmission dynamics dominated local policies.

Significance: The synchronized patterns indicate that even with different policies, countries couldn’t avoid global pandemic waves, highlighting the need for international coordination.

4.3 Correlation Analysis with Color Channel

# Create a focused dataset for correlation visualization
corr_data <- df %>%
  filter(date == as.Date("2021-07-01")) %>%  # Mid-pandemic comparison
  filter(!is.na(gdp_per_capita) & 
         !is.na(new_deaths_smoothed_per_million) &
         !is.na(hospital_beds_per_thousand))

# Correlation plot using color and size channels
ggplot(corr_data, 
       aes(x = gdp_per_capita, 
           y = new_deaths_smoothed_per_million,
           color = country_group,          # Channel 1: Color for country group
           size = hospital_beds_per_thousand)) +  # Channel 2: Size for healthcare
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = FALSE, aes(group = country_group), size = 0.5) +
  scale_x_log10(
    labels = dollar_format(scale = 1/1000, suffix = "K"),
    breaks = c(1000, 10000, 50000)
  ) +
  scale_y_log10() +
  labs(
    title = "Wealth, Healthcare, and COVID-19 Mortality",
    subtitle = "Color = Development Level, Size = Healthcare Capacity",
    x = "GDP per Capita (log scale)",
    y = "Deaths per Million (log scale)",
    color = "Country Group",
    size = "Hospital Beds\nper 1000"
  ) +
  theme_minimal() +
  theme(
    legend.position = "right",
    plot.title = element_text(face = "bold", size = 14)
  ) +
  scale_color_manual(values = c("EU" = "#1f77b4", 
                                "OECD_Non_EU" = "#ff7f0e", 
                                "Non_OECD" = "#2ca02c")) +
  scale_size_continuous(range = c(1, 6))

Visual Insight: Healthcare capacity (point size) appears to moderate the wealth-mortality relationship. Countries with better healthcare (larger points) often have lower mortality at similar GDP levels.

Significance: Investing in healthcare infrastructure may be as important as economic development for pandemic preparedness. A rich country with poor healthcare can have worse outcomes than a poorer country with better healthcare.

Summary of Insights

Key Findings

Data Quality Matters: Non-OECD countries show wider variance in all metrics, requiring careful interpretation
Context is Crucial: Policy effectiveness depends on healthcare capacity and economic development
Equity Challenges: Vaccination advantages for wealthy countries didn’t eliminate their COVID risks
Global Patterns: Synchronized waves suggest limited local control over pandemic timing

Actionable Recommendations

For Future Pandemics:
- Tailor policies to local healthcare capacity
- Prioritize healthcare strengthening globally
- Create equitable vaccine distribution systems
For Data Analysis:
- Account for testing/reporting differences
- Use multiple visualization channels
- Focus on within-group best practices

Further Investigation Needed

Country-Level Analysis: Which specific policies were most effective?
Temporal Analysis: Did relationships change as pandemic evolved?
Additional Factors: Role of population density, age structure, governance?

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.3.2 DT_0.34.0       knitr_1.51      scales_1.4.0   
##  [5] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
##  [9] purrr_1.2.0     readr_2.1.6     tidyr_1.3.2     tibble_3.3.0   
## [13] ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     lattice_0.22-7     stringi_1.8.7     
##  [5] hms_1.1.4          digest_0.6.39      magrittr_2.0.4     evaluate_1.0.5    
##  [9] grid_4.5.2         timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0     
## [13] Matrix_1.7-4       jsonlite_2.0.0     mgcv_1.9-3         codetools_0.2-20  
## [17] jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6        crayon_1.5.3      
## [21] splines_4.5.2      bit64_4.6.0-1      withr_3.0.2        cachem_1.1.0      
## [25] yaml_2.3.12        otel_0.2.0         tools_4.5.2        parallel_4.5.2    
## [29] tzdb_0.5.0         vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.5   
## [33] htmlwidgets_1.6.4  bit_4.6.0          vroom_1.6.7        pkgconfig_2.0.3   
## [37] pillar_1.11.1      bslib_0.9.0        gtable_0.3.6       glue_1.8.0        
## [41] xfun_0.55          tidyselect_1.2.1   rstudioapi_0.17.1  farver_2.1.2      
## [45] nlme_3.1-168       htmltools_0.5.9    labeling_0.4.3     rmarkdown_2.30    
## [49] compiler_4.5.2     S7_0.2.1

Week 2 Data Dive: COVID-19 Comparative Analysis

Krish Shah

January 20, 2026

Week 2 Data Dive: COVID-19 Analysis

Dataset Overview

PART 1: Numeric Summary of Data

1.1 Categorical Column Summary: `country_group`

1.2 Numeric Column Summaries

Column 1: `new_cases_smoothed_per_million`

Column 2: `stringency_index`

1.3 Combined Summary by Country Group

PART 2: Three Novel Questions

Question 1: Policy Effectiveness Differential

Question 2: Vaccination Equity Impact

Question 3: Healthcare Capacity as Moderator

PART 3: Addressing Question 2 with Aggregation

PART 4: Visual Summaries

4.1 Distribution Visualizations

4.2 Time Trends Analysis

4.3 Correlation Analysis with Color Channel

Summary of Insights

Key Findings

Actionable Recommendations

Further Investigation Needed

Week 2 Data Dive: COVID-19 Comparative Analysis

Krish Shah

January 20, 2026

Week 2 Data Dive: COVID-19 Analysis

Dataset Overview

PART 1: Numeric Summary of Data

1.1 Categorical Column Summary: country_group

1.2 Numeric Column Summaries

Column 1: new_cases_smoothed_per_million

Column 2: stringency_index

1.3 Combined Summary by Country Group

PART 2: Three Novel Questions

Question 1: Policy Effectiveness Differential

Question 2: Vaccination Equity Impact

Question 3: Healthcare Capacity as Moderator

PART 3: Addressing Question 2 with Aggregation

PART 4: Visual Summaries

4.1 Distribution Visualizations

4.2 Time Trends Analysis

4.3 Correlation Analysis with Color Channel

Summary of Insights

Key Findings

Actionable Recommendations

Further Investigation Needed

1.1 Categorical Column Summary: `country_group`

Column 1: `new_cases_smoothed_per_million`

Column 2: `stringency_index`