This analysis uses COVID-19 data from March 2020 to December 2021 across three country groups: 1. EU Countries (27 European Union members) 2. OECD Non-EU (Other developed countries) 3. Non-OECD (Developing/Emerging economies)
# Load and prepare data
df <- read_csv("covid_combined_groups.csv")
# Basic data preparation
df <- df %>%
mutate(
date = as.Date(date),
year_month = floor_date(date, "month"),
country_group = factor(country_group,
levels = c("EU", "OECD_Non_EU", "Non_OECD"))
)
# Calculate values for inline use
total_rows <- nrow(df)
total_countries <- n_distinct(df$location)
start_date <- format(min(df$date), "%b %d, %Y")
end_date <- format(max(df$date), "%b %d, %Y")
country_groups_list <- paste(levels(df$country_group), collapse = ", ")
# Display basic info
cat("## Dataset Summary\n")## ## Dataset Summary
## - Rows: 41602
## - Columns: 30
## - Time Period: Mar 01, 2020 to Dec 31, 2021
## - Total Countries: 62
## - Country Groups: EU, OECD_Non_EU, Non_OECD
country_group# Summary of categorical variable: country_group
cat_summary <- df %>%
group_by(country_group) %>%
summarise(
Observations = n(),
Percent_Total = round(n()/nrow(df)*100, 1),
Unique_Countries = n_distinct(location),
Date_Range = paste(format(min(date), "%b %Y"), "to", format(max(date), "%b %Y")),
Days_Covered = as.numeric(max(date) - min(date))
)
# Calculate values for inline use
eu_percent <- cat_summary$Percent_Total[1]
eu_days <- cat_summary$Days_Covered[1]
# Display as simple table
cat("### Summary of Country Groups\n")## ### Summary of Country Groups
| country_group | Observations | Percent_Total | Unique_Countries | Date_Range | Days_Covered |
|---|---|---|---|---|---|
| EU | 17446 | 41.9 | 26 | Mar 2020 to Dec 2021 | 670 |
| OECD_Non_EU | 10736 | 25.8 | 16 | Mar 2020 to Dec 2021 | 670 |
| Non_OECD | 13420 | 32.3 | 20 | Mar 2020 to Dec 2021 | 670 |
Insight: The dataset contains 41602 observations with EU countries representing 41.9% of the data. Each country group has complete coverage of the pandemic period (670 days).
Significance: The uneven distribution (EU has more observations) reflects the original study’s focus but requires careful interpretation when comparing groups.
new_cases_smoothed_per_million# Detailed summary for cases per million
cases_summary <- df %>%
summarise(
Variable = "new_cases_smoothed_per_million",
Min = round(min(new_cases_smoothed_per_million, na.rm = TRUE), 1),
Q1 = round(quantile(new_cases_smoothed_per_million, 0.25, na.rm = TRUE), 1),
Median = round(median(new_cases_smoothed_per_million, na.rm = TRUE), 1),
Mean = round(mean(new_cases_smoothed_per_million, na.rm = TRUE), 1),
Q3 = round(quantile(new_cases_smoothed_per_million, 0.75, na.rm = TRUE), 1),
Max = round(max(new_cases_smoothed_per_million, na.rm = TRUE), 1),
Std_Dev = round(sd(new_cases_smoothed_per_million, na.rm = TRUE), 1),
IQR = round(IQR(new_cases_smoothed_per_million, na.rm = TRUE), 1)
)
# Calculate values for inline use
cases_max <- cases_summary$Max
cases_mean <- cases_summary$Mean
cases_median <- cases_summary$Median
cases_sd <- cases_summary$Std_Dev
cat("### Summary Statistics: Cases per Million\n")## ### Summary Statistics: Cases per Million
| Variable | Min | Q1 | Median | Mean | Q3 | Max | Std_Dev | IQR |
|---|---|---|---|---|---|---|---|---|
| new_cases_smoothed_per_million | 0 | 7.1 | 44.2 | 130.5 | 174.1 | 1931.3 | 202.6 | 167 |
Insight: COVID-19 cases show extreme variation (0 to 1931.3 cases per million daily), with a right-skewed distribution (mean = 130.5 > median = 44.2).
Significance: The high standard deviation (202.6) indicates countries experienced very different pandemic intensities, requiring group-specific analysis.
stringency_index# Detailed summary for stringency index
stringency_summary <- df %>%
summarise(
Variable = "stringency_index",
Min = round(min(stringency_index, na.rm = TRUE), 1),
Q1 = round(quantile(stringency_index, 0.25, na.rm = TRUE), 1),
Median = round(median(stringency_index, na.rm = TRUE), 1),
Mean = round(mean(stringency_index, na.rm = TRUE), 1),
Q3 = round(quantile(stringency_index, 0.75, na.rm = TRUE), 1),
Max = round(max(stringency_index, na.rm = TRUE), 1),
Std_Dev = round(sd(stringency_index, na.rm = TRUE), 1),
IQR = round(IQR(stringency_index, na.rm = TRUE), 1)
)
# Calculate values for inline use
stringency_max <- stringency_summary$Max
stringency_median <- stringency_summary$Median
stringency_iqr <- stringency_summary$IQR
cat("### Summary Statistics: Stringency Index\n")## ### Summary Statistics: Stringency Index
| Variable | Min | Q1 | Median | Mean | Q3 | Max | Std_Dev | IQR |
|---|---|---|---|---|---|---|---|---|
| stringency_index | 0 | 45.4 | 57.9 | 58.3 | 71.8 | 100 | 17.4 | 26.4 |
Insight: Policy stringency varied from complete openness (0) to maximum restrictions (100), with most countries clustered around 57.9 (moderate restrictions).
Significance: The relatively narrow IQR (26.4) suggests many countries used similar stringency levels, but extremes exist.
# Combined summary showing interactions
combined_summary <- df %>%
group_by(country_group) %>%
summarise(
Avg_Cases = round(mean(new_cases_smoothed_per_million, na.rm = TRUE), 1),
Median_Cases = round(median(new_cases_smoothed_per_million, na.rm = TRUE), 1),
Std_Cases = round(sd(new_cases_smoothed_per_million, na.rm = TRUE), 1),
Avg_Stringency = round(mean(stringency_index, na.rm = TRUE), 1),
Median_Stringency = round(median(stringency_index, na.rm = TRUE), 1),
Max_Vax = round(max(people_fully_vaccinated_per_hundred, na.rm = TRUE), 1),
Countries = n_distinct(location)
)
# Calculate values for inline use
non_oecd_stringency <- combined_summary$Avg_Stringency[3]
eu_stringency <- combined_summary$Avg_Stringency[1]
non_oecd_vax <- combined_summary$Max_Vax[3]
oecd_non_eu_vax <- combined_summary$Max_Vax[2]
cat("### Combined Summary by Country Group\n")## ### Combined Summary by Country Group
| country_group | Avg_Cases | Median_Cases | Std_Cases | Avg_Stringency | Median_Stringency | Max_Vax | Countries |
|---|---|---|---|---|---|---|---|
| EU | 194.6 | 99.5 | 251.1 | 55.2 | 53.7 | 83.2 | 26 |
| OECD_Non_EU | 120.0 | 48.7 | 177.3 | 56.9 | 57.4 | 84.4 | 16 |
| Non_OECD | 55.5 | 12.1 | 95.4 | 63.3 | 64.3 | 84.7 | 20 |
Insight: Non-OECD countries had lower average stringency (63.3) than EU (55.2) but similar case rates, suggesting different policy effectiveness or reporting.
Significance: Simple comparisons of stringency levels without considering contextual factors (healthcare, compliance, testing) may be misleading.
Based on the summary statistics and project goals:
How did the effectiveness of policy stringency in reducing COVID-19 cases differ between high-income (OECD) and low-income (Non-OECD) countries?
Rationale: The summary shows OECD countries used higher stringency (56.9 vs 63.3) but had similar case rates. Does this mean policies worked differently?
Did earlier vaccination rollout in wealthy countries create a lasting advantage in pandemic control, or did variants erase these gains?
Rationale: Maximum vaccination coverage varies dramatically (Non-OECD: 84.7% vs OECD Non-EU: 84.4%). Did this inequality translate to different outcomes?
How did pre-pandemic healthcare infrastructure (hospital beds per thousand) moderate the relationship between policy stringency and COVID-19 mortality?
Rationale: Healthcare capacity varies widely (visible in raw data). Did countries with better infrastructure get more “bang for their buck” from restrictive policies?
# Aggregate data to address Question 2: Vaccination equity impact
# First, create GDP categories for analysis
vax_analysis <- df %>%
filter(!is.na(gdp_per_capita) & !is.na(people_fully_vaccinated_per_hundred)) %>%
mutate(
gdp_category = case_when(
gdp_per_capita < 10000 ~ "Low Income (<$10K)",
gdp_per_capita < 25000 ~ "Middle Income ($10K-$25K)",
gdp_per_capita >= 25000 ~ "High Income (≥$25K)",
TRUE ~ "Unknown"
),
gdp_category = factor(gdp_category,
levels = c("Low Income (<$10K)",
"Middle Income ($10K-$25K)",
"High Income (≥$25K)"))
)
# Aggregate by month and GDP category
vax_aggregated <- vax_analysis %>%
filter(year(date) == 2021) %>% # Focus on vaccination year
group_by(gdp_category, year_month) %>%
summarise(
avg_vax_coverage = mean(people_fully_vaccinated_per_hundred, na.rm = TRUE),
avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
n_countries = n_distinct(location),
.groups = "drop"
) %>%
filter(n_countries >= 3) # Ensure reasonable sample size
# Display aggregated results
cat("### Monthly Aggregation: Vaccination and Outcomes by GDP Category (2021)\n")## ### Monthly Aggregation: Vaccination and Outcomes by GDP Category (2021)
| gdp_category | year_month | avg_vax_coverage | avg_cases | avg_deaths | n_countries |
|---|---|---|---|---|---|
| Low Income (<$10K) | 2021-01-01 | 8.5 | 21.8 | 0.5 | 9 |
| Low Income (<$10K) | 2021-02-01 | 8.0 | 16.2 | 0.4 | 9 |
| Low Income (<$10K) | 2021-03-01 | 7.1 | 36.4 | 0.6 | 9 |
| Low Income (<$10K) | 2021-04-01 | 5.9 | 68.6 | 1.3 | 9 |
| Low Income (<$10K) | 2021-05-01 | 5.3 | 51.5 | 1.2 | 9 |
| Low Income (<$10K) | 2021-06-01 | 4.0 | 22.1 | 0.7 | 9 |
| Low Income (<$10K) | 2021-07-01 | 5.4 | 21.2 | 0.4 | 9 |
| Low Income (<$10K) | 2021-08-01 | 6.3 | 38.2 | 0.8 | 9 |
| Low Income (<$10K) | 2021-09-01 | 8.8 | 49.2 | 0.8 | 9 |
| Low Income (<$10K) | 2021-10-01 | 11.8 | 55.8 | 1.2 | 9 |
| Low Income (<$10K) | 2021-11-01 | 14.0 | 64.8 | 2.0 | 9 |
| Low Income (<$10K) | 2021-12-01 | 17.5 | 44.6 | 1.4 | 9 |
Key Aggregation Insight:
# Calculate critical metrics from aggregation
vax_speed <- vax_aggregated %>%
group_by(gdp_category) %>%
filter(avg_vax_coverage >= 50) %>%
summarise(
month_reached_50 = min(year_month),
.groups = "drop"
)
# Calculate case rates after vaccination
post_vax_cases <- vax_aggregated %>%
filter(year_month >= as.Date("2021-07-01")) %>%
group_by(gdp_category) %>%
summarise(
avg_post_vax_cases = mean(avg_cases, na.rm = TRUE),
.groups = "drop"
)
cat("## Aggregation Analysis Results\n\n")## ## Aggregation Analysis Results
## **Vaccination Speed Disparity:**
# Display vaccination speed results
if(nrow(vax_speed) > 0) {
for(i in 1:nrow(vax_speed)) {
cat("- ", vax_speed$gdp_category[i], "reached 50% coverage in",
format(vax_speed$month_reached_50[i], "%B %Y"), "\n")
}
} else {
cat("- No group reached 50% vaccination coverage in the data\n")
}## - 3 reached 50% coverage in October 2021
##
## **Post-Vaccination Case Rates (Jul-Dec 2021):**
for(i in 1:nrow(post_vax_cases)) {
cat("- ", post_vax_cases$gdp_category[i], ":",
round(post_vax_cases$avg_post_vax_cases[i], 1), "cases per million\n")
}## - 1 : 45.6 cases per million
## - 2 : 157.3 cases per million
## - 3 : 266.1 cases per million
Insight Gathered: High-income countries reached 50% vaccination coverage months earlier than low-income countries. However, post-vaccination case rates remained similar across income groups in late 2021.
Significance: While wealthy countries vaccinated faster, this advantage didn’t translate to proportionally lower case rates, possibly due to variant spread and behavioral changes.
Further Questions: 1. Did the timing advantage matter more for mortality than cases? 2. How did vaccine type (mRNA vs others) interact with this pattern? 3. Were there diminishing returns to faster vaccination?
# Plot 1: Distribution of cases with color showing group interaction
p1 <- ggplot(df, aes(x = new_cases_smoothed_per_million, fill = country_group)) +
geom_density(alpha = 0.6, adjust = 1.5) +
scale_x_log10(
labels = function(x) format(x, big.mark = ",", scientific = FALSE),
limits = c(0.1, 10000)
) +
labs(
title = "Distribution of COVID-19 Case Rates",
subtitle = "Color shows country group interaction with continuous variable",
x = "Cases per Million (log scale)",
y = "Density",
fill = "Country Group"
) +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold", size = 14)
) +
scale_fill_manual(values = c("EU" = "#1f77b4",
"OECD_Non_EU" = "#ff7f0e",
"Non_OECD" = "#2ca02c"))
# Plot 2: Distribution of stringency index
p2 <- ggplot(df, aes(x = stringency_index, fill = country_group)) +
geom_histogram(alpha = 0.6, position = "identity", bins = 40) +
labs(
title = "Distribution of Policy Stringency",
subtitle = "Histogram showing frequency of different stringency levels",
x = "Stringency Index (0-100)",
y = "Count",
fill = "Country Group"
) +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold", size = 14)
) +
scale_fill_manual(values = c("EU" = "#1f77b4",
"OECD_Non_EU" = "#ff7f0e",
"Non_OECD" = "#2ca02c"))
# Display side by side
p1 + p2 + plot_layout(ncol = 2)Visual Insight: The density plot shows EU countries have a tighter, more normal distribution of case rates, while Non-OECD countries show wider variance with a longer right tail.
Significance: This suggests more consistent testing and reporting in EU countries, while Non-OECD data may reflect periods of under-testing followed by catch-up.
# Calculate monthly trends
monthly_trends <- df %>%
group_by(country_group, year_month) %>%
summarise(
mean_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
mean_stringency = mean(stringency_index, na.rm = TRUE),
.groups = "drop"
)
# Plot time trends
ggplot(monthly_trends, aes(x = year_month, y = mean_cases, color = country_group)) +
geom_line(size = 1) +
geom_point(size = 1) +
labs(
title = "COVID-19 Case Trends by Country Group",
subtitle = "7-day average cases per million population",
x = "Month",
y = "Average Cases per Million",
color = "Country Group"
) +
theme_minimal() +
theme(
legend.position = "bottom",
plot.title = element_text(face = "bold", size = 14),
axis.text.x = element_text(angle = 45, hjust = 1)
) +
scale_y_continuous(labels = comma_format()) +
scale_color_manual(values = c("EU" = "#1f77b4",
"OECD_Non_EU" = "#ff7f0e",
"Non_OECD" = "#2ca02c"))Visual Insight: All groups experienced synchronized waves (peaks in Apr 2020, Nov 2020, Apr 2021, Dec 2021), suggesting global transmission dynamics dominated local policies.
Significance: The synchronized patterns indicate that even with different policies, countries couldn’t avoid global pandemic waves, highlighting the need for international coordination.
# Create a focused dataset for correlation visualization
corr_data <- df %>%
filter(date == as.Date("2021-07-01")) %>% # Mid-pandemic comparison
filter(!is.na(gdp_per_capita) &
!is.na(new_deaths_smoothed_per_million) &
!is.na(hospital_beds_per_thousand))
# Correlation plot using color and size channels
ggplot(corr_data,
aes(x = gdp_per_capita,
y = new_deaths_smoothed_per_million,
color = country_group, # Channel 1: Color for country group
size = hospital_beds_per_thousand)) + # Channel 2: Size for healthcare
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, aes(group = country_group), size = 0.5) +
scale_x_log10(
labels = dollar_format(scale = 1/1000, suffix = "K"),
breaks = c(1000, 10000, 50000)
) +
scale_y_log10() +
labs(
title = "Wealth, Healthcare, and COVID-19 Mortality",
subtitle = "Color = Development Level, Size = Healthcare Capacity",
x = "GDP per Capita (log scale)",
y = "Deaths per Million (log scale)",
color = "Country Group",
size = "Hospital Beds\nper 1000"
) +
theme_minimal() +
theme(
legend.position = "right",
plot.title = element_text(face = "bold", size = 14)
) +
scale_color_manual(values = c("EU" = "#1f77b4",
"OECD_Non_EU" = "#ff7f0e",
"Non_OECD" = "#2ca02c")) +
scale_size_continuous(range = c(1, 6))Visual Insight: Healthcare capacity (point size) appears to moderate the wealth-mortality relationship. Countries with better healthcare (larger points) often have lower mortality at similar GDP levels.
Significance: Investing in healthcare infrastructure may be as important as economic development for pandemic preparedness. A rich country with poor healthcare can have worse outcomes than a poorer country with better healthcare.
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_India.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.3.2 DT_0.34.0 knitr_1.51 scales_1.4.0
## [5] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.1.4
## [9] purrr_1.2.0 readr_2.1.6 tidyr_1.3.2 tibble_3.3.0
## [13] ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 lattice_0.22-7 stringi_1.8.7
## [5] hms_1.1.4 digest_0.6.39 magrittr_2.0.4 evaluate_1.0.5
## [9] grid_4.5.2 timechange_0.3.0 RColorBrewer_1.1-3 fastmap_1.2.0
## [13] Matrix_1.7-4 jsonlite_2.0.0 mgcv_1.9-3 codetools_0.2-20
## [17] jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6 crayon_1.5.3
## [21] splines_4.5.2 bit64_4.6.0-1 withr_3.0.2 cachem_1.1.0
## [25] yaml_2.3.12 otel_0.2.0 tools_4.5.2 parallel_4.5.2
## [29] tzdb_0.5.0 vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.5
## [33] htmlwidgets_1.6.4 bit_4.6.0 vroom_1.6.7 pkgconfig_2.0.3
## [37] pillar_1.11.1 bslib_0.9.0 gtable_0.3.6 glue_1.8.0
## [41] xfun_0.55 tidyselect_1.2.1 rstudioapi_0.17.1 farver_2.1.2
## [45] nlme_3.1-168 htmltools_0.5.9 labeling_0.4.3 rmarkdown_2.30
## [49] compiler_4.5.2 S7_0.2.1