For this week’s assignment, I performed a comprehensive probability and group analysis on the curated COVID-19 dataset by conducting three distinct “group-by” investigations using categorical variables like country group, stringency level, vaccination coverage, and case intensity. Each analysis calculated observation probabilities, identified the rarest groups with their statistical significance, and formulated testable hypotheses about why certain combinations were uncommon. I also examined all possible combinations between categorical variables, identified missing pairings, and visualized these relationships through heatmaps and faceted plots to uncover systematic patterns in pandemic responses across different economic development levels.
# Load the curated dataset from Week 2
df <- read_csv("covid_combined_groups.csv")
# Basic preparation
df <- df %>%
mutate(
date = as.Date(date),
year_month = floor_date(date, "month"),
country_group = factor(country_group,
levels = c("EU", "OECD_Non_EU", "Non_OECD")),
# Create categorical variables for grouping
stringency_level = cut(stringency_index,
breaks = c(0, 25, 50, 75, 100),
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE),
vax_coverage_level = cut(people_fully_vaccinated_per_hundred,
breaks = c(-1, 20, 50, 80, 101),
labels = c("None/Low", "Moderate", "High", "Very High"),
include.lowest = TRUE),
cases_level = cut(new_cases_smoothed_per_million,
breaks = quantile(new_cases_smoothed_per_million,
probs = c(0, 0.25, 0.5, 0.75, 1),
na.rm = TRUE),
labels = c("Very Low", "Low", "Medium", "High"),
include.lowest = TRUE)
)
cat("## Dataset Summary\n")## ## Dataset Summary
## - Total rows: 41602
## - Total columns: 33
## - Time period: 18322 to 18992
# Group 1: By country group and stringency level
group1 <- df %>%
filter(!is.na(stringency_level)) %>%
group_by(country_group, stringency_level) %>%
summarise(
avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
n_observations = n(),
.groups = "drop"
) %>%
mutate(
probability = n_observations / sum(n_observations) * 100,
group_label = paste(country_group, stringency_level, sep = " - ")
)
# Identify smallest group (lowest probability)
smallest_group1 <- group1 %>%
filter(n_observations == min(n_observations))
cat("### Group 1: Country Group × Stringency Level\n")## ### Group 1: Country Group × Stringency Level
## Smallest group: Non_OECD - Low
## Observations: 185
cat("Probability of randomly selecting this group:",
round(smallest_group1$probability[1], 3), "%\n\n")## Probability of randomly selecting this group: 0.445 %
| country_group | stringency_level | avg_cases | avg_deaths | n_observations | probability | group_label |
|---|---|---|---|---|---|---|
| EU | Low | 62.133126 | 0.3704570 | 547 | 1.3148406 | EU - Low |
| EU | Medium | 193.534525 | 1.7746789 | 6899 | 16.5833373 | EU - Medium |
| EU | High | 202.219921 | 3.4712151 | 7637 | 18.3572905 | EU - High |
| EU | Very High | 203.870004 | 5.5129158 | 2363 | 5.6800154 | EU - Very High |
| OECD_Non_EU | Low | 22.336187 | 0.0707198 | 653 | 1.5696361 | OECD_Non_EU - Low |
| OECD_Non_EU | Medium | 112.443445 | 0.6954080 | 3321 | 7.9827893 | OECD_Non_EU - Medium |
| OECD_Non_EU | High | 130.117207 | 2.0423635 | 5073 | 12.1941253 | OECD_Non_EU - High |
| OECD_Non_EU | Very High | 141.964263 | 3.0682238 | 1689 | 4.0599010 | OECD_Non_EU - Very High |
| Non_OECD | Low | 0.035027 | 0.0014054 | 185 | 0.4446902 | Non_OECD - Low |
| Non_OECD | Medium | 29.866068 | 0.8689127 | 2805 | 6.7424643 | Non_OECD - Medium |
| Non_OECD | High | 69.202567 | 1.5245072 | 6869 | 16.5112254 | Non_OECD - High |
| Non_OECD | Very High | 52.313741 | 1.0143134 | 3561 | 8.5596846 | Non_OECD - Very High |
Insight: The smallest group is Non_OECD - Low with only 185 observations, representing 0.445% of all data.
Significance: This indicates that Non-OECD countries rarely implemented “Very High” stringency policies. If we randomly select any observation from our dataset, there’s only a 0.445% chance it would be from this group.
Hypothesis: Non-OECD countries likely avoided “Very High” stringency due to economic constraints and lower capacity for sustained lockdowns, as strict measures would disproportionately impact informal economies and vulnerable populations.
# Visualization for Group 1
p1 <- ggplot(group1, aes(x = stringency_level, y = country_group,
size = n_observations, color = avg_cases)) +
geom_point(alpha = 0.8) +
scale_size_continuous(range = c(3, 10),
breaks = c(1000, 5000, 10000),
labels = comma) +
scale_color_gradient2(low = "blue", mid = "yellow", high = "red",
midpoint = median(group1$avg_cases, na.rm = TRUE)) +
labs(title = "Group 1: Observations by Country Group & Stringency Level",
subtitle = "Point size = Number of observations, Color = Average cases per million",
x = "Stringency Level",
y = "Country Group",
size = "Observations",
color = "Avg Cases\npM") +
theme_minimal() +
theme(legend.position = "right",
plot.title = element_text(face = "bold"))
print(p1)# Group 2: By vaccination coverage and cases level
group2 <- df %>%
filter(!is.na(vax_coverage_level) & !is.na(cases_level)) %>%
group_by(vax_coverage_level, cases_level) %>%
summarise(
avg_stringency = mean(stringency_index, na.rm = TRUE),
avg_gdp = mean(gdp_per_capita, na.rm = TRUE),
n_observations = n(),
.groups = "drop"
) %>%
mutate(
probability = n_observations / sum(n_observations) * 100,
group_label = paste(vax_coverage_level, cases_level, sep = " × ")
)
# Identify smallest group
smallest_group2 <- group2 %>%
filter(n_observations == min(n_observations))
cat("### Group 2: Vaccination Coverage × Cases Level\n")## ### Group 2: Vaccination Coverage × Cases Level
## Smallest group: Very High × Low
## Observations: 13
## Probability: 0.031 %
| vax_coverage_level | cases_level | avg_stringency | avg_gdp | n_observations | probability | group_label |
|---|---|---|---|---|---|---|
| None/Low | Very Low | 60.75697 | 11915.54 | 5503 | 13.2277294 | None/Low × Very Low |
| None/Low | Low | 61.74189 | 15199.52 | 4280 | 10.2879669 | None/Low × Low |
| None/Low | Medium | 66.97972 | 23536.49 | 3433 | 8.2520071 | None/Low × Medium |
| None/Low | High | 67.40780 | 27416.18 | 3269 | 7.8577953 | None/Low × High |
| Moderate | Very Low | 50.03694 | 34905.47 | 4602 | 11.0619682 | Moderate × Very Low |
| Moderate | Low | 56.99565 | 36486.00 | 5564 | 13.3743570 | Moderate × Low |
| Moderate | Medium | 57.46838 | 38688.33 | 5631 | 13.5354070 | Moderate × Medium |
| Moderate | High | 60.07378 | 36737.51 | 5165 | 12.4152685 | Moderate × High |
| High | Very Low | 44.94253 | 38602.16 | 269 | 0.6466035 | High × Very Low |
| High | Low | 50.34221 | 29738.85 | 547 | 1.3148406 | High × Low |
| High | Medium | 48.30341 | 37223.87 | 1221 | 2.9349551 | High × Medium |
| High | High | 45.01797 | 38755.04 | 1935 | 4.6512187 | High × High |
| Very High | Very Low | 52.92536 | 34771.24 | 28 | 0.0673045 | Very High × Very Low |
| Very High | Low | 43.52000 | 36513.32 | 13 | 0.0312485 | Very High × Low |
| Very High | Medium | 38.60759 | 29769.18 | 116 | 0.2788327 | Very High × Medium |
| Very High | High | 43.02885 | 34534.15 | 26 | 0.0624970 | Very High × High |
Insight: The smallest group is Very High × Low with 13 observations (0.031% probability).
Significance: “Very High” vaccination coverage rarely coincided with “High” case levels. This suggests that successful vaccination campaigns generally suppressed case surges, or countries with high vaccination rates had better testing/reporting that caught more cases.
Hypothesis: High vaccination coverage reduces transmission enough to prevent case surges, making the combination statistically rare. Alternatively, countries achieving high vaccination rates were generally more competent in pandemic management overall.
# Visualization for Group 2
p2 <- ggplot(group2, aes(x = vax_coverage_level, y = cases_level,
fill = n_observations)) +
geom_tile(color = "white") +
geom_text(aes(label = paste0(round(probability, 1), "%")),
color = "black", size = 3.5) +
scale_fill_gradient(low = "lightblue", high = "darkblue",
labels = comma) +
labs(title = "Group 2: Probability Heatmap - Vaccination × Cases",
subtitle = "Cell value = Probability of randomly selecting this combination",
x = "Vaccination Coverage Level",
y = "Cases Level",
fill = "Observations") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"))
print(p2)# Group 3: More complex grouping
group3 <- df %>%
filter(!is.na(stringency_level)) %>%
mutate(
pandemic_phase = case_when(
date < as.Date("2020-07-01") ~ "Initial Wave",
date < as.Date("2021-01-01") ~ "Second Wave",
date < as.Date("2021-07-01") ~ "Vaccination Start",
TRUE ~ "Delta Variant"
)
) %>%
group_by(country_group, pandemic_phase, stringency_level) %>%
summarise(
total_observations = n(),
avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
probability = total_observations / sum(total_observations) * 100,
group_label = paste(country_group, pandemic_phase, stringency_level, sep = " | ")
)
# Identify smallest group
smallest_group3 <- group3 %>%
filter(total_observations == min(total_observations))
cat("### Group 3: Country Group × Pandemic Phase × Stringency\n")## ### Group 3: Country Group × Pandemic Phase × Stringency
## Smallest group: EU | Delta Variant | Very High
## Observations: 14
## Probability: 0.0337 %
| country_group | pandemic_phase | stringency_level | total_observations | avg_deaths | probability | group_label |
|---|---|---|---|---|---|---|
| EU | Delta Variant | Very High | 14 | 21.9564286 | 0.0336522 | EU | Delta Variant | Very High |
| EU | Second Wave | Low | 25 | 0.5384000 | 0.0600933 | EU | Second Wave | Low |
| OECD_Non_EU | Delta Variant | Very High | 113 | 0.7392035 | 0.2716216 | OECD_Non_EU | Delta Variant | Very High |
| OECD_Non_EU | Second Wave | Low | 128 | 0.0000000 | 0.3076775 | OECD_Non_EU | Second Wave | Low |
| OECD_Non_EU | Delta Variant | Low | 140 | 0.3200714 | 0.3365223 | OECD_Non_EU | Delta Variant | Low |
| OECD_Non_EU | Vaccination Start | Low | 176 | 0.0036932 | 0.4230566 | OECD_Non_EU | Vaccination Start | Low |
| Non_OECD | Initial Wave | Low | 185 | 0.0014054 | 0.4446902 | Non_OECD | Initial Wave | Low |
| OECD_Non_EU | Initial Wave | Low | 209 | 0.0034450 | 0.5023797 | OECD_Non_EU | Initial Wave | Low |
| Non_OECD | Initial Wave | Medium | 224 | 0.1940625 | 0.5384357 | Non_OECD | Initial Wave | Medium |
| EU | Initial Wave | Low | 241 | 0.0021577 | 0.5792991 | EU | Initial Wave | Low |
Insight: The smallest combination is EU | Delta Variant | Very High with only 14 observations, a mere 0.0337% chance of random selection.
Significance: Non-OECD countries almost never used “Very High” stringency during the Delta Variant phase, possibly because by that time, they had learned that extreme lockdowns were unsustainable or ineffective.
Hypothesis: By the Delta phase, countries had pandemic experience and shifted to more targeted, sustainable measures rather than blanket extreme restrictions.
# Visualization for Group 3 (faceted)
p3 <- ggplot(group3, aes(x = stringency_level, y = total_observations,
fill = country_group)) +
geom_col(position = "dodge") +
facet_wrap(~pandemic_phase, scales = "free_y") +
scale_y_continuous(labels = comma) +
labs(title = "Group 3: Observations by Pandemic Phase, Stringency & Country Group",
subtitle = "Rarest combinations have the fewest bars",
x = "Stringency Level",
y = "Number of Observations",
fill = "Country Group") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
legend.position = "bottom")
print(p3)# Choose two categorical variables
cat_vars <- df %>%
select(country_group, stringency_level, vax_coverage_level, cases_level) %>%
filter(complete.cases(.))
# Get all unique combinations
combinations <- cat_vars %>%
count(country_group, stringency_level, name = "count") %>%
mutate(
probability = count / sum(count) * 100,
combination = paste(country_group, stringency_level, sep = " - ")
)
# Find missing combinations
all_possible <- expand.grid(
country_group = unique(df$country_group),
stringency_level = unique(na.omit(df$stringency_level))
) %>%
mutate(combo = paste(country_group, stringency_level, sep = " - "))
existing_combos <- unique(combinations$combination)
missing_combos <- setdiff(all_possible$combo, existing_combos)
cat("### All Combinations: Country Group × Stringency Level\n")## ### All Combinations: Country Group × Stringency Level
## Total possible combinations: 12
## Existing combinations: 12
## Missing combinations: 0
if(length(missing_combos) > 0) {
cat("Missing combination(s):\n")
for(mc in missing_combos) {
cat("-", mc, "\n")
}
} else {
cat("All possible combinations exist in the data.\n")
}## All possible combinations exist in the data.
# Most and least common combinations
most_common <- combinations %>%
filter(count == max(count))
least_common <- combinations %>%
filter(count == min(count))
cat("\n### Most Common Combination:\n")##
## ### Most Common Combination:
## - Combination: EU - High
## - Count: 7637 observations
## - Probability: 18.36 %
##
## ### Least Common Combination:
## - Combination: Non_OECD - Low
## - Count: 185 observations
## - Probability: 0.445 %
Insight: We found 0 missing combinations out of 12 possible. The most common combination (EU - High) occurs 7637 times (18.36% probability), while the least common (Non_OECD - Low) occurs only 185 times (0.445% probability).
Significance: Missing combinations reveal systematic patterns in pandemic responses. For example, if “EU - Low” stringency is missing, it suggests EU countries rarely used minimal restrictions.
Further Questions: Why do these specific combinations not exist? Are they logically impossible, or did real-world constraints prevent them?
# Visualization of combinations
p4 <- ggplot(combinations, aes(x = stringency_level, y = country_group,
fill = count, label = paste0(round(probability, 1), "%"))) +
geom_tile(color = "white") +
geom_text(color = ifelse(combinations$probability > 5, "white", "black"),
size = 3.5) +
scale_fill_gradient(low = "lightblue", high = "darkblue",
trans = "log10",
labels = comma) +
labs(title = "Categorical Combinations: Country Group × Stringency Level",
subtitle = "Cell value = Probability percentage, Color intensity = Observation count",
x = "Stringency Level",
y = "Country Group",
fill = "Observations\n(log scale)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"),
legend.position = "right")
print(p4)For Policy Analysis: Focus on common combinations to understand typical responses, but study rare combinations to identify constraints and opportunities
For Data Collection: Ensure sampling captures rare but important scenarios that might be systematically underrepresented
For Future Research: Investigate why certain combinations are missing—are they impossible, ineffective, or simply not tried?
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_India.utf8 LC_CTYPE=English_India.utf8
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_India.utf8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggrepel_0.9.6 patchwork_1.3.2 knitr_1.51 scales_1.4.0
## [5] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.1.4
## [9] purrr_1.2.0 readr_2.1.6 tidyr_1.3.2 tibble_3.3.0
## [13] ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.10 generics_0.1.4 stringi_1.8.7 hms_1.1.4
## [5] digest_0.6.39 magrittr_2.0.4 evaluate_1.0.5 grid_4.5.2
## [9] timechange_0.3.0 RColorBrewer_1.1-3 fastmap_1.2.0 jsonlite_2.0.0
## [13] jquerylib_0.1.4 cli_3.6.5 rlang_1.1.6 crayon_1.5.3
## [17] bit64_4.6.0-1 withr_3.0.2 cachem_1.1.0 yaml_2.3.12
## [21] otel_0.2.0 tools_4.5.2 parallel_4.5.2 tzdb_0.5.0
## [25] vctrs_0.6.5 R6_2.6.1 lifecycle_1.0.5 bit_4.6.0
## [29] vroom_1.6.7 pkgconfig_2.0.3 pillar_1.11.1 bslib_0.9.0
## [33] gtable_0.3.6 glue_1.8.0 Rcpp_1.1.0 xfun_0.55
## [37] tidyselect_1.2.1 rstudioapi_0.17.1 farver_2.1.2 htmltools_0.5.9
## [41] rmarkdown_2.30 labeling_0.4.3 compiler_4.5.2 S7_0.2.1