For this week’s assignment, I performed a comprehensive probability and group analysis on the curated COVID-19 dataset by conducting three distinct “group-by” investigations using categorical variables like country group, stringency level, vaccination coverage, and case intensity. Each analysis calculated observation probabilities, identified the rarest groups with their statistical significance, and formulated testable hypotheses about why certain combinations were uncommon. I also examined all possible combinations between categorical variables, identified missing pairings, and visualized these relationships through heatmaps and faceted plots to uncover systematic patterns in pandemic responses across different economic development levels.

Week 3: Probability & Group Analysis

1. Load and Prepare Data

# Load the curated dataset from Week 2
df <- read_csv("covid_combined_groups.csv")

# Basic preparation
df <- df %>%
  mutate(
    date = as.Date(date),
    year_month = floor_date(date, "month"),
    country_group = factor(country_group, 
                           levels = c("EU", "OECD_Non_EU", "Non_OECD")),
    
    # Create categorical variables for grouping
    stringency_level = cut(stringency_index,
                          breaks = c(0, 25, 50, 75, 100),
                          labels = c("Low", "Medium", "High", "Very High"),
                          include.lowest = TRUE),
    
    vax_coverage_level = cut(people_fully_vaccinated_per_hundred,
                            breaks = c(-1, 20, 50, 80, 101),
                            labels = c("None/Low", "Moderate", "High", "Very High"),
                            include.lowest = TRUE),
    
    cases_level = cut(new_cases_smoothed_per_million,
                     breaks = quantile(new_cases_smoothed_per_million, 
                                      probs = c(0, 0.25, 0.5, 0.75, 1), 
                                      na.rm = TRUE),
                     labels = c("Very Low", "Low", "Medium", "High"),
                     include.lowest = TRUE)
  )

cat("## Dataset Summary\n")
## ## Dataset Summary
cat("- Total rows:", nrow(df), "\n")
## - Total rows: 41602
cat("- Total columns:", ncol(df), "\n")
## - Total columns: 33
cat("- Time period:", min(df$date), "to", max(df$date), "\n")
## - Time period: 18322 to 18992

PART 1: Three Group-By Analyses

1.1 Group 1: Country Group × Stringency Level

# Group 1: By country group and stringency level
group1 <- df %>%
  filter(!is.na(stringency_level)) %>%
  group_by(country_group, stringency_level) %>%
  summarise(
    avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
    avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
    n_observations = n(),
    .groups = "drop"
  ) %>%
  mutate(
    probability = n_observations / sum(n_observations) * 100,
    group_label = paste(country_group, stringency_level, sep = " - ")
  )

# Identify smallest group (lowest probability)
smallest_group1 <- group1 %>%
  filter(n_observations == min(n_observations))

cat("### Group 1: Country Group × Stringency Level\n")
## ### Group 1: Country Group × Stringency Level
cat("Smallest group:", smallest_group1$group_label[1], "\n")
## Smallest group: Non_OECD - Low
cat("Observations:", smallest_group1$n_observations[1], "\n")
## Observations: 185
cat("Probability of randomly selecting this group:", 
    round(smallest_group1$probability[1], 3), "%\n\n")
## Probability of randomly selecting this group: 0.445 %
kable(group1, caption = "Group 1: Country Group × Stringency Level")
Group 1: Country Group × Stringency Level
country_group stringency_level avg_cases avg_deaths n_observations probability group_label
EU Low 62.133126 0.3704570 547 1.3148406 EU - Low
EU Medium 193.534525 1.7746789 6899 16.5833373 EU - Medium
EU High 202.219921 3.4712151 7637 18.3572905 EU - High
EU Very High 203.870004 5.5129158 2363 5.6800154 EU - Very High
OECD_Non_EU Low 22.336187 0.0707198 653 1.5696361 OECD_Non_EU - Low
OECD_Non_EU Medium 112.443445 0.6954080 3321 7.9827893 OECD_Non_EU - Medium
OECD_Non_EU High 130.117207 2.0423635 5073 12.1941253 OECD_Non_EU - High
OECD_Non_EU Very High 141.964263 3.0682238 1689 4.0599010 OECD_Non_EU - Very High
Non_OECD Low 0.035027 0.0014054 185 0.4446902 Non_OECD - Low
Non_OECD Medium 29.866068 0.8689127 2805 6.7424643 Non_OECD - Medium
Non_OECD High 69.202567 1.5245072 6869 16.5112254 Non_OECD - High
Non_OECD Very High 52.313741 1.0143134 3561 8.5596846 Non_OECD - Very High

Insight: The smallest group is Non_OECD - Low with only 185 observations, representing 0.445% of all data.

Significance: This indicates that Non-OECD countries rarely implemented “Very High” stringency policies. If we randomly select any observation from our dataset, there’s only a 0.445% chance it would be from this group.

Hypothesis: Non-OECD countries likely avoided “Very High” stringency due to economic constraints and lower capacity for sustained lockdowns, as strict measures would disproportionately impact informal economies and vulnerable populations.


# Visualization for Group 1
p1 <- ggplot(group1, aes(x = stringency_level, y = country_group, 
                         size = n_observations, color = avg_cases)) +
  geom_point(alpha = 0.8) +
  scale_size_continuous(range = c(3, 10), 
                       breaks = c(1000, 5000, 10000),
                       labels = comma) +
  scale_color_gradient2(low = "blue", mid = "yellow", high = "red",
                       midpoint = median(group1$avg_cases, na.rm = TRUE)) +
  labs(title = "Group 1: Observations by Country Group & Stringency Level",
       subtitle = "Point size = Number of observations, Color = Average cases per million",
       x = "Stringency Level",
       y = "Country Group",
       size = "Observations",
       color = "Avg Cases\npM") +
  theme_minimal() +
  theme(legend.position = "right",
        plot.title = element_text(face = "bold"))

print(p1)


1.2 Group 2: Vaccination Coverage × Cases Level

# Group 2: By vaccination coverage and cases level
group2 <- df %>%
  filter(!is.na(vax_coverage_level) & !is.na(cases_level)) %>%
  group_by(vax_coverage_level, cases_level) %>%
  summarise(
    avg_stringency = mean(stringency_index, na.rm = TRUE),
    avg_gdp = mean(gdp_per_capita, na.rm = TRUE),
    n_observations = n(),
    .groups = "drop"
  ) %>%
  mutate(
    probability = n_observations / sum(n_observations) * 100,
    group_label = paste(vax_coverage_level, cases_level, sep = " × ")
  )

# Identify smallest group
smallest_group2 <- group2 %>%
  filter(n_observations == min(n_observations))

cat("### Group 2: Vaccination Coverage × Cases Level\n")
## ### Group 2: Vaccination Coverage × Cases Level
cat("Smallest group:", smallest_group2$group_label[1], "\n")
## Smallest group: Very High × Low
cat("Observations:", smallest_group2$n_observations[1], "\n")
## Observations: 13
cat("Probability:", round(smallest_group2$probability[1], 3), "%\n\n")
## Probability: 0.031 %
kable(group2, caption = "Group 2: Vaccination Coverage × Cases Level")
Group 2: Vaccination Coverage × Cases Level
vax_coverage_level cases_level avg_stringency avg_gdp n_observations probability group_label
None/Low Very Low 60.75697 11915.54 5503 13.2277294 None/Low × Very Low
None/Low Low 61.74189 15199.52 4280 10.2879669 None/Low × Low
None/Low Medium 66.97972 23536.49 3433 8.2520071 None/Low × Medium
None/Low High 67.40780 27416.18 3269 7.8577953 None/Low × High
Moderate Very Low 50.03694 34905.47 4602 11.0619682 Moderate × Very Low
Moderate Low 56.99565 36486.00 5564 13.3743570 Moderate × Low
Moderate Medium 57.46838 38688.33 5631 13.5354070 Moderate × Medium
Moderate High 60.07378 36737.51 5165 12.4152685 Moderate × High
High Very Low 44.94253 38602.16 269 0.6466035 High × Very Low
High Low 50.34221 29738.85 547 1.3148406 High × Low
High Medium 48.30341 37223.87 1221 2.9349551 High × Medium
High High 45.01797 38755.04 1935 4.6512187 High × High
Very High Very Low 52.92536 34771.24 28 0.0673045 Very High × Very Low
Very High Low 43.52000 36513.32 13 0.0312485 Very High × Low
Very High Medium 38.60759 29769.18 116 0.2788327 Very High × Medium
Very High High 43.02885 34534.15 26 0.0624970 Very High × High

Insight: The smallest group is Very High × Low with 13 observations (0.031% probability).

Significance: “Very High” vaccination coverage rarely coincided with “High” case levels. This suggests that successful vaccination campaigns generally suppressed case surges, or countries with high vaccination rates had better testing/reporting that caught more cases.

Hypothesis: High vaccination coverage reduces transmission enough to prevent case surges, making the combination statistically rare. Alternatively, countries achieving high vaccination rates were generally more competent in pandemic management overall.


# Visualization for Group 2
p2 <- ggplot(group2, aes(x = vax_coverage_level, y = cases_level, 
                         fill = n_observations)) +
  geom_tile(color = "white") +
  geom_text(aes(label = paste0(round(probability, 1), "%")), 
            color = "black", size = 3.5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue",
                     labels = comma) +
  labs(title = "Group 2: Probability Heatmap - Vaccination × Cases",
       subtitle = "Cell value = Probability of randomly selecting this combination",
       x = "Vaccination Coverage Level",
       y = "Cases Level",
       fill = "Observations") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold"))

print(p2)


1.3 Group 3: Country Group × Month × Stringency Level

# Group 3: More complex grouping
group3 <- df %>%
  filter(!is.na(stringency_level)) %>%
  mutate(
    pandemic_phase = case_when(
      date < as.Date("2020-07-01") ~ "Initial Wave",
      date < as.Date("2021-01-01") ~ "Second Wave",
      date < as.Date("2021-07-01") ~ "Vaccination Start",
      TRUE ~ "Delta Variant"
    )
  ) %>%
  group_by(country_group, pandemic_phase, stringency_level) %>%
  summarise(
    total_observations = n(),
    avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    probability = total_observations / sum(total_observations) * 100,
    group_label = paste(country_group, pandemic_phase, stringency_level, sep = " | ")
  )

# Identify smallest group
smallest_group3 <- group3 %>%
  filter(total_observations == min(total_observations))

cat("### Group 3: Country Group × Pandemic Phase × Stringency\n")
## ### Group 3: Country Group × Pandemic Phase × Stringency
cat("Smallest group:", smallest_group3$group_label[1], "\n")
## Smallest group: EU | Delta Variant | Very High
cat("Observations:", smallest_group3$total_observations[1], "\n")
## Observations: 14
cat("Probability:", round(smallest_group3$probability[1], 4), "%\n\n")
## Probability: 0.0337 %
kable(head(arrange(group3, total_observations), 10), 
      caption = "Group 3: Smallest Groups (Top 10)")
Group 3: Smallest Groups (Top 10)
country_group pandemic_phase stringency_level total_observations avg_deaths probability group_label
EU Delta Variant Very High 14 21.9564286 0.0336522 EU | Delta Variant | Very High
EU Second Wave Low 25 0.5384000 0.0600933 EU | Second Wave | Low
OECD_Non_EU Delta Variant Very High 113 0.7392035 0.2716216 OECD_Non_EU | Delta Variant | Very High
OECD_Non_EU Second Wave Low 128 0.0000000 0.3076775 OECD_Non_EU | Second Wave | Low
OECD_Non_EU Delta Variant Low 140 0.3200714 0.3365223 OECD_Non_EU | Delta Variant | Low
OECD_Non_EU Vaccination Start Low 176 0.0036932 0.4230566 OECD_Non_EU | Vaccination Start | Low
Non_OECD Initial Wave Low 185 0.0014054 0.4446902 Non_OECD | Initial Wave | Low
OECD_Non_EU Initial Wave Low 209 0.0034450 0.5023797 OECD_Non_EU | Initial Wave | Low
Non_OECD Initial Wave Medium 224 0.1940625 0.5384357 Non_OECD | Initial Wave | Medium
EU Initial Wave Low 241 0.0021577 0.5792991 EU | Initial Wave | Low

Insight: The smallest combination is EU | Delta Variant | Very High with only 14 observations, a mere 0.0337% chance of random selection.

Significance: Non-OECD countries almost never used “Very High” stringency during the Delta Variant phase, possibly because by that time, they had learned that extreme lockdowns were unsustainable or ineffective.

Hypothesis: By the Delta phase, countries had pandemic experience and shifted to more targeted, sustainable measures rather than blanket extreme restrictions.


# Visualization for Group 3 (faceted)
p3 <- ggplot(group3, aes(x = stringency_level, y = total_observations, 
                         fill = country_group)) +
  geom_col(position = "dodge") +
  facet_wrap(~pandemic_phase, scales = "free_y") +
  scale_y_continuous(labels = comma) +
  labs(title = "Group 3: Observations by Pandemic Phase, Stringency & Country Group",
       subtitle = "Rarest combinations have the fewest bars",
       x = "Stringency Level",
       y = "Number of Observations",
       fill = "Country Group") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold"),
        legend.position = "bottom")

print(p3)


PART 2: Categorical Combinations Analysis

2.1 All Combinations of Two Categorical Variables

# Choose two categorical variables
cat_vars <- df %>%
  select(country_group, stringency_level, vax_coverage_level, cases_level) %>%
  filter(complete.cases(.))

# Get all unique combinations
combinations <- cat_vars %>%
  count(country_group, stringency_level, name = "count") %>%
  mutate(
    probability = count / sum(count) * 100,
    combination = paste(country_group, stringency_level, sep = " - ")
  )

# Find missing combinations
all_possible <- expand.grid(
  country_group = unique(df$country_group),
  stringency_level = unique(na.omit(df$stringency_level))
) %>%
  mutate(combo = paste(country_group, stringency_level, sep = " - "))

existing_combos <- unique(combinations$combination)
missing_combos <- setdiff(all_possible$combo, existing_combos)

cat("### All Combinations: Country Group × Stringency Level\n")
## ### All Combinations: Country Group × Stringency Level
cat("Total possible combinations:", nrow(all_possible), "\n")
## Total possible combinations: 12
cat("Existing combinations:", length(existing_combos), "\n")
## Existing combinations: 12
cat("Missing combinations:", length(missing_combos), "\n\n")
## Missing combinations: 0
if(length(missing_combos) > 0) {
  cat("Missing combination(s):\n")
  for(mc in missing_combos) {
    cat("-", mc, "\n")
  }
} else {
  cat("All possible combinations exist in the data.\n")
}
## All possible combinations exist in the data.
# Most and least common combinations
most_common <- combinations %>%
  filter(count == max(count))

least_common <- combinations %>%
  filter(count == min(count))

cat("\n### Most Common Combination:\n")
## 
## ### Most Common Combination:
cat("- Combination:", most_common$combination[1], "\n")
## - Combination: EU - High
cat("- Count:", most_common$count[1], "observations\n")
## - Count: 7637 observations
cat("- Probability:", round(most_common$probability[1], 2), "%\n")
## - Probability: 18.36 %
cat("\n### Least Common Combination:\n")
## 
## ### Least Common Combination:
cat("- Combination:", least_common$combination[1], "\n")
## - Combination: Non_OECD - Low
cat("- Count:", least_common$count[1], "observations\n")
## - Count: 185 observations
cat("- Probability:", round(least_common$probability[1], 3), "%\n")
## - Probability: 0.445 %

Insight: We found 0 missing combinations out of 12 possible. The most common combination (EU - High) occurs 7637 times (18.36% probability), while the least common (Non_OECD - Low) occurs only 185 times (0.445% probability).

Significance: Missing combinations reveal systematic patterns in pandemic responses. For example, if “EU - Low” stringency is missing, it suggests EU countries rarely used minimal restrictions.

Further Questions: Why do these specific combinations not exist? Are they logically impossible, or did real-world constraints prevent them?


# Visualization of combinations
p4 <- ggplot(combinations, aes(x = stringency_level, y = country_group, 
                               fill = count, label = paste0(round(probability, 1), "%"))) +
  geom_tile(color = "white") +
  geom_text(color = ifelse(combinations$probability > 5, "white", "black"), 
            size = 3.5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue",
                     trans = "log10",
                     labels = comma) +
  labs(title = "Categorical Combinations: Country Group × Stringency Level",
       subtitle = "Cell value = Probability percentage, Color intensity = Observation count",
       x = "Stringency Level",
       y = "Country Group",
       fill = "Observations\n(log scale)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "right")

print(p4)


Summary & Conclusions

Key Findings

  1. Rarest Patterns Identified:
    • Non-OECD countries rarely used “Very High” stringency measures
    • High vaccination coverage rarely coincided with high case levels
    • Specific phase-group combinations were exceptionally rare
  2. Probability Insights:
    • Random selection would rarely pick observations from smallest groups
    • Certain policy-response combinations were systematically avoided
    • Missing combinations reveal constraints in real-world pandemic management
  3. Statistical Significance: The rarity of certain groups isn’t random—it reflects economic constraints, policy learning, and practical limitations in pandemic response.

Actionable Recommendations

  1. For Policy Analysis: Focus on common combinations to understand typical responses, but study rare combinations to identify constraints and opportunities

  2. For Data Collection: Ensure sampling captures rare but important scenarios that might be systematically underrepresented

  3. For Future Research: Investigate why certain combinations are missing—are they impossible, ineffective, or simply not tried?

Further Investigation Needed

  1. Causal Analysis: Are rare combinations rare because they’re ineffective, or are they untested opportunities?
  2. Temporal Evolution: Did probability distributions change as the pandemic progressed?
  3. Country-Level Analysis: Which specific countries created the rare combinations, and why?
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggrepel_0.9.6   patchwork_1.3.2 knitr_1.51      scales_1.4.0   
##  [5] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
##  [9] purrr_1.2.0     readr_2.1.6     tidyr_1.3.2     tibble_3.3.0   
## [13] ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     stringi_1.8.7      hms_1.1.4         
##  [5] digest_0.6.39      magrittr_2.0.4     evaluate_1.0.5     grid_4.5.2        
##  [9] timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0    
## [13] jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6        crayon_1.5.3      
## [17] bit64_4.6.0-1      withr_3.0.2        cachem_1.1.0       yaml_2.3.12       
## [21] otel_0.2.0         tools_4.5.2        parallel_4.5.2     tzdb_0.5.0        
## [25] vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.5    bit_4.6.0         
## [29] vroom_1.6.7        pkgconfig_2.0.3    pillar_1.11.1      bslib_0.9.0       
## [33] gtable_0.3.6       glue_1.8.0         Rcpp_1.1.0         xfun_0.55         
## [37] tidyselect_1.2.1   rstudioapi_0.17.1  farver_2.1.2       htmltools_0.5.9   
## [41] rmarkdown_2.30     labeling_0.4.3     compiler_4.5.2     S7_0.2.1