For this week’s assignment, I performed a comprehensive probability and group analysis on the curated COVID-19 dataset by conducting three distinct “group-by” investigations using categorical variables like country group, stringency level, vaccination coverage, and case intensity. Each analysis calculated observation probabilities, identified the rarest groups with their statistical significance, and formulated testable hypotheses about why certain combinations were uncommon. I also examined all possible combinations between categorical variables, identified missing pairings, and visualized these relationships through heatmaps and faceted plots to uncover systematic patterns in pandemic responses across different economic development levels.

Week 3: Probability & Group Analysis

1. Load and Prepare Data

# Load the curated dataset from Week 2
df <- read_csv("covid_combined_groups.csv")

# Basic preparation
df <- df %>%
  mutate(
    date = as.Date(date),
    year_month = floor_date(date, "month"),
    country_group = factor(country_group, 
                           levels = c("EU", "OECD_Non_EU", "Non_OECD")),
    
    # Create categorical variables for grouping
    stringency_level = cut(stringency_index,
                          breaks = c(0, 25, 50, 75, 100),
                          labels = c("Low", "Medium", "High", "Very High"),
                          include.lowest = TRUE),
    
    vax_coverage_level = cut(people_fully_vaccinated_per_hundred,
                            breaks = c(-1, 20, 50, 80, 101),
                            labels = c("None/Low", "Moderate", "High", "Very High"),
                            include.lowest = TRUE),
    
    cases_level = cut(new_cases_smoothed_per_million,
                     breaks = quantile(new_cases_smoothed_per_million, 
                                      probs = c(0, 0.25, 0.5, 0.75, 1), 
                                      na.rm = TRUE),
                     labels = c("Very Low", "Low", "Medium", "High"),
                     include.lowest = TRUE)
  )

cat("## Dataset Summary\n")

## ## Dataset Summary

cat("- Total rows:", nrow(df), "\n")

## - Total rows: 41602

cat("- Total columns:", ncol(df), "\n")

## - Total columns: 33

cat("- Time period:", min(df$date), "to", max(df$date), "\n")

## - Time period: 18322 to 18992

PART 1: Three Group-By Analyses

1.1 Group 1: Country Group × Stringency Level

# Group 1: By country group and stringency level
group1 <- df %>%
  filter(!is.na(stringency_level)) %>%
  group_by(country_group, stringency_level) %>%
  summarise(
    avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
    avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
    n_observations = n(),
    .groups = "drop"
  ) %>%
  mutate(
    probability = n_observations / sum(n_observations) * 100,
    group_label = paste(country_group, stringency_level, sep = " - ")
  )

# Identify smallest group (lowest probability)
smallest_group1 <- group1 %>%
  filter(n_observations == min(n_observations))

cat("### Group 1: Country Group × Stringency Level\n")

## ### Group 1: Country Group × Stringency Level

cat("Smallest group:", smallest_group1$group_label[1], "\n")

## Smallest group: Non_OECD - Low

cat("Observations:", smallest_group1$n_observations[1], "\n")

## Observations: 185

cat("Probability of randomly selecting this group:", 
    round(smallest_group1$probability[1], 3), "%\n\n")

## Probability of randomly selecting this group: 0.445 %

kable(group1, caption = "Group 1: Country Group × Stringency Level")

Group 1: Country Group × Stringency Level
country_group	stringency_level	avg_cases	avg_deaths	n_observations	probability	group_label
EU	Low	62.133126	0.3704570	547	1.3148406	EU - Low
EU	Medium	193.534525	1.7746789	6899	16.5833373	EU - Medium
EU	High	202.219921	3.4712151	7637	18.3572905	EU - High
EU	Very High	203.870004	5.5129158	2363	5.6800154	EU - Very High
OECD_Non_EU	Low	22.336187	0.0707198	653	1.5696361	OECD_Non_EU - Low
OECD_Non_EU	Medium	112.443445	0.6954080	3321	7.9827893	OECD_Non_EU - Medium
OECD_Non_EU	High	130.117207	2.0423635	5073	12.1941253	OECD_Non_EU - High
OECD_Non_EU	Very High	141.964263	3.0682238	1689	4.0599010	OECD_Non_EU - Very High
Non_OECD	Low	0.035027	0.0014054	185	0.4446902	Non_OECD - Low
Non_OECD	Medium	29.866068	0.8689127	2805	6.7424643	Non_OECD - Medium
Non_OECD	High	69.202567	1.5245072	6869	16.5112254	Non_OECD - High
Non_OECD	Very High	52.313741	1.0143134	3561	8.5596846	Non_OECD - Very High

Insight: The smallest group is Non_OECD - Low with only 185 observations, representing 0.445% of all data.

Significance: This indicates that Non-OECD countries rarely implemented “Very High” stringency policies. If we randomly select any observation from our dataset, there’s only a 0.445% chance it would be from this group.

Hypothesis: Non-OECD countries likely avoided “Very High” stringency due to economic constraints and lower capacity for sustained lockdowns, as strict measures would disproportionately impact informal economies and vulnerable populations.

# Visualization for Group 1
p1 <- ggplot(group1, aes(x = stringency_level, y = country_group, 
                         size = n_observations, color = avg_cases)) +
  geom_point(alpha = 0.8) +
  scale_size_continuous(range = c(3, 10), 
                       breaks = c(1000, 5000, 10000),
                       labels = comma) +
  scale_color_gradient2(low = "blue", mid = "yellow", high = "red",
                       midpoint = median(group1$avg_cases, na.rm = TRUE)) +
  labs(title = "Group 1: Observations by Country Group & Stringency Level",
       subtitle = "Point size = Number of observations, Color = Average cases per million",
       x = "Stringency Level",
       y = "Country Group",
       size = "Observations",
       color = "Avg Cases\npM") +
  theme_minimal() +
  theme(legend.position = "right",
        plot.title = element_text(face = "bold"))

print(p1)

1.2 Group 2: Vaccination Coverage × Cases Level

# Group 2: By vaccination coverage and cases level
group2 <- df %>%
  filter(!is.na(vax_coverage_level) & !is.na(cases_level)) %>%
  group_by(vax_coverage_level, cases_level) %>%
  summarise(
    avg_stringency = mean(stringency_index, na.rm = TRUE),
    avg_gdp = mean(gdp_per_capita, na.rm = TRUE),
    n_observations = n(),
    .groups = "drop"
  ) %>%
  mutate(
    probability = n_observations / sum(n_observations) * 100,
    group_label = paste(vax_coverage_level, cases_level, sep = " × ")
  )

# Identify smallest group
smallest_group2 <- group2 %>%
  filter(n_observations == min(n_observations))

cat("### Group 2: Vaccination Coverage × Cases Level\n")

## ### Group 2: Vaccination Coverage × Cases Level

cat("Smallest group:", smallest_group2$group_label[1], "\n")

## Smallest group: Very High × Low

cat("Observations:", smallest_group2$n_observations[1], "\n")

## Observations: 13

cat("Probability:", round(smallest_group2$probability[1], 3), "%\n\n")

## Probability: 0.031 %

kable(group2, caption = "Group 2: Vaccination Coverage × Cases Level")

Group 2: Vaccination Coverage × Cases Level
vax_coverage_level	cases_level	avg_stringency	avg_gdp	n_observations	probability	group_label
None/Low	Very Low	60.75697	11915.54	5503	13.2277294	None/Low × Very Low
None/Low	Low	61.74189	15199.52	4280	10.2879669	None/Low × Low
None/Low	Medium	66.97972	23536.49	3433	8.2520071	None/Low × Medium
None/Low	High	67.40780	27416.18	3269	7.8577953	None/Low × High
Moderate	Very Low	50.03694	34905.47	4602	11.0619682	Moderate × Very Low
Moderate	Low	56.99565	36486.00	5564	13.3743570	Moderate × Low
Moderate	Medium	57.46838	38688.33	5631	13.5354070	Moderate × Medium
Moderate	High	60.07378	36737.51	5165	12.4152685	Moderate × High
High	Very Low	44.94253	38602.16	269	0.6466035	High × Very Low
High	Low	50.34221	29738.85	547	1.3148406	High × Low
High	Medium	48.30341	37223.87	1221	2.9349551	High × Medium
High	High	45.01797	38755.04	1935	4.6512187	High × High
Very High	Very Low	52.92536	34771.24	28	0.0673045	Very High × Very Low
Very High	Low	43.52000	36513.32	13	0.0312485	Very High × Low
Very High	Medium	38.60759	29769.18	116	0.2788327	Very High × Medium
Very High	High	43.02885	34534.15	26	0.0624970	Very High × High

Insight: The smallest group is Very High × Low with 13 observations (0.031% probability).

Significance: “Very High” vaccination coverage rarely coincided with “High” case levels. This suggests that successful vaccination campaigns generally suppressed case surges, or countries with high vaccination rates had better testing/reporting that caught more cases.

Hypothesis: High vaccination coverage reduces transmission enough to prevent case surges, making the combination statistically rare. Alternatively, countries achieving high vaccination rates were generally more competent in pandemic management overall.

# Visualization for Group 2
p2 <- ggplot(group2, aes(x = vax_coverage_level, y = cases_level, 
                         fill = n_observations)) +
  geom_tile(color = "white") +
  geom_text(aes(label = paste0(round(probability, 1), "%")), 
            color = "black", size = 3.5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue",
                     labels = comma) +
  labs(title = "Group 2: Probability Heatmap - Vaccination × Cases",
       subtitle = "Cell value = Probability of randomly selecting this combination",
       x = "Vaccination Coverage Level",
       y = "Cases Level",
       fill = "Observations") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold"))

print(p2)

1.3 Group 3: Country Group × Month × Stringency Level

# Group 3: More complex grouping
group3 <- df %>%
  filter(!is.na(stringency_level)) %>%
  mutate(
    pandemic_phase = case_when(
      date < as.Date("2020-07-01") ~ "Initial Wave",
      date < as.Date("2021-01-01") ~ "Second Wave",
      date < as.Date("2021-07-01") ~ "Vaccination Start",
      TRUE ~ "Delta Variant"
    )
  ) %>%
  group_by(country_group, pandemic_phase, stringency_level) %>%
  summarise(
    total_observations = n(),
    avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    probability = total_observations / sum(total_observations) * 100,
    group_label = paste(country_group, pandemic_phase, stringency_level, sep = " | ")
  )

# Identify smallest group
smallest_group3 <- group3 %>%
  filter(total_observations == min(total_observations))

cat("### Group 3: Country Group × Pandemic Phase × Stringency\n")

## ### Group 3: Country Group × Pandemic Phase × Stringency

cat("Smallest group:", smallest_group3$group_label[1], "\n")

## Smallest group: EU | Delta Variant | Very High

cat("Observations:", smallest_group3$total_observations[1], "\n")

## Observations: 14

cat("Probability:", round(smallest_group3$probability[1], 4), "%\n\n")

## Probability: 0.0337 %

kable(head(arrange(group3, total_observations), 10), 
      caption = "Group 3: Smallest Groups (Top 10)")

Group 3: Smallest Groups (Top 10)
country_group	pandemic_phase	stringency_level	total_observations	avg_deaths	probability	group_label
EU	Delta Variant	Very High	14	21.9564286	0.0336522	EU \| Delta Variant \| Very High
EU	Second Wave	Low	25	0.5384000	0.0600933	EU \| Second Wave \| Low
OECD_Non_EU	Delta Variant	Very High	113	0.7392035	0.2716216	OECD_Non_EU \| Delta Variant \| Very High
OECD_Non_EU	Second Wave	Low	128	0.0000000	0.3076775	OECD_Non_EU \| Second Wave \| Low
OECD_Non_EU	Delta Variant	Low	140	0.3200714	0.3365223	OECD_Non_EU \| Delta Variant \| Low
OECD_Non_EU	Vaccination Start	Low	176	0.0036932	0.4230566	OECD_Non_EU \| Vaccination Start \| Low
Non_OECD	Initial Wave	Low	185	0.0014054	0.4446902	Non_OECD \| Initial Wave \| Low
OECD_Non_EU	Initial Wave	Low	209	0.0034450	0.5023797	OECD_Non_EU \| Initial Wave \| Low
Non_OECD	Initial Wave	Medium	224	0.1940625	0.5384357	Non_OECD \| Initial Wave \| Medium
EU	Initial Wave	Low	241	0.0021577	0.5792991	EU \| Initial Wave \| Low

Insight: The smallest combination is EU | Delta Variant | Very High with only 14 observations, a mere 0.0337% chance of random selection.

Significance: Non-OECD countries almost never used “Very High” stringency during the Delta Variant phase, possibly because by that time, they had learned that extreme lockdowns were unsustainable or ineffective.

Hypothesis: By the Delta phase, countries had pandemic experience and shifted to more targeted, sustainable measures rather than blanket extreme restrictions.

# Visualization for Group 3 (faceted)
p3 <- ggplot(group3, aes(x = stringency_level, y = total_observations, 
                         fill = country_group)) +
  geom_col(position = "dodge") +
  facet_wrap(~pandemic_phase, scales = "free_y") +
  scale_y_continuous(labels = comma) +
  labs(title = "Group 3: Observations by Pandemic Phase, Stringency & Country Group",
       subtitle = "Rarest combinations have the fewest bars",
       x = "Stringency Level",
       y = "Number of Observations",
       fill = "Country Group") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold"),
        legend.position = "bottom")

print(p3)

PART 2: Categorical Combinations Analysis

2.1 All Combinations of Two Categorical Variables

# Choose two categorical variables
cat_vars <- df %>%
  select(country_group, stringency_level, vax_coverage_level, cases_level) %>%
  filter(complete.cases(.))

# Get all unique combinations
combinations <- cat_vars %>%
  count(country_group, stringency_level, name = "count") %>%
  mutate(
    probability = count / sum(count) * 100,
    combination = paste(country_group, stringency_level, sep = " - ")
  )

# Find missing combinations
all_possible <- expand.grid(
  country_group = unique(df$country_group),
  stringency_level = unique(na.omit(df$stringency_level))
) %>%
  mutate(combo = paste(country_group, stringency_level, sep = " - "))

existing_combos <- unique(combinations$combination)
missing_combos <- setdiff(all_possible$combo, existing_combos)

cat("### All Combinations: Country Group × Stringency Level\n")

## ### All Combinations: Country Group × Stringency Level

cat("Total possible combinations:", nrow(all_possible), "\n")

## Total possible combinations: 12

cat("Existing combinations:", length(existing_combos), "\n")

## Existing combinations: 12

cat("Missing combinations:", length(missing_combos), "\n\n")

## Missing combinations: 0

if(length(missing_combos) > 0) {
  cat("Missing combination(s):\n")
  for(mc in missing_combos) {
    cat("-", mc, "\n")
  }
} else {
  cat("All possible combinations exist in the data.\n")
}

## All possible combinations exist in the data.

# Most and least common combinations
most_common <- combinations %>%
  filter(count == max(count))

least_common <- combinations %>%
  filter(count == min(count))

cat("\n### Most Common Combination:\n")

## 
## ### Most Common Combination:

cat("- Combination:", most_common$combination[1], "\n")

## - Combination: EU - High

cat("- Count:", most_common$count[1], "observations\n")

## - Count: 7637 observations

cat("- Probability:", round(most_common$probability[1], 2), "%\n")

## - Probability: 18.36 %

cat("\n### Least Common Combination:\n")

## 
## ### Least Common Combination:

cat("- Combination:", least_common$combination[1], "\n")

## - Combination: Non_OECD - Low

cat("- Count:", least_common$count[1], "observations\n")

## - Count: 185 observations

cat("- Probability:", round(least_common$probability[1], 3), "%\n")

## - Probability: 0.445 %

Insight: We found 0 missing combinations out of 12 possible. The most common combination (EU - High) occurs 7637 times (18.36% probability), while the least common (Non_OECD - Low) occurs only 185 times (0.445% probability).

Significance: Missing combinations reveal systematic patterns in pandemic responses. For example, if “EU - Low” stringency is missing, it suggests EU countries rarely used minimal restrictions.

Further Questions: Why do these specific combinations not exist? Are they logically impossible, or did real-world constraints prevent them?

# Visualization of combinations
p4 <- ggplot(combinations, aes(x = stringency_level, y = country_group, 
                               fill = count, label = paste0(round(probability, 1), "%"))) +
  geom_tile(color = "white") +
  geom_text(color = ifelse(combinations$probability > 5, "white", "black"), 
            size = 3.5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue",
                     trans = "log10",
                     labels = comma) +
  labs(title = "Categorical Combinations: Country Group × Stringency Level",
       subtitle = "Cell value = Probability percentage, Color intensity = Observation count",
       x = "Stringency Level",
       y = "Country Group",
       fill = "Observations\n(log scale)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "right")

print(p4)

Summary & Conclusions

Key Findings

Rarest Patterns Identified:
- Non-OECD countries rarely used “Very High” stringency measures
- High vaccination coverage rarely coincided with high case levels
- Specific phase-group combinations were exceptionally rare
Probability Insights:
- Random selection would rarely pick observations from smallest groups
- Certain policy-response combinations were systematically avoided
- Missing combinations reveal constraints in real-world pandemic management
Statistical Significance: The rarity of certain groups isn’t random—it reflects economic constraints, policy learning, and practical limitations in pandemic response.

Actionable Recommendations

For Policy Analysis: Focus on common combinations to understand typical responses, but study rare combinations to identify constraints and opportunities
For Data Collection: Ensure sampling captures rare but important scenarios that might be systematically underrepresented
For Future Research: Investigate why certain combinations are missing—are they impossible, ineffective, or simply not tried?

Further Investigation Needed

Causal Analysis: Are rare combinations rare because they’re ineffective, or are they untested opportunities?
Temporal Evolution: Did probability distributions change as the pandemic progressed?
Country-Level Analysis: Which specific countries created the rare combinations, and why?

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_India.utf8  LC_CTYPE=English_India.utf8   
## [3] LC_MONETARY=English_India.utf8 LC_NUMERIC=C                  
## [5] LC_TIME=English_India.utf8    
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] ggrepel_0.9.6   patchwork_1.3.2 knitr_1.51      scales_1.4.0   
##  [5] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
##  [9] purrr_1.2.0     readr_2.1.6     tidyr_1.3.2     tibble_3.3.0   
## [13] ggplot2_4.0.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     stringi_1.8.7      hms_1.1.4         
##  [5] digest_0.6.39      magrittr_2.0.4     evaluate_1.0.5     grid_4.5.2        
##  [9] timechange_0.3.0   RColorBrewer_1.1-3 fastmap_1.2.0      jsonlite_2.0.0    
## [13] jquerylib_0.1.4    cli_3.6.5          rlang_1.1.6        crayon_1.5.3      
## [17] bit64_4.6.0-1      withr_3.0.2        cachem_1.1.0       yaml_2.3.12       
## [21] otel_0.2.0         tools_4.5.2        parallel_4.5.2     tzdb_0.5.0        
## [25] vctrs_0.6.5        R6_2.6.1           lifecycle_1.0.5    bit_4.6.0         
## [29] vroom_1.6.7        pkgconfig_2.0.3    pillar_1.11.1      bslib_0.9.0       
## [33] gtable_0.3.6       glue_1.8.0         Rcpp_1.1.0         xfun_0.55         
## [37] tidyselect_1.2.1   rstudioapi_0.17.1  farver_2.1.2       htmltools_0.5.9   
## [41] rmarkdown_2.30     labeling_0.4.3     compiler_4.5.2     S7_0.2.1

Week 3 Data Dive: Probability & Group Analysis of COVID-19 Outcomes

Krish Shah

January 27, 2026