For this week’s assignment, I performed a comprehensive probability and group analysis on the curated COVID-19 dataset by conducting three distinct “group-by” investigations using categorical variables like country group, stringency level, vaccination coverage, and case intensity. Each analysis calculated observation probabilities, identified the rarest groups with their statistical significance, and formulated testable hypotheses about why certain combinations were uncommon. I also examined all possible combinations between categorical variables, identified missing pairings, and visualized these relationships through heatmaps and faceted plots to uncover systematic patterns in pandemic responses across different economic development levels.

Week 3: Probability & Group Analysis

1. Load and Prepare Data

# Load the curated dataset from Week 2
df <- read_csv("covid_combined_groups.csv")

# Basic preparation
df <- df %>%
  mutate(
    date = as.Date(date),
    year_month = floor_date(date, "month"),
    country_group = factor(country_group, 
                           levels = c("EU", "OECD_Non_EU", "Non_OECD")),
    
    # Create categorical variables for grouping
    stringency_level = cut(stringency_index,
                          breaks = c(0, 25, 50, 75, 100),
                          labels = c("Low", "Medium", "High", "Very High"),
                          include.lowest = TRUE),
    
    vax_coverage_level = cut(people_fully_vaccinated_per_hundred,
                            breaks = c(-1, 20, 50, 80, 101),
                            labels = c("None/Low", "Moderate", "High", "Very High"),
                            include.lowest = TRUE),
    
    cases_level = cut(new_cases_smoothed_per_million,
                     breaks = quantile(new_cases_smoothed_per_million, 
                                      probs = c(0, 0.25, 0.5, 0.75, 1), 
                                      na.rm = TRUE),
                     labels = c("Very Low", "Low", "Medium", "High"),
                     include.lowest = TRUE)
  )

cat("## Dataset Summary\n")

## ## Dataset Summary

cat("- Total rows:", nrow(df), "\n")

## - Total rows: 41602

cat("- Total columns:", ncol(df), "\n")

## - Total columns: 33

cat("- Time period:", format(min(df$date), "%B %d, %Y"), "to", format(max(df$date), "%B %d, %Y"), "
")

## - Time period: March 01, 2020 to December 31, 2021

PART 1: Three Group-By Analyses

1.1 Group 1: Country Group × Stringency Level

# Group 1: By country group and stringency level
group1 <- df %>%
  filter(!is.na(stringency_level)) %>%
  group_by(country_group, stringency_level) %>%
  summarise(
    avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
    avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
    n_observations = n(),
    .groups = "drop"
  ) %>%
  mutate(
    probability = n_observations / sum(n_observations) * 100,
    group_label = paste(country_group, stringency_level, sep = " - ")
  )

# Identify smallest group (lowest probability)
smallest_group1 <- group1 %>%
  filter(n_observations == min(n_observations))

cat("### Group 1: Country Group × Stringency Level\n")

## ### Group 1: Country Group × Stringency Level

cat("Smallest group:", smallest_group1$group_label[1], "\n")

## Smallest group: Non_OECD - Low

cat("Observations:", smallest_group1$n_observations[1], "\n")

## Observations: 185

cat("Probability of randomly selecting this group:", 
    round(smallest_group1$probability[1], 3), "%\n\n")

## Probability of randomly selecting this group: 0.445 %

kable(group1, caption = "Group 1: Country Group × Stringency Level")

Group 1: Country Group × Stringency Level
country_group	stringency_level	avg_cases	avg_deaths	n_observations	probability	group_label
EU	Low	62.133126	0.3704570	547	1.3148406	EU - Low
EU	Medium	193.534525	1.7746789	6899	16.5833373	EU - Medium
EU	High	202.219921	3.4712151	7637	18.3572905	EU - High
EU	Very High	203.870004	5.5129158	2363	5.6800154	EU - Very High
OECD_Non_EU	Low	22.336187	0.0707198	653	1.5696361	OECD_Non_EU - Low
OECD_Non_EU	Medium	112.443445	0.6954080	3321	7.9827893	OECD_Non_EU - Medium
OECD_Non_EU	High	130.117207	2.0423635	5073	12.1941253	OECD_Non_EU - High
OECD_Non_EU	Very High	141.964263	3.0682238	1689	4.0599010	OECD_Non_EU - Very High
Non_OECD	Low	0.035027	0.0014054	185	0.4446902	Non_OECD - Low
Non_OECD	Medium	29.866068	0.8689127	2805	6.7424643	Non_OECD - Medium
Non_OECD	High	69.202567	1.5245072	6869	16.5112254	Non_OECD - High
Non_OECD	Very High	52.313741	1.0143134	3561	8.5596846	Non_OECD - Very High

Insight: The smallest group is Non_OECD - Low with only 185 observations, representing 0.445% of all data.

Significance: This indicates that the rarest observed combination in the data is Non_OECD - Low, not an assumed policy extreme. In practical terms, this means that the least common pairing reflects what countries actually did least often in the dataset, based on observed counts rather than expectations.

Hypothesis: The rarity of Non_OECD - Low may reflect structural or context-specific constraints faced by these countries (e.g., economic capacity, governance, or timing of the pandemic), making this particular policy–country combination uncommon in practice.

# Visualization for Group 1
p1 <- ggplot(group1, aes(x = stringency_level, y = country_group, 
                         size = n_observations, color = avg_cases)) +
  geom_point(alpha = 0.8) +
  scale_size_continuous(range = c(3, 10), 
                       breaks = c(1000, 5000, 10000),
                       labels = comma) +
  scale_color_gradient2(low = "blue", mid = "yellow", high = "red",
                       midpoint = median(group1$avg_cases, na.rm = TRUE)) +
  labs(title = "Group 1: Observations by Country Group & Stringency Level",
       subtitle = "Point size = Number of observations, Color = Average cases per million",
       x = "Stringency Level",
       y = "Country Group",
       size = "Observations",
       color = "Avg Cases\npM") +
  theme_minimal() +
  theme(legend.position = "right",
        plot.title = element_text(face = "bold"))

print(p1)

1.2 Group 2: Vaccination Coverage × Cases Level

# Group 2: By vaccination coverage and cases level
group2 <- df %>%
  filter(!is.na(vax_coverage_level) & !is.na(cases_level)) %>%
  group_by(vax_coverage_level, cases_level) %>%
  summarise(
    avg_stringency = mean(stringency_index, na.rm = TRUE),
    avg_gdp = mean(gdp_per_capita, na.rm = TRUE),
    n_observations = n(),
    .groups = "drop"
  ) %>%
  mutate(
    probability = n_observations / sum(n_observations) * 100,
    group_label = paste(vax_coverage_level, cases_level, sep = " × ")
  )

# Identify smallest group
smallest_group2 <- group2 %>%
  filter(n_observations == min(n_observations))

cat("### Group 2: Vaccination Coverage × Cases Level\n")

## ### Group 2: Vaccination Coverage × Cases Level

cat("Smallest group:", smallest_group2$group_label[1], "\n")

## Smallest group: Very High × Low

cat("Observations:", smallest_group2$n_observations[1], "\n")

## Observations: 13

cat("Probability:", round(smallest_group2$probability[1], 3), "%\n\n")

## Probability: 0.031 %

kable(group2, caption = "Group 2: Vaccination Coverage × Cases Level")

Group 2: Vaccination Coverage × Cases Level
vax_coverage_level	cases_level	avg_stringency	avg_gdp	n_observations	probability	group_label
None/Low	Very Low	60.75697	11915.54	5503	13.2277294	None/Low × Very Low
None/Low	Low	61.74189	15199.52	4280	10.2879669	None/Low × Low
None/Low	Medium	66.97972	23536.49	3433	8.2520071	None/Low × Medium
None/Low	High	67.40780	27416.18	3269	7.8577953	None/Low × High
Moderate	Very Low	50.03694	34905.47	4602	11.0619682	Moderate × Very Low
Moderate	Low	56.99565	36486.00	5564	13.3743570	Moderate × Low
Moderate	Medium	57.46838	38688.33	5631	13.5354070	Moderate × Medium
Moderate	High	60.07378	36737.51	5165	12.4152685	Moderate × High
High	Very Low	44.94253	38602.16	269	0.6466035	High × Very Low
High	Low	50.34221	29738.85	547	1.3148406	High × Low
High	Medium	48.30341	37223.87	1221	2.9349551	High × Medium
High	High	45.01797	38755.04	1935	4.6512187	High × High
Very High	Very Low	52.92536	34771.24	28	0.0673045	Very High × Very Low
Very High	Low	43.52000	36513.32	13	0.0312485	Very High × Low
Very High	Medium	38.60759	29769.18	116	0.2788327	Very High × Medium
Very High	High	43.02885	34534.15	26	0.0624970	Very High × High

Insight: The smallest group is Very High × Low with 13 observations (0.031% probability).

Significance: This indicates that the rarest observed combination in this grouping is Very High vaccination coverage with Low case levels. In practical terms, this suggests that periods where vaccination coverage was very high while case levels were simultaneously low were relatively uncommon in the dataset. Rather than reflecting a policy extreme, this likely reflects the timing of vaccination rollouts and infection waves across countries.

Hypothesis: The rarity of Very High × Low may reflect the fact that vaccination coverage reached very high levels later in the pandemic, often during or following waves of infection. As a result, there were fewer observed periods where vaccination rates were extremely high while case intensity was simultaneously in the lower quantiles.

# Visualization for Group 2
p2 <- ggplot(group2, aes(x = vax_coverage_level, y = cases_level, 
                         fill = n_observations)) +
  geom_tile(color = "white") +
  geom_text(aes(label = paste0(round(probability, 1), "%")), 
            color = "black", size = 3.5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue",
                     labels = comma) +
  labs(title = "Group 2: Probability Heatmap - Vaccination × Cases",
       subtitle = "Cell value = Probability of randomly selecting this combination",
       x = "Vaccination Coverage Level",
       y = "Cases Level",
       fill = "Observations") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold"))

print(p2)

1.3 Group 3: Country Group × Month × Stringency Level

# Group 3: More complex grouping
group3 <- df %>%
  filter(!is.na(stringency_level)) %>%
  mutate(
    pandemic_phase = case_when(
      date < as.Date("2020-07-01") ~ "Initial Wave",
      date < as.Date("2021-01-01") ~ "Second Wave",
      date < as.Date("2021-07-01") ~ "Vaccination Start",
      TRUE ~ "Delta Variant"
    )
  ) %>%
  group_by(country_group, pandemic_phase, stringency_level) %>%
  summarise(
    total_observations = n(),
    avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    probability = total_observations / sum(total_observations) * 100,
    group_label = paste(country_group, pandemic_phase, stringency_level, sep = " | ")
  )

# Identify smallest group
smallest_group3 <- group3 %>%
  filter(total_observations == min(total_observations))

cat("### Group 3: Country Group × Pandemic Phase × Stringency\n")

## ### Group 3: Country Group × Pandemic Phase × Stringency

cat("Smallest group:", smallest_group3$group_label[1], "\n")

## Smallest group: EU | Delta Variant | Very High

cat("Observations:", smallest_group3$total_observations[1], "\n")

## Observations: 14

cat("Probability:", round(smallest_group3$probability[1], 4), "%\n\n")

## Probability: 0.0337 %

kable(head(arrange(group3, total_observations), 10), 
      caption = "Group 3: Smallest Groups (Top 10)")

Group 3: Smallest Groups (Top 10)
country_group	pandemic_phase	stringency_level	total_observations	avg_deaths	probability	group_label
EU	Delta Variant	Very High	14	21.9564286	0.0336522	EU \| Delta Variant \| Very High
EU	Second Wave	Low	25	0.5384000	0.0600933	EU \| Second Wave \| Low
OECD_Non_EU	Delta Variant	Very High	113	0.7392035	0.2716216	OECD_Non_EU \| Delta Variant \| Very High
OECD_Non_EU	Second Wave	Low	128	0.0000000	0.3076775	OECD_Non_EU \| Second Wave \| Low
OECD_Non_EU	Delta Variant	Low	140	0.3200714	0.3365223	OECD_Non_EU \| Delta Variant \| Low
OECD_Non_EU	Vaccination Start	Low	176	0.0036932	0.4230566	OECD_Non_EU \| Vaccination Start \| Low
Non_OECD	Initial Wave	Low	185	0.0014054	0.4446902	Non_OECD \| Initial Wave \| Low
OECD_Non_EU	Initial Wave	Low	209	0.0034450	0.5023797	OECD_Non_EU \| Initial Wave \| Low
Non_OECD	Initial Wave	Medium	224	0.1940625	0.5384357	Non_OECD \| Initial Wave \| Medium
EU	Initial Wave	Low	241	0.0021577	0.5792991	EU \| Initial Wave \| Low

Insight: The smallest combination is EU | Delta Variant | Very High with only 14 observations, a mere 0.0337% chance of random selection.

Significance: This indicates that the rarest observed three-way combination is EU countries during the Delta Variant phase under Very High stringency levels. This suggests that EU countries may not have frequently maintained the highest level of stringency during this specific phase, at least relative to other group–phase–policy combinations in the dataset.

Hypothesis: The rarity of EU | Delta Variant | Very High may reflect evolving policy strategies during the Delta phase, where EU countries may have relied more on vaccination coverage and targeted interventions rather than consistently maintaining the strictest stringency levels. This would result in fewer observations in the “Very High” category during that phase.

# Visualization for Group 3 (faceted)
p3 <- ggplot(group3, aes(x = stringency_level, y = total_observations, 
                         fill = country_group)) +
  geom_col(position = "dodge") +
  facet_wrap(~pandemic_phase, scales = "free_y") +
  scale_y_continuous(labels = comma) +
  labs(title = "Group 3: Observations by Pandemic Phase, Stringency & Country Group",
       subtitle = "Rarest combinations have the fewest bars",
       x = "Stringency Level",
       y = "Number of Observations",
       fill = "Country Group") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        plot.title = element_text(face = "bold"),
        legend.position = "bottom")

print(p3)

PART 2: Categorical Combinations Analysis

2.1 All Combinations of Two Categorical Variables

# Choose two categorical variables
cat_vars <- df %>%
  select(country_group, stringency_level, vax_coverage_level, cases_level) %>%
  filter(complete.cases(.))

# Get all unique combinations
combinations <- cat_vars %>%
  count(country_group, stringency_level, name = "count") %>%
  mutate(
    probability = count / sum(count) * 100,
    combination = paste(country_group, stringency_level, sep = " - ")
  )

# Find missing combinations
all_possible <- expand.grid(
  country_group = unique(df$country_group),
  stringency_level = unique(na.omit(df$stringency_level))
) %>%
  mutate(combo = paste(country_group, stringency_level, sep = " - "))

existing_combos <- unique(combinations$combination)
missing_combos <- setdiff(all_possible$combo, existing_combos)

cat("### All Combinations: Country Group × Stringency Level\n")

## ### All Combinations: Country Group × Stringency Level

cat("Total possible combinations:", nrow(all_possible), "\n")

## Total possible combinations: 12

cat("Existing combinations:", length(existing_combos), "\n")

## Existing combinations: 12

cat("Missing combinations:", length(missing_combos), "\n\n")

## Missing combinations: 0

if(length(missing_combos) > 0) {
  cat("Missing combination(s):\n")
  for(mc in missing_combos) {
    cat("-", mc, "\n")
  }
} else {
  cat("All possible combinations exist in the data.\n")
}

## All possible combinations exist in the data.

# Most and least common combinations
most_common <- combinations %>%
  filter(count == max(count))

least_common <- combinations %>%
  filter(count == min(count))

cat("\n### Most Common Combination:\n")

## 
## ### Most Common Combination:

cat("- Combination:", most_common$combination[1], "\n")

## - Combination: EU - High

cat("- Count:", most_common$count[1], "observations\n")

## - Count: 7637 observations

cat("- Probability:", round(most_common$probability[1], 2), "%\n")

## - Probability: 18.36 %

cat("\n### Least Common Combination:\n")

## 
## ### Least Common Combination:

cat("- Combination:", least_common$combination[1], "\n")

## - Combination: Non_OECD - Low

cat("- Count:", least_common$count[1], "observations\n")

## - Count: 185 observations

cat("- Probability:", round(least_common$probability[1], 3), "%\n")

## - Probability: 0.445 %

Insight: We found 0 missing combinations out of 12 possible. The most common combination (EU - High) occurs 7637 times (18.36% probability), while the least common (Non_OECD - Low) occurs only 185 times (0.445% probability).

Significance: Missing combinations reveal systematic patterns in pandemic responses. For example, if “EU - Low” stringency is missing, it suggests EU countries rarely used minimal restrictions.

Further Questions: Why do these specific combinations not exist? Are they logically impossible, or did real-world constraints prevent them?

# Visualization of combinations
p4 <- ggplot(combinations, aes(x = stringency_level, y = country_group, 
                               fill = count, label = paste0(round(probability, 1), "%"))) +
  geom_tile(color = "white") +
  geom_text(color = ifelse(combinations$probability > 5, "white", "black"), 
            size = 3.5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue",
                     trans = "log10",
                     labels = comma) +
  labs(title = "Categorical Combinations: Country Group × Stringency Level",
       subtitle = "Cell value = Probability percentage, Color intensity = Observation count",
       x = "Stringency Level",
       y = "Country Group",
       fill = "Observations\n(log scale)") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold"),
        legend.position = "right")

print(p4)

Summary & Conclusions

Key Findings

Rarest Patterns Identified:
- Non-OECD countries rarely used “Very High” stringency measures
- High vaccination coverage rarely coincided with high case levels
- Specific phase-group combinations were exceptionally rare
Probability Insights:
- Random selection would rarely pick observations from smallest groups
- Certain policy-response combinations were systematically avoided
- Missing combinations reveal constraints in real-world pandemic management
Statistical Significance: The rarity of certain groups isn’t random—it reflects economic constraints, policy learning, and practical limitations in pandemic response.

Week 3 Data Dive: Probability & Group Analysis of COVID-19 Outcomes

Krish Shah

February 16, 2026