For this week’s assignment, I performed a comprehensive probability and group analysis on the curated COVID-19 dataset by conducting three distinct “group-by” investigations using categorical variables like country group, stringency level, vaccination coverage, and case intensity. Each analysis calculated observation probabilities, identified the rarest groups with their statistical significance, and formulated testable hypotheses about why certain combinations were uncommon. I also examined all possible combinations between categorical variables, identified missing pairings, and visualized these relationships through heatmaps and faceted plots to uncover systematic patterns in pandemic responses across different economic development levels.
# Load the curated dataset from Week 2
df <- read_csv("covid_combined_groups.csv")
# Basic preparation
df <- df %>%
mutate(
date = as.Date(date),
year_month = floor_date(date, "month"),
country_group = factor(country_group,
levels = c("EU", "OECD_Non_EU", "Non_OECD")),
# Create categorical variables for grouping
stringency_level = cut(stringency_index,
breaks = c(0, 25, 50, 75, 100),
labels = c("Low", "Medium", "High", "Very High"),
include.lowest = TRUE),
vax_coverage_level = cut(people_fully_vaccinated_per_hundred,
breaks = c(-1, 20, 50, 80, 101),
labels = c("None/Low", "Moderate", "High", "Very High"),
include.lowest = TRUE),
cases_level = cut(new_cases_smoothed_per_million,
breaks = quantile(new_cases_smoothed_per_million,
probs = c(0, 0.25, 0.5, 0.75, 1),
na.rm = TRUE),
labels = c("Very Low", "Low", "Medium", "High"),
include.lowest = TRUE)
)
cat("## Dataset Summary\n")## ## Dataset Summary
## - Total rows: 41602
## - Total columns: 33
cat("- Time period:", format(min(df$date), "%B %d, %Y"), "to", format(max(df$date), "%B %d, %Y"), "
")## - Time period: March 01, 2020 to December 31, 2021
# Group 1: By country group and stringency level
group1 <- df %>%
filter(!is.na(stringency_level)) %>%
group_by(country_group, stringency_level) %>%
summarise(
avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
n_observations = n(),
.groups = "drop"
) %>%
mutate(
probability = n_observations / sum(n_observations) * 100,
group_label = paste(country_group, stringency_level, sep = " - ")
)
# Identify smallest group (lowest probability)
smallest_group1 <- group1 %>%
filter(n_observations == min(n_observations))
cat("### Group 1: Country Group × Stringency Level\n")## ### Group 1: Country Group × Stringency Level
## Smallest group: Non_OECD - Low
## Observations: 185
cat("Probability of randomly selecting this group:",
round(smallest_group1$probability[1], 3), "%\n\n")## Probability of randomly selecting this group: 0.445 %
| country_group | stringency_level | avg_cases | avg_deaths | n_observations | probability | group_label |
|---|---|---|---|---|---|---|
| EU | Low | 62.133126 | 0.3704570 | 547 | 1.3148406 | EU - Low |
| EU | Medium | 193.534525 | 1.7746789 | 6899 | 16.5833373 | EU - Medium |
| EU | High | 202.219921 | 3.4712151 | 7637 | 18.3572905 | EU - High |
| EU | Very High | 203.870004 | 5.5129158 | 2363 | 5.6800154 | EU - Very High |
| OECD_Non_EU | Low | 22.336187 | 0.0707198 | 653 | 1.5696361 | OECD_Non_EU - Low |
| OECD_Non_EU | Medium | 112.443445 | 0.6954080 | 3321 | 7.9827893 | OECD_Non_EU - Medium |
| OECD_Non_EU | High | 130.117207 | 2.0423635 | 5073 | 12.1941253 | OECD_Non_EU - High |
| OECD_Non_EU | Very High | 141.964263 | 3.0682238 | 1689 | 4.0599010 | OECD_Non_EU - Very High |
| Non_OECD | Low | 0.035027 | 0.0014054 | 185 | 0.4446902 | Non_OECD - Low |
| Non_OECD | Medium | 29.866068 | 0.8689127 | 2805 | 6.7424643 | Non_OECD - Medium |
| Non_OECD | High | 69.202567 | 1.5245072 | 6869 | 16.5112254 | Non_OECD - High |
| Non_OECD | Very High | 52.313741 | 1.0143134 | 3561 | 8.5596846 | Non_OECD - Very High |
Insight: The smallest group is Non_OECD - Low with only 185 observations, representing 0.445% of all data.
Significance: This indicates that the rarest observed combination in the data is Non_OECD - Low, not an assumed policy extreme. In practical terms, this means that the least common pairing reflects what countries actually did least often in the dataset, based on observed counts rather than expectations.
Hypothesis: The rarity of Non_OECD - Low may reflect structural or context-specific constraints faced by these countries (e.g., economic capacity, governance, or timing of the pandemic), making this particular policy–country combination uncommon in practice.
# Visualization for Group 1
p1 <- ggplot(group1, aes(x = stringency_level, y = country_group,
size = n_observations, color = avg_cases)) +
geom_point(alpha = 0.8) +
scale_size_continuous(range = c(3, 10),
breaks = c(1000, 5000, 10000),
labels = comma) +
scale_color_gradient2(low = "blue", mid = "yellow", high = "red",
midpoint = median(group1$avg_cases, na.rm = TRUE)) +
labs(title = "Group 1: Observations by Country Group & Stringency Level",
subtitle = "Point size = Number of observations, Color = Average cases per million",
x = "Stringency Level",
y = "Country Group",
size = "Observations",
color = "Avg Cases\npM") +
theme_minimal() +
theme(legend.position = "right",
plot.title = element_text(face = "bold"))
print(p1)# Group 2: By vaccination coverage and cases level
group2 <- df %>%
filter(!is.na(vax_coverage_level) & !is.na(cases_level)) %>%
group_by(vax_coverage_level, cases_level) %>%
summarise(
avg_stringency = mean(stringency_index, na.rm = TRUE),
avg_gdp = mean(gdp_per_capita, na.rm = TRUE),
n_observations = n(),
.groups = "drop"
) %>%
mutate(
probability = n_observations / sum(n_observations) * 100,
group_label = paste(vax_coverage_level, cases_level, sep = " × ")
)
# Identify smallest group
smallest_group2 <- group2 %>%
filter(n_observations == min(n_observations))
cat("### Group 2: Vaccination Coverage × Cases Level\n")## ### Group 2: Vaccination Coverage × Cases Level
## Smallest group: Very High × Low
## Observations: 13
## Probability: 0.031 %
| vax_coverage_level | cases_level | avg_stringency | avg_gdp | n_observations | probability | group_label |
|---|---|---|---|---|---|---|
| None/Low | Very Low | 60.75697 | 11915.54 | 5503 | 13.2277294 | None/Low × Very Low |
| None/Low | Low | 61.74189 | 15199.52 | 4280 | 10.2879669 | None/Low × Low |
| None/Low | Medium | 66.97972 | 23536.49 | 3433 | 8.2520071 | None/Low × Medium |
| None/Low | High | 67.40780 | 27416.18 | 3269 | 7.8577953 | None/Low × High |
| Moderate | Very Low | 50.03694 | 34905.47 | 4602 | 11.0619682 | Moderate × Very Low |
| Moderate | Low | 56.99565 | 36486.00 | 5564 | 13.3743570 | Moderate × Low |
| Moderate | Medium | 57.46838 | 38688.33 | 5631 | 13.5354070 | Moderate × Medium |
| Moderate | High | 60.07378 | 36737.51 | 5165 | 12.4152685 | Moderate × High |
| High | Very Low | 44.94253 | 38602.16 | 269 | 0.6466035 | High × Very Low |
| High | Low | 50.34221 | 29738.85 | 547 | 1.3148406 | High × Low |
| High | Medium | 48.30341 | 37223.87 | 1221 | 2.9349551 | High × Medium |
| High | High | 45.01797 | 38755.04 | 1935 | 4.6512187 | High × High |
| Very High | Very Low | 52.92536 | 34771.24 | 28 | 0.0673045 | Very High × Very Low |
| Very High | Low | 43.52000 | 36513.32 | 13 | 0.0312485 | Very High × Low |
| Very High | Medium | 38.60759 | 29769.18 | 116 | 0.2788327 | Very High × Medium |
| Very High | High | 43.02885 | 34534.15 | 26 | 0.0624970 | Very High × High |
Insight: The smallest group is Very High × Low with 13 observations (0.031% probability).
Significance: This indicates that the rarest observed combination in this grouping is Very High vaccination coverage with Low case levels. In practical terms, this suggests that periods where vaccination coverage was very high while case levels were simultaneously low were relatively uncommon in the dataset. Rather than reflecting a policy extreme, this likely reflects the timing of vaccination rollouts and infection waves across countries.
Hypothesis: The rarity of Very High × Low may reflect the fact that vaccination coverage reached very high levels later in the pandemic, often during or following waves of infection. As a result, there were fewer observed periods where vaccination rates were extremely high while case intensity was simultaneously in the lower quantiles.
# Visualization for Group 2
p2 <- ggplot(group2, aes(x = vax_coverage_level, y = cases_level,
fill = n_observations)) +
geom_tile(color = "white") +
geom_text(aes(label = paste0(round(probability, 1), "%")),
color = "black", size = 3.5) +
scale_fill_gradient(low = "lightblue", high = "darkblue",
labels = comma) +
labs(title = "Group 2: Probability Heatmap - Vaccination × Cases",
subtitle = "Cell value = Probability of randomly selecting this combination",
x = "Vaccination Coverage Level",
y = "Cases Level",
fill = "Observations") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"))
print(p2)# Group 3: More complex grouping
group3 <- df %>%
filter(!is.na(stringency_level)) %>%
mutate(
pandemic_phase = case_when(
date < as.Date("2020-07-01") ~ "Initial Wave",
date < as.Date("2021-01-01") ~ "Second Wave",
date < as.Date("2021-07-01") ~ "Vaccination Start",
TRUE ~ "Delta Variant"
)
) %>%
group_by(country_group, pandemic_phase, stringency_level) %>%
summarise(
total_observations = n(),
avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
.groups = "drop"
) %>%
mutate(
probability = total_observations / sum(total_observations) * 100,
group_label = paste(country_group, pandemic_phase, stringency_level, sep = " | ")
)
# Identify smallest group
smallest_group3 <- group3 %>%
filter(total_observations == min(total_observations))
cat("### Group 3: Country Group × Pandemic Phase × Stringency\n")## ### Group 3: Country Group × Pandemic Phase × Stringency
## Smallest group: EU | Delta Variant | Very High
## Observations: 14
## Probability: 0.0337 %
| country_group | pandemic_phase | stringency_level | total_observations | avg_deaths | probability | group_label |
|---|---|---|---|---|---|---|
| EU | Delta Variant | Very High | 14 | 21.9564286 | 0.0336522 | EU | Delta Variant | Very High |
| EU | Second Wave | Low | 25 | 0.5384000 | 0.0600933 | EU | Second Wave | Low |
| OECD_Non_EU | Delta Variant | Very High | 113 | 0.7392035 | 0.2716216 | OECD_Non_EU | Delta Variant | Very High |
| OECD_Non_EU | Second Wave | Low | 128 | 0.0000000 | 0.3076775 | OECD_Non_EU | Second Wave | Low |
| OECD_Non_EU | Delta Variant | Low | 140 | 0.3200714 | 0.3365223 | OECD_Non_EU | Delta Variant | Low |
| OECD_Non_EU | Vaccination Start | Low | 176 | 0.0036932 | 0.4230566 | OECD_Non_EU | Vaccination Start | Low |
| Non_OECD | Initial Wave | Low | 185 | 0.0014054 | 0.4446902 | Non_OECD | Initial Wave | Low |
| OECD_Non_EU | Initial Wave | Low | 209 | 0.0034450 | 0.5023797 | OECD_Non_EU | Initial Wave | Low |
| Non_OECD | Initial Wave | Medium | 224 | 0.1940625 | 0.5384357 | Non_OECD | Initial Wave | Medium |
| EU | Initial Wave | Low | 241 | 0.0021577 | 0.5792991 | EU | Initial Wave | Low |
Insight: The smallest combination is EU | Delta Variant | Very High with only 14 observations, a mere 0.0337% chance of random selection.
Significance: This indicates that the rarest observed three-way combination is EU countries during the Delta Variant phase under Very High stringency levels. This suggests that EU countries may not have frequently maintained the highest level of stringency during this specific phase, at least relative to other group–phase–policy combinations in the dataset.
Hypothesis: The rarity of EU | Delta Variant | Very High may reflect evolving policy strategies during the Delta phase, where EU countries may have relied more on vaccination coverage and targeted interventions rather than consistently maintaining the strictest stringency levels. This would result in fewer observations in the “Very High” category during that phase.
# Visualization for Group 3 (faceted)
p3 <- ggplot(group3, aes(x = stringency_level, y = total_observations,
fill = country_group)) +
geom_col(position = "dodge") +
facet_wrap(~pandemic_phase, scales = "free_y") +
scale_y_continuous(labels = comma) +
labs(title = "Group 3: Observations by Pandemic Phase, Stringency & Country Group",
subtitle = "Rarest combinations have the fewest bars",
x = "Stringency Level",
y = "Number of Observations",
fill = "Country Group") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
plot.title = element_text(face = "bold"),
legend.position = "bottom")
print(p3)# Choose two categorical variables
cat_vars <- df %>%
select(country_group, stringency_level, vax_coverage_level, cases_level) %>%
filter(complete.cases(.))
# Get all unique combinations
combinations <- cat_vars %>%
count(country_group, stringency_level, name = "count") %>%
mutate(
probability = count / sum(count) * 100,
combination = paste(country_group, stringency_level, sep = " - ")
)
# Find missing combinations
all_possible <- expand.grid(
country_group = unique(df$country_group),
stringency_level = unique(na.omit(df$stringency_level))
) %>%
mutate(combo = paste(country_group, stringency_level, sep = " - "))
existing_combos <- unique(combinations$combination)
missing_combos <- setdiff(all_possible$combo, existing_combos)
cat("### All Combinations: Country Group × Stringency Level\n")## ### All Combinations: Country Group × Stringency Level
## Total possible combinations: 12
## Existing combinations: 12
## Missing combinations: 0
if(length(missing_combos) > 0) {
cat("Missing combination(s):\n")
for(mc in missing_combos) {
cat("-", mc, "\n")
}
} else {
cat("All possible combinations exist in the data.\n")
}## All possible combinations exist in the data.
# Most and least common combinations
most_common <- combinations %>%
filter(count == max(count))
least_common <- combinations %>%
filter(count == min(count))
cat("\n### Most Common Combination:\n")##
## ### Most Common Combination:
## - Combination: EU - High
## - Count: 7637 observations
## - Probability: 18.36 %
##
## ### Least Common Combination:
## - Combination: Non_OECD - Low
## - Count: 185 observations
## - Probability: 0.445 %
Insight: We found 0 missing combinations out of 12 possible. The most common combination (EU - High) occurs 7637 times (18.36% probability), while the least common (Non_OECD - Low) occurs only 185 times (0.445% probability).
Significance: Missing combinations reveal systematic patterns in pandemic responses. For example, if “EU - Low” stringency is missing, it suggests EU countries rarely used minimal restrictions.
Further Questions: Why do these specific combinations not exist? Are they logically impossible, or did real-world constraints prevent them?
# Visualization of combinations
p4 <- ggplot(combinations, aes(x = stringency_level, y = country_group,
fill = count, label = paste0(round(probability, 1), "%"))) +
geom_tile(color = "white") +
geom_text(color = ifelse(combinations$probability > 5, "white", "black"),
size = 3.5) +
scale_fill_gradient(low = "lightblue", high = "darkblue",
trans = "log10",
labels = comma) +
labs(title = "Categorical Combinations: Country Group × Stringency Level",
subtitle = "Cell value = Probability percentage, Color intensity = Observation count",
x = "Stringency Level",
y = "Country Group",
fill = "Observations\n(log scale)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold"),
legend.position = "right")
print(p4)