This report turns the COVID dataset into a clear story for a public-health audience. Rather than just listing statistics, the goal is to explain what changed over time, which groups carried the highest burden, and which factors seem most useful for decision making.
The central story is simple:
covid <- read.csv("covid_combined_groups.csv")
covid$date <- as.Date(covid$date)
if (!"year_month" %in% names(covid)) {
covid$year_month <- format(covid$date, "%Y-%m")
}
covid <- covid %>%
mutate(
year_month_date = as.Date(paste0(year_month, "-01"))
)
core_cols <- c(
"new_cases_smoothed_per_million",
"new_deaths_smoothed_per_million",
"stringency_index",
"reproduction_rate",
"vax_coverage",
"case_fatality_rate",
"median_age",
"country_group"
)
covid_model <- covid %>%
select(all_of(core_cols)) %>%
drop_na()
monthly_story <- covid %>%
group_by(year_month_date) %>%
summarise(
avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
avg_stringency = mean(stringency_index, na.rm = TRUE),
avg_vax = mean(vax_coverage, na.rm = TRUE),
avg_cfr = mean(case_fatality_rate, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(year_month_date)
summary(monthly_story)## year_month_date avg_cases avg_deaths avg_stringency
## Min. :2020-03-01 Min. : 8.936 Min. :0.2841 Min. :46.16
## 1st Qu.:2020-08-08 1st Qu.: 41.615 1st Qu.:1.0855 1st Qu.:50.99
## Median :2021-01-16 Median :138.032 Median :1.8577 Median :57.75
## Mean :2021-01-15 Mean :130.510 Mean :2.0832 Mean :58.30
## 3rd Qu.:2021-06-23 3rd Qu.:185.168 3rd Qu.:3.0557 3rd Qu.:63.69
## Max. :2021-12-01 Max. :354.838 Max. :4.5739 Max. :80.11
## avg_vax avg_cfr
## Min. : 0.0000 Min. :0.01339
## 1st Qu.: 0.0000 1st Qu.:0.01853
## Median : 0.6324 Median :0.02127
## Mean :10.0270 Mean :0.02789
## 3rd Qu.:18.3425 3rd Qu.:0.02620
## Max. :40.1228 Max. :0.07976
I convert date into a real Date object so the time axis
is usable in plots. I also create year_month_date so the
data can be summarized into monthly trends. That makes the report easier
to read because it reduces day-to-day noise and highlights the larger
pattern.
monthly_long <- monthly_story %>%
pivot_longer(
cols = c(avg_cases, avg_deaths, avg_stringency, avg_vax),
names_to = "series",
values_to = "value"
)
series_labels <- c(
avg_cases = "Cases per million",
avg_deaths = "Deaths per million",
avg_stringency = "Stringency index",
avg_vax = "Vaccination coverage per hundred"
)
ggplot(monthly_long, aes(x = year_month_date, y = value)) +
geom_line(linewidth = 0.7) +
facet_wrap(~series, scales = "free_y", labeller = as_labeller(series_labels)) +
labs(
title = "Monthly Pandemic Story",
x = "Month",
y = "Monthly average"
) +
theme_minimal()Answer: The data tells a story of repeated COVID waves rather than a smooth one-way change. Cases and deaths rise and fall across time, while vaccination coverage increases gradually and stringency changes with the pandemic pressure. That means the pandemic response should be interpreted as a sequence of phases, not a single static period.
The practical message is that policy and vaccination need to be read together. When cases rise, governments respond more aggressively; when vaccination coverage improves, severity should fall over time.
group_summary <- covid %>%
group_by(country_group) %>%
summarise(
avg_cases = mean(new_cases_smoothed_per_million, na.rm = TRUE),
avg_deaths = mean(new_deaths_smoothed_per_million, na.rm = TRUE),
avg_vax = mean(vax_coverage, na.rm = TRUE),
avg_cfr = mean(case_fatality_rate, na.rm = TRUE),
avg_stringency = mean(stringency_index, na.rm = TRUE),
n = n(),
.groups = "drop"
) %>%
arrange(desc(avg_cfr))
kable(group_summary, digits = 2, caption = "Average COVID indicators by country group")| country_group | avg_cases | avg_deaths | avg_vax | avg_cfr | avg_stringency | n |
|---|---|---|---|---|---|---|
| EU | 194.62 | 2.98 | 11.51 | 0.03 | 55.22 | 17446 |
| Non_OECD | 55.55 | 1.23 | 4.85 | 0.03 | 63.29 | 13420 |
| OECD_Non_EU | 119.96 | 1.67 | 14.19 | 0.02 | 56.90 | 10736 |
ggplot(group_summary, aes(x = reorder(country_group, avg_cfr), y = avg_cfr)) +
geom_col() +
coord_flip() +
labs(
title = "Average Case Fatality Rate by Country Group",
x = "Country group",
y = "Average case fatality rate"
) +
theme_minimal()The chart shows a clear difference in average case fatality rates across country groups. EU countries have the highest fatality rate at around 3%, followed by Non-OECD countries at roughly 2.8%. OECD countries outside the EU have the lowest fatality rate at approximately 2.3%.
This indicates that OECD countries outside the EU were more effective at reducing deaths relative to cases. In contrast, EU countries experienced higher mortality among confirmed cases despite similar economic status. Non-OECD countries fall in between, suggesting that factors beyond income level—such as healthcare system response or population structure—play a significant role in outcomes.
Answer: This section identifies which groups appear to carry the heaviest severity burden. The group with the highest average case fatality rate should be treated as the highest-priority audience for stronger intervention, because it combines the largest consequences with the greatest need for support.
The exact ranking comes directly from the table above. The important interpretation is that country groups are not equal in risk, so a one-size-fits-all response is not the best policy choice.
model_data <- covid %>%
select(case_fatality_rate, vax_coverage, stringency_index, reproduction_rate, median_age) %>%
drop_na()
cfr_model <- lm(case_fatality_rate ~ vax_coverage + stringency_index + reproduction_rate + median_age,
data = model_data)
summary(cfr_model)##
## Call:
## lm(formula = case_fatality_rate ~ vax_coverage + stringency_index +
## reproduction_rate + median_age, data = model_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.06065 -0.01943 -0.01088 0.00296 2.24908
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.680e-02 2.618e-03 14.058 < 2e-16 ***
## vax_coverage -1.739e-04 1.641e-05 -10.599 < 2e-16 ***
## stringency_index 2.782e-04 2.095e-05 13.276 < 2e-16 ***
## reproduction_rate -3.205e-02 1.085e-03 -29.533 < 2e-16 ***
## median_age 2.937e-04 4.552e-05 6.452 1.12e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06594 on 39574 degrees of freedom
## Multiple R-squared: 0.03284, Adjusted R-squared: 0.03274
## F-statistic: 335.9 on 4 and 39574 DF, p-value: < 2.2e-16
coef_table <- tidy(cfr_model, conf.int = TRUE)
kable(coef_table, digits = 4, caption = "Linear model for case fatality rate")| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 0.0368 | 0.0026 | 14.0582 | 0 | 0.0317 | 0.0419 |
| vax_coverage | -0.0002 | 0.0000 | -10.5990 | 0 | -0.0002 | -0.0001 |
| stringency_index | 0.0003 | 0.0000 | 13.2757 | 0 | 0.0002 | 0.0003 |
| reproduction_rate | -0.0321 | 0.0011 | -29.5333 | 0 | -0.0342 | -0.0299 |
| median_age | 0.0003 | 0.0000 | 6.4517 | 0 | 0.0002 | 0.0004 |
ggplot(model_data, aes(x = vax_coverage, y = case_fatality_rate)) +
geom_point(alpha = 0.35) +
geom_smooth(method = "lm", se = TRUE) +
labs(
title = "Vaccination Coverage and Case Fatality Rate",
x = "Vaccination coverage per hundred",
y = "Case fatality rate"
) +
theme_minimal()The scatter plot shows a strong downward pattern where case fatality rates decrease as vaccination coverage increases. Countries with very low vaccination coverage display wide variation and include the highest fatality rates, with some extreme values above 2%.
As vaccination coverage rises above 40–50 per hundred people, fatality rates become tightly clustered near zero and extreme values disappear. At the highest vaccination levels (around 70–80), fatality rates are consistently minimal.
This demonstrates that higher vaccination coverage directly corresponds to lower death rates from COVID-19, with the most stable and lowest fatality outcomes observed in highly vaccinated populations.
beta_vax <- coef(cfr_model)["vax_coverage"]
p_vax <- summary(cfr_model)$coefficients["vax_coverage", "Pr(>|t|)"]Answer: The coefficient for vaccination coverage is -2^{-4} with a p-value of 3.27^{-26}. If the coefficient is negative, then higher vaccination coverage is associated with lower case fatality, which supports the public-health expectation that vaccination reduces severe outcomes.
Even if the exact estimate is small, the direction matters because it turns a broad policy idea into a measurable relationship. This is the strongest part of the analysis for an actionable message: increasing vaccine coverage is a practical lever for reducing severity.
cor_cases_stringency <- cor(monthly_story$avg_cases, monthly_story$avg_stringency, use = "complete.obs")
cor_cases_stringency## [1] -0.3128258
ggplot(monthly_story, aes(x = avg_stringency, y = avg_cases)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(
title = "Stringency and Case Burden",
subtitle = paste0("Correlation = ", round(cor_cases_stringency, 3)),
x = "Average stringency index",
y = "Average cases per million"
) +
theme_minimal()The plot shows a clear negative relationship between government stringency and average cases per million, supported by a correlation of -0.313. As the stringency index increases, the number of cases per million steadily decreases.
Countries with lower stringency levels (around 45–55) show higher and more variable case counts, including some of the highest observed values. In contrast, countries with higher stringency levels (above 65) consistently report lower case numbers.
This confirms that stricter policy measures are associated with reduced transmission, leading to a lower overall case burden.
Answer: This section tests whether policy reaction lines up with the severity of the pandemic. A positive relationship would suggest that stringency increases when cases are high, which is consistent with reactive public-health policy. A weak relationship would suggest that policy is not tightly synchronized with the case burden.
Either way, the interpretation is useful because it shows whether governments responded in step with the pandemic or whether there was a delay.
The draft conclusion is that the pandemic story in this dataset is driven by changing waves, uneven burden across country groups, and a meaningful relationship between vaccination coverage and severity. That makes vaccination the clearest actionable lever in the data.
A practical recommendation is to prioritize vaccination efforts and targeted support in the country groups with the highest case fatality and weakest coverage. Policy stringency still matters, but it should be paired with prevention rather than used as the only response.
The clearest public-health message is simple: protect the highest-risk groups first, maintain vaccination coverage, and use policy quickly when case waves rise.