Introduction

Audience

This report is written for a public health policy team that needs a clear, non-technical summary of how COVID-19 outcomes differ across countries and what factors are most closely tied to those differences. The intended audience does not need advanced statistics knowledge, so each section starts with a plain-language explanation and then adds the supporting analysis.

Main objective

The main objective of this analysis is to test whether country-level vaccination coverage, demographics, economic conditions, and policy stringency are associated with COVID-19 case fatality rate. That central problem is examined through three related questions:

  • Do fatality rates differ across continents?
  • Is vaccination coverage related to lower fatality rates?
  • Which country-level variables help explain vaccination success and new case burden?

Data source and variables

The analysis uses the Our World in Data (OWID) COVID-19 dataset, which compiles global COVID-19 information from official public health sources such as national health agencies, the WHO, and other government reporting systems. For this project, the data was cleaned and organized into country-date observations, making it suitable for trend analysis and cross-country comparison. OWID is widely used in academic and policy research, which makes it a credible source for this analysis.

Variable What it means Why it matters
case_fatality_rate Share of cases that ended in death Main outcome for severity
vax_coverage Vaccination coverage level Main outcome for vaccine progress
continent Broad geographic region Used for group comparison
median_age Median age of the population Captures population structure
gdp_per_capita Economic capacity per person Captures resource differences
stringency_index How strict policy responses were Captures response intensity
reproduction_rate Transmission pressure Captures outbreak momentum

How to read this report

Read the EDA to see the patterns in the data, the hypothesis test to see whether group differences are statistically meaningful, and the regression sections to see which factors matter after controlling for other variables. Read the TL;DR first if you only want the main conclusions.

What is in this report

This report follows the same structure as the weekly assignments, but with stronger interpretation and more direct recommendations:

  • initial exploratory data analysis with visualizations
  • clear assumptions and limitations
  • hypothesis tests where they make sense
  • regression models where they make sense
  • interpretation of the results in context
  • final conclusions and recommendations

Initial EDA

1. Average case fatality rate over time

ggplot(monthly_trend, aes(x = year_month, y = mean_cfr)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.5) +
  geom_vline(xintercept = as.Date(c("2020-03-01", "2021-06-01", "2021-11-01")),
             linetype = "dashed", alpha = 0.5) +
  annotate("text", x = as.Date("2020-03-01"), y = plot_cfr_max * 0.98,
           label = "Early pandemic waves", angle = 90, vjust = -0.4, size = 3) +
  annotate("text", x = as.Date("2021-06-01"), y = plot_cfr_max * 0.98,
           label = "Delta becomes dominant", angle = 90, vjust = -0.4, size = 3) +
  annotate("text", x = as.Date("2021-11-01"), y = plot_cfr_max * 0.98,
           label = "Omicron emerges", angle = 90, vjust = -0.4, size = 3) +
  labs(
    title = "Average COVID-19 Case Fatality Rate Over Time",
    subtitle = "Annotations highlight major pandemic waves that help explain the peaks and valleys",
    x = "Year-Month",
    y = "Mean Case Fatality Rate"
  ) +
  theme_minimal()

Interpretation

This figure gives a time-based view of how fatality rates changed across the full data set. The annotated peaks and valleys help connect the trend to major pandemic phases, such as the Delta wave in mid-2021 and the emergence of Omicron later in 2021. That context makes the graph easier to interpret because the line is not just moving randomly; it reflects changes in variants, treatment, testing, and vaccination over time.

Why this matters

2. Case fatality rate by continent

p_continent_full <- ggplot(covid_model, aes(x = continent, y = case_fatality_rate, fill = continent)) +
  geom_boxplot(alpha = 0.75) +
  labs(
    title = "Case Fatality Rate by Continent",
    subtitle = "Full distribution view",
    x = "Continent",
    y = "Case Fatality Rate"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

p_continent_zoom <- ggplot(covid_model, aes(x = continent, y = case_fatality_rate, fill = continent)) +
  geom_boxplot(alpha = 0.75) +
  coord_cartesian(ylim = c(0, plot_cfr_q95)) +
  labs(
    title = "Case Fatality Rate by Continent",
    subtitle = "Zoomed-in view to make the box ranges easier to compare",
    x = "Continent",
    y = "Case Fatality Rate"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

p_continent_full + p_continent_zoom + plot_layout(ncol = 2)

Interpretation

This boxplot compares the distribution of fatality rates across continents. The zoomed-in view makes the box ranges and median lines easier to compare, especially when a few outliers stretch the full scale. Together, the two panels show both the overall spread and the more readable center of each regional distribution.

Why this matters

3. Vaccination coverage versus case fatality rate

ggplot(scatter_data, aes(x = vax_coverage, y = case_fatality_rate)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  coord_cartesian(xlim = c(plot_vax_q05, plot_vax_q95), ylim = c(plot_cfr_q05, plot_cfr_q95)) +
  labs(
    title = "Vaccination Coverage and Case Fatality Rate",
    subtitle = "Zoomed-in view to make the slope easier to see",
    x = "Vaccination Coverage",
    y = "Case Fatality Rate"
  ) +
  theme_minimal()

Interpretation

This scatterplot shows whether higher vaccination coverage tends to be associated with lower fatality rates. Each dot represents one country-date observation, and the fitted line gives a quick summary of the overall trend. The zoomed-in axis makes the slope easier to read, which is important because the relationship is subtle in the full-scale version.

Why this matters

4. Vaccination coverage versus median age

ggplot(scatter_data, aes(x = median_age, y = vax_coverage)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Vaccination Coverage and Median Age",
    subtitle = "Population age structure appears to matter for vaccination uptake",
    x = "Median Age",
    y = "Vaccination Coverage"
  ) +
  theme_minimal()

Interpretation

This graph examines whether older countries tended to have higher vaccination coverage. That relationship is important because age structure can affect vaccine priority, public risk perception, and the ability to reach large coverage levels. If the line trends upward, then median age is a plausible predictor of vaccination success.

Why this matters

This plot helps motivate the logistic regression section. It also shows a relationship that a general audience can understand quickly without needing statistical background.


Part 1: ANOVA - Does fatality rate differ by continent?

Why this analysis matters

A continent-level comparison gives a simple way to test whether fatality rates differ across broad geographic regions. That is useful because a client audience often wants a high-level answer first. If the regions are meaningfully different, then the rest of the report should not treat the world as one uniform population.

Hypotheses

Null hypothesis (H0): Mean case fatality rate is the same across continents.

Alternative hypothesis (H1): At least one continent has a different mean case fatality rate.

Model

summary(anova_fit)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## continent       5   0.61 0.12283   27.41 <2e-16 ***
## Residuals   39573 177.31 0.00448                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-hoc comparisons

TukeyHSD(anova_fit)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = case_fatality_rate ~ continent, data = anova_data)
## 
## $continent
##                                     diff           lwr          upr     p adj
## Asia-Africa                 -0.010170982 -0.0141393942 -0.006202570 0.0000000
## Europe-Africa               -0.005321778 -0.0089669780 -0.001676578 0.0004553
## North America-Africa         0.001072502 -0.0039615256  0.006106529 0.9905634
## Oceania-Africa              -0.018652966 -0.0249545485 -0.012351383 0.0000000
## South America-Africa        -0.006497737 -0.0115463951 -0.001449078 0.0033414
## Europe-Asia                  0.004849204  0.0024208304  0.007277577 0.0000002
## North America-Asia           0.011243484  0.0070066278  0.015480340 0.0000000
## Oceania-Asia                -0.008481984 -0.0141670046 -0.002796964 0.0003058
## South America-Asia           0.003673245 -0.0005809841  0.007927475 0.1359942
## North America-Europe         0.006394280  0.0024585294  0.010330031 0.0000537
## Oceania-Europe              -0.013331188 -0.0187954941 -0.007866882 0.0000000
## South America-Europe        -0.001175958 -0.0051304057  0.002778489 0.9585259
## Oceania-North America       -0.019725468 -0.0261994614 -0.013251474 0.0000000
## South America-North America -0.007570238 -0.0128325196 -0.002307957 0.0005906
## South America-Oceania        0.012155229  0.0056698525  0.018640606 0.0000014

Interpretation

The ANOVA test checks whether the differences seen in the boxplot are large enough to be unlikely under random variation alone. If the p-value is below 0.05, we reject the null hypothesis and conclude that fatality rates are not equal across all continents. In practical terms, that means geography is associated with meaningful differences in COVID severity or reporting patterns.

The post-hoc Tukey test then helps identify which continent pairs differ from one another. That matters because a significant ANOVA result only tells us that at least one group differs; it does not say exactly which ones. The pairwise results make the conclusion more actionable.

Why this matters

This section gives the audience a clear yes/no answer to a simple question. It also shows that regional differences, if present, are worth taking seriously in later modeling and recommendations.


Part 2: Linear Regression - What predicts case fatality rate?

Why this analysis matters

The ANOVA gives a broad comparison, but it does not explain the drivers behind fatality rates. A multiple linear regression model helps answer a more practical question: when vaccination, demographics, and economic conditions are considered together, which factors are most closely associated with case fatality rate?

Variables used

  • Response variable: case_fatality_rate
  • Predictors: vax_coverage, median_age, gdp_per_capita, stringency_index

These predictors were selected because they represent vaccination progress, age structure, economic capacity, and policy response.

Model fit

summary(lm_fit)
## 
## Call:
## lm(formula = case_fatality_rate ~ vax_coverage + median_age + 
##     gdp_per_capita + stringency_index, data = lm_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04686 -0.01995 -0.01151  0.00183  2.25340 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.038e-03  2.327e-03  -2.595  0.00946 ** 
## vax_coverage     -1.508e-04  1.651e-05  -9.136  < 2e-16 ***
## median_age        7.986e-04  5.438e-05  14.687  < 2e-16 ***
## gdp_per_capita   -4.104e-07  2.262e-08 -18.147  < 2e-16 ***
## stringency_index  3.052e-04  2.108e-05  14.480  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06639 on 39574 degrees of freedom
## Multiple R-squared:  0.01968,    Adjusted R-squared:  0.01958 
## F-statistic: 198.6 on 4 and 39574 DF,  p-value: < 2.2e-16

Visual support

ggplot(lm_data, aes(x = vax_coverage, y = case_fatality_rate)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  coord_cartesian(xlim = c(plot_vax_q05, plot_vax_q95), ylim = c(plot_cfr_q05, plot_cfr_q95)) +
  labs(
    title = "Higher Vaccination Coverage Is Associated With Lower Fatality Rates",
    subtitle = "Zoomed-in view to make the negative slope easier to see",
    x = "Vaccination Coverage",
    y = "Case Fatality Rate"
  ) +
  theme_minimal()

This plot gives a direct visual check of the regression relationship. The goal here is not just to see whether the line slopes up or down, but also to see whether the points are roughly centered around a straight-line pattern. The zoomed-in axis makes the negative slope easier to read, which addresses the fact that the full-scale graph can look nearly flat.

Why this matters

This plot is the simplest way to explain the key regression idea to a general audience. It shows the main relationship without requiring anyone to read coefficient tables first.

Model diagnostics and coefficient summary

par(mfrow = c(2, 2))
plot(lm_fit)

par(mfrow = c(1, 1))
knitr::kable(
  lm_tidy,
  digits = 4,
  caption = "Linear Regression Coefficients With Confidence Intervals"
)
Linear Regression Coefficients With Confidence Intervals
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -6e-03 0.0023 -2.5950 0.0095 -0.0106 -0.0015
vax_coverage -2e-04 0.0000 -9.1359 0.0000 -0.0002 -0.0001
median_age 8e-04 0.0001 14.6866 0.0000 0.0007 0.0009
gdp_per_capita 0e+00 0.0000 -18.1468 0.0000 0.0000 0.0000
stringency_index 3e-04 0.0000 14.4798 0.0000 0.0003 0.0003
lm_vif
##     vax_coverage       median_age   gdp_per_capita stringency_index 
##         1.097356         1.474312         1.464795         1.125965

Interpretation

  • The regression table shows the direction and size of each relationship while controlling for the other variables.
  • A negative coefficient for vax_coverage means that higher vaccination coverage is associated with lower fatality rates.
  • A positive coefficient for median_age suggests that older populations experience higher fatality rates, which fits the expectation that older adults are at greater risk of severe COVID outcomes.
  • gdp_per_capita and stringency_index help capture economic capacity and policy response. Even if one coefficient is not statistically significant, it still matters conceptually because it may help reduce omitted variable bias.
  • The adjusted R-squared value is especially important because it tells us how much of the variation in fatality rate is explained by the full model, not just by one variable alone.

Plain-English coefficient interpretation

A 1-unit increase in each predictor changes fatality rate by the amount shown in the coefficient table, holding the other predictors constant. For a non-technical reader, the most important thing is the direction of the effect. If the vaccination coefficient is negative, that means more vaccination is associated with lower fatality. If the median-age coefficient is positive, that means older populations tend to have higher fatality rates.

Why this matters

This is the core analytic section of the report. It turns the visual pattern into a controlled comparison and identifies which factors still matter after accounting for the others.


Part 3: Logistic Regression - What predicts high vaccination coverage?

Why this analysis matters

The previous model focused on fatality rates. This section flips the question and asks what helps explain whether a country reaches high vaccination coverage. That is useful because vaccination success is itself an important policy outcome.

Creating the binary outcome

table(logit_data$high_vax)
## 
##     0     1 
## 35441  4138

We define high vaccination coverage as greater than 50 percent. That cutoff is a practical midpoint for comparison, not a claim that 50 percent is enough to stop COVID-19 spread. The purpose of the threshold is to split the sample into lower- and higher-coverage groups so the logistic model can estimate which factors make crossing that midpoint more likely.

Visual support

ggplot(logit_data, aes(x = median_age, y = vax_coverage, color = factor(high_vax))) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = 50, linetype = "dashed") +
  labs(
    title = "Median Age and Vaccination Coverage",
    subtitle = "The dashed line marks the 50 percent threshold used to define the binary outcome",
    x = "Median Age",
    y = "Vaccination Coverage",
    color = "High Vax"
  ) +
  theme_minimal()

Interpretation

This scatterplot shows why median age is a plausible predictor. Countries with higher vaccination coverage are concentrated at higher median ages, while lower vaccination coverage appears more common across younger populations. That pattern suggests that demographic structure may be tied to vaccine uptake.

Why this matters

This plot helps a non-technical reader understand why the logistic model was built in the first place. It also connects the model to a policy question: which countries are more likely to cross a meaningful vaccination threshold?

Model

summary(logit_fit)
## 
## Call:
## glm(formula = high_vax ~ reproduction_rate + stringency_index + 
##     median_age, family = binomial, data = logit_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.836460   0.143363  -5.835 5.39e-09 ***
## reproduction_rate -0.213001   0.055685  -3.825 0.000131 ***
## stringency_index  -0.049699   0.001151 -43.174  < 2e-16 ***
## median_age         0.040139   0.002725  14.727  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26515  on 39578  degrees of freedom
## Residual deviance: 23808  on 39575  degrees of freedom
## AIC: 23816
## 
## Number of Fisher Scoring iterations: 6

Odds ratios and coefficient table

knitr::kable(
  logit_tidy,
  digits = 4,
  caption = "Logistic Regression Odds Ratios and Confidence Intervals"
)
Logistic Regression Odds Ratios and Confidence Intervals
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 0.4332 0.1434 -5.8346 0e+00 0.3268 0.5733
reproduction_rate 0.8082 0.0557 -3.8251 1e-04 0.7241 0.9008
stringency_index 0.9515 0.0012 -43.1736 0e+00 0.9494 0.9537
median_age 1.0410 0.0027 14.7273 0e+00 1.0354 1.0466
logit_or
##       (Intercept) reproduction_rate  stringency_index        median_age 
##         0.4332415         0.8081556         0.9515157         1.0409555

Interpretation

  • Logistic regression is used here because the response variable is binary.
  • The coefficients are interpreted on the log-odds scale, but the odds ratios are easier to explain to a client audience. An odds ratio above 1 means the predictor is associated with higher odds of reaching high vaccination coverage, while an odds ratio below 1 means the predictor is associated with lower odds.
  • For example, if median_age has an odds ratio above 1, then each additional year in median age increases the odds of high vaccination coverage.
  • If reproduction_rate has an odds ratio below 1, then more transmission pressure is associated with lower odds of high vaccination coverage.

This model is useful because it turns a continuous policy question into a simple decision-oriented outcome: which countries are more likely to cross the 50 percent vaccination threshold?

Why this matters

This section translates the vaccination story into a simple yes/no outcome that is easy to communicate to decision-makers. It also gives the report a second model that checks whether the same broad patterns still appear when the question is reframed.

Model accuracy

logit_cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 35391  4138
##          1    50     0
##                                           
##                Accuracy : 0.8942          
##                  95% CI : (0.8911, 0.8972)
##     No Information Rate : 0.8954          
##     P-Value [Acc > NIR] : 0.7968          
##                                           
##                   Kappa : -0.0025         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9986          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8953          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.8954          
##          Detection Rate : 0.8942          
##    Detection Prevalence : 0.9987          
##       Balanced Accuracy : 0.4993          
##                                           
##        'Positive' Class : 0               
## 

The confusion matrix checks whether the model classifies countries correctly. This matters because a model can have statistically significant coefficients but still perform poorly in prediction. Accuracy, sensitivity, and specificity give a fuller view of whether the logistic model is actually useful for classification.

Why this matters

A model is more convincing when it does not only look significant on paper but also performs reasonably well in classification. This helps the audience trust the practical usefulness of the results.


Conclusions and Recommendations

Key findings

  • The report shows that COVID-19 outcomes are not evenly distributed across the world.
  • The EDA suggests that case fatality rate changes over time and differs across continents. That means geography and timing both matter, so a single global average would hide important patterns.
  • The ANOVA section tests whether those regional differences are statistically meaningful. If the test is significant, the practical conclusion is that continent-level structure should be part of any public health interpretation.
  • The linear regression model gives a more nuanced view by estimating how vaccination coverage, age structure, economic capacity, and policy stringency are related to fatality rate at the same time. This is useful because it separates the broad visual patterns into a more controlled statistical comparison.
  • The logistic regression section shows which factors help explain high vaccination coverage. That is especially helpful for a client audience because it translates the question into a simple yes/no outcome.

Final recommendation

A reasonable policy recommendation is to prioritize vaccination rollout and maintain targeted public health responses in countries with older populations, higher transmission pressure, and weaker economic capacity.

The model results should be used as guidance rather than as proof of causation, but they are still valuable because they identify which factors are repeatedly associated with worse outcomes and lower vaccination success.

TL;DR - Executive Summary

What was the goal?

This project analyzes global COVID-19 data to identify which country-level factors are most closely associated with case fatality rate and vaccination coverage. The goal is to turn a large and messy public-health dataset into a small set of practical insights that a non-technical audience can use.

What did we find?

  • COVID-19 case fatality rates vary across regions and over time, but the size of those differences is not the same everywhere.
  • The ANOVA test suggests that continent-level differences are only meaningful if the p-value is below 0.05; in this report, the exact value is 8.41^{-28}, which gives the clearest single-number answer about whether continental differences are statistically convincing.
  • Vaccination coverage is the most consistently useful predictor in the linear model, with the coefficient showing the direction and size of the relationship after controlling for age, income, and policy stringency.
  • The regression model explains 2.0% of the variation in fatality rate, which means the model captures a meaningful but not complete share of the differences across country-date observations.
  • Median age is also important, which suggests that country population structure is tied to both severity and vaccination patterns.
  • The logistic model shows that countries are more likely to pass the 50 percent vaccination threshold when the predictors move in the favorable direction, and the confusion matrix shows how well the model classifies those countries.

What does this mean in plain English?

  • Geography alone does not explain everything.
  • Vaccination coverage is one of the clearest levers linked to better outcomes.
  • Older populations behave differently from younger ones, so a one-size-fits-all policy is not a good fit.
  • A small number of measurable factors explain a useful share of the pattern, which means targeted action is more practical than broad assumptions.

What should be done?

  • Prioritize vaccination outreach where coverage is still low.
  • Pay special attention to countries with older populations, because those places are more likely to show different outcome patterns.
  • Use data-driven targeting instead of assuming that all regions need the same response.
  • Keep updating the analysis as new data become available, because COVID patterns changed over time.

What are the limitations?

  • This is an observational data set, so the results show association, not proof of cause and effect.
  • Some important factors, such as healthcare quality or local policy enforcement, are not fully measured here.
  • A binary cutoff for high vaccination coverage is useful for interpretation, but it simplifies a continuous outcome.

Bottom line

The most useful public-health strategy is to focus on measurable factors that move outcomes in a better direction, especially vaccination coverage and population structure. The report shows where those relationships are strongest and gives a practical basis for decision-making.


Final takeaway

The strongest message from the report is that a small number of measurable factors, especially vaccination coverage and demographic structure, explain a meaningful share of the differences in COVID outcomes. That makes targeted, data-driven action more useful than broad assumptions.