Introduction

Audience

This report is written for a public health policy team that needs a clear, non-technical summary of how COVID-19 outcomes differ across countries and what factors are most closely tied to those differences. The intended audience does not need advanced statistics knowledge, so each section starts with a plain-language explanation and then adds the supporting analysis.

Main objective

The main objective of this analysis is to identify which country-level factors are associated with COVID-19 case fatality rate and vaccination coverage, and to use those relationships to support practical public health recommendations.

The analysis asks three related questions:

Do fatality rates differ across continents?
Is vaccination coverage related to lower fatality rates?
Which country-level variables help explain vaccination success and new case burden?

Data source and variables

Variable	What it means	Why it matters
`case_fatality_rate`	Share of cases that ended in death	Main outcome for severity
`vax_coverage`	Vaccination coverage level	Main outcome for vaccine progress
`continent`	Broad geographic region	Used for group comparison
`median_age`	Median age of the population	Captures population structure
`gdp_per_capita`	Economic capacity per person	Captures resource differences
`stringency_index`	How strict policy responses were	Captures response intensity
`reproduction_rate`	Transmission pressure	Captures outbreak momentum

How to read this report

Read the EDA to see the patterns in the data, the hypothesis test to see whether group differences are statistically meaningful, and the regression sections to see which factors matter after controlling for other variables. Read the TL;DR first if you only want the main conclusions.

What is in this report

This report follows the same structure as the weekly assignments, but with stronger interpretation and more direct recommendations:

initial exploratory data analysis with visualizations
clear assumptions and limitations
hypothesis tests where they make sense
regression models where they make sense
interpretation of the results in context
final conclusions and recommendations

Data Preparation

glimpse(covid_model)

## Rows: 39,579
## Columns: 30
## $ iso_code                            <chr> "AUT", "AUT", "AUT", "AUT", "AUT",…
## $ continent                           <chr> "Europe", "Europe", "Europe", "Eur…
## $ location                            <chr> "Austria", "Austria", "Austria", "…
## $ date                                <date> 2020-03-01, 2020-03-02, 2020-03-0…
## $ new_cases_smoothed_per_million      <dbl> 0.11, 0.11, 0.11, 0.11, 0.11, 0.11…
## $ new_deaths_smoothed_per_million     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ total_cases_per_million             <dbl> 0.77, 0.77, 0.77, 0.77, 0.77, 0.77…
## $ total_deaths_per_million            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ stringency_index                    <dbl> 11.11, 11.11, 11.11, 11.11, 11.11,…
## $ reproduction_rate                   <dbl> 1.07, 1.07, 1.07, 1.07, 1.07, 1.07…
## $ total_vaccinations_per_hundred      <dbl> 69.3, 69.3, 69.3, 69.3, 69.3, 69.3…
## $ people_vaccinated_per_hundred       <dbl> 43.6, 43.6, 43.6, 43.6, 43.6, 43.6…
## $ people_fully_vaccinated_per_hundred <dbl> 30.58, 30.58, 30.58, 30.58, 30.58,…
## $ hospital_beds_per_thousand          <dbl> 7.37, 7.37, 7.37, 7.37, 7.37, 7.37…
## $ life_expectancy                     <dbl> 81.54, 81.54, 81.54, 81.54, 81.54,…
## $ cardiovasc_death_rate               <dbl> 145.18, 145.18, 145.18, 145.18, 14…
## $ diabetes_prevalence                 <dbl> 6.35, 6.35, 6.35, 6.35, 6.35, 6.35…
## $ gdp_per_capita                      <dbl> 45436.69, 45436.69, 45436.69, 4543…
## $ population_density                  <dbl> 106.75, 106.75, 106.75, 106.75, 10…
## $ median_age                          <dbl> 44.4, 44.4, 44.4, 44.4, 44.4, 44.4…
## $ aged_65_older                       <dbl> 19.2, 19.2, 19.2, 19.2, 19.2, 19.2…
## $ human_development_index             <dbl> 0.92, 0.92, 0.92, 0.92, 0.92, 0.92…
## $ population                          <int> 8939617, 8939617, 8939617, 8939617…
## $ country_group                       <chr> "EU", "EU", "EU", "EU", "EU", "EU"…
## $ year                                <int> 2020, 2020, 2020, 2020, 2020, 2020…
## $ month                               <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ year_month                          <date> 2020-03-01, 2020-03-01, 2020-03-0…
## $ case_fatality_rate                  <dbl> 0.000000000, 0.000000000, 0.000000…
## $ vax_coverage                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ days_since_start                    <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …

summary(covid_model %>% select(case_fatality_rate, vax_coverage, median_age, gdp_per_capita, stringency_index, reproduction_rate))

##  case_fatality_rate  vax_coverage     median_age    gdp_per_capita 
##  Min.   :0.000000   Min.   : 0.00   Min.   :18.10   Min.   : 1730  
##  1st Qu.:0.005885   1st Qu.: 0.00   1st Qu.:31.90   1st Qu.:17336  
##  Median :0.014787   Median : 0.00   Median :39.70   Median :29481  
##  Mean   :0.027773   Mean   :10.54   Mean   :37.48   Mean   :30159  
##  3rd Qu.:0.030259   3rd Qu.: 5.79   3rd Qu.:43.20   3rd Qu.:42659  
##  Max.   :2.281690   Max.   :84.68   Max.   :48.20   Max.   :94278  
##  stringency_index reproduction_rate
##  Min.   :  0.00   Min.   :0.110    
##  1st Qu.: 45.65   1st Qu.:0.890    
##  Median : 58.04   Median :1.040    
##  Mean   : 58.47   Mean   :1.075    
##  3rd Qu.: 71.76   3rd Qu.:1.220    
##  Max.   :100.00   Max.   :4.650

The data set contains repeated country-level observations over time, so each row represents a country-date combination rather than a single independent country snapshot.

That matters because the same country appears more than once across months, which means the observations are useful for trend analysis, but they are not fully independent in the same way a one-row-per-country data set would be.

For that reason, the report focuses on patterns, associations, and model fit rather than claiming direct causation.

Assumptions and interpretation risks

The analysis assumes the key variables are measured consistently enough across countries to make comparisons meaningful. That is reasonable for this project because the data set already combines many countries and dates into a common structure.
Missing values are handled with listwise deletion inside each model-specific data frame. That keeps the code simple and transparent, but it also means the exact sample changes from section to section.
continent and country_group are broad grouping variables. That is useful for summarizing large-scale patterns, but it can hide important differences inside each region or category.
The study is observational. Because the data are not from a randomized experiment, the results should be interpreted as associations, not direct proof that one variable causes another.
Any cutoff used to define a binary outcome, such as high vaccination coverage, makes the story easier to read but reduces detail. To limit that risk, the report explains the cutoff clearly and treats it as a practical classification choice rather than a magical boundary.

Initial EDA

1. Average case fatality rate over time

ggplot(monthly_trend, aes(x = year_month, y = mean_cfr)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.5) +
  labs(
    title = "Average COVID-19 Case Fatality Rate Over Time",
    subtitle = "Each point shows the average across all country-date observations for that month",
    x = "Year-Month",
    y = "Mean Case Fatality Rate"
  ) +
  theme_minimal()

Interpretation

This figure gives a time-based view of how fatality rates changed across the full data set. If the line rises or falls sharply, that suggests the overall severity pattern changed across the pandemic period. That kind of movement is expected in COVID data because treatment, testing, vaccination, and variants all changed over time.

Why this matters

Time is one of the most important hidden structures in COVID data. Starting with a time trend helps the reader see that the data are not static and that later statistical models need to account for changes over time.

2. Case fatality rate by continent

ggplot(covid_model, aes(x = continent, y = case_fatality_rate, fill = continent)) +
  geom_boxplot(alpha = 0.75) +
  labs(
    title = "Case Fatality Rate by Continent",
    subtitle = "The boxplot shows the spread, center, and outliers within each region",
    x = "Continent",
    y = "Case Fatality Rate"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation

This boxplot compares the distribution of fatality rates across continents. It helps answer the question of whether some regions appear to have systematically higher or lower severity levels. Large gaps between the boxes would suggest meaningful regional differences, while heavy overlap would suggest the regions are more similar than they first appear.

Why this matters

A broad regional comparison gives the audience a simple first answer before moving into more detailed modeling. It also helps justify the ANOVA test later in the report.

3. Vaccination coverage versus case fatality rate

ggplot(scatter_data, aes(x = vax_coverage, y = case_fatality_rate)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Vaccination Coverage and Case Fatality Rate",
    subtitle = "The fitted line summarizes the overall direction of the relationship",
    x = "Vaccination Coverage",
    y = "Case Fatality Rate"
  ) +
  theme_minimal()

Interpretation

This scatterplot shows whether higher vaccination coverage tends to be associated with lower fatality rates. Each dot represents one country-date observation, and the smooth line gives a quick summary of the overall trend. A downward slope would mean that places with more vaccination coverage generally also have lower fatality rates.

Why this matters

This is one of the most important plots in the report because it previews the central public-health story. It also gives an intuitive first check before the regression model formally controls for other variables.

4. Vaccination coverage versus median age

ggplot(scatter_data, aes(x = median_age, y = vax_coverage)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Vaccination Coverage and Median Age",
    subtitle = "Population age structure appears to matter for vaccination uptake",
    x = "Median Age",
    y = "Vaccination Coverage"
  ) +
  theme_minimal()

Interpretation

This graph examines whether older countries tended to have higher vaccination coverage. That relationship is important because age structure can affect vaccine priority, public risk perception, and the ability to reach large coverage levels. If the line trends upward, then median age is a plausible predictor of vaccination success.

Why this matters

This plot helps motivate the logistic regression section. It also shows a relationship that a general audience can understand quickly without needing statistical background.

Part 1: ANOVA - Does fatality rate differ by continent?

Why this analysis matters

A continent-level comparison gives a simple way to test whether fatality rates differ across broad geographic regions. That is useful because a client audience often wants a high-level answer first. If the regions are meaningfully different, then the rest of the report should not treat the world as one uniform population.

Hypotheses

Null hypothesis (H0): Mean case fatality rate is the same across continents.

Alternative hypothesis (H1): At least one continent has a different mean case fatality rate.

Model

summary(anova_fit)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## continent       5   0.61 0.12283   27.41 <2e-16 ***
## Residuals   39573 177.31 0.00448                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-hoc comparisons

TukeyHSD(anova_fit)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = case_fatality_rate ~ continent, data = anova_data)
## 
## $continent
##                                     diff           lwr          upr     p adj
## Asia-Africa                 -0.010170982 -0.0141393942 -0.006202570 0.0000000
## Europe-Africa               -0.005321778 -0.0089669780 -0.001676578 0.0004553
## North America-Africa         0.001072502 -0.0039615256  0.006106529 0.9905634
## Oceania-Africa              -0.018652966 -0.0249545485 -0.012351383 0.0000000
## South America-Africa        -0.006497737 -0.0115463951 -0.001449078 0.0033414
## Europe-Asia                  0.004849204  0.0024208304  0.007277577 0.0000002
## North America-Asia           0.011243484  0.0070066278  0.015480340 0.0000000
## Oceania-Asia                -0.008481984 -0.0141670046 -0.002796964 0.0003058
## South America-Asia           0.003673245 -0.0005809841  0.007927475 0.1359942
## North America-Europe         0.006394280  0.0024585294  0.010330031 0.0000537
## Oceania-Europe              -0.013331188 -0.0187954941 -0.007866882 0.0000000
## South America-Europe        -0.001175958 -0.0051304057  0.002778489 0.9585259
## Oceania-North America       -0.019725468 -0.0261994614 -0.013251474 0.0000000
## South America-North America -0.007570238 -0.0128325196 -0.002307957 0.0005906
## South America-Oceania        0.012155229  0.0056698525  0.018640606 0.0000014

Interpretation

The ANOVA test checks whether the differences seen in the boxplot are large enough to be unlikely under random variation alone. If the p-value is below 0.05, we reject the null hypothesis and conclude that fatality rates are not equal across all continents. In practical terms, that means geography is associated with meaningful differences in COVID severity or reporting patterns.

The post-hoc Tukey test then helps identify which continent pairs differ from one another. That matters because a significant ANOVA result only tells us that at least one group differs; it does not say exactly which ones. The pairwise results make the conclusion more actionable.

Why this matters

This section gives the audience a clear yes/no answer to a simple question. It also shows that regional differences, if present, are worth taking seriously in later modeling and recommendations.

Part 2: Linear Regression - What predicts case fatality rate?

Why this analysis matters

The ANOVA gives a broad comparison, but it does not explain the drivers behind fatality rates. A multiple linear regression model helps answer a more practical question: when vaccination, demographics, and economic conditions are considered together, which factors are most closely associated with case fatality rate?

Variables used

Response variable: case_fatality_rate
Predictors: vax_coverage, median_age, gdp_per_capita, stringency_index

These predictors were selected because they represent vaccination progress, age structure, economic capacity, and policy response.

Model fit

summary(lm_fit)

## 
## Call:
## lm(formula = case_fatality_rate ~ vax_coverage + median_age + 
##     gdp_per_capita + stringency_index, data = lm_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04686 -0.01995 -0.01151  0.00183  2.25340 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.038e-03  2.327e-03  -2.595  0.00946 ** 
## vax_coverage     -1.508e-04  1.651e-05  -9.136  < 2e-16 ***
## median_age        7.986e-04  5.438e-05  14.687  < 2e-16 ***
## gdp_per_capita   -4.104e-07  2.262e-08 -18.147  < 2e-16 ***
## stringency_index  3.052e-04  2.108e-05  14.480  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06639 on 39574 degrees of freedom
## Multiple R-squared:  0.01968,    Adjusted R-squared:  0.01958 
## F-statistic: 198.6 on 4 and 39574 DF,  p-value: < 2.2e-16

Visual support

ggplot(lm_data, aes(x = vax_coverage, y = case_fatality_rate)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Higher Vaccination Coverage Is Associated With Lower Fatality Rates",
    subtitle = "The line summarizes the negative trend in the data",
    x = "Vaccination Coverage",
    y = "Case Fatality Rate"
  ) +
  theme_minimal()

This plot gives a direct visual check of the regression relationship. The goal here is not just to see whether the line slopes up or down, but also to see whether the points are roughly centered around a straight-line pattern. If the cloud is extremely curved or highly uneven, then a simple linear model may not be the best choice.

Why this matters

This plot is the simplest way to explain the key regression idea to a general audience. It shows the main relationship without requiring anyone to read coefficient tables first.

Model diagnostics and coefficient summary

par(mfrow = c(2, 2))
plot(lm_fit)

par(mfrow = c(1, 1))

knitr::kable(
  lm_tidy,
  digits = 4,
  caption = "Linear Regression Coefficients With Confidence Intervals"
)

Linear Regression Coefficients With Confidence Intervals
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-6e-03	0.0023	-2.5950	0.0095	-0.0106	-0.0015
vax_coverage	-2e-04	0.0000	-9.1359	0.0000	-0.0002	-0.0001
median_age	8e-04	0.0001	14.6866	0.0000	0.0007	0.0009
gdp_per_capita	0e+00	0.0000	-18.1468	0.0000	0.0000	0.0000
stringency_index	3e-04	0.0000	14.4798	0.0000	0.0003	0.0003

lm_vif

##     vax_coverage       median_age   gdp_per_capita stringency_index 
##         1.097356         1.474312         1.464795         1.125965

Interpretation

The regression table shows the direction and size of each relationship while controlling for the other variables.
A negative coefficient for vax_coverage means that higher vaccination coverage is associated with lower fatality rates.
A positive coefficient for median_age suggests that older populations experience higher fatality rates, which fits the expectation that older adults are at greater risk of severe COVID outcomes.
gdp_per_capita and stringency_index help capture economic capacity and policy response. Even if one coefficient is not statistically significant, it still matters conceptually because it may help reduce omitted variable bias.
The adjusted R-squared value is especially important because it tells us how much of the variation in fatality rate is explained by the full model, not just by one variable alone.

Plain-English coefficient interpretation

A 1-unit increase in each predictor changes fatality rate by the amount shown in the coefficient table, holding the other predictors constant. For a non-technical reader, the most important thing is the direction of the effect. If the vaccination coefficient is negative, that means more vaccination is associated with lower fatality. If the median-age coefficient is positive, that means older populations tend to have higher fatality rates.

Why this matters

This is the core analytic section of the report. It turns the visual pattern into a controlled comparison and identifies which factors still matter after accounting for the others.

Part 3: Logistic Regression - What predicts high vaccination coverage?

Why this analysis matters

The previous model focused on fatality rates. This section flips the question and asks what helps explain whether a country reaches high vaccination coverage. That is useful because vaccination success is itself an important policy outcome.

Creating the binary outcome

table(logit_data$high_vax)

## 
##     0     1 
## 35441  4138

We define high vaccination coverage as greater than 50 percent. That cutoff is acceptable for this assignment because it creates a clear and easy-to-interpret binary outcome. It also gives the logistic model a meaningful distinction between countries that have reached majority coverage and those that have not.

Visual support

ggplot(logit_data, aes(x = median_age, y = vax_coverage, color = factor(high_vax))) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Median Age and Vaccination Coverage",
    subtitle = "Countries with older populations are more likely to exceed the 50 percent threshold",
    x = "Median Age",
    y = "Vaccination Coverage",
    color = "High Vax"
  ) +
  theme_minimal()

Interpretation

This scatterplot shows why median age is a plausible predictor. Countries with higher vaccination coverage are concentrated at higher median ages, while lower vaccination coverage appears more common across younger populations. That pattern suggests that demographic structure may be tied to vaccine uptake.

Why this matters

This plot helps a non-technical reader understand why the logistic model was built in the first place. It also connects the model to a policy question: which countries are more likely to cross a meaningful vaccination threshold?

Model

summary(logit_fit)

## 
## Call:
## glm(formula = high_vax ~ reproduction_rate + stringency_index + 
##     median_age, family = binomial, data = logit_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.836460   0.143363  -5.835 5.39e-09 ***
## reproduction_rate -0.213001   0.055685  -3.825 0.000131 ***
## stringency_index  -0.049699   0.001151 -43.174  < 2e-16 ***
## median_age         0.040139   0.002725  14.727  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26515  on 39578  degrees of freedom
## Residual deviance: 23808  on 39575  degrees of freedom
## AIC: 23816
## 
## Number of Fisher Scoring iterations: 6

Odds ratios and coefficient table

knitr::kable(
  logit_tidy,
  digits = 4,
  caption = "Logistic Regression Odds Ratios and Confidence Intervals"
)

Logistic Regression Odds Ratios and Confidence Intervals
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	0.4332	0.1434	-5.8346	0e+00	0.3268	0.5733
reproduction_rate	0.8082	0.0557	-3.8251	1e-04	0.7241	0.9008
stringency_index	0.9515	0.0012	-43.1736	0e+00	0.9494	0.9537
median_age	1.0410	0.0027	14.7273	0e+00	1.0354	1.0466

logit_or

##       (Intercept) reproduction_rate  stringency_index        median_age 
##         0.4332415         0.8081556         0.9515157         1.0409555

Interpretation

Logistic regression is used here because the response variable is binary.
The coefficients are interpreted on the log-odds scale, but the odds ratios are easier to explain to a client audience. An odds ratio above 1 means the predictor is associated with higher odds of reaching high vaccination coverage, while an odds ratio below 1 means the predictor is associated with lower odds.
For example, if median_age has an odds ratio above 1, then each additional year in median age increases the odds of high vaccination coverage.
If reproduction_rate has an odds ratio below 1, then more transmission pressure is associated with lower odds of high vaccination coverage.

This model is useful because it turns a continuous policy question into a simple decision-oriented outcome: which countries are more likely to cross the 50 percent vaccination threshold?

Why this matters

This section translates the vaccination story into a simple yes/no outcome that is easy to communicate to decision-makers. It also gives the report a second model that checks whether the same broad patterns still appear when the question is reframed.

Model accuracy

logit_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 35391  4138
##          1    50     0
##                                           
##                Accuracy : 0.8942          
##                  95% CI : (0.8911, 0.8972)
##     No Information Rate : 0.8954          
##     P-Value [Acc > NIR] : 0.7968          
##                                           
##                   Kappa : -0.0025         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9986          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8953          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.8954          
##          Detection Rate : 0.8942          
##    Detection Prevalence : 0.9987          
##       Balanced Accuracy : 0.4993          
##                                           
##        'Positive' Class : 0               
##

The confusion matrix checks whether the model classifies countries correctly. This matters because a model can have statistically significant coefficients but still perform poorly in prediction. Accuracy, sensitivity, and specificity give a fuller view of whether the logistic model is actually useful for classification.

Why this matters

A model is more convincing when it does not only look significant on paper but also performs reasonably well in classification. This helps the audience trust the practical usefulness of the results.

Conclusions and Recommendations

Key findings

The report shows that COVID-19 outcomes are not evenly distributed across the world.
The EDA suggests that case fatality rate changes over time and differs across continents. That means geography and timing both matter, so a single global average would hide important patterns.
The ANOVA section tests whether those regional differences are statistically meaningful. If the test is significant, the practical conclusion is that continent-level structure should be part of any public health interpretation.
The linear regression model gives a more nuanced view by estimating how vaccination coverage, age structure, economic capacity, and policy stringency are related to fatality rate at the same time. This is useful because it separates the broad visual patterns into a more controlled statistical comparison.
The logistic regression section shows which factors help explain high vaccination coverage. That is especially helpful for a client audience because it translates the question into a simple yes/no outcome.

Final recommendation

A reasonable policy recommendation is to prioritize vaccination rollout and maintain targeted public health responses in countries with older populations, higher transmission pressure, and weaker economic capacity.

The model results should be used as guidance rather than as proof of causation, but they are still valuable because they identify which factors are repeatedly associated with worse outcomes and lower vaccination success.

TL;DR - Executive Summary

What was the goal?

This project analyzes global COVID-19 data to identify which country-level factors are most closely associated with case fatality rate and vaccination coverage. The goal is to turn a large and messy public-health dataset into a small set of practical insights that a non-technical audience can use.

What did we find?

COVID-19 case fatality rates vary across regions and over time, but the size of those differences is not the same everywhere.
The ANOVA test suggests that continent-level differences are only meaningful if the p-value is below 0.05; in this report, the exact value is 8.41^{-28}, which gives the clearest single-number answer about whether continental differences are statistically convincing.
Vaccination coverage is the most consistently useful predictor in the linear model, with the coefficient showing the direction and size of the relationship after controlling for age, income, and policy stringency.
The regression model explains 2.0% of the variation in fatality rate, which means the model captures a meaningful but not complete share of the differences across country-date observations.
Median age is also important, which suggests that country population structure is tied to both severity and vaccination patterns.
The logistic model shows that countries are more likely to pass the 50 percent vaccination threshold when the predictors move in the favorable direction, and the confusion matrix shows how well the model classifies those countries.

What does this mean in plain English?

Geography alone does not explain everything.
Vaccination coverage is one of the clearest levers linked to better outcomes.
Older populations behave differently from younger ones, so a one-size-fits-all policy is not a good fit.
A small number of measurable factors explain a useful share of the pattern, which means targeted action is more practical than broad assumptions.

What should be done?

Prioritize vaccination outreach where coverage is still low.
Pay special attention to countries with older populations, because those places are more likely to show different outcome patterns.
Use data-driven targeting instead of assuming that all regions need the same response.
Keep updating the analysis as new data become available, because COVID patterns changed over time.

What are the limitations?

This is an observational data set, so the results show association, not proof of cause and effect.
Some important factors, such as healthcare quality or local policy enforcement, are not fully measured here.
A binary cutoff for high vaccination coverage is useful for interpretation, but it simplifies a continuous outcome.

Bottom line

The most useful public-health strategy is to focus on measurable factors that move outcomes in a better direction, especially vaccination coverage and population structure. The report shows where those relationships are strongest and gives a practical basis for decision-making.

Final takeaway

The strongest message from the report is that a small number of measurable factors, especially vaccination coverage and demographic structure, explain a meaningful share of the differences in COVID outcomes. That makes targeted, data-driven action more useful than broad assumptions.

COVID-19 Global Data Dive: Severity, Vaccination, and Policy Factors

Krish Shah

April 21, 2026

Introduction

Audience

Main objective

Data source and variables

How to read this report

What is in this report

Data Preparation

Assumptions and interpretation risks

Initial EDA

1. Average case fatality rate over time

Interpretation

Why this matters

2. Case fatality rate by continent

Interpretation

Why this matters

3. Vaccination coverage versus case fatality rate

Interpretation

Why this matters

4. Vaccination coverage versus median age

Interpretation

Why this matters

Part 1: ANOVA - Does fatality rate differ by continent?

Why this analysis matters

Hypotheses

Model

Post-hoc comparisons

Interpretation

Why this matters

Part 2: Linear Regression - What predicts case fatality rate?

Why this analysis matters

Variables used

Model fit

Visual support

Why this matters

Model diagnostics and coefficient summary

Interpretation

Plain-English coefficient interpretation

Why this matters

Part 3: Logistic Regression - What predicts high vaccination coverage?

Why this analysis matters

Creating the binary outcome

Visual support

Interpretation

Why this matters

Model

Odds ratios and coefficient table

Interpretation

Why this matters

Model accuracy

Why this matters

Conclusions and Recommendations

Key findings

Final recommendation

TL;DR - Executive Summary

What was the goal?

What did we find?

What does this mean in plain English?

What should be done?

What are the limitations?

Bottom line

Final takeaway