⚠️ Note

This is a draft mockup of the final project. It will be refined further with improved visuals and deeper analysis.

Introduction

Audience

This presentation is directed to a public health policy team that wants to understand how COVID-19 outcomes differ across regions and which country-level factors are most closely associated with severity and vaccination success.

The audience is assumed to be technical enough to follow a data report, but not necessarily a statistics class. Because of that, the report uses short explanations, clear visuals, and plain-language interpretations after each model.

Main objective

The main objective of this analysis is to identify which country-level factors are associated with COVID-19 case fatality rate and vaccination coverage, and to use those relationships to support practical public health recommendations.

The analysis asks three related questions:

Do fatality rates differ across continents?
Is vaccination coverage related to lower fatality rates?
Which country-level variables help explain vaccination success and new case burden?

What is in this report

This report follows the same style as the Week 8 to Week 11 assignments:

initial exploratory data analysis with visualizations
clear assumptions and limitations
hypothesis tests where they make sense
regression models where they make sense
interpretation of the results in context
final conclusions and recommendations

Data Preparation

covid <- read.csv("covid_combined_groups.csv")
covid$date <- as.Date(covid$date)

# Keep the core variables needed for the report
covid_model <- covid %>%
  select(iso_code, continent, location, date,
         new_cases_smoothed_per_million,
         new_deaths_smoothed_per_million,
         total_cases_per_million,
         total_deaths_per_million,
         stringency_index,
         reproduction_rate,
         total_vaccinations_per_hundred,
         people_vaccinated_per_hundred,
         people_fully_vaccinated_per_hundred,
         hospital_beds_per_thousand,
         life_expectancy,
         cardiovasc_death_rate,
         diabetes_prevalence,
         gdp_per_capita,
         population_density,
         median_age,
         aged_65_older,
         human_development_index,
         population,
         country_group,
         year,
         month,
         year_month,
         case_fatality_rate,
         vax_coverage,
         days_since_start) %>%
  mutate(year_month = as.Date(paste0(year_month, "-01"))) %>%
  drop_na(continent, location, case_fatality_rate, vax_coverage, median_age, gdp_per_capita, stringency_index, reproduction_rate, population)

glimpse(covid_model)

## Rows: 39,579
## Columns: 30
## $ iso_code                            <chr> "AUT", "AUT", "AUT", "AUT", "AUT",…
## $ continent                           <chr> "Europe", "Europe", "Europe", "Eur…
## $ location                            <chr> "Austria", "Austria", "Austria", "…
## $ date                                <date> 2020-03-01, 2020-03-02, 2020-03-0…
## $ new_cases_smoothed_per_million      <dbl> 0.11, 0.11, 0.11, 0.11, 0.11, 0.11…
## $ new_deaths_smoothed_per_million     <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ total_cases_per_million             <dbl> 0.77, 0.77, 0.77, 0.77, 0.77, 0.77…
## $ total_deaths_per_million            <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00…
## $ stringency_index                    <dbl> 11.11, 11.11, 11.11, 11.11, 11.11,…
## $ reproduction_rate                   <dbl> 1.07, 1.07, 1.07, 1.07, 1.07, 1.07…
## $ total_vaccinations_per_hundred      <dbl> 69.3, 69.3, 69.3, 69.3, 69.3, 69.3…
## $ people_vaccinated_per_hundred       <dbl> 43.6, 43.6, 43.6, 43.6, 43.6, 43.6…
## $ people_fully_vaccinated_per_hundred <dbl> 30.58, 30.58, 30.58, 30.58, 30.58,…
## $ hospital_beds_per_thousand          <dbl> 7.37, 7.37, 7.37, 7.37, 7.37, 7.37…
## $ life_expectancy                     <dbl> 81.54, 81.54, 81.54, 81.54, 81.54,…
## $ cardiovasc_death_rate               <dbl> 145.18, 145.18, 145.18, 145.18, 14…
## $ diabetes_prevalence                 <dbl> 6.35, 6.35, 6.35, 6.35, 6.35, 6.35…
## $ gdp_per_capita                      <dbl> 45436.69, 45436.69, 45436.69, 4543…
## $ population_density                  <dbl> 106.75, 106.75, 106.75, 106.75, 10…
## $ median_age                          <dbl> 44.4, 44.4, 44.4, 44.4, 44.4, 44.4…
## $ aged_65_older                       <dbl> 19.2, 19.2, 19.2, 19.2, 19.2, 19.2…
## $ human_development_index             <dbl> 0.92, 0.92, 0.92, 0.92, 0.92, 0.92…
## $ population                          <int> 8939617, 8939617, 8939617, 8939617…
## $ country_group                       <chr> "EU", "EU", "EU", "EU", "EU", "EU"…
## $ year                                <int> 2020, 2020, 2020, 2020, 2020, 2020…
## $ month                               <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ year_month                          <date> 2020-03-01, 2020-03-01, 2020-03-0…
## $ case_fatality_rate                  <dbl> 0.000000000, 0.000000000, 0.000000…
## $ vax_coverage                        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ days_since_start                    <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, …

summary(covid_model %>% select(case_fatality_rate, vax_coverage, median_age, gdp_per_capita, stringency_index, reproduction_rate))

##  case_fatality_rate  vax_coverage     median_age    gdp_per_capita 
##  Min.   :0.000000   Min.   : 0.00   Min.   :18.10   Min.   : 1730  
##  1st Qu.:0.005885   1st Qu.: 0.00   1st Qu.:31.90   1st Qu.:17336  
##  Median :0.014787   Median : 0.00   Median :39.70   Median :29481  
##  Mean   :0.027773   Mean   :10.54   Mean   :37.48   Mean   :30159  
##  3rd Qu.:0.030259   3rd Qu.: 5.79   3rd Qu.:43.20   3rd Qu.:42659  
##  Max.   :2.281690   Max.   :84.68   Max.   :48.20   Max.   :94278  
##  stringency_index reproduction_rate
##  Min.   :  0.00   Min.   :0.110    
##  1st Qu.: 45.65   1st Qu.:0.890    
##  Median : 58.04   Median :1.040    
##  Mean   : 58.47   Mean   :1.075    
##  3rd Qu.: 71.76   3rd Qu.:1.220    
##  Max.   :100.00   Max.   :4.650

The data set contains repeated country-level observations over time, so each row represents a country-date combination rather than a single independent country snapshot.

That matters because the same country appears more than once across months, which means the observations are useful for trend analysis, but they are not fully independent in the same way a one-row-per-country data set would be.

For that reason, the report focuses on patterns, associations, and model fit rather than claiming direct causation.

Initial EDA

1. Average case fatality rate over time

monthly_trend <- covid_model %>%
  group_by(year_month) %>%
  summarise(
    mean_cfr = mean(case_fatality_rate, na.rm = TRUE),
    mean_vax = mean(vax_coverage, na.rm = TRUE),
    mean_stringency = mean(stringency_index, na.rm = TRUE),
    .groups = "drop"
  )

ggplot(monthly_trend, aes(x = year_month, y = mean_cfr)) +
  geom_line(linewidth = 1) +
  geom_point(size = 1.5) +
  labs(
    title = "Average COVID-19 Case Fatality Rate Over Time",
    x = "Year-Month",
    y = "Mean Case Fatality Rate"
  ) +
  theme_minimal()

Interpretation:

This plot shows that Europe has the highest variability and the most extreme case fatality rates, with several outliers significantly above other regions. In contrast, Africa, Asia, and South America have relatively lower and more tightly clustered fatality rates, indicating more consistency across observations.
North America shows moderate variability, while Oceania has the lowest overall fatality rates with very few extreme values.
Overall, this indicates that regional differences exist in COVID-19 fatality outcomes, likely driven by differences in healthcare systems, reporting practices, and population characteristics.
The reason for starting here is that time is one of the most important hidden structures in COVID data. Early months of the pandemic usually had different fatality patterns than later months because testing, treatment, and vaccination all changed over time.

2. Case fatality rate by continent

ggplot(covid_model, aes(x = continent, y = case_fatality_rate, fill = continent)) +
  geom_boxplot(alpha = 0.75) +
  labs(
    title = "Case Fatality Rate by Continent",
    x = "Continent",
    y = "Case Fatality Rate"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

This boxplot compares the distribution of fatality rates across continents.

Interpretation:

This scatter plot shows a clear negative relationship between vaccination coverage and case fatality rate. As vaccination coverage increases, the fatality rate consistently decreases and stabilizes near zero.
At low vaccination levels, there is a wide spread of fatality rates, including several extreme high values. However, once vaccination coverage exceeds roughly 50%, fatality rates become consistently low with minimal variation.
This demonstrates that higher vaccination coverage is strongly associated with reduced COVID-19 fatality rates.

3. Vaccination coverage versus case fatality rate

scatter_data <- covid_model %>%
  select(case_fatality_rate, vax_coverage, median_age, gdp_per_capita) %>%
  drop_na()

ggplot(scatter_data, aes(x = vax_coverage, y = case_fatality_rate)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Vaccination Coverage and Case Fatality Rate",
    x = "Vaccination Coverage",
    y = "Case Fatality Rate"
  ) +
  theme_minimal()

This scatterplot shows whether higher vaccination coverage tends to be associated with lower fatality rates.

Each dot represents one country-date observation. The smooth line is a quick summary of the overall direction of the relationship. If the line slopes downward, then countries and time periods with more vaccination coverage generally also have lower fatality rates.

That does not prove vaccination is the only reason fatality rates change, but it does give a strong first look at whether vaccination should be included in a regression model. It also helps us see whether the relationship looks roughly linear or whether there may be curvature or outliers.

4. Vaccination coverage versus median age

ggplot(scatter_data, aes(x = median_age, y = vax_coverage)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Vaccination Coverage and Median Age",
    x = "Median Age",
    y = "Vaccination Coverage"
  ) +
  theme_minimal()

This graph examines whether older countries tended to have higher vaccination coverage.

That relationship is important because age structure can affect vaccine rollout priority, public risk perception, and the overall ability of a country to reach large coverage levels. If the plot shows a positive trend, then median age may be a strong predictor of vaccination success and should be considered in the logistic regression section.

Assumptions and Interpretation Risks

Several assumptions guide the rest of the report.
First, the analysis assumes the key variables are measured consistently enough across countries to make comparisons meaningful. That is acceptable for a class project because the data set already combines many countries and years into a common format.
Second, missing values are handled with listwise deletion inside each model-specific data frame. That is acceptable here because the goal is to keep the code transparent and easy to reproduce. The tradeoff is that some rows are lost, so the sample used in each model may be slightly different.
Third, the report treats continent and country_group as broad grouping variables. That is useful for summarizing large-scale patterns, but it can hide important differences inside each group. To reduce that interpretation risk, the report does not rely on one variable alone. Instead, it combines visualizations, hypothesis tests, and regression models.
Finally, the analysis is observational. Because the data are not from a randomized experiment, the results should be interpreted as associations, not proof that one variable directly causes another.

Part 1: ANOVA - Does fatality rate differ by continent?

Why this analysis matters

A continent-level comparison gives a simple way to test whether fatality rates differ across broad geographic regions.

This is useful because a client audience often wants a high-level answer first. If the regions are meaningfully different, then the rest of the report should not treat the world as one uniform population.

Hypotheses

Null hypothesis (H0): Mean case fatality rate is the same across continents.

Alternative hypothesis (H1): At least one continent has a different mean case fatality rate.

Model

anova_data <- covid_model %>%
  select(case_fatality_rate, continent) %>%
  drop_na()

anova_fit <- aov(case_fatality_rate ~ continent, data = anova_data)
summary(anova_fit)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## continent       5   0.61 0.12283   27.41 <2e-16 ***
## Residuals   39573 177.31 0.00448                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Post-hoc comparisons

TukeyHSD(anova_fit)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = case_fatality_rate ~ continent, data = anova_data)
## 
## $continent
##                                     diff           lwr          upr     p adj
## Asia-Africa                 -0.010170982 -0.0141393942 -0.006202570 0.0000000
## Europe-Africa               -0.005321778 -0.0089669780 -0.001676578 0.0004553
## North America-Africa         0.001072502 -0.0039615256  0.006106529 0.9905634
## Oceania-Africa              -0.018652966 -0.0249545485 -0.012351383 0.0000000
## South America-Africa        -0.006497737 -0.0115463951 -0.001449078 0.0033414
## Europe-Asia                  0.004849204  0.0024208304  0.007277577 0.0000002
## North America-Asia           0.011243484  0.0070066278  0.015480340 0.0000000
## Oceania-Asia                -0.008481984 -0.0141670046 -0.002796964 0.0003058
## South America-Asia           0.003673245 -0.0005809841  0.007927475 0.1359942
## North America-Europe         0.006394280  0.0024585294  0.010330031 0.0000537
## Oceania-Europe              -0.013331188 -0.0187954941 -0.007866882 0.0000000
## South America-Europe        -0.001175958 -0.0051304057  0.002778489 0.9585259
## Oceania-North America       -0.019725468 -0.0261994614 -0.013251474 0.0000000
## South America-North America -0.007570238 -0.0128325196 -0.002307957 0.0005906
## South America-Oceania        0.012155229  0.0056698525  0.018640606 0.0000014

Interpretation

The ANOVA test checks whether the differences seen in the boxplot are large enough to be unlikely under random variation alone.

If the p-value is below 0.05, we reject the null hypothesis and conclude that fatality rates are not equal across all continents. In practical terms, that means geography is associated with meaningful differences in COVID severity or reporting patterns.

The post-hoc Tukey test then helps identify which continent pairs differ from one another. That matters because a significant ANOVA result only tells us that at least one group differs; it does not say exactly which ones. The pairwise results make the conclusion more actionable.

Part 2: Linear Regression - What predicts case fatality rate?

Why this analysis matters

The ANOVA gives a broad comparison, but it does not explain the drivers behind fatality rates.

A multiple linear regression model helps answer a more practical question: when vaccination, demographics, and economic conditions are considered together, which factors are most closely associated with case fatality rate?

Variables used

Response variable: case_fatality_rate
Predictors: vax_coverage, median_age, gdp_per_capita, stringency_index

These predictors were selected because they represent vaccination progress, age structure, economic capacity, and policy response.

Model fit

lm_data <- covid_model %>%
  select(case_fatality_rate, vax_coverage, median_age, gdp_per_capita, stringency_index) %>%
  drop_na()

lm_fit <- lm(case_fatality_rate ~ vax_coverage + median_age + gdp_per_capita + stringency_index,
             data = lm_data)
summary(lm_fit)

## 
## Call:
## lm(formula = case_fatality_rate ~ vax_coverage + median_age + 
##     gdp_per_capita + stringency_index, data = lm_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.04686 -0.01995 -0.01151  0.00183  2.25340 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.038e-03  2.327e-03  -2.595  0.00946 ** 
## vax_coverage     -1.508e-04  1.651e-05  -9.136  < 2e-16 ***
## median_age        7.986e-04  5.438e-05  14.687  < 2e-16 ***
## gdp_per_capita   -4.104e-07  2.262e-08 -18.147  < 2e-16 ***
## stringency_index  3.052e-04  2.108e-05  14.480  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06639 on 39574 degrees of freedom
## Multiple R-squared:  0.01968,    Adjusted R-squared:  0.01958 
## F-statistic: 198.6 on 4 and 39574 DF,  p-value: < 2.2e-16

Visual support

ggplot(lm_data, aes(x = vax_coverage, y = case_fatality_rate)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Linear Relationship Between Vaccination Coverage and Fatality Rate",
    x = "Vaccination Coverage",
    y = "Case Fatality Rate"
  ) +
  theme_minimal()

This plot gives a direct visual check of the regression relationship.

The goal here is not just to see whether the line slopes up or down, but also to see whether the points are roughly centered around a straight-line pattern. If the cloud is extremely curved or highly uneven, then a simple linear model may not be the best choice.

If the fitted line slopes downward, the interpretation is that higher vaccination coverage is associated with lower fatality rates, holding the other variables constant in the model. That is a strong and intuitive pattern for a public health audience.

Assumptions for the linear model

The main regression assumptions are:

the relationship is approximately linear
residuals are roughly centered around zero
residual spread is not wildly changing across fitted values
the predictors are not excessively collinear

These are reasonable assumptions for an exploratory class project, especially because the goal is to understand patterns rather than produce a final policy model.

Diagnostics

par(mfrow = c(2, 2))
plot(lm_fit)

par(mfrow = c(1, 1))

vif(lm_fit)

##     vax_coverage       median_age   gdp_per_capita stringency_index 
##         1.097356         1.474312         1.464795         1.125965

Interpretation

The regression table shows the direction and size of each relationship while controlling for the other variables.
A negative coefficient for vax_coverage would mean that higher vaccination coverage is associated with lower fatality rates.
A positive coefficient for median_age would suggest that older populations experience higher fatality rates, which makes sense because older adults are at greater risk of severe COVID outcomes.
gdp_per_capita and stringency_index help capture the idea that both economic resources and policy response may shape health outcomes. Even if one coefficient is not statistically significant, it still matters conceptually because it may help reduce omitted variable bias.

The adjusted R-squared value is especially important here because it shows how much of the variation in fatality rate is explained by the full model, not just by one variable alone. In a public health setting, even a modest R-squared can still be useful if the direction of the effects is meaningful and the model is transparent.

Part 3: Logistic Regression - What predicts high vaccination coverage?

Why this analysis matters

The previous model focused on fatality rates.

This section flips the question and asks what helps explain whether a country reaches high vaccination coverage. That is useful because vaccination success is itself an important policy outcome.

Creating the binary outcome

logit_data <- covid_model %>%
  select(vax_coverage, reproduction_rate, stringency_index, median_age) %>%
  drop_na() %>%
  mutate(high_vax = ifelse(vax_coverage > 50, 1, 0))

table(logit_data$high_vax)

## 
##     0     1 
## 35441  4138

We define high vaccination coverage as greater than 50 percent.

That cutoff is acceptable for this assignment because it creates a clear and easy-to-interpret binary outcome. It also gives the logistic model a meaningful distinction between countries that have reached majority coverage and those that have not.

The risk with any cutoff is that it simplifies a continuous variable too much. That risk is mitigated by being explicit about the threshold and by using the model as a classification tool rather than as a claim that 50 percent is a magical boundary.

Visual support

ggplot(logit_data, aes(x = median_age, y = vax_coverage, color = factor(high_vax))) +
  geom_point(alpha = 0.6) +
  labs(
    title = "Median Age and Vaccination Coverage",
    x = "Median Age",
    y = "Vaccination Coverage",
    color = "High Vax"
  ) +
  theme_minimal()

This scatterplot shows why median age is a plausible predictor.

Interpretation:

This plot clearly separates countries into low and high vaccination groups.
Countries with higher vaccination coverage (above 50%) are concentrated at higher median ages, while lower vaccination coverage is more common across all age groups but dominates in younger populations.
This confirms that higher median age is strongly associated with higher vaccination uptake, reinforcing the earlier relationship observed.

Model

logit_fit <- glm(high_vax ~ reproduction_rate + stringency_index + median_age,
                 data = logit_data,
                 family = binomial)
summary(logit_fit)

## 
## Call:
## glm(formula = high_vax ~ reproduction_rate + stringency_index + 
##     median_age, family = binomial, data = logit_data)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.836460   0.143363  -5.835 5.39e-09 ***
## reproduction_rate -0.213001   0.055685  -3.825 0.000131 ***
## stringency_index  -0.049699   0.001151 -43.174  < 2e-16 ***
## median_age         0.040139   0.002725  14.727  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26515  on 39578  degrees of freedom
## Residual deviance: 23808  on 39575  degrees of freedom
## AIC: 23816
## 
## Number of Fisher Scoring iterations: 6

Odds ratios

exp(coef(logit_fit))

##       (Intercept) reproduction_rate  stringency_index        median_age 
##         0.4332415         0.8081556         0.9515157         1.0409555

Interpretation

Logistic regression is used here because the response variable is binary.
The coefficients are interpreted on the log-odds scale, but the odds ratios are easier to explain to a client audience. An odds ratio above 1 means the predictor is associated with higher odds of reaching high vaccination coverage, while an odds ratio below 1 means the predictor is associated with lower odds.
For example, if median_age has an odds ratio above 1, then each additional year in median age increases the odds of high vaccination coverage. If reproduction_rate has an odds ratio below 1, then more transmission pressure is associated with lower odds of high vaccination coverage.

This model is useful because it turns a continuous policy question into a simple decision-oriented outcome: which countries are more likely to cross the 50 percent vaccination threshold?

Model accuracy

pred_probs <- predict(logit_fit, type = "response")
pred_class <- ifelse(pred_probs > 0.5, 1, 0)

confusionMatrix(factor(pred_class), factor(logit_data$high_vax))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 35391  4138
##          1    50     0
##                                           
##                Accuracy : 0.8942          
##                  95% CI : (0.8911, 0.8972)
##     No Information Rate : 0.8954          
##     P-Value [Acc > NIR] : 0.7968          
##                                           
##                   Kappa : -0.0025         
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.9986          
##             Specificity : 0.0000          
##          Pos Pred Value : 0.8953          
##          Neg Pred Value : 0.0000          
##              Prevalence : 0.8954          
##          Detection Rate : 0.8942          
##    Detection Prevalence : 0.9987          
##       Balanced Accuracy : 0.4993          
##                                           
##        'Positive' Class : 0               
##

The confusion matrix checks whether the model classifies countries correctly.

This matters because a model can have statistically significant coefficients but still perform poorly in prediction. Accuracy, sensitivity, and specificity give a fuller view of whether the logistic model is actually useful for classification.

Conclusions and Recommendations

The report shows that COVID-19 outcomes are not evenly distributed across the world.
The EDA suggests that case fatality rate changes over time and differs across continents. That means geography and timing both matter, so a single global average would hide important patterns.
The ANOVA section tests whether those regional differences are statistically meaningful. If the test is significant, the practical conclusion is that continent-level structure should be part of any public health interpretation.
The linear regression model gives a more nuanced view by estimating how vaccination coverage, age structure, economic capacity, and policy stringency are related to fatality rate at the same time. This is useful because it separates the broad visual patterns into a more controlled statistical comparison.
The logistic regression section shows which factors help explain high vaccination coverage. That is especially helpful for a client audience because it translates the question into a simple yes/no outcome.
The Poisson model adds another angle by focusing on case counts. Even if the assumptions are not perfect, it gives a structured way to study outbreak intensity.

Final recommendation

A reasonable policy recommendation is to prioritize vaccination rollout and maintain targeted public health responses in countries with older populations, higher transmission pressure, and weaker economic capacity.
The model results should be used as guidance rather than as proof of causation, but they are still valuable because they identify which factors are repeatedly associated with worse outcomes and lower vaccination success.

COVID-19 Global Data Dive: Severity, Vaccination, and Policy Factors

Krish Shah

April 07, 2026

⚠️ Note

Introduction

Audience

Main objective

What is in this report

Data Preparation

Initial EDA

1. Average case fatality rate over time

2. Case fatality rate by continent

3. Vaccination coverage versus case fatality rate

4. Vaccination coverage versus median age

Assumptions and Interpretation Risks

Part 1: ANOVA - Does fatality rate differ by continent?

Why this analysis matters

Hypotheses

Model

Post-hoc comparisons

Interpretation

Part 2: Linear Regression - What predicts case fatality rate?

Why this analysis matters

Variables used

Model fit

Visual support

Assumptions for the linear model

Diagnostics

Interpretation

Part 3: Logistic Regression - What predicts high vaccination coverage?

Why this analysis matters

Creating the binary outcome

Visual support

Model

Odds ratios

Interpretation

Model accuracy

Conclusions and Recommendations

Final recommendation