Introduction

The purpose of this analysis is to investigate whether COVID-19 case fatality rates differ across continents and whether vaccination coverage helps explain variation in fatality rates.

This data dive includes:

  • Selection of a meaningful response variable
  • An ANOVA test using a categorical explanatory variable
  • A visualization to support interpretation
  • A linear regression model using a continuous explanatory variable
  • Interpretation of coefficients and model fit
  • Broader conclusions about the global population

All response variables analyzed are continuous numeric variables.


Part 1: ANOVA — Case Fatality Rate by Continent

Response Variable

  • The response variable selected is case_fatality_rate.

  • This variable represents the proportion of confirmed cases that result in death. It is one of the most important indicators of COVID-19 severity and is highly relevant to policymakers and public health officials.

Explanatory Variable

  • The explanatory variable selected is continent.

  • Continent is a categorical variable that may capture structural differences such as healthcare systems, demographic composition, and policy responses.

Data Preparation

covid <- read.csv("covid_combined_groups.csv")

covid_anova <- covid %>%
  select(case_fatality_rate, continent) %>%
  drop_na()

Hypotheses

Null Hypothesis (H₀):

  • The mean case fatality rate is equal across all continents.

Alternative Hypothesis (H₁):

  • At least one continent has a different mean case fatality rate.

Visualization

ggplot(covid_anova,
       aes(x = continent,
           y = case_fatality_rate,
           fill = continent)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Case Fatality Rate by Continent",
       x = "Continent",
       y = "Case Fatality Rate") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation of Visualization

  • The boxplots show noticeable differences in median fatality rates across continents. Some continents display higher variability and higher median fatality rates.

  • This suggests that geographic region may influence COVID-19 severity, which we formally test using ANOVA.

ANOVA Test

anova_model <- aov(case_fatality_rate ~ continent,
                   data = covid_anova)

summary(anova_model)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## continent       5   0.61 0.12283   27.41 <2e-16 ***
## Residuals   39573 177.31 0.00448                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Interpretation

  • If the p-value is less than 0.05, we reject the null hypothesis and conclude that fatality rates differ significantly by continent.

  • If the p-value is greater than 0.05, we fail to reject the null hypothesis, meaning there is not enough statistical evidence to conclude that fatality rates differ across continents.

Practical Meaning

  • From your ANOVA table:

    • F(5, 39573) = 27.41
    • p-value < 2e-16

• The F-statistic is 27.41. • The p-value is less than 0.0000000000000002. • Since the p-value is less than 0.05, we reject the null hypothesis. • There is strong statistical evidence that mean case fatality rates are not equal across continents. • At least one continent has a significantly different mean fatality rate.


Part 2: Linear Regression — Vaccination Coverage and Fatality Rate

Continuous Explanatory Variable

  • The continuous explanatory variable selected is people_fully_vaccinated_per_hundred.

  • Vaccination coverage is expected to reduce severe outcomes, so we anticipate a negative linear relationship with case fatality rate.

Data Preparation

covid_reg <- covid %>%
  select(case_fatality_rate,
         people_fully_vaccinated_per_hundred) %>%
  drop_na()

Visualization

ggplot(covid_reg,
       aes(x = people_fully_vaccinated_per_hundred,
           y = case_fatality_rate)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Fatality Rate vs Vaccination Coverage",
       x = "People Fully Vaccinated per Hundred",
       y = "Case Fatality Rate") +
  theme_minimal()

Interpretation of Visualization

  • The scatterplot shows a generally linear pattern. As vaccination coverage increases, fatality rates tend to decrease.

  • This supports fitting a linear regression model.

Linear Regression Model

lm_model <- lm(case_fatality_rate ~ people_fully_vaccinated_per_hundred,
               data = covid_reg)

summary(lm_model)
## 
## Call:
## lm(formula = case_fatality_rate ~ people_fully_vaccinated_per_hundred, 
##     data = covid_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03263 -0.02073 -0.01276  0.00157  2.25341 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          3.263e-02  5.881e-04   55.48   <2e-16 ***
## people_fully_vaccinated_per_hundred -1.868e-04  1.855e-05  -10.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06696 on 39577 degrees of freedom
## Multiple R-squared:  0.002557,   Adjusted R-squared:  0.002532 
## F-statistic: 101.4 on 1 and 39577 DF,  p-value: < 2.2e-16
  • Although vaccination coverage is statistically significantly associated with fatality rate, the very small R-squared value suggests that vaccination alone explains only a small portion of the variation in fatality rates. Other factors likely play a more substantial role. ## Interpretation of Coefficients

The regression equation takes the form:

\[ \hat{y} = \beta_0 + \beta_1 x \]

β₀ (Intercept):
Represents the predicted fatality rate when vaccination coverage is 0%.

β₁ (Slope):
Represents the change in fatality rate for each 1% increase in vaccination coverage.

  • If β₁ is negative and statistically significant, this indicates that higher vaccination coverage is associated with lower fatality rates.

    • Intercept = 0.03263
    • Slope = -0.0001868
    • R² = 0.002557

Intercept (β₀ = 0.03263)

• When vaccination coverage is 0%, the predicted case fatality rate is 0.03263. • This means the model predicts a fatality rate of 3.263% at 0% vaccination coverage.

Context meaning:

If a population had no vaccinated individuals, the expected fatality rate would be about 3.26% according to this model.

Slope (β₁ = -0.0001868)

• For every 1 percentage point increase in vaccination coverage, the case fatality rate decreases by 0.0001868. • That equals a decrease of 0.01868 percentage points in fatality rate per 1% increase in vaccination.

Context meaning:

  • If vaccination coverage increases by 10%, the predicted fatality rate decreases by: 10 × 0.0001868 = 0.001868

  • That equals a reduction of 0.1868 percentage points.

  • The negative slope and p-value < 2e-16 mean:

• Vaccination coverage has a statistically significant negative association with fatality rate.

Model Fit

  • The R-squared value represents the proportion of variation in case fatality rate explained by vaccination coverage.

  • If R-squared is modest, this suggests that other variables such as median age, hospital capacity, or GDP per capita may also influence fatality rates.

  • R-squared (R² = 0.002557)

• Vaccination coverage explains 0.2557% of the variation in case fatality rates. • Over 99.7% of the variation is explained by other factors.


Overall Insights

From this analysis:

  • There may be statistically significant differences in fatality rates across continents.
  • Vaccination coverage appears to be associated with lower fatality rates.
  • Geographic and healthcare factors likely interact in complex ways.

Conclusion

  • This data dive demonstrates how ANOVA and linear regression can provide meaningful insight into global public health outcomes.

  • There is statistical evidence suggesting that both geography and vaccination coverage are important factors in understanding COVID-19 case fatality rates worldwide.


TODO Requirement:

  1. Select a continuous response variable that is most valuable in context.
  • Selected Response Variable: case_fatality_rate
    • ✔ Continuous numeric variable
    • ✔ Measures severity of COVID outcomes
    • ✔ Highly relevant to policymakers and public health officials
    • ✔ Represents the core outcome of interest in this dataset
  1. Select a categorical explanatory variable that may influence the response.
  • Selected Categorical Variable: continent
    • ✔ Categorical variable
    • ✔ Contains fewer than 10 categories (no consolidation required)
    • ✔ Likely captures structural differences such as healthcare systems, demographics, and public policy across regions
  1. Devise null and alternative hypotheses for ANOVA.
  • H0: The mean case_fatality_rate is equal across all continents.
  • H1: At least one continent has a different mean case_fatality_rate.
  • ANOVA is used to test whether differences in means are statistically significant.
  1. Interpret ANOVA results in context.

    • If p-value < 0.05: Reject H0 → Fatality rates differ significantly by continent.

    • If p-value ≥ 0.05: Fail to reject H0 → Not enough evidence to conclude differences exist.

    • Practical Meaning: Geographic region may influence COVID severity due to healthcare, demographic, or policy differences.

  2. Select a continuous explanatory variable that may influence the response.

    • Selected Continuous Variable: people_fully_vaccinated_per_hundred
    • ✔ Continuous numeric variable
    • ✔ Non-binary
    • ✔ Expected to have a roughly linear relationship with fatality rate
  3. Build a linear regression model using the selected continuous predictor.

  • Model: case_fatality_rate ~ people_fully_vaccinated_per_hundred - ✔ Single predictor model - ✔Relationship appears approximately linear from scatterplot
  1. Interpret regression coefficients in context.

    • Intercept (β0): Predicted fatality rate when vaccination coverage = 0%.
    • Slope (β1): Change in fatality rate for each 1% increase in vaccination coverage.
    • A negative significant slope suggests higher vaccination coverage is associated with lower fatality rates.
  2. Evaluate model fit.

    • R-squared indicates how much variation in fatality rate is explained by vaccination coverage.
    • A small R-squared suggests other variables likely also influence fatality rates.