Introduction

The purpose of this analysis is to investigate whether COVID-19 case fatality rates differ across continents and whether vaccination coverage helps explain variation in fatality rates.

This data dive includes:

Selection of a meaningful response variable
An ANOVA test using a categorical explanatory variable
A visualization to support interpretation
A linear regression model using a continuous explanatory variable
Interpretation of coefficients and model fit
Broader conclusions about the global population

All response variables analyzed are continuous numeric variables.

Part 1: ANOVA — Case Fatality Rate by Continent

Response Variable

The response variable selected is case_fatality_rate.
This variable represents the proportion of confirmed cases that result in death. It is one of the most important indicators of COVID-19 severity and is highly relevant to policymakers and public health officials.

Explanatory Variable

The explanatory variable selected is continent.
Continent is a categorical variable that may capture structural differences such as healthcare systems, demographic composition, and policy responses.

Data Preparation

covid <- read.csv("covid_combined_groups.csv")

covid_anova <- covid %>%
  select(case_fatality_rate, continent) %>%
  drop_na()

Hypotheses

Null Hypothesis (H₀):

The mean case fatality rate is equal across all continents.

Alternative Hypothesis (H₁):

At least one continent has a different mean case fatality rate.

Visualization

ggplot(covid_anova,
       aes(x = continent,
           y = case_fatality_rate,
           fill = continent)) +
  geom_boxplot(alpha = 0.7) +
  labs(title = "Case Fatality Rate by Continent",
       x = "Continent",
       y = "Case Fatality Rate") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation of Visualization

The boxplots show noticeable differences in median fatality rates across continents. Some continents display higher variability and higher median fatality rates.
This suggests that geographic region may influence COVID-19 severity, which we formally test using ANOVA.

ANOVA Test

anova_model <- aov(case_fatality_rate ~ continent,
                   data = covid_anova)

summary(anova_model)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## continent       5   0.61 0.12283   27.41 <2e-16 ***
## Residuals   39573 177.31 0.00448                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Interpretation

If the p-value is less than 0.05, we reject the null hypothesis and conclude that fatality rates differ significantly by continent.
If the p-value is greater than 0.05, we fail to reject the null hypothesis, meaning there is not enough statistical evidence to conclude that fatality rates differ across continents.

Practical Meaning

From your ANOVA table:
- F(5, 39573) = 27.41
- p-value < 2e-16

• The F-statistic is 27.41. • The p-value is less than 0.0000000000000002. • Since the p-value is less than 0.05, we reject the null hypothesis. • There is strong statistical evidence that mean case fatality rates are not equal across continents. • At least one continent has a significantly different mean fatality rate.

Part 2: Linear Regression — Vaccination Coverage and Fatality Rate

Continuous Explanatory Variable

The continuous explanatory variable selected is people_fully_vaccinated_per_hundred.
Vaccination coverage is expected to reduce severe outcomes, so we anticipate a negative linear relationship with case fatality rate.

Data Preparation

covid_reg <- covid %>%
  select(case_fatality_rate,
         people_fully_vaccinated_per_hundred) %>%
  drop_na()

Visualization

ggplot(covid_reg,
       aes(x = people_fully_vaccinated_per_hundred,
           y = case_fatality_rate)) +
  geom_point(alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Fatality Rate vs Vaccination Coverage",
       x = "People Fully Vaccinated per Hundred",
       y = "Case Fatality Rate") +
  theme_minimal()

Interpretation of Visualization

The scatterplot shows a generally linear pattern. As vaccination coverage increases, fatality rates tend to decrease.
This supports fitting a linear regression model.

Linear Regression Model

lm_model <- lm(case_fatality_rate ~ people_fully_vaccinated_per_hundred,
               data = covid_reg)

summary(lm_model)

## 
## Call:
## lm(formula = case_fatality_rate ~ people_fully_vaccinated_per_hundred, 
##     data = covid_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.03263 -0.02073 -0.01276  0.00157  2.25341 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          3.263e-02  5.881e-04   55.48   <2e-16 ***
## people_fully_vaccinated_per_hundred -1.868e-04  1.855e-05  -10.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06696 on 39577 degrees of freedom
## Multiple R-squared:  0.002557,   Adjusted R-squared:  0.002532 
## F-statistic: 101.4 on 1 and 39577 DF,  p-value: < 2.2e-16

Although vaccination coverage is statistically significantly associated with fatality rate, the very small R-squared value suggests that vaccination alone explains only a small portion of the variation in fatality rates. Other factors likely play a more substantial role. ## Interpretation of Coefficients

The regression equation takes the form:

\[ \hat{y} = \beta_0 + \beta_1 x \]

β₀ (Intercept):
Represents the predicted fatality rate when vaccination coverage is 0%.

β₁ (Slope):
Represents the change in fatality rate for each 1% increase in vaccination coverage.

If β₁ is negative and statistically significant, this indicates that higher vaccination coverage is associated with lower fatality rates.
- Intercept = 0.03263
- Slope = -0.0001868
- R² = 0.002557

Intercept (β₀ = 0.03263)

• When vaccination coverage is 0%, the predicted case fatality rate is 0.03263. • This means the model predicts a fatality rate of 3.263% at 0% vaccination coverage.

Context meaning:

If a population had no vaccinated individuals, the expected fatality rate would be about 3.26% according to this model.

Slope (β₁ = -0.0001868)

• For every 1 percentage point increase in vaccination coverage, the case fatality rate decreases by 0.0001868. • That equals a decrease of 0.01868 percentage points in fatality rate per 1% increase in vaccination.

Context meaning:

If vaccination coverage increases by 10%, the predicted fatality rate decreases by: 10 × 0.0001868 = 0.001868
That equals a reduction of 0.1868 percentage points.
The negative slope and p-value < 2e-16 mean:

• Vaccination coverage has a statistically significant negative association with fatality rate.

Model Fit

The R-squared value represents the proportion of variation in case fatality rate explained by vaccination coverage.
If R-squared is modest, this suggests that other variables such as median age, hospital capacity, or GDP per capita may also influence fatality rates.
R-squared (R² = 0.002557)

• Vaccination coverage explains 0.2557% of the variation in case fatality rates. • Over 99.7% of the variation is explained by other factors.

Overall Insights

From this analysis:

There may be statistically significant differences in fatality rates across continents.
Vaccination coverage appears to be associated with lower fatality rates.
Geographic and healthcare factors likely interact in complex ways.

Conclusion

This data dive demonstrates how ANOVA and linear regression can provide meaningful insight into global public health outcomes.
There is statistical evidence suggesting that both geography and vaccination coverage are important factors in understanding COVID-19 case fatality rates worldwide.

TODO Requirement:

Select a continuous response variable that is most valuable in context.

Selected Response Variable: case_fatality_rate
- ✔ Continuous numeric variable
- ✔ Measures severity of COVID outcomes
- ✔ Highly relevant to policymakers and public health officials
- ✔ Represents the core outcome of interest in this dataset

Select a categorical explanatory variable that may influence the response.

Selected Categorical Variable: continent
- ✔ Categorical variable
- ✔ Contains fewer than 10 categories (no consolidation required)
- ✔ Likely captures structural differences such as healthcare systems, demographics, and public policy across regions

Devise null and alternative hypotheses for ANOVA.

H0: The mean case_fatality_rate is equal across all continents.
H1: At least one continent has a different mean case_fatality_rate.
ANOVA is used to test whether differences in means are statistically significant.

Interpret ANOVA results in context.
- If p-value < 0.05: Reject H0 → Fatality rates differ significantly by continent.
- If p-value ≥ 0.05: Fail to reject H0 → Not enough evidence to conclude differences exist.
- Practical Meaning: Geographic region may influence COVID severity due to healthcare, demographic, or policy differences.
Select a continuous explanatory variable that may influence the response.
- Selected Continuous Variable: people_fully_vaccinated_per_hundred
- ✔ Continuous numeric variable
- ✔ Non-binary
- ✔ Expected to have a roughly linear relationship with fatality rate
Build a linear regression model using the selected continuous predictor.

Model: case_fatality_rate ~ people_fully_vaccinated_per_hundred - ✔ Single predictor model - ✔Relationship appears approximately linear from scatterplot

Interpret regression coefficients in context.
- Intercept (β0): Predicted fatality rate when vaccination coverage = 0%.
- Slope (β1): Change in fatality rate for each 1% increase in vaccination coverage.
- A negative significant slope suggests higher vaccination coverage is associated with lower fatality rates.
Evaluate model fit.
- R-squared indicates how much variation in fatality rate is explained by vaccination coverage.
- A small R-squared suggests other variables likely also influence fatality rates.

Week 8 Data Dive: ANOVA and Regression Modeling

Krish Shah

March 03, 2026

Introduction

Part 1: ANOVA — Case Fatality Rate by Continent

Response Variable

Explanatory Variable

Data Preparation

Hypotheses

Null Hypothesis (H₀):

Alternative Hypothesis (H₁):

Visualization

Interpretation of Visualization

ANOVA Test

ANOVA Interpretation

Practical Meaning

Part 2: Linear Regression — Vaccination Coverage and Fatality Rate

Continuous Explanatory Variable

Data Preparation

Visualization

Interpretation of Visualization

Linear Regression Model

Intercept (β₀ = 0.03263)

Context meaning:

Slope (β₁ = -0.0001868)

Context meaning:

Model Fit

Overall Insights

Conclusion

TODO Requirement: