Introduction

The purpose of this analysis is to investigate relationships between COVID-19 cases, deaths, and vaccination coverage using quantitative methods.

For each section:

  • One original numeric variable is paired with a mutated (created) variable
  • A visualization is constructed
  • The relationship is evaluated critically
  • A correlation coefficient is computed and interpreted
  • A 95% confidence interval is constructed for the response variable
  • Conclusions are drawn about the broader global population

All variables analyzed are continuous numeric variables.


Pair 1: New Cases (Explanatory) & New Deaths (Response)

Data Preparation

covid <- read.csv("covid_combined_groups.csv")

covid_clean <- covid %>%
  select(new_cases_smoothed_per_million,
         new_deaths_smoothed_per_million,
         people_fully_vaccinated_per_hundred) %>%
  drop_na()

Mutated Variable

covid_clean <- covid_clean %>%
  mutate(death_to_case_ratio =
           new_deaths_smoothed_per_million /
           new_cases_smoothed_per_million)

Why This Variable Matters

The death_to_case_ratio captures severity relative to infections. While raw death counts increase with cases, this ratio provides insight into mortality burden conditional on infection levels.


Visualization

ggplot(covid_clean,
       aes(x = new_cases_smoothed_per_million,
           y = new_deaths_smoothed_per_million)) +
  geom_point(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "lm", se = TRUE, color = "darkred") +
  labs(title = "Relationship Between New COVID Cases and Deaths",
       subtitle = "Smoothed values per million population",
       x = "New Cases per Million",
       y = "New Deaths per Million") +
  theme_minimal()

Interpretation of the Visualization

The scatterplot reveals a clear positive linear association between new cases and new deaths.

Key observations:

  • As cases increase, deaths increase.
  • The relationship appears approximately linear.
  • The spread of points widens at higher case values, suggesting possible heteroscedasticity.
  • A few extreme case values may represent outliers, likely corresponding to major outbreak periods.

This pattern is epidemiologically logical: deaths occur as a consequence of infections, though healthcare capacity and demographic differences introduce variability.


Correlation Analysis

cor_test_1 <- cor.test(covid_clean$new_cases_smoothed_per_million,
                       covid_clean$new_deaths_smoothed_per_million)

cor_test_1
## 
##  Pearson's product-moment correlation
## 
## data:  covid_clean$new_cases_smoothed_per_million and covid_clean$new_deaths_smoothed_per_million
## t = 153.51, df = 41600, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5951768 0.6074461
## sample estimates:
##       cor 
## 0.6013469

Interpretation of Correlation

The Pearson correlation coefficient quantifies the strength and direction of the linear relationship.

  • A value closer to +1 indicates a strong positive linear association.
  • A moderate value suggests other influencing factors (age structure, healthcare quality, vaccination rates).

If the correlation is strong, it confirms what we visually observe: deaths increase proportionally with cases.
If moderate, it suggests meaningful variability that warrants further investigation.


95% Confidence Interval for Mean Deaths

ci_deaths <- t.test(covid_clean$new_deaths_smoothed_per_million)
ci_deaths
## 
##  One Sample t-test
## 
## data:  covid_clean$new_deaths_smoothed_per_million
## t = 123.45, df = 41601, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  2.043921 2.109869
## sample estimates:
## mean of x 
##  2.076895

Population-Level Interpretation

Suppose the 95% confidence interval is:

(Lower Bound, Upper Bound)

This means:

We are 95% confident that the true global mean number of new COVID deaths per million people lies between these two values.

Important interpretation elements:

  • The interval reflects variability across countries and time.
  • A wide interval indicates substantial global inequality in mortality burden.
  • The midpoint represents the best estimate of average global mortality per million.

This inference generalizes beyond the observed dataset to the broader global population represented by the data.


Pair 2: Vaccination Coverage & Vaccination Gap

Mutated Variable

covid_clean <- covid_clean %>%
  mutate(vax_gap = 100 - people_fully_vaccinated_per_hundred)

Meaning of the Mutated Variable

vax_gap represents the remaining percentage of a population that is not fully vaccinated.

This provides a more interpretable measure of how far countries are from complete vaccination coverage.


Visualization

ggplot(covid_clean,
       aes(x = people_fully_vaccinated_per_hundred,
           y = vax_gap)) +
  geom_point(alpha = 0.3, color = "darkgreen") +
  geom_smooth(method = "lm", se = TRUE, color = "black") +
  labs(title = "Vaccination Coverage vs Remaining Vaccination Gap",
       x = "Fully Vaccinated per Hundred",
       y = "Vaccination Gap (%)") +
  theme_minimal()

Interpretation of the Visualization

The scatterplot shows an almost perfectly straight negative linear relationship.

This is expected because:

vax_gap = 100 − vaccination_rate

Therefore:

  • As vaccination increases, the gap decreases proportionally.
  • The relationship is mathematically deterministic.
  • The regression line should fit extremely closely to the data.

Any deviation from a perfect line may reflect reporting inconsistencies or rounding differences.


Correlation Analysis

cor_test_2 <- cor.test(covid_clean$people_fully_vaccinated_per_hundred,
                       covid_clean$vax_gap)

cor_test_2
## 
##  Pearson's product-moment correlation
## 
## data:  covid_clean$people_fully_vaccinated_per_hundred and covid_clean$vax_gap
## t = -9678578008, df = 41600, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -1 -1
## sample estimates:
## cor 
##  -1

Interpretation

Because one variable is a linear transformation of the other, we expect a correlation very close to -1.

A correlation near -1 confirms:

  • A perfectly inverse linear relationship
  • Mathematical consistency
  • Proper construction of the mutated variable

95% Confidence Interval for Vaccination Coverage

ci_vax <- t.test(covid_clean$people_fully_vaccinated_per_hundred)
ci_vax
## 
##  One Sample t-test
## 
## data:  covid_clean$people_fully_vaccinated_per_hundred
## t = 286.89, df = 41601, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  25.22555 25.57260
## sample estimates:
## mean of x 
##  25.39907

Population-Level Interpretation

The 95% confidence interval estimates the true mean percentage of fully vaccinated individuals per hundred people globally.

If the interval is:

(Lower Bound, Upper Bound)

This implies:

We are 95% confident that the true global mean vaccination coverage lies between these values.

If the upper bound is substantially below 100%, it confirms that global vaccination remains incomplete.
If the interval is wide, this suggests significant inequality in vaccine distribution across countries.

This inference applies to the broader global population represented in the dataset, not just the observed sample.


Overall Conclusions

This analysis demonstrates:

  1. A statistically meaningful positive association between COVID cases and deaths.
  2. A mathematically deterministic inverse relationship between vaccination coverage and vaccination gap.
  3. Evidence of variability and potential outliers in global mortality data.
  4. Clear population-level inferences using confidence intervals.