Introduction
Research Question: What factors predict the percentage of people fully vaccinated against COVID-19 across different countries?
This project uses the COVID-19 vaccination dataset published by Our World in Data (OWID). The dataset provides country-level information on vaccination coverage along with demographic, economic, and policy-related indicators. Each observation represents a single country, and the primary outcome variable is people_fully_vaccinated_per_hundred, which measures the percentage of a country’s population that has completed the full COVID-19 vaccination series.
The dataset contains numerous variables related to vaccination progress, economic development, population structure, and public health policy. I chose this topic because COVID-19 vaccination rates varied substantially across countries, and examining the factors associated with higher vaccination coverage can help explain global inequalities in health outcomes and resource access.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.1
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Warning: package 'car' was built under R version 4.5.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.5.2
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
# Load local CSV file
#setwd("~/Data 101/Final project")
vaccinations <- read.csv("owid-covid-latest.csv")
# Inspect the data
head(vaccinations)
## iso_code continent location last_updated_date total_cases new_cases
## 1 AFG Asia Afghanistan 8/4/2024 235214 0
## 2 OWID_AFR Africa 8/4/2024 13145380 36
## 3 ALB Europe Albania 8/4/2024 335047 0
## 4 DZA Africa Algeria 8/4/2024 272139 18
## 5 ASM Oceania American Samoa 8/4/2024 8359 0
## 6 AND Europe Andorra 8/4/2024 48015 0
## new_cases_smoothed total_deaths new_deaths new_deaths_smoothed
## 1 0.000 7998 0 0
## 2 5.143 259117 0 0
## 3 0.000 3605 0 0
## 4 2.571 6881 0 0
## 5 0.000 34 0 0
## 6 0.000 159 0 0
## total_cases_per_million new_cases_per_million new_cases_smoothed_per_million
## 1 5796.468 0.000 0.000
## 2 9088.877 0.025 0.004
## 3 118491.020 0.000 0.000
## 4 5984.050 0.396 0.057
## 5 172831.600 0.000 0.000
## 6 602280.440 0.000 0.000
## total_deaths_per_million new_deaths_per_million
## 1 197.098 0
## 2 179.157 0
## 3 1274.926 0
## 4 151.306 0
## 5 702.988 0
## 6 1994.431 0
## new_deaths_smoothed_per_million reproduction_rate icu_patients
## 1 0 NA NA
## 2 0 NA NA
## 3 0 NA NA
## 4 0 NA NA
## 5 0 NA NA
## 6 0 NA NA
## icu_patients_per_million hosp_patients hosp_patients_per_million
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## weekly_icu_admissions weekly_icu_admissions_per_million
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## weekly_hosp_admissions weekly_hosp_admissions_per_million total_tests
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## new_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## total_vaccinations people_vaccinated people_fully_vaccinated total_boosters
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## total_boosters_per_hundred new_vaccinations_smoothed_per_million
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## stringency_index population_density median_age aged_65_older aged_70_older
## 1 NA 54.422 18.6 2.581 1.337
## 2 NA NA NA NA NA
## 3 NA 104.871 38.0 13.188 8.643
## 4 NA 17.348 29.1 6.211 3.857
## 5 NA 278.205 NA NA NA
## 6 NA 163.755 NA NA NA
## gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence
## 1 1803.987 NA 597.029 9.59
## 2 NA NA NA NA
## 3 11803.431 1.1 304.195 10.08
## 4 13913.839 0.5 278.364 6.73
## 5 NA NA 283.750 NA
## 6 NA NA 109.135 7.97
## female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand
## 1 NA NA 37.746 0.50
## 2 NA NA NA NA
## 3 7.1 51.2 NA 2.89
## 4 0.7 30.4 83.741 1.90
## 5 NA NA NA NA
## 6 29.0 37.8 NA NA
## life_expectancy human_development_index population
## 1 64.83 0.511 41128772
## 2 NA NA 1426736614
## 3 78.57 0.795 2842318
## 4 76.88 0.748 44903228
## 5 73.74 NA 44295
## 6 83.73 0.868 79843
## excess_mortality_cumulative_absolute excess_mortality_cumulative
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## excess_mortality excess_mortality_cumulative_per_million
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
str(vaccinations)
## 'data.frame': 247 obs. of 67 variables:
## $ iso_code : chr "AFG" "OWID_AFR" "ALB" "DZA" ...
## $ continent : chr "Asia" "" "Europe" "Africa" ...
## $ location : chr "Afghanistan" "Africa" "Albania" "Algeria" ...
## $ last_updated_date : chr "8/4/2024" "8/4/2024" "8/4/2024" "8/4/2024" ...
## $ total_cases : int 235214 13145380 335047 272139 8359 48015 107481 3904 9106 10101218 ...
## $ new_cases : int 0 36 0 18 0 0 0 0 0 54 ...
## $ new_cases_smoothed : num 0 5.14 0 2.57 0 ...
## $ total_deaths : int 7998 259117 3605 6881 34 159 1937 12 146 130663 ...
## $ new_deaths : int 0 0 0 0 0 0 0 0 0 1 ...
## $ new_deaths_smoothed : num 0 0 0 0 0 0 0 0 0 0.143 ...
## $ total_cases_per_million : num 5796 9089 118491 5984 172832 ...
## $ new_cases_per_million : num 0 0.025 0 0.396 0 ...
## $ new_cases_smoothed_per_million : num 0 0.004 0 0.057 0 0 0 0 0 0.17 ...
## $ total_deaths_per_million : num 197 179 1275 151 703 ...
## $ new_deaths_per_million : num 0 0 0 0 0 0 0 0 0 0.022 ...
## $ new_deaths_smoothed_per_million : num 0 0 0 0 0 0 0 0 0 0.003 ...
## $ reproduction_rate : logi NA NA NA NA NA NA ...
## $ icu_patients : int NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients : int NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions : int NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions : int NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests : logi NA NA NA NA NA NA ...
## $ new_tests : logi NA NA NA NA NA NA ...
## $ total_tests_per_thousand : logi NA NA NA NA NA NA ...
## $ new_tests_per_thousand : logi NA NA NA NA NA NA ...
## $ new_tests_smoothed : logi NA NA NA NA NA NA ...
## $ new_tests_smoothed_per_thousand : logi NA NA NA NA NA NA ...
## $ positive_rate : logi NA NA NA NA NA NA ...
## $ tests_per_case : logi NA NA NA NA NA NA ...
## $ tests_units : logi NA NA NA NA NA NA ...
## $ total_vaccinations : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations : int NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed : int NA NA NA NA NA NA NA NA NA NA ...
## $ total_vaccinations_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed_per_million : int NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed : int NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed_per_hundred: int NA NA NA NA NA NA NA NA NA NA ...
## $ stringency_index : logi NA NA NA NA NA NA ...
## $ population_density : num 54.4 NA 104.9 17.3 278.2 ...
## $ median_age : num 18.6 NA 38 29.1 NA NA 16.8 NA 32.1 31.9 ...
## $ aged_65_older : num 2.58 NA 13.19 6.21 NA ...
## $ aged_70_older : num 1.34 NA 8.64 3.86 NA ...
## $ gdp_per_capita : num 1804 NA 11803 13914 NA ...
## $ extreme_poverty : num NA NA 1.1 0.5 NA NA NA NA NA 0.6 ...
## $ cardiovasc_death_rate : num 597 NA 304 278 284 ...
## $ diabetes_prevalence : num 9.59 NA 10.08 6.73 NA ...
## $ female_smokers : num NA NA 7.1 0.7 NA 29 NA NA NA 16.2 ...
## $ male_smokers : num NA NA 51.2 30.4 NA 37.8 NA NA NA 27.7 ...
## $ handwashing_facilities : num 37.7 NA NA 83.7 NA ...
## $ hospital_beds_per_thousand : num 0.5 NA 2.89 1.9 NA NA NA NA 3.8 5 ...
## $ life_expectancy : num 64.8 NA 78.6 76.9 73.7 ...
## $ human_development_index : num 0.511 NA 0.795 0.748 NA 0.868 0.581 NA 0.778 0.845 ...
## $ population : num 4.11e+07 1.43e+09 2.84e+06 4.49e+07 4.43e+04 ...
## $ excess_mortality_cumulative_absolute : logi NA NA NA NA NA NA ...
## $ excess_mortality_cumulative : logi NA NA NA NA NA NA ...
## $ excess_mortality : logi NA NA NA NA NA NA ...
## $ excess_mortality_cumulative_per_million : logi NA NA NA NA NA NA ...
Data Analysis
To prepare the data for analysis, I first restricted the dataset to country-level observations by removing OWID aggregate regions (such as income groups and world totals). Next, I selected variables relevant to the research question, including vaccination coverage, economic development, demographics, and policy stringency. Missing values were handled by removing incomplete cases for the final regression model. Finally, GDP per capita was log-transformed to reduce skewness and better satisfy linear regression assumptions. These steps ensure the dataset is clean, comparable across countries, and suitable for multiple linear regression.
covid_countries <- vaccinations |>
filter(!startsWith(iso_code, "OWID"))
head(covid_countries)
## iso_code continent location last_updated_date total_cases new_cases
## 1 AFG Asia Afghanistan 8/4/2024 235214 0
## 2 ALB Europe Albania 8/4/2024 335047 0
## 3 DZA Africa Algeria 8/4/2024 272139 18
## 4 ASM Oceania American Samoa 8/4/2024 8359 0
## 5 AND Europe Andorra 8/4/2024 48015 0
## 6 AGO Africa Angola 8/4/2024 107481 0
## new_cases_smoothed total_deaths new_deaths new_deaths_smoothed
## 1 0.000 7998 0 0
## 2 0.000 3605 0 0
## 3 2.571 6881 0 0
## 4 0.000 34 0 0
## 5 0.000 159 0 0
## 6 0.000 1937 0 0
## total_cases_per_million new_cases_per_million new_cases_smoothed_per_million
## 1 5796.468 0.000 0.000
## 2 118491.020 0.000 0.000
## 3 5984.050 0.396 0.057
## 4 172831.600 0.000 0.000
## 5 602280.440 0.000 0.000
## 6 3016.162 0.000 0.000
## total_deaths_per_million new_deaths_per_million
## 1 197.098 0
## 2 1274.926 0
## 3 151.306 0
## 4 702.988 0
## 5 1994.431 0
## 6 54.357 0
## new_deaths_smoothed_per_million reproduction_rate icu_patients
## 1 0 NA NA
## 2 0 NA NA
## 3 0 NA NA
## 4 0 NA NA
## 5 0 NA NA
## 6 0 NA NA
## icu_patients_per_million hosp_patients hosp_patients_per_million
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## weekly_icu_admissions weekly_icu_admissions_per_million
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## weekly_hosp_admissions weekly_hosp_admissions_per_million total_tests
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## new_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## total_vaccinations people_vaccinated people_fully_vaccinated total_boosters
## 1 NA NA NA NA
## 2 NA NA NA NA
## 3 NA NA NA NA
## 4 NA NA NA NA
## 5 NA NA NA NA
## 6 NA NA NA NA
## new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred
## 1 NA NA NA
## 2 NA NA NA
## 3 NA NA NA
## 4 NA NA NA
## 5 NA NA NA
## 6 NA NA NA
## people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## total_boosters_per_hundred new_vaccinations_smoothed_per_million
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## stringency_index population_density median_age aged_65_older aged_70_older
## 1 NA 54.422 18.6 2.581 1.337
## 2 NA 104.871 38.0 13.188 8.643
## 3 NA 17.348 29.1 6.211 3.857
## 4 NA 278.205 NA NA NA
## 5 NA 163.755 NA NA NA
## 6 NA 23.890 16.8 2.405 1.362
## gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence
## 1 1803.987 NA 597.029 9.59
## 2 11803.431 1.1 304.195 10.08
## 3 13913.839 0.5 278.364 6.73
## 4 NA NA 283.750 NA
## 5 NA NA 109.135 7.97
## 6 5819.495 NA 276.045 3.94
## female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand
## 1 NA NA 37.746 0.50
## 2 7.1 51.2 NA 2.89
## 3 0.7 30.4 83.741 1.90
## 4 NA NA NA NA
## 5 29.0 37.8 NA NA
## 6 NA NA 26.664 NA
## life_expectancy human_development_index population
## 1 64.83 0.511 41128772
## 2 78.57 0.795 2842318
## 3 76.88 0.748 44903228
## 4 73.74 NA 44295
## 5 83.73 0.868 79843
## 6 61.15 0.581 35588996
## excess_mortality_cumulative_absolute excess_mortality_cumulative
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
## excess_mortality excess_mortality_cumulative_per_million
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
analysis_data <- covid_countries |>
select(
location,
iso_code,
people_fully_vaccinated_per_hundred,
gdp_per_capita,
human_development_index,
median_age,
population_density,
stringency_index
)
head(analysis_data)
## location iso_code people_fully_vaccinated_per_hundred gdp_per_capita
## 1 Afghanistan AFG NA 1803.987
## 2 Albania ALB NA 11803.431
## 3 Algeria DZA NA 13913.839
## 4 American Samoa ASM NA NA
## 5 Andorra AND NA NA
## 6 Angola AGO NA 5819.495
## human_development_index median_age population_density stringency_index
## 1 0.511 18.6 54.422 NA
## 2 0.795 38.0 104.871 NA
## 3 0.748 29.1 17.348 NA
## 4 NA NA 278.205 NA
## 5 0.868 NA 163.755 NA
## 6 0.581 16.8 23.890 NA
final_data <- analysis_data |>
mutate(
log_gdp_per_capita = log(gdp_per_capita)
) |>
drop_na(
people_fully_vaccinated_per_hundred,
log_gdp_per_capita,
human_development_index,
median_age,
population_density
)
dim(final_data)
## [1] 6 9
summary(final_data)
## location iso_code people_fully_vaccinated_per_hundred
## Length:6 Length:6 Min. :65.06
## Class :character Class :character 1st Qu.:66.08
## Mode :character Mode :character Median :67.79
## Mean :73.06
## 3rd Qu.:77.98
## Max. :90.85
## gdp_per_capita human_development_index median_age population_density
## Min. : 6427 Min. :0.6450 Min. :28.20 Min. : 31.03
## 1st Qu.:27476 1st Qu.:0.8280 1st Qu.:33.10 1st Qu.: 57.91
## Median :29503 Median :0.8870 Median :43.00 Median : 116.72
## Mean :30150 Mean :0.8463 Mean :38.73 Mean :1299.96
## 3rd Qu.:31835 3rd Qu.:0.8980 3rd Qu.:43.45 3rd Qu.: 372.11
## Max. :56055 Max. :0.9490 Max. :44.80 Max. :7039.71
## stringency_index log_gdp_per_capita
## Mode:logical Min. : 8.768
## NA's:6 1st Qu.:10.220
## Median :10.292
## Mean :10.146
## 3rd Qu.:10.367
## Max. :10.934
Statistical Analysis Method: Multiple Linear Regression
Because the response variable people_fully_vaccinated_per_hundred is continuous, multiple linear regression is an appropriate method for examining how several country-level predictors jointly explain variation in full COVID-19 vaccination coverage. This model allows us to estimate the association between vaccination rates and factors such as economic development, human development, population characteristics, and government policy stringency while holding other variables constant.
vaccination_model <- lm(
people_fully_vaccinated_per_hundred ~
log_gdp_per_capita +
human_development_index +
median_age +
population_density,
data = final_data
)
summary(vaccination_model)
##
## Call:
## lm(formula = people_fully_vaccinated_per_hundred ~ log_gdp_per_capita +
## human_development_index + median_age + population_density,
## data = final_data)
##
## Residuals:
## 1 2 3 4 5 6
## -1.21003 0.71462 0.02123 -0.07444 0.48709 0.06153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.464e+02 6.839e+01 -2.140 0.278
## log_gdp_per_capita 4.631e+01 1.627e+01 2.846 0.215
## human_development_index -3.165e+02 1.481e+02 -2.138 0.279
## median_age 3.835e-01 7.552e-01 0.508 0.701
## population_density 1.992e-03 4.507e-04 4.420 0.142
##
## Residual standard error: 1.491 on 1 degrees of freedom
## Multiple R-squared: 0.996, Adjusted R-squared: 0.98
## F-statistic: 62.39 on 4 and 1 DF, p-value: 0.09464
Interpretation
The multiple linear regression model examined how economic development, human development, demographic structure, and population density relate to the percentage of people fully vaccinated against COVID-19 across countries. The model explains a very large proportion of the variation in vaccination coverage, with an Rsquare of 0.996 and an adjusted Rsquare of 0.98, indicating that the included predictors jointly account for nearly all observed differences in full vaccination rates in the analytic sample.
Holding other variables constant, log GDP per capita has a positive estimated coefficient, suggesting that countries with higher economic capacity tend to have higher percentages of fully vaccinated individuals. This aligns with expectations, as wealthier countries generally have greater access to vaccines, stronger healthcare infrastructure, and more efficient distribution systems. Human Development Index (HDI) shows a negative estimated coefficient in this model, which may reflect overlap with GDP per capita and other predictors rather than a true negative relationship, suggesting potential multicollinearity among development-related variables.
Median age has a positive coefficient, indicating that countries with older populations tend to have higher full vaccination rates, which is consistent with prioritization of older adults in vaccination campaigns. Population density also shows a positive association with full vaccination coverage, suggesting that more densely populated countries may have achieved higher vaccination rates, possibly due to greater perceived risk of transmission or more centralized healthcare delivery.
Although the overall model F-test is statistically significant at the 10% level (p ≈ 0.095), individual predictors are not statistically significant at conventional 5% levels in this specification. This suggests that while the predictors jointly explain vaccination coverage very well, their individual effects are difficult to separate precisely in this sample, likely due to strong correlations among development and demographic variables.
Regression Assumptions and Diagnostics Linearity
crPlots(vaccination_model)
Independence of Observations
Because each row is a different country, independence of observations is a reasonable assumption for this cross-sectional dataset.
plot(resid(vaccination_model), type = "b",
main = "Residuals vs Order",
ylab = "Residuals")
abline(h = 0, lty = 2)
Homoscedasticity, Normality, and Influential Points
par(mfrow = c(2, 2))
plot(vaccination_model)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
par(mfrow = c(1, 1))
Interpretation
Residuals vs Order
The residuals-versus-order plot shows no clear systematic trend over the observation order, with residuals fluctuating around zero. This suggests that the assumption of independence of observations is reasonably satisfied. Because each observation represents a distinct country, independence is also supported by the cross-sectional structure of the data.
Residuals vs Fitted
The Residuals vs Fitted plot shows residuals scattered around zero, though with visible structure due to the very small analytic sample size. While some curvature appears, this pattern is likely driven by overfitting rather than a strong violation of linearity. Overall, the linear form is acceptable given the exploratory nature of the analysis.
Normal Q–Q Plot
The Normal Q–Q plot shows noticeable deviations from the reference line, particularly in the tails. This suggests departures from normality in the residuals. However, with an extremely small sample size, such deviations are expected and do not meaningfully undermine the analysis. Normality assumptions are therefore interpreted cautiously.
Scale–Location Plot
The Scale–Location plot indicates uneven spread of residuals across fitted values, suggesting potential heteroscedasticity. Given the limited number of observations, this pattern is difficult to assess reliably and is likely influenced by sample size constraints rather than systematic variance instability.
Conclusion and Future Directions
This project examined country-level predictors of COVID-19 full vaccination rates using multiple linear regression. The analysis explored how economic development, human development, demographic structure, and population density relate to the percentage of people fully vaccinated across countries. The regression model demonstrates that development-related factors are strongly associated with vaccination coverage, highlighting the role of structural and socioeconomic capacity in shaping public health outcomes.
At the same time, the analysis revealed substantial multicollinearity among development indicators and a very small effective sample size due to missing data across countries. As a result, individual coefficient estimates and statistical significance should be interpreted cautiously. The exceptionally high R-squared likely reflects overfitting rather than true explanatory power, underscoring the limitations of cross-sectional country-level data when multiple correlated predictors are included.
Future research could strengthen this analysis by using longitudinal (panel) data to increase the number of observations, allowing for more stable estimates and better assessment of temporal dynamics. Additional improvements could include reducing predictor redundancy, incorporating regional fixed effects, or adding healthcare system capacity measures. Despite its limitations, this analysis provides References
Our World in Data. (n.d.). Coronavirus (COVID-19) vaccinations. https://ourworldindata.org/covid-vaccinations
R Core Team. (n.d.). R: A language and environment for statistical computing. R Foundation for Statistical Computing.
Fox, J., & Weisberg, S. (n.d.). car package documentation.