Article
Article Link
The article I chose is from Our World in Data, and it discusses a wide range of topics related to Coronavirus (COVID-19) Vaccinations. I chose this article because it is presently relevant and has many very interesting visualizations that I would like to attempt to recreate and build on top of. The key purpose/idea of the article is twofold:
- Display the high-level statistics on vaccines administered throughout the world
- Highlight the disparity in vaccines administered by country and potentially other demographic details.
For example, the most prominent KPIs that the article presents are that 62.7% of the world population has received at least one dose of a COVID-19 vaccine, and that 10.7 billion doses have been administered globally, and 22.83 million are now administered each day. The last key statistic is that only 12.3% of people in low-income countries have received at least one dose. I definitely agree completely with the premise of the article, such that this disparity is very apparent and based in fact. The data visualizations in the article do a great job of displaying this fact, and in my presentation I will try to build on the article’s work.
Data
Kaggle Link | GitHub Link
My dataset is sourced from Kaggle. I am confident that the dataset is appropriate for my article because the Kaggle repository is sourced and updated weekly from a GitHub repository that was used for the article. As made apparent at the bottom of the article, the data was collected via a variety of global health organization or national governments. It appears that the main sources of data are the World Health Organization (WHO) and the Ministry of Health.
Although the broader set of articles uses all of the variables in the dataset, my articles specifically focuses on vaccination related variables. For example, the article uses country name, date, total/new vaccinations and boosters, among others.
Load Data + Import Libraries
Libraries: ggplot2, timereg, plotly, dplyr, tidyr, DT
Load Data
covid <- read.csv('owid-covid-data.csv')
Dimensions
## Dimensions: 160934 67
## # of Columns: 67
## Columns: iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed total_cases_per_million new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate icu_patients icu_patients_per_million hosp_patients hosp_patients_per_million weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions weekly_hosp_admissions_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units total_vaccinations people_vaccinated people_fully_vaccinated total_boosters new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred people_vaccinated_per_hundred people_fully_vaccinated_per_hundred total_boosters_per_hundred new_vaccinations_smoothed_per_million new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million
Data Structure
## 'data.frame': 160934 obs. of 67 variables:
## $ iso_code : Factor w/ 238 levels "ABW","AFG","AGO",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ continent : Factor w/ 7 levels "","Africa","Asia",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ location : Factor w/ 238 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ date : Factor w/ 772 levels "2020-01-01","2020-01-02",..: 55 56 57 58 59 60 61 62 63 64 ...
## $ total_cases : num 5 5 5 5 5 5 5 5 5 5 ...
## $ new_cases : num 5 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed : num NA NA NA NA NA 0.714 0.714 0 0 0 ...
## $ total_deaths : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_cases_per_million : num 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 ...
## $ new_cases_per_million : num 0.126 0 0 0 0 0 0 0 0 0 ...
## $ new_cases_smoothed_per_million : num NA NA NA NA NA 0.018 0.018 0 0 0 ...
## $ total_deaths_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_deaths_smoothed_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ reproduction_rate : num NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients : num NA NA NA NA NA NA NA NA NA NA ...
## $ icu_patients_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients : num NA NA NA NA NA NA NA NA NA NA ...
## $ hosp_patients_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_icu_admissions_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions : num NA NA NA NA NA NA NA NA NA NA ...
## $ weekly_hosp_admissions_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_tests_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_tests_smoothed_per_thousand : num NA NA NA NA NA NA NA NA NA NA ...
## $ positive_rate : num NA NA NA NA NA NA NA NA NA NA ...
## $ tests_per_case : num NA NA NA NA NA NA NA NA NA NA ...
## $ tests_units : Factor w/ 5 levels "","people tested",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ total_vaccinations : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_vaccinations_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_vaccinated_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ people_fully_vaccinated_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ total_boosters_per_hundred : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_vaccinations_smoothed_per_million : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed : num NA NA NA NA NA NA NA NA NA NA ...
## $ new_people_vaccinated_smoothed_per_hundred: num NA NA NA NA NA NA NA NA NA NA ...
## $ stringency_index : num 8.33 8.33 8.33 8.33 8.33 ...
## $ population : num 39835428 39835428 39835428 39835428 39835428 ...
## $ population_density : num 54.4 54.4 54.4 54.4 54.4 ...
## $ median_age : num 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
## $ aged_65_older : num 2.58 2.58 2.58 2.58 2.58 ...
## $ aged_70_older : num 1.34 1.34 1.34 1.34 1.34 ...
## $ gdp_per_capita : num 1804 1804 1804 1804 1804 ...
## $ extreme_poverty : num NA NA NA NA NA NA NA NA NA NA ...
## $ cardiovasc_death_rate : num 597 597 597 597 597 ...
## $ diabetes_prevalence : num 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
## $ female_smokers : num NA NA NA NA NA NA NA NA NA NA ...
## $ male_smokers : num NA NA NA NA NA NA NA NA NA NA ...
## $ handwashing_facilities : num 37.7 37.7 37.7 37.7 37.7 ...
## $ hospital_beds_per_thousand : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ life_expectancy : num 64.8 64.8 64.8 64.8 64.8 ...
## $ human_development_index : num 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
## $ excess_mortality_cumulative_absolute : num NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality_cumulative : num NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality : num NA NA NA NA NA NA NA NA NA NA ...
## $ excess_mortality_cumulative_per_million : num NA NA NA NA NA NA NA NA NA NA ...
Numerical Summaries
## total_cases total_deaths total_tests total_vaccinations
## Min. : 1 Min. : 1 Min. : 0 Min. :0.000e+00
## 1st Qu.: 1819 1st Qu.: 75 1st Qu.: 346961 1st Qu.:5.434e+05
## Median : 23970 Median : 740 Median : 1757715 Median :4.387e+06
## Mean : 2355755 Mean : 55691 Mean : 15993388 Mean :1.595e+08
## 3rd Qu.: 278559 3rd Qu.: 6986 3rd Qu.: 8113634 3rd Qu.:2.746e+07
## Max. :405961201 Max. :5789567 Max. :787820796 Max. :1.032e+10
## NA's :2888 NA's :20521 NA's :93606 NA's :118159
## population human_development_index gdp_per_capita extreme_poverty
## Min. :4.700e+01 Min. :0.394 Min. : 661.2 Min. : 0.10
## 1st Qu.:1.172e+06 1st Qu.:0.602 1st Qu.: 4466.5 1st Qu.: 0.60
## Median :8.478e+06 Median :0.743 Median : 12951.8 Median : 2.20
## Mean :1.478e+08 Mean :0.726 Mean : 19655.8 Mean :13.56
## 3rd Qu.:3.393e+07 3rd Qu.:0.845 3rd Qu.: 27936.9 3rd Qu.:21.20
## Max. :7.875e+09 Max. :0.957 Max. :116935.6 Max. :77.60
## NA's :1052 NA's :28993 NA's :26816 NA's :72559
Data Validation
Correct Data Types
After exploring the structure and descriptive summary of the dataset, it appears that all variables are of the correct data type; specifically, variables like country and continent are factor variables and most of the remaining are numeric. The only variable that would potentially need to be changed is the date column. I decided to keep one column as a factor, and create a second, date-type column using the as.Date() function.
## # of Numeric Columns: 62
## # of Character Columns: 5
covid$dt <- as.Date(covid$date)
Valid Ranges
Date Ranges
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2020-01-01" "2020-09-02" "2021-03-02" "2021-02-25" "2021-08-22" "2022-02-10"
Missing Values Viz
src


Missing Values Stats
Although it is slightly disconcerting that there are almost 5 million missing data points (~44% of the entire dataset), this actually does make sense considering it includes records from beginning on January 1st, 2020. Therefore, especially for columns such as total vaccinations and total boosters, over 100,000 missing values is expected considering they were not administered until late 2021. In order to fix this, I think that it is reasonable to simply fill these null values with 0.
Another important note is that there appear to be ~80k blank, but non-NA values. The only two columns where this is an issue is the continent and test units columns; I dropped all rows where continent is blank, as these appeared to be for more obscure countries.
## Total Missing Values: 4726363
## % Missing of Entire Dataset: 43.19
NULLs by Column
## [,1]
## iso_code 0
## continent 0
## location 0
## date 0
## total_cases 2888
## new_cases 2918
## new_cases_smoothed 4069
## total_deaths 20521
## new_deaths 20347
## new_deaths_smoothed 20477
## total_cases_per_million 3623
## new_cases_per_million 3653
## new_cases_smoothed_per_million 4799
## total_deaths_per_million 21243
## new_deaths_per_million 21069
## new_deaths_smoothed_per_million 21199
## reproduction_rate 39281
## icu_patients 138265
## icu_patients_per_million 138265
## hosp_patients 137494
## hosp_patients_per_million 137494
## weekly_icu_admissions 155770
## weekly_icu_admissions_per_million 155770
## weekly_hosp_admissions 150458
## weekly_hosp_admissions_per_million 150458
## new_tests 94793
## total_tests 93606
## total_tests_per_thousand 93606
## new_tests_per_thousand 94793
## new_tests_smoothed 79374
## new_tests_smoothed_per_thousand 79374
## positive_rate 84612
## tests_per_case 85172
## tests_units 0
## total_vaccinations 118159
## people_vaccinated 120203
## people_fully_vaccinated 122896
## total_boosters 145780
## new_vaccinations 125511
## new_vaccinations_smoothed 81658
## total_vaccinations_per_hundred 118159
## people_vaccinated_per_hundred 120203
## people_fully_vaccinated_per_hundred 122896
## total_boosters_per_hundred 145780
## new_vaccinations_smoothed_per_million 81658
## new_people_vaccinated_smoothed 82851
## new_people_vaccinated_smoothed_per_hundred 82851
## stringency_index 34847
## population 1052
## population_density 17691
## median_age 27438
## aged_65_older 28886
## aged_70_older 28154
## gdp_per_capita 26816
## extreme_poverty 72559
## cardiovasc_death_rate 28468
## diabetes_prevalence 21527
## female_smokers 58207
## male_smokers 59681
## handwashing_facilities 94577
## hospital_beds_per_thousand 41145
## life_expectancy 10718
## human_development_index 28993
## excess_mortality_cumulative_absolute 155405
## excess_mortality_cumulative 155405
## excess_mortality 155393
## excess_mortality_cumulative_per_million 155405
## dt 0
covid[is.na(covid)] <- 0
cat("Total Blanks: ", sum(covid[, -length(names(covid))] == ""))
## Total Blanks: 86798
covid <- covid[covid$continent != "", ]
## Total Missing Values: 0
Duplicate Rows
After a quick look at the unique rows in the dataset, it is evident that there are no duplicate rows.
## Number of Unique Rows: 151277
## Number of Total Rows: 151277
Most Recent Data
most_recent_date <- max(covid$dt)
most_recent <- covid[covid$dt == most_recent_date,]
cat("Dimensions:", dim(most_recent))
## Dimensions: 215 68
Plots
Vaccines by Country
My first graph using ISO country codes to plot a world map, with the color scale corresponding to the total number of vaccinations per 100 hundred people; I decided to use the “per 100” variable for my analysis to avoid misleading disparities solely due to population differences (i.e. countries like China and India that have massive populations). Regardless of any other variable, it is apparent that there is a drastic disparity in total vaccinations throughout the world. Countries traditionally considered more developed, like China, Canada, US, Italy, etc., have very high numbers whereas less developed nations are significantly lower.
Vaccines x GDP per Capita
My second graph utilizes LOcal regrESSion to clearly highlight the positive relationship between a country’s GDP per Capita and Total Vaccinations per 100. Although this is not a perfect trend, considering there are countries with very high GDP per capita’s but very low vaccination counts.

Vaccines by GDP and HDI
My third graph is a 3D Plotly graph that builds on the previous finding, such that the Human Development Index has a similar positive relationship with total vaccinations. In other words, a country with a high HDI value typically has a higher number of vaccinations per 100. For reference, the HDI is a statistic composite index of life expectancy, education, and per capita income indicators, which are used to rank countries into four tiers of human development; per Wikipedia. Additionally, the graph is colored by continent to effectively show that there is not a distinct clustering. The most noticeable cluster is the orange dots, corresponding to African countries, which was also depicted in the first plotly graph.
Vaccines by Income and Development Levels
The next set of graphs are all very similar in that are boxplots that utilize facet wrap on either GDP per Capita (“Income Groups”) or Human Development Index (“HDI Groups”). It is immediately apparent that the number of vaccinations per 100 is significantly higher in the groups with higher GDP per Capita (“Upper Income”) and higher HDI values (“Developed”). It is important to note that I used the log of total vaccinations per 100 as the y axis for each of these boxplots in order to best display the trend and disparity between groups. Furthermore, the “II” labeled graphs add 1 to the total vaccinations variable to avoid log(0) NA data points; I decided to include all graphs because the “I” labeled graphs somewhat hide the trend due to a large amount of NA values.
Quantile Cuts
most_recent$income <- qcut(
most_recent$gdp_per_capita,
cuts=3,
labels=c("Low Income", "Middle Income", "Upper Income")
)
most_recent$hdi <- qcut(
most_recent$human_development_index,
cuts=4,
labels=c("Underdeveloped", "Mid-Low Dev.", "Mid-High Dev.", "Developed")
)
Income Groups I

Income Groups II

HDI Groups I

HDI Groups II

Conlusion
After exploring the dataset via multiple visualization techniques, it is evident that there is a noticeable disparity between the total number of vaccinations administered throughout the world. Specifically, lower income and more underdeveloped countries (as determined by GDP per Capita and the Human Development Index) have significantly less vaccinations. These findings completely agree with the article’s argument and the sad truth that vaccinations are definitely a luxury for higher income countries to have. Beyond the scope of the article and dataset, these findings make sense considering that higher income countries not only have more establish medical supply chains, but also that many of these more developed countries were actively involved in the synthesis of the vaccinations.
In terms of limitations and future work, I would have loved to dive into more country-specific details, like the number of hospitals or medical professionals in the country (to name a few); this serves as both a limitation of the current dataset and potential for future work given another similar dataset. Additionally, among the 67 columns in the dataset, I only analyzed roughly 10, which leaves the door open for an incredible amount of data visualization and analysis. For example, I could have completed a very similar analysis for the number of deaths, cases, tests, hospitalizations, etc. Lastly, I largely decided to focus on the most recent statistics for each of the countries, rather than looking at these numbers over time; this substantially cuts into the potential of this massive dataset. That being said, I decided to do so because the total vaccination count is cumulative and the most recent statistics are essentially the most important, especially considering the number will only be increasing.