1 Article

Article Link

The article I chose is from Our World in Data, and it discusses a wide range of topics related to Coronavirus (COVID-19) Vaccinations. I chose this article because it is presently relevant and has many very interesting visualizations that I would like to attempt to recreate and build on top of. The key purpose/idea of the article is twofold:

  1. Display the high-level statistics on vaccines administered throughout the world
  2. Highlight the disparity in vaccines administered by country and potentially other demographic details.

For example, the most prominent KPIs that the article presents are that 62.7% of the world population has received at least one dose of a COVID-19 vaccine, and that 10.7 billion doses have been administered globally, and 22.83 million are now administered each day. The last key statistic is that only 12.3% of people in low-income countries have received at least one dose. I definitely agree completely with the premise of the article, such that this disparity is very apparent and based in fact. The data visualizations in the article do a great job of displaying this fact, and in my presentation I will try to build on the article’s work.

2 Data

Kaggle Link | GitHub Link

My dataset is sourced from Kaggle. I am confident that the dataset is appropriate for my article because the Kaggle repository is sourced and updated weekly from a GitHub repository that was used for the article. As made apparent at the bottom of the article, the data was collected via a variety of global health organization or national governments. It appears that the main sources of data are the World Health Organization (WHO) and the Ministry of Health.

Although the broader set of articles uses all of the variables in the dataset, my articles specifically focuses on vaccination related variables. For example, the article uses country name, date, total/new vaccinations and boosters, among others.

2.1 Load Data + Import Libraries

Libraries: ggplot2, timereg, plotly, dplyr, tidyr, DT

Load Data

covid <- read.csv('owid-covid-data.csv')

Dimensions

## Dimensions: 160934 67
## # of Columns: 67
## Columns: iso_code continent location date total_cases new_cases new_cases_smoothed total_deaths new_deaths new_deaths_smoothed total_cases_per_million new_cases_per_million new_cases_smoothed_per_million total_deaths_per_million new_deaths_per_million new_deaths_smoothed_per_million reproduction_rate icu_patients icu_patients_per_million hosp_patients hosp_patients_per_million weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions weekly_hosp_admissions_per_million new_tests total_tests total_tests_per_thousand new_tests_per_thousand new_tests_smoothed new_tests_smoothed_per_thousand positive_rate tests_per_case tests_units total_vaccinations people_vaccinated people_fully_vaccinated total_boosters new_vaccinations new_vaccinations_smoothed total_vaccinations_per_hundred people_vaccinated_per_hundred people_fully_vaccinated_per_hundred total_boosters_per_hundred new_vaccinations_smoothed_per_million new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred stringency_index population population_density median_age aged_65_older aged_70_older gdp_per_capita extreme_poverty cardiovasc_death_rate diabetes_prevalence female_smokers male_smokers handwashing_facilities hospital_beds_per_thousand life_expectancy human_development_index excess_mortality_cumulative_absolute excess_mortality_cumulative excess_mortality excess_mortality_cumulative_per_million

2.2 Data Preview

2.3 Data Structure

## 'data.frame':    160934 obs. of  67 variables:
##  $ iso_code                                  : Factor w/ 238 levels "ABW","AFG","AGO",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ continent                                 : Factor w/ 7 levels "","Africa","Asia",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ location                                  : Factor w/ 238 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ date                                      : Factor w/ 772 levels "2020-01-01","2020-01-02",..: 55 56 57 58 59 60 61 62 63 64 ...
##  $ total_cases                               : num  5 5 5 5 5 5 5 5 5 5 ...
##  $ new_cases                                 : num  5 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed                        : num  NA NA NA NA NA 0.714 0.714 0 0 0 ...
##  $ total_deaths                              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths                                : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths_smoothed                       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_cases_per_million                   : num  0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 ...
##  $ new_cases_per_million                     : num  0.126 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed_per_million            : num  NA NA NA NA NA 0.018 0.018 0 0 0 ...
##  $ total_deaths_per_million                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths_per_million                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths_smoothed_per_million           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ reproduction_rate                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients                              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients_per_million                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients                             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients_per_million                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions                     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions_per_million         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions_per_million        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests                                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests                               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests_per_thousand                  : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_per_thousand                    : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed_per_thousand           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ positive_rate                             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_per_case                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_units                               : Factor w/ 5 levels "","people tested",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ total_vaccinations                        : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated                         : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated                   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_vaccinations_per_hundred            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated_per_hundred             : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated_per_hundred       : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters_per_hundred                : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed_per_million     : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed_per_hundred: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ stringency_index                          : num  8.33 8.33 8.33 8.33 8.33 ...
##  $ population                                : num  39835428 39835428 39835428 39835428 39835428 ...
##  $ population_density                        : num  54.4 54.4 54.4 54.4 54.4 ...
##  $ median_age                                : num  18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
##  $ aged_65_older                             : num  2.58 2.58 2.58 2.58 2.58 ...
##  $ aged_70_older                             : num  1.34 1.34 1.34 1.34 1.34 ...
##  $ gdp_per_capita                            : num  1804 1804 1804 1804 1804 ...
##  $ extreme_poverty                           : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ cardiovasc_death_rate                     : num  597 597 597 597 597 ...
##  $ diabetes_prevalence                       : num  9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
##  $ female_smokers                            : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ male_smokers                              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ handwashing_facilities                    : num  37.7 37.7 37.7 37.7 37.7 ...
##  $ hospital_beds_per_thousand                : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ life_expectancy                           : num  64.8 64.8 64.8 64.8 64.8 ...
##  $ human_development_index                   : num  0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
##  $ excess_mortality_cumulative_absolute      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative               : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality                          : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative_per_million   : num  NA NA NA NA NA NA NA NA NA NA ...

2.4 Numerical Summaries

##   total_cases         total_deaths      total_tests        total_vaccinations 
##  Min.   :        1   Min.   :      1   Min.   :        0   Min.   :0.000e+00  
##  1st Qu.:     1819   1st Qu.:     75   1st Qu.:   346961   1st Qu.:5.434e+05  
##  Median :    23970   Median :    740   Median :  1757715   Median :4.387e+06  
##  Mean   :  2355755   Mean   :  55691   Mean   : 15993388   Mean   :1.595e+08  
##  3rd Qu.:   278559   3rd Qu.:   6986   3rd Qu.:  8113634   3rd Qu.:2.746e+07  
##  Max.   :405961201   Max.   :5789567   Max.   :787820796   Max.   :1.032e+10  
##  NA's   :2888        NA's   :20521     NA's   :93606       NA's   :118159     
##    population        human_development_index gdp_per_capita     extreme_poverty
##  Min.   :4.700e+01   Min.   :0.394           Min.   :   661.2   Min.   : 0.10  
##  1st Qu.:1.172e+06   1st Qu.:0.602           1st Qu.:  4466.5   1st Qu.: 0.60  
##  Median :8.478e+06   Median :0.743           Median : 12951.8   Median : 2.20  
##  Mean   :1.478e+08   Mean   :0.726           Mean   : 19655.8   Mean   :13.56  
##  3rd Qu.:3.393e+07   3rd Qu.:0.845           3rd Qu.: 27936.9   3rd Qu.:21.20  
##  Max.   :7.875e+09   Max.   :0.957           Max.   :116935.6   Max.   :77.60  
##  NA's   :1052        NA's   :28993           NA's   :26816      NA's   :72559

3 Data Validation

3.1 Correct Data Types

After exploring the structure and descriptive summary of the dataset, it appears that all variables are of the correct data type; specifically, variables like country and continent are factor variables and most of the remaining are numeric. The only variable that would potentially need to be changed is the date column. I decided to keep one column as a factor, and create a second, date-type column using the as.Date() function.

## # of Numeric Columns:  62
## # of Character Columns:  5
covid$dt <- as.Date(covid$date)

3.2 Valid Ranges

Date Ranges

##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2020-01-01" "2020-09-02" "2021-03-02" "2021-02-25" "2021-08-22" "2022-02-10"

3.3 Missing Values Viz

src

3.4 Missing Values Stats

Although it is slightly disconcerting that there are almost 5 million missing data points (~44% of the entire dataset), this actually does make sense considering it includes records from beginning on January 1st, 2020. Therefore, especially for columns such as total vaccinations and total boosters, over 100,000 missing values is expected considering they were not administered until late 2021. In order to fix this, I think that it is reasonable to simply fill these null values with 0.

Another important note is that there appear to be ~80k blank, but non-NA values. The only two columns where this is an issue is the continent and test units columns; I dropped all rows where continent is blank, as these appeared to be for more obscure countries.

## Total Missing Values: 4726363
## % Missing of Entire Dataset: 43.19

NULLs by Column

##                                            [,1]  
## iso_code                                   0     
## continent                                  0     
## location                                   0     
## date                                       0     
## total_cases                                2888  
## new_cases                                  2918  
## new_cases_smoothed                         4069  
## total_deaths                               20521 
## new_deaths                                 20347 
## new_deaths_smoothed                        20477 
## total_cases_per_million                    3623  
## new_cases_per_million                      3653  
## new_cases_smoothed_per_million             4799  
## total_deaths_per_million                   21243 
## new_deaths_per_million                     21069 
## new_deaths_smoothed_per_million            21199 
## reproduction_rate                          39281 
## icu_patients                               138265
## icu_patients_per_million                   138265
## hosp_patients                              137494
## hosp_patients_per_million                  137494
## weekly_icu_admissions                      155770
## weekly_icu_admissions_per_million          155770
## weekly_hosp_admissions                     150458
## weekly_hosp_admissions_per_million         150458
## new_tests                                  94793 
## total_tests                                93606 
## total_tests_per_thousand                   93606 
## new_tests_per_thousand                     94793 
## new_tests_smoothed                         79374 
## new_tests_smoothed_per_thousand            79374 
## positive_rate                              84612 
## tests_per_case                             85172 
## tests_units                                0     
## total_vaccinations                         118159
## people_vaccinated                          120203
## people_fully_vaccinated                    122896
## total_boosters                             145780
## new_vaccinations                           125511
## new_vaccinations_smoothed                  81658 
## total_vaccinations_per_hundred             118159
## people_vaccinated_per_hundred              120203
## people_fully_vaccinated_per_hundred        122896
## total_boosters_per_hundred                 145780
## new_vaccinations_smoothed_per_million      81658 
## new_people_vaccinated_smoothed             82851 
## new_people_vaccinated_smoothed_per_hundred 82851 
## stringency_index                           34847 
## population                                 1052  
## population_density                         17691 
## median_age                                 27438 
## aged_65_older                              28886 
## aged_70_older                              28154 
## gdp_per_capita                             26816 
## extreme_poverty                            72559 
## cardiovasc_death_rate                      28468 
## diabetes_prevalence                        21527 
## female_smokers                             58207 
## male_smokers                               59681 
## handwashing_facilities                     94577 
## hospital_beds_per_thousand                 41145 
## life_expectancy                            10718 
## human_development_index                    28993 
## excess_mortality_cumulative_absolute       155405
## excess_mortality_cumulative                155405
## excess_mortality                           155393
## excess_mortality_cumulative_per_million    155405
## dt                                         0
covid[is.na(covid)] <- 0
cat("Total Blanks: ", sum(covid[, -length(names(covid))] == ""))
## Total Blanks:  86798
covid <- covid[covid$continent != "", ]
## Total Missing Values: 0

3.5 Duplicate Rows

After a quick look at the unique rows in the dataset, it is evident that there are no duplicate rows.

## Number of Unique Rows: 151277
## Number of Total Rows: 151277

3.6 Most Recent Data

most_recent_date <- max(covid$dt)
most_recent <- covid[covid$dt == most_recent_date,]
cat("Dimensions:", dim(most_recent))
## Dimensions: 215 68

4 Plots

4.1 Vaccines by Country

My first graph using ISO country codes to plot a world map, with the color scale corresponding to the total number of vaccinations per 100 hundred people; I decided to use the “per 100” variable for my analysis to avoid misleading disparities solely due to population differences (i.e. countries like China and India that have massive populations). Regardless of any other variable, it is apparent that there is a drastic disparity in total vaccinations throughout the world. Countries traditionally considered more developed, like China, Canada, US, Italy, etc., have very high numbers whereas less developed nations are significantly lower.

4.2 Vaccines x GDP per Capita

My second graph utilizes LOcal regrESSion to clearly highlight the positive relationship between a country’s GDP per Capita and Total Vaccinations per 100. Although this is not a perfect trend, considering there are countries with very high GDP per capita’s but very low vaccination counts.

4.3 Vaccines by GDP and HDI

My third graph is a 3D Plotly graph that builds on the previous finding, such that the Human Development Index has a similar positive relationship with total vaccinations. In other words, a country with a high HDI value typically has a higher number of vaccinations per 100. For reference, the HDI is a statistic composite index of life expectancy, education, and per capita income indicators, which are used to rank countries into four tiers of human development; per Wikipedia. Additionally, the graph is colored by continent to effectively show that there is not a distinct clustering. The most noticeable cluster is the orange dots, corresponding to African countries, which was also depicted in the first plotly graph.

4.4 Vaccines by Income and Development Levels

The next set of graphs are all very similar in that are boxplots that utilize facet wrap on either GDP per Capita (“Income Groups”) or Human Development Index (“HDI Groups”). It is immediately apparent that the number of vaccinations per 100 is significantly higher in the groups with higher GDP per Capita (“Upper Income”) and higher HDI values (“Developed”). It is important to note that I used the log of total vaccinations per 100 as the y axis for each of these boxplots in order to best display the trend and disparity between groups. Furthermore, the “II” labeled graphs add 1 to the total vaccinations variable to avoid log(0) NA data points; I decided to include all graphs because the “I” labeled graphs somewhat hide the trend due to a large amount of NA values.

4.4.1 Quantile Cuts

most_recent$income <- qcut(
  most_recent$gdp_per_capita, 
  cuts=3,
  labels=c("Low Income", "Middle Income", "Upper Income")
)
most_recent$hdi <- qcut(
  most_recent$human_development_index, 
  cuts=4,
  labels=c("Underdeveloped", "Mid-Low Dev.", "Mid-High Dev.", "Developed")
)

4.4.2 Income Groups I

4.4.3 Income Groups II

4.4.4 HDI Groups I

4.4.5 HDI Groups II

5 Conlusion

After exploring the dataset via multiple visualization techniques, it is evident that there is a noticeable disparity between the total number of vaccinations administered throughout the world. Specifically, lower income and more underdeveloped countries (as determined by GDP per Capita and the Human Development Index) have significantly less vaccinations. These findings completely agree with the article’s argument and the sad truth that vaccinations are definitely a luxury for higher income countries to have. Beyond the scope of the article and dataset, these findings make sense considering that higher income countries not only have more establish medical supply chains, but also that many of these more developed countries were actively involved in the synthesis of the vaccinations.

In terms of limitations and future work, I would have loved to dive into more country-specific details, like the number of hospitals or medical professionals in the country (to name a few); this serves as both a limitation of the current dataset and potential for future work given another similar dataset. Additionally, among the 67 columns in the dataset, I only analyzed roughly 10, which leaves the door open for an incredible amount of data visualization and analysis. For example, I could have completed a very similar analysis for the number of deaths, cases, tests, hospitalizations, etc. Lastly, I largely decided to focus on the most recent statistics for each of the countries, rather than looking at these numbers over time; this substantially cuts into the potential of this massive dataset. That being said, I decided to do so because the total vaccination count is cumulative and the most recent statistics are essentially the most important, especially considering the number will only be increasing.