Final Project: Data Exploration

Author

Joe Aumuller

Final Project Idea: Effects of COVID-19 on Vaccine Preventable Diseases

Project Aims

Aim 1: Characterize temporal, geospatial distribution of VPD cases, immunizations compared to COVID-19

Research question: Has the COVID-19 pandemic and associated vaccination campaigns cannibalized resources and progress on previous set targets for Vaccine preventable disease?

Hypothesis: Advent of COVID-19 pandemic cases led to increased surveillance of COVID-19 and eventually increased immunization while surveilled cases and immunizations for VPDs decreased.

Aim 2: Highlight linkages between COVID-19 pandemic, economic shock, new COVID-19 vaccine, prices to VPD vaccines

Research question: Has the economic shock of COVID-19 and R&D focus on the COVID-19 vaccine distorted prices to VPD vaccines

Hypothesis: While GDP growth and GNI of countries slowed or shrank in the Pandemic, prices to VPD vaccines increased and became relatively more expensive.

Aim 3: Determine the cost vs. benefit of added investment into VPD vaccines and immunization coverage

Research question: Is the benefit of added investment into VPD vaccines and immunization coverage greater than/needed against the cost of VPDs

Hypothesis: The cost of VPDs far exceeds the cost and benefit of investing in VPD vaccines and coverage programs

Data Exploration

Loading in libraries

Code
library(tidyverse)
library(lubridate)
library(purrr)
library(janitor)
library(here)
library(readxl)
library(ggridges)
library(stringr)

library(rnaturalearth)
library(countrycode)
library(wbstats)

library("data.table")

setwd("C:/Users/joeau/OneDrive - Johns Hopkins/SAIS/Year 2/Sustainable Finance/final_project/")

Loading in the data

Code
vax_explore <- read_csv("03_data_processed/vpd_covid_project.csv") |> 
  janitor::clean_names() 

#don't need x1, drop variable

Structure & Summary Statistics

Looking at the data by year we can see that the data set spansa 34 year period: 1994-2023 but with lots of missing data/observations as year totals are inconsistent. Up to 8459 observations with as little as 253 observations.

Code
#Year range and observations

vax_explore |> 
  count(year, sort = TRUE) |> 
  arrange(n)
# A tibble: 34 × 2
    year     n
   <dbl> <int>
 1  2023   236
 2  2022   337
 3  1990  2437
 4  1991  2469
 5  1994  2536
 6  1998  2536
 7  1992  2539
 8  1993  2552
 9  1996  2600
10  1995  2604
# … with 24 more rows
Code
#Unique country names

vax_explore |> 
  distinct(country) |> 
  View()

vax_explore |> 
  filter(year == 2020) |> 
  arrange(desc(cases))
# A tibble: 7,474 × 27
       x1 iso3c country    year disease disea…¹  cases units    ir antig antig…²
    <dbl> <chr> <chr>     <dbl> <chr>   <chr>    <dbl> <chr> <dbl> <chr> <chr>  
 1 138161 USA   United S…  2020 COVID19 Corona… 1.96e7 <NA>     NA <NA>  <NA>   
 2 137393 BRA   Brazil     2020 COVID19 Corona… 7.56e6 <NA>     NA <NA>  <NA>   
 3 137993 RUS   Russia     2020 COVID19 Corona… 3.16e6 <NA>     NA <NA>  <NA>   
 4 137550 FRA   France     2020 COVID19 Corona… 2.56e6 <NA>     NA <NA>  <NA>   
 5 137566 GBR   United K…  2020 COVID19 Corona… 2.56e6 <NA>     NA <NA>  <NA>   
 6 138133 TUR   Turkey     2020 COVID19 Corona… 2.19e6 <NA>     NA <NA>  <NA>   
 7 137686 ITA   Italy      2020 COVID19 Corona… 2.08e6 <NA>     NA <NA>  <NA>   
 8 137526 ESP   Spain      2020 COVID19 Corona… 1.96e6 <NA>     NA <NA>  <NA>   
 9 137489 DEU   Germany    2020 COVID19 Corona… 1.73e6 <NA>     NA <NA>  <NA>   
10 137953 POL   Poland     2020 COVID19 Corona… 1.30e6 <NA>     NA <NA>  <NA>   
# … with 7,464 more rows, 16 more variables: coverage_cat <chr>,
#   cov_cat_desc <chr>, targ_number <dbl>, doses <dbl>, coverage <dbl>,
#   vpd <chr>, gdp <dbl>, gdp_ppp <dbl>, gni <dbl>, gni_pp <dbl>,
#   longitude <dbl>, latitude <dbl>, region_iso3c <chr>, region <chr>,
#   income_level <chr>, flag <chr>, and abbreviated variable names
#   ¹​disease_desc, ²​antig_desc

Looking at Correlations

Looking at data distribution

Code
#Distribution of VPD & COVID cases

vax_explore |> 
  filter(vpd == "VPD",
         region != "NA") |> 
  mutate(vpd_rank = ntile(cases,4)) |> 
  ggplot(aes(vpd_rank, cases/1000, na.rm=TRUE)) +
  geom_col() +
  facet_grid(vpd ~ region) +
  labs(x = "Level of cases Low to High (1-5)",
       y = "# of Country Cases per 100,000") +
  theme_bw()

vax_explore |> 
  filter(vpd == "COVID",
         region != "NA") |> 
  mutate(vpd_rank = ntile(cases,5)) |> 
  ggplot(aes(vpd_rank, cases/1000, na.rm=TRUE)) +
  geom_col() +
  facet_grid(vpd ~ region) +
  labs(x = "Level of cases Low to High (1-5)",
       y = "# of Country Cases per 100,000") +
  theme_bw()

(a) Vaccine Preventable Diseases

(b) COVID-19

Figure 1: Disease case distributions by severity & region.

Looking at data over time

Code
#vax_explore |> 
  #filter(year == 2020) |> 
  #gather(metric, value, civil_liverties, political_rights) |>  
  #mutate(metric = str_to_title(str_replace_all(metric, "_"," ")),
         #region_name = fct_reorder(region_name, value)) |> 
  #group_by(year, region_name, metric) |> 
  #summarize(avg_rating = mean(value)) |> 
  #ggplot(aes(year, avg_rating, color = region_name)) +
  #geom_line() +
  #facet_wrap(~ metric) +
  #expand_limits(y = 0) +
  #scale_color_discrete(guide = guide_legend(reverse = TRUE)) +
  #labs(x = "Year",
       #y = "World Freedom Index rating",
       #title = "World Freedom Index rating over time",
       #colod = "Region")