1 Problem Description

The purpose of this report is to investigate the effects of vaccines on two key aspects of the pandemic:

    1. slowing down the spread and
    1. reducing the severity of illness from Covid-19

The analysis was done using the many-models approach, where the same model was estimated among every entity worldwide then summarized to gain a comprehensive picture of vaccine effectiveness. The report contains FOUR major parts:

  • Data: Data source, cleaning, imputation of missing values, and data transformation.
  • Description: Describe the progress of vaccine roll-out in the world.
  • Analysis 1: The effect of Vaccine on Daily New Cases
  • Analysis 2: The effect of Vaccine on Daily New Deaths

Because the dataset used in this report is updated daily (please see the next section), all the graphs and analysis results in this RMarkdown report are designed to update accordingly when knitted. In addition to this report, you can also explore the project through this interactive dashboard, which is also updated daily.

2 Prepare the Data

One of the best datasets to use for this purpose is the daily-updated Covid-19 dataset offered by Our World in Data.

2.1 Data Overview

There are 65 variables in the data, which can be roughly divided into following seven categories.

  1. Identity
    • Location (Entity)
    • Date
    • Continent
    • iso_code
  2. Vaccines
    • Total vaccinations: Total vaccine doses given. (original & per hundred)
    • People vaccinated: Number of people who have received at least one dose (original & per hundred)
    • People fully vaccinated: Number of people who have received two doses (original & per hundred)
    • New vaccinations (original, smoothed, smoothed per million)
  3. Cases
    • New & total cases (original & smoothed)
    • New & total cases per million (original & smoothed)
  4. Deaths
    • New & total deaths (original & smoothed)
    • New & total deaths (original & smoothed)
  5. Tests
    • New & total test (original & smoothed)
    • New & total test per million (original & smoothed)
    • Test positive rate (& test per case)
    • Tests unit (people or samples tested)
  6. Other Key Indicators of COVID-19
    • Reproduction rate
    • ICU_patients (original & per million; daily & weekly)
    • Hospitalized_patients (original & per million; daily & weekly)
  7. Demographics and Other Control Variables
    • Population & population density
    • Median age; proportion of population >=65 & >=70; life expectancy
    • GDP per capita; proportion under extreme poverty
    • Stingency index
    • Cadiovasc death rate
    • Diabetes prevalence
    • Female & male smokers
    • Hand-washing facilities
    • Hospital beds per thousand
    • Human development index

Some of the observations in the data are of collective entities, such as of each continent and of the whole world, which were dropped. Entities with population equal to or less than 1M were also excluded.

2.2 Data Problems Discussed

Although the data is in a tidy format, still two issues need to be addressed before the analysis.

  1. The fluctuations of daily case/death numbers: The number of new cases for an entity understandably change daily. But some of these variations are due to adjustment/correction, such as drop of repeatedly counted cases or addition of past cases missed; and some due to particular dates, such as more people get tested during weekends than weekdays. It is therefore helpful to provide a moving average (e.g., a centered 7-day moving average is the average of +/- 3 days), so that the daily case numbers can bear a more reliable representation of the Covid-19 situation in each entity. This can be done easily in R, please see 2.4.4

  2. The vaccine numbers: There are a large amount of missing values for the key variable in this project: the vaccine data. Followed please see a plot showing missing in a) people vaccinated (VaccinatedP100) and in b) total number of vaccination doses (VaccinationP100) given during the last 5 days.

Vaccine data is more complete in terms of Total Vaccines Doses than in People Vaccinated. Therefore, the former will be used in this analysis.

In the following sections, I will first discuss the strategies that can be used to work with these missing values, then implement the strategies chosen.

2.3 Possible Strategies

2.3.1 Handeling of missing values: For entities with no vaccine data at all

If an entity has no vaccine data at all till the date of this report, it is likely that vaccine rollout has not started there. The missing values will therefore be replaced by 0. (Please see section 2.4.1 for a brief description of these entities, and Appendix 1 for a more detailed discussion and analysis)

2.3.2 Handling of missing values: For entities with some vaccine data

These entities have started their vaccine rollout, but did not update the numbers daily. Therefore, the gaps between known numbers need to be filled. The following two steps will be taken to impute these values

  1. Missing before vaccine rollout: For each entity, identify the first day with reported vaccine numbers. Any observations before the first date of reported vaccine number will be droped.

  2. Missing after vaccine rollout: To fill in the gaps between reported vaccine numbers, at least three possible strategies can be used:

    • To impute missing values by bringing the last available data forward: For entities that update their vaccine numbers sparingly, this imputation method could result in serious under-evaluation of the actual numbers. For example, we could impute today’s vaccine number with the number from a week ago.

    • To impute missing values with moving average: This strategy can combine the information from before and after a specific date if a centered average is used. But the result could be biased toward the end where more values are available.

    • To impute missing with an algorithm that could a) address the trend of the increasing total vaccine numbers in between the gaps and b) turn in reasonably accurate results even when the gaps are relatively large. After much research and considerations, I have decided to use the na_interpolation function in the imputeTS package for this purpose. Its performance is demonstrated in section 2.4.3.

2.4 Implement the Strategy Chosen

2.4.1 Entities with no vaccination data to date

Till the date of this report, there are 1 entities in the data that has no vaccine numbers reported at all. It is conceivable to infer that vaccine rollout hasn’t started in these entities. Their vaccine numbers are all replaced with 0. Analysis was conducted to see whether these entities were less motivated (i.e., has less cases) or lack resources (i.e., has lower GDP per capita) to administer Covid-19 vaccines. Please see Appendix 1 for details.

2.4.2 Drop observations before the 1st day of vaccine rollout

The starting date of vaccine rollout differs from entities to entities. In the following table, please see the entities ranked by their Vaccine Start Date.

For each entity, only observations from the 1st day of known vaccine values were included in the analysis.

2.4.3 Impute missings between the 1st and last day of known vaccine numbers

As said, the na_interpolation method will be used to fill in the gap between known vaccine numbers. One of the biggest concern for missing vaccine numbers in this data is the large proportion of missing for some entities. Followed please see the imputed values among five entities with the most missing values to the date of this report. Please note that the imputation always ends at the last known vaccine number for each entity, which may be one single dot at the end of the line and is therefore hard to see.

2.4.4 Smooth fluctuations in cases and deaths

As discussed in 2.2, a centered 7-day moving average (i.e., average of +/- 3 days) are calculated to smooth out the fluctuations in new cases per capita and new deaths per capita.

As to the date of this report, there are 1 entities with no case values and 1 entities with no death values for all dates since its 1st day of vaccine rollout till the last day with available vaccine numbers. These missing values are replaced with 0 (see also 2.4.1 for handling of vaccine numbers in the similar situation).

2.5 Overview of the Data Prepared for Modeling

2.5.1 Main Variables

After all the imputations, the missing of main variables in the project are as follows.

2.5.2 Other Relevant Variables

Some other variables are also potentially relevant in this report. Please see followed.

  • Test numbers will not be included in the model because 1) there are too many missing values and 2) the missing is likely to be systematic rather than random, in that entities with less resources are more likely to be missing on these numbers.

  • Reproduction Rate is not as complete as Daily New Cases (after imputation), but can still be used for reference. In the regression model estimating the effect of Vaccination on the Spread of Covid-19, Daily New Cases will be used as the dependent measure.

  • ICU Patients and Hospitalized Patients are potentially helpful indicators for severity of symptoms. However, we lack data on these two variables. Therefore, Deaths will be adopted as the dependent measure in the model estimating the impact of Vaccination on severity of symptoms.

  • Population info is relatively adequate. This variable will be used in some of the visualizations.

Other relevant variables that do not vary within each entity, such as GDP, median age, population density, will not be included in the analysis because in this analysis, I will use the many models method to estimate the effect of Vaccination. Specifically, the same regression model will be estimated among each and every entity, with the results then summarized in the report.

3 Description: Vaccination in the World

3.1 Vaccination Rates across Continents

Before exploring the effect of vaccines, let’s first look at vaccine roll-out progress worldwide. Each point represents an entity and is sized based on its population. The points are grouped and colored by continents. Europe has the highest median vaccine rate (number of doses given per 100 people), but there are also entities in Europe where vaccine rates are pretty low.

3.2 Entities with the Highest Vaccine Rates (Top20)

Then let’s zoom in to look at entities with the highest vaccine rate. Again, these entities are colored by continents.

4 Analysis: Vaccination and the Spread of Covid-19

4.1 Visualization: Vaccine Rate and New Cases (A recent date)

Before running the regression models estimating how the increasing vaccine rate affects the spread of Covid-19 within each entity, let’s review the situation of all entities on one recent day.

First, vaccination rate and Daily New Cases. A nonparametric smoothing line using the LOESS method is added here to facilitate understanding.

4.2 Visualization: Vaccine Rate and Reproduction Rate (A recent date)

Let’s also take a look at vaccination rate and Reproduction Rate. Again, a LOESS smoothing line is added here to facilitate understanding.

4.3 Modeling: Vaccine Rate and New Cases (Many Models)

Now let’s run the regression models estimating Daily New Cases using Vaccination Rate as the explanatory variable. We will run the model within each and every entity, then summarize the results.

Before showing the results, two issues need to be explained about the model:

  1. Because it takes about 2 weeks for the body to build immunity against the virus after being fully vaccinated (CDC), a 14-day lagged vaccination rate will be used in the model. That is to say, for each entity, vaccine per capita from two weeks ago is used to estimate new cases today.

  2. To ensure that the model is not messed up by case surges that happened when an entity’s vaccine rate is still too low to exert any effect, the analysis was done when the 14-day lagged vaccine rate has reached 100 doses per hundred people. This is a very low bar to set, considering that the desirable vaccine rate for herd immunity against Covid-19 is albeit unknown but likely to be much higher (WHO,2020).

Followed please see a summary of parameters from all the entities analyzed. The majority (the specific numbers available later) comes back with a model with a goodfit (as defined by an adjusted R square with P value < 0.05). A zero line added FYR. Coefficients smaller than 0 indicates a negative association between vaccine rate and daily new cases, that is to say, an increase in vaccine rate will predict a decrease in daily new cases.

Now let’s explore the coefficients among models where Vaccine Rate is estimated to have a significant influence on New Cases. They are grouped and colored by continents for your reference. Again, negative coefficients indicate that an increase in vaccine rate result in fewer daily new cases.

In summary, among all 53 models estimated, 48 resulted in a good fit. Among these goodfit models, 30 returned a significantly negative coefficient for the explanatory variable, indicating the effectiveness of vaccines in slowing down the spread.

4.4 Visualization: Entities where the vaccines were estimated to be the MOST effective in slowing down spread

Now, let us zoom in on the five entities where vaccine rate was estimated to be the most effective in slowing down the spread of Covid-19 according to these regression models, that is, the same increase in vaccine per hundred results in the greatest predicted decrease in daily new cases. Please note that case surges before an entity reached 100 doses per hundred were not considered in the estimation model.

Hover over any lines (including the gray ones) to get more information.

4.5 Visualization: Entities where the vaccines were estimated to be the LEAST effective in slowing down spread

Now, let us zoom in on the five entities where vaccine rate was estimated to be the least effective in slowing down the spread of Covid-19 according to these regression models. Please note that case surges before an entity reached 100 doses per hundred were not considered in the estimation model.

Hover over any lines (including the gray ones) to get more information.

5 Analysis: Vaccination and the Severity of Covid-19 Symptoms

5.1 Visualization: Vaccination and New Deaths (A Recent Date)

Before running the regression models estimating how the increasing vaccine rate affect the severity of illness from Covid-19 within each entity, let’s review the situation of all entities on one recent date. Followed is a graph showing vaccine rate and Daily New Deaths. A nonparametric smoothing line using the LOESS method is added here to facilitate understanding.

5.2 Model: Vaccination and New Deaths (Many Models)

Now let’s run the regression models estimating Daily New Deaths using Vaccination Rate as the explanatory variable. We will run the model within each and every entity, then summarize the results.

Before showing the results, two issues need to be explained about the models:

  1. In addition to the 2 weeks needed for the body to build immunity after vaccination ((CDC,2021)), the vaccine number need to be further lagged for the model on Deaths (versus the models on Cases) to incorporate the time lag between tested positive to deaths. Since the median time between symptoms onset and ICU admission is around 10 days (Baud, Qi, Nielsen-Saines, Musso, Pomar & Favre, 2020), a 24-day lagged vaccination rate is used in the model to estimate deaths. That is to say, for each entity, vaccine rate from 24 days ago is used to estimate new deaths today.

  2. Same as discussed in section 4.3, the models are trained using data after an entity has had at least 100 doses per hundred people, so as to exclude possible early surges before the vaccine can take any effect.

Followed please see a summary of coefficients from all the entities analyzed. The majority (the specific numbers available later) comes back with a model with a goodfit (as defined by an adjusted R square with P value < 0.05). A zero line added FYR. Coefficients smaller than 0 indicates a negative association between vaccine rate and daily new deaths, that is to say, an increase in vaccine rate will predict a decrease in daily new deaths.

Now let’s explore the coefficients among models where Vaccine Rate is estimated to have a significant influence on New Deaths. They are grouped and colored by continents for your reference. Again, negative coefficients indicate that an increase in vaccine rate result in fewer daily new deaths.

In summary, among all 49 models estimated, 42 resulted in a good fit. Among these goodfit models, 19 returned a significantly negative coefficient for the explanatory variable, indicating the effectiveness of vaccines in reducing deaths.

5.3 Visualization: Entities where the vaccines were estimated to be the MOST effective in reducing deaths

Now, let us zoom in on the five entities where vaccine rate was estimated to be the most effective in reducing Covid-19 related deaths, that is, the same increase in vaccine doses per hundred result in the greatest predicted decrease in daily new deaths. Please note that daily deaths before an entity reached 100 doses per hundred were not considered in the estimation model.

Hover over any lines (including the gray ones) to get more information.

5.4 Visualization: Entities where the vaccines were estimated to be the LEAST effective in reducing deaths

Now, let us zoom in on the five entities where vaccine rate was estimated to be the least effective in reducing Covid-19 deaths. Again, please note that deaths before an entity reached 100 doses per hundred were not considered in the estimation model.

Hover over any lines (including the gray ones) to get more information.

6 Discussion and Limitations

6.1 Tradeoffs in Automatically Updated Graphs and Analysis

One interesting challenge in this project is that the data is updated daily. Therefore, I tried to incorporate as much real-time report as possible to ensure that the graphs and models can stay updated with the data.

For example, when demonstrating the performance of the imputation algorithm among the five entities that have the most missing values, these five entities were chosen based on the most updated data and could therefore differ from day to day. The advantage was of course the ability to keep up to date, but the challenge is to offer useful comments and explanations that would be relevant to whichever entities shown. In this particular example, after much consideration, I decide to add an explanation that is general enough to be applicable to whichever entities shown, and is also likely to facilitate understanding when the graph looks tricky: “Please note that the imputation always ends at the last known vaccine number for each entity, which may be one single dot at the end of the line and is therefore hard to see.”

Another interesting challenge that comes with such a real-time report is how best to report the results. Considering that the analysis results are likely to change when data update, instead of qualitative statements about whether vaccines are effective, the results are reported in numbers, tables and graphs (please see 4.3-4.5 & 5.3-5.5).

6.2 The Many-Models versus One-Model Approach

In this analysis, the many models approach was adopted to estimate the effectiveness of vaccines. Specifically, we explored how the increasing vaccine rate within each entity affects its daily new cases and new deaths. The advantage of the approach is that many other factors that could potentially affect our dependent measures, such as population density, age structure, GDP, hospital beds per capita, are all naturally controlled for because they are constant within each entity.

In contrast, if entity is treated as one of the explanatory variables, then we can explore not only the effects of vaccine rate, but also the effects of many control variables. For example, we could construct a model that estimates mortality rate with a set of explanatory variables, such as age structure (e.g., the proportion of aged 70 and above), healthcare resources (e.g., hospital beds per thousand) and infrastructure (e.g., hand-washing facilities), and then check whether adding vaccine rate significantly enhances the model’s predictive capability. One challenge for this approach is the lack of data on relevant control variables. Furthermore, these missing values are likely to be systematic (e.g., a result of lack of resources) rather than random. If more data can be found or become available in the future, it will be interesting and helpful to analyze such a model.

Another issue I would like to explore further beyond this project is diagnostics with the many models approach. When running one linear regression model, we can check the assumptions with diagnostic plots, such as the residuals versus fitted values plot and the Q-Q plot. But how about when we run many models? Moreover, because the analysis will be redeployed with the daily updated covid-19 data, it is not practical to check all the diagnostic plots for all the models after each updates. It is important to note that this is a limitation of the current project.

6.3 A Note on Time Series Models

Last but not least, it is also worth noting that this project did not seek to fit a time series model. A major technical problem I am facing with time series models is how to not only a) capture the exponential dynamic of contagious disease transmission, but also b) incorporate our key variable, vaccine rate, as a covariate. Although exponential smoothing models are proposed to be suitable in modelling COVID-19 cases and deaths (Petropoulos, Makridakis and Stylianou, 2020), I am not yet aware of any R forecasting package that can incorporate a covariate term into such a model. (Please correct me if this information is outdated, also please see this article by professor Rob Hyndman for a more in-depth discussion of this very interesting challenge).

7 About the Author

This markdown file is developed by Mena WANG (Twitter, GitHub, LinkedIn) based on data from ourworldindata as one of the assignments for the course Data Driven Decision-Making at Monash University. I would like to thank the lead educator Prof Dianne Cook and the course mentor Jiaying Wu for their constructive comments and suggestions.

8 Reference

Baud, D., Qi, X., Nielsen-Saines, K., Musso, D., Pomar, L., & Favre, G. (2020). Real estimates of mortality following COVID-19 infection. The Lancet infectious diseases, 20(7), 773.

CDC (2021, March 9). Understanding How COVID-19 Vaccines Work. https://www.cdc.gov/coronavirus/2019-ncov/vaccines/different-vaccines/how-they-work.html

Goldstein, J. R., & Lee, R. D. (2020). Demographic perspectives on the mortality of COVID-19 and other epidemics. Proceedings of the National Academy of Sciences, 117(36), 22035-22041.

Grolemund G, Wickham H (2011). “Dates and Times Made Easy with lubridate.” Journal of Statistical Software, 40(3), 1–25. https://www.jstatsoft.org/v40/i03/.

Moritz S, Bartz-Beielstein T (2017). “imputeTS: Time Series Missing Value Imputation in R.” The R Journal, 9(1), 207–218. doi: 10.32614/RJ-2017-009.

Petropoulos, F., Makridakis, S., & Stylianou, N. (2020). COVID-19: Forecasting confirmed cases and deaths with a simple time series model. International Journal of Forecasting.

Sievert, C. (2020). Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com

South, A. (2011). rworldmap: A New R package for Mapping Global Data. The R Journal, 3(1): 35-43.

Stevenson, J. W. (2012). Operations Management (11th Edition). McGraw-Hill.

Tierney, (2017), visdat: Visualising Whole Data Frames, Journal of Open Source Software, 2(16), 355, doi:10.21105/joss.00355.

WHO (2020, December 31). Coronavirus disease (COVID-19): Herd immunity, lockdowns and COVID-19. https://www.who.int/news-room/q-a-detail/herd-immunity-lockdowns-and-covid-19

Wickham et al., (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, https://doi.org/10.21105/joss.01686

Xie Y (2021). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.33, https://yihui.org/knitr/.

Xie Y (2015). Dynamic Documents with R and knitr, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, https://yihui.org/knitr/.

Xie Y (2014). “knitr: A Comprehensive Tool for Reproducible Research in R.” In Stodden V, Leisch F, Peng RD (eds.), Implementing Reproducible Computational Research. Chapman and Hall/CRC. ISBN 978-1466561595, http://www.crcpress.com/product/isbn/9781466561595.

Zeileis A, Grothendieck G (2005). “zoo: S3 Infrastructure for Regular and Irregular Time Series.” Journal of Statistical Software, 14(6), 1–27. doi: 10.18637/jss.v014.i06.

9 Appendix

9.1 Appendix 1: Explore Entities with No Vaccine Numbers to Date

At least two possible factors may contribute to the complete absence of vaccine data in these entities:

  1. Necessity: Compare to entities with fewer cases, entities with more cases are more likely to be motivated to administer vaccine rollout and to report the progress closely.

  2. Ability: An entity needs to have enough resources in order to administer vaccine rollout and record relevant data.

Note: As more and more countries start their vaccine roll-out, this appendix may no longer be necessary or suitable. But if we still would like to conduct the relevant analysis, a time capsule may be helpful, that is, we can choose a point of time in the past and compare vaccine and non-vaccine entities back then. The following graphs were made based on data from Oct 31, 2021, when we thankfully only have one entity left with no vaccine data. Let’s save this historical moment in the time capsule for now. In the future, we can roll the time back further for better analysis.

9.1.1 The Necessity Hypothesis: Vaccine and Total Cases

We have identified 1 entities that have no vaccine numbers at all to date. Let’s look at their case numbers to check out the necessity hypothesis. Because entities differ in how frequently vaccine data is updated, compassion within the last five days were shown to help us get a more comprehensive picture.

Consistent with the above discussion, entities with no vaccine data have much less confirmed cases per capita. This result, however, may need to be viewed in terms of these entities’ ability to perform tests to identify cases if there are any, which then bring us to the next section.

9.1.2 The Ability Hypothesis: Vaccines and GDP

Also consistent with the above discussion, entities with no vaccine data have lower GDP per capita. It is possible that lack of resources may also contribute to fewer tests, hence the lower case numbers.

9.1.3 Location of the Entities

Finally, we can also locate these entities on the map.

## 52129 codes from your data successfully matched countries in the map
## 333 codes from your data failed to match with a country code in the map
## 86 codes from the map weren't represented in your data

9.2 Appendix 2. Missing on Potential Control Variables