Covid-19 Analysis: Effectiveness of Covid-19 Policies

Abstract

The primary objective of this analysis is to identify whether public health policies have a significant and varying impact on the number of positive Covid-19 cases using data obtained from various sources. Analysis of variance (ANOVA) is used to assess the relationship between Covid-19 and a comprehensive set of factors to determine the effect of public health policies on the number of positive Covid-19 cases while taking into account the effects of different regions. For this method, the number of global Covid-19 cases is used as the response variable while categorical region and policy factors are used as predictor variables. Based on the results obtained, all five public health policies (face covering, vaccination, internal movement, stay-at-home, and international travel regulations) are significantly useful in determining the trajectory of Covid-19 transmission globally. Additionally, the interactions between these factors are equally effective. The identified association between the number of positive Covid-19 cases and the significant factors observed in this analysis may be an indirect association due to the fact that successful transmission of the Covid-19 virus may depend on other entities excluded from this analysis that may be used to define the significant relationships observed.

Introduction

Despite the technological innovations in our civilization today, a significant portion of the entire world’s population suffered from the unexpectedly devastating effects of the Covid-19 virus. Various government interventions and public health regulations were implemented in order to suppress the ongoing and rapid spread of Covid-19 virus and to decrease the mortality rate associated with it. As the spread of Covid-19 virus gradually declines around the world and sufficient amount of data have been collected, it is important to improve our understanding of the effectiveness of the public health policies enforced that may have significant influence on the transmission of Covid-19 virus. Experts often consider the interplay of various public health regulations to slow down the spread of a virus. However, identifying the specific policies that are correlated with virus transmission is often a challenging and complex task to perform. Often times results vary between different research studies. The results obtained from this analysis may slightly contribute to the determination of important measures that need to be implemented during a pandemic. The primary goal of this analysis is to investigate the association between the number of detected Covid-19 cases and a diverse set of public health policies. The observations obtained from this analysis can be used for hypothesis generation and further investigations in the future since this project only focuses on the limited numbers of factors and doesn’t include other covariates that might be correlated with the features identified as statistically significant. The data sets used for this analysis were collected from two sources, World Health Organization (WHO) and Our World in Data (OWID). These data sets contain the features country, region, new Covid-19 cases, date, new cases, cumulative cases, and public health policies including vaccination, face covering, internal movement, international travel, and stay-at-home order. Possible correlations between these factors are expected and appropriate methods are used to verify them. Based on the previous studies done on this subject, it can be hypothesized that a positive correlation between Covid-19 transmission and the public health policies exists. This analysis project is primarily focused on answering the following questions.

Primary Question of Interest:

Are public health policies useful in determining the number of positive Covid-19 cases?

Secondary Question of Interest:

Which of these public health policies are most and least effective?

Background

Since Covid-19 virus can be transmitted from one individual to another regardless of gender or health and most public health policies are focused on limiting social contact, it is important to identify whether the policies implemented by public health experts are effective or not. According to a study conducted by BMC Public Health, public health measures are found to be effective in decreasing Covid-19 transmission (Ayouni et al., 2021). One of the public health policy included in this investigation is face covering. It has been proven that face masks are effective in filtering out microscopic particles. However, specific studies that focus on the effectiveness of wearing a face mask in decreasing the transmission of Covid-19 has not been extensively explored. One research study recently revealed that the use of facial coverings indoor has a positive effect on the decreasing the number of individuals testing positive for Covid-19 (Andrejko et al., 2022). Some of the public health regulations are mainly focused on reducing social contact. These regulations include stay-at-home orders, reduction of internal mobility, and international travel restrictions. Studies have shown that minimizing social contact by issuing stay-at-home orders is associated with decline in successful Covid-19 transmission (Fowler et al., 2021). Similarly, travel restrictions have been proven to mitigate the rapid spread of Covid-19 virus (Kwok et al., 2021). If stay-at-home orders are placed and international or domestic travel is restricted, then internal movement is also expected to decline. Therefore, the internal movement restriction should have similar effect on the spread of Covid-19 virus.

This project explores the data collected from two sources in order to answer the questions of interest, which are the primary focus of this analysis. Since any individual can be affected by the Covid-19 virus, this project targets the general population worldwide who are susceptible to contracting the Covid-19 virus regardless of existing health conditions. The total number of individuals all over the world in various countries that tested positive for Covid-19 from January 2020 to January 2022 is used as the response variable in the data modeling section. The data set that includes valuable information about the daily recorded new Covid-19 cases from 237 countries around the world is gathered from World Health Organization (WHO). The categorical covariates chosen are a diverse set of public health policies. The data containing public health policies are collected from Our World in Data (OWID). Specific information about the sources are provided in the Acknowledgement section at the end of this report. The final cleaned and aggregated data set that excludes missing records are displayed in the Data section below.

Data

The data used for this analysis was collected from various sources mentioned in the Acknowledgement section of this report.The supplementary data sets were collected from OWID, which contain the features that are useful for answering the primary and secondary questions of interest. The final aggregated data displayed in Table 3 below specifically used for this analysis project are aggregated by month for a total of 180 countries from 2020 to 2022. This data set contains variables that are defined in the tables attached below (Tables 1 and 2). The maximum policy level for each month is retained and used for this analysis since policy levels do not change daily. Simplifying the data by aggregating it into a monthly record also contributes to the efficiency when analyzing this data to answer the questions of interest. A lagged variable for total number of Covid-19 cases per month is also generated in order to consider the fact that changes in policies implemented during a specific time may not yield any significant results immediately.

Table 1: Variable Definitions

Variable	Definition	Levels (Values)
Country	Includes 180 countries worldwide with record
New_cases	Total number of new positive Covid-19 cases from 1/3/20 through 2/16/22
Face_cover	Policies on the use of face coverings outside-of-the-home	5 (0,1,2,3, and 4)
Vaccination	Policies on vaccination availability for different groups	6 (0,1,2,3,4, and 5)
Internal_movement	policies on restrictions on internal movement/travel between regions and cities	3 (0,1, and 2)
Stay_home	Policies on stay-at-home requirements or household lockdowns	4 (0,1,2, and 3)
International_travel	Policies on restrictions on international travel controls	5 (0,1,2,3, and 4)

Table 2: Level Definitions

Levels	Face_cover	Vaccination	Internal_movement	Stay_home	International_travel
0	No policy	No availability	No measures	No measures	No measures
1	Recommended	Availability key workers or clinically vulnerable groups or elderly groups	Recommended movement restriction	Recommended not to leave the house	Screening
2	Required in some specified shared/public spaces outside the home with other people present, or some situations when social distancing not possible	Availability for two of the following: key workers, clinically vulnerable groups, or elderly groups	Restrict movement	Required to not leave the house with exceptions for daily exercise, grocery shopping, and ‘essential’ trips	Quarantine from high-risk regions
3	Required in all shared or public spaces outside the home with other people present or all situations when social distancing not possible	Availability for key workers, clinically vulnerable groups, and elderly groups	–	Required to not leave the house with minimal exceptions (e.g. allowed to leave only once every few days, or only one person can leave at a time, etc.	Ban on high-risk regions
4	Required outside the home at all times regardless of location or presence of other people	Availability for all three plus partial additional availability (select broad groups/ages)	–	–	Total border closure
5	–	Universal availability	–	–	–

Table 3: Monthly Data

Descriptive Analysis

The following sections show the results of further data exploration using numerical and graphical methods.

Summary Statistics

The following chart displays the general summary statistics of the monthly data being used for this analysis project. All five categorical variables contain contain unequal number of factor levels. In this dataset, there are a total of 180 countries from six WHO regions. The original aggregated data was properly cleaned to exclude all the missing values. In the graph column of the chart shown below, the bar graph for each variable shows the type of distributions each variable have. The New_cases variable appears to be heavily right-skewed so a proper variable transformation needs to be performed. The categorical variables appear to be roughly normal distributed with the exception of Vaccination, which is slightly right-tailed.The second column in the chart shows the computed mean, standard deviation, minimum, median, maximum, and interquartile range (IQR), and coefficient of variation (CV) for each quantitative variable. The variable Year shows that the observations in this data are records from three different years (2020 to 2022).

Data Frame Summary

monthly_data

Dimensions: 4641 x 10
Duplicates: 0

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

Country [character]

1. Albania

2. Algeria

3. Andorra

4. Angola

5. Australia

6. Austria

7. Bahrain

8. Bangladesh

9. Belarus

10. Belgium

[ 170 others ]

26	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
26	(	0.6%	)
4381	(	94.4%	)

4641 (100.0%)

0 (0.0%)

WHO_region [character]

1. AFRO

2. AMRO

3. EMRO

4. EURO

5. SEARO

6. WPRO

1083	(	23.3%	)
870	(	18.7%	)
541	(	11.7%	)
1398	(	30.1%	)
231	(	5.0%	)
518	(	11.2%	)

4641 (100.0%)

0 (0.0%)

Year [character]

1. 2020

2. 2021

3. 2022

2160	(	46.5%	)
2159	(	46.5%	)
322	(	6.9%	)

4641 (100.0%)

0 (0.0%)

Month [character]

1. 01
2. 02
3. 03
4. 04
5. 05
6. 06
7. 08
8. 09
9. 10
10. 11
[ 2 others ]

539	(	11.6%	)
503	(	10.8%	)
360	(	7.8%	)
360	(	7.8%	)
360	(	7.8%	)
360	(	7.8%	)
360	(	7.8%	)
360	(	7.8%	)
360	(	7.8%	)
360	(	7.8%	)
719	(	15.5%	)

4641 (100.0%)

0 (0.0%)

New_cases [integer]

Mean (sd) : 86102.5 (488758.6)

min ≤ med ≤ max:

0 ≤ 3053 ≤ 20257043

IQR (CV) : 27325 (5.7)

3354 distinct values

4641 (100.0%)

0 (0.0%)

Face_cover [factor]

1. 0

2. 1

3. 2

4. 3

5. 4

793	(	17.1%	)
212	(	4.6%	)
727	(	15.7%	)
1925	(	41.5%	)
984	(	21.2%	)

4641 (100.0%)

0 (0.0%)

Vaccination [factor]

1. 0

2. 1

3. 2

4. 3

5. 4

6. 5

2343	(	50.5%	)
138	(	3.0%	)
227	(	4.9%	)
362	(	7.8%	)
490	(	10.6%	)
1081	(	23.3%	)

4641 (100.0%)

0 (0.0%)

Internal_movement [factor]

1. 0

2. 1

3. 2

2090	(	45.0%	)
699	(	15.1%	)
1852	(	39.9%	)

4641 (100.0%)

0 (0.0%)

Stay_home [factor]

1. 0

2. 1

3. 2

4. 3

1580	(	34.0%	)
1064	(	22.9%	)
1737	(	37.4%	)
260	(	5.6%	)

4641 (100.0%)

0 (0.0%)

International_travel [factor]

1. 0

2. 1

3. 2

4. 3

5. 4

244	(	5.3%	)
773	(	16.7%	)
1011	(	21.8%	)
1432	(	30.9%	)
1181	(	25.4%	)

4641 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.2.3)
2023-04-10

The summary statistics of monthly new Covid-19 cases from 2020 to 2022 for each country in the data is shown in Table 4 below. It appears that the United States, India, and Brazil have the highest total and average number of positive Covid-19 cases during this time period. Meanwhile, Turkmenistan, Tonga, and Vanuatu have the lowest number of Covid-19 cases in this data. However, it appears that these three countries may have inaccurate record due to their extremely low cases compare to other countries. This information is also shown in Figure 1 below.

Table 4: Summary statistics by Country

Data Visualization

The interactive bubble chart displayed below (Figure 1) shows how the total number of new cases per year for each country included in the data change every year from 2020 to 2022. The purpose of this project is to show how these cumulative Covid-19 cases is influenced by different public health policies and restrictions. Therefore, it is important to visualize the number of Covid-19 cases, which is used in the data modeling section as the response variable. In this chart, it is evident that United States , India , and Brazil have the highest number of Covid-19 cases in 2020 and 2021. The United States remains as the country with the highest number of positive cases in 2022 and France surpassed India as the country with the second highest number of cases. It is also clear that the total number of cases for most countries significantly increased in 2021 compared to the recorded cases in 2020. Additionally, it is apparent that the highest number of Covid-19 cases belong to different regions. It is important to add that this analysis does not consider possible effect of each country on the number of Covid-19 cases. In order to improve the efficiency of the methods used in this analysis, the variable region is used as a between-subjects factor with a few number of groups.

Figure 1: Covid-19 Total New Cases by Country and Year

The heatmap of correlation matrix for the different regions and policy factors is displayed in Figure 2 below. Generally the different levels of the factors are not correlated with each other. The highest correlation appears to be between Internal_movement2 and Stay_home2, which is reasonable because higher restrictions on stay-at-home orders should result in less number of people outside and less amount of mobility in a particular area.

Figure 2: Correlations of Policy Levels

The histogram below shows the frequency of the different regions and levels of policies and restrictions using the data aggregated by year and month (Table 3). The histograms show the these factors have unequal number of levels, which results in an imbalanced model design.

Figure 3: Histogram of Categorical Region and Policy Variables

The attached figure below (Figure 4) shows the boxplot of total monthly Covid-19 cases in the data (Table 3) under different regions and policy factors. Since the means (red dots) within each plot does not follow a horizontal trend, this suggests that the regions and policies shown have an impact on the number of cases. The means of the number of cases vary across all factor levels within region and each policy variable. The boxplot of Lag_cases does not differ significantly from the boxplot using New_cases. Therefore, it is not necessary to use Lag_cases in this case.

Figure 4: Boxplot of Covid-19 and Policies

Exploratory Analysis

In order to further explore the data used for this analysis, the following method is used.

Multiple correspondence analysis (MCA)

The goal in this section is to identify any possible similarities between categorical predictors in the data that are used in the data modelling section. In order to perform this task, a method that is specifically used for summarizing and visualizing data that contains more than one categorical variables is used. The multiple correspondence analysis (MCA) method is similar to principal component analysis (PCA) but more useful when the data is categorical. Using this method, the significant variables that contribute the most in explaining the variations in the data can also be identified and the insignificant variables can be eliminated to avoid overfitting. In order to improve process efficiency, the data used in this section is an aggregated yearly data where the number of cumulative cases for each country every year are aggregated as total number of cases from 2020 to 2022 and the maximum policies imposed given the specified period are retained. In terms of MCA, the categorical region and policies are used as the active variables, which are the variables that are passed into the MCA() function. The different countries are the active individuals, which represent the observations or rows included in the MCA. The number of Covid-19 cases are used as the supplementary variable, which are not included in the MCA but their coordinates can be predicted by the MCA using the active variables and active individuals. The histogram of the active variables (policies) in Figure 3 in the previous section does not show any categories with extremely small frequency. The lowest predictor level frequency is equal to 138, which is from vaccination with level 1. This value should be large enough to not cause any distortion in MCA.

To visualize the proportion of variances explained by each MCA dimension, the function fviz_screeplot() is used to generate the Scree plot shown in Figure 5 below. This Scree plot can be interpreted in a similar way as the PCA Scree plots for principal components. The scree plot below does not appear to have any significant drop or “elbow”. Therefore, it can be inferred that there are no dominating directions that exist in the data. Hence, all the dimensions can be retained for this analysis.

Figure 5: MCA Scree plot

The function fviz_mca_biplot() is used to draw the MCA Biplot of individuals and variable categories or levels in Figure 6 below. This plot shows an overall pattern within the data. Observations (rows or individuals) are represented by blue points and columns (active variable categories) are the triangles color coded by contribution to the MCA dimension. In this plot, the distance between the row points or column points can be used to measure their similarity or dissimilarity depending on how close the distance is. For example, Stay_home_1, Internal_movement_1, and International_travel_3 are fairly close to each other in this plot and highly contribute to MCA dimension 1 (x-axis). Therefore, it can be concluded that they have similar profiles. These three variable categories or levels also contribute to MCA dimension 2 (y-axis).

Figure 6: MCA Biplot

The correlation between variables and MCA principal dimensions can also be assessed using the plot of MCA variables in Figure 7 below. Using this plot, we can identify which of the variables contribute the most to MCA dimensions 1 (x-axis) and 2 (y-axis). In this plot, the squared correlations between MCA dimensions and the variables are used as coordinates. It is apparent that, the variables Stay_home and Internal_movement are the most correlated with MCA dimension 1. These two variables are also the most correlated with MCA dimension 2.

Figure 7: MCA Variables

Inferential Analysis

Since the variables included in the data are categorical and the questions of interest for this analysis are primarily focused on the effects of the different levels in these predictors on the total number of positive Covid-19 cases worldwide, using analysis of variance (ANOVA) can help determine if the dependent variable (Covid-19 cases) changes according to the level of the independent variables (policies).

Variable Transformation

For this analysis, it can be tested whether New_cases can be transformed in order to remove its heavy right-skewed distribution. This variable is used as the response for the ANOVA model in the following section. Therefore, it is crucial to determine if this variable violates the normality assumption or not. Figure 8 below shows the diagnostic plot for the full model. Although the residual follow a constant variance, we can see from the normal Q-Q plot that the residuals are heavily right-tailed suggesting that variable transformation is needed.

Figure 8: Diagnostic Plot

Since the monthly dataset contains various observations where the variable New_cases is zero, these observations can be excluded in order to avoid encountering issues when using log transformation. Figure 9 below shows that the relative frequency of 0 new cases in the monthly data is dominant. The histogram of the log of New_cases below in Figure 9 shows that the distribution of the variable New_cases is normally distributed after performing the log transformation. Thus, the ANOVA model can be fitted using the log of New_cases with nonzero values. After the removal of rows with 0 new cases recorded, the total number of observations in the new simplified dataset decreased from 4641 to 4164, which is not a significant difference.

Figure 9: Log transformation of response variable

The full model fitted with the monthly dataset excluding observations where New_cases is 0 and the log of New_cases as the response variable does not severely violate the ANOVA model assumptions as shown in its diagnostic plot below (Figure 10). Additionally, The Box-Cox plot in Figure 11 shows that fitting the full model with log of New_cases as the response does not need any further transformation since \(\lambda\) is approximately 1.

Figure 10: Box-Cox Plot

Figure 11: Box-Cox Plot

Fixed-effect ANOVA Model

For this analysis, the proposed fixed-effect ANOVA model with multiple factors in the factor-effect form defined below uses the categorical policy variables (Face_cover (\(\alpha\)), Vaccination (\(\beta\)), Internal_movement (\(\gamma\)), Stay_home (\(\delta\)), International_travel (\(\zeta\)), WHO_region (\(\eta\)) for predicting New_cases.

\[Y_{ijk\ell m n o} = \mu_{\cdot\cdot }+ \alpha_i + \beta_j + \gamma_k+ \delta_{\ell} + \zeta_m + \eta_n +\epsilon_{ijk\ell m n o}\] where

\(o = 1, 2, \cdots, n_{ijk\ell m n}\), where \(n_{ijk\ell m n}\)= is the number of observations in cell (\(i,j,k,\ell,m, n\)) or the combinations of the different levels of the categorical predictors

\(i = 0,1, \cdots, 4\)

\(j = 0,1, \cdots, 5\)

\(k = 0, 1, 2\)

\(\ell = 0, 1, \cdots, 3\)

\(m = 0, 1, \cdots, 4\)

\(n = 1, 2, \cdots, 6\)

\(\epsilon_{ijk\ell m n o}\) (error terms) are i.i.d. \(N(0,\sigma^2)\)

\(Y_{ijk\ell m n o}\) = \(o^{th}\) observation from the \(i^{th}\) level of Face_cover, \(j^{th}\) level of Vaccination, \(k^{th}\) level of Internal_movement, \(\ell ^{th}\) level of Stay_home, \(m ^{th}\) level of International_travel policies, and \(n ^{th}\) group of WHO_region.

\(\mu_{\cdot\cdot } = \sum_{i=0}^{4} \sum_{j=0}^{5} \sum_{k=0}^{2} \sum_{\ell=0}^{3} \sum_{m=0}^{4} \sum_{n=1}^{6} \frac{\mu_{ijk\ell m}}{ijk\ell m}\) (overall mean across all populations)

\(\alpha_i = \mu_{i\cdot} - \mu_{\cdot\cdot}\) = factor effect of Face_cover with \(\sum_{i=0}^{4} \alpha_i = 0\) constraint

\(\beta_j = \mu_{j\cdot} - \mu_{\cdot\cdot}\) = factor effect of Vaccination with \(\sum_{j=0}^{5} \beta_j =0\) constraint

\(\gamma_k = \mu_{k\cdot} - \mu_{\cdot\cdot}\) = factor effect of Internal_movement with \(\sum_{k=0}^{2} \gamma_k =0\) constraint

\(\delta_{\ell} = \mu_{\ell\cdot} - \mu_{\cdot\cdot}\) = factor effect of Stay_home with \(\sum_{\ell=0}^{3} \delta_{\ell} =0\) constraint

\(\zeta_m = \mu_{m\cdot} - \mu_{\cdot\cdot}\) = factor effect of International_travel with \(\sum_{m=0}^{4} \zeta_m =0\) constraint

\(\eta_n = \mu_{n\cdot} - \mu_{\cdot\cdot}\) = factor effect of WHO_region with \(\sum_{n=1}^{6} \eta_n =0\) constraint

Hypothesis

Because the is analysis is focused on testing the effects of five independent variables as well as their interaction on an outcome measure, it is important to state a hypothesis test. For the main effects, which are the effects of the different levels of a single independent variable, the null hypothesis (\(H_o\)) is that the means of the different levels of a given independent variable are not different from each other, while the alternative hypothesis (\(H_a\)) is that these groups are different from each other as follows.

\[H_o : \mu_1 =\mu_2 = \cdots = \mu_k = 0 \ \ \ vs. \ \ H_a: \ not \ all \ \mu_i \ equals \ 0\] The decision rule for this hypothesis are defined as follows, where \(F(1-\alpha;r−1,n_T - r)\) is the \((1-\alpha)100\) percentile of the appropriate F distribution.

If \(F^* \le F(1-\alpha;r−1,n_T - r)\), then conclude \(H_o\).
If \(F^* > F(1-\alpha;r−1,n_T - r)\), then conclude \(H_a\).

Choosing a significance level \(\alpha=0.01\), it is evident that the Main Effect of each factor are all statistically significant based on the output of the two models specified below, where all the p-values are less than 2e-16 as shown in the summary output. The null hypothesis stated above can be rejected because of the strong evidence observed. It can also be concluded that there are significant differences in the impact of each factor on New_cases. It is important to note that this analysis is only focused on the effect of the different region and policies without considering the possible effects of the variable Country in the data. Since fitting the model with the variable Country is more complex and requires longer time to complete, choosing the variable WHO_region as the between-subjects factor is more efficient.

Summary output of the full additive ANOVA model without interactions:

##                        Df Sum Sq Mean Sq F value   Pr(>F)    
## Face_cover              4   4890  1222.5  227.09  < 2e-16 ***
## Vaccination             5   2098   419.6   77.94  < 2e-16 ***
## Internal_movement       2   2285  1142.5  212.24  < 2e-16 ***
## Stay_home               3    215    71.5   13.29 1.25e-08 ***
## International_travel    4   1444   360.9   67.05  < 2e-16 ***
## WHO_region              5   3346   669.2  124.31  < 2e-16 ***
## Residuals            4140  22286     5.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table of the full additive ANOVA model without interactions is displayed below and shows that all of the factors are statistically significant at significance level \(\alpha=0.05\). The ges (Generalized Eta-Squared), which measures the effect size, also shows the the factors Face_cover, Vaccination, and WHO_region have the highest effect size out of all the 6 factors included in the model.

ANOVA Table (Type II test) for full additive ANOVA model without interactions:

## ANOVA Table (type II tests)
## 
##                 Effect DFn  DFd       F         p p<.05   ges
## 1           Face_cover   4 4140 125.463 3.10e-101     * 0.108
## 2          Vaccination   5 4140  72.941  2.08e-73     * 0.081
## 3    Internal_movement   2 4140 111.410  7.45e-48     * 0.051
## 4            Stay_home   3 4140  13.196  1.42e-08     * 0.009
## 5 International_travel   4 4140  23.625  2.42e-19     * 0.022
## 6           WHO_region   5 4140 124.311 5.99e-123     * 0.131

It can be further tested whether the interactions between the categorical predictors have a significant impact on the response variable New_cases. Based on the ANOVA table displayed below, the interaction between most of these factors appear to have a significant impact on the response with small p-values suggesting that there there is a strong evidence against the null hypothesis that the interactions between the predictors have no significant effect on the response. Additional tests can be performed in order to find the best combination of predictors for this case where their effect on the response is statistically significant. The summary output of the full ANOVA model shown below also shows that all 5 policy factors are statistically significant at significance level \(\alpha\)=0.001. By looking at this summary output, we can see that the interaction between most of the factors are statistically significant with the exception of the interaction between Stay_home and International_travel. Therefore, additional tests are not needed to identify which of the interactions are the best combinations. Based on these results, most of the interactions between the categorical predictors are statistically significant. Therefore the final anova model must be additive with most of the possible interactions.

Summary output of the full ANOVA model with interactions:

##                                          Df Sum Sq Mean Sq F value   Pr(>F)    
## Face_cover                                4   4890  1222.5 267.914  < 2e-16 ***
## Vaccination                               5   2098   419.6  91.952  < 2e-16 ***
## Internal_movement                         2   2285  1142.5 250.389  < 2e-16 ***
## Stay_home                                 3    215    71.5  15.675 3.93e-10 ***
## International_travel                      4   1444   360.9  79.104  < 2e-16 ***
## WHO_region                                5   3346   669.2 146.655  < 2e-16 ***
## Face_cover:Vaccination                   17    499    29.4   6.437 2.71e-15 ***
## Face_cover:Internal_movement              8    268    33.5   7.342 9.82e-10 ***
## Face_cover:Stay_home                     12    267    22.3   4.879 4.82e-08 ***
## Face_cover:International_travel          15    316    21.1   4.614 7.54e-09 ***
## Face_cover:WHO_region                    20    744    37.2   8.154  < 2e-16 ***
## Vaccination:Internal_movement            10    154    15.4   3.370 0.000217 ***
## Vaccination:Stay_home                    15    179    12.0   2.621 0.000602 ***
## Vaccination:International_travel         18    207    11.5   2.518 0.000394 ***
## Vaccination:WHO_region                   25    316    12.6   2.768 5.79e-06 ***
## Internal_movement:Stay_home               6    100    16.7   3.667 0.001235 ** 
## Internal_movement:International_travel    8    181    22.7   4.968 3.84e-06 ***
## Internal_movement:WHO_region             10    399    39.9   8.739 2.61e-14 ***
## Stay_home:International_travel           11     76     6.9   1.519 0.117393    
## Stay_home:WHO_region                     15    207    13.8   3.019 7.39e-05 ***
## International_travel:WHO_region          20    440    22.0   4.824 7.88e-12 ***
## Residuals                              3930  17933     4.6                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

To compare the variation explained by two models: a full model that is additive with significant interactions and a reduced model that is only additive, the ANOVA table can be used. The table below shows that the full model provides a better fit to the data than the reduced model

ANOVA Table of full and reduced model comparison:

## Analysis of Variance Table
## 
## Model 1: New_cases ~ Face_cover + Vaccination + Internal_movement + Stay_home + 
##     International_travel + WHO_region
## Model 2: New_cases ~ Face_cover + Vaccination + Internal_movement + Stay_home + 
##     International_travel + WHO_region + Face_cover:Vaccination + 
##     Face_cover:Internal_movement + Face_cover:Stay_home + Face_cover:International_travel + 
##     Face_cover:WHO_region + Vaccination:Internal_movement + Vaccination:Stay_home + 
##     Vaccination:International_travel + Vaccination:WHO_region + 
##     Internal_movement:Stay_home + Internal_movement:International_travel + 
##     Internal_movement:WHO_region + Stay_home:WHO_region + International_travel:WHO_region
##   Res.Df   RSS  Df Sum of Sq      F    Pr(>F)    
## 1   4140 22286                                   
## 2   3941 17991 199    4294.9 4.7276 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table of the full additive ANOVA model with significant interactions is displayed below and shows that all of the factors and some of their interactions are statistically significant at significance level \(\alpha=0.05\), which means that they have a significant effect on the number of covid-19 cases recorded. Based on the effect size (ges), the factors Face_cover, Vaccination, and WHO_region continue to show the highest effect size out of all the 6 factors included in this model. Meanwhile, the interactions with the largest effect size are International_travel:WHO_region, Face_cover:WHO_region, and Vaccination:WHO_region. It is evident that the interaction between the 5 policies and the region tend to have large effect size.

ANOVA Table (Type II test) for full additive ANOVA model with significant interactions:

## ANOVA Table (type II tests)
## 
##                                    Effect DFn  DFd       F         p p<.05
## 1                              Face_cover   4 3941 111.247  5.64e-90     *
## 2                             Vaccination   5 3941  63.796  3.38e-64     *
## 3                       Internal_movement   2 3941  98.650  1.57e-42     *
## 4                               Stay_home   3 3941  12.125  6.75e-08     *
## 5                    International_travel   4 3941  25.371  8.82e-21     *
## 6                              WHO_region   5 3941 123.743 5.31e-122     *
## 7                  Face_cover:Vaccination  17 3941   2.707  1.83e-04     *
## 8            Face_cover:Internal_movement   8 3941   1.267  2.56e-01      
## 9                    Face_cover:Stay_home  12 3941   4.214  1.23e-06     *
## 10        Face_cover:International_travel  15 3941   3.659  2.09e-06     *
## 11                  Face_cover:WHO_region  20 3941   5.546  2.28e-14     *
## 12          Vaccination:Internal_movement  10 3941   1.538  1.19e-01      
## 13                  Vaccination:Stay_home  15 3941   2.076  9.00e-03     *
## 14       Vaccination:International_travel  18 3941   2.629  2.02e-04     *
## 15                 Vaccination:WHO_region  25 3941   2.456  7.33e-05     *
## 16            Internal_movement:Stay_home   6 3941   3.706  1.00e-03     *
## 17 Internal_movement:International_travel   8 3941   2.382  1.50e-02     *
## 18           Internal_movement:WHO_region  10 3941   5.771  1.14e-08     *
## 19                   Stay_home:WHO_region  15 3941   3.060  5.92e-05     *
## 20        International_travel:WHO_region  20 3941   4.898  4.34e-12     *
##      ges
## 1  0.101
## 2  0.075
## 3  0.048
## 4  0.009
## 5  0.025
## 6  0.136
## 7  0.012
## 8  0.003
## 9  0.013
## 10 0.014
## 11 0.027
## 12 0.004
## 13 0.008
## 14 0.012
## 15 0.015
## 16 0.006
## 17 0.005
## 18 0.014
## 19 0.012
## 20 0.024

Mean Pair-Wise Comparisons

In Figure 12 below, it is evident that the means within each factor vary between different levels or groups. Tukey’s HSD (honestly significant difference) is utilized to perform the pairwise comparison of means (\(\mu_i -\mu_{i'}\)). The coverage is exactly 1−\(\alpha\) for this study it is at least 1−α because it is an unbalanced case.The following tables show the mean pairwise comparisons within each factor using the Tukey HSD method. The p-values in each table appears to be extremely small the pairwise differences are significant, which proves the observation obtained using Figure 12 in the previous section.

Figure 12: Main Effect Plot (95% Confidence Interval)

Sensitivity Analysis

Each of the assumptions stated for the ANOVA model can be tested in order to verify if the chosen model is reliable or not. The following lsits the assumptions associated with the ANOVA model used for this analysis.

The error terms are independent.
The error terms are normally distributed.
The data does not have any significant outliers.
The variance of the error terms are equal.
The variances of the sampled populations are equal.

The Normal Q-Q Plot shown below (Figure 13) can be used to determine if the residuals follow a normal distribution. Since all of the points are roughly along the diagonal dashed line, it can be inferred that the the error terms \(\epsilon_{ijk\ell mn}\) follow a normal distribution without significant outliers. Furthermore, the Residuals against Fitted values shows that the red line is horizontal, which suggests that the error terms \(\epsilon_{ijk\ell mn}\) are independent. Based on the Residuals vs Leverage plot below, it appears that there are no possible significant outliers that needs to be removed from the data.

Figure 13: Model Diagnostic Plots

The Levene test method is used to test the homogeneity of variance and generate the results shown below. The null hypothesis for the Levene test method states that the variances across each population are equal (\(H_o: \sigma_{1}^{2} = \sigma_{2}^{2} = \ldots = \sigma_{a}^{2}\)) and the alternative hypothesis is defined as \(H_a: \sigma_{i}^{2} \ne \sigma_{j}^{2}\) for at least two populations. For this test, the \(F\)-statistic is computed for \(H_0: \mathbb{E}[d_{1\cdot}]=\mathbb{E}[d_{2\cdot}] = \cdots =\mathbb{E}[d_{r\cdot}]\) and \(H_0\) is rejected if \(F^*>F(1-\alpha; r-1, n_T-r)\) at significance level \(\alpha\).The Levene test is performed below by simply using the leveneTest() function. The p-values for all factors are statistically significant, which declares there is non-homogeneity across their respective levels when tested to significance level \(\alpha\)=0.01. This confirms that the equal variance assumption stated for this model is violated. Therefore, further investigation is needed to assess the source of non-homogenous variance that we see in this result. However, for the model diagnostics portion of this report we use the results obtained using the diagnostics plot above.

Face covering:

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value   Pr(>F)    
## group    4  14.717 6.08e-12 ***
##       4159                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Vaccination:

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    5  5.9015 1.923e-05 ***
##       4158                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Internal movement:

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value  Pr(>F)  
## group    2  3.4685 0.03125 *
##       4161                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Stay-at-home:

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    3  8.9645 6.452e-06 ***
##       4160                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

International travel:

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    4  8.2576 1.248e-06 ***
##       4159                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

WHO region:

## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    5  36.072 < 2.2e-16 ***
##       4158                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The table below shows the different possible outliers in this data. However, the diagnostics plot above show that they do not have significant effects on the model assumptions so they can be retained.

Discussion

The primary objective of this analysis is to identify whether public health policies have a significant impact on the number of positive Covid-19 cases using data obtained from various sources. Using the Analysis of variance (ANOVA) method to assess the effect of public health policies on Covid-19 cases across different regions, we can conclude that all the public health policies have significant association with the number of positive Covid-19 cases in different regions around the world. To answer the secondary question of interest stated in the introduction section, face covering and vaccination policies have the highest effect size out of all the 5 policies included in the model while stay-at-home order has the least effect size. It is important to clarify that the identified association between the number of positive Covid-19 cases and the significant factors may be a classified as an indirect association due to the fact that successful transmission of the Covid-19 virus may depend on other factors that are excluded from this analysis that can be used to define the significant relationships observed. Further investigations maybe necessary in order to verify the results obtained from the methods used in this analysis. The relationships between the variables in the data used for this project can be explored using other methods such as Mixed Effect ANOVA with multiple factors using Country as the random factor and holding the policy factors fixed. The results of using Fixed-effect ANOVA suggests that all public health policies and their interactions have significant effect on the number of positive Covid-19 cases across different regions worldwide. This may be the reason why the current number of positive cases is generally declining. However, it is important to note that there are periods during the pandemic when these factors can be considered ineffective due the increase in positive cases recorded. Although the government policies are highly useful in identifying potential causes of increased virus transmission, it is important to not solely rely on them. Other factors may also have a significant impact and could override the effects of public health policies. Therefore, it is crucial to take a comprehensive approach in identifying and addressing factors that contribute to the spread of the virus. Overall, the results of this analysis suggest that government interventions are necessary during a pandemic.

Acknowledgement

https://statisticsglobe.com/aggregate-daily-data-to-month-year-intervals-in-r

https://www.r-bloggers.com/2018/01/bitcoin-world-map-bubbles/

https://www.machinelearningplus.com/machine-learning/feature-selection/

https://github.com/bbc/bbplot

https://rawgit.com/valentinitnelav/valentinitnelav.github.io/master/assets/2018-08-25-PCA-interactive-biplot/PCA-interactive-biplot.html

https://rpubs.com/Saskia/520216

http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials/

https://machinelearningmastery.com/feature-selection-with-categorical-data/

https://services.google.com/fh/files/misc/exploratory_data_analysis_for_feature_selection_in_machine_learning.pdf

Data Sources:

Covid-19

Other

Reference

https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-021-11111-1#

https://www.cdc.gov/mmwr/volumes/71/wr/mm7106e1.htm

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0248849#

https://bmcpublichealth.biomedcentral.com/articles/10.1186/s12889-021-11889-0

Appendix

Github Repository

Session information

sessionInfo()

## R version 4.2.3 (2023-03-15 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22621)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] car_3.1-2          carData_3.0-5      gridExtra_2.3      gplots_3.1.3      
##  [5] rstatix_0.7.2      factoextra_1.0.7   FactoMineR_2.8     hrbrthemes_0.8.0  
##  [9] ggcorrplot_0.1.4   plotly_4.10.1      ggplot2_3.4.2      summarytools_1.0.1
## [13] DT_0.27            dplyr_1.1.1        zoo_1.8-11         kableExtra_1.3.4  
## 
## loaded via a namespace (and not attached):
##   [1] colorspace_2.1-0        ggsignif_0.6.4          pryr_0.1.6             
##   [4] ellipsis_0.3.2          estimability_1.4.1      base64enc_0.1-3        
##   [7] rstudioapi_0.14         httpcode_0.3.0          ggpubr_0.6.0           
##  [10] farver_2.1.1            ggrepel_0.9.3           fansi_1.0.4            
##  [13] mvtnorm_1.1-3           lubridate_1.9.2         xml2_1.3.3             
##  [16] codetools_0.2-19        leaps_3.1               extrafont_0.19         
##  [19] cachem_1.0.7            knitr_1.42              jsonlite_1.8.4         
##  [22] broom_1.0.4             Rttf2pt1_1.3.12         cluster_2.1.4          
##  [25] shiny_1.7.4             compiler_4.2.3          httr_1.4.5             
##  [28] emmeans_1.8.5           backports_1.4.1         fastmap_1.1.1          
##  [31] lazyeval_0.2.2          cli_3.6.0               later_1.3.0            
##  [34] htmltools_0.5.5         tools_4.2.3             gtable_0.3.3           
##  [37] glue_1.6.2              reshape2_1.4.4          Rcpp_1.0.10            
##  [40] jquerylib_0.1.4         fontquiver_0.2.1        vctrs_0.6.1            
##  [43] crul_1.3                svglite_2.1.1           extrafontdb_1.0        
##  [46] crosstalk_1.2.0         xfun_0.38               stringr_1.5.0          
##  [49] rvest_1.0.3             timechange_0.2.0        mime_0.12              
##  [52] lifecycle_1.0.3         gtools_3.9.4            MASS_7.3-58.2          
##  [55] scales_1.2.1            promises_1.2.0.1        fontLiberation_0.1.0   
##  [58] RColorBrewer_1.1-3      yaml_2.3.7              curl_5.0.0             
##  [61] pander_0.6.5            gdtools_0.3.3           sass_0.4.5             
##  [64] stringi_1.7.12          fontBitstreamVera_0.1.1 highr_0.10             
##  [67] checkmate_2.1.0         caTools_1.18.2          bitops_1.0-7           
##  [70] rlang_1.1.0             pkgconfig_2.0.3         systemfonts_1.0.4      
##  [73] matrixStats_0.63.0      evaluate_0.20           lattice_0.20-45        
##  [76] purrr_1.0.1             rapportools_1.1         htmlwidgets_1.6.2      
##  [79] labeling_0.4.2          tidyselect_1.2.0        plyr_1.8.8             
##  [82] magrittr_2.0.3          bookdown_0.33           R6_2.5.1               
##  [85] magick_2.7.4            generics_0.1.3          multcompView_0.1-8     
##  [88] pillar_1.9.0            withr_2.5.0             abind_1.4-5            
##  [91] scatterplot3d_0.3-43    tibble_3.2.1            crayon_1.5.2           
##  [94] gfonts_0.2.0            KernSmooth_2.23-20      utf8_1.2.3             
##  [97] rmarkdown_2.21          grid_4.2.3              data.table_1.14.8      
## [100] rmdformats_1.0.4        digest_0.6.31           flashClust_1.01-2      
## [103] webshot_0.5.4           xtable_1.8-4            tidyr_1.3.0            
## [106] httpuv_1.6.9            munsell_0.5.0           viridisLite_0.4.1      
## [109] bslib_0.4.2             tcltk_4.2.3