Abstract
COVID-19 pandemic is outbreaking to worldwide countries starting from December 2019. Four main tasks in this mini project include visualization COVID-19 data with PCA and the modeling of numbers of deaths caused by COVID-19 with Quasi-Poisson Regression, Quasi-Binomial Regression, and the Linear mixed model respectively. PCA result suggests three optimal clusters for COVIS-19 data. Quasi-Poisson Regression analysis suggests six latent factors to describe the number of deaths. Quasi-Binomial Regression analysis suggests two contradictory results. Linear Mixed Model suggests two positive fixed effects and three negative fixed effects for an explanation of the variations in the number of deaths caused by COVID-19.COVID-19 is a new coronavirus, outbreaking to worldwide countries from Wuhan, China starting from the early of December 2019. COVID-19 Pandemic globally causes approximately 6 million peoples to be infected and 368 thousand people to be dead dated May 27, 2020. Some government health websites and non-profit organizations provide free access to the basic statistics for COVID-19 such as Wikipedia, the World Health Organization, and the European Centers for Disease Control and Prevention (see Figure 1, which is downloaded from https://ourworldindata.org/coronavirus-data).
Currently, some research works are focusing on the COVID-19 outspread simulation with mathematical models. For example, the papers of Basnarkov (2020), Kevrekidis et al. (2020), Lopez and Rodo (2020), and Wangping et al. (2020) simulate the dynamics of their intra-relationship among the rate of changes of peoples being susceptible, exposed, infected, and recovered with SEIR Model in Figure 2 (Hethcote (2000)). One major drawback is that the basic SEIR mathematical model does not include the latent factor effects such as smoking, aged group, pre-existing medical conditions, handwashing facilities, and the number of hospital beds.
From scientific perspectives, this new coronavirus: COVID-19 currently has a critical challenge that the basic SEIR mathematical model cannot capture some latent factor effects. Because of that, this mini-project proposes using a linear statistical modelling approach to study the latent-factor effects from COVID-19 with PCA Analysis, GLM, and LMM.
Four main tasks in this mini-poroject are as follows:
In this mini-project, the COVID19 dataset in Table 1 downloads from https://ourworldindata.org/coronavirus-data dated May 27, 2020. Only Locations and Regions are caterogical data. in Table 2
The main task in this section is to perform COVID-19 Data Visualization with PCA. Figure 3 indicates that the first principal component explains 34.5 % variability of COVID-19 data; whereas the second principal component explains 17.2 % variability of COVID-19. The total sum of the first and second can explain about 51.7% variability of COVID-19 data.
Figure 4, 5, 6, and Table 4 indicates that the first six components in the first principle component dimension are aged_65_older, F_smokers, GDP_per_capita, hospital_beds_per_100k, total_deaths_pm, and total_cases_pm. The first six components in the second principal component dimension are total_cases_pm, M_smokers,cvd_death_rate, hospital_beds_per_100k, total_deaths_pm, and aged_65_older. Therefore, total_cases_pm, total_deaths_pm, and aged_65_older are sharing the top three components for two major principle component axis. This information is very useful for linear statistical modeling in sections II, III, and IV.
Figure 7 is the PCA Analysis of Variables in Table 2: Biplot Grouped by Geographic Regions. It visually shows that the variability of COVID-19 data have disimilarity structure clusters by geographical regions.
Figure 8 is PCA Analysis of Variables in Table 2: Biplot Grouped by “Having More than 50% Hand-washing Facilities” It visually shows that the group having More than 50% Hand-washing Facilities is significantly smaller than the group having less than 50% Hand-washing Facilities.
Figure 9 is PCA Analysis of Variables in Table 2: Biplot Grouped by “Having More than 50% are Smoking Patients” It visually shows that the portion between the group of smoking patients and non-smoking patients in COVID-19 are more or less the same.
Figure 10 is three Optimal clusters of COVID-19 data with PCA. Table 5 lists the key latent factors for each cluster. The PCA optimal cluster is based on common key components that can share within groups. Between groups, the common key components are various.
PCA is a good tool to visualize the key components in each principal component dimensions. With PCA, it found that the total sum of the first two principal component dimension contributes over 50% the variability of COVIS-19 data. Total_cases_pm, total_deaths_pm, and aged_65_older are sharing the top three components for two major principle component axis. Besides, PCA suggests three optimal clusters for COVIS-19 data. Each cluster shares the common key components within groups. This information is very useful for linear statistical modeling in sections II, III, and IV.
This section aims to model the count data of deaths caused by COVID-19 with Poisson regression. The variable of total_death is the count data response and follows the Poisson distribution. The explanatory variables include total_case being infected, median_age, aged_65_older, gdp_per_capita, population_density, diabetes_prevalence, female_smokers, male_smokers, handwashing_facilities and hospital_beds_per_100k. In this Poisson Regression analysis, it uses a natural log link function to model the multiplicative effects of the explanatory components listed in Table 2.
\[\log(Y(\mu))=\beta_0+\beta_1 x_1+\beta_2 x_2+\ldots+\beta_k x_k=x_{i}^{T}\beta\]
Hypothesis Testing:
\(Null \ Model: \beta_i = 0\)
If the parameter estimates \(\beta_{i} = 0\), it means that the explanatory component i has no causal effects on the response total_deaths caused by COVID-19.
If the parameter estimates \(\beta_{i} > 0\), it means the explanatory component i has a \(exp(\beta_{i})\) time larger effect on the response total_deaths caused by COVID-19.
If the parameter estimates \(\beta_{i} < 0\), it means the explanatory component i has a \(exp(\beta_{i})\) time smaller effect on the response total_deaths caused by COVID-19.
Figure 11 is a correlation plot of the latent variables in Table 2. This correlation plot shows three important points that should emphasize as follows:
Firstly, handwashing_facilities and hospital_beds_per_100k have a very low correlation with pre-existing medical conditions of patients having cardiovascular disease and diabetes prevalence.
Secondly, male smokers have a positive correlation with pre-existing medical conditions of cardiovascular disease and diabetes prevalence; whereas female smokers have a negative correlation with pre-existing medical conditions of cardiovascular disease and diabetes prevalence.
Thirdly, the median age of patients caused by COVID-19 has a positive correlation with diabetes prevalence but a negative correlation with cardiovascular disease.
The Poisson distribution has a strong property that mean and variance are the same. If the variance is smaller (or greater) than the mean, then the Poisson model is said to be underdispersion (or overdispersion). Overdispersion is more serious than underdispersion from a statistical modeling perspective. Overdispersion is a problem that random variations that explanatory factors are not sufficient to explain all random variation, causing the problem of insufficient goodness-of-fit and problem of model misspecification. Table 6 is the results of the Poisson regression analysis.
Model_01 is the GLM model with a Poisson log link whereas Model 02 is the GLM model with a quasiPoisson log link.
Model_ 01 shows the problem of overdispersion as the ratio of Residual deviance/Degrees of freedom = 4992.2/123 = 40.59 which is greater than the dispersion parameter limit = 1.
Model_ 02 shows no problem of overdispersion as the ratio of Residual deviance/Degrees of freedom = 5171.5/126 = 41.04 which is less than the dispersion parameter limit = 51.78.
Since Model_02 has no problem of overdispersion, Model_02 is a model better than Model_01.
Figure 12 is the diagnostic check to overview the residual fits and log-normal assumptions for Model 02. The QQ normal indicates there exists some outliers that deviated from log-normal assumption. From practical statistical and econometric perspectives, it is not necessary to delete the outliers to increase the goodness-of-fit and power of explanation. It is because the deletion of outliers will cause the other problems of overfitting, hence lowering the degree of predictive power. The power of prediction against the power of explanation is the art of scientific modeling issues that subjected to the modeling objectives.
Table 9 in Appendix A is the anova test using a Pearson chi-square statistic to test the goodness-of-fit for Model 02. The results in Table 9 shows that factors of total_cases_pm, aged_65_older, diabetes_prevalence, cvd_death_rate, F_smokers, and hospital_beds_per_100k are highly statistically significant.
Figure 13 suggests that the total number of deaths per population million exponentially increases with the total number of cases( people being infected) per population million, the groups aged over 65, and the groups of female smokers. However, the total number of deaths per population million exponentially decreases with diabetes prevalence, the death rate of cardiovascular diseases, and the total number of hospital beds per 100k. No matter the factors are exponentially increasing or decreasing, the overall effects on the numbers of deaths from COVID-19 are multiplicative and explosive.
Model 02 is Quasi-Poisson Regression which is a model better than the Poisson Regression Model 01. Model 02 shows no problem with overdispersion. With the Quasi-Poisson Regression analysis, it found that six significant latent factors sufficient to describe the random variarions of the total numbers of deaths caused by COVID-19. These six statistically significant explanatory factors include total_cases_pm, aged_65_older, diabetes_prevalence, cvd_death_rate, F_smokers, and hospital_beds_per_100k .
It is well-known that the overdispersion problem does not exist when the dependent variable is a single Bernoulli trial. The binomial distribution is the aggerated process of multiple Bernoulli trials. The problem of overdispersion in binomial regression is obvious and exits, similar to the previous section of Poisson regression analysis. The COVID-19 data are highly right-skewed distribution for non-zero values and possible excess zero count which show in the scatter matrix plot in Figure 11. To avoid the overdispersion problem, I propose to model the count data of total death caused by COVID-19 with Quasi-Binomial Regression, but not Binomial Regression. In this section, it aims to model the count data of deaths and survival caused by COVID-19 with quasi-binomial regression. \[Y(u) = log(\frac{\pi}{1- \pi}) = X ' \beta\], where \(\pi\) is the probability of the death event occurred and bounded within the interval [0,1]. the odd ratio between total numbers of deaths over the total numbers of survivals \(\frac{\pi}{1- \pi}\) for COVID-19 is to measure the association between an exposure attributed by latent factors and the response outcome. Similar to the previous section of Quasi-Poisson Regression, the Quasi-Binomial Regression(or logit regression) uses the same latent factors listed in Tables 2. Different from Quasi-Poisson Regression, the response Y of Quasi-Binomial Regression in GLM function refers to the odd ratio between total numbers of deaths over the total numbers of survivals \(\frac{\pi}{1- \pi}\) for COVID-19. The GLM function in R must construct as the following matrix structure :
\[\mathrm{Y(u) = [count\_data (deaths), count\_data(survivals )]}\] where the total number of survivals here defines as the total number of cases - the total number of deaths.
Table 7 is the results of the Quasi-Binomial Regression Analysis for the total number of deaths caused by COVID-19.
Figure 14 is the diagnostic check to overview the residual fits and log odd assumptions for Model 03. The QQ normal indicates there exists significant out outliers that Quasi-Binomial Regression still cannot capture. However, the goodness-of-fit with ANOVA test in Table 10 (see Appendix A) shows that population density, the patients who are aged over 65, pre-mediation of diabetes prevalence, and cardiovascular disease are highly statistically significant for the model fitness.
Figure 15 is the effect plots of Model 03. The total number of deaths per population million exponentially increases with the groups aged over 65, but it exponentially decreases with diabetes prevalence, cardiovascular diseases, and population density. The inverse relationship between the total number of deaths per population million and population density (like in the Asian regions of Macau, Hong Kong, and Singapore) is a contradictory but interesting finding. From a medical science perspective, this inverse relationship is contradictory to the traditional concept of higher population density with a higher total number of infected cases or deaths for any coronavirus pandemics. One explanation for this contradiction may attribute the external geographical control factors such as how effective wearing a medical facial mask by all local citizens, how effective implementation of social distancing, and a 14-day quarantine control policy. These external geographical control factors are various by the local situation and local governmental policy.
With Quasi-Binomial Regression Analysis, it found two contradictory results.
The first contradiction is the inverse relationship between the total number of deaths per population million and population density. This finding is contradictory to our traditional coronavirus pandemics from scientific knowledge: " the higher population density, the higher total number of infected cases or deaths for any coronavirus outbreaks. " This contradiction may be attributed to external geographical control factors such as social distance policy or a 14-day quarantine local control policy.
The second contradiction is that the latent factors of female smokers and hospital beds per 100k are not statistically significant in Model 03 when compared with Model 02. One possible reason is that geographical locations might contribute a significant random effect on female smokers and hospital beds per 100k.
In the last section, I will introduce a Linear mixed model (LMM) to model the total of deaths caused by COVID-19 with random effects and fixed effects
Simply speaking, the Linear mixed model (LMM) is a linear combination of random effects and fixed effects of a linear regression model. One advantage of a LMM for count data is that the assumption of independence between explanatory variables is not necessary. LMM allows fixed variables conditionally dependent on other random variables. Whether a variable is random or fixed in the LMM is the art of sciencific modeling.
In this section, the main task is to run a linear mixed model (LMM) with the COVID-19 dataset. The response count data is the total number of deaths per population million. The fixed effect chooses from continuous variables in Table 2, while the random effect sets to be categorical variable regional locations. My interest is to find out whether the total number of deaths by regions are sharing the common fixed effects of variables across the random effect of geographical region. Figure 16 is to show the variables variations of COVID-19 Dataset(in Table 2) which is clustered by Regions.
Since the residual, local influence and leverage analyses are violated the underying assumptions of gaussian linear mixed model, the lmer Package in R does not provides any diagnostic check. Instead, Only one possiblilty is to generate an qqplot to check the assumption of normality for Model 04 (see Figure 17 in Appendex A). To check the statistically significancy for fixed effect estimated coefficients in LMM, the lmer function in R uses Satterthwaite approximation to conduct a two-sided sigificant t test. Details can refer to the papers of Satterthwaite (1946).
Table 8 is the optimal result with LMM analysis for COVID-19. The results indicate that:
Each regional cluster has five significant fixed effects (total_cases_pm, diabetes_prevalence, aged_65_older, hospital_beds_per_100k, and population_density) commonly attributed to the total number of deaths.
Only two significant random effects (diabetes_prevalence and hospital_beds_per_100k) commonly contribute to the total number of deaths across each cluster.
The effects plot for Model 04 in Figure 18 indicates that
total_cases_pm and aged_65_older show positive fixed effects on the total number of death, whereas
diabetes_prevalence hospital_beds_per_100k, and population_density shows negative fixed effects on the total number of death.
COVID-19 pandemic is outbreaking to worldwide countries starting from December 2019. Simulation of COVID-19 pandemic with SEIR mathematical models cannot capture some latent factors. In view of that, I propose four main tasks as follows:
The overall conclusion is as follows :
With PCA analysis, total_cases_pm, total_deaths_pm, and aged_65_older are sharing the top three components for two major principle component axis. PCA also suggests three optimal clusters for COVIS-19 data. Each cluster shares the common key components within groups.
With the Quasi-Poisson Regression analysis, it found that total_cases_pm, aged_65_older, diabetes_prevalence, cvd_death_rate, F_smokers, and hospital_beds_per_100k are sufficient to describe the random variations of the total numbers of deaths caused by COVID-19.
With Quasi-Binomial Regression Analysis, it found two contradictory results.
With Linear mixed model,
Basnarkov, Lasko. 2020. “Epidemic Spreading Model of Covid-19,” May. http://arxiv.org/abs/2005.11815v1.
Hethcote, Herbert W. 2000. “The Mathematics of Infectious Diseases.” SIAM Rev. 42 (4): 599–653. https://doi.org/10.1137/s0036144500371907.
Kevrekidis, P. G., J. Cuevas-Maraver, Y. Drossinos, Z. Rapti, and G. A. Kevrekidis. 2020. “Spatial Modeling of Covid-19: Greece and Andalusia as Case Examples,” May. http://arxiv.org/abs/2005.04527v2.
Lopez, Leonardo R, and Xavier Rodo. 2020. “A Modified Seir Model to Predict the Covid-19 Outbreak in Spain and Italy: Simulating Control Scenarios and Multi-Scale Epidemics.” medRxiv. https://doi.org/10.1101/2020.03.27.20045005.
Satterthwaite, F. E. 1946. “An Approximate Distribution of Estimates of Variance Components.” Biometrics Bulletin 2 (6): 110–14. www.jstor.org/stable/3002019.
Wangping, Jia, Han Ke, Song Yang, Cao Wenzhe, Wang Shengshu, Yang Shanshan, Wang Jianwei, et al. 2020. “Extended Sir Prediction of the Epidemics Trend of Covid-19 in Italy and Compared with Hunan, China.” Frontiers in Medicine 7: 169. https://www.frontiersin.org/article/10.3389/fmed.2020.00169.