Load the packages and set up the file output parameters.

Introduction

The novel coronavirus, otherwise known as SARS-CoV-2 or Covid-19, originated in late 2019 in Wuhan, China. The virus is said to transmit person-to-person and cause symptoms including fever, cough, difficulty breathing, and sometimes organ failure even leading to death. Due to rapid transmission and outbreak of the coronavirus, with no known cure, the World Health Organization declared a pandemic on March 11, 2020. The virus is considered a respiratory illness and as such is believed to impact individuals in high risk categories more severely. High risk categories include, but are not limited to diabetes, asthma, autoimmune diseases, or the elderly. As a response to the rapidly transmitting virus, the nation has implemented social distancing to the maximum capacity possible. Therefore many employees are working remotely and schools are conducting instruction online. Data on the virus is still evolving as more people are diagnosed. The aim of this Bayesian framework analysis is to examine the patient outcome relationship with age and comorbidity details based on coronavirus data from Korea.

At the time of this work I could not find a sufficient data set on patients in the United States but were able to find a detailed data set on Covid patients in south Korea. Also, I was not able to find associated comorbidity data for Covid patients so the details were synthetically created for the data set as detailed below.

Bayesian Analysis

This report details estimating using a generalized linear model (GLM) for binary (Bernoulli) and binomial response variables using the rstanarm package in R. Our data set is composed of predominantly categorical variables, with one exception, age. Also, each of the categorical variables has just two levels, excepting geographic origin of the patient.

We are performing our modeling using the rstanarm package in R, using RStudio as our integrated development environment (IDE). The rstanarm package uses Stan via the rstan package which is an interface to Stan in R. These are all open source software packages that are free to use.

Stan is a platform for statistical modeling and high-performance statistical computation.

The rstanarm package uses the stan_glm function to estimate GLM’s for binary response variable such as what we are doing with our data where we are modeling the survival outcome for patients diagnosed with covid 19.

The documentation for rstanarm explains the four steps in a Bayesian analysis as

  1. Specify a joint distribution for the outcome(s) and all the unknowns, which typically takes the form of a marginal prior distribution for the unknowns multiplied by a likelihood for the outcome(s) conditional on the unknown variables. This joint distribution is proportional to the posterior distribution of the unknown variables conditional on observed data.

  2. Draw from the posterior distribution using Markov Chain Monte Carlo (MCMC)

  3. Evaluate how well the model fits the data and possibly revise the model.

  4. Draw from the posterior predictive distribution of the outcome(s) given interesting values of the predictors in order to visualize how a manipulation of a predictor affects the outcomes.

GLM models are commonly used in a frequentist setting but have added value when developed under a Bayesian framework. The Bayesian approach uses a full probability model that takes the uncertainty about our parameter(s) of interest, \(\theta\) and the uncertainty of the value of our dependent variable \(y\) conditional on some unknown parameter(s) \(\theta\).

With regression models, we have our dependent outcome variable \(y\) and our independent predictor variable(s) \(x\). With the Bayesian approach we want to be able to update our beliefs about the parameters \(\theta\), which we define as our coefficients in our model, and the data defined by \(x\) and \(y\). We describe this relationship between what we know before and after observing the data using the Bayes theorem,

\[p(\theta|y) = \frac{p(y|\theta)p(\theta)}{p(y)}\]

which tells us that the posterior distribution, \(p(\theta|y)\)equals the right side of the equation shown above. The denominator \(p(y)\) is the key. This is the marginal likelihood, which is the likelihood averaged over all the model parameters, with weights added that are given by their prior probabilities. It doesn’t depend on the parameter(s) \(\theta\) and therefore we can ignore it and use the proportionality relationship.

In regression models we include the predictor variables as

\[p(\theta|y, x) \propto p(y|\theta, x)p(\theta|x)\]

which illustrates the conditionality on \(x\).

The mathematical building blocks of Bayesian inference that we need to understand and use when doing Bayesian inference are:

The Likelihood

The likelihood, \(p(y|\theta, x)\), is the joint probability of the data \(y\) for all possible values of the parameter \(\theta\) and given the observed values of \(x\). For each outcome variable \(y_i\), \(p(y|\theta, x_i)\) is the generating process that produces that observation \(y_i\). If these observations are conditionally independent given the parameters, the likelihood is equal to the product of each individual likelihood contribution from each of our given data points. It is noteworthy that this does not integrate to one and is therefore not a probability density function.

A GLM for a binomial can be written for one observation \(y\) as a conditionally binomial probability mass function

\[{n \choose y} p^y(1-p)^{n-y}\]

where \(n\) is the known number of trials, \(p = g^{-1}(\nu)\) is the probability of success and \(\nu = \alpha + x^T\beta\) is a linear predictor. Let \(N\) be a sample size, then the likelihood of the entire sample is the product of \(N\) individual likelihood contributions.

For a binomial model the link function \(g\) maps between the unit interval (the support of \(p\)) and set \(\mathbb{R}\). When the inverse link function \(g^{-1}(\nu)\) is applied to a linear predictor \(\nu\) in \(\mathbb{R}\) we get a valid probability between \(0\) and \(1\).

With binomial GLM’s, the two most used link functions are the logit and probit functions. The logit function is the logarithm of the odds where \(p\) is the probability, \(\frac{p}{1-p}\). This function has the property of creating a map of probabiliity values from \((0,1)\) to \((-\infty, +\infty)\). The probit function is the quantile function that relates to the standard normal distribution and it is the inverse of the CDF of the standard normal, say \(\phi(z)\) which would then be \(\phi^{-1}(z)\).

The Prior

The prior, \(p(\theta)\) is a probability distribution that describes our uncertainty about our paramter of interest, \(\theta\) before we observe the data \(y\). We generally have some level of prior information about the paramters \(\theta\) before observing the data \(y\). In our GLM model our a binary outcome we have information suggesting our prior may follow a binomial distribution.

To specify the prior we need to specify a prior \(f(\alpha)\) for the intercept and \(f(\beta)\) for the vector of regression corefficients. With stan_glm, we set these using the prior_intercept and prior arguments. The stan_glm function supports a variety of prior distributions.

The Posterior

The posterior is the product of the likelihood and the prior and is the joint probability of all parameters \(\theta\) that sums up the updated and accumulated knowledge about the parameters after we observe \(y\).

With independent prior distributions, the joint posterior distribution for \(\alpha\) and \(\beta\) is proportional to the product of the priors and the \(N\) likelihood contributions:

\[f(\alpha, \beta|y, X) \propto f(\alpha) \times \prod_{k=1}^K f(\beta_k) \times \prod_{i=1}^N g^{-1}(\nu_i)^{y_i}(1-g^{-1}(\nu_i))^{n_i-y_i}\]

This is the mathematics of the posterior distribution that stan_glm will draw from when running \(MCMC\).

MCMC Sampling of the Posterior

Markov Chain Monte Carlo (MCMC) intends to construct a Markov chain to do Monte Carlo approximation. In practice, the prior usually isn’t conjugate and can’t be easily solved, which is where this method is preferred. MCMC sampling begins with fixing a starting parameter, and the model “jumps” to another position, centered around a current mu value and standard deviation. Markov chains, by defintion, are the basis for these “jumps” or transitions, from one state to another. Then, the MCMC code can quantify if that was a good “jump” by comparing the mu of the proposed jump to the current mu. Continuing this procedure gives samples from the posterior. The likelihood function then adjusts to the proposed mus. As discussed, the primary method this project will discuss from the rstanam package is the The stan_glm function; rather than performing maximum likelihood estimation of generalized linear models, the stan_glm function uses full Bayesian estimation via MCMC sampling, detailed above. The Bayesian model adds priors (independent by default) on the coefficients of the GLM.

The Mathematics of Generalized Linear Models and Why a Bayesian Framework Helps

We are going to write up a generalized linear model using covid 19 patient data to model the outcome for patients diagnosed with the covid 19 virus. This is a binary dependent variable with outcomes either deceased or released from the hospital so we will use a logistic model in a generalized linear model in the R package rstanarm.

In order to understand how a Bayesian framework enhances this model a brief discussion on the mathematics underlying a generalized linear model is in order:

Generalized linear models (GLMs) are models that relate a dependent variables known as response variables and independent variables known explanatory variables. Despite the fact they are called linear models, we do not assume a linear relationship between the dependent and independent variables, but a linear relationship between the transformed response in terms of the link function and the explanatory variables. Some examples of GLMs include linear regression, ANOVA, Poisson regression, log-linear models. GLMs have three components

  1. The random component known as a response variable who’s distribution is a member of the exponential family of distributions, e.g. Normal, Binomial or Poisson.,

  2. The systematic component known as explanatory variables which are linear.

  3. The link function \(g\) specifies the link between random and systematic components, such that \(E(Y|X) = \mu = g^{-1}(\beta X)\).

Our response variable \(Y\) is the state of a patient which can be released or deceased. This is a binary variable meaning it can take on only two value with no in between. Since our data has binary response variables we can assume the model that will best fit our data would be a binary logistic model. For a binary logistic model, the explanatory variables can be continuous or categorical. Our explanatory variable are sex = (male, female), age (from 0 to 104), age range = (kids, teens, 20’s,…, 90 up), province (geographic origin of patient), presence of comorbidities including: hypertension, obesity, chronic metabolic syndrome, chronic lung disease, and cardiovascular disease. As you can see all of our data excepting age is categorical or has been made to be categorical. Our desired response variable is state, which is whether the patient survived or died given the attributes we model against.

\(Y \sim Binomial(n, \pi)\) where \(\pi \in (0,1)\) where \(\pi\) is the probability of success, or in our case the probability of being released. We want to estimate \(\pi\) for a linear combination of the independent variables. The set of explanatory variables \(X_i\), \(i = 1, 2, \dots, k\) are linear in the parameters,

\[\beta X = \beta_0 + \beta X_1 + \dots + \beta_0 + \beta X_k.\]

The link function for this model is called the logit,

\[logit(\pi) = log \left( \frac{\pi}{1 - \pi} \right) = \beta_0 + \beta x_i + \dots + \beta_0 + \beta x_k.\]

This function models the log of the odds ratio as a linear function of the explanatory variable. It maps the the possible values of the probability \((0, 1)\) the possible values of our variables in the the real numbers.

The mean function

\[\mu_{Y|X} = \frac{1}{1 + e^{-\beta X}} = \frac{e^{\beta X}}{1 + e^{\beta X}}\]

where \(\beta X\) is the outcome of our linear combination.

how to Use rstanarm

It is very helpful because of the resource intensity of the MCMC sampling to use more than one core, assuming your computer, either laptop or desktop, has a multicore processor. To enable the use of multiple cores, run the code block below.

# To run on multiple cores
options(mc.cores=parallel::detectCores())

Bring in the data set

We are using a Korea Covid-19 dataset here. The data set is posted on Kaggle at the url: https://www.kaggle.com/kimjihoo/coronavirusdataset#PatientInfo.csv. It has 2243 patients and provided information on the sex, age, geographic location, date of disease confirmation and the outcome of the case, i.e. did the patient survive or not survive. There are some additional data columns but they have a very large amount of missing or irrelevant information so they are disregarded for now.

Create Synthetic Data

(Caveat Emptor)

I modified the data set in its .csv format to include comorbidity information for the patients in our data set. Using data found at the US Center for Disease Control detailing patient comorbidities in a .csv format I created synthetic data by extrapolating the incidence levels by age group and by sex to modify our .csv files used in our modeling efforts below. This allows us to add information on the proportion of patients presenting with other conditions that may have a bearing on the outcome of their specific situation. Unfortunately, we have been unable to find a data set where the comorbidities match an actual patient and the patient outcome therefore our model output is extremely questionable in terms of fit and predictive ability.

A similar process was used to fill in missing data in other columns. The proportions of missing data per column was below 5% in all cases.

Examine the dataset

Add additional transformations of the dataset

We can bin the ages into 10 year brackets and name them and transform age to a numeric data structure.

We are only concerned with certain columns so let’s create a dataframe with just the columns we want to use in our model.

The columns of interest are: sex, age_range, province, state, hypertension, obesity, chronic metabolic syndrome, chronic lung disease, and cardiovascular disease. We can change these later if we decide to model a different set of independent variables or if we gather or create more information.

Verify results with head() and str()

Visualization of the Dataset

Because our dataset consists of categorical data excepting patient age, we are limited in the visualization options. Using the DataExplorer package we can, however, look at a few relationships.

Below is a set of bar plots for all of the columns in our dataframe showing the counts for each response in each attribute.

Our primary concern is whether the hospitalized Covid-19 patient surived so we can model the state of the patient against the age of the patient.

Bar plots with counts of covid cases per age group, deaths per age group, etc.

par(mfrow = c(2,4))
#Plot on data-- test
p5 <- ggplot(data = df1, mapping = aes(x = age_range,
                                       y=state, color= "identity",
                                       fill= "red"))
#print(p5)
plot(p5)

#Aggregating the age range into bucket by sex/state. we can adjust the way we want to aggregate this
binning_data<-aggregate(df1$age_range, 
                        by = df1[c('sex', 'state')], 
                        length)
#print(binning_data)
plot(binning_data)

#plot on the binned data
p6<- ggplot(data=binning_data, 
            mapping=aes(x=sex, y=x)) + geom_point(color= "red", fill="red")
#print(p6)
plot(p6)

#test
p7  <- ggplot(data = binning_data, mapping = aes(x = age_range, 
                                                 y = obesity, 
                                                 color= "purple", fill= "Purple"))
#p + geom_point(alpha=.4) 
#plot(p7)

#visual on counts of deceased vs released 
p8<- ggplot(binning_data, aes(state, x)) + 
  geom_bar(stat="identity", width=0.7, fill="steelblue") + 
  theme_minimal()
#print(p8) 
plot(p8)

#Alternate take on binning data
binning_data2<- aggregate(df1$state,
                          by = df1[c('age_range', 'state')], length)
#print(binning_data2)
plot(binning_data2)

#Plot new binned data-- counts on age range buckets and state counts 
p9<- ggplot(binning_data2, aes(age_range, x))+
  geom_bar(stat="identity", width=0.7, fill="steelblue")+
  theme_minimal()
#print(p9) 
plot(p9)

Build the Bayesian Generalized Linear Model Using the rstanarm package

The model, fit1, regresses on the state of the patient against sex, age, and the set of comorbidities. The model aims to explain the final state of the patient; i.e was the patient released from the hospital or did they die from the effects of covid-19 based on their particular set of predictive characteristics?

prior <- normal()
fit1 <- stan_glm(state ~ sex + 
                   age + 
                   hypertension + 
                   obesity + 
                   chronic.metabolic.syndrome + 
                   chronic.lung.disease + 
                   cardiovascular.disease, 
                 data = df1,
                 family = binomial(link = "logit"),
                 prior = normal())

posterior <- as.matrix(fit1)

plot_title <- ggtitle("Posterior Distributions", "with medians and 90% intervals")

mcmc_areas(posterior, prob = 0.9) + plot_title

# We can build competing models in this code block
# i.e. fit2 <- stan_glm(state ~ sex + age_range, data = df1, family = gaussian(link = "identity), prior = poisson())
# etc.

Summary of the model output

At the top of the output, we have some general info on the model, family, formula, algorithm, and posterior sample size (number of chains). Next, we have group level effects displayed for each grouping factor in terms of mean, standard deviations and intervals. This is followed by a summary of the mean_ppd which is the sample average posterior predictive distribution of the outcome variable.

Output table generated in ShinyStan

NOTE: To view this table as output you need to have the ShinyStanTable.csv file in your file path. This is a .csv file that was generated from the web browser output from ShinyStan and saved to the file directory independent of ShinyStan.

shinystantable <- read.csv("ShinyStanTable.csv")
kable(shinystantable, caption = "Shiny Stan Output Table")
Shiny Stan Output Table
X n_eff Rhat mean mcse sd X2.50. X25. X50. X75. X97.50.
(Intercept) 3,682 1 8.7 0 0.9 7.0 8.1 8.6 9.2 10.4
sexmale 5,158 1 -0.9 0 0.4 -1.6 -1.1 -0.9 -0.6 -0.2
age 3,543 1 -0.1 0 0.0 -0.1 -0.1 -0.1 -0.1 0.0
hypertensionyes 4,704 1 0.8 0 0.4 0.1 0.6 0.8 1.1 1.7
obesityyes 5,198 1 0.2 0 0.4 -0.5 0.0 0.2 0.4 1.0
chronic.metabolic.syndromeyes 5,226 1 -0.1 0 0.4 -0.8 -0.3 0.0 0.2 0.6
chronic.lung.diseaseyes 4,947 1 -0.6 0 0.4 -1.3 -0.9 -0.6 -0.4 0.1
cardiovascular.diseaseyes 4,867 1 -0.1 0 0.4 -0.9 -0.4 -0.1 0.2 0.7
mean_PPD 4,416 1 1.0 0 0.0 1.0 1.0 1.0 1.0 1.0
log-posterior 1,687 1 -161.3 0 2.0 -166.0 -162.4 -161.0 -159.8 -158.3

stan_glm returns the posterior distribution for the parameters describing the uncertainty related to unknown parameter values:

This probably looks better with a normal or student t distribution for the prior.

To generate median and Bayesian uncertainty intervals we can use coef and the posterior_interval function. The uncertainy intervals are computed by finding the relevant quantities of the draws from the posterior distribution. To compute \(90%\) intervals we use:

We can view a plot of the posterior credible intervals from the table output above.

Launch ShinyStan to View Detailed Model and Data Analytics in Web Browser:

Important Note: ShinyStan will open up a separate browser window with important analytic information. This analytic information is not capturable in a knitted .pdf or html file. To view the ShinyStan output, make sure the data set, “KoreaPatientInfo_Synthetic.csv” is in your filepath and that the launch_shinystan(fit1, ppd = FALSE) command in the code block below is uncommented. Then run the entire program and when the program gets to this code block a browser window will open up in your laptop or desktop that you can navigate through via the various windows and drop down commands. The window will only remain active as long as the program is running (you can see a spinning “wheel” in the output of the launch_shinystan code block).

ShinyStan is a browser based diagnostic and explanatory tool that provides immediate, informative, customizable visual and numerical summaries of the model parameters and the convergence diagnostics for MCMC simulations. The launch_shinystan() code in the code block below will open a web browser tab with a broad set of display and analytics information to gain better understanding of the model.

# launch_shinystan(fit1, ppd = FALSE)

# NOTE: In order to knit file, this command must be commented out!
# Thus, in order to view shinystan html browser page, command must be uncommented. 

The ShinyStan wed browser page will look like this below on the front end with several tabs to view the many diagnostic outputs.

knitr::include_graphics("C:/Users/pchri/Chris Schmidt/Towson Stuff/Bayesian Statistics Math 687/ProjectFiles/ShinyStanCoverPage1.png")

knitr::opts_chunk$set(echo = TRUE, fig.align = "left")

Summary

This has been an exploration of how Bayesian statistics enhances statistical modeling. In this course we have learned that the key difference between frequentist and Bayesian methods is in the treatment of the parameters and the response variables. In the frequentist approach, the parameters, \(\beta_0, \beta_1, ...\) are fixed and the response \(y\) was random while we have the opposite assumptions in Bayesianism. Therefore, in Bayesian inference, because we assume the parameters are random, we assign a distribution to the parameters that we call the prior and any parameters used to define this prior we call hyperparameters.

With Bayesian GLM’s the big advantage is that we get a credible interval for our output where we can claim with a degree of certainty unavailable with a freguentist approach that our predictors and or responses lie in a specific range.

In our work, we had hoped to find a dataset with qualitative data about the underlying conditions for patients hospitalized for symptomatic Covid-19. Unfortunately, that data is not publically available yet, although we think that to be a bit unusual as it seems likely that patient specific data is collected at the hospital level and aggregated for distribution to the governmental public health authorities. Why that is not being made public for use by data analysis is a good question.

Our workaround for this circumstance was to create our own sythetic data using public information on incidence rates of specific comorbidities for Covid-19 patients that is available at the CDC website. We took these incidence rates that were age specific and randomly assigned them to our patients.

This is problematic for several obvious reasons. First, we know that in our data, the survival rate was very high and that the proportion of hospitalized patients with at least one comobidity was also very high, therefore, a patient that died probably had at least one underlying comorbidities. By randomly assigning comorbidities there is a strong chance we missed assigning a comorbidity to a deceased patient thus impacting our efforts at achieving modeling accuracy.

Another concern is that we have a data set where all the attributes are binary where for some of the comorbidities, such as obesity, and hypertension, it would be better if we had an actual body mass index and a value for each of those attributes, respectively.

That said, we were able to build a model, generate some output, perform some analysis, and generate some visual representations that allowed us to gain a better understanding into the Bayesian approach to statistical inference, even if we are not comfortable with the ‘value’ of our model output.

References

Works Cited

Auzenbergs, M., Correia-Gomes, C., Economou, T., Lowe, R., & O’Reilly, K. (2019). https://www.sciencedirect.com/science/article/pii/S1755436519300349. Epidemics , 29, 1-9.

Caramelo, F., Ferreira, N., & Oliverios, B. (2020, February 25). Estimation of risk factors for COVID-19 mortality - preliminary results. 1-12.

Churches, T. (2020, February 17). Analysing COVID-19 (2019-nCoV) outbreak data with R - part 1. Retrieved April 15, 2020, from Tim Churches Health Data Science Blog: https://timchurches.github.io/blog/posts/2020-02-18-analysing-covid-19-2019-ncov-outbreak-data-with-r-part-1/

Huges, G., & Madden, L. (1993). Using the Beta-Binomial Distribution to Describe Aggregated Patterns of Disease Incidence, Ecology and Epidemiology, 83 (7) 759-63.

Li, M., Dushoff, J., & Bolker , B. (2018). Fitting mechanistic epidemic models to data: A comparison of simple Markov chain Monte Carlo approaches. (I. Dattner, & A. Huppert, Eds.) Statistical Methods in Medical Research , 27 (7), 1956-67.

Mcfall-Johnsen, M., & Bendix, A. (2020, April 18). An average coronavirus patient infects at least 2 others. To end the pandemic, that crucial metric needs to drop below 1 Ñ here’s how we get there. Retrieved April 25, 2020, from Business Insider: https://www.businessinsider.com/coronavirus-contagious-r-naught-average-patient-spread-2020-3

Minin, V., Auranen, K., & Halloran, E. (2018). MCMC I 10th Summer Institute in Statistics and Modeling in Infectious Diseases Course Time Plan July 11-13, 2018. MCMC I (pp. 1-174). Seattle: University of Washington.

Model 13: MCMC II for Infectious Diseases. Nottingham, Math, Seattle.

Ox educ. (2014, August 12). 24 - Bayesian inference in practice - posterior distribution: example Disease prevalence. Retrieved April 15, 2020, from Youtube.com: https://www.youtube.com/watch?v=_nhn54l5dis

R notebook.ipynb. (n.d.). Weicki, T., & Husain, H. (2020, February 10). Copy of 2020-03-16-covid19_growth_bayes.ipynb. COVID-19 Growth Rate Prediction . Retrieved from https://colab.research.google.com/drive/1qIz2uA3v4RRHi3xe7jGLX06xPRWQuzjq

Wen, Y.-H., Wu, S.-S., Lin, C.-H. R., Tsai, J.-H., Yang, P., Chang, Y.-P., et al. (2016). A Bayesian Approach to Identifying New Risk Factors for Dementia. (H. Gharaei, Ed.) Medicine , 95 (21), 1-6.

Wolfram Research. (2020, February 3). Wolfram. Retrieved April 15, 2020, from Patient Medical Data for Novel Coronavirus COVID-19: https://datarepository.wolframcloud.com/resources/Patient-Medical-Data-for-Novel-Coronavirus-COVID-19