As researchers, we collect data to test hypotheses. This requires statistics. Two main statistical frameworks are available to us: frequentist and Bayesian. What is the difference between the two, and does it matter? Let us find out through a simple example.
Let us suppose that we rob 20 students (10 women, 10 men) to find out how much money they have in their pockets. We’re interested in testing the hypothesis that the amount of money carried by a student differs between gender.
Because it is unethical to rob students, instead we simulate data. This allows us to set the ‘true’ average amount of money that women and men carry in their pockets.
In statistical terms, we refer here to women and men as distinct ‘populations’. We set the ‘population’ mean at \(6.00\)$ for women, and \(5.40\)$ for men, with a standard deviation of \(1\)$ each. This suggests that students are poor overall regardless of gender, but we already knew that. Again, the hypothesis of interest here is whether there is a difference between women and men.
The R code to simulate our student robbing experiment data is shown below.
#### data simulation
# number of students per gender
n <- 10
# to ensure consistent random draws
set.seed(1234)
# simulate random data from normal distribution
men <- rnorm(n, mean = 5.4, sd = 1)
women <- rnorm(n, mean = 6, sd = 1)
# generate factor for gender
gender <- gl(2, n, labels = c('men', 'women'))
# create data frame to store data
rob <- data.frame(gender, money = c(men, women))
With these simulated data we expect women to carry about \(0.60 = 6.00 - 5.40\)$ more in their pockets than men. But let’s pretend we don’t know the population means are. We want to know if differences in sample means are ‘statistically significant’.
A boxplot suggests that women might be slightly richer than men. However, we need statistical tests to ascertain this.
The frequentist approach is by far the most widely used one. For our data, we could use a \(t\)-test, or a one-way ANOVA. Here we will use one-way ANOVA.
A one-way ANOVA can be coded in R the following way:
mod1 <- lm(money ~ gender, data = rob)
summary(mod1)
##
## Call:
## lm(formula = money ~ gender, data = rob)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9625 -0.6733 -0.1864 0.6985 2.5340
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.0168 0.3264 15.370 8.57e-12 ***
## genderwomen 0.8650 0.4616 1.874 0.0773 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.032 on 18 degrees of freedom
## Multiple R-squared: 0.1632, Adjusted R-squared: 0.1167
## F-statistic: 3.511 on 1 and 18 DF, p-value: 0.07728
The summary output tells us many things, but let’s focus on the most important information.
The (Intercept) term is actually the sample mean for men, which is the first level of our gender level (by default, levels are stored in alphabetical order).
The genderwomen term represents the difference between women and men, which can be checked via the code below. This is the term of interest to us.
(diff.women.men <- round(mean(women) - mean(men), 4) )
## [1] 0.865
(diff.women.men == round(coef(mod1)[2], 4) )
## genderwomen
## TRUE
The \(P\)-value associated with the term genderwomen is \(0.0773\). Tradition dictates that terms with \(P>0.05\) are deemed not ‘statistically significant’. Hence according to this test, our data do not allow us to conclude that women carry more money than men in their pockets. This is disappointing.
In our case, our inability to detect differences is due to low sample size because we know that data from men and women were drawn from distributions whose expected values differed from each other (by not much though).
But what exactly is a \(P\)-value? Remember that our hypothesis (\(H\)) of interest was that the amount of money a student carries in her/his pockets differs between gender.
As \(P\) stands for ‘probability’, intuitively we might think that the \(P\)-value directly informs us about the likelihood that our hypothesis is true. For example, it would be tempting to interpret the \(P\)-value as being \((1 - 0.0773 * 100) = 92.3\%\) certain that our hypothesis is true. Unfortunately, the \(P\)-value does not allow us to make such probabilistic statements about our hypothesis.
The logic underlying the frequentist approach of hypothesis testing is a bit twisted. Rather than directly testing our hypothesis \(H\), it focuses on testing the ‘null hypothesis’ (\(H_0\)) that there are no differences between means (i.e. no difference between women and men). If we find sufficient evidence against \(H_0\) (e.g. if the \(P\)-value is \(\leq 0.05\)), then we can safely reject \(H_0\). By default, rejecting \(H_0\) leads us to accept \(H\).
In other words, the \(P\)-value is the probability of observing our data if \(H_0\) is true. The way the \(P\)-value must be interpreted is like this: if we were to repeat this experiment a very large number of times (hence the name ‘frequentist’) and if there are no differences between women and men, then in \(7.73\%\) of cases, we would observe differences between women and men as large or greater than those we actually observed. This is a convoluted statement, and does not even inform us about \(H\).
One problem with the use of \(P\)-values in frequentist approaches is that it often leads to dichotomous statistical decisions that can seem a bit arbitrary: we either reject \(H_0\) is \(P \leq 0.05\), or cannot reject it and move on. There is a growing trend toward reporting effect sizes (i.e. strengths of effects) as well as \(P\)-values, which helps interpretation. However, this does not solve the problem that the \(P\)-value tells us basically nothing about our hypothesis of interest, \(H\).
To summarise in more mathematical terms, we were interested in \(P(H | data)\); that is, the probability that our hypothesis is true, given the data we collected. However, the frequentist approach instead gave us \(P(data | H_0)\); that is, the probability to observe data as extreme or more if \(H_0\) is true. This is very different.
Next we present Bayesian analysis, which allows us to estimate \(P(H | data)\).
Given that Bayesian analysis allows us to directly estimate \(P(H | data)\), why is not more widespread? One reason is that Bayesian analysis can be computationally intensive. With modern computers, this is becoming less of an issue. Another reason is that Bayesian analysis is more challenging in that it requires us to explicitly specify the model in order to perform the analysis. We come back to that later.
To run Bayesian analyses, we will use JAGS (which stands for ‘Just another Gibbs sampler’). JAGS is cross-platform and can be freely downloaded here.
Once downloaded, simply install it on your computer.
To interact with JAGS via R, we then need to install the R2jags package.
install.packages('R2jags')
And then load it.
library(R2jags)
## Loading required package: rjags
## Loading required package: coda
## Linked to JAGS 4.2.0
## Loaded modules: basemod,bugs
##
## Attaching package: 'R2jags'
## The following object is masked from 'package:coda':
##
## traceplot
To code a Bayesian analysis, we must first describe our statistical model. Note that this is the same model that ANOVA is based on, although the lm code shielded us a bit from the mathematical details.
For a Bayesian analysis it is useful to write down the formulas to make sure we translate them well into code.
\[y_i \sim N(\mu_i, \sigma^2)\] \[\mu_i = {\beta}_1 + {\beta}_2 \times Gender_i\] This last line can be written in matrix notation, using matrix multiplication \[\mu_i = \boldsymbol{\beta} \, \mathbf{X}\]
where \(y_i\) is the \(i^{th}\) observation of response variable \(\mathbf{y}\), which is normally distributed (the \(N\) stands for the normal or Gaussian distribution) with mean \(\mu_i\) with a constant variance \(\sigma^2\) (\(\sigma\) is the standard deviation).
Sometimes, the (equivalent in this case) following notation is used
\[y_i = \mu_i + \epsilon\] \[\epsilon \sim N(0, \sigma^2)\]
which means that \(y_i\) is equal to \(\mu_i\) (the expected value) plus some random noise around the expected value. In this case, this random noise is assumed to be normally distributed around \(0\) and have constant variance \(\sigma_2\). This explains why in ANOVA and regression, we need to inspect the residuals for model validation: the model assumes that residuals are centered around \(0\) and have constant variance (homoscedasticity).
As ANOVA is a linear model (same as regression, but using categorical predictors instead of continuous ones), \(\mu_i\) is modelled using two regression paramaters \(\beta_i\) (the incercept) and \(\beta_2\) (the slope). \(Gender_i\) is our predictor variable (telling us the gender of student \(i\)), which in the case of categorical variable is represented by dummy variables.
The dummy variables of predictors are stored in matrix \(\mathbf{X}\) and the regression parameters are stored in the vector \(\boldsymbol{\beta}\).
Converting a categorical factor with two levels (here, women and men) can be done by creating two dummy variables of \(0\)’s and \(1\)’s; however, because these two variables are perfectly correlated with each other and provide redundant information (if we know that student \(i\) is not a woman, then we know it’s a man) we only keep one and drop the other. Therefore, \(\mathbf{X}\) will contain a column of \(1\)’s (used to estimate the intercept), and a single dummy variable representing one of the genders.
In R, matrix \(\mathbf{X}\) can be obtained using the model.matrix function:
(X <- model.matrix(~ gender, data = rob) )
## (Intercept) genderwomen
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## 7 1 0
## 8 1 0
## 9 1 0
## 10 1 0
## 11 1 1
## 12 1 1
## 13 1 1
## 14 1 1
## 15 1 1
## 16 1 1
## 17 1 1
## 18 1 1
## 19 1 1
## 20 1 1
## attr(,"assign")
## [1] 0 1
## attr(,"contrasts")
## attr(,"contrasts")$gender
## [1] "contr.treatment"
As can be seen, X contains a column of \(1\)’s and a single dummy variable of \(0\)’s and \(1\)’s called genderwomen, telling us that row \(i\) with a value of \(1\) is a woman. The fact that gendermen was dropped and not genderwomen is arbitrary and simply due to the order of the levels in R. Indeed, levels are stored in alphabetical order by default.
Then we simply define K (number of columns in X, representing number of regression parameters to estimate) and N (total number of observations, or students in this case). These two numbers will be needed for to specify the model in JAGS.
K <- ncol(X) # number of regression parameters
N <- nrow(rob) # number of observations
In this case we will have two regression parameters (i.e. two columns in \(\mathbf{X}\)), corresponding to \(\beta_1\) (the intercept) and \(\beta_2\) (the slope). Note that these are exactly the same two terms that were reported in the lm output of ANOVA above. Because of the way \(\mathbf{X}\) is constructed, \(\beta_2\) estimates the difference between men and women. Therefore it is the parameter we’re interested in.
Next, we store these objects into a list that will be exported later to JAGS. We also add Y, which is our response variable (i.e. amount of money in pockets).
jags.data <- list(Y = rob$money,
X = X,
N = N,
K = K)
The next chunk of code is more complicated as it specifies the Bayesian model to be run in JAGS. As with R, JAGS model files can contain comments that are preceeded by #. These lines are ignored by JAGS.
The sink() and cat() functions are simply to export the text into a .txt file that JAGS can use. We could write that .txt file directly in a text editor, but writing it in R keeps all the code in the same place, which is tidier.
sink("bayes.txt")
cat("
model {
#1A. Diffuse priors for parameters
for (i in 1:K) {beta[i] ~ dnorm(0, 0.0001)}
# first parameter is mean, second (tau) is precision (tau = 1/variance)
# corresponds to normal distribution with mean 0 and huge variance (10000)
#1B. Diffuse prior for sigma
sigma ~ dunif(0.001, 100)
# weak prior, any value between 0.001 and 100 is equally likely
tau <- 1 / (sigma * sigma)
# tau = precision and variance = sigma^2 (sigma = standard deviation)
#2. Likelihood function
for (i in 1:N) {
Y[i] ~ dnorm(mu[i], tau)
mu[i] <- inprod(beta[], X[i,])
# matrix product of the betas and predictors X
}
}
",fill = TRUE)
sink()
The model description file requires some explanation. Let’s first go the section #2. Likelihood function. This section is the direct translation into JAGS code of our statistical model.
The likelihood function is enclosed with a for loop for all \(i\) observations. We use dnorm because \(y_i\) is normally distributed. The first argument is the mean \(\mu_i\), and the second argument is the precision parameter \(\tau\), which is the inverse of the variance \(\sigma_2\). This is confusing, but this is the way that JAGS parameterises the normal distribution, so we just have to deal with it. The inprod function is matrix multiplication between the regression parameters (the \(\beta\)’s) and the predictors in $.
Now we need to address the first part of the code: the priors. Bayesian analysis requires us to specify prior distributions of the parameters to estimate to calculate the posterior distributions of those parameters, given the data. This means that if we do have prior information about the distribution of the parameters, we can include this information here. The posterior distributions depend on the prior distributions.
In this case, we will specify ‘weak’ or ‘diffuse’ priors because we will assume we have little to no prior knowledge about the parameters. As such, we say that the regression parameters \(\beta_1\) and \(\beta_2\) are normally distributed with a huge variance \(\sigma^2 = 10^4\) (remember that \(\tau = 1/\sigma^2\)).
We also have to estimate \(\sigma\) (the standard deviation of the residuals), and as a prior distribution of \(\sigma\) we use a uniform distribution suggesting that any value between \(0.001\) and \(100\) is equally likely. This is basically saying that we have no idea what \(\sigma\) should be.
JAGS also requires initial values for the parameters. This is supplied via a function that simply contains a list of parameters with their values. In this case the initial values of the regression coefficients (the \(\beta\)’s) are taken from a normal distribution, whereas the initial value for \(\sigma\) is taken from a uniform distribution.
inits <- function () {
list(beta = rnorm(K, 0, 0.01),
sigma = runif(1, 0.1, 5) )
}
Then we must define the parameters we want to save by listing their names in a vector.
params <- c("beta", "sigma")
Finally, we can now run the model! To do so, we call jagsdirectly from R, with the list of arguments listed below. It will export everything to JAGS and then results will be stored into the object jags.mod1.
jags.mod1 <- jags(data = jags.data,
inits = inits,
parameters.to.save = params,
model.file = "bayes.txt",
n.thin = 10,
n.chains = 3,
n.burnin = 500,
n.iter = 10000)
## module glm loaded
Assuming that everything went smoothly, results can be examined using the print function:
print(jags.mod1, digits = 3)
## Inference for Bugs model at "bayes.txt", fit using jags,
## 3 chains, each with 10000 iterations (first 500 discarded), n.thin = 10
## n.sims = 2850 iterations saved
## mu.vect sd.vect 2.5% 25% 50% 75% 97.5% Rhat n.eff
## beta[1] 5.019 0.344 4.344 4.794 5.024 5.244 5.701 1.001 2800
## beta[2] 0.854 0.495 -0.125 0.528 0.863 1.173 1.813 1.001 2800
## sigma 1.115 0.210 0.801 0.966 1.084 1.227 1.629 1.001 2800
## deviance 59.300 2.844 56.163 57.289 58.565 60.416 66.676 1.001 2800
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
##
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 4.0 and DIC = 63.3
## DIC is an estimate of expected predictive error (lower deviance is better).
This shows us a lot of information, but for this exercice we will only focus on some of it. This table us the mean values of \(\beta_1\) (the intercept), \(\beta_2\) (the parameter we’re interested in, corresponding to the difference between women and men), and \(\sigma\) (the residual standard deviation).
We now want to save the estimated values of \(\beta_2\). We extract them from the jags.mod1 object via the following code, and store the values into a vector called beta2.
sims <- jags.mod1$BUGSoutput$sims.matrix
beta2 <- sims[,2] # this contains the estimated values
The histogram below shows the posterior distribution of \(\beta_2\). A red line is shown at \(0\). As can be seen, the majority of the values of \(\beta_2\) are greater than 0.
We can now calculate how many times the estimated values \(\beta_2 > 0\) using the following code:
(mean(beta2 > 0) )
## [1] 0.9624561
Finally! This gives us \(P(H | data)\) –that is, the probability that the difference in the amount of money carried in pockets of students differs between women and men. Based on the observed data (and our prior knowledge on the parameters) we can say that we’re \(96.2\%\) sure that this population of women has more money in their pockets than the population of men. Such a statement is much more meaningful than what we could say using the \(P\)-value from the frequentist approach.
To be fair, our initial hypothesis did not specify whether men or women were expected to have more money, but simply that there was a difference between the two. As such, we should use a more conservative ‘two-tailed’ test. In practice, we can check whether the \(95\%\) credible interval of the posterior dstribution of \(\beta_2\) includes zero. In this particular case (see the print(jags.mod1...) output above), it did so. Therefore, at the \(5\%\) ‘significance level’ $_2$ does not differ significantly from \(0\). Still, the ability to make direct probabilistic statements about the parameters themselves (because we have distributions of them) is a clear advantage over the frequentist approach.
So, should we use frequentist or Bayesian analyses? Well, both have their own merits. Frequentist approaches are typically computationnally much less intensive and available in a very wide range of software packages. The ability to rapidly fit a model, and evaluate competing models, is really useful.
Bayesian analyses are more flexible and powerful, but more time-consuming (both from a coding and computational perspective). They also require a higher degree of statistical fluency, which makes them less readily accessible.
The fact that Bayesian analyses require us to specify prior distributions of parameters (which will affect posterior estimates) have led some people to criticise these methods as inherently ‘subjective’. However, using weak or diffuses priors addresses makes this criticism less relevant. On the other hand, the ability to explicitly incorporate prior knowledge into models can be seen as an advantage.
From a more philosophical perspective, Bayesian analyses are arguably more satisfying because they allow us to estimate \(P(H | data)\). However, with weak or diffuse priors parameter estimates obtained in a Bayesian model are more or less the same as those from frequentist methods, so from a pragmatic approach both approaches give similar answers. This was the case here, as can be seen by comparing outputs from the two analyses.
For simple problems where we want a quick and dirty answer, we don’t need a sledgehamer to crack a nut and frequentist methods might be all we need. For certain complex hierarchical models that can be difficult to fit using frequentist methods, Bayesian analyses will be better. Choice is good.