Introduction

Have you ever had this problem?

Example 1: Smart Drug

Summary: Two groups of people took an IQ test.
Group 1, \(N_1=47\), consumed a “smart drug”, and Group 2, \(N_2=42\), is a control group that consumed a placebo (Kruschke 2013).

The group means, standard deviations and standard errors are:

Group	mean	sd	se
Placebo	100.36	2.52	0.39
SmartDrug	101.91	6.02	0.88

It is obvious that the data contain several ‘outliers’.

Two sample t-test / Welch test

We can perform a two-sample t-test, or a Welch test:

t.test(IQ ~ Group,
       data = TwoGroupIQ,
       var.equal = TRUE,
       alternative = "less")

## 
##  Two Sample t-test
## 
## data:  IQ by Group
## t = -1.5587, df = 87, p-value = 0.06135
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf 0.1037991
## sample estimates:
##   mean in group Placebo mean in group SmartDrug 
##                100.3571                101.9149

t.test(IQ ~ Group,
       data = TwoGroupIQ,
       var.equal = FALSE,
       alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  IQ by Group
## t = -1.6222, df = 63.039, p-value = 0.05488
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf 0.04532157
## sample estimates:
##   mean in group Placebo mean in group SmartDrug 
##                100.3571                101.9149

Problem: Neither yield a significant result. What do we do?

Example 2: Kitchen rolls

The following is a commonly encountered problem: you would like to quantify evidence for the null hypothesis.

Imagine you have gone to the trouble of runnning a replication experiment in which you measure Openness to Experience scores for two groups of students - while filling out the personality questionnaire, both groups rotated a kitchen roll with their hands; one group clockwise, the other group counterclockwise (Wagenmakers et al. 2015).

library(tidyverse)
kitchenrolls <- read_csv("data/KitchenRolls.csv") %>%
  select(ParticipantNumber, Rotation, NEO = mean_NEO) %>%
  mutate_at(vars(ParticipantNumber, Rotation), ~as_factor(.))

kitchenrolls

We can compute means, standard deviations and standard errors:

kitchenrolls %>%
  group_by(Rotation) %>%
  summarise(N = n(),
            mean = mean(NEO),
            sd = sd(NEO),
            se = sd(NEO)/sqrt(n())) %>%
  mutate_if(is.numeric, ~round(., 3))

kitchenrolls %>%
  ggplot(aes(x = Rotation, y = NEO, fill = Rotation)) +
  geom_boxplot() +
  geom_jitter(width = 0.2) +
  scale_fill_viridis_d() +
  theme_tidybayes()

The hypothesis was that turing a kitchen roll in the clockwise direction should increase Openness to experience.

t.test(NEO ~ Rotation,
       data = kitchenrolls,
       alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  NEO by Rotation
## t = 0.75149, df = 97.315, p-value = 0.7729
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf 0.2321921
## sample estimates:
## mean in group counter   mean in group clock 
##              0.712963              0.640625

The result cannot be replicated. If anything, the effect goes in the other direction, but we would like to quantify evidence in favour of the null hypothesis that rotating a kitchen roll has no effect.

How can we do this?

Problems with null hypothesis significance testing (NHST)

Statement by American Statistical Association (ASA) about p-values (Lazar 2016).
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true (we would actually like to know this), or the probability that the data were produced by chance.
Greenland et al. (2016) provide a good discussion of common misinterpretations of p values and confidence intervals.
- A confidence interval does not have a 95% chance of containing the true parameter.

The (Bayesian) New Statistics

Cumming (2014): We need to shift from reliance on NHST to estimation and other techniques.
Kruschke and Liddell (2018): Bayesian methods are better suited for this, for both hypothesis testing and parameter estimation.
According to Gigerenzer (2004) and Gigerenzer (2018), we need to stop relying on NHST (mindless statistics), but instead learn to use a whole statistical toolkit.
Many reviewers now demand Bayes factors (because a BF can provide evidence for/against hypotheses).
However: Bayesian data analysis is not limited to calculating Bayes factors.

Bayesian methods:

🤗

more intuitive (uncertainty) and based on probability theory
provide evidence for/against hypotheses
more flexible: robust models and cognitive process models (Lee and Wagenmakers 2014)
can include prior knowledge
better for multilevel models (Gelman and Hill 2006)

😧

require computing power
setting priors requires familiarity with probability distributions
ongoing discussion about parameter estimation vs. hypothesis testing. See e.g. here and here.

Why Bayesian statistics?

Why should we care about Bayesian statistics?
Support for various hypotheses, inlcuding the null hypothesis; therefore, we can use non-significant results
Dienes (2014); Wagenmakers et al. (2018); Wagenmakers, Morey, and Lee (2016) provide very useful discussions of the advantages offered by going Bayesian:

It is important to distinguish between parameter estimation and hypothesis testing (Wagenmakers et al. 2018)

Benefits of Bayesian parameter estimation

Bayesian estimation can incorporate prior knowledge
Bayesian estimation can quantify confidence that a parameter lies in a specific interval
- E.g. what is the probility that a parameter lies between 0.4 and 0.6?
Bayesian estimation conditions on what is known
Bayesian estimation is coherent (coherence is guaranteed by the laws of probability theory)
Bayesian estimation extends naturally to complicated models

Benefits of Bayesian hypothesis testing

The Bayes factor (BF) quantifies evidence that the data provide for \(\mathcal{H}_0\) vs. \(\mathcal{H}_1\)
The BF can quantify evidence in favour of \(\mathcal{H}_0\)
The BF allows evidence to be monitored as data accumulate
The BF does not depend on unknown or absent sampling plans
The BF is not biased against \(\mathcal{H}_0\) (The NHST p-value only considers the unusualness of the data under \(\mathcal{H}_0\))

What is Bayesian statistics?

Again: It is important to distinguish between parameter estimation and hypothesis testing.
Hypothesis testing means comparing models - thus it is more complicated than estimation

Bayesian parameter estimation

In Bayesian parameter estimation, we focus on one model.

The inferential goal is the posterior distribution.

\[ p(\theta | y) = p(\theta) \cdot \frac{p(y | \theta)}{p(y)}\]

Bayesians cannot test precise hypotheses using confidence intervals. In classical statistics one frequently sees testing done by forming a confidence region for the parameter, and then rejecting a null value of the parameter if it does not lie in the confidence region. This is simply wrong if done in a Bayesian formulation (and if the null value of the parameter is believable as a hypothesis).

Bayesian hypothesis testing

Bayesian hypothesis testing is model comparison, in which we compare the ability of two or more competing models to predict data.

\[ \frac{p(\mathcal{M}_1 | y) = P(y | \mathcal{M}_1) p(\mathcal{M}_1)} {p(\mathcal{M}_2 | y) = P(y | \mathcal{M}_2) p(\mathcal{M}_2)}\]

When the goal is hypothesis testing, Bayesians need to go beyond the posterior distribution. To answer the question “To what extent do the data support the presence of a correlation?” one needs to compare two models

The Bayesian approach unifies both problems within a coherent predictive framework.
Parameters and models that predict the data successfully receive a boost in plausibility, whereas parameters and models that predict poorly suffer a decline.
Bayesian analyses can be more informative, more elegant, and more flexible than the orthodox methodology that remains dominant within the field of psychology.

The Bayes factor

Let’s have another look at Bayes rule (including the dependency of the parameters \(\mathbf{\theta}\) on the model \(\mathcal{M}\)):

\[ p(\theta | y, \mathcal{M}) = \frac{p(y|\theta, \mathcal{M}) p(\theta | \mathcal{M})}{p(y | \mathcal{M})}\]

where \(\mathcal{M}\) refers to a specific model. The marginal likelihood \(p(y | \mathcal{M})\) now gives the probability of the data, averaged over all possible parameter value under model \(\mathcal{M}\).

The marginal likelihood \(p(y | \mathcal{M})\) is usually neglected when looking at a single model, but becomes important when comparing models.

Writing out the marginal likelihood \(p(y | \mathcal{M})\): \[ p(y | \mathcal{M}) = \int{p(y | \theta, \mathcal{M}) p(\theta|\mathcal{M})d\theta}\]

we see that this is averaged over all possible values of \(\theta\) that the model will allow.

The priors on \(\theta\) are important.

The model evidence will depend on what kind of predictions a model can make. This gives us a measure of complexity – a complex model is a model that can make many predictions.

The problem with making many predictions is that most of these predictions will turn out to be false.

The complexity of a model depends on (among other things):

the number of parameters (as in frequentist model comparison)
the prior distributions of the model’s parameters

When a parameter priors are broad (uninformative), those parts of the parameter space where the likelihood is high are assigned low probability. Intuitively, if one hedges one’s bets, one has to assign low probability to parameter values that make good predictions, because one has more possible parameter values.

All this leads to the fact that more complex model have comparatively lower marginal likelihood.

Therefore, when we compare models, and we prefer models with higher marginal likelihood, we are using Ockham’s razor in a principled manner.

We can also write Bayes rule applied to a comparison between models (marginalized over all parameters within the model):

\[ p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)}\]

and

\[ p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)}\]

This tells us that for model \(\mathcal{M_m}\), the posterior probability of the model is proportional to the marginal likelihood times the prior probability of the model.

Now, one is usually less interested in absolute evidence than in relative evidence; we want to compare the predictive performance of one model over another.

To do this, we simply form the ratio of the model probabilities:

\[ \frac{p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)}} {p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)}}\]

The term \(p(y)\) cancels out, giving us: \[ \frac{p(\mathcal{M}_1 | y) = P(y | \mathcal{M}_1) p(\mathcal{M}_1)} {p(\mathcal{M}_2 | y) = P(y | \mathcal{M}_2) p(\mathcal{M}_2)}\]

The ratio \(\frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}\) is called the prior odds.
The ratio \(\frac{p(\mathcal{M}_1 | y)}{p(\mathcal{M}_2 | y)}\) is therefore the posterior odds.

We are particularly interested in the ratio of the marginal likelihoods:

\[\frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)}\]

This is the Bayes factor, and it can be interpreted as the change from prior odds to posterior odds that is indicated by the data.

If we consider the prior odds to be \(1\), i.e. we do not favour one model over another a priori, then we are only interested in the Bayes factor. We write this as:

\[ BF_{12} = \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)}\]

Here, \(BF_{12}\) indicates the extent to which the data support model \(\mathcal{M}_1\) over model \(\mathcal{M}_2\).

As an example, if we obtain a \(BF_{12} = 5\), this mean that the data are 5 times more likely to have occured under model 1 than under model 2. Conversely, if \(BF_{12} = 0.2\), then the data are 5 times more likely to have occured under model 2.

We usually perform model comparisons between a null hypothesis \(\mathcal{H}_0\) and an alternative hypothesis \(\mathcal{H}_1\). The terms “model” and “hypothesis” are used synonymously.

In JASP, we will see Bayes factors reported as either

\[ BF_{10} = \frac{P(y | \mathcal{H}_1)}{P(y | \mathcal{H}_0)}\]

which indicates a BF for an undirected alternative \(\mathcal{H}_1\) versus the null, or

\[ BF_{+0} = \frac{P(y | \mathcal{H}_+)}{P(y | \mathcal{H}_0)}\]

which indicates a BF for a directed alternative \(\mathcal{H}_+\) versus \(\mathcal{H}_0\).

If we want a BF for the null \(\mathcal{H}_0\), we can simply take the inverse of \(BF_{10}\):

\[ BF_{01} = \frac{1}{BF_{10}}\]

The following classification scheme is sometimes used, although it is rather unnesscessary.

Principled Bayesian workflow

According to Gelman (2014), Bayesian data analysis is performed in three steps:

Set up probability model (joint probability distribution for observed (\(y\), \(x\)) and latent quantities \(\theta\)).
Condition on observed data: calculate posterior distribution \(P(y | \theta) \cdot p(\theta)\).
Evaluate model and implications of the posterior distribution.
- How well does the model fit the data?
- Are the substantive conclusions reasonable?
- How sensitive are the results to the modelling assumptions?
- Does the model need to be revised?

This fits very well with the iterative process described by Blei (2014).

In fact, we can describe a Bayesian workflow like this:

This highlights the distinction between posterior evaluation (estimation) of a model

and model comparison (hypothesis testing)

How to do Bayesian statistics?

Open notebook: 01-intro-bayesian-statistics.Rmd

Hands-on session with Jasp

Open notebook: 02-jasp-case-studies.Rmd

Hands-on session with R

Open notebook: 03-brms-case-studies.Rmd

References

Blei, David M. 2014. “Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models.” Annual Review of Statistics and Its Application 1 (1): 203–32. https://doi.org/10.1146/annurev-statistics-022513-115657.

Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–29. https://doi.org/10.1177/0956797613504966.

Dienes, Zoltan. 2014. “Using Bayes to Get the Most Out of Non-Significant Results.” Frontiers in Psychology 5. https://doi.org/10.3389/fpsyg.2014.00781.

Gelman, Andrew. 2014. Bayesian Data Analysis. Third edition. Chapman & Hall/CRC Texts in Statistical Science. Boca Raton: CRC Press.

Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942.

Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of Socio-Economics 33 (5): 587–606. https://doi.org/10.1016/j.socec.2004.09.033.

———. 2018. “Statistical Rituals: The Replication Delusion and How We Got There.” Advances in Methods and Practices in Psychological Science 1 (2): 198–218. https://doi.org/10.1177/2515245918771329.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–50. https://doi.org/10.1007/s10654-016-0149-3.

Kruschke, John K. 2013. “Bayesian Estimation Supersedes the T Test.” Journal of Experimental Psychology: General 142 (2): 573–603. https://doi.org/10.1037/a0029146.

Kruschke, John K., and Torrin M. Liddell. 2018. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin & Review 25 (1): 178–206. https://doi.org/10.3758/s13423-016-1221-4.

Lazar, Nicole A. 2016. “The ASA’s Statement on P-Values: Context, Process, and Purpose AU - Wasserstein, Ronald L.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

Lee, Michael D., and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical Course. 1st ed. Cambridge ; New York: Cambridge University Press. https://doi.org/10.1017/CBO9781139087759.

Wagenmakers, Eric-Jan, Titia F. Beek, Mark Rotteveel, Alex Gierholz, Dora Matzke, Helen Steingroever, Alexander Ly, et al. 2015. “Turning the Hands of Time Again: A Purely Confirmatory Replication Study and a Bayesian Analysis.” Frontiers in Psychology 6 (April). https://doi.org/10.3389/fpsyg.2015.00494.

Wagenmakers, Eric-Jan, Maarten Marsman, Tahira Jamil, Alexander Ly, Josine Verhagen, Jonathon Love, Ravi Selker, et al. 2018. “Bayesian Inference for Psychology. Part I: Theoretical Advantages and Practical Ramifications.” Psychonomic Bulletin & Review 25 (1): 35–57. https://doi.org/10.3758/s13423-017-1343-3.

Wagenmakers, Eric-Jan, Richard D. Morey, and Michael D. Lee. 2016. “Bayesian Benefits for the Pragmatic Researcher.” Current Directions in Psychological Science 25 (3): 169–76. https://doi.org/10.1177/0963721416643289.

Swiss Graduate School for Cognition, Learning, and Memory

Introduction to Bayesian statistics

Plan for today