Andrew Ellis
2019-04-26
brms
brms
Have you ever had this problem?
Summary: Two groups of people took an IQ test.
Group 1, N1=47, consumed a “smart drug”, and Group 2, N2=42, is a control group that consumed a placebo (Kruschke 2013).
The group means, standard deviations and standard errors are:
Group | mean | sd | se |
---|---|---|---|
Placebo | 100.36 | 2.52 | 0.39 |
SmartDrug | 101.91 | 6.02 | 0.88 |
It is obvious that the data contain several ‘outliers’.
We can perform a two-sample t-test, or a Welch test:
##
## Two Sample t-test
##
## data: IQ by Group
## t = -1.5587, df = 87, p-value = 0.06135
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.1037991
## sample estimates:
## mean in group Placebo mean in group SmartDrug
## 100.3571 101.9149
##
## Welch Two Sample t-test
##
## data: IQ by Group
## t = -1.6222, df = 63.039, p-value = 0.05488
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.04532157
## sample estimates:
## mean in group Placebo mean in group SmartDrug
## 100.3571 101.9149
Problem: Neither yield a significant result. What do we do?
The following is a commonly encountered problem: you would like to quantify evidence for the null hypothesis.
Imagine you have gone to the trouble of runnning a replication experiment in which you measure Openness to Experience
scores for two groups of students - while filling out the personality questionnaire, both groups rotated a kitchen roll with their hands; one group clockwise, the other group counterclockwise (Wagenmakers et al. 2015).
library(tidyverse)
kitchenrolls <- read_csv("data/KitchenRolls.csv") %>%
select(ParticipantNumber, Rotation, NEO = mean_NEO) %>%
mutate_at(vars(ParticipantNumber, Rotation), ~as_factor(.))
We can compute means, standard deviations and standard errors:
kitchenrolls %>%
group_by(Rotation) %>%
summarise(N = n(),
mean = mean(NEO),
sd = sd(NEO),
se = sd(NEO)/sqrt(n())) %>%
mutate_if(is.numeric, ~round(., 3))
ABCDEFGHIJ0123456789 |
Rotation <fctr> | N <dbl> | mean <dbl> | sd <dbl> | se <dbl> |
---|---|---|---|---|
counter | 54 | 0.713 | 0.473 | 0.064 |
clock | 48 | 0.641 | 0.496 | 0.072 |
kitchenrolls %>%
ggplot(aes(x = Rotation, y = NEO, fill = Rotation)) +
geom_boxplot() +
geom_jitter(width = 0.2) +
scale_fill_viridis_d() +
theme_tidybayes()
The hypothesis was that turing a kitchen roll in the clockwise
direction should increase Openness to experience
.
##
## Welch Two Sample t-test
##
## data: NEO by Rotation
## t = 0.75149, df = 97.315, p-value = 0.7729
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.2321921
## sample estimates:
## mean in group counter mean in group clock
## 0.712963 0.640625
The result cannot be replicated. If anything, the effect goes in the other direction, but we would like to quantify evidence in favour of the null hypothesis that rotating a kitchen roll has no effect.
How can we do this?
Statement by American Statistical Association (ASA) about p-values (Lazar 2016).
P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true (we would actually like to know this), or the probability that the data were produced by chance.
Greenland et al. (2016) provide a good discussion of common misinterpretations of p values and confidence intervals.
Cumming (2014): We need to shift from reliance on NHST to estimation and other techniques.
Kruschke and Liddell (2018): Bayesian methods are better suited for this, for both hypothesis testing and parameter estimation.
According to Gigerenzer (2004) and Gigerenzer (2018), we need to stop relying on NHST (mindless statistics), but instead learn to use a whole statistical toolkit.
Many reviewers now demand Bayes factors (because a BF can provide evidence for/against hypotheses).
However: Bayesian data analysis is not limited to calculating Bayes factors.
🤗
more intuitive (uncertainty) and based on probability theory
provide evidence for/against hypotheses
more flexible: robust models and cognitive process models (Lee and Wagenmakers 2014)
can include prior knowledge
better for multilevel models (Gelman and Hill 2006)
😧
require computing power
setting priors requires familiarity with probability distributions
ongoing discussion about parameter estimation vs. hypothesis testing. See e.g. here and here.
Why should we care about Bayesian statistics?
Support for various hypotheses, inlcuding the null hypothesis; therefore, we can use non-significant results
Dienes (2014); Wagenmakers et al. (2018); Wagenmakers, Morey, and Lee (2016) provide very useful discussions of the advantages offered by going Bayesian:
It is important to distinguish between parameter estimation and hypothesis testing (Wagenmakers et al. 2018)
In Bayesian parameter estimation, we focus on one model.
p(θ|y)=p(θ)⋅p(y|θ)p(y)
Bayesians cannot test precise hypotheses using confidence intervals. In classical statistics one frequently sees testing done by forming a confidence region for the parameter, and then rejecting a null value of the parameter if it does not lie in the confidence region. This is simply wrong if done in a Bayesian formulation (and if the null value of the parameter is believable as a hypothesis).
Bayesian hypothesis testing is model comparison, in which we compare the ability of two or more competing models to predict data.
p(M1|y)=P(y|M1)p(M1)p(M2|y)=P(y|M2)p(M2)
When the goal is hypothesis testing, Bayesians need to go beyond the posterior distribution. To answer the question “To what extent do the data support the presence of a correlation?” one needs to compare two models
Let’s have another look at Bayes rule (including the dependency of the parameters θ on the model M):
p(θ|y,M)=p(y|θ,M)p(θ|M)p(y|M)
where M refers to a specific model. The marginal likelihood p(y|M) now gives the probability of the data, averaged over all possible parameter value under model M.
The marginal likelihood p(y|M) is usually neglected when looking at a single model, but becomes important when comparing models.
Writing out the marginal likelihood p(y|M): p(y|M)=∫p(y|θ,M)p(θ|M)dθ
we see that this is averaged over all possible values of θ that the model will allow.
The priors on θ are important.
The problem with making many predictions is that most of these predictions will turn out to be false.
The complexity of a model depends on (among other things):
When a parameter priors are broad (uninformative), those parts of the parameter space where the likelihood is high are assigned low probability. Intuitively, if one hedges one’s bets, one has to assign low probability to parameter values that make good predictions, because one has more possible parameter values.
All this leads to the fact that more complex model have comparatively lower marginal likelihood.
We can also write Bayes rule applied to a comparison between models (marginalized over all parameters within the model):
p(M1|y)=P(y|M1)p(M1)p(y)
and
p(M2|y)=P(y|M2)p(M2)p(y)
This tells us that for model Mm, the posterior probability of the model is proportional to the marginal likelihood times the prior probability of the model.
Now, one is usually less interested in absolute evidence than in relative evidence; we want to compare the predictive performance of one model over another.
To do this, we simply form the ratio of the model probabilities:
p(M1|y)=P(y|M1)p(M1)p(y)p(M2|y)=P(y|M2)p(M2)p(y)
The term p(y) cancels out, giving us: p(M1|y)=P(y|M1)p(M1)p(M2|y)=P(y|M2)p(M2)
P(y|M1)P(y|M2)
This is the Bayes factor, and it can be interpreted as the change from prior odds to posterior odds that is indicated by the data.
If we consider the prior odds to be 1, i.e. we do not favour one model over another a priori, then we are only interested in the Bayes factor. We write this as:
BF12=P(y|M1)P(y|M2)
Here, BF12 indicates the extent to which the data support model M1 over model M2.
As an example, if we obtain a BF12=5, this mean that the data are 5 times more likely to have occured under model 1 than under model 2. Conversely, if BF12=0.2, then the data are 5 times more likely to have occured under model 2.
We usually perform model comparisons between a null hypothesis H0 and an alternative hypothesis H1. The terms “model” and “hypothesis” are used synonymously.
In JASP, we will see Bayes factors reported as either
BF10=P(y|H1)P(y|H0)
which indicates a BF for an undirected alternative H1 versus the null, or
BF+0=P(y|H+)P(y|H0)
which indicates a BF for a directed alternative H+ versus H0.
If we want a BF for the null H0, we can simply take the inverse of BF10:
BF01=1BF10
The following classification scheme is sometimes used, although it is rather unnesscessary.
According to Gelman (2014), Bayesian data analysis is performed in three steps:
Set up probability model (joint probability distribution for observed (y, x) and latent quantities θ).
Condition on observed data: calculate posterior distribution P(y|θ)⋅p(θ).
Evaluate model and implications of the posterior distribution.
This fits very well with the iterative process described by Blei (2014).
In fact, we can describe a Bayesian workflow like this:
This highlights the distinction between posterior evaluation (estimation) of a model
and model comparison (hypothesis testing)
Open notebook: 01-intro-bayesian-statistics.Rmd
Open notebook: 02-jasp-case-studies.Rmd
Open notebook: 03-brms-case-studies.Rmd
Blei, David M. 2014. “Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models.” Annual Review of Statistics and Its Application 1 (1): 203–32. https://doi.org/10.1146/annurev-statistics-022513-115657.
Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–29. https://doi.org/10.1177/0956797613504966.
Dienes, Zoltan. 2014. “Using Bayes to Get the Most Out of Non-Significant Results.” Frontiers in Psychology 5. https://doi.org/10.3389/fpsyg.2014.00781.
Gelman, Andrew. 2014. Bayesian Data Analysis. Third edition. Chapman & Hall/CRC Texts in Statistical Science. Boca Raton: CRC Press.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942.
Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of Socio-Economics 33 (5): 587–606. https://doi.org/10.1016/j.socec.2004.09.033.
———. 2018. “Statistical Rituals: The Replication Delusion and How We Got There.” Advances in Methods and Practices in Psychological Science 1 (2): 198–218. https://doi.org/10.1177/2515245918771329.
Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–50. https://doi.org/10.1007/s10654-016-0149-3.
Kruschke, John K. 2013. “Bayesian Estimation Supersedes the T Test.” Journal of Experimental Psychology: General 142 (2): 573–603. https://doi.org/10.1037/a0029146.
Kruschke, John K., and Torrin M. Liddell. 2018. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin & Review 25 (1): 178–206. https://doi.org/10.3758/s13423-016-1221-4.
Lazar, Nicole A. 2016. “The ASA’s Statement on P-Values: Context, Process, and Purpose AU - Wasserstein, Ronald L.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.
Lee, Michael D., and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical Course. 1st ed. Cambridge ; New York: Cambridge University Press. https://doi.org/10.1017/CBO9781139087759.
Wagenmakers, Eric-Jan, Titia F. Beek, Mark Rotteveel, Alex Gierholz, Dora Matzke, Helen Steingroever, Alexander Ly, et al. 2015. “Turning the Hands of Time Again: A Purely Confirmatory Replication Study and a Bayesian Analysis.” Frontiers in Psychology 6 (April). https://doi.org/10.3389/fpsyg.2015.00494.
Wagenmakers, Eric-Jan, Maarten Marsman, Tahira Jamil, Alexander Ly, Josine Verhagen, Jonathon Love, Ravi Selker, et al. 2018. “Bayesian Inference for Psychology. Part I: Theoretical Advantages and Practical Ramifications.” Psychonomic Bulletin & Review 25 (1): 35–57. https://doi.org/10.3758/s13423-017-1343-3.
Wagenmakers, Eric-Jan, Richard D. Morey, and Michael D. Lee. 2016. “Bayesian Benefits for the Pragmatic Researcher.” Current Directions in Psychological Science 25 (3): 169–76. https://doi.org/10.1177/0963721416643289.
Space, Right Arrow or swipe left to move to next slide, click help below for more details