Hypothesis Testing With Bayes Factors

August 18, 2015

Outline

The problem: stating evidence for an invariance
Where null hypothesis significance testing fails us
Where Bayes factor could save us
Examples: coin flips, t-test, ANOVA
Some criticisms

The problem

According to your theory, the effect of Factor A is expected to be invariant with respect to Factor B

The effect of Factor A should not be different across levels of Factor B

How can we provide statistical evidence for this invariance?

Traditional approach

Fit a statistical model (i.e., ANOVA) to data where Factor A and Factor B were manipulated
Is the Factor A X Factor B interaction significant?
Calculate p value – what proportion of values will be as or more extreme as what we observed, under the null hypothesis?
Reject or fail to reject null hypothesis based on p value (i.e., reject if p < .05)

When p > .05

Fail to reject null hypothesis of no difference
Can't "accept" the null (even if we want to!)
State that the evidence for an effect is inconclusive
If you make a big deal of a null effect, chances are a reviewer will suggest your study was underpowered to find an interaction

The absence of an effect is not itself evidence of absence

Bayes factor

Alternative to NHST

Directly compare the likelihood of the data under null and non-null models by computing Bayes factor

To what extent should we revise our beliefs regarding null and alternative models after observing data?

Probability

Frequentist (Classical)

Probability (i.e., a p value) is the proportion of expected outcomes under repeated tests

Bayesian

Probability quantifies our subjective belief about the likelihood of an outcome, and is rationally updated given new data

Bayesian updating

Bayes rule in simple form:

\[ Pr ( H | D ) = \frac{ Pr( D | H ) Pr ( H ) }{ Pr ( D ) } \]

\(Pr(D|H)\) is the likelihood of the data under a hypothesis

\(Pr(H)\) is the prior probability of the hypothesis

\(Pr(D)\) is the probability of the data

Bayes factor in model comparison

Bayes factor compares the marginal likelihood of the data under competing models

\[ \frac{ Pr( M_0 | D )}{ Pr( M_1 | D )} = \frac{ Pr( D | M_0) }{ Pr (D | M_1 ) } \times \frac{ Pr(M_0) } { Pr ( M_1 )} \]

Are the data more probable under the null model than a specified alternative?

Likelihood ratio test

When \(Pr(D|M)\) is computed by maximizing the likelihood, we can compare two nested models by comparing the likelihood ratio

\(-2 * ln Pr(D|M_1)/Pr(D|M_2)\)

However, there is no direct mechanism that can favor a null (less complex) model, as the log-likelihood of the more complex of the nested models will always be as least as high as the reduced model
Bayes factor offers a clear advantage by eliminating this limitation

A coin example

We want to determine the likelihood of flipping heads using a coin. Prior to flipping the coin 10 times, we may believe that all outcomes are equally likely.

\(Y | \theta \sim \text{Binomial}(\theta, N)\)

\(\theta \sim \text{Uniform} (0,1)\)

A coin example

\(\theta \sim \text{Uniform} (0,1)\)

A coin example

Another person may believe the coin to be fairly weighted, and is likely to land heads and tails evenly.

\(Y \sim \text{Binomial}(.5, N)\)

A coin example

Let's say we observe 6 heads out of 10 flips.

A coin example

\(Y = 6\)

\(Pr(D|M_1) = .09\)

\(Pr(D|M_2) = .21\)

\(BF_{12} = .44\)

\(BF_{21} = 2.26\)

The data are more than twice as likely under the 'fair' model as under the uniform model.

Bayes factor

Kass & Raftery (1995) recommendations

BF	Strength of evidence
< 1	Favors other model
1 - 3	Not worth a bare mention
3 - 20	Substantial
20 - 150	Strong
> 150	Very strong

Bayes factor

Bayes factor informs us how to revise our prior beliefs
People often hold different prior opinions regarding the truth of hypotheses (i.e., ESP, global warming)
However, we can agree on the impact that data have on our resulting beliefs

Choosing prior distributions

Subjectivity in choosing priors is a concern, as priors have a direct influence on Bayes factor (and posterior beliefs)
We often don't have point estimates for effects in to-be-conducted experiments
Complexity of this problem has prevented many from adopting Bayesian methods

BayesFactor R Package

BayesFactor (Morey et al., 2015)

Rouder et al. (2009, 2012) have developed "default" priors for use in a wide range of applications in a convenient R package
Priors are placed on effect size (unaffected by measurement scale) and variance
Warning: not to be used blindly

t-test

A t-test can indicate whether two independently sampled means are significantly different from one another. Under the null, the two means are assumed to be equal.

\(H_0\): \(\bar{Y_1} - \bar{Y_2} = 0\)

\(H_A\): \(\bar{Y_1} - \bar{Y_2} \neq 0\)

\[ t = \frac{ \bar{Y_1} - \bar{Y_2} } { s_p \sqrt { 1/N_1 + 1/N_2 } } \]

Compare observed \(t\) against \(t\) with \(df = N_1 + N_2 - 2\)

t-test using BayesFactor

Rouder et al. (2009; PB&R)

Assume no effect

\[ M_0: \delta = 0 \]

Null model (\(M_0\))

Jeffrey's (noninformative) prior on variance \(Pr(\sigma^2) = 1 / \sigma^2\)

Alternative model (\(M_1\))

Prior on effect size where:

\[ \delta = \frac{ \mu } { \sigma^2 } \]

\[ M_1: \delta \text{ ~ Cauchy} \]

Cauchy distribution

Two sample t-test

When weaving, does yarn "A" break more than yarn "B"?

plot(breaks ~ factor(wool), warpbreaks)

Two sample t-test

Does yarn "A" break more than yarn "B"?

t.test(breaks ~ wool, data = warpbreaks, var.eq=TRUE)

## 
##  Two Sample t-test
## 
## data:  breaks by wool
## t = 1.6335, df = 52, p-value = 0.1084
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.319679 12.875235
## sample estimates:
## mean in group A mean in group B 
##        31.03704        25.25926

Bayesian t-test using BayesFactor

ttestBF(formula = breaks ~ wool, data = warpbreaks)

## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 0.8184672 ±0.01%
## 
## Against denominator:
##   Null, mu1-mu2 = 0 
## ---
## Bayes factor type: BFindepSample, JZS

Bayesian t-test using BayesFactor

Plot code from E-J Wagenmakers

Cauchy scale parameter (r)

Bayesian t-test using BayesFactor

Plot code from E-J Wagenmakers

Assessing evidence: p-value vs. BF

Rouder et al. (2012) p. 360

ANOVAs

Bayes factor for ANOVA designs

Rouder et al. (2012; JMP)

Bayesian implementation of the hierarchical ANOVA model (Gelman, 2005)

\[ Y_{ijk} = \mu + \sigma[A_i + B_j + (AB)ij] + \epsilon_{ijk} \]

Cauchy priors are placed on effect size for each term (A, B, AB)

Error terms are assumed to be independent for each effect

Bayes factor for ANOVA designs

Prior structure

Fixed effect \(\alpha\)

\(\alpha* | g \sim \text{Normal}(0, g_{\alpha-1},gI_{\alpha-1})\)

\(g \sim \text{Inverse-}\chi^2(1)\)

Resulting effect-size distributions are multivariate Cauchy
Separate \(g\)-prior for each factor

Bayes factor for ANOVA designs

Effect size used is \(R^2\)
Null model includes only random effects (i.e., subjects or items)
Each of the nested linear models in the ANOVA design is compared to the null model
Nested models can be directly compared by taking the ratio of the Bayes factor

An example: Abbott & Staub (In Press, JML)

E-Z Reader 10 (Reichle, Warren, & McConnell, 2009) asserts that word identification and integration occur in discrete serial stages
Word identification time is a function of frequency and cloze probability
Integration of word n can affect eye movements if it lags behind word n+1 identification or it fails outright (with probability P(F))

Abbott & Staub (In Press, JML)

The man noticed the journal was missing from his desk. (HF, plausible)
The man noticed the stapler was missing from his desk. (LF, plausible)
The man angered the journal by placing it in the drawer. (HF, implausible)
The man angered the stapler by placing it in the drawer. (LF, implausible)

E-Z Reader's null predictions

Frequency, but not plausibility, will affect word skipping
Plausibility, but not frequency, will affect regression rates
Frequency and plausibility will have additive (critically, not interactive) effects across all temporal measures (i.e., gaze duration)

First fixation duration

Target Word Duration Measures

First fixation duration

ANOVA by subjects

Variable	F(1,111)	p
Frequency	35.14	< .001
Plausibility	7.52	< .01
Freq X Plaus	.33	.57

First fixation duration

Bayesian ANOVA

ffd.aov.bf <- anovaBF(ffdR3 ~ Freq*Plaus + Subj, data=fpl.tarffd.byss,
whichRandom="Subj", rscaleFixed=.5, iterations=100000)
ffd.aov.bf

## Bayes factor analysis
## --------------
## [1] Freq + Subj                      : 237815.4 ±3.3%
## [2] Plaus + Subj                     : 4.578714 ±0.48%
## [3] Freq + Plaus + Subj              : 1473768  ±0.62%
## [4] Freq + Plaus + Freq:Plaus + Subj : 244388.3 ±0.86%
## 
## Against denominator:
##   ffdR3 ~ Subj 
## ---
## Bayes factor type: BFlinearModel, JZS

First fixation duration

Bayesian ANOVA

plot(ffd.aov.bf)

First fixation duration

Bayesian ANOVA

What is the evidence for the additive effects model over the interactive effects model?

ffd.aov.bf[3] / ffd.aov.bf[4]

## Bayes factor analysis
## --------------
## [1] Freq + Plaus + Subj : 6.030438 ±1.06%
## 
## Against denominator:
##   ffdR3 ~ Freq + Plaus + Freq:Plaus + Subj 
## ---
## Bayes factor type: BFlinearModel, JZS

First fixation duration

Bayesian ANOVA

We can also sample from the posterior distribution of the parameters

ffd.chain <- data.frame(posterior(ffd.aov.bf[4], iterations=10000,
columnFilter="^Subj$"))
head(ffd.chain[,1:5])

##         mu   Freq.HF  Freq.LF Plaus.IM  Plaus.PL
## 1 245.8085 -7.556494 7.556494 3.956670 -3.956670
## 2 243.9439 -8.633186 8.633186 5.928889 -5.928889
## 3 242.8874 -9.918842 9.918842 5.867772 -5.867772
## 4 250.2694 -5.541311 5.541311 4.786933 -4.786933
## 5 243.7995 -9.204271 9.204271 4.652317 -4.652317
## 6 249.9436 -7.153691 7.153691 2.312630 -2.312630

First fixation duration

Bayesian ANOVA

Prior to posterior odds

We can compute posterior odds using the BF object. For example, we can consider all models to be equally likely before observing the data (we have done this implicitly already).

prior.odds = newPriorOdds(ffd.aov.bf, type = "equal")
prior.odds

## Prior odds
## --------------
## [1] Freq + Subj                      : 1
## [2] Plaus + Subj                     : 1
## [3] Freq + Plaus + Subj              : 1
## [4] Freq + Plaus + Freq:Plaus + Subj : 1
## 
## Against denominator:
##   ffdR3 ~ Subj 
## ---
## Model type: BFlinearModel, JZS

Prior to posterior odds

Now compute the posterior odds

post.odds = prior.odds * ffd.aov.bf
post.odds

## Posterior odds
## --------------
## [1] Freq + Subj                      : 237815.4 ±3.3%
## [2] Plaus + Subj                     : 4.578714 ±0.48%
## [3] Freq + Plaus + Subj              : 1473768  ±0.62%
## [4] Freq + Plaus + Freq:Plaus + Subj : 244388.3 ±0.86%
## 
## Against denominator:
##   ffdR3 ~ Subj 
## ---
## Model type: BFlinearModel, JZS

Prior to posterior odds

And we can convert these to probabilities

post.prob = as.BFprobability(post.odds)
post.prob

## Posterior probabilities
## --------------
## [1] Freq + Subj                      : 0.1215839    ±NA%
## [2] Plaus + Subj                     : 2.340883e-06 ±NA%
## [3] Freq + Plaus + Subj              : 0.7534689    ±NA%
## [4] Freq + Plaus + Freq:Plaus + Subj : 0.1249443    ±NA%
## [5] Subj                             : 5.112533e-07 ±NA%
## 
## Normalized probability:  1  
## ---
## Model type: BFlinearModel, JZS

Mixed effects models are also possible - just pass additional factors to "whichRandom". NAs must be omitted.

fpl.nona <- subset(fpl, !is.na(fpl$ffdR3))
ffd.mixed.bf <- anovaBF(ffdR3 ~ Freq*Plaus + subj + item, data=fpl.nona,
whichRandom=c("subj", "item"), rscaleFixed=.5)
ffd.mixed.bf

## Bayes factor analysis
## --------------
## [1] Freq + item + subj                      : 4641683  ±1.03%
## [2] Plaus + item + subj                     : 7.145093 ±1.92%
## [3] Freq + Plaus + item + subj              : 32615663 ±2.11%
## [4] Freq + Plaus + Freq:Plaus + item + subj : 1723331  ±2.03%
## 
## Against denominator:
##   ffdR3 ~ item + subj 
## ---
## Bayes factor type: BFlinearModel, JZS

Mixed effects models

plot(ffd.mixed.bf)

End notes

Criticism of Bayes factor

Is Bayes factor prejudiced against small effects? Uri Simonsohn
Krushke: Estimation and precision are the goal (i.e., posterior predictive checks)
Gelman: Avoid model selection in Bayesian inference

Small effects

Rouder et al. (2009) p. 233

\(\delta = .2\)

Conclusions

Model comparison using Bayes factors is an attractive alternative to NHST, in particular when null effects are of interest
The information gained from Bayesian analysis is rich
Recent developments by Rouder, Morey, and colleagues make Bayesian analysis digestable (and available to the masses using JASP)
What are we waiting for?