August 18, 2015

## Outline

• The problem: stating evidence for an invariance

• Where null hypothesis significance testing fails us

• Where Bayes factor could save us

• Examples: coin flips, t-test, ANOVA

• Some criticisms

## The problem

According to your theory, the effect of Factor A is expected to be invariant with respect to Factor B

The effect of Factor A should not be different across levels of Factor B

How can we provide statistical evidence for this invariance?

1. Fit a statistical model (i.e., ANOVA) to data where Factor A and Factor B were manipulated

2. Is the Factor A X Factor B interaction significant?

3. Calculate p value – what proportion of values will be as or more extreme as what we observed, under the null hypothesis?

4. Reject or fail to reject null hypothesis based on p value (i.e., reject if p < .05)

## When p > .05

• Fail to reject null hypothesis of no difference

• Can't "accept" the null (even if we want to!)

• State that the evidence for an effect is inconclusive

• If you make a big deal of a null effect, chances are a reviewer will suggest your study was underpowered to find an interaction

• The absence of an effect is not itself evidence of absence

## Alternative to NHST

Directly compare the likelihood of the data under null and non-null models by computing Bayes factor

To what extent should we revise our beliefs regarding null and alternative models after observing data?

## Probability

Frequentist (Classical)

Probability (i.e., a p value) is the proportion of expected outcomes under repeated tests

Bayesian

Probability quantifies our subjective belief about the likelihood of an outcome, and is rationally updated given new data

## Bayesian updating

Bayes rule in simple form:

$Pr ( H | D ) = \frac{ Pr( D | H ) Pr ( H ) }{ Pr ( D ) }$

$$Pr(D|H)$$ is the likelihood of the data under a hypothesis

$$Pr(H)$$ is the prior probability of the hypothesis

$$Pr(D)$$ is the probability of the data

## Bayes factor in model comparison

Bayes factor compares the marginal likelihood of the data under competing models

$\frac{ Pr( M_0 | D )}{ Pr( M_1 | D )} = \frac{ Pr( D | M_0) }{ Pr (D | M_1 ) } \times \frac{ Pr(M_0) } { Pr ( M_1 )}$

Are the data more probable under the null model than a specified alternative?

## Likelihood ratio test

• When $$Pr(D|M)$$ is computed by maximizing the likelihood, we can compare two nested models by comparing the likelihood ratio

$$-2 * ln Pr(D|M_1)/Pr(D|M_2)$$

• However, there is no direct mechanism that can favor a null (less complex) model, as the log-likelihood of the more complex of the nested models will always be as least as high as the reduced model

• Bayes factor offers a clear advantage by eliminating this limitation

## A coin example

We want to determine the likelihood of flipping heads using a coin. Prior to flipping the coin 10 times, we may believe that all outcomes are equally likely.

$$Y | \theta \sim \text{Binomial}(\theta, N)$$

$$\theta \sim \text{Uniform} (0,1)$$

## A coin example

$$\theta \sim \text{Uniform} (0,1)$$

## A coin example

Another person may believe the coin to be fairly weighted, and is likely to land heads and tails evenly.

$$Y \sim \text{Binomial}(.5, N)$$

## A coin example

Let's say we observe 6 heads out of 10 flips.

## A coin example

$$Y = 6$$

$$Pr(D|M_1) = .09$$

$$Pr(D|M_2) = .21$$

$$BF_{12} = .44$$

$$BF_{21} = 2.26$$

The data are more than twice as likely under the 'fair' model as under the uniform model.

## Bayes factor

Kass & Raftery (1995) recommendations

BF Strength of evidence
< 1 Favors other model
1 - 3 Not worth a bare mention
3 - 20 Substantial
20 - 150 Strong
> 150 Very strong

## Bayes factor

• Bayes factor informs us how to revise our prior beliefs

• People often hold different prior opinions regarding the truth of hypotheses (i.e., ESP, global warming)

• However, we can agree on the impact that data have on our resulting beliefs

## Choosing prior distributions

• Subjectivity in choosing priors is a concern, as priors have a direct influence on Bayes factor (and posterior beliefs)

• We often don't have point estimates for effects in to-be-conducted experiments

• Complexity of this problem has prevented many from adopting Bayesian methods

## BayesFactor (Morey et al., 2015)

• Rouder et al. (2009, 2012) have developed "default" priors for use in a wide range of applications in a convenient R package

• Priors are placed on effect size (unaffected by measurement scale) and variance

• Warning: not to be used blindly

## t-test

A t-test can indicate whether two independently sampled means are significantly different from one another. Under the null, the two means are assumed to be equal.

$$H_0$$: $$\bar{Y_1} - \bar{Y_2} = 0$$

$$H_A$$: $$\bar{Y_1} - \bar{Y_2} \neq 0$$

$t = \frac{ \bar{Y_1} - \bar{Y_2} } { s_p \sqrt { 1/N_1 + 1/N_2 } }$

Compare observed $$t$$ against $$t$$ with $$df = N_1 + N_2 - 2$$

## t-test using BayesFactor

### Rouder et al. (2009; PB&R)

Assume no effect

$M_0: \delta = 0$

Null model ($$M_0$$)

Jeffrey's (noninformative) prior on variance $$Pr(\sigma^2) = 1 / \sigma^2$$

Alternative model ($$M_1$$)

Prior on effect size where:

$\delta = \frac{ \mu } { \sigma^2 }$

$M_1: \delta \text{ ~ Cauchy}$

## Two sample t-test

When weaving, does yarn "A" break more than yarn "B"?

plot(breaks ~ factor(wool), warpbreaks)

## Two sample t-test

Does yarn "A" break more than yarn "B"?

t.test(breaks ~ wool, data = warpbreaks, var.eq=TRUE)
##
##  Two Sample t-test
##
## data:  breaks by wool
## t = 1.6335, df = 52, p-value = 0.1084
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.319679 12.875235
## sample estimates:
## mean in group A mean in group B
##        31.03704        25.25926

## Bayesian t-test using BayesFactor

ttestBF(formula = breaks ~ wool, data = warpbreaks)
## Bayes factor analysis
## --------------
## [1] Alt., r=0.707 : 0.8184672 ±0.01%
##
## Against denominator:
##   Null, mu1-mu2 = 0
## ---
## Bayes factor type: BFindepSample, JZS

## Bayesian t-test using BayesFactor

Plot code from E-J Wagenmakers

## Assessing evidence: p-value vs. BF

Rouder et al. (2012) p. 360

## Bayes factor for ANOVA designs

### Rouder et al. (2012; JMP)

Bayesian implementation of the hierarchical ANOVA model (Gelman, 2005)

$Y_{ijk} = \mu + \sigma[A_i + B_j + (AB)ij] + \epsilon_{ijk}$

Cauchy priors are placed on effect size for each term (A, B, AB)

Error terms are assumed to be independent for each effect

## Bayes factor for ANOVA designs

### Prior structure

Fixed effect $$\alpha$$

$$\alpha* | g \sim \text{Normal}(0, g_{\alpha-1},gI_{\alpha-1})$$

$$g \sim \text{Inverse-}\chi^2(1)$$

• Resulting effect-size distributions are multivariate Cauchy

• Separate $$g$$-prior for each factor

## Bayes factor for ANOVA designs

• Effect size used is $$R^2$$

• Null model includes only random effects (i.e., subjects or items)

• Each of the nested linear models in the ANOVA design is compared to the null model

• Nested models can be directly compared by taking the ratio of the Bayes factor

## An example: Abbott & Staub (In Press, JML)

• E-Z Reader 10 (Reichle, Warren, & McConnell, 2009) asserts that word identification and integration occur in discrete serial stages

• Word identification time is a function of frequency and cloze probability

• Integration of word n can affect eye movements if it lags behind word n+1 identification or it fails outright (with probability P(F))

## Abbott & Staub (In Press, JML)

1. The man noticed the journal was missing from his desk. (HF, plausible)

2. The man noticed the stapler was missing from his desk. (LF, plausible)

3. The man angered the journal by placing it in the drawer. (HF, implausible)

4. The man angered the stapler by placing it in the drawer. (LF, implausible)

• Frequency, but not plausibility, will affect word skipping

• Plausibility, but not frequency, will affect regression rates

• Frequency and plausibility will have additive (critically, not interactive) effects across all temporal measures (i.e., gaze duration)

## First fixation duration

### ANOVA by subjects

Variable F(1,111) p
Frequency 35.14 < .001
Plausibility 7.52 < .01
Freq X Plaus .33 .57

## First fixation duration

### Bayesian ANOVA

ffd.aov.bf <- anovaBF(ffdR3 ~ Freq*Plaus + Subj, data=fpl.tarffd.byss,
whichRandom="Subj", rscaleFixed=.5, iterations=100000)
ffd.aov.bf
## Bayes factor analysis
## --------------
## [1] Freq + Subj                      : 237815.4 ±3.3%
## [2] Plaus + Subj                     : 4.578714 ±0.48%
## [3] Freq + Plaus + Subj              : 1473768  ±0.62%
## [4] Freq + Plaus + Freq:Plaus + Subj : 244388.3 ±0.86%
##
## Against denominator:
##   ffdR3 ~ Subj
## ---
## Bayes factor type: BFlinearModel, JZS

## First fixation duration

### Bayesian ANOVA

plot(ffd.aov.bf)

## First fixation duration

### Bayesian ANOVA

What is the evidence for the additive effects model over the interactive effects model?

ffd.aov.bf[3] / ffd.aov.bf[4]
## Bayes factor analysis
## --------------
## [1] Freq + Plaus + Subj : 6.030438 ±1.06%
##
## Against denominator:
##   ffdR3 ~ Freq + Plaus + Freq:Plaus + Subj
## ---
## Bayes factor type: BFlinearModel, JZS

## First fixation duration

### Bayesian ANOVA

We can also sample from the posterior distribution of the parameters

ffd.chain <- data.frame(posterior(ffd.aov.bf[4], iterations=10000,
columnFilter="^Subj$")) head(ffd.chain[,1:5]) ## mu Freq.HF Freq.LF Plaus.IM Plaus.PL ## 1 245.8085 -7.556494 7.556494 3.956670 -3.956670 ## 2 243.9439 -8.633186 8.633186 5.928889 -5.928889 ## 3 242.8874 -9.918842 9.918842 5.867772 -5.867772 ## 4 250.2694 -5.541311 5.541311 4.786933 -4.786933 ## 5 243.7995 -9.204271 9.204271 4.652317 -4.652317 ## 6 249.9436 -7.153691 7.153691 2.312630 -2.312630 ## First fixation duration ### Bayesian ANOVA ## Prior to posterior odds We can compute posterior odds using the BF object. For example, we can consider all models to be equally likely before observing the data (we have done this implicitly already). prior.odds = newPriorOdds(ffd.aov.bf, type = "equal") prior.odds ## Prior odds ## -------------- ## [1] Freq + Subj : 1 ## [2] Plaus + Subj : 1 ## [3] Freq + Plaus + Subj : 1 ## [4] Freq + Plaus + Freq:Plaus + Subj : 1 ## ## Against denominator: ## ffdR3 ~ Subj ## --- ## Model type: BFlinearModel, JZS ## Prior to posterior odds Now compute the posterior odds post.odds = prior.odds * ffd.aov.bf post.odds ## Posterior odds ## -------------- ## [1] Freq + Subj : 237815.4 ±3.3% ## [2] Plaus + Subj : 4.578714 ±0.48% ## [3] Freq + Plaus + Subj : 1473768 ±0.62% ## [4] Freq + Plaus + Freq:Plaus + Subj : 244388.3 ±0.86% ## ## Against denominator: ## ffdR3 ~ Subj ## --- ## Model type: BFlinearModel, JZS ## Prior to posterior odds And we can convert these to probabilities post.prob = as.BFprobability(post.odds) post.prob ## Posterior probabilities ## -------------- ## [1] Freq + Subj : 0.1215839 ±NA% ## [2] Plaus + Subj : 2.340883e-06 ±NA% ## [3] Freq + Plaus + Subj : 0.7534689 ±NA% ## [4] Freq + Plaus + Freq:Plaus + Subj : 0.1249443 ±NA% ## [5] Subj : 5.112533e-07 ±NA% ## ## Normalized probability: 1 ## --- ## Model type: BFlinearModel, JZS ## More Mixed effects models are also possible - just pass additional factors to "whichRandom". NAs must be omitted. fpl.nona <- subset(fpl, !is.na(fpl$ffdR3))
ffd.mixed.bf <- anovaBF(ffdR3 ~ Freq*Plaus + subj + item, data=fpl.nona,
whichRandom=c("subj", "item"), rscaleFixed=.5)
ffd.mixed.bf
## Bayes factor analysis
## --------------
## [1] Freq + item + subj                      : 4641683  ±1.03%
## [2] Plaus + item + subj                     : 7.145093 ±1.92%
## [3] Freq + Plaus + item + subj              : 32615663 ±2.11%
## [4] Freq + Plaus + Freq:Plaus + item + subj : 1723331  ±2.03%
##
## Against denominator:
##   ffdR3 ~ item + subj
## ---
## Bayes factor type: BFlinearModel, JZS

## Mixed effects models

plot(ffd.mixed.bf)

## Small effects

### Rouder et al. (2009) p. 233

$$\delta = .2$$