Beyond Significant

Andrew Ellis

23/01/2019

Introduction

Overview

A practical guide to current Bayesian data analysis methods
- setting up a Bayesian model
- parameter estimation
- model comparison: Bayes factors and posterior predictive checking.

The purpose of this talk is to familiarize the audience with Bayesian thinking and to demonstrate that with current software, Bayesian analysis is no longer restricted to a small number of expert statisticians.

Slides are available online: rpubs.com/awellis/BeyondSignificant

To set the mood

IQ example

Summary: Two groups of people took an IQ test.

Group 1, \(N_1=47\), consumes a “smart drug”, and Group 2, \(N_2=42\), is a control group that consumes a placebo (Kruschke, 2013).

T-Test

Two Sample t-test

    Two Sample t-test

data:  IQ by Group
t = -1.5587, df = 87, p-value = 0.1227
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.544155  0.428653
sample estimates:
  mean in group Placebo mean in group SmartDrug
               100.3571                101.9149

Welch Two Sample t-test

    Welch Two Sample t-test

data:  IQ by Group
t = -1.6222, df = 63.039, p-value = 0.1098
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.4766863  0.3611848
sample estimates:
  mean in group Placebo mean in group SmartDrug
               100.3571                101.9149

Problems with NHST

Misinterpretations

American Statistical Association (ASA) released a statement about p-values (Lazar, 2016). Among the principles are:
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Greenland, Senn, Rothman, Carlin, Poole, Goodman, and Altman (2016) provide a good discussion of common misinterpretations of p values and confidence intervals.
- A confidence interval does not have a 95% chance of containing the true parameter.

The Bayesian New Statistics

Cumming (2014) claim that "we need to shift from reliance on NHST to estimation and other preferred techniques.
Kruschke and Liddell (2018) advocate that Bayesian methods are better suited to achieve this, for both hypothesis testing and parameter estimation.
According to Gigerenzer (2018), we need to stop relying on NHST, but instead learn to use a statistical toolkit.
Many reviewers now demand Bayes factors (because a BF can provide evidence for/against hypotheses).
Bayesian data analysis is not limited to calculating Bayes factors.

Bayesian methods

🤗

more intuitive (quantification of uncertainty)
able to provide evidence for/against hypotheses
more flexible
- cognitive process models (Lee and Wagenmakers, 2013)
- robust models
can include prior knowledge
better for multilevel models (Gelman and Hill, 2006)
based on probability theory (Bayes theorem)

😧

requires computing power
setting priors requires familiarity with probability distributions
(ongoing discussion about parameter estimation vs. hypothesis testing. See e.g. here and here.)

Some theory

Key ideas

We will have a brief look at the theoretical background, then dive straight into a practical example.

Parameters are random variables. These are drawn from probability distributions, which reflect our uncertainty about the parameters.

The prior distribution is updated with the likelihood (data) to obtain a posterior distribution.

Bayesian workflow

Bayesian model comparison

A hands-on example

To make this more concrete, we return to the IQ example.

Our null hypothesis: the smart drug group have higher IQ scores than the placebo group.
We obtained a p-value of 0.0548769 (Welch test)
Even after removing outliers, p is > 0.05

Bayes factor to the rescue?

Not in this case. The data provide little evidence either for or against our hypothesis.

The maximally attainable Bayes factor in favour of our hypothesis is \(\sim 1.8\).
A further benefit of going Bayesian is the ability to flexibly specify our generative model. We can make the model robust against outliers.

T-Test as linear model

A t-test is really just a general linear model (assuming equal variances) \[ Y = \alpha + \beta X + \epsilon\] \[ \epsilon \sim N(0, \sigma^2) \]

where \(X\) is an indicator variable.

fit_ols <- lm(IQ ~ Group,
              data = TwoGroupIQ)

which can be read as:

\[ IQ = Placebo + \beta \cdot SmartDrug + \epsilon\] \[ \epsilon \sim N(0, \sigma^2) \]

The \(\beta\) parameter therefore represents the difference between groups.

Generative model

Linear regression model as a probabilistic model

The graph on the right shows the data-generating process (dependencies among the random variables).
\(\alpha\) is our expectation for the placebo group, \(\beta\) is our expectation for the difference in means. \(\sigma\) is the variance of the outcome.

Software

Probabilistic programming languages:

Winbugs
Jags
Stan: Probabilistic programming language
PyMC3: Probabilistic programming in Python
Turing: Probabilistic programming language with an intuitive modelling interface.

R interface / GUI:

brms: R package for Bayesian generalized multivariate non-linear multilevel models using Stan
rstanarm: Bayesian applied regression modeling via Stan (less flexible than brms, but models are pre-compiled)
JASP: GUI for simple models (frequentist and Bayesian)

Inference using Stan

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  sigma ~ cauchy(0, 2.5);
  alpha ~ normal(100, 15);
  beta ~ normal(0, 10);
  y ~ normal(alpha + beta * x, sigma);
}

Inference using brms

fit_eqvar <- brm(IQ ~ Group,
    data = TwoGroupIQ,
    file = here::here("models/fitiq-eqvar"))

 Family: gaussian
  Links: mu = identity; sigma = identity
Formula: IQ ~ Group
   Data: TwoGroupIQ (Number of observations: 89)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects:
               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept        100.36      0.72    98.91   101.75       4083 1.00
* GroupSmartDrug   1.56      1.00    -0.37     3.56       4039 1.00

Family Specific Parameters:
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     4.76      0.36     4.12     5.53       3504 1.00

Estimated parameter

We obtain the marginal posterior distribution of the regression parameter \(\beta\)¹

Criticize model (posterior predictive check)

We can draw samples from the posterior predictive distribution. The model predicts equal variances, and no outliers.

Model revision

The Gaussian distribution is sensitive to outliers. We can model our data as being generated by a T distribution instead.

fit_robust <- brm(bf(IQ ~ 0 + Group, sigma ~ Group),
    family = student,
    data = TwoGroupIQ,
    prior = c(set_prior("normal(100, 10)", class = "b"),
             set_prior("cauchy(0, 1)", class = "b", dpar = "sigma"),
             set_prior("exponential(1.0/29)", class = "nu")),
    cores = parallel::detectCores(),
    file = here::here("models/fitiq-robust"))

Robust regression


 Family: student
  Links: mu = identity; sigma = log; nu = identity
Formula: IQ ~ 0 + Group
         sigma ~ Group
   Data: TwoGroupIQ (Number of observations: 89)
Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup samples = 4000

Population-Level Effects:
                     Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma_Intercept          0.01      0.19    -0.37     0.38       4343 1.00
GroupPlacebo           100.52      0.21   100.12   100.93       4312 1.00
GroupSmartDrug         101.55      0.36   100.85   102.24       4262 1.00
sigma_GroupSmartDrug     0.61      0.25     0.10     1.11       3800 1.00

Family Specific Parameters:
   Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
nu     1.74      0.45     1.10     2.81       3554 1.00

Predicted marginal means

## Hypothesis Tests for class b:
##                 Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
## 1 (GroupSmartDrug)-... > 0     1.03      0.41     0.34      Inf     101.56
##   Post.Prob Star
## 1      0.99    *
## ---
## '*': The expected value under the hypothesis lies outside the 95%-CI.
## Posterior probabilities of point hypotheses assume equal prior probabilities.

Posterior predictive distribution

Equal variance model

Robust model

Bayes factors

We can compute Bayes factors using different two methods:

Savage-Dickey density ratio
Bridge sampling (used to obtain the marginal likelihood)
Comparison: Model with grouping variable vs model without (restricted).
For a more detailed description of Bayes factors, see here.
Bayes factors are difficult to compute, often only possible for certain restricted models.

Bayes factors

Savage Dickey density ratio:

## [1] 2.935822

hypothesis(fit_robust_bf,
    hypothesis = 'GroupSmartDrug = 0')

Bridge sampling:

bayes_factor(fit_robust_bridge,
         fit_robust_bridge_null)

## [1] 3.485551

Summary

We noted that the normal model does not describe the data well. We modelled the outcome \(y\) as being drawn from a t distribution. This accounts for outliers.

Posterior predictive check: the data are well described by this model.
Parameter estimation: the parameter representing the group differnce is positive, with a 95% credible interval of [0.23, 1.85].
Bayes factor in favour of our hypothesis lies between 2.94 and 3.49, depending on the method used.

Is this b-hacking?

This is exploratory.
We did not remove outliers.
We did not test more subjects.
Computing a BF requires carefully specifying prior distributions, and these need to be reported.

Message to PhD students

Read Statistical Rethinking and install.packages("brms")

Beyond Significant

Andrew Ellis

23/01/2019

Introduction

Overview

To set the mood

IQ example

T-Test

Two Sample t-test

Welch Two Sample t-test

Problems with NHST

Misinterpretations

The Bayesian New Statistics

Bayesian methods

🤗

😧

Some theory

Key ideas

Bayesian workflow

Bayesian workflow

Bayesian model comparison

A hands-on example

Bayes factor to the rescue?

T-Test as linear model

Generative model

Software

Inference using Stan

Inference using brms

Estimated parameter

Criticize model (posterior predictive check)

Model revision

Robust regression

Predicted marginal means

Posterior predictive distribution

Bayes factors

Bayes factors

Summary

Is this b-hacking?

Message to PhD students

Further reading

References