Group	mean
Placebo	100.36
SmartDrug	101.91

Two sample t-test / Welch test

We can perform a two-sample t-test, or a Welch test:

t.test(IQ ~ Group,
       data = TwoGroupIQ,
       var.equal = TRUE,
       alternative = "less")

## 
##  Two Sample t-test
## 
## data:  IQ by Group
## t = -1.5587, df = 87, p-value = 0.06135
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf 0.1037991
## sample estimates:
##   mean in group Placebo mean in group SmartDrug 
##                100.3571                101.9149

t.test(IQ ~ Group,
       data = TwoGroupIQ,
       var.equal = FALSE,
       alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  IQ by Group
## t = -1.6222, df = 63.039, p-value = 0.05488
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf 0.04532157
## sample estimates:
##   mean in group Placebo mean in group SmartDrug 
##                100.3571                101.9149

Problem: Neither yield a significant result. What do we do?

Problems with NHST

Statement by American Statistical Association (ASA) about p-values (Lazar 2016).
- P-values can indicate how incompatible the data are with a specified statistical model.
- P-values do not measure the probability that the studied hypothesis is true (we would actually like to know this), or the probability that the data were produced by chance.
Greenland et al. (2016) provide a good discussion of common misinterpretations of p values and confidence intervals.
- A confidence interval does not have a 95% chance of containing the true parameter.

The (Bayesian) New Statistics

Cumming (2014): We need to shift from reliance on NHST to estimation and other techniques.
Kruschke and Liddell (2018): Bayesian methods are better suited for this, for both hypothesis testing and parameter estimation.
According to Gigerenzer (2004) and Gigerenzer (2018), we need to stop relying on NHST (mindless statistics), but instead learn to use a whole statistical toolkit.
Many reviewers now demand Bayes factors (because a BF can provide evidence for/against hypotheses).
However: Bayesian data analysis is not limited to calculating Bayes factors.

Bayesian methods

🤗

more intuitive (uncertainty) and based on probability theory
provide evidence for/against hypotheses
more flexible: robust models and cognitive process models (Lee and Wagenmakers 2014)
can include prior knowledge
better for multilevel models (Gelman and Hill 2006)

😧

require computing power
setting priors requires familiarity with probability distributions
ongoing discussion about parameter estimation vs. hypothesis testing. See e.g. here and here.

Quantifying uncertainty

The essential characteristic of Bayesian methods is their use of probability for quantifying uncertainty in inferences based on data (Gelman 2014).
This facilitates a common-sense interpretation of statistical conclusions.

The figure shows a frequentist confidence interval (OLS model) and a Bayesian credible interval. Which one is easier to interpret?

Parameters are random variables

Parameters are random variables, as opposed to a frequentist point of view, in which they are fixed values.
These are drawn from (prior) probability distributions, which reflect our uncertainty about the parameters.
The prior distribution is updated with the likelihood (data) to obtain a posterior distribution.
Requires some familiarity with probability distributions.

Three steps of Bayesian data analysis

According to Gelman (2014):

Set up probability model (joint probability distribution for observed (\(y\), \(x\)) and latent quantities \(\theta\)).
Condition on observed data: calculate posterior distribution \(P(y | \theta) \cdot p(\theta)\).
Evaluate model and implications of the posterior distribution.
- How well does the model fit the data?
- Are the substantive conclusions reasonable?
- How sensitive are the results to the modelling assumptions?
- Does the model need to be revised?

Box’s loop

Blei (2014) describes data analysis as an iterative process.

He calls this Box’s loop, a modern interpretation of the perspective of the statistician George Box (Box (1976)).

Posterior evaluation

Case study: Robust t-test

Let’s return to the IQ example.

Our hypothesis: the ‘smart drug’ group have higher IQ scores than the placebo group.
We obtained a p-value of 0.055 (Welch test)
(Even after removing outliers, p is > 0.05. This appproach is tricky.)

Bayes factor to the rescue?

We might want to compute a Bayes factor. Can we at least quantify evidence for the null hypothesis?

Not in this case. The data provide little evidence either for or against our hypothesis.

The maximally attainable Bayes factor in favour of our hypothesis is \(\sim 1.8\).
Bayes factor are not all we can do.
A benefit of going Bayesian is the ability to flexibly specify our generative model. We can make the model robust against outliers.

Bayes factor

Let’s have another look at Bayes rule, making the dependency of the parameters \(\mathbf{\theta}\) on the model \(\mathcal{M}\) explicit:

\[ p(\theta | y, \mathcal{M}) = \frac{p(y|\theta, \mathcal{M}) p(\theta | \mathcal{M})}{p(y | \mathcal{M})}\]

where \(\mathcal{M}\) refers to a specific model. The marginal likelihood \(p(y | \mathcal{M})\) now gives the probability of the data, averaged over all possible parameter value under model \(\mathcal{M}\).

Writing out the marginal likelihood \(p(y | \mathcal{M})\): \[ p(y | \mathcal{M}) = \int{p(y | \theta, \mathcal{M}) p(\theta|\mathcal{M})d\theta}\]

we see that this is averaged over all possible values of \(\theta\) that the model will allow.

The important thing to consider here is that the model evidence will depend on what kind of predictions a model can make. This gives us a measure of complexity – a complex model is a model that can make many predictions.

The problem with making many predictions is that most of these predictions will turn out to be false.

The complexity of a model will depend on (among other things):

the number of parameters (as in frequentist model comparison)
the prior distributions of the model’s parameters

When a parameters priors are broad (uninformative), those parts of the parameter space where the likelihood is high are assigned low probability. Intuitively, by hedging one’s bets, one assigns low probability to parameter values that make good predictions.

All this leads to the fact that more complex model have comparatively lower marginal likelihood.

Therefore, when we compare models, and we prefer models with higher marginal likelihood, we are using Ockham’s razor in a principled manner.

We can also write Bayes rule applied to a comparison between models (marginalized over all parameters within the model):

\[ p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)}\]

and

\[ p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)}\]

This tells us that for model \(\mathcal{M_m}\), the posterior probability of the model is proportional to the marginal likelihood times the prior probability of the model.

Now, one is usually less interested in absolute evidence than in relative evidence; we want to compare the predictive performance of one model over another.

To do this, we simply form the ratio of the model probabilities:

\[ \frac{p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)}} {p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)}}\]

The term \(p(y)\) cancels out, giving us: \[ \frac{p(\mathcal{M}_1 | y) = P(y | \mathcal{M}_1) p(\mathcal{M}_1)} {p(\mathcal{M}_2 | y) = P(y | \mathcal{M}_2) p(\mathcal{M}_2)}\]

The ratio \(\frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}\) is called the prior odds, and the ratio \(\frac{p(\mathcal{M}_1 | y)}{p(\mathcal{M}_2 | y)}\) is therefore the posterior odds.

We are particulary interested in the ratio of the marginal likelihoods:

\[\frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)}\]

This is the Bayes factor, and it can be interpreted as the change from prior odds to posterior odds that is indicated by the data.

If we consider the prior odds to be \(1\), i.e. we do not favour one model over another a priori, then we are only interested in the Bayes factor. We write this as:

\[ BF_{12} = \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)}\]

Here, \(BF_{12}\) indicates the extent to which the data support model \(\mathcal{M}_1\) over model \(\mathcal{M}_2\).

As an example, if we obtain a \(BF_{12} = 5\), this mean that the data are 5 times more likely to have occured under model 1 than under model 2. Conversely, if \(BF_{12} = 0.2\), then the data are 5 times more likely to have occured under model 2.

The following classification is sometimes used, although it is unnessecary.

T-Test as general linear model

In order to go Bayesian, we first formulate our t-test as a GLM.

Assuming equal variances, we can write

\[ Y = \alpha + \beta X + \epsilon\] \[ \epsilon \sim N(0, \sigma^2) \]

where \(X\) is indicates group membership.

levels(TwoGroupIQ$Group)

## [1] "Placebo"   "SmartDrug"

Using R’s formula notation, we can write this as

fit_ols <- lm(IQ ~ Group,
              data = TwoGroupIQ)

(The outcome IQ is modeled as a function of Group)

More formally:

\[ IQ = Placebo + \beta \cdot SmartDrug + \epsilon\]

\[\epsilon \sim N(0, \sigma^2) \]

The \(\beta\) parameter therefore represents the difference between groups.

 summary(fit_ols)

## 
## Call:
## lm(formula = IQ ~ Group, data = TwoGroupIQ)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.9149  -0.9149   0.0851   1.0851  22.0851 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    100.3571     0.7263 138.184   <2e-16 ***
## GroupSmartDrug   1.5578     0.9994   1.559    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.707 on 87 degrees of freedom
## Multiple R-squared:  0.02717,    Adjusted R-squared:  0.01599 
## F-statistic:  2.43 on 1 and 87 DF,  p-value: 0.1227

Generative model

For a Bayesian model, we need prior distributions on all parameters.

The outcome \(y\) can be modelled as a normally distributed random variable. The normal distribution has two parameters:

\(\mu\): this is predicted using a linear model
\(\sigma\): this needs a prior (\(\sigma > 0\))

Linear regression model as a probabilistic model

The graphical model shows the data-generating process (dependencies among the random variables).
\(\alpha\) is our expectation for the placebo group, \(\beta\) is our expectation for the difference in means. \(\sigma\) is the variance of the outcome.

Software

Probabilistic programming languages:

Winbugs / Jags
Stan: Turing complete probabilistic programming language
PyMC3: Probabilistic programming in Python
Turing: Probabilistic programming language with an intuitive modelling interface.

R interface / GUI:

brms: R package for Bayesian generalized multivariate non-linear multilevel models using Stan
rstanarm: less flexible than brms, but models are pre-compiled
JASP: GUI for simple models (frequentist and Bayesian)

Inference using Stan

We can translate the graphical model into a Stan program (this is what brms does).

shaded nodes are observed (conditioned upon)
\(\mu\) is a deterministic node (double border). In the Stan program, we omit this
\(y\) depends on \(\mu\), \(\sigma\)

data {
  int<lower=0> N;
  vector[N] x;
  vector[N] y;
}
parameters {
  real alpha;
  real beta;
  real<lower=0> sigma;
}
model {
  sigma ~ cauchy(0, 2.5);
  alpha ~ normal(100, 15);
  beta ~ normal(0, 10);
  y ~ normal(alpha + beta * x, sigma);
}

brms: equal variance model

Whilst this is very powerful, and for simple models also pretty easy, it gets tedious for complex models.
Fortunately, brms simplifies this process, by allowing us to specify a model using lme4 style formula notation.
brms then generates Stan code, and provides several convience functions for accessing the marginal posterior distributions.
brms allows specification of non-linear models. A simple linear regression looks like this:

fit_eqvar <- brm(IQ ~ Group,
                 data = TwoGroupIQ,
                 file = here::here("models/fit_iq-eqvar"))

summary(fit_eqvar)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: IQ ~ Group 
##    Data: TwoGroupIQ (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept        100.35      0.72    98.92   101.74       3865 1.00
## GroupSmartDrug     1.57      0.99    -0.37     3.51       3793 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     4.76      0.36     4.13     5.53       3109 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

In this model, we let brms set priors.

Estimated parameter

Marginal posterior distribution of the regression parameter \(\beta\)

fit_eqvar %>%
    gather_draws(b_GroupSmartDrug) %>%
    ggplot(aes(y = .variable, x = .value)) +
    geom_halfeyeh(fill = "Steelblue4") +
    geom_vline(xintercept = 0, color = "white", linetype = 1, size = 1) +
    ylab("") +
    xlab("Estimated difference") +
    theme_classic()

Posterior predictions: normal model

We can plot the predicted marginal means and the predictive bands alongside the data

grid <- TwoGroupIQ %>%
    data_grid(Group)

fits_IQ <- grid %>%
    add_fitted_draws(fit_eqvar)

preds_IQ <- grid %>%
    add_predicted_draws(fit_eqvar)

pp_eqvar <- TwoGroupIQ %>%
    ggplot(aes(x = IQ, y = Group)) +
    geom_halfeyeh(aes(x = .value),
                  relative_scale = 0.7,
                  position = position_nudge(y = 0.1),
                  data = fits_IQ,
                  .width = c(.66, .95, 0.99)) +
    stat_intervalh(aes(x = .prediction),
                   data = preds_IQ,
                   .width = c(.66, .95, 0.99)) +
    scale_x_continuous(limits = c(75, 125)) +
    geom_point(data = TwoGroupIQ) +
    scale_color_brewer() +
    labs(title = "Equal variance model predictions")

pp_eqvar

Robust model

The normal equal variance model does not predict the data well. We can use a t distribution instead of a normal distribution to make the model more robust.
The t distribution has three parameters: location, (positive) scale, and a (positive) normality parameter, \(\nu\). As \(\nu\) increases, a t distribution starts to resemble a normal distribution

We have now have an extra parameter, for which we need a prior. A recommened distribution is the gamma distribution:

brms: robust model

fit_robust_1 <- brm(bf(IQ ~ 0 + Group, sigma ~ Group),
                  family = student,
                  data = TwoGroupIQ,
                  prior = c(set_prior("normal(100, 10)", class = "b"),
                            set_prior("cauchy(0, 1)", class = "b", dpar = "sigma"),
                            set_prior("gamma(2, 0.1)", class = "nu")),
                  cores = parallel::detectCores(),
                  file = here::here("models/fit_iq-robust_1"))

summary(fit_robust_1)

##  Family: student 
##   Links: mu = identity; sigma = log; nu = identity 
## Formula: IQ ~ 0 + Group 
##          sigma ~ Group
##    Data: TwoGroupIQ (Number of observations: 89) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##                      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_Intercept          0.02      0.19    -0.36     0.40       4165 1.00
## GroupPlacebo           100.53      0.21   100.12   100.94       4451 1.00
## GroupSmartDrug         101.55      0.36   100.83   102.29       4053 1.00
## sigma_GroupSmartDrug     0.61      0.25     0.12     1.10       3973 1.00
## 
## Family Specific Parameters: 
##    Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## nu     1.84      0.49     1.14     2.94       3310 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Posterior estimates: robust model

fit_robust_1 %>%
    spread_draws(b_GroupSmartDrug, b_GroupPlacebo) %>%
    mutate(`SmartDrug - Placebo` = b_GroupSmartDrug - b_GroupPlacebo) %>%
    gather(variable, difference, `SmartDrug - Placebo`) %>%
    ggplot(aes(y = variable, x = difference)) +
    geom_halfeyeh(fill = "Steelblue4") +
    geom_vline(xintercept = 0, color = "white", linetype = 1, size = 1) +
    ylab("") +
    xlab("Estimated difference") +
    theme_tidybayes()

Posterior predictions: robust model

Model comparison

We can use the loo package to carry out Pareto smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO)

loo(fit_eqvar, fit_robust_1)

##                           LOOIC    SE
## fit_eqvar                539.50 39.67
## fit_robust_1             437.11 27.72
## fit_eqvar - fit_robust_1 102.39 24.71

Bayes factors: Savage Dickey

For the robust model, we can obtain a Bayes factor using either the Savage-Dickey density ratio, or a recently implemented approach called bridge sampling.

fit_robust_bf <- brm(bf(IQ ~ 0 + intercept + Group,
                        sigma ~ Group),
                     family = student,
                     data = TwoGroupIQ,
                     prior = c(set_prior("normal(100, 10)",
                                         class = "b", coef = "intercept"),
                               set_prior("cauchy(0, 0.707)",
                                         class = "b", coef = "GroupSmartDrug"),
                               set_prior("cauchy(0, 1)",
                                         class = "b", dpar = "sigma"),
                               set_prior("gamma(2, 0.1)",
                                         class = "nu")),
                     cores = parallel::detectCores(),
                     sample_prior = TRUE,
                     file = here::here("models/fit_iq-robust_bf"))

BF <- hypothesis(fit_robust_bf,
                hypothesis = 'GroupSmartDrug = 0')
1/BF$hypothesis$Evid.Ratio

## [1] 2.669161

Bayes factors: Bridge sampling

fit_robust_bridge <- brm(bf(IQ ~ 0 + intercept + Group,
                            sigma ~ Group),
                         family = student,
                         data = TwoGroupIQ,
                         prior = c(set_prior("normal(100, 10)",
                                             class = "b", coef = "intercept"),
                                   set_prior("cauchy(0, 0.707)",
                                             class = "b",
                                             coef = "GroupSmartDrug"),
                                   set_prior("cauchy(0, 1)",
                                             class = "b", dpar = "sigma"),
                                   set_prior("gamma(2, 0.1)",
                                             class = "nu")),
                         cores = parallel::detectCores(),
                         save_all_pars = TRUE,
                         file = here::here("models/fit_iq-robust_bridge"))

We have to specify a null model (leaving out the Group variable):

fit_robust_bridge_null <- brm(bf(IQ ~ 0 + intercept,
                                 sigma ~ Group),
                              family = student,
                              data = TwoGroupIQ,
                              prior = c(set_prior("normal(100, 10)",
                                                  class = "b",
                                                  coef = "intercept"),
                                        set_prior("cauchy(0, 1)",
                                                  class = "b",
                                                  dpar = "sigma"),
                                        set_prior("gamma(2, 0.1)",
                                                  class = "nu")),
                              cores = parallel::detectCores(),
                              save_all_pars = TRUE,
                              file = here::here("models/fit_iq-robust_bridge-null"))

BF_bridge <- bayes_factor(fit_robust_bridge, fit_robust_bridge_null)

## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5

BF_bridge$bf

## [1] 3.22826

Compare OLS and Bayesian estimates

fit_ols <- lm(IQ ~ Group,
              data = TwoGroupIQ)


d <- broom::tidy(fit_ols, conf.int = TRUE) %>%
    mutate(
        low = estimate - std.error,
        high = estimate + std.error
    )

linear_results <- fit_ols %>%
    broom::tidy(conf.int = TRUE) %>%
    filter(term == "GroupSmartDrug") %>%
    mutate(model = "OLS")


bayes_results <- fit_robust_1 %>%
    spread_draws(b_GroupSmartDrug, b_GroupPlacebo) %>%
    mutate(GroupSmartDrug = b_GroupSmartDrug - b_GroupPlacebo) %>%
    median_qi(GroupSmartDrug) %>%
    gather(term, estimate, GroupSmartDrug) %>%
    to_broom_names() %>%
    mutate(model = "Robust Bayesian model")


bind_rows(linear_results, bayes_results) %>%
    mutate(term = term) %>%
    ggplot(aes(y = term,
               x = estimate,
               xmin = conf.low,
               xmax = conf.high,
               color = model)) +
    geom_pointrangeh(position = position_dodgev(height = .3)) +
    geom_vline(xintercept = 0, linetype = 3,
               color = "grey", alpha = 0.8) +
    scale_color_viridis_d(option = "B", direction = -1,
                          begin = 1/3, end = 2/3) +
    scale_y_discrete(breaks = NULL) +
    ylab("Comparison: SmartDrug - Placebo")

Is this b-hacking?

This is exploratory.

We did not remove outliers.

We did not test more subjects.

Computing a BF requires carefully specifying prior distributions, and these need to be reported.

Multilevel model

This demonstrates that we can either treat an effect that depends on a grouping variable as varying or non-varying. Gelman has written extensively about this.
Varying (random) parameters have an associated variance while population-level (fixed) parameters do not.
If we treat the condition variable as varying, then the estimates will be shrunk towards the overall mean, if the data provide evidence for this. Such a model is also known as a partial pooling model.

set.seed(5)

n_obs <- 30
n_condition <- 5
ABC <- tibble(condition = rep(LETTERS[1:n_condition], n_obs),
                  response = rnorm(n_obs * n_condition,
                                   mean = c(-2, -1, 0, 1, 2),
                                   sd = 0.5))
knitr::kable(head(ABC))

condition	response
A	-2.4204277
B	-0.3078203
C	-0.6277459
D	1.0350714
E	2.8557204
A	-2.3014540

ABC %>%
    ggplot(aes(y = condition, x = response, color = condition)) +
    geom_point() +
    scale_color_viridis_d(option = "B", begin = 1/3, end = 2/3)

No pooling

fit_abc_np <- brm(response ~ 0 + condition,
              data = ABC,
              # control = list(adapt_delta = .99),
              prior = c(prior(normal(0, 10), class = b),
                        prior(student_t(3, 0, 1), class = sigma)),
              file = here::here("models/fit_abc_no-pooling"))
summary(fit_abc_np)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: response ~ 0 + condition 
##    Data: ABC (Number of observations: 150) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Population-Level Effects: 
##            Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## conditionA    -1.95      0.10    -2.15    -1.76       5375 1.00
## conditionB    -1.11      0.09    -1.29    -0.93       5449 1.00
## conditionC     0.07      0.09    -0.11     0.26       5613 1.00
## conditionD     1.00      0.10     0.80     1.20       4698 1.00
## conditionE     1.97      0.10     1.78     2.15       4198 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     0.53      0.03     0.47     0.59       4678 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Partial pooling

fit_abc_pp <- brm(response ~ 1 + (1|condition),
              data = ABC,
              # control = list(adapt_delta = .99),
              prior = c(prior(normal(0, 1), class = Intercept),
                        prior(student_t(3, 0, 1), class = sd),
                        prior(student_t(3, 0, 1), class = sigma)),
              file = here::here("models/fit_abc_partial-pooling"))
summary(fit_abc_pp)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: response ~ 1 + (1 | condition) 
##    Data: ABC (Number of observations: 150) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~condition (Number of levels: 5) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     1.58      0.51     0.88     2.83        872 1.00
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept    -0.03      0.58    -1.22     1.13        984 1.00
## 
## Family Specific Parameters: 
##       Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma     0.53      0.03     0.47     0.59       2596 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Fit is the same

loo(fit_abc_pp, fit_abc_np)

##                          LOOIC    SE
## fit_abc_pp              237.36 14.76
## fit_abc_np              237.45 14.90
## fit_abc_pp - fit_abc_np  -0.09  0.34

pp <- fit_abc_pp$data %>%
    data_grid(condition) %>%
    add_fitted_draws(fit_abc_pp) %>%
    do(data_frame(.value = quantile(.$.value, ppoints(100)))) %>%
    mutate(model = "partial pooling") %>%
    bind_rows()

np <- fit_abc_np$data %>%
    data_grid(condition) %>%
    add_fitted_draws(fit_abc_np) %>%
    do(data_frame(.value = quantile(.$.value, ppoints(100)))) %>%
    mutate(model = "no pooling")

pp_np <- bind_rows(pp, np)

pp_np_summary <- pp_np %>%
    group_by(condition) %>%
    summarise(mean = mean(.value))


pp_np %>%
    ggplot(aes(x = .value, fill = condition)) +
    # geom_dotplot(binwidth = .04) +
    geom_density() +
    # scale_color_viridis_d(option = "A", end = 4/7) +
    geom_vline(aes(xintercept = mean), data = pp_np_summary,
               color = "white", linetype = 1, size = 1,
               alpha = 0.8) +
    scale_fill_viridis_d(option = "A", end = 6/7) +
    facet_wrap(~model, nrow = 2) +
    scale_y_continuous(breaks = NULL)

Signal detection model

We have the following binary data from a line bisection task. Each of 6 subjects performed the task at two time points. We expect that performance decreases from t1 to t2.

A common approach is to compute d’ for each subject in each session, as a measure of performance.

subject	measurement	cr	fa	hit	miss	dprime	c
ALPE94	1	21	19	27	13	0.516	-0.196
ALPE94	2	25	15	33	7	1.253	-0.308
ANCH96	1	26	14	26	14	0.771	0.000
ANCH96	2	29	11	32	8	1.439	-0.122
BRMA88	1	33	7	22	18	1.060	0.404
BRMA88	2	31	9	29	11	1.353	0.079
BRMA96	1	31	9	26	14	1.141	0.185
BRMA96	2	30	10	22	18	0.800	0.274
REER95	1	25	15	31	9	1.074	-0.218
REER95	2	25	15	31	9	1.074	-0.218
URST96	1	30	10	27	13	1.128	0.110
URST96	2	28	12	30	10	1.199	-0.075

DeCarlo (1998); Knoblauch and Maloney (2012) describe how d’ can be computed using a type of generalized linear model (Probit regression)
Here, we implement a Bayesian multilevel probit regression model.

Bayesian multilevel probit regression model

In order to obtain d’ as a parameter, we use an efect coding for the stimulus:

PilotData %>%
  select(subject, measurement, offset, offsetnum, sayright) %>%
  head() %>%
  knitr::kable()

subject	measurement	offset	offsetnum	sayright
ALPE94	1	left	-0.5	0
ALPE94	1	right	0.5	1
ALPE94	1	left	-0.5	0
ALPE94	1	right	0.5	1
ALPE94	1	left	-0.5	1
ALPE94	1	left	-0.5	0

fit_dprime_c <- brm(sayright ~ 0 + measurement + measurement:offsetnum +
                      (0 + measurement + measurement:offsetnum | subject),
                  family = bernoulli("probit"),
                  data = PilotData,
                  control = list(adapt_delta = .99),
                  prior = c(prior(normal(1, 1), class = b),
                            prior(student_t(3, 0, 1), class = sd)),
                  inits = 0,
                  file = here::here("models/fit_dprime-c"))

Posterior predictive check

To check whether the model captures the important features of the data, we generate 500 data sets from the posterior predictive distribution.
Next, we calculate d’ for each dataset, and then plot the distribution of d’. We overlay this with the d’ computed from the actual data.

preds_dprime_c <- fit_dprime_c$data %>%
    select(subject, measurement, offsetnum) %>%
    add_predicted_draws(model = fit_dprime_c, n = 500)

sdt_pred <- preds_dprime_c %>%
    group_by(subject, measurement, .draw) %>%
    mutate(type = "hit",
           type = ifelse(offsetnum > 0 & .prediction == 0, "miss", type),
           type = ifelse(offsetnum < 0 & .prediction == 0, "cr", type),
           type = ifelse(offsetnum < 0 & .prediction == 1, "fa", type)) %>%
    ungroup()

sdt_pred <- sdt_pred %>%
    group_by(subject, measurement, .draw, type) %>%
    summarise(count = n()) %>%
    spread(type, count)

sdt_pred <- sdt_pred %>%
    mutate(zhr = qnorm(hit / (hit+miss)),
           zfa = qnorm(fa / (fa+cr)),
           dprime = zhr-zfa,
           crit = -zfa,
           c = -0.5*(zhr+zfa),
           beta = exp(0.5 * (zfa^2 - zhr^2))) %>%
    mutate_if(is.numeric, round, 3)


sdt_pred %>%
    group_by(subject, measurement) %>%
    median_qi(dprime, .width = c(.80, .95), na.rm = TRUE) %>%
    ggplot(aes(y = dprime, x = measurement)) +
    geom_pointinterval(fatten_point = 2,
                       size_range = c(0.6, 2),
                       color = "black") +
    geom_point(aes(y = dprime, x = measurement),
               data = sdt,
               shape = 21, color = "black",
               fill = "pink", stroke = 1, size = 3) +
    geom_hline(yintercept = 0, linetype = 3, color = "grey", alpha = 0.9) +
    facet_wrap(~subject)

d’ estimates: population level

These are the population level effects (called fixed effects in frequentist jargon)

plot(fit_dprime_c, "b_measurement")

plot(hypothesis(fit_dprime_c,
                "measurement2:offsetnum > measurement1:offsetnum"))

d’ estimates: subject level

We plot the posterior estimtates of d’ for each individual subject (sum of population level and group level effects)

coefs <- purrr::array_tree(coef(object = fit_dprime_c,
                      summary = FALSE)$subject[,,],
                  margin = 3)

f <- function(x) {
    x <- as_tibble(x)
    x <- gather(x, key = subject, value = dprime)
    x
}

coefs <- coefs %>% map_df(f, .id = "measurement") %>%
    filter(str_detect(measurement, 'offsetnum')) %>%
    mutate(subject = as_factor(subject),
           measurement = str_extract(measurement, "(\\d)+"),
           measurement = as_factor(measurement))

dprimes <- coefs %>%
    group_by(subject, measurement) %>%
    median_qi(.width = c(.95, .66))

compare_d_primes <- sdt %>%
    mutate(point_estimate = dprime) %>%
    select(subject, measurement, point_estimate) %>%
    left_join(dprimes)

p_dprime <- coefs %>%
    ggplot(aes(y = measurement, x = dprime)) +
    geom_vline(xintercept = 0, linetype = 2, color = "grey",
               size = 1, alpha = 0.9) +
    geom_halfeyeh(fill = "steelblue4", alpha = 0.4) +
    geom_point(aes(x = point_estimate),
               data = compare_d_primes,
               color = 'red') +
    facet_wrap(~subject) +
    theme_tidybayes()

For comparison, the red data points represent the d’ values compputed from the data.

Attentional lapses and guessing

We could extend this by implementing a non-linear model that can account for attentional lapses and guessing (Kuss, Jäkel, and Wichmann 2005). This is equivalent to a binomial mixture model, in which subjects lapse with probability \(\lambda\) and pay attention with probability \(1-\lambda\). If they lapse, they have to guess, otherwise they respond as in the previous model.

BF <- bf(sayright ~ guess + (1-guess-lapse) * Phi(eta),
         eta ~ 0 + measurement:offsetnum +
             (0 + measurement:offsetnum |s| subject),
         guess ~ 1,
         lapse ~  (1 |s| subject),
         family = bernoulli(link="identity"),
         nl = TRUE)

prior_guessing <- c(
    prior(normal(1, 1), class = "b", nlpar = "eta"),
    prior(beta(1, 1), nlpar = "lapse", lb = 0, ub = 1),
    prior(beta(1, 1), nlpar = "guess", lb = 0, ub = 1))

fit_dprime_guessing <- brm(BF,
                           data = PilotData,
                           control = list(adapt_delta = .99),
                           inits = 0,
                           prior = prior_guessing,
                           file = here::here("models/fit_dprime-guessing"))

The model can also be implemented as a mixture:

mix <- mixture(bernoulli("probit"), bernoulli("probit"))
p_mix <- c(
    prior(normal(0, 0.1), class = Intercept, dpar = mu1),
    prior(normal(0, 1), class = "b", dpar = mu2))

fit_mix <- brm(bf(y ~ 1, mu1 ~ 1, mu2 ~ x),
               data = PilotData,
               family = mix,
               prior = p_mix)

loo(fit_dprime_guessing, fit_dprime_c)

##                                      LOOIC    SE
## fit_dprime_guessing                1178.93 23.91
## fit_dprime_c                       1179.04 24.90
## fit_dprime_guessing - fit_dprime_c   -0.11  4.15

A model comparison shows that lapse model’s predictive accuracy is no better than the model without lapses and guessing.

Predict new subjects

Finally, we can predict data for new subjects (since we have a multilevel model, and the estimated variance of the subjects. In other words, each subject’s parameters are drawn from population level distributions).

preds_new_subjects <- fit_dprime_c$data %>%
    data_grid(subject = str_c("subject_", 1:6),
              measurement,
              offsetnum) %>%
    add_predicted_draws(model = fit_dprime_c,
                        allow_new_levels = TRUE,
                        n = 100)

sdt_pred_new <- preds_new_subjects %>%
    group_by(subject, measurement, .draw) %>%
    mutate(type = "hit",
           type = ifelse(offsetnum > 0 & .prediction == 0, "miss", type),
           type = ifelse(offsetnum < 0 & .prediction == 0, "cr", type),
           type = ifelse(offsetnum < 0 & .prediction == 1, "fa", type)) %>%
    ungroup()

sdt_pred_new <- sdt_pred_new %>%
    group_by(subject, measurement, type) %>%
    summarise(count = n()) %>%
    spread(type, count)

sdt_pred_new <- sdt_pred_new %>%
    mutate(zhr = qnorm(hit / (hit+miss)),
           zfa = qnorm(fa / (fa+cr)),
           hit_rate = hit / (hit+miss),
           fa_rate = fa / (fa+cr),
           dprime = zhr-zfa,
           crit = -zfa,
           c = -0.5*(zhr+zfa),
           beta = exp(0.5 * (zfa^2 - zhr^2))) %>%
    mutate_if(is.numeric, round, 3)


sdt_pred_new %>%
    group_by(subject, measurement) %>%
    median_qi(dprime, .width = c(.80, .95), na.rm = TRUE) %>%
    ggplot(aes(y = dprime, x = measurement)) +
    geom_pointinterval(fatten_point = 1,
                       size_range = c(0.6, 2),
                       color = "black") +
    geom_hline(yintercept = 0, linetype = 3, color = "grey", alpha = 0.9) +
    facet_wrap(~subject)

References

Blei, David M. 2014. “Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models.” Annual Review of Statistics and Its Application 1 (1): 203–32. https://doi.org/10.1146/annurev-statistics-022513-115657.

Box, George E. P. 1976. “Science and Statistics.” Journal of the American Statistical Association 71 (356): 791–99. https://doi.org/10.1080/01621459.1976.10480949.

Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–29. https://doi.org/10.1177/0956797613504966.

DeCarlo, Lawrence T. 1998. “Signal Detection Theory and Generalized Linear Models.” Psychological Methods 3 (2): 186–205.

Gelman, Andrew. 2014. Bayesian Data Analysis. Third edition. Chapman & Hall/CRC Texts in Statistical Science. Boca Raton: CRC Press.

Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942.

Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of Socio-Economics 33 (5): 587–606. https://doi.org/10.1016/j.socec.2004.09.033.

———. 2018. “Statistical Rituals: The Replication Delusion and How We Got There.” Advances in Methods and Practices in Psychological Science 1 (2): 198–218. https://doi.org/10.1177/2515245918771329.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–50. https://doi.org/10.1007/s10654-016-0149-3.

Knoblauch, Kenneth, and Laurence T. Maloney. 2012. Modeling Psychophysical Data in R. Use R! New York: Springer-Verlag.

Kruschke, John K. 2013. “Bayesian Estimation Supersedes the T Test.” Journal of Experimental Psychology: General 142 (2): 573–603. https://doi.org/10.1037/a0029146.

Kruschke, John K., and Torrin M. Liddell. 2018. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin & Review 25 (1): 178–206. https://doi.org/10.3758/s13423-016-1221-4.

Kuss, Malte, Frank Jäkel, and Felix A. Wichmann. 2005. “Bayesian Inference for Psychometric Functions.” Journal of Vision 5 (5): 8–8. https://doi.org/10.1167/5.5.8.

Lazar, Nicole A. 2016. “The ASA’s Statement on P-Values: Context, Process, and Purpose AU - Wasserstein, Ronald L.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.

Lee, Michael D., and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical Course. 1st ed. Cambridge ; New York: Cambridge University Press. https://doi.org/10.1017/CBO9781139087759.

Beyond Significant

Load R packages

Overview

Problem: non-significant result

IQ example

Two sample t-test / Welch test

Problems with NHST

The (Bayesian) New Statistics

Bayesian methods

Quantifying uncertainty

Parameters are random variables

Three steps of Bayesian data analysis

Box’s loop

Bayesian workflow

Posterior evaluation

Model comparison

Case study: Robust t-test

Bayes factor to the rescue?

Bayes factor

T-Test as general linear model

Generative model

Software

Inference using Stan

brms: equal variance model

Estimated parameter

Posterior predictions: normal model

Robust model

brms: robust model

Posterior estimates: robust model

Posterior predictions: robust model

Model comparison

Bayes factors: Savage Dickey

Bayes factors: Bridge sampling

Compare OLS and Bayesian estimates

Is this b-hacking?

Multilevel model

No pooling

Partial pooling

Fit is the same

Signal detection model

Bayesian multilevel probit regression model

Posterior predictive check

d’ estimates: population level

d’ estimates: subject level

Attentional lapses and guessing

Predict new subjects

Further reading

References