Andrew Ellis
2019-02-20
Introduction
Bayesian methods
Bayesian analysis workflow
Probabilistic model how-to
Multilevel models: partial pooling vs. no pooling
Model zoo: signal detection, attentional lapses
Summary: Two groups of people took an IQ test.
Group 1, \(N_1=47\), consumed a “smart drug”, and Group 2, \(N_2=42\), is a control group that consumed a placebo (Kruschke 2013).
The group means are:
| Group | mean |
|---|---|
| Placebo | 100.36 |
| SmartDrug | 101.91 |
It is obvious that the data contain several ‘outliers’.
We can perform a two-sample t-test, or a Welch test:
##
## Two Sample t-test
##
## data: IQ by Group
## t = -1.5587, df = 87, p-value = 0.06135
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.1037991
## sample estimates:
## mean in group Placebo mean in group SmartDrug
## 100.3571 101.9149
##
## Welch Two Sample t-test
##
## data: IQ by Group
## t = -1.6222, df = 63.039, p-value = 0.05488
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 0.04532157
## sample estimates:
## mean in group Placebo mean in group SmartDrug
## 100.3571 101.9149
Problem: Neither yield a significant result. What do we do?
Statement by American Statistical Association (ASA) about p-values (Lazar 2016).
P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true (we would actually like to know this), or the probability that the data were produced by chance.
Greenland et al. (2016) provide a good discussion of common misinterpretations of p values and confidence intervals.
Cumming (2014): We need to shift from reliance on NHST to estimation and other techniques.
Kruschke and Liddell (2018): Bayesian methods are better suited for this, for both hypothesis testing and parameter estimation.
According to Gigerenzer (2004) and Gigerenzer (2018), we need to stop relying on NHST (mindless statistics), but instead learn to use a whole statistical toolkit.
Many reviewers now demand Bayes factors (because a BF can provide evidence for/against hypotheses).
However: Bayesian data analysis is not limited to calculating Bayes factors.
🤗
more intuitive (uncertainty) and based on probability theory
provide evidence for/against hypotheses
more flexible: robust models and cognitive process models (Lee and Wagenmakers 2014)
can include prior knowledge
better for multilevel models (Gelman and Hill 2006)
😧
The essential characteristic of Bayesian methods is their use of probability for quantifying uncertainty in inferences based on data (Gelman 2014).
This facilitates a common-sense interpretation of statistical conclusions.
The figure shows a frequentist confidence interval (OLS model) and a Bayesian credible interval. Which one is easier to interpret?
Parameters are random variables, as opposed to a frequentist point of view, in which they are fixed values.
These are drawn from (prior) probability distributions, which reflect our uncertainty about the parameters.
The prior distribution is updated with the likelihood (data) to obtain a posterior distribution.
Requires some familiarity with probability distributions.
According to Gelman (2014):
Set up probability model (joint probability distribution for observed (\(y\), \(x\)) and latent quantities \(\theta\)).
Condition on observed data: calculate posterior distribution \(P(y | \theta) \cdot p(\theta)\).
Blei (2014) describes data analysis as an iterative process.
He calls this Box’s loop, a modern interpretation of the perspective of the statistician George Box (Box (1976)).
Let’s return to the IQ example.
Our hypothesis: the ‘smart drug’ group have higher IQ scores than the placebo group.
We obtained a p-value of 0.055 (Welch test)
(Even after removing outliers, p is > 0.05. This appproach is tricky.)
We might want to compute a Bayes factor. Can we at least quantify evidence for the null hypothesis?
Not in this case. The data provide little evidence either for or against our hypothesis.
The maximally attainable Bayes factor in favour of our hypothesis is \(\sim 1.8\).
Bayes factor are not all we can do.
A benefit of going Bayesian is the ability to flexibly specify our generative model. We can make the model robust against outliers.
Let’s have another look at Bayes rule, making the dependency of the parameters \(\mathbf{\theta}\) on the model \(\mathcal{M}\) explicit:
\[ p(\theta | y, \mathcal{M}) = \frac{p(y|\theta, \mathcal{M}) p(\theta | \mathcal{M})}{p(y | \mathcal{M})}\]
where \(\mathcal{M}\) refers to a specific model. The marginal likelihood \(p(y | \mathcal{M})\) now gives the probability of the data, averaged over all possible parameter value under model \(\mathcal{M}\).
Writing out the marginal likelihood \(p(y | \mathcal{M})\): \[ p(y | \mathcal{M}) = \int{p(y | \theta, \mathcal{M}) p(\theta|\mathcal{M})d\theta}\]
we see that this is averaged over all possible values of \(\theta\) that the model will allow.
The important thing to consider here is that the model evidence will depend on what kind of predictions a model can make. This gives us a measure of complexity – a complex model is a model that can make many predictions.
The problem with making many predictions is that most of these predictions will turn out to be false.
The complexity of a model will depend on (among other things):
When a parameters priors are broad (uninformative), those parts of the parameter space where the likelihood is high are assigned low probability. Intuitively, by hedging one’s bets, one assigns low probability to parameter values that make good predictions.
All this leads to the fact that more complex model have comparatively lower marginal likelihood.
Therefore, when we compare models, and we prefer models with higher marginal likelihood, we are using Ockham’s razor in a principled manner.
We can also write Bayes rule applied to a comparison between models (marginalized over all parameters within the model):
\[ p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)}\]
and
\[ p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)}\]
This tells us that for model \(\mathcal{M_m}\), the posterior probability of the model is proportional to the marginal likelihood times the prior probability of the model.
Now, one is usually less interested in absolute evidence than in relative evidence; we want to compare the predictive performance of one model over another.
To do this, we simply form the ratio of the model probabilities:
\[ \frac{p(\mathcal{M}_1 | y) = \frac{P(y | \mathcal{M}_1) p(\mathcal{M}_1)}{p(y)}} {p(\mathcal{M}_2 | y) = \frac{P(y | \mathcal{M}_2) p(\mathcal{M}_2)}{p(y)}}\]
The term \(p(y)\) cancels out, giving us: \[ \frac{p(\mathcal{M}_1 | y) = P(y | \mathcal{M}_1) p(\mathcal{M}_1)} {p(\mathcal{M}_2 | y) = P(y | \mathcal{M}_2) p(\mathcal{M}_2)}\]
The ratio \(\frac{p(\mathcal{M}_1)}{p(\mathcal{M}_2)}\) is called the prior odds, and the ratio \(\frac{p(\mathcal{M}_1 | y)}{p(\mathcal{M}_2 | y)}\) is therefore the posterior odds.
We are particulary interested in the ratio of the marginal likelihoods:
\[\frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)}\]
This is the Bayes factor, and it can be interpreted as the change from prior odds to posterior odds that is indicated by the data.
If we consider the prior odds to be \(1\), i.e. we do not favour one model over another a priori, then we are only interested in the Bayes factor. We write this as:
\[ BF_{12} = \frac{P(y | \mathcal{M}_1)}{P(y | \mathcal{M}_2)}\]
Here, \(BF_{12}\) indicates the extent to which the data support model \(\mathcal{M}_1\) over model \(\mathcal{M}_2\).
As an example, if we obtain a \(BF_{12} = 5\), this mean that the data are 5 times more likely to have occured under model 1 than under model 2. Conversely, if \(BF_{12} = 0.2\), then the data are 5 times more likely to have occured under model 2.
The following classification is sometimes used, although it is unnessecary.
In order to go Bayesian, we first formulate our t-test as a GLM.
Assuming equal variances, we can write
\[ Y = \alpha + \beta X + \epsilon\] \[ \epsilon \sim N(0, \sigma^2) \]
where \(X\) is indicates group membership.
## [1] "Placebo" "SmartDrug"
Using R’s formula notation, we can write this as
(The outcome IQ is modeled as a function of Group)
More formally:
\[ IQ = Placebo + \beta \cdot SmartDrug + \epsilon\]
\[\epsilon \sim N(0, \sigma^2) \]
The \(\beta\) parameter therefore represents the difference between groups.
##
## Call:
## lm(formula = IQ ~ Group, data = TwoGroupIQ)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.9149 -0.9149 0.0851 1.0851 22.0851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 100.3571 0.7263 138.184 <2e-16 ***
## GroupSmartDrug 1.5578 0.9994 1.559 0.123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.707 on 87 degrees of freedom
## Multiple R-squared: 0.02717, Adjusted R-squared: 0.01599
## F-statistic: 2.43 on 1 and 87 DF, p-value: 0.1227
For a Bayesian model, we need prior distributions on all parameters.
The outcome \(y\) can be modelled as a normally distributed random variable. The normal distribution has two parameters:
Linear regression model as a probabilistic model
The graphical model shows the data-generating process (dependencies among the random variables).
\(\alpha\) is our expectation for the placebo group, \(\beta\) is our expectation for the difference in means. \(\sigma\) is the variance of the outcome.
Probabilistic programming languages:
Winbugs / Jags
Stan: Turing complete probabilistic programming language
PyMC3: Probabilistic programming in Python
Turing: Probabilistic programming language with an intuitive modelling interface.
R interface / GUI:
We can translate the graphical model into a Stan program (this is what brms does).
shaded nodes are observed (conditioned upon)
\(\mu\) is a deterministic node (double border). In the Stan program, we omit this
\(y\) depends on \(\mu\), \(\sigma\)
Whilst this is very powerful, and for simple models also pretty easy, it gets tedious for complex models.
Fortunately, brms simplifies this process, by allowing us to specify a model using lme4 style formula notation.
brms then generates Stan code, and provides several convience functions for accessing the marginal posterior distributions.
brms allows specification of non-linear models. A simple linear regression looks like this:
## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: IQ ~ Group
## Data: TwoGroupIQ (Number of observations: 89)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept 100.35 0.72 98.92 101.74 3865 1.00
## GroupSmartDrug 1.57 0.99 -0.37 3.51 3793 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 4.76 0.36 4.13 5.53 3109 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
brms set priors.Marginal posterior distribution of the regression parameter \(\beta\)
fit_eqvar %>%
gather_draws(b_GroupSmartDrug) %>%
ggplot(aes(y = .variable, x = .value)) +
geom_halfeyeh(fill = "Steelblue4") +
geom_vline(xintercept = 0, color = "white", linetype = 1, size = 1) +
ylab("") +
xlab("Estimated difference") +
theme_classic()We can plot the predicted marginal means and the predictive bands alongside the data
grid <- TwoGroupIQ %>%
data_grid(Group)
fits_IQ <- grid %>%
add_fitted_draws(fit_eqvar)
preds_IQ <- grid %>%
add_predicted_draws(fit_eqvar)
pp_eqvar <- TwoGroupIQ %>%
ggplot(aes(x = IQ, y = Group)) +
geom_halfeyeh(aes(x = .value),
relative_scale = 0.7,
position = position_nudge(y = 0.1),
data = fits_IQ,
.width = c(.66, .95, 0.99)) +
stat_intervalh(aes(x = .prediction),
data = preds_IQ,
.width = c(.66, .95, 0.99)) +
scale_x_continuous(limits = c(75, 125)) +
geom_point(data = TwoGroupIQ) +
scale_color_brewer() +
labs(title = "Equal variance model predictions")
pp_eqvarThe normal equal variance model does not predict the data well. We can use a t distribution instead of a normal distribution to make the model more robust.
The t distribution has three parameters: location, (positive) scale, and a (positive) normality parameter, \(\nu\). As \(\nu\) increases, a t distribution starts to resemble a normal distribution
We have now have an extra parameter, for which we need a prior. A recommened distribution is the gamma distribution:
fit_robust_1 <- brm(bf(IQ ~ 0 + Group, sigma ~ Group),
family = student,
data = TwoGroupIQ,
prior = c(set_prior("normal(100, 10)", class = "b"),
set_prior("cauchy(0, 1)", class = "b", dpar = "sigma"),
set_prior("gamma(2, 0.1)", class = "nu")),
cores = parallel::detectCores(),
file = here::here("models/fit_iq-robust_1"))## Family: student
## Links: mu = identity; sigma = log; nu = identity
## Formula: IQ ~ 0 + Group
## sigma ~ Group
## Data: TwoGroupIQ (Number of observations: 89)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma_Intercept 0.02 0.19 -0.36 0.40 4165 1.00
## GroupPlacebo 100.53 0.21 100.12 100.94 4451 1.00
## GroupSmartDrug 101.55 0.36 100.83 102.29 4053 1.00
## sigma_GroupSmartDrug 0.61 0.25 0.12 1.10 3973 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## nu 1.84 0.49 1.14 2.94 3310 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
fit_robust_1 %>%
spread_draws(b_GroupSmartDrug, b_GroupPlacebo) %>%
mutate(`SmartDrug - Placebo` = b_GroupSmartDrug - b_GroupPlacebo) %>%
gather(variable, difference, `SmartDrug - Placebo`) %>%
ggplot(aes(y = variable, x = difference)) +
geom_halfeyeh(fill = "Steelblue4") +
geom_vline(xintercept = 0, color = "white", linetype = 1, size = 1) +
ylab("") +
xlab("Estimated difference") +
theme_tidybayes()We can use the loo package to carry out Pareto smoothed importance-sampling leave-one-out cross-validation (PSIS-LOO)
## LOOIC SE
## fit_eqvar 539.50 39.67
## fit_robust_1 437.11 27.72
## fit_eqvar - fit_robust_1 102.39 24.71
For the robust model, we can obtain a Bayes factor using either the Savage-Dickey density ratio, or a recently implemented approach called bridge sampling.
fit_robust_bf <- brm(bf(IQ ~ 0 + intercept + Group,
sigma ~ Group),
family = student,
data = TwoGroupIQ,
prior = c(set_prior("normal(100, 10)",
class = "b", coef = "intercept"),
set_prior("cauchy(0, 0.707)",
class = "b", coef = "GroupSmartDrug"),
set_prior("cauchy(0, 1)",
class = "b", dpar = "sigma"),
set_prior("gamma(2, 0.1)",
class = "nu")),
cores = parallel::detectCores(),
sample_prior = TRUE,
file = here::here("models/fit_iq-robust_bf"))## [1] 2.669161
fit_robust_bridge <- brm(bf(IQ ~ 0 + intercept + Group,
sigma ~ Group),
family = student,
data = TwoGroupIQ,
prior = c(set_prior("normal(100, 10)",
class = "b", coef = "intercept"),
set_prior("cauchy(0, 0.707)",
class = "b",
coef = "GroupSmartDrug"),
set_prior("cauchy(0, 1)",
class = "b", dpar = "sigma"),
set_prior("gamma(2, 0.1)",
class = "nu")),
cores = parallel::detectCores(),
save_all_pars = TRUE,
file = here::here("models/fit_iq-robust_bridge"))We have to specify a null model (leaving out the Group variable):
fit_robust_bridge_null <- brm(bf(IQ ~ 0 + intercept,
sigma ~ Group),
family = student,
data = TwoGroupIQ,
prior = c(set_prior("normal(100, 10)",
class = "b",
coef = "intercept"),
set_prior("cauchy(0, 1)",
class = "b",
dpar = "sigma"),
set_prior("gamma(2, 0.1)",
class = "nu")),
cores = parallel::detectCores(),
save_all_pars = TRUE,
file = here::here("models/fit_iq-robust_bridge-null"))## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## Iteration: 1
## Iteration: 2
## Iteration: 3
## Iteration: 4
## Iteration: 5
## [1] 3.22826
fit_ols <- lm(IQ ~ Group,
data = TwoGroupIQ)
d <- broom::tidy(fit_ols, conf.int = TRUE) %>%
mutate(
low = estimate - std.error,
high = estimate + std.error
)
linear_results <- fit_ols %>%
broom::tidy(conf.int = TRUE) %>%
filter(term == "GroupSmartDrug") %>%
mutate(model = "OLS")
bayes_results <- fit_robust_1 %>%
spread_draws(b_GroupSmartDrug, b_GroupPlacebo) %>%
mutate(GroupSmartDrug = b_GroupSmartDrug - b_GroupPlacebo) %>%
median_qi(GroupSmartDrug) %>%
gather(term, estimate, GroupSmartDrug) %>%
to_broom_names() %>%
mutate(model = "Robust Bayesian model")
bind_rows(linear_results, bayes_results) %>%
mutate(term = term) %>%
ggplot(aes(y = term,
x = estimate,
xmin = conf.low,
xmax = conf.high,
color = model)) +
geom_pointrangeh(position = position_dodgev(height = .3)) +
geom_vline(xintercept = 0, linetype = 3,
color = "grey", alpha = 0.8) +
scale_color_viridis_d(option = "B", direction = -1,
begin = 1/3, end = 2/3) +
scale_y_discrete(breaks = NULL) +
ylab("Comparison: SmartDrug - Placebo")This demonstrates that we can either treat an effect that depends on a grouping variable as varying or non-varying. Gelman has written extensively about this.
Varying (random) parameters have an associated variance while population-level (fixed) parameters do not.
If we treat the condition variable as varying, then the estimates will be shrunk towards the overall mean, if the data provide evidence for this. Such a model is also known as a partial pooling model.
set.seed(5)
n_obs <- 30
n_condition <- 5
ABC <- tibble(condition = rep(LETTERS[1:n_condition], n_obs),
response = rnorm(n_obs * n_condition,
mean = c(-2, -1, 0, 1, 2),
sd = 0.5))
knitr::kable(head(ABC))| condition | response |
|---|---|
| A | -2.4204277 |
| B | -0.3078203 |
| C | -0.6277459 |
| D | 1.0350714 |
| E | 2.8557204 |
| A | -2.3014540 |
ABC %>%
ggplot(aes(y = condition, x = response, color = condition)) +
geom_point() +
scale_color_viridis_d(option = "B", begin = 1/3, end = 2/3)fit_abc_np <- brm(response ~ 0 + condition,
data = ABC,
# control = list(adapt_delta = .99),
prior = c(prior(normal(0, 10), class = b),
prior(student_t(3, 0, 1), class = sigma)),
file = here::here("models/fit_abc_no-pooling"))
summary(fit_abc_np)## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: response ~ 0 + condition
## Data: ABC (Number of observations: 150)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## conditionA -1.95 0.10 -2.15 -1.76 5375 1.00
## conditionB -1.11 0.09 -1.29 -0.93 5449 1.00
## conditionC 0.07 0.09 -0.11 0.26 5613 1.00
## conditionD 1.00 0.10 0.80 1.20 4698 1.00
## conditionE 1.97 0.10 1.78 2.15 4198 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 0.53 0.03 0.47 0.59 4678 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
fit_abc_pp <- brm(response ~ 1 + (1|condition),
data = ABC,
# control = list(adapt_delta = .99),
prior = c(prior(normal(0, 1), class = Intercept),
prior(student_t(3, 0, 1), class = sd),
prior(student_t(3, 0, 1), class = sigma)),
file = here::here("models/fit_abc_partial-pooling"))
summary(fit_abc_pp)## Family: gaussian
## Links: mu = identity; sigma = identity
## Formula: response ~ 1 + (1 | condition)
## Data: ABC (Number of observations: 150)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~condition (Number of levels: 5)
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept) 1.58 0.51 0.88 2.83 872 1.00
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept -0.03 0.58 -1.22 1.13 984 1.00
##
## Family Specific Parameters:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sigma 0.53 0.03 0.47 0.59 2596 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
## LOOIC SE
## fit_abc_pp 237.36 14.76
## fit_abc_np 237.45 14.90
## fit_abc_pp - fit_abc_np -0.09 0.34
pp <- fit_abc_pp$data %>%
data_grid(condition) %>%
add_fitted_draws(fit_abc_pp) %>%
do(data_frame(.value = quantile(.$.value, ppoints(100)))) %>%
mutate(model = "partial pooling") %>%
bind_rows()
np <- fit_abc_np$data %>%
data_grid(condition) %>%
add_fitted_draws(fit_abc_np) %>%
do(data_frame(.value = quantile(.$.value, ppoints(100)))) %>%
mutate(model = "no pooling")
pp_np <- bind_rows(pp, np)
pp_np_summary <- pp_np %>%
group_by(condition) %>%
summarise(mean = mean(.value))
pp_np %>%
ggplot(aes(x = .value, fill = condition)) +
# geom_dotplot(binwidth = .04) +
geom_density() +
# scale_color_viridis_d(option = "A", end = 4/7) +
geom_vline(aes(xintercept = mean), data = pp_np_summary,
color = "white", linetype = 1, size = 1,
alpha = 0.8) +
scale_fill_viridis_d(option = "A", end = 6/7) +
facet_wrap(~model, nrow = 2) +
scale_y_continuous(breaks = NULL)We have the following binary data from a line bisection task. Each of 6 subjects performed the task at two time points. We expect that performance decreases from t1 to t2.
A common approach is to compute d’ for each subject in each session, as a measure of performance.
| subject | measurement | cr | fa | hit | miss | dprime | c |
|---|---|---|---|---|---|---|---|
| ALPE94 | 1 | 21 | 19 | 27 | 13 | 0.516 | -0.196 |
| ALPE94 | 2 | 25 | 15 | 33 | 7 | 1.253 | -0.308 |
| ANCH96 | 1 | 26 | 14 | 26 | 14 | 0.771 | 0.000 |
| ANCH96 | 2 | 29 | 11 | 32 | 8 | 1.439 | -0.122 |
| BRMA88 | 1 | 33 | 7 | 22 | 18 | 1.060 | 0.404 |
| BRMA88 | 2 | 31 | 9 | 29 | 11 | 1.353 | 0.079 |
| BRMA96 | 1 | 31 | 9 | 26 | 14 | 1.141 | 0.185 |
| BRMA96 | 2 | 30 | 10 | 22 | 18 | 0.800 | 0.274 |
| REER95 | 1 | 25 | 15 | 31 | 9 | 1.074 | -0.218 |
| REER95 | 2 | 25 | 15 | 31 | 9 | 1.074 | -0.218 |
| URST96 | 1 | 30 | 10 | 27 | 13 | 1.128 | 0.110 |
| URST96 | 2 | 28 | 12 | 30 | 10 | 1.199 | -0.075 |
DeCarlo (1998); Knoblauch and Maloney (2012) describe how d’ can be computed using a type of generalized linear model (Probit regression)
Here, we implement a Bayesian multilevel probit regression model.
In order to obtain d’ as a parameter, we use an efect coding for the stimulus:
PilotData %>%
select(subject, measurement, offset, offsetnum, sayright) %>%
head() %>%
knitr::kable()| subject | measurement | offset | offsetnum | sayright |
|---|---|---|---|---|
| ALPE94 | 1 | left | -0.5 | 0 |
| ALPE94 | 1 | right | 0.5 | 1 |
| ALPE94 | 1 | left | -0.5 | 0 |
| ALPE94 | 1 | right | 0.5 | 1 |
| ALPE94 | 1 | left | -0.5 | 1 |
| ALPE94 | 1 | left | -0.5 | 0 |
fit_dprime_c <- brm(sayright ~ 0 + measurement + measurement:offsetnum +
(0 + measurement + measurement:offsetnum | subject),
family = bernoulli("probit"),
data = PilotData,
control = list(adapt_delta = .99),
prior = c(prior(normal(1, 1), class = b),
prior(student_t(3, 0, 1), class = sd)),
inits = 0,
file = here::here("models/fit_dprime-c"))To check whether the model captures the important features of the data, we generate 500 data sets from the posterior predictive distribution.
Next, we calculate d’ for each dataset, and then plot the distribution of d’. We overlay this with the d’ computed from the actual data.
preds_dprime_c <- fit_dprime_c$data %>%
select(subject, measurement, offsetnum) %>%
add_predicted_draws(model = fit_dprime_c, n = 500)
sdt_pred <- preds_dprime_c %>%
group_by(subject, measurement, .draw) %>%
mutate(type = "hit",
type = ifelse(offsetnum > 0 & .prediction == 0, "miss", type),
type = ifelse(offsetnum < 0 & .prediction == 0, "cr", type),
type = ifelse(offsetnum < 0 & .prediction == 1, "fa", type)) %>%
ungroup()
sdt_pred <- sdt_pred %>%
group_by(subject, measurement, .draw, type) %>%
summarise(count = n()) %>%
spread(type, count)
sdt_pred <- sdt_pred %>%
mutate(zhr = qnorm(hit / (hit+miss)),
zfa = qnorm(fa / (fa+cr)),
dprime = zhr-zfa,
crit = -zfa,
c = -0.5*(zhr+zfa),
beta = exp(0.5 * (zfa^2 - zhr^2))) %>%
mutate_if(is.numeric, round, 3)
sdt_pred %>%
group_by(subject, measurement) %>%
median_qi(dprime, .width = c(.80, .95), na.rm = TRUE) %>%
ggplot(aes(y = dprime, x = measurement)) +
geom_pointinterval(fatten_point = 2,
size_range = c(0.6, 2),
color = "black") +
geom_point(aes(y = dprime, x = measurement),
data = sdt,
shape = 21, color = "black",
fill = "pink", stroke = 1, size = 3) +
geom_hline(yintercept = 0, linetype = 3, color = "grey", alpha = 0.9) +
facet_wrap(~subject)coefs <- purrr::array_tree(coef(object = fit_dprime_c,
summary = FALSE)$subject[,,],
margin = 3)
f <- function(x) {
x <- as_tibble(x)
x <- gather(x, key = subject, value = dprime)
x
}
coefs <- coefs %>% map_df(f, .id = "measurement") %>%
filter(str_detect(measurement, 'offsetnum')) %>%
mutate(subject = as_factor(subject),
measurement = str_extract(measurement, "(\\d)+"),
measurement = as_factor(measurement))
dprimes <- coefs %>%
group_by(subject, measurement) %>%
median_qi(.width = c(.95, .66))
compare_d_primes <- sdt %>%
mutate(point_estimate = dprime) %>%
select(subject, measurement, point_estimate) %>%
left_join(dprimes)
p_dprime <- coefs %>%
ggplot(aes(y = measurement, x = dprime)) +
geom_vline(xintercept = 0, linetype = 2, color = "grey",
size = 1, alpha = 0.9) +
geom_halfeyeh(fill = "steelblue4", alpha = 0.4) +
geom_point(aes(x = point_estimate),
data = compare_d_primes,
color = 'red') +
facet_wrap(~subject) +
theme_tidybayes()For comparison, the red data points represent the d’ values compputed from the data.
We could extend this by implementing a non-linear model that can account for attentional lapses and guessing (Kuss, Jäkel, and Wichmann 2005). This is equivalent to a binomial mixture model, in which subjects lapse with probability \(\lambda\) and pay attention with probability \(1-\lambda\). If they lapse, they have to guess, otherwise they respond as in the previous model.
BF <- bf(sayright ~ guess + (1-guess-lapse) * Phi(eta),
eta ~ 0 + measurement:offsetnum +
(0 + measurement:offsetnum |s| subject),
guess ~ 1,
lapse ~ (1 |s| subject),
family = bernoulli(link="identity"),
nl = TRUE)
prior_guessing <- c(
prior(normal(1, 1), class = "b", nlpar = "eta"),
prior(beta(1, 1), nlpar = "lapse", lb = 0, ub = 1),
prior(beta(1, 1), nlpar = "guess", lb = 0, ub = 1))fit_dprime_guessing <- brm(BF,
data = PilotData,
control = list(adapt_delta = .99),
inits = 0,
prior = prior_guessing,
file = here::here("models/fit_dprime-guessing"))The model can also be implemented as a mixture:
mix <- mixture(bernoulli("probit"), bernoulli("probit"))
p_mix <- c(
prior(normal(0, 0.1), class = Intercept, dpar = mu1),
prior(normal(0, 1), class = "b", dpar = mu2))
fit_mix <- brm(bf(y ~ 1, mu1 ~ 1, mu2 ~ x),
data = PilotData,
family = mix,
prior = p_mix)## LOOIC SE
## fit_dprime_guessing 1178.93 23.91
## fit_dprime_c 1179.04 24.90
## fit_dprime_guessing - fit_dprime_c -0.11 4.15
A model comparison shows that lapse model’s predictive accuracy is no better than the model without lapses and guessing.
Finally, we can predict data for new subjects (since we have a multilevel model, and the estimated variance of the subjects. In other words, each subject’s parameters are drawn from population level distributions).
preds_new_subjects <- fit_dprime_c$data %>%
data_grid(subject = str_c("subject_", 1:6),
measurement,
offsetnum) %>%
add_predicted_draws(model = fit_dprime_c,
allow_new_levels = TRUE,
n = 100)
sdt_pred_new <- preds_new_subjects %>%
group_by(subject, measurement, .draw) %>%
mutate(type = "hit",
type = ifelse(offsetnum > 0 & .prediction == 0, "miss", type),
type = ifelse(offsetnum < 0 & .prediction == 0, "cr", type),
type = ifelse(offsetnum < 0 & .prediction == 1, "fa", type)) %>%
ungroup()
sdt_pred_new <- sdt_pred_new %>%
group_by(subject, measurement, type) %>%
summarise(count = n()) %>%
spread(type, count)
sdt_pred_new <- sdt_pred_new %>%
mutate(zhr = qnorm(hit / (hit+miss)),
zfa = qnorm(fa / (fa+cr)),
hit_rate = hit / (hit+miss),
fa_rate = fa / (fa+cr),
dprime = zhr-zfa,
crit = -zfa,
c = -0.5*(zhr+zfa),
beta = exp(0.5 * (zfa^2 - zhr^2))) %>%
mutate_if(is.numeric, round, 3)
sdt_pred_new %>%
group_by(subject, measurement) %>%
median_qi(dprime, .width = c(.80, .95), na.rm = TRUE) %>%
ggplot(aes(y = dprime, x = measurement)) +
geom_pointinterval(fatten_point = 1,
size_range = c(0.6, 2),
color = "black") +
geom_hline(yintercept = 0, linetype = 3, color = "grey", alpha = 0.9) +
facet_wrap(~subject)Statistical Rethinking (an introduction to applied Bayesian data analysis with lots of example code): book website and lectures on Youtube
brms: R package for Bayesian generalized multivariate non-linear multilevel models using Stan
Blog post by Jonas Lindeloev on how to compute Bayes factors using various methods.
Blog post by Matti Vuorre on how to perform a Bayesian t-test using brms
Blog post by A. Solomon Kurz on how to perform robust regression.
Blog post by Andrew Heiss: Half a dozen frequentist and Bayesian ways to measure the difference in means in two groups.
Blei, David M. 2014. “Build, Compute, Critique, Repeat: Data Analysis with Latent Variable Models.” Annual Review of Statistics and Its Application 1 (1): 203–32. https://doi.org/10.1146/annurev-statistics-022513-115657.
Box, George E. P. 1976. “Science and Statistics.” Journal of the American Statistical Association 71 (356): 791–99. https://doi.org/10.1080/01621459.1976.10480949.
Cumming, Geoff. 2014. “The New Statistics: Why and How.” Psychological Science 25 (1): 7–29. https://doi.org/10.1177/0956797613504966.
DeCarlo, Lawrence T. 1998. “Signal Detection Theory and Generalized Linear Models.” Psychological Methods 3 (2): 186–205.
Gelman, Andrew. 2014. Bayesian Data Analysis. Third edition. Chapman & Hall/CRC Texts in Statistical Science. Boca Raton: CRC Press.
Gelman, Andrew, and Jennifer Hill. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press. https://doi.org/10.1017/CBO9780511790942.
Gigerenzer, Gerd. 2004. “Mindless Statistics.” The Journal of Socio-Economics 33 (5): 587–606. https://doi.org/10.1016/j.socec.2004.09.033.
———. 2018. “Statistical Rituals: The Replication Delusion and How We Got There.” Advances in Methods and Practices in Psychological Science 1 (2): 198–218. https://doi.org/10.1177/2515245918771329.
Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–50. https://doi.org/10.1007/s10654-016-0149-3.
Knoblauch, Kenneth, and Laurence T. Maloney. 2012. Modeling Psychophysical Data in R. Use R! New York: Springer-Verlag.
Kruschke, John K. 2013. “Bayesian Estimation Supersedes the T Test.” Journal of Experimental Psychology: General 142 (2): 573–603. https://doi.org/10.1037/a0029146.
Kruschke, John K., and Torrin M. Liddell. 2018. “The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective.” Psychonomic Bulletin & Review 25 (1): 178–206. https://doi.org/10.3758/s13423-016-1221-4.
Kuss, Malte, Frank Jäkel, and Felix A. Wichmann. 2005. “Bayesian Inference for Psychometric Functions.” Journal of Vision 5 (5): 8–8. https://doi.org/10.1167/5.5.8.
Lazar, Nicole A. 2016. “The ASA’s Statement on P-Values: Context, Process, and Purpose AU - Wasserstein, Ronald L.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.
Lee, Michael D., and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical Course. 1st ed. Cambridge ; New York: Cambridge University Press. https://doi.org/10.1017/CBO9781139087759.