Probit model
The most common model for binary responses is the logistic
model. The probit
model is related, but uses a different link function. The probit
function is the quantile function associated with the normal distribution. The quantile function is also the inverse cumulative distribution function:
\[probit(x) = \Phi^{-1}(x) \]
and looks like this:
For reference, this is the cumulative distribution function:
As a GLM (with a single predictor \(X\)):
\[ P(y = 1 | X) = \Phi(\alpha + \beta \cdot X) \] or more conventionally:
\[ probit(P(y = 1 | X)) = \alpha + \beta \cdot X \]
or \[ \Phi^{-1}(P(y = 1 | X)) = \alpha + \beta \cdot X \]
Generally:
\[ g(E[P(y = 1 | X)]) = \alpha + \beta \cdot X \]
where \(g()\) is the link function.
Summary:
- \(\Phi\) is the inverse link function: this maps a number on the real line to a probability, and is therefore a cumulative distribution function.
- \(\Phi^{-1}\) is the link function: this maps a probability to a real, and is therefore a quantile function.
Signal detection theory
We consider an experiment in which a person has to classify a stimulus into one of two possible categories:
- new / old
- left / right
- yes / no
We can neglect the underlying task, as the math is the same. In the general case, say we present two stimulus categories A
and B
, that vary along some dimension. The task of the subject in our experiment is to perform a binary classification with the response options A
and B
. The subject’s performance can be summarised in a classification table, with four possible outcomes:
- hit: Stimulus is
A
, subject responds A
- miss: Stimulus is
A
, subject responds B
- false alarm: Stimulus is
B
, subject responds A
- correct rejection: Stimulus is
B
, subject responds B
Given the stimulus, the subject has two response options. Therefore, we consider only the A
responses when the stimulus is A
(hits) or B
(false alarms).
- The SDT model assumes that on each trial \(i\), a person’s information about a stimulus can be modeled as a random variable \(X_i\).
- This is drawn from one of two possible distributions, which (in equal variance SDT) differ only in their location, but not their scale ( we assume that \(\sigma = 1\)).
Example: familiarity. When the subject is shown an image, this evokes a feeling of ’familiarity`. This is a latent construct.
SDT does not require any particular distributions, but in practise, Gaussians are often chosen.
Thought experiment: you are a subject in a memory experiment. You were previously shown a number of images, and now you are presented with a mixture of old and new items, and have to say whether you have previously seen the test image.
This can be formulated as the following statistical problem:
You are given a random variable \(X\), i.e. a draw from a normal distribution with a known standard deviation. You also know that distribution can have either of two known means, you just don’t know which one. The two distributions differ only in their mean, and the difference in means is called d'
.
You are asked to say which distribution that \(X\) is most likely to have come from. This is a decision, so you need some sort of decision rule. In this case you can choose a criterion, and compare \(X\) to this.
You will produce four types of responses: you will either correctly classifiy the presented stimulus, or its internal representation \(X\), as either old
or new
. You will do this correctly (hits
/ correct rejections
), or you will produce a missclassification (false alarms
/ misses
).
From your behavioural data, the number of hits
and false alarms
, we want to estimate your hit rate and false alarm rate, and then compute your internal (latent) quantities that guided your behaviour.
The internal signal evoked by old and new items is often shown like this:
New items produce less familiarity than old items, but the internal representation is noisy.
In order to classify the presented stimulus, based on the evoked familiarity (decision variable), we need a decision rule:
A simple rule is to compare the signal with a criterion \(c\). If the signal \(X > c\), then respond old
(“I have previously seen it”), otherwise respond new
(“haven’t seen it before”).
Signal detection parameters
The commonly known SDT parameters are \(d'\) and and \(c\).
- d’: distance between distributions
\[ d' = c - \phi^{-1}(1-p_{H}) = \phi^{-1}(p_{H}) - \phi^{-1}(p_{FA}) \] which can also be written as: \[ d' = \phi^{-1}(P(y = 1 | old)) - \phi^{-1}(P(y = 1 | new)) \]
The expression for \(d'\)
- requires estimating probabilities conditional on the identity of a signal
- requires taking the difference on a transformed (probit) scale
- this is equivalent to a contrast between levels of a factor with two levels as the linear predictor for a response in a GLM
- c: decision criterion \[ c = \phi^{-1}(1-p_{FA}) = -\phi^{-1}(p_{FA}) \]
Better: distance to optimal decision boundary \[ c' = -\frac{1}{2} \left[\phi^{-1}(p_{H}) + \phi^{-1}(p_{FA})\right] \]
What we are doing here is estimating the hit rate and false alarm rate from observed hits and false alarms using maximum likelihood estimation, and then computing d’ and c from these estimated probabilities.
We can also write this the other way round:
When the stimulus is new
, we will produce false alarms with probability:
\[ p_{FA} = P(y = 1 | X = 0) = 1 - \Phi(c) \]
When the stimulus is old
, we will produce hits with probability: \[ p_{H} = P(y = 1 | X=1) = 1 - \Phi(c-d') \]
- We can write this in one equation:
\[ P(y = 1 | X = x) = 1 - \Phi(c-d'X) = \Phi(-c + d'X) \] where \(X\) is an indicator variable, i.e. takes the value 1
for old
and 0
for new
.
This produces the probability of giving an old
response, given the stimulus. If the stimulus is old
, this is the probability of a hit
, if the stimulus is new
, this is the probability of a false alarm
.
Compare the signal detection model
\[ P(y = 1 | X) = \Phi(-c + d'X) \]
to a GLM: \[ P(y = 1 | X) = \Phi(\alpha + \beta \cdot X) \]
We can estimate SDT parameters c
and d'
using a probit GLM, if we use dummy coding (or effect coding) for a two-level stimulus factor.
The intercept provides an estimate of the normal quantile of the false alarm rate.
The stimulus parameter (\(\beta\) or \(d'\)) provides an estimate of the difference between hit and false alarm rates on the probit scale.
Memory experiment: single subject
Let’s look at an example (borrowing heavily from this blog post).
The data are from a recognition memory experiment:
First we classify each response as hit, miss, correct rejection (cr) or false alarm (fa):
And then count the number of hits, etc.
Next, we estimate the hit and false alarm rates, based on the observed number of hits and false alarms, and compute \(d'\), \(c\) and \(bias\) using the formulae given above.
For simplicity, we first look at the data from subject 53
only:
A (standard) GLM will give us
- an intercept: this corresponds to
-c
- a parameter for the indicator
isold
: this corresponds to d'
##
## Call:
## glm(formula = sayold ~ isold, family = binomial(link = "probit"),
## data = subno53)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2322 -0.9734 -0.9734 1.1236 1.3961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3124 0.1752 -1.783 0.0746 .
## isold 0.3925 0.2534 1.549 0.1214
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 137.63 on 99 degrees of freedom
## Residual deviance: 135.22 on 98 degrees of freedom
## AIC: 139.22
##
## Number of Fisher Scoring iterations: 4
We can use Bayesian inference for the same model:
## Family: bernoulli
## Links: mu = probit
## Formula: sayold ~ isold
## Data: subno53 (Number of observations: 100)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept -0.32 0.18 -0.68 0.02 3318 1.00
## isold 0.40 0.25 -0.10 0.88 3432 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
But what if we would rather obtain an estimate for the bias
, rather than c
?
We can use a special effect coding: the stimulus item is coded as 1/2
if it is old
, and -1/2
if it is new
. We are then either adding \(\frac{1}{2}d'\) to the intercept or or subtracting \(-\frac{1}{2}d'\) from the intercept, depending on the stimulus, and the intercept will give us the negative bias (difference between actual and optimal decision boundary).
\[ P(y = 1 | X) = \Phi(-c + d'X) \]
Using a standard GLM:
##
## Call:
## glm(formula = sayold ~ item, family = binomial(link = "probit"),
## data = subno53)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2322 -0.9734 -0.9734 1.1236 1.3961
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1162 0.1267 -0.917 0.359
## item 0.3925 0.2534 1.549 0.121
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 137.63 on 99 degrees of freedom
## Residual deviance: 135.22 on 98 degrees of freedom
## AIC: 139.22
##
## Number of Fisher Scoring iterations: 4
Or a Bayesian GLM:
## Family: bernoulli
## Links: mu = probit
## Formula: sayold ~ item
## Data: filter(confcontr, subno == 53) (Number of observations: 100)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept -0.12 0.13 -0.36 0.14 3708 1.00
## item 0.40 0.25 -0.08 0.89 3574 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Another trick is to use a non-linear model in brms
:
m53_nl <- bf(sayold ~ Phi(dprime*item - c),
dprime ~ 1, c ~ 1,
nl = TRUE)
Priors <- c(prior(normal(.5, 1.5), nlpar = "dprime"),
prior(normal(0, 1.5), nlpar = "c"))
fit_53_nl <- brm(m53_nl,
family = bernoulli(link="identity"),
data = subno53,
prior = Priors,
control = list(adapt_delta = .99),
cores = parallel::detectCores(),
file = here::here("models/sdtmodel-1_2"))
## Family: bernoulli
## Links: mu = identity
## Formula: sayold ~ Phi(dprime * item - c)
## dprime ~ 1
## c ~ 1
## Data: subno53 (Number of observations: 100)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## dprime_Intercept 0.39 0.25 -0.08 0.86 2170 1.00
## c_Intercept 0.11 0.13 -0.13 0.35 2356 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Memory experiment: multilevel model
Traditionally, we would summarise the manually calculated subject-specific point estimates of \(d'\) and \(c\) with their sample means and standard deviations.
Alternatively, we can estimate both population and group level parameters in a multilevel model. Note that I am using effect
coding (1/2
and -1/2
) instead of dummy
coding.
fitglmm <- brm(sayold ~ 1 + item + (1 + item | subno),
family = bernoulli(link = "probit"),
data = confcontr,
prior = c(prior(student_t(3, 0, 1), class = sd),
prior(lkj(1), class = cor)),
cores = parallel::detectCores(),
file = here::here("models/sdtmodel-2_1"))
## Family: bernoulli
## Links: mu = probit
## Formula: sayold ~ 1 + item + (1 + item | subno)
## Data: confcontr (Number of observations: 3100)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~subno (Number of levels: 31)
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept) 0.21 0.04 0.13 0.30 1953 1.00
## sd(item) 0.40 0.09 0.25 0.59 1999 1.00
## cor(Intercept,item) 0.11 0.26 -0.39 0.59 1791 1.00
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept -0.13 0.04 -0.22 -0.04 1703 1.00
## item 1.06 0.09 0.89 1.23 2697 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
The following figure shows the population-level parameters (also know to frequentists as fixed effects):
In the next figure, we can see a visualization of the shrinkage of individual, or group level, parameters (only mean is shown) towards to overall mean in the partial pooling model, as compared to the maximum likelihood (no pooling) estimates.
coefs <- coef(fitglmm)
crit <- coefs[[1]][,,1][,1] %>%
enframe() %>%
select(subno = name,
crit = value) %>%
mutate(crit = -crit)
dprime <- coefs[[1]][,,2][,1] %>%
enframe() %>%
select(subno = name,
dprime = value)
estimates <- crit %>%
left_join(dprime) %>%
mutate(model = "partial pooling",
subno = as_factor(subno)) %>%
select(subno, c = crit, dprime, model)
sdt %>%
select(subno, c = bias, dprime, model) %>%
bind_rows(estimates) %>%
mutate(model = as_factor(model)) %>%
ggplot(aes(x=dprime, y = c, color = model)) +
geom_hline(yintercept = 0,
linetype = "dotted",
alpha = 0.6) +
geom_line(aes(group = subno),
color = "grey60", alpha = 0.2,
linetype = 1) +
geom_point(size = 2) +
scale_color_viridis_d()
Of course, we can use a non-linear model here as well, in order to get a direct estimate of the bias:
glmm2 <- bf(sayold ~ Phi(dprime*item - c),
dprime ~ 1 + (1 |s| subno),
c ~ 1 + (1 |s| subno),
nl = TRUE)
Priors <- c(prior(normal(0, 3), nlpar = "dprime", lb = 0),
prior(normal(0, 3), nlpar = "c"),
prior(student_t(10, 0, 1), class = "sd", nlpar = "dprime"),
prior(student_t(10, 0, 1), class = "sd", nlpar = "c"),
prior(lkj(4), class = "cor"))
fitglmm2 <- brm(glmm2,
family = bernoulli(link = "identity"),
data = confcontr,
prior = Priors,
control = list(adapt_delta = .99),
cores = parallel::detectCores(),
inits = 0,
file = here::here("models/sdtmodel-2_2"))
## Family: bernoulli
## Links: mu = identity
## Formula: sayold ~ Phi(dprime * item - c)
## dprime ~ 1 + (1 | s | subno)
## c ~ 1 + (1 | s | subno)
## Data: confcontr (Number of observations: 3100)
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
## total post-warmup samples = 4000
##
## Group-Level Effects:
## ~subno (Number of levels: 31)
## Estimate Est.Error l-95% CI u-95% CI
## sd(dprime_Intercept) 0.40 0.08 0.25 0.57
## sd(c_Intercept) 0.21 0.04 0.14 0.30
## cor(dprime_Intercept,c_Intercept) -0.07 0.21 -0.47 0.36
## Eff.Sample Rhat
## sd(dprime_Intercept) 1561 1.00
## sd(c_Intercept) 1690 1.00
## cor(dprime_Intercept,c_Intercept) 1307 1.00
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## dprime_Intercept 1.06 0.09 0.89 1.23 1458 1.00
## c_Intercept 0.13 0.05 0.04 0.22 1321 1.00
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Posterior predictive checks
And finally, because this is always a good idea, we can perform some posterior predictive checks to make sure that the data are well-described by the model.
We simulate 500 data sets from the posterior predictive distribution, and then compute empirical estimates of \(c\) and \(d'\).
preds_fitglmm <- fitglmm$data %>%
select(subno, item) %>%
add_predicted_draws(model = fitglmm, n = 500)
sdt_pred <- preds_fitglmm %>%
group_by(subno, .draw) %>%
mutate(type = "hit",
type = ifelse(item > 0 & .prediction == 0, "miss", type),
type = ifelse(item < 0 & .prediction == 0, "cr", type),
type = ifelse(item < 0 & .prediction == 1, "fa", type)) %>%
ungroup()
sdt_pred <- sdt_pred %>%
group_by(subno, .draw, type) %>%
summarise(count = n()) %>%
spread(type, count)
sdt_pred <- sdt_pred %>%
mutate(zhr = qnorm(hit / (hit+miss)),
zfa = qnorm(fa / (fa+cr)),
dprime = zhr-zfa,
crit = -zfa,
c = -0.5*(zhr+zfa),
beta = exp(0.5 * (zfa^2 - zhr^2))) %>%
mutate_if(is.numeric, round, 3)
sdt_pred %>%
group_by(subno) %>%
median_qi(dprime, .width = c(.80, .95), na.rm = TRUE) %>%
ggplot(aes(y = dprime, x = subno)) +
geom_pointinterval(fatten_point = 2,
size_range = c(0.6, 2),
color = "black") +
geom_point(aes(y = dprime, x = subno),
data = sdt,
shape = 21, color = "black",
fill = "pink", stroke = 1, size = 3) +
geom_hline(yintercept = 0, linetype = 3, color = "grey", alpha = 0.9) +
theme_tidybayes()