Response functions and their discontents

Most judgements of quality are not strictly bounded – no matter how good that hamburger was, it probably wasn’t the best possible hamburger. However, asking people express judgements on infinite scales has proven to be impractical if not dangerous, which is why Yelp for instance permits judgements to be indicated only on a finite scale consisting of the integers from zero to five. Accordingly, it is useful to model such responses as the result of some linear gaussian system in an unbounded space of “internal judgements”, which is then mapped onto the restricted response domain via some nonlinear function \(g\), which I will call the response function.

Thus far, we have been modeling our legal judgements using a particular and particularly inflexible choice of response function:

\[ g(x) = \begin{cases} U & \text{if}\quad x \geq U \\ L & \text{if}\quad x \leq L \\ x & \text{otherwise} \end{cases}. \] Obviously, this is a linear function rectified above and below at \(U\) and \(L\) respectively, which are the extrema of the response scale. Note that in the current model, the trial-by-trial noise is internal to the response function – that is, if the linear predictor for a given observation \(i\) is \(\eta_i=X_i' \theta\), then the response is given by

\[ y_i = g(\eta_i + \delta_i) \\ \delta_i \sim \mathcal{N}(0,\sigma_{\delta}^2) \]

However, posterior predictive checks reveal major qualitative discrepancies between the observed distribution of responses and the expected response distribution under the linear model:

rypred <- function(samps,type="linear") {
  transfunc2 <- function(x,sigma)
    pnorm(x,0,sigma)
  
  nsamps <- nrow(samps$theta)
  nobs <- ncol(samps$theta)
  ypred <- array(dim=c(nsamps,nobs))
  
  if (type=="internalnoise") {
    for (i in 1:nsamps)
      ypred[i,] <- sapply(samps$theta[i,]+rnorm(nobs),transfunc2,samps$sigma[i])*100
    ypred <- round(ypred)
  } else if (type=="linear") {
    for (i in 1:nsamps)
      ypred[i,] <- samps$theta[i,]+rnorm(nobs)*samps$sigma[i]
    ypred[ypred<0] <- 0
    ypred[ypred>100] <- 100
  } else if (type %in% c("doublenoise","externalnoise")) {
    for (i in 1:nsamps)
      ypred[i,] <- sapply(samps$theta[i,]+rnorm(nobs)*(type=="externalnoise"),transferfunc,
                          samps$w_trans[i,],samps$l_trans[i,],samps$s_trans[i,])*100 + 
        rnorm(nobs,sd=samps$sigma[i])
    ypred[ypred<0] <- 0
    ypred[ypred>100] <- 100
  }
  
  return(ypred)
}
yppars1 <- c("sigma","theta")
yppars2 <- c("w_trans","l_trans","s_trans")

ppdat <- rbind(data.table(rating=dat$rating, model="empirical"),
               data.table(rating=as.vector(rypred(extract(fit_linear,pars=yppars1),"linear")),model="linear"),
               data.table(rating=as.vector(rypred(extract(fit_2sigma_1k,pars=c(yppars1,yppars2)),"doublenoise")),model="response & judgement noise"),
               data.table(rating=as.vector(rypred(extract(fit_1sigma_2k,pars=c(yppars1,yppars2)),"externalnoise")),model="response noise"),
               data.table(rating=as.vector(rypred(extract(fit_1sigma_1k_cdflik,pars=yppars1),"internalnoise")),model="judgement noise"))
ggplot(ppdat[model %in% c("empirical","linear")],aes(x=rating,y=stat(density))) + geom_histogram(bins=101) + facet_wrap(facets = "model", ncol = 1)

The rectified-linear model overestimates the relative density of responses in the middle of the scale as well as responses at the extrema of the scale. This produces, rather than a bowl, a bell with two large columns on either side.

Is this a problem with the additive model of evidence integration, or a problem with rectified-linear response function not capturing how people percieve and use the response scale? Perhaps we should consider a broader class of response functions, but how to pick the function? A response function must be bounded, smooth, and monotonic, but beyond these criteria I for one have no particular hypothesis regarding the shape of the response function.

We can take a… semi-parametric? approach by specifying the response function as the cummulative distribution of a mixture of Gaussian distributions, scaled appropriately to match the support of the response scale:

\[ g(x) =(U-L)\sum_{i=1}^K \pi_k \Phi \left ( \frac{x-\mu_k}{\sigma_k} \right ) + L \] A practical difficulty is that any smooth CDF with unbounded support will assymptotically approach \(0\) and \(1\) but never reach them, while our participants have no difficulty reaching the extrema of the response scale. We can fudge this by adding (rectified) noise to the response scale:

\[ \gamma_i = g(\eta_i + \delta_i) \\ y^*_i \sim \mathcal{N}(\gamma_i,\sigma^2_\epsilon) \\ \delta_i \sim \mathcal{N}(0,1) \\ \] \[ y_i = \begin{cases} U & \text{if}\quad y_i^* \geq U \\ L & \text{if}\quad y_i^* \leq L \\ y_i^* & \text{otherwise} \end{cases}\\ \]

Intuitively, \(\sigma_\epsilon\) can be interpreted as noise in the response selection itself while the \(\delta\)s can be interpreted as noise in the underlying judgement itself (the effective scale of the \(\delta\)s is implied by the scale of the mixture model underlying the response function). While \(\sigma_\epsilon > 0\) is obligatory in the presence of responses on the extrema, note that we can set \(\delta_i = 0\) for all \(i\) without inducing any explicit contradiction of the data.

Accordingly we will begin by evaluating models of varying complexity, as defined by the number of mixture components, and with versus without noise on the judgement scale. First let’s examine their predictive ability, as measure via the WAIC approximation to leave-one-out cross validation:

baseline <- waic(extract_log_lik(fit_linear))
wdat <- rbind(compare(baseline,waic(extract_log_lik(fit_1sigma_1k))),
              compare(baseline,waic(extract_log_lik(fit_1sigma_2k))),
              compare(baseline,waic(extract_log_lik(fit_2sigma_1k))),
              compare(baseline,waic(extract_log_lik(fit_2sigma_2k)))) %>% as.data.table()
wdat$noise <- c("response","response","response & judgement","response & judgement")
wdat$mixture <- c("one component","two components","one component","two components")
ggplot(wdat,aes(y=elpd_diff,x=noise,color=mixture)) + 
  geom_pointrange(aes(ymin=elpd_diff-2*se,ymax=elpd_diff+2*se),position=position_dodge(width=0.25))

The zero point is the rectified-linear model, up is better, and the scale is log probability. It seems that we don’t need more than a single component at all – so long as we have noise on the judgement scale.

Second, let’s look the predicted response distributions under the best-fitting models of the different noise classes:

ggplot(ppdat[model %in% c("response noise","response & judgement noise")],aes(x=rating,y=stat(density))) + geom_histogram(bins=101) + facet_wrap(facets = "model", ncol = 1)

While clearly an improvement, rather than a “bowl” these models appear to be shifting and stretching the “bell”. The issue here may be not the response function but the choice of censored rather than truncated responses; intuitively, under censorship, you can’t move the “bowl” towards the edge without massivley inflating the mass at the extreme value. Furthermore, it makes more sense to consider response variability as truncated, since responses outside the support of the slider are simply not registered.

We therefore consider the following modifiaction to the previous model:

\[ y_i \sim \mathcal{N}_{[L,U]}(\gamma_i,\sigma_e^2) \]

However, upon attempting to fit this model, our good friend Stan complains that a majority of attempted samples meet their grisly end at the evaluation of a divergent log probability. Diagnostics (not here pictured) suggested that these divergences arose from attempts to drive down \(\sigma_\epsilon\) to very small values.

If the model wishes \(\sigma_\epsilon\) to be so small, perhaps we can simply get rid of it altogether? But alas, we need it to place probability mass on the extrema! Unless, of course, we reconstitute the likelihood like so:

\[ y_i = \begin{cases} U & \text{if}\quad g(\eta_i + \delta_i) > U-D/2 \\ L & \text{if}\quad g(\eta_i + \delta_i) < L+D/2 \\ r & \text{if}\quad r-D/2 < g(\eta_i + \delta_i) < r+D/2 \end{cases}, \\ r \in \{L+D,L+2D, \ldots,U-D\} \]

where \(D\) is the distance between consecutive “ticks” on the scale. In other words, we recognize the true discreteness of the response scale and simply round the output of the response function to the nearest point on the scale. Though perhaps slightly hacky, this model is defensible in that the response we collect really is discrete while the motor action by which the response is generated really is continuous (well, if anything is really continuous). We can also think of it as a cutpoint model for ordinal responses where the cutpoints are generated by some smooth function rather than inferred independently; this nicely connects slider responses to the wider world of survey response models.

Because we’ve changed the data type of \(y\), we can’t directly compare WAIC and such between this model and the previous ones. However, we can do the predictive check:

ggplot(ppdat[model %in% c("empirical","judgement noise")],aes(x=rating,y=stat(density))) + geom_histogram(bins=101) + facet_wrap(facets = "model", ncol = 1)

At long last, we have achieved bowl-hood!

Beyond making a bowl, this last formulation allows us to integrate out the \(\delta_i\) to tremendous computational savings. However, that integration requires that we find \(g^{-1}\) for every possible response on the scale. For \(K=1\) this is trivial, but for more components there is no closed-form solution and the problem must be solved numerically. Based on similar models I’ve seen on the Stan mailing list, I think this should be doable with a relatively little performance penalty.

Finally, it may occur to the reader that, if the whole point is to make a bowl, why not just model the response function with a beta distribution? Which is a very good question, and the answer is maybe I should have. However, normal-mixture CDF function can do things other than bowls if it is called upon to do so, and is also faster by several orders of magnitude than the same model with the beta CDF plugged in for \(g\). In fact, the beta version was so slow that I didn’t even bother letting it finish. The source of the performance penalty to the beta response function appears to be the cost of evaluating the gradients rather than difficulty generating trajectories.