Challenges associated with enforcing distributional support when fitting GEV models

How the sign of \(\xi\) controls the support of the GEV distribution
Problems enforcing support with the location-scale parameterization
Enforcing support via a quantile-based reparameterization of the GEV
Case study: comparing estimates across support-enforcing and non-support-enforcing models
Liesel investigation

In its location-scale parameterization the GEV has CDF

\[\begin{equation} G(y) = \text{exp}\left\{-\left[1 + \xi\left(\frac{y-\mu}{\sigma}\right)\right]^{-1/\xi}\right\}. \end{equation}\]

The support of the distribution depends on all three parameters and the data:

\[\begin{equation} \tag{(1)} \{ y : 1 + \xi(y-\mu)/\sigma > 0 \} \end{equation}\]

where \(-\infty < \mu < \infty\), \(\sigma > 0\), and \(-\infty < \xi < \infty\).

When fitting the GEV with a standard boilerplate MCMC (for example, a metropolis-hasting algorithm) we would enforce this condition by simply setting the log likelihood equal to \(-\infty\) when the sampler proposes a value outside the support.

However, once we look at fitting the GEV with more complex methods the parameter-dependent support starts to become a real issue.

In this vignette we look at the gradient-based Hamiltonian Monte Carlo (HMC) sampler implemented in Stan. Stan needs parameter bounds defined at initialization, so we can’t just enforce any support conditions in the likelihood. Furthermore, the parameters need to be continuously differentiable at the bounds, so we can’t have bounds in the form of conditional statements on other parameter values (no if-then statements).

This is typically handled (see Aki Vehtari’s work with the Generalized Pareto distribution) by forcing the burden of support onto a single parameter such that the bounds for this one parameter are a continuously differentiable function of the other parameters (and, in the case of the GEV, the maximum and minimum data points as well). In the GEV distribution it makes sense to have the \(\xi\) parameter be the one that accommodates the support.

How the sign of \(\xi\) controls the support of the GEV distribution

If \(\xi\) has bounds dependent on the other two parameters we can think of the sign of \(\xi\) as being determined by the sign of the quantity \((y-\mu)/\sigma\). Assuming all \(y\) are positive, we can consider a toy scenario where \(\mu\) is guaranteed to fall between min(\(y\)) and max(\(y\)):

If \((y-\mu)/\sigma > 0\) then \(\xi < 0\)

Then the \(y\) we care about is \(y_{max}\) and \(\xi\) needs a lower bound such that \[\xi > \frac{-1}{(y_{max}-\mu)/\sigma}.\]

If \((y-\mu)/\sigma < 0\) then \(\xi > 0\)

Then the \(y\) we care about is \(y_{min}\) and \(\xi\) needs an upper bound such that \[\xi < \frac{-1}{(y_{min}-\mu)/\sigma}.\]

This works because we can trap \(\xi\) between two bounds, one always dependent on \(y_{max}\) and the other always dependent on \(y_{min}\). But this logic fails when applied to the GEV distribution in its standard location-scale parameterization since \(\mu\) can fall outside the range of the data. This means we cannot definitively tie \(y_{max}\) to the lower bound and \(y_{min}\) to the upper bound, respectively, since we would need to know the value of \(\mu\) ahead of time to know which order statistic we care about in the context of Eqn (1).

We can demonstrate the implications of this with a case study using the \(\texttt{brms}\) implementation of the GEV.

Problems enforcing support with the location-scale parameterization

The \(\texttt{brms}\) package transforms an unconstrained parameter \(\kappa\) to \(\xi\) using the link function

\[\begin{equation} \tag{(2)} \xi = \left(\frac{1}{1 + e^{-\kappa}}\right)^{\gamma(y,\mu,\sigma)} + \text{min}\left\{ \frac{-1}{\text{min}\left(\frac{y-\mu}{\sigma}\right)} , \frac{-1}{\text{max}\left(\frac{y-\mu}{\sigma}\right)}\right\} \end{equation}\]

where

\[\begin{equation} \gamma(y,\mu,\sigma) = \frac{-1}{\text{max}\left(\frac{y-\mu}{\sigma}\right)} - \frac{-1}{\text{min}\left(\frac{y-\mu}{\sigma}\right)}. \end{equation}\]

Problems arise when \(\mu\) is outside the range of the data and both quantities used for the logit shift in Eqn (2) become positive (or negative).

We simulate 50 data points from the GEV distribution where the parameters are chosen to mimic a typical flood event dataset at a small catchment and the number of data points is a typical flood record length.

set.seed(0)
mu <- 3.25
sigma <- 1.41
xi <- -0.17

n <- 50

gev_data <- list(N = n,
                 y = extRemes::revd(n,mu,sigma,xi,type="GEV"))

c(min(gev_data$y),max(gev_data$y))

## [1] 0.8644518 6.8028075

After initialization, the sampler proposes the following parameter values:

t_mu = 14.0945 # mu greater than max(data)
t_sigma = 2.33266
t_k = -2.61045

With these values, under the shifted logit transformation in Eqn (2), the resulting \(\xi\) can never be smaller than

bnds <- c(-1/min((gev_data$y-t_mu)/t_sigma),-1/max((gev_data$y-t_mu)/t_sigma))
min(bnds)

## [1] 0.1763153

But we would need \(\xi\) smaller than this value to enforce support:

The \(\xi\) value produced by the transform and these parameter values (purple star in above plot) is clearly outside the support:

t_xi <- (1/(1+exp(-t_k)))^(max(bnds)-min(bnds)) + min(bnds)

# check support of the GEV 
min(1 + t_xi*(gev_data$y-t_mu)/t_sigma) > 0

## [1] FALSE

This results in a \(\texttt{NaN}\) value when the log-likelihood is evaluated.

Having this \(\texttt{NaN}\) value is an issue because we are essentially defining an area of our parameter space to have zero likelihood without reflecting this in the parameter bounds. This can cause fitting issues (see the case study section of this markdown).

We can’t fix this problem by constraining \(\mu\) to be within the range of data because there is no viable reason why it should be so. Instead the trick is to reparameterize the GEV such that we can make claims about one of the parameters that allow us to reliably tie the order statistics (ymin, ymax) to the bounds on \(\xi\).

Enforcing support via a quantile-based reparameterization of the GEV

The relationship between the location parameter, \(\mu\), and the median, \(\eta\), is given as \[\begin{equation} \eta = \begin{cases} \mu + \sigma \frac{\text{log}(2)^{-\xi}-1}{\xi} & \text{if}\ \xi \neq 0 \\ \mu - \text{log}\left(\text{log}(2)\right) & \text{if}\ \xi = 0. \end{cases} \label{eqn:medianrelation} \end{equation}\]

We also reparameterize the scale parameter to reduce dependency:

\[\begin{equation} \beta = \text{log}\left(\frac{\sigma}{\eta}\right). \label{eqn:beta} \end{equation}\]

Under this reparameterization the support condition then becomes

\[\begin{equation} \text{min}\left\{1 + \xi \left( \frac{y-\eta + \eta e^{\beta} \left(\frac{\text{log}(2)^{-\xi}-1}{\xi}\right) }{\eta e^{\beta}} \right)\right\} > 0. \end{equation}\]

Ignoring the \(\texttt{min()}\) for the time being, setting the \(1 + \xi(...)\) quantity equal to zero and solving for \(\xi\) gives a solution in the form of the Lambert-W function (analytic continuation of the product log function):

\[\begin{equation} \tag{(3)} \xi = \frac{W_{c_1}\left( \frac{e^{\beta}\eta \text{log}(\text{log}(2))}{\eta - y} \right)}{\text{log}(\text{log}(2))}, \hspace{0.5cm} \eta \neq y, \,\, e^{\beta}\eta \neq 0. \end{equation}\]

The Lambert W function has an infinite number of branches, most of which are in the complex plane. Luckily for us, however, the range of values we are interested in (-0.5 to 0.5) is fully contained within one of the two branches of this function that deal with real numbers:

With our new parameterization it is reasonable to assume

\[\begin{equation} \tag{4} 0 \leq y_{min} < \eta < y_{max} \end{equation}\]

since the median flood for a catchment should not exceed the maximum flood observed or be less than the minimum observed flood. Knowing this, and observing that the Lambert-W function is strictly increasing within (-0.5,0.5), it follows that \(y_{max}\) minimizes Eqn (3) and \(y_{min}\) maximizes Eqn (3), giving us bounds

\[\begin{equation} \tag{5} \frac{W_{0}\left( \frac{e^{\beta}\eta \text{log}(\text{log}(2))}{\eta - y_{max}} \right)}{\text{log}(\text{log}(2))} < \xi < \frac{W_{0}\left( \frac{e^{\beta}\eta \text{log}(\text{log}(2))}{\eta - y_{min}} \right)}{\text{log}(\text{log}(2))}. \end{equation}\]

Since the Lambert-W function is only defined for values \(\geq -1/e\), we need additional bounds on the \(\beta\) parameter. The argument of the Lambert-W function in the lower bound is always positive, so we only need to worry about the argument that depends on \(y_{min}\) in the upper bound when constructing bounds for \(\beta\). Setting

\[\begin{equation} \frac{e^{\beta}\eta \text{log}(\text{log}(2))}{\eta - y_{min}} \geq -\frac{1}{e} \end{equation}\]

we find that \(\beta\) has an upper bound given by

\[\begin{equation} \tag{6} \beta < \text{log}\left( \frac{y_{min}-\eta}{\eta \text{log}(\text{log}(2))} \right) - 1 \end{equation}\]

We can plot the bounds as a function of \(\beta\) for a chosen \(\eta\):

where the green shading in the above figure indicates a prior for \(\xi\) over -0.5 to 0.5 and the grey dashed line is the upper limit for \(\beta\).

So, in sum, this construction requires 4 components:

constraints on \(\eta\) and \(y_{min},y_{max}\) given in Eqn (4)
constraints on \(\xi\) given in Eqn (5)
constraints on \(\beta\) given in Eqn (6)
and a prior on the \(\xi\) parameter such that it can only take values within the interval \((-0.5,0.5)\).

Case study: comparing estimates across support-enforcing and non-support-enforcing models

I originally thought the support enforcement was an issue because it could cause our sampler to converge to an incorrect parameter estimate without indicating anything was wrong. This can technically happen, but only if the support is violated in specific ways during sampling and not just the warmup phase of the HMC. Unlike other MCMC samplers like Metropolis-Hastings, warmup is not simply early sampling–what Stan does during warmup is quite different from what it does after warmup. It seems that the sampler hitting values outside the support after warmup and not properly adjusting occurs only in particularly pathological cases, or when a rejection step is enforced in the likelihood function. Although this behavior came up during testing, neither of these cases apply to our particular problem since we know ahead of time all our data points will be positive, we have pre-determined priors for our parameters, and we don’t write the model with a rejection step in the likelihood.

Instead it appears for our particular application any issues with the support present as more typical behavior–divergences and initialization issues. This is good in the sense that we can have faith that if the model converges we can trust the results. But this is bad in the sense our support problem now hides among many other model fitting issues.

For example: we can create 12 different simulated datasets that mimic our 12 gauging stations from our first paper. We fit three different models to these datasets: (1) the support-enforcing reparameterized median model with Lambert-W bounds (2) the same model, but with bounds artificially widened by +0.25, and (3) again the Lambert-W model, but with bounds widened by +0.5 (see figure; orange shading represents realm of possible outside-of-support values).

Both the +0.25 and the +0.5 model are considerably slower during the warm-up phase than the support-enforcing model and spend time proposing values outside the support. The +0.25 scenario results in 1 divergent transition after warmup. The +0.5 scenario results in 4 divergent transitions after warmup. This is still a small portion of the total chains (each of the 12 simulated datasets was run for 3 chains) and every simulated dataset managed to fit at least 2 out of 3 chains. In every case where things converged, the parameter estimates from the non-support enforcing models match that of the support-enforcing model:

So while we might have problems fitting the model, it seems some sloppiness in the bounds doesn’t throw the sampler off entirely. This would be great if there was a numerical reason behind this and we could leverage this when constructing a link function, but I’m worried that this is more of a “skating by the seats of our pants” type of issue and things would start to break down once we added more complexity to the model. It is also problematic that any model-fitting problems that might arise would be a product of an intentionally misspecified model and not a more solvable problem (such as needing better priors or a non-centered parameterization, for example).

Liesel investigation

Liesel has the standard location-scale parameterization of the GEV and uses the Tensorflow python library as a backend. Hannes ran a simple GEV problem for me, and while it converged to the correct parameters we do think it went outside the support during the sampling (as evidenced by invalid transition kernels during the IWLS sampling). So we suspect the support is dealt with by rejecting the sample rather than building constraints in from the beginning. This may or may not have the same issues as Stan. We kind of end up back at our original (difficult) question: what do gradient-based methods do when part of the parameter space has zero probability?

This content is aligned center.