Monarch SSA: derive candidate distributions from expert parameter estimates

Summary of RFA from Kristen Voorhies:
Parameter A
Parameter C

Summary of RFA from Kristen Voorhies:

For several parameters of monarch population models, expert opinion provides a “most likely” value, upper and lower bounds, and a subjective measure of confidence in that bound. Kristen wants to derive an underlying distribution that satisfies these measures but from which she can then generate random draws to evaluation population model uncertainty.

She provided values for three parameters of current interest:

Parameter A: low = -1.17; most likely = -0.59; high = -0.04; 76% confidence
Parameter B: low=-0.65; most likely = -0.18; high= -0.05; 80% confidence
Parameter C: low=-1.87; most likely= -0.29; high = -0.01; 93% confidence

My understanding is that ‘confidence’ implies that proportion of a distribution’s density that falls between the lower and upper bounds, with some measure of the distribution’s location (i.e., mean, median) corresponding to the “most likely” value.

Parameter A

The values for the first case suggest a normal distribution may be pretty reasonable assuming symmetry around the mean value. There may well be other distributions with a location near -0.59 and similar upper and lower bounds that fit the bill but that are asymmetrical.

# Specify known information
mean <- -0.59
lower <- -1.17
upper <- -0.04

# Desired proportion ('confidence') between upper and lower bounds
desired_bw_bounds <- 0.76

In this case, bounds are fairly symmetrical around a “most likely” value. With normal distributions, ~ 68% of data fall within 1 sd of mean, so the difference between the upper/lower and mean give a good upper limit on the standard deviation in the search for a potentially “matching” normal distribution.

upper - mean; mean - lower # ~ 0.6

## [1] 0.55

## [1] 0.58

# Specify standard deviations to search across
sds <- seq(from = 0.4, to = 0.6, by = 0.001)

# Create empty vector to accept calculations of % of data between lower and upper bounds
bw_bounds <- rep(NA, length(sds))

# Loop through SDs and calculate % data between given bounds
for (i in seq_along(sds)) {
  # First, calculate proportion of data less than upper bound
  # assuming normal distribution centered on mean and using
  # this iteration's standard deviation
  hi_p <- pnorm(upper, mean = mean, sd = sds[i])
  # Now same for lower bound
  lo_p <- pnorm(lower, mean = mean, sd = sds[i])
  # Calculate percent data between upper and lower bounds
  bw_bounds[i] <- hi_p - lo_p
}

# Plot it
plot(sds, bw_bounds)
# Add line showing desired amount of data within bounds
abline(h = desired_bw_bounds, lty = "dashed", col = "blue") # So, ~ sd = 0.48?

# Let's find the actual estimate
# Which standard deviation, in combination with the given mean,
# and assuming a normal distribution, is closest to desired?
diffs <- abs(bw_bounds - desired_bw_bounds)
plot(sds, diffs)

sds[which.min(diffs)]

## [1] 0.481

# So, closest fit of normal distribution with mean = -0.59
# is standard deviation = 0.481
x <- seq(-3.5, 3, length.out = 500)
plot(x, dnorm(x, mean = -0.59, sd = 0.481), type = "l")
# Add lower & upper bounds
abline(v = c(lower, upper), lty = "dashed", col = "blue")

# About 75.96% of data between bounds
bw_bounds[which.min(diffs)]

## [1] 0.7596311

Parameter C

Fitting a distribution with only three “known” quantiles is not necessarily a good idea - more data and a clear purpose are imperative (see here also). Thus I suggested that Kristen should seek more data, even having her expert go so far as to draw a distribution for the parameter to get an idea of its shape, hard bounds on the value, the heaviness of the tails, etc…

But, based on the discussion at the first link above, there are statistical approaches to “fit” distributions with only three quantiles. In Kristen’s case, she doesn’t know the quantiles that the upper and lower bounds correspond to, but can make the assumption that the “confidence” describes the distribution quantiles in a confidence interval-like manner. For example, she could assume the following percentiles: -1.87 (3.5th percentile), -0.29 (50th percentile; median), and -0.01 (96.5th percentile).

Log-normal

The current values suggest a skewed distibution, so we can try a log-normal distribution, although we have to reverse the sign of values since the log-normal is defined only for non-negative numbers. That is, we flip the above values to their corresponding positive numbers, essentially making the lower bound the upper bound (and vice versa): 0.01 (3.5th percentile), 0.29 (50th percentile; median), and 1.87 (96.5th percentile). Of course a log-normal distribution assumes that this parameter is bound by zero, which may not be a reasonable assumption.

lower <- -1.87 # quantile = 0.035
mean <- -0.29  # quantile = 0.5; treat as median
upper <- -0.01 # quantile = 0.965

quantiles <- c(lower, mean, upper) # Notice still on original negative values

# Define function that compares various log-normal fits with the 'observed' values
fit_ln <- function(ln_parms) {
  # Use -1 * quantiles to turn into positive numbers for log-normal
  # Calculating (and minimizing using `optim` function below) the sum of squared differences
  sum(abs(-quantiles - qlnorm(c(0.965, 0.5, 0.035), ln_parms[1], ln_parms[2]))^2)
}

# Reaches a solution, although some iterations fail...
# Parameters are stored in test_soln$par
test_soln <- optim(c(1,1), fit_ln)

## Warning in qlnorm(c(0.965, 0.5, 0.035), ln_parms[1], ln_parms[2]): NaNs
## produced

## Warning in qlnorm(c(0.965, 0.5, 0.035), ln_parms[1], ln_parms[2]): NaNs
## produced

## Warning in qlnorm(c(0.965, 0.5, 0.035), ln_parms[1], ln_parms[2]): NaNs
## produced

## Warning in qlnorm(c(0.965, 0.5, 0.035), ln_parms[1], ln_parms[2]): NaNs
## produced

## Warning in qlnorm(c(0.965, 0.5, 0.035), ln_parms[1], ln_parms[2]): NaNs
## produced

x <- seq(-2.5, 0, 0.01) # Range of values of interest

# Plot it.  Notice multiplication of x with -1 to "flip" from original scale
plot(x, dlnorm(-x, meanlog = test_soln$par[1], 
               sdlog = test_soln$par[2]), type="l", col="red")
# Add "known" values use to estimate distribution
points(quantiles, dlnorm(-quantiles,  meanlog = test_soln$par[1], 
                         sdlog = test_soln$par[2]))

# How good is fit?
# "Flipped" estimated log-normal fit is fairly close to "known" paramters
# Median
-qlnorm(0.5, meanlog = test_soln$par[1], sdlog = test_soln$par[2])

## [1] -0.2804112

# Lower
-qlnorm(0.965, meanlog = test_soln$par[1], sdlog = test_soln$par[2])

## [1] -1.869551

# Upper
-qlnorm(0.035, meanlog = test_soln$par[1], sdlog = test_soln$par[2])

## [1] -0.04205846

# Amount of data between lower and upper bounds?
# About 96.4%
plnorm(-lower, meanlog = test_soln$par[1], sdlog = test_soln$par[2]) -
  plnorm(-upper, meanlog = test_soln$par[1], sdlog = test_soln$par[2])

## [1] 0.964291

These results suggest a “flipped” log-normal may not be too bad, pending review by the expert. For example, if the “most-likely value” of -0.29 really should be the most probable value, then this distribution is clearly a poor fit…

Revisiting Parameter A

Finally, note that we could have used the same strategy for the assumed normal case (Parameter A) above, rather than the search over a vector of possible standard deviations:

# Specify known information
# 76% confidence suggests lower and upper bounds are 12th and 88th percentiles, respectively
mean <- -0.59 # quantile = 0.5
lower <- -1.17 # quantile = 0.12
upper <- -0.04 # quantile = 0.88

quantiles <- c(lower, mean, upper) # Notice still on original negative values

# Define function that compares various normal fits with the 'observed' values
fit_norm <- function(norm_parms) {
  # Calculating (and minimizing using `optim` function below) the sum of squared differences
  sum(abs(quantiles - qnorm(c(0.12, 0.5, 0.88), norm_parms[1], norm_parms[2]))^2)
}

# Parameter estimates are stored in test_soln$par
test_soln <- optim(c(1,1), fit_norm)

# Optimization suggests "closest" fit as we've defined it is
# normal distribution with mean:
test_soln$par[1]

## [1] -0.5999458

# and standard deviation:
test_soln$par[2]

## [1] 0.4808353

# Plot it.  
x <- seq(-2, 1, 0.001) # Range of values of interest
plot(x, dnorm(x, mean = test_soln$par[1], 
               sd = test_soln$par[2]), type="l", col="red")
# Add "known" values use to estimate distribution
points(quantiles, dnorm(quantiles,  mean = test_soln$par[1], 
                         sd = test_soln$par[2]))

# Amount of data between lower and upper bounds?
# About 76%
pnorm(upper, mean = test_soln$par[1], sd = test_soln$par[2]) -
  pnorm(lower, mean = test_soln$par[1], sd = test_soln$par[2])

## [1] 0.7599948