1. Results

We introduced a more flexible QDF model (Double-Delta) to allow for the scenario where the ratio between peak and daily floods is dependent on return period. However, it is challenging to know when this extra flexibility is merited as the hydrological characteristics that give rise to this behavior are not easily connected to catchment properties; i.e. it is not always apparent when Javelle is preferred over Double-Delta or vice versa. We circumvent this problem via use of a reversible jump MCMC sampler, which allows for parameter selection and parameter estimation simultaneously and gives a Bayesian posterior density estimate for the mixture model generated from Javelle and Double-Delta.

Thus we evaluate three models: Javelle, Double-Delta, and the mixture model of Javelle and Double-Delta (RJD) where the weights on the respective components are chosen under the reversible jump sampler.

We first asses how well the models capture flood behavior for within-sample durations at a variety of catchments. Then we evaluate which of the models is most effective at predicting out-of-sample durations, specifically short (<24hr) durations from 24hr+ data.

Within-sample behavior for d = 1, 24, 36, 48, 72 hours

The models were fit to flood data from the 1, 24, 36, 48 and 72 hour durations and quantile estimates from the three models were computed from the posterior distributions to produce the return level plots shown in Figures 1 & 2. These figures display the within-sample growth curves (median quantile estimates) along with the associated data points for two select stations, Gravå and Dyrsdalvatn.

The median quantile curve from the RJD mixture model always falls in between the median quantile curves from Javelle and Double-Delta, with Double-Delta reporting higher return level estimates than Javelle.

The weight assigned to Double-Delta in the RJD model reflects the extent to which the model needs to adjust to scaling behavior in observed durations. For example, for Gravå, the flood events from the 1-hour duration are substantially different from the flood events from the 24hr+ durations so the extra flexibility afforded by the Double-Delta model is merited for this range of durations fed into the model. For Dyrdalsvatn the 1 hour and 24hr+ durations are more similar so much less weight is afforded to Double Delta as Javelle is sufficient to capture the characteristics of the data.

The weights in the RJD model range from about 7% Double-Delta (Sjodalsvatn) up to 62% Double-Delta (Gryta). See Table 1 for a full summary of the model weights in the RJD model. The variation in the weights indicates the ability of the RJD model to adjust to the presence or absence of this scaling behavior. Most importantly, in situations where the data do not exhibit this characteristic the RJD model does not add the second Delta unnecessarily but instead recovers something very close to the original QDF model.

percDD lnxID
0.118 dyrdalsvatn
0.1606 elgtjern
0.1062 etna
0.5064 grava
0.1805 grosettjern
0.6164 gryta
0.1478 hugdalbru
0.1278 manndalenbru
0.1313 oyungen
0.3342 roykenes
0.0749 sjodalsvatn
0.0813 viksvatn

To assess the fit of the three models we compare the 1 hour, 24 hour and 72 hour median quantiles to median quantiles generated by fitting a GEV distribution to each duration individually. This is not a strict model-to-model comparison as the QDF model is developed for separate application than the individual GEV fit (namely extension to unobserved durations and ungauged locations) but provides a useful visual cue as to the performance of the QDF model.

The individual GEV fit falls within the 90% confidence band for most stations and durations. The RJD provides the most consistent match to the GEV fit for both the smallest (1-hour) and largest (72-hour) durations. This is especially evident in some of the smaller catchments with a higher weight of Double-Delta in the RJD model (see Gryta, Gravå, Røykenes). Here the RJD model better captures the individual GEV fit for the 1-hour duration GEV fit while Javelle does not. Notably, the RJD model for these stations also better captures the 72-hour duration GEV fit in situations where Double-Delta does not.

Since a QDF model must fit to a range of durations it runs the risk of underestimating the return level for instantaneous or short-duration events while simultaneously overestimating the return level for longer-duration events. As reflected in Figures 1, 2, & 3 (grid plot) this underestimation of the instantaneous flood is especially prevalent in Javelle while the overestimation of longer duration events is more prevalent in Double-Delta. This behavior is mitigated in the more flexible RJD model. This extra flexibility provides return level estimates closer to those obtained by individual GEV fits.

Predicting out-of-sample durations

Application of the QDF model to data series of daily (or greater) time steps with the wish to estimate the instantaneous or sub-daily return levels is a realistic, and challenging, scenario. To assess this ability for the Javelle, Double-Delta, and RJD models we fit to two different data sets of 24hr+ data: four durations between 24 and 72 hours (24, 36, 48, 72) and six durations between 24 and 120 hours (24, 36, 48, 96, 72, 120). The output from these two fits was then used to predict return levels at 1-hr and 12-hr durations. Fitting to two different sets of durations allows for assessment of model sensitivity to input durations.

As a measure of out-of-sample prediction performance we compute the quantile score, given by

\[\begin{equation} s_Q\left(F,y | \tau \right) = \left(y-F^{-1}(\tau)\right)\left(\tau - \mathbb{1}\{y \leq F^{-1}(\tau)\}\right) \end{equation}\]

where \(F\) is the predictive distribution, \(y\) is the realized observation, and \(\tau\) is the specific probability level we evaluate the score at. Under this formulation \(F^{-1}(\tau)\) is the predicted return level at the probability level (return period) of interest.

To compare quantile scores from different models we use a permutation test as proposed in Thorarinsdottir et. al., 2019. Under such a test we compute the difference in quantile scores for two models \(M^1\) and \(M^2\) as

\[\begin{equation} c = \frac{1}{N} \sum_{i=1}^N \left(s_Q(M^1_i)-s_Q(M^2_i)\right) \end{equation}\]

where \(N\) is the number of observed data points. If \(c < 0\), model \(M^1\) performs better than model \(M^2\) as measured by the quantile score and vice versa. Then the permutation test is based on resampling copies of \(c\) with \(M^1\) and \(M^2\) randomly swapped. Under the null hypothesis, \(M^1\) and \(M^2\) perform equally well and the set of permutations cannot be distinguished from \(c\). This is formalized as a statistical test by considering which quantile \(c\) occupies in the set of permutations; if the p-value is less than 0.05 then the performance of \(M^1\) and \(M^2\) is significantly different.

Results of the permutation test

The 5, 10, 20 and 50 year quantiles were used in the prediction analysis. The tables below show the p-values for three sets of model comparisons: Double-Delta against Javelle, Double-Delta against RJD, and RJD against Javelle. A low p-value (< 0.05) means the model listed first in the column name is significantly better than the model it was compared to. A high p-value (> 0.95) means the reverse is true; the model listed second in the column name is the winner.

# durations 24-72 hours
kable(pseventytwo)
lnxID DDoverJ DDoverRJ RJoverJ
dyrdalsvatn 0.4206 0.4694 0.2092
elgtjern 0.0512 0.0648 0.0322
etna 0.3230 0.3858 0.0862
grava 0.0000 0.0000 0.0000
grosettjern 0.0112 0.0120 0.0114
gryta 0.0002 0.0004 0.0000
hugdalbru 0.0002 0.0004 1.0000
manndalenbru 0.0650 0.0718 0.4120
oyungen 0.6920 0.6818 0.7146
roykenes 0.9962 0.9992 0.9804
sjodalsvatn 0.2850 0.2848 0.6248
viksvatn 0.7928 0.8248 0.5528
# durations 24-120 hours
kable(ponetwenty)
lnxID DDoverJ DDoverRJ RJoverJ
dyrdalsvatn 0.0002 0.0000 0.0090
elgtjern 0.0372 0.0352 0.0332
etna 0.3974 0.4296 0.2322
grava 0.0000 0.0000 0.0000
grosettjern 0.0022 0.0058 0.0060
gryta 0.0000 0.0000 0.0000
hugdalbru 0.0000 0.0000 0.0000
manndalenbru 0.0150 0.0114 0.9956
oyungen 0.8560 0.8680 0.8196
roykenes 0.7644 0.6308 0.7712
sjodalsvatn 0.2622 0.2642 0.2332
viksvatn 0.9662 0.9626 0.9482

We start by looking at the column “DDoverJ”. Double-Delta is better than Javelle for 5 of the stations in the 24-72 hour duration set and 7 of the stations in the 24-120 hour set. For the rest of the stations in each set, the permutations test results see no significant difference in the models with the exception of Røykenes in 24-72 and Viksvatn in 24-120, which both prefer Javelle. The far right column of both tables, “RJoverJ” shows the RJD model mostly parallels the results of Double-Delta with the notable exceptions of Hugdal Bru in set 24-72 and Manndalen Bru in set 24-120 which both prefer Double-Delta to Javelle but prefer Javelle to RJD, meaning the RJD model is ranked third out of the three models.

While the median quantile of the RJD mixture model always falls between Javelle and Double-Delta, the predictive performance of RJD is not guaranteed to display the same behavior as the quantile score (the measure of predictive performance) is not linear.

Evaluation of duration sensitivity in prediction

There were two stations that did not have a significant preference for Double-Delta over Javelle in the 24-72 hour data set but discovered they did actually prefer Double-Delta to Javelle in the 24-120 hour data set. We plot the predictions for these two stations with the individual GEV fit overlaid in black in the figures below:

The return levels for the 24-120 hour set are consistently lower than the return levels predicted from the 24-72 hour set as the model must find a balance between more (and longer) durations. Predictions for short durations are not improved by generating more sets of dependent data to feed into the model. The improvement in significance between the 24-72 hour set and the 24-120 hour set is due to improvements in comparative performance between the models, not improvements in the actual predictive estimates.

Comparing predicted durations to individual GEV fits

Should we look at how good the predictions are to individual GEV fits?