1. Results

Where this Rmarkdown is different from the last one…

Model behavior - same as before, but uses the median curve instead of the curve made of the medians.

Assessing model performance - whole section is new. Uses integrated quantile distance (IQD) to compare models.

Predicting out-of-sample durations - uses IQD instead of quantile score/permutation test.

Duration sensitivity - adds in results from fitting with only 2 durations, so we compare (24,36) (24,36,48,72) (24,36,48,72,96,120) and moves this to the “model behavior” section at the start.

Model behavior for d = 1, 24, 36, 48, 72 hours

The models were fit to flood data from the 1, 24, 36, 48 and 72 hour durations and quantile estimates from the three models were computed from the posterior distributions to produce the return level plots shown in Figures 1 & 2. These figures display the within-sample growth curves (median quantile estimates) along with the associated data points for two select stations, Gravå and Dyrsdalvatn.

The median quantile curve from the RJD mixture model always falls in between the median quantile curves from Javelle and Double-Delta, with Double-Delta reporting higher return level estimates than Javelle.

The weight assigned to Double-Delta in the RJD model reflects the extent to which the model needs to adjust to scaling behavior in observed durations. For example, for Gravå, the flood events from the 1-hour duration are substantially different from the flood events from the 24hr+ durations so the extra flexibility afforded by the Double-Delta model is merited for this range of durations fed into the model. For Dyrdalsvatn the 1 hour and 24hr+ durations are more similar so much less weight is afforded to Double Delta as Javelle is sufficient to capture the characteristics of the data.

The weights in the RJD model range from about 7% Double-Delta (Sjodalsvatn) up to 62% Double-Delta (Gryta). See Table 1 for a full summary of the model weights in the RJD model. The variation in the weights indicates the ability of the RJD model to adjust to the presence or absence of this scaling behavior. Most importantly, in situations where the data do not exhibit this characteristic the RJD model does not add the second Delta unnecessarily but instead recovers something very close to the original QDF model.

percDD	lnxID
0.118	dyrdalsvatn
0.1606	elgtjern
0.1062	etna
0.5064	grava
0.1805	grosettjern
0.6164	gryta
0.1478	hugdalbru
0.1278	manndalenbru
0.1313	oyungen
0.3342	roykenes
0.0749	sjodalsvatn
0.0813	viksvatn

Duration sensitivity in model fit

Need to make plots for this section, but I think the story here is pretty simple:

The model did not converge when fit with just 2 durations (24,36).

When fit with extra durations (24,36,48,72,96,120) the confidence intervals were shrunk to the point they did not capture the individual GEV anymore and the median return level estimates were biased downward.

Therefore we should always fit with the minimum durations needed to ensure convergence and be careful with our duration selection.

And my thought is that putting this in the “model introduction” section saves us the need to report both sets of results (24-72 vs 24-120) in the prediction section.

Assessing model performance at in-sample durations

The integrated quadratic distance (IQD) measures distributional similarity by integrating the squared difference between two distribution functions. We compare return levels from the QDF model to return levels from obtained by indivually fitting GEV distributions to each duration. This gives us a score we can compare over stations and durations. The lower the IQD score, the better the distributional similarity.

iqd_score

##            lnxID  d type       iqd
##   1: dyrdalsvatn  1   DD 0.1855624
##   2: dyrdalsvatn  1    J 0.3380716
##   3: dyrdalsvatn  1  RJD 0.3319539
##   4: dyrdalsvatn 24   DD 0.3504414
##   5: dyrdalsvatn 24    J 0.6040347
##  ---                              
## 176:    viksvatn 48    J 2.7515477
## 177:    viksvatn 48  RJD 2.6926155
## 178:    viksvatn 72   DD 1.5900795
## 179:    viksvatn 72    J 0.3677967
## 180:    viksvatn 72  RJD 0.4222754

We can average this score across durations, models, or stations, and any combination of these. If we average across models (that is, group scores from DD, RJD, and J), the combined score answers questions about the overall fit of a QDF model. If we average across stations or durations while keeping models separate, the score lets us answer questions about the comparative performance of DD, RJD, and J.

Overall QDF model performance

TLDR; QDF models do better at fitting the 24hr+ durations. QDF models provide a better fit to some stations than others. On the three stations with the worst fit (Etna, Røykenes, Hugdal Bru), the 1 and 72 hour durations are responsible, and all models struggle to fit these exterior durations (as opposed to one model being substantially worse than the others and skewing the average).

Things we should discuss on Thursday: Is there a way to decide how “good” the IQD score would have to be to use in practice?

Question: Are there specific in-sample durations the QDF models are better at fitting?

iqd_score[,.(score=mean(iqd)),by=list(d)]

##     d    score
## 1:  1 5.411743
## 2: 24 3.007829
## 3: 36 1.916123
## 4: 48 1.380638
## 5: 72 2.022237

Question: Are there specific stations the QDF models are better at fitting?

st <- iqd_score[,.(score=mean(iqd)),by=list(lnxID)]
st[order(score)]

##            lnxID     score
##  1:  grosettjern 0.1821960
##  2:  dyrdalsvatn 0.2357753
##  3:     elgtjern 0.5281901
##  4:        gryta 0.5627981
##  5:        grava 0.6622019
##  6:  sjodalsvatn 0.8059851
##  7: manndalenbru 2.6089137
##  8:      oyungen 3.0054170
##  9:     viksvatn 4.5735940
## 10:         etna 5.2144725
## 11:     roykenes 5.3792959
## 12:    hugdalbru 9.2137305

Interestingly, there seems to be a split between the IQD scores for each station: six stations that have IQD < 1 and six that have IQD > 2. This boundary is pretty arbitrary, as I don’t think there’s any meaning associated with IQD scores above or below 1.

There seems to be two ways to obtain a “bad” QDF fit (defined as IQD > 2). In the worst stations (Etna, Røykenes, Hugdal Bru) the poor fit of the QDF models can be attributed to a extremely poor fit in the 1 hour or 72 hour durations (or both). In all other cases where the QDF models have trouble (Manndalen Bru, Øyungen, Viksvatn) the poor fit seems to be more evenly dispersed across durations.

If we look at Etna, Røykenes, Hugdal Bru:

std <- iqd_score[,.(score=mean(iqd)),by=list(lnxID,d)]
Etna <- std[lnxID=="etna"]
Røykenes <- std[lnxID=="roykenes"]
Hugdal_Bru <- std[lnxID=="hugdalbru"];

Etna

##    lnxID  d      score
## 1:  etna  1 13.8480331
## 2:  etna 24  6.2544173
## 3:  etna 36  3.8859167
## 4:  etna 48  1.3807631
## 5:  etna 72  0.7032323

Røykenes

##       lnxID  d     score
## 1: roykenes  1 10.732239
## 2: roykenes 24  8.874263
## 3: roykenes 36  2.669564
## 4: roykenes 48  1.403041
## 5: roykenes 72  3.217373

Hugdal_Bru

##        lnxID  d     score
## 1: hugdalbru  1 25.811634
## 2: hugdalbru 24  3.077204
## 3: hugdalbru 36  2.603988
## 4: hugdalbru 48  3.764654
## 5: hugdalbru 72 10.811173

Versus, for example, Viksvatn:

Viksvatn <- std[lnxID=="viksvatn"]
Viksvatn

##       lnxID  d     score
## 1: viksvatn  1 7.7003214
## 2: viksvatn 24 6.9690441
## 3: viksvatn 36 5.1412837
## 4: viksvatn 48 2.2639367
## 5: viksvatn 72 0.7933839

These results hold across the models, if we split the above tables up by model; that is, not just one model is contributing to the bad fit at 1 and 72 for Etna, Røykenes, and Hugdal Bru–they all are.

Comparing QDF models

TLDR; Double-Delta has the lowest IQD score when averaged across all durations and stations. However, when we look at model performance split by duration and station we see that although Double-Delta “wins” and performs the best on average, there are specific durations and stations where it loses big. This often occurs in small catchments where we suspect scaling behavior; in these scenarios Double-Delta overfits the shorter durations at the expense of the longer ones. RJD may provide a more consistent alternative to Double-Delta while still outperforming Javelle.

Things we should discuss on Thursday: The overall “best” model for a catchment is influenced by which durations we feed into the model. For example, if there is a substantial difference between the <24hr and 24hr+ durations, such as in Gravå and Gryta, the winning model is a product of how many in-sample durations fall on either side of this boundary. In our case, we have 4 24hr+ durations, so Javelle comes out the winner for those catchments according to the IQD score. It would be good to talk abotu how to fit this into the discussion section of the paper.

Question: which model has the lowest average IQD score?

lowest_iqd <- iqd_score[,.(score=mean(iqd)),by=type]
lowest_iqd[order(score)]

##    type    score
## 1:   DD 2.104151
## 2:  RJD 2.964654
## 3:    J 3.174337

Question: where is each model winning compared to the others?

The bigger the IQD score, the greater the disagreement between the QDF model and the individual GEV fit. To compare how good one model fit was relative to the others, I looked at its percentage contribution to the total IQD score when summed across models. A greater contribution to the total score means the model did comparatively worse.

The pie-chart plot below shows the percent contribution from DD, J and RJD. The plot is repeated three times, each time with a different model highlighted. The stations are arranged in order of easiest to fit with QDF models to hardest to fit with QDF models. The model with the lowest IQD score, when averaged across all durations for that station, is indicated as a label to the right of the station name. So “Grosettjern - RJD - 0.18” means that row belongs to station Grosettjern, which has winning model RJD, and an average IQD score across all three QDF models of 0.18.

I’ll send a side-by-side version of this plot in an email, it’s a lot easier to compare.

If a model has a larger area in the pie chart, it means it did relatively worse when compared tot he other two models.

Double-Delta has the ability to outperform the other two models, contributing next to nothing to the overall score (see the 36 hour duration on Øyungen or Røykenes, or the 72 hour duration for Manndalen Bru and Etna). However, there are also cases where it is substantially worse than the other two models and Double-Delta makes up almost the entire pie (see the 1 hour duration on Øyungen, or the 24hr+ durations on Sjodalsvatn, Gravå and Gryta).

If we want to avoid “whole-pie” cases RJD is the next best (still ranking above Javelle when averaged across stations and durations, but with none of these whole pie scenarios).

This reasoning does not consider the magnitude of the IQD score, only where the models are performing. That is–DD is winning on the stations QDF models have a hard time with (it wins every station with an IQD > 2). This is why it does so well in the overall ranking: DD’s bad whole-pie cases are on stations that already have a low overall IQD score. This is also something to consider…

Predicting out-of-sample durations

TLDR; We find that DD performs best when predicting short durations from long as measured by the IQD. I still need to run the permutation test results on this, but I expect these results will be in line with the quantile score + permutation test we did before.

Things we should discuss on Thursday: What is the best way to assess prediction – quantile score like we did before or IQD score? Need to consider in-sample vs out of sample durations too…

This section compares predicted return levels at short durations (1 and 12 hours) to individually fit GEV distributions at 1 and 12 hours. Here we use the IQD score as a measure of fit again.

Question: which model is the best at predicting short durations from long?

bestmodel <- iqdscoreST[,.(score=mean(iqd)),by=type]
bestmodel

##    type    score
## 1:   DD 6.781660
## 2:    J 7.685661
## 3:  RJD 7.527059

Question: which stations are the hardest to predict at?

hardstations <- iqdscoreST[,.(score=mean(iqd)),by=list(lnxID)]
hardstations

##            lnxID       score
##  1:  dyrdalsvatn  0.61177580
##  2:     elgtjern  0.06506148
##  3:         etna 14.19311706
##  4:        grava  1.55983676
##  5:  grosettjern  0.73835698
##  6:        gryta  1.82276925
##  7:    hugdalbru 42.79613654
##  8: manndalenbru  3.48357398
##  9:      oyungen  8.31073510
## 10:     roykenes  6.11007226
## 11:  sjodalsvatn  0.83500495
## 12:     viksvatn  7.45107757

So Hugdal Bru is clearly an unusally hard station to predict short durations from long (this could be nice to talk about in the discussion). If we look at the return level plot for Hugdal Bru vs Elgtjern:

Here the dashed lines are the individual GEV fits.

The bands are given by the middle 75% curve envelope as opposed to taking the pointwise uncertainties at every return period. The flare at the end of the envelope is product of the log transform on the x axis.

Question: which model is preferred at each station?

tfST <- iqdscoreST[,.(score=mean(iqd)),by=list(lnxID,type)]
wtfST <- tfST[tfST[, .I[score == min(score)], by=lnxID]$V1]
setnames(wtfST,"type", "w")
wtfST

##            lnxID   w        score
##  1:  dyrdalsvatn  DD  0.158869059
##  2:     elgtjern RJD  0.005195154
##  3:         etna  DD 11.500570748
##  4:        grava  DD  1.346266677
##  5:  grosettjern  DD  0.654790200
##  6:        gryta  DD  1.577695443
##  7:    hugdalbru  DD 36.821487867
##  8: manndalenbru  DD  1.733382615
##  9:      oyungen   J  3.242727825
## 10:     roykenes  DD  2.227923693
## 11:  sjodalsvatn   J  0.027518475
## 12:     viksvatn  DD  5.601915097

To do: run permutation test. Split by model the IQD scores are:

iqdscoreST[,.(score=mean(iqd)),by=list(lnxID,type)]

##            lnxID type        score
##  1:  dyrdalsvatn   DD  0.158869059
##  2:  dyrdalsvatn    J  0.901368770
##  3:  dyrdalsvatn  RJD  0.775089558
##  4:     elgtjern   DD  0.177942141
##  5:     elgtjern    J  0.012047134
##  6:     elgtjern  RJD  0.005195154
##  7:         etna   DD 11.500570748
##  8:         etna    J 15.867377816
##  9:         etna  RJD 15.211402619
## 10:        grava   DD  1.346266677
## 11:        grava    J  1.714789922
## 12:        grava  RJD  1.618453684
## 13:  grosettjern   DD  0.654790200
## 14:  grosettjern    J  0.791844952
## 15:  grosettjern  RJD  0.768435802
## 16:        gryta   DD  1.577695443
## 17:        gryta    J  1.997331106
## 18:        gryta  RJD  1.893281190
## 19:    hugdalbru   DD 36.821487867
## 20:    hugdalbru    J 45.755733049
## 21:    hugdalbru  RJD 45.811188715
## 22: manndalenbru   DD  1.733382615
## 23: manndalenbru    J  4.392955662
## 24: manndalenbru  RJD  4.324383675
## 25:      oyungen   DD 17.149940846
## 26:      oyungen    J  3.242727825
## 27:      oyungen  RJD  4.539536622
## 28:     roykenes   DD  2.227923693
## 29:     roykenes    J  8.993507803
## 30:     roykenes  RJD  7.108785287
## 31:  sjodalsvatn   DD  2.429135550
## 32:  sjodalsvatn    J  0.027518475
## 33:  sjodalsvatn  RJD  0.048360823
## 34:     viksvatn   DD  5.601915097
## 35:     viksvatn    J  8.530724288
## 36:     viksvatn  RJD  8.220593321
##            lnxID type        score

Results summary, local QDF model

D. Barna

2/22/2022