| iso3 | Number of adm2 pcodes all 0 |
|---|---|
| NGA | 10 |
| SDN | 1 |
WIP
1 Pre-Processing
1.1 Admins with all 0 and NA Values:
NA Values- there are 364 cases ofNA SFEDvalues inmean. These are all contained to NGA a MOZ. See open issueAll 0 values: There are 11 pcodes that have all 0’s for yearly max for entire historical record0 Variance: just 1 pcode with 0 variance
Since this numbers are pretty small it seems reasonable to remove them in post-processing and include a meta-data table
We can easily just take these out and put in a meta-data table since it’s pretty small. Postgres PROD needs to e updated so that we can label these and the rest of data w/ names
| iso3 | pcode |
|---|---|
| NGA | NG003002 |
| NGA | NG003004 |
| NGA | NG003008 |
| NGA | NG003014 |
| NGA | NG003019 |
| NGA | NG013004 |
| NGA | NG017009 |
| NGA | NG030012 |
| NGA | NG033005 |
| NGA | NG033020 |
| SDN | SD13028 |
On top of that there is 1 admin 2 that has 0 variance in SFED values and can be removed
| value |
|---|
| NGA - NG003009 |
| 0.0001714286 |
2 RP Calcs & Issues
Here we see SFED values for each RP from LP3 Method. We do a % of admin units that do reach impossible SFED values of greater than 100%
A zoom in on the range a bit more
3 How does the data fit
Why the bad fits?
Below we calculate a bunch of error metrics and settle on relative root mean squared error (RRMSE) for visualization. For the RMMSE calculations we use the lmoments lp3 method instead of usgs for convenience in extracting the error. To deal w/the issue of 0s we add some random noise to the data for this method at 1e-8 decimal place while ensuring no values below 0.
Below is the top 10 best fits and worst 10 fits. Just to give you an idea that some data fits excellently while other data fits atrociously
Belwo we see the same plot RP vs RV by admin and colored by the how well the fit as measured by RRMSE
So where do the fits get really bad?
We see the a spike in RMMSE when admins have between 11 & 21 0’s (out of 25)
Same thing log scaled. Interestingly there some data w/ 0 zero values that have very poor fits.
if we color the plots by whether the admin falls in this category 9-22 (i expanded the range) 0’s we see it does explain a lot of the bad fits, but not all.
After running pre-processing steps (remove admins with all 0s, remove admins with NAs, remove admins with 0 Variance) here is the breakdown:
| iso3 | SFED Too High | SFED Good (<=100%) | # Insufficient Variance |
|---|---|---|---|
| BFA | 7 (15.56%) | 38 (84.44%) | 0 |
| CAF | 38 (52.78%) | 34 (47.22%) | 0 |
| CMR | 9 (15.52%) | 47 (81.03%) | 2 |
| ETH | 7 (7.61%) | 85 (92.39%) | 0 |
| MLI | 8 (15.09%) | 45 (84.91%) | 0 |
| MOZ | 34 (21.52%) | 124 (78.48%) | 0 |
| NER | 51 (76.12%) | 16 (23.88%) | 0 |
| NGA | 247 (32.93%) | 493 (65.73%) | 10 |
| SDN | 84 (44.68%) | 104 (55.32%) | 0 |
| SOM | 15 (20.27%) | 59 (79.73%) | 0 |
| SSD | 8 (10.13%) | 71 (89.87%) | 0 |
| TCD | 16 (22.86%) | 54 (77.14%) | 0 |
3.1 How much does thresholding help?
However, How much of this will be taken care of when we implement an upper return period threshold. As values above that will not be reflected in our data set. So we reallly only have to concern ourselves with the admin 2 values represented by the red points
The table below show’s how the overall results do improve significantly once we implement a threshold of 100 year RP. However, the % admins still with issue (SFED exceeding 100% before 100 year RP) still seem quite significant
| iso3 | SFED Too High | SFED Good (<=100%) | # Insufficient Variance |
|---|---|---|---|
| BFA | 3 (6.67%) | 42 (93.33%) | 0 |
| CAF | 38 (52.78%) | 34 (47.22%) | 0 |
| CMR | 9 (15.52%) | 47 (81.03%) | 2 |
| ETH | 5 (5.43%) | 87 (94.57%) | 0 |
| MLI | 8 (15.09%) | 45 (84.91%) | 0 |
| MOZ | 26 (16.46%) | 132 (83.54%) | 0 |
| NER | 33 (49.25%) | 34 (50.75%) | 0 |
| NGA | 181 (24.13%) | 559 (74.53%) | 10 |
| SDN | 52 (27.66%) | 136 (72.34%) | 0 |
| SOM | 11 (14.86%) | 63 (85.14%) | 0 |
| SSD | 8 (10.13%) | 71 (89.87%) | 0 |
| TCD | 8 (11.43%) | 62 (88.57%) | 0 |
3.2 Use Admin 1- Pooled regional analysis
Pooled regional analysis to borrow information:
Results below show much better fitting at admin 1 level
If we subset admin 1 to just the ones containing problematic admin 2’s we still see generally much better fitting
From our admin 1 fits for admin 1's containing the problematic admin 2's we see only 1 admin 1 unit that breaches 100% SFED before the 100 year RP. This is good news.
Heres a breakdown of all admin 1’s filtered to HRP countries so they match adm2 stats better. Not all of these necessarily matter because it’s really just the problematic adm2’s where we need to borrow this information
| iso3 | SFED Too High | SFED Valid |
|---|---|---|
| BFA | 0 (0) | 13 (100.00%) |
| CAF | 3 (17.65%) | 14 (82.35%) |
| CMR | 0 (0) | 10 (100.00%) |
| ETH | 2 (15.38%) | 11 (84.62%) |
| MLI | 2 (20.00%) | 8 (80.00%) |
| MOZ | 1 (9.09%) | 10 (90.91%) |
| NER | 2 (25.00%) | 6 (75.00%) |
| NGA | 0 (0) | 37 (100.00%) |
| SDN | 0 (0) | 19 (100.00%) |
| SOM | 0 (0) | 18 (100.00%) |
| SSD | 1 (9.09%) | 10 (90.91%) |
| TCD | 5 (21.74%) | 18 (78.26%) |
4 Additional Ideas
can experiment with the ideas below
4.1 Solution 1 - Pooled regional analysis
- Use admin 1 fits for admin 2 data that exceeds 100%. Implementation still a WIP. All admin 1 fits are fine except some admins in Ethiopia
4.2 Solution 2
modified from PR comment
- For each admin 2 fit the LP3
- Set a general upper threshold of 100 year RP for all admin 2s. Any vale above is set as
Inf - For each admin 2 that reaches >= 100% SFED before 100 year RP we find the exact RP that cooresponds to the 100% SFED value(in most cases that is still very high)
- For those problematic admins we set any value over this RP (per admin) to
Inf
TBD: try implementing
4.3 Solution 3 - Mix in Empirical
- Something like the above, but mixing in empirical RP values for problematic admin units.
- Seems like could work, but need to think of implications and how to implement - obviously for the problematic admin units we could only get a maximum of 1 in 26 year RP (and could set any value above to
Inf)
4.4 Solution 4 - Beta Distribution
Let’s try beta distribution as proposed in this GH PR. Here is the function and implementation
Code
rv_beta <- function(x, return_period) {
# Fit Beta distribution
fit <- fitdistrplus::fitdist(x, "beta", method = "mge")
# Calculate exceedance probabilities
exceedance_probs <- 1 - (1 / return_period)
# Calculate return values
qbeta(exceedance_probs,
shape1 = fit$estimate["shape1"],
shape2 = fit$estimate["shape2"]
)
}
df_beta_rps <- df_valid |>
group_by(
iso3,
pcode
) |>
reframe(
RP = rps_to_calc,
RV = rv_beta(
x = value,
return_period = rps_to_calc
)
)and resulting plot… bounding seemed to work well, but values appear problematic. We know from empirical data that admins in SSD reach ~60% flood fraction in empirical 26-year record. Therefore do we really believe we reached return periods than thousands of years in the last couple of years in SSD?
Code
df_beta_rps |>
ggplot(
aes(x= RP,y= RV,group = pcode)
)+
geom_line(alpha=0.2)+
scale_y_continuous(labels =scales::label_percent())+
facet_wrap(~iso3, scales= "free_y")+
labs(
title = "Beta Distribution RPs vs RVs",
subtitle = "Admin 2 level Analysis"
)+ theme(
axis.text.x = element_text(angle = 90,size=8),
axis.text.y = element_text(size=8)
)4.5 Solution 4 - LP3 w/ Logit Transform
Code
box::use(moments)
logit <- function(p) log(p / (1 - p))
inv_logit <- function(x) exp(x) / (1 + exp(x))
lp3_params_bounded <- function(x, imputation_method = "lowest") {
# comment out `pmin()` call because i want to see if there are issues not skip them
# x <- pmin(pmax(x, 1e-6), 1 - 1e-6) # Prevent logit issues at 0 or 1
# Apply logit transformation
x_logit <- logit(x)
list(
mu = mean(x_logit),
sd = sd(x_logit),
g = moments$skewness(x_logit)
)
}
rv_lp3_bounded <- function(x,
return_period,
method = "usgs",
imputation_method = "lowest",
params = NULL) {
if (imputation_method == "lowest") {
x[x == 0] <- min(x[x != 0])
}
if (method == "usgs") {
# Use bounded parameters if not provided
if (is.null(params)) {
params <- lp3_params_bounded(x)
}
g <- params$g
mu <- params$mu
stdev <- params$sd
# Calculate exceedance probabilities and normal quantiles
rp_exceedance <- 1 / return_period
q_rp_norm <- qnorm(1 - rp_exceedance, mean = 0, sd = 1) # Normal quantiles
# Skewness adjustment
k_rp <- (2 / g) * (((q_rp_norm - (g / 6)) * (g / 6) + 1)^3 - 1)
# Fitted values for return periods (logit space)
y_fit_rp <- mu + k_rp * stdev
# Transform back to original scale using inverse logit
ret <- inv_logit(y_fit_rp)
# comment out `pmin()` call for now because i want to see if there are issues
# not skip them -- however could be good for prod
# ret <- pmin(ret, 1) # Ensure no values exceed 100%
} else {
stop("Only 'usgs' method is supported in this bounded version.")
}
return(ret)
}
df_lp3_logit_rps <- df_valid |>
group_by(
iso3,
pcode
) |>
reframe(
RP = rps_to_calc,
RV = rv_lp3_bounded(
x = value,
return_period = rps_to_calc
)
)Below we see te LP3 Plots after the logit transform. For each country I’ve highlighted the 3 admins that reached the greated SFED flood fraction based on all 26 years of data. If you hover over those highlighted lines you can see the admin name and SFED value that was reached
TBD
4.6 Additional Method Worth Trying
- If using the LP3 distribution, my imputation method of filling 0’s with next lowest value is probably not appropriate for applying at scale. Could look into treating the data as non-detect - left-censored data.
- My guess is that this will not fix a huge amount of the issues with SFED exceeding 100, but could fix some and overall still improve the method.
5 Appendix
5.1 Check % 0’s
Below we see that when there are more 0s in the historical record there is a higher chance that you cannot even fit the LP3 at all.
I think explained by: LP3 requires a log transform, therefore 0 values are not allowed. As part of this methdo I replace 0 with lowest non-zero value. However you also cannot have a variance=0. Therefore, if you have 26 0’s obviously it won’t work, but if you hav 25 0’s it fills those in with the next lowest value which would create a variance of 0 and also fail. This could keep happening even with lower numbers of 0’s even if
5.2 Minor blocking issues
- Need to figure out what to do w/
NULL/NAvalues in our internal floodscan postgres DB - Polygon table needs to be updated in postgres db so that we can label admin pcodes with readable names in final dataset.