RP Methods (Full DB) - Tabular Data Admin 2

WIP

1 Pre-Processing

1.1 Admins with all `0` and `NA` Values:

NA Values - there are 364 cases of NA SFED values in mean. These are all contained to NGA a MOZ. See open issue
All 0 values: There are 11 pcodes that have all 0’s for yearly max for entire historical record
0 Variance: just 1 pcode with 0 variance

Since this numbers are pretty small it seems reasonable to remove them in post-processing and include a meta-data table

iso3	Number of adm2 pcodes all 0
NGA	10
SDN	1

We can easily just take these out and put in a meta-data table since it’s pretty small. Postgres PROD needs to e updated so that we can label these and the rest of data w/ names

iso3	pcode
NGA	NG003002
NGA	NG003004
NGA	NG003008
NGA	NG003014
NGA	NG003019
NGA	NG013004
NGA	NG017009
NGA	NG030012
NGA	NG033005
NGA	NG033020
SDN	SD13028

On top of that there is 1 admin 2 that has 0 variance in SFED values and can be removed

value
NGA - NG003009
0.0001714286

2 RP Calcs & Issues

Here we see SFED values for each RP from LP3 Method. We do a % of admin units that do reach impossible SFED values of greater than 100%

A zoom in on the range a bit more

3 How does the data fit

Why the bad fits?

Below we calculate a bunch of error metrics and settle on relative root mean squared error (RRMSE) for visualization. For the RMMSE calculations we use the lmoments lp3 method instead of usgs for convenience in extracting the error. To deal w/the issue of 0s we add some random noise to the data for this method at 1e-8 decimal place while ensuring no values below 0.

Below is the top 10 best fits and worst 10 fits. Just to give you an idea that some data fits excellently while other data fits atrociously

Belwo we see the same plot RP vs RV by admin and colored by the how well the fit as measured by RRMSE

So where do the fits get really bad?

We see the a spike in RMMSE when admins have between 11 & 21 0’s (out of 25)

Same thing log scaled. Interestingly there some data w/ 0 zero values that have very poor fits.

if we color the plots by whether the admin falls in this category 9-22 (i expanded the range) 0’s we see it does explain a lot of the bad fits, but not all.

After running pre-processing steps (remove admins with all 0s, remove admins with NAs, remove admins with 0 Variance) here is the breakdown:

iso3	SFED Too High	SFED Good (<=100%)	# Insufficient Variance
BFA	7 (15.56%)	38 (84.44%)	0
CAF	38 (52.78%)	34 (47.22%)	0
CMR	9 (15.52%)	47 (81.03%)	2
ETH	7 (7.61%)	85 (92.39%)	0
MLI	8 (15.09%)	45 (84.91%)	0
MOZ	34 (21.52%)	124 (78.48%)	0
NER	51 (76.12%)	16 (23.88%)	0
NGA	247 (32.93%)	493 (65.73%)	10
SDN	84 (44.68%)	104 (55.32%)	0
SOM	15 (20.27%)	59 (79.73%)	0
SSD	8 (10.13%)	71 (89.87%)	0
TCD	16 (22.86%)	54 (77.14%)	0

3.1 How much does thresholding help?

However, How much of this will be taken care of when we implement an upper return period threshold. As values above that will not be reflected in our data set. So we reallly only have to concern ourselves with the admin 2 values represented by the red points

The table below show’s how the overall results do improve significantly once we implement a threshold of 100 year RP. However, the % admins still with issue (SFED exceeding 100% before 100 year RP) still seem quite significant

iso3	SFED Too High	SFED Good (<=100%)	# Insufficient Variance
BFA	3 (6.67%)	42 (93.33%)	0
CAF	38 (52.78%)	34 (47.22%)	0
CMR	9 (15.52%)	47 (81.03%)	2
ETH	5 (5.43%)	87 (94.57%)	0
MLI	8 (15.09%)	45 (84.91%)	0
MOZ	26 (16.46%)	132 (83.54%)	0
NER	33 (49.25%)	34 (50.75%)	0
NGA	181 (24.13%)	559 (74.53%)	10
SDN	52 (27.66%)	136 (72.34%)	0
SOM	11 (14.86%)	63 (85.14%)	0
SSD	8 (10.13%)	71 (89.87%)	0
TCD	8 (11.43%)	62 (88.57%)	0

3.2 Use Admin 1- Pooled regional analysis

Pooled regional analysis to borrow information:

Results below show much better fitting at admin 1 level

If we subset admin 1 to just the ones containing problematic admin 2’s we still see generally much better fitting

From our admin 1 fits for admin 1's containing the problematic admin 2's we see only 1 admin 1 unit that breaches 100% SFED before the 100 year RP. This is good news.

Heres a breakdown of all admin 1’s filtered to HRP countries so they match adm2 stats better. Not all of these necessarily matter because it’s really just the problematic adm2’s where we need to borrow this information

iso3	SFED Too High	SFED Valid
BFA	0 (0)	13 (100.00%)
CAF	3 (17.65%)	14 (82.35%)
CMR	0 (0)	10 (100.00%)
ETH	2 (15.38%)	11 (84.62%)
MLI	2 (20.00%)	8 (80.00%)
MOZ	1 (9.09%)	10 (90.91%)
NER	2 (25.00%)	6 (75.00%)
NGA	0 (0)	37 (100.00%)
SDN	0 (0)	19 (100.00%)
SOM	0 (0)	18 (100.00%)
SSD	1 (9.09%)	10 (90.91%)
TCD	5 (21.74%)	18 (78.26%)

4 Additional Ideas

can experiment with the ideas below

4.1 Solution 1 - Pooled regional analysis

Use admin 1 fits for admin 2 data that exceeds 100%. Implementation still a WIP. All admin 1 fits are fine except some admins in Ethiopia

4.2 Solution 2

modified from PR comment

For each admin 2 fit the LP3
Set a general upper threshold of 100 year RP for all admin 2s. Any vale above is set as Inf
For each admin 2 that reaches >= 100% SFED before 100 year RP we find the exact RP that cooresponds to the 100% SFED value(in most cases that is still very high)
For those problematic admins we set any value over this RP (per admin) to Inf

TBD: try implementing

4.3 Solution 3 - Mix in Empirical

Something like the above, but mixing in empirical RP values for problematic admin units.
Seems like could work, but need to think of implications and how to implement - obviously for the problematic admin units we could only get a maximum of 1 in 26 year RP (and could set any value above to Inf)

4.4 Solution 4 - Beta Distribution

Let’s try beta distribution as proposed in this GH PR. Here is the function and implementation

Code

rv_beta <- function(x, return_period) {
  # Fit Beta distribution
  fit <- fitdistrplus::fitdist(x, "beta", method = "mge")
  
  # Calculate exceedance probabilities
  exceedance_probs <- 1 - (1 / return_period)
  
  # Calculate return values
  qbeta(exceedance_probs, 
        shape1 = fit$estimate["shape1"], 
        shape2 = fit$estimate["shape2"]
  )
  
}

df_beta_rps <- df_valid |> 
  group_by(
    iso3,
    pcode
  ) |> 
  reframe(
    RP = rps_to_calc,
    RV = rv_beta(
      x = value, 
      return_period = rps_to_calc
    )
  )

and resulting plot… bounding seemed to work well, but values appear problematic. We know from empirical data that admins in SSD reach ~60% flood fraction in empirical 26-year record. Therefore do we really believe we reached return periods than thousands of years in the last couple of years in SSD?

Code

df_beta_rps |> 
  ggplot(
    aes(x= RP,y= RV,group = pcode)
  )+
  geom_line(alpha=0.2)+
  scale_y_continuous(labels =scales::label_percent())+
  facet_wrap(~iso3, scales= "free_y")+
  labs(
    title = "Beta Distribution RPs vs RVs",
    subtitle = "Admin 2 level Analysis"
  )+ theme(
    axis.text.x = element_text(angle = 90,size=8),
    axis.text.y = element_text(size=8)
  )

4.5 Solution 4 - LP3 w/ Logit Transform

Code

box::use(moments)

logit <- function(p) log(p / (1 - p))
inv_logit <- function(x) exp(x) / (1 + exp(x))
lp3_params_bounded <- function(x, imputation_method = "lowest") {
  
  # comment out `pmin()` call because i want to see if there are issues not skip them
  # x <- pmin(pmax(x, 1e-6), 1 - 1e-6)  # Prevent logit issues at 0 or 1
  
  # Apply logit transformation
  x_logit <- logit(x)
  
  list(
    mu = mean(x_logit),
    sd = sd(x_logit),
    g = moments$skewness(x_logit)
  )
}

rv_lp3_bounded <- function(x,
                           return_period,
                           method = "usgs",
                           imputation_method = "lowest",
                           params = NULL) {
  if (imputation_method == "lowest") {
    x[x == 0] <- min(x[x != 0])
  }
  
  if (method == "usgs") {
    # Use bounded parameters if not provided
    if (is.null(params)) {
      params <- lp3_params_bounded(x)
    }
    
    g <- params$g
    mu <- params$mu
    stdev <- params$sd
    
    # Calculate exceedance probabilities and normal quantiles
    rp_exceedance <- 1 / return_period
    q_rp_norm <- qnorm(1 - rp_exceedance, mean = 0, sd = 1)  # Normal quantiles
    
    # Skewness adjustment
    k_rp <- (2 / g) * (((q_rp_norm - (g / 6)) * (g / 6) + 1)^3 - 1)
    
    # Fitted values for return periods (logit space)
    y_fit_rp <- mu + k_rp * stdev
    
    # Transform back to original scale using inverse logit
    ret <- inv_logit(y_fit_rp)  
    
    # comment  out `pmin()` call for now because i want to see if there are issues 
    # not skip them -- however could be good  for prod
    # ret <- pmin(ret, 1)  # Ensure no values exceed 100%
    
  } else {
    stop("Only 'usgs' method is supported in this bounded version.")
  }
  
  return(ret)
}


df_lp3_logit_rps <- df_valid |> 
  group_by(
    iso3,
    pcode
  ) |> 
  reframe(
    RP = rps_to_calc,
    RV = rv_lp3_bounded(
      x = value, 
      return_period = rps_to_calc
    )
  )

Below we see te LP3 Plots after the logit transform. For each country I’ve highlighted the 3 admins that reached the greated SFED flood fraction based on all 26 years of data. If you hover over those highlighted lines you can see the admin name and SFED value that was reached

TBD

4.6 Additional Method Worth Trying

If using the LP3 distribution, my imputation method of filling 0’s with next lowest value is probably not appropriate for applying at scale. Could look into treating the data as non-detect - left-censored data.
My guess is that this will not fix a huge amount of the issues with SFED exceeding 100, but could fix some and overall still improve the method.

5 Appendix

5.1 Check % 0’s

Below we see that when there are more 0s in the historical record there is a higher chance that you cannot even fit the LP3 at all.

I think explained by: LP3 requires a log transform, therefore 0 values are not allowed. As part of this methdo I replace 0 with lowest non-zero value. However you also cannot have a variance=0. Therefore, if you have 26 0’s obviously it won’t work, but if you hav 25 0’s it fills those in with the next lowest value which would create a variance of 0 and also fail. This could keep happening even with lower numbers of 0’s even if

5.2 Minor blocking issues

Need to figure out what to do w/ NULL/NA values in our internal floodscan postgres DB
Polygon table needs to be updated in postgres db so that we can label admin pcodes with readable names in final dataset.

1 Pre-Processing

1.1 Admins with all 0 and NA Values:

2 RP Calcs & Issues

3 How does the data fit

3.1 How much does thresholding help?

3.2 Use Admin 1- Pooled regional analysis

4 Additional Ideas

4.1 Solution 1 - Pooled regional analysis

4.2 Solution 2

4.3 Solution 3 - Mix in Empirical

4.4 Solution 4 - Beta Distribution

4.5 Solution 4 - LP3 w/ Logit Transform

4.6 Additional Method Worth Trying

5 Appendix

5.1 Check % 0’s

5.2 Minor blocking issues

1.1 Admins with all `0` and `NA` Values: