FloodScan SFED Data Simplification

Exploratory Smoothing & Examples

Intro

We are currently in the process of discussing a FloodScan product to make available on HDX. This notebooks is made to display and explore pros & cons of different options.

There are several decisions to be made when preparing the files including:

  1. Threshold/filter value: It seems beneficial to apply some value threshold to the pixels to reduce overall noise in data set. Previous analysis showed that very low SFED values were still useful in identifying flood events. Therefore in this notebook we explore a very low value of 1%. All value below will be set to 0.
  2. How/if to smooth historical average Day-of-Year (DOY) values. In this notebook we explore smoothing the average historical value with a 30 day centered window (+/- 15 days). It appears to be a suitable window.
  3. What years to include in historical baseline. For the purpose of this notebook baseline is calculated from years 1998 (start of FloodScan) to 2020. Setting the the end year as 2020 specifically is arbitrary, but would help mute effect of any irregular years over the last 4 years on anomaly calculations.

The processing of the data is all done in data-raw/fs_baselines.R. This document reads the processed final and intermediate files for visualization and communication.

Code
box::use(terra[...])
box::use(sf[...])
box::use(tidyterra[...])
box::use(dplyr[...])
box::use(stringr[...])
box::use(lubridate[...])
box::use(purrr[...])
box::use(ggplot2[...])
box::use(forcats[...])
box::use(tidyr[...])
box::use(readr[...])
box::use(gghdx[...])
box::use(rnaturalearth)
box::use(extract = exactextractr)
box::use(AzureStor[...])

box::use(paths=../R/path_utils)
box::use(../R/utils) # func to get fieldaps/
sf_use_s2(FALSE)

box::use(../src/utils/blob)
box::use(paths =../R/path_utils)

gghdx()
Sys.setenv(AZURE_SAS = Sys.getenv("DSCI_AZ_SAS_DEV"))
Sys.setenv(AZURE_STORAGE_ACCOUNT = Sys.getenv("DSCI_AZ_STORAGE_ACCOUNT"))

# averages will be calculated from historical data up to this date
END_DATE_BASELINE <- as.Date("2020-12-31")
SFED_THRESHOLD <- 0.01

bc <- blob$load_containers()
gc <- bc$GLOBAL_CONT
pc <- bc$PROJECTS_CONT
Code
fps <- paths$load_paths()

r_smoothed_thresh <- rast(fps$FVP_DOY_SMOOTHED30D_THRESH_COG)
r_doy_thresh <- rast(fps$FVP_DOY_THRESH_COG)

r_smoothed <- rast(fps$FVP_DOY_NO_THRESH_SMOOTHED30d)
r_doy <- rast(fps$FVP_DOY_NO_THRESH)

gdf<- rnaturalearth$ne_countries(
  country= c("Nigeria","Somalia"),
  returnclass = "sf"
) |> 
  select(country=name)
Code
lr_historical <- list(
  r_doy,
  r_smoothed,
  r_doy_thresh,
  r_smoothed_thresh
)

l_historical_types <-  list(
  "unsmoothed",
  "smoothed",
  "unsmoothed (thresholded)",
  "smoothed (thresholded)"
)

lr_historical <- set_names(
  lr_historical,
  l_historical_types
)

ldf_zonal <- imap(
  lr_historical,
  \(rtmp,label_tmp){
    extract$exact_extract(
      rtmp,
      gdf,
      fun = "mean",
      append_cols = "country",
      progress = FALSE
    ) |> 
      pivot_longer(-country) |> 
      separate(name, into = c("stat","doy"), sep = "\\.") |> 
      mutate(
        doy = parse_number(doy),
        type = label_tmp
      )
  }
)


df_no_thresholds <- bind_rows(
  ldf_zonal$unsmoothed,
  ldf_zonal$smoothed
)

df_thresholds <- bind_rows(
  ldf_zonal$`unsmoothed (thresholded)`,
  ldf_zonal$`smoothed (thresholded)`
)

Visualize Smoothing

It’s a bit tricky to visualize the exact effect of smoothing on individual pixels because there is alot of noise if you were to randomly sample pixels (i.e many pixels never flooded). Therefore, let’s look at the zonal averages of two “random” countries: Somalia, Nigeria.

In this first plot below we’ve calculated the the average SFED value per DOY (i.e 365 days) at the pixel-level for a historical baseline of 1998-2020. From that we calculated the 30 day rolling average (centered +/-15 days) also at Nigeria.

Code
df_no_thresholds |> 
  ggplot(
    aes(
      x=doy , y= value, color = as.factor(type)
    )
  ) +
  geom_point(size=1)+
  geom_line()+
  facet_wrap(~country,nrow = 2, scales="free")+
  labs(
    title = "Zonal mean of historical average SFED by day of Year (DOY)",
    subtitle = "Smoothed vs unsmooothed",
    caption = "Smoothed version is 30d rolling mean"
  )+
  theme(
    legend.title = element_blank()
  )

The processing for the second plot followed the same process except a low-pass filter was applied with a threshold of 1% SFED to remove noise. The smoothed raster was made by first taking the 30d rolling mean on the un-thresholded raster and applying the 1% threshold to the smoothed result. The low pass filter was then applied to the non-smoothed raster and both results were plotted again after taking zonal means.

Code
df_thresholds |> 
  ggplot(
    
    aes(
      x=doy , y= value, color = as.factor(type)
    )
  )+
  geom_point(size=1)+
  geom_line()+
  facet_wrap(~country,nrow = 2, scales="free")+
  labs(
    title = "Zonal mean of historical average SFED by day of Year (DOY)",
    subtitle = "Thresholded - Smoothed vs unsmooothed",
    caption = "Smoothed version is 30d rolling mean"
  )+
  theme(
    legend.title = element_blank()
  )

Examples of analyses enabled:

Here are just some quick ideas of different types of analyses that could be easily done with the data package. These are all experimental and just some first attempts

real time monitoring

I will extend the time range beyond that of an individual 90d file we plan to share for purpose of providing more context as I don’t know of specific flood events off the top of my head to look for in the last 90days. However, I do know that there as quite a bit of flooding reported in Somalia at the end of 2023 & early 2024. Therefore, let’s exchange the range for the last 365 and draw some plots. We can pretend we were monitoring over the past 365 days and see if any of the plots/visuals look promising for being useful.

Code
# since we already had nga & som loaded, lets filter to just one
gdf_aoi <- gdf |> 
  filter(
    country == "Somalia"
  )
# open up recent data raster
r_365d <- rast(paths$vp(fps$FP_LAST365D_THRESH))

# open up smoothed baseline
r_historical <- rast(paths$vp(fps$FP_DOY_SMOOTHED30D_THRESH_COG))

# crop for simplification
lr_crop <- map(
  list(
    "RECENT_365" = r_365d,
    "HISTORICAL" = r_historical
  ), \(r_tmp){
    crop(r_tmp, gdf_aoi)
  }
)

lr_diff <- names(lr_crop$RECENT_365) |> 
  map(
    \(dt_tmp_str){
      dt_tmp <- as.Date(dt_tmp_str)
      doy_tmp <- yday(dt_tmp)
      r_diff_tmp <- lr_crop$RECENT_365[[dt_tmp_str ]] - lr_crop$HISTORICAL[[doy_tmp]]
      r_diff_tmp
    }
  )

r_diff <-  rast(lr_diff)

gdf_som_adm2 <- utils$download_fieldmaps_sf(iso3 = "som",layer = "som_adm2")


 df_zonal_anom_som_adm2 <- extract$exact_extract(
        r_diff,
        gdf_som_adm2$som_adm2,
        fun = "mean",
        append_cols = c("ADM2_EN","ADM2_PCODE","ADM1_EN","ADM1_PCODE"), 
        progress = FALSE
      ) |> 
        pivot_longer(-starts_with("ADM")) |> 
        separate(name, into = c("stat","date"), sep = "\\.") |> 
   mutate(
     date= as.Date(date)
   )
 
 
 top3_anoms <- df_zonal_anom_som_adm2 |> 
   group_by(ADM2_EN,ADM2_PCODE) |> 
   summarise(
     value = max(value),
     .groups="drop"
   ) |> 
   slice_max(order_by = value, n= 3)
 
# RColorBrewer::brewer.pal(5,"Blues")
pal <- c( "#08519C","#3182BD" ,"#6BAED6" ,"grey")
 

 df_zonal_anom_som_adm2 <- df_zonal_anom_som_adm2 |> 
   mutate(
     line_color = if_else(ADM2_EN %in% top3_anoms$ADM2_EN, ADM2_EN, "Other"),
     line_color = fct_relevel(line_color, top3_anoms$ADM2_EN,"Other")
   ) 
 
 
 p_sfed_365_somalia <- df_zonal_anom_som_adm2 |> 
   ggplot(
     aes(x= date,
         y= value,
         color = line_color,
         group=ADM2_EN,
         alpha= line_color)
   )+
   # geom_point(size=1, alpha=0.4)+
   geom_line() + 
   scale_alpha_manual(values = c(1,1,1,0.4))+
   scale_color_manual(values = pal)+
   scale_x_date(date_breaks = "2 week",date_labels = "%e %b")+
   labs(
     y= "SFED Anomaly (absolute)",
     title = "SFED Anomaly Somalia Admin 2 - Last 365 Days"
   )+
   theme(
     axis.text.x = element_text(angle = 90),
     legend.title = element_blank()
   )
  
   
 
 gdf_som_adm2 <- gdf_som_adm2$som_adm2 |> 
   mutate(
     top3_sfed = ADM2_EN %in% top3_anoms$ADM2_EN
   )
 
m_high_districts <-  ggplot() +
   geom_sf(
     data = gdf_som_adm2,
     aes(fill = top3_sfed)
   )+
   scale_fill_manual(values = c("white","tomato"))+
   labs(
     title = "Districts with highest SFED anomaly in last year"
   )+
   theme(
     legend.position = "none",
     legend.title = element_blank()
   )
 
 # facet_wrap(~ADM1_EN)


dates_dec_mid_jan <- as.character(seq(
  as.Date("2023-12-01"),
  as.Date("2024-01-15"),
  by = "day"
  
)
)

r_mean_composite <- mean(r_diff[[dates_dec_mid_jan]] )
p_map_raster_dec_jan <- ggplot()+
  geom_spatraster(
    data= r_mean_composite
  )+
  scale_fill_viridis_b()+
  geom_sf(
    data = gdf_som_adm2, fill =NA, color ="white"
  )+
  labs(
    title = "Mean flooding anomaly composite",
    subtitle =  "01 Dec 2023 - 15 Jan 2024"
  )

# box::use(patchwork[...])
# p_sfed_365_somalia+
#   m_high_districts+
#   p_map_raster_dec_jan+ 
#   plot_layout(ncol = 1)

I think it is pretty clear from this first plot that we see large positive SFED flood anomalies in Dec 23’/Jan24’ which does align with field reporting.

Code
p_sfed_365_somalia

Here we’ve plotted the top 3 districts with the highest SFED anomaly in the last year.

Code
m_high_districts

And finally we create and plot an average composite raster of the positively anomalous period (Dec 23’- mid Jan 24’) to get a qualitative view of areas with most flooding

Code
p_map_raster_dec_jan

Discussion

Even from this rough analysis, it’s clear that real time analysis of SFED anomalies has the potential to benefit monitoring systems made at various scales (sub-national, national, regional, and global). The purpose of this document is not to thoroughly investigate a methodology, but explain the data sources that we are considering processing and putting on HDX and demonstrate a few basic families of analyses that can easily be done with those products.

Admin level aggregations: We are considering what admin level aggregations of this data could be included in the data package. Since flooding can occur within a country at a large scale in uninhabited area(s), it seems appropriate to caution against admin 0 (national) level aggregations without the integration of exposure metrics/data into the analysis. One potential option would be to integrate a population raster or building footprint similar to the flood risk analysis done to support the Somalia 2023 HRP. It could be reasonable to do a pure SFED anomaly analysis at lower levels such as admin 2/3 (similar to what was done above) or custom/common grid so that people could contextualize the indicated risk of any anomalous values.

Appendix

Code
# Below we map out the anomaly for the last 3 days at the raster pixel level.

# since we already had nga & som loaded, lets filter to just one
gdf_aoi <- gdf |> 
  filter(
    country == "Somalia"
  )
# open up recent data raster
r_90d <- rast(paths$vp(fps$FP_LAST90D))

# open up smoothed baseline
r_historical <- rast(paths$vp(fps$FP_DOY_SMOOTHED30D_THRESH_COG))

# crop for simplification
lr_crop <- map(
  list(
    "RECENT_90" = r_90d,
    "HISTORICAL" = r_historical
  ), \(r_tmp){
    crop(r_tmp, gdf_aoi)
  }
)


dates_monitor <- sort(as.Date(names(lr_crop$RECENT_90)),decreasing = T)[1:7]
doys_monitor <- yday(dates_monitor)

r_recent_monitor <- lr_crop$RECENT_90[[as.character(dates_monitor)]]
r_historical_monitor <- lr_crop$HISTORICAL[[doys_monitor]]

r_diff_anomaly <- r_recent_monitor - r_historical_monitor
r_pct_anomaly <- r_diff_anomaly / r_historical_monitor
Code
ggplot() +
  geom_spatraster(data=r_diff_anomaly[[1:5]])+
  scale_fill_viridis_b()+
  facet_wrap(~lyr)+
  geom_sf(data = gdf_aoi , fill =NA, color ="white")+
  labs(
    title = "Rainfall anomaly past 3 days"
  )+
  theme(
    legend.text = element_text(angle =90 )
  )

ggplot() +
  geom_spatraster(data=r_pct_anomaly[[1:5]])+
  scale_fill_viridis_b()+
  facet_wrap(~lyr)+
  geom_sf(data = gdf_aoi , fill =NA, color ="white")+
  labs(
    title = "Rainfall anomaly past 3 days"
  )+
  theme(
    legend.text = element_text(angle =90 )
  )
Code
# We can also take zonal stats of the AOI and see the zonal level anomalies.

# lets just quickly change the historical doys to dates so we can align on one graph
dates_from_doy <- as.Date("2024-01-01")+(as.numeric(names(r_historical_monitor))-1)
set.names(r_historical_monitor,dates_from_doy)

df_zonal_recent_monitor <- 
  imap(
    list(
      "RECENT" = r_recent_monitor,
      "HISTORICAL" = r_historical_monitor,
      "DIFF_ANOM" = r_diff_anomaly
    ),
    \(r_tmp,label_tmp){
      extract$exact_extract(
        r_tmp,
        gdf_aoi,
        fun = "mean",
        append_cols = "country",
        progress = FALSE
      ) |> 
        pivot_longer(-country) |> 
        separate(name, into = c("stat","date"), sep = "\\.") |> 
        mutate(
          # doy = parse_number(doy),
          type = label_tmp,
          facet_split = if_else(type %in% c("RECENT","HISTORICAL"), "SFED Avg", "SFED Anomaly (absolute)")
        )
    }
  ) |> 
  list_rbind()

df_zonal_recent_monitor |> 
  ggplot(
    aes(x= date , y= value, color =type, group=type)
  )+
  geom_point()+
  geom_line()+
  facet_wrap(~facet_split,scales = "free",nrow = 2)
Code
gdf_africa <- rnaturalearth$ne_countries(continent= "Africa") |> 
  st_make_valid() |> 
  summarise()



gdf_grid_africa <- st_make_grid(
  gdf_africa,
  # cell size = units::as_units(50000,"metre"), 
  cellsize = 5, # degrees
  square = FALSE
) |> 
  st_as_sf() |> 
  mutate(
    grid_id = row_number()
  )

gdf_grid_africa <- gdf_grid_africa[gdf_africa,]

# leaflet() |> 
#   addTiles() |> 
#   addPolygons(
#     data= gdf_grid_africa
#   )

df_grid <- extract$exact_extract(
  r_365d,
  gdf_grid_africa,
  fun = "mean",
  append_cols = "grid_id"
)


df_grid_long <- df_grid |> 
  pivot_longer(-grid_id) |> 
  separate(name, into = c("stat","date"), sep = "\\.") 

df_grid_long |> 
  mutate(
    doy = lubridate::yday(date)
  ) |> 
  left_join(
    zonal_smooth_thresh
  )
#   ggplot(
#     aes(
#       x=grid_id, y= value
#     )
#   )+
#   geom_point()+
#   geom_line()+
#   labs(
#     title = "Historical average SFED by grid cell"
# )
#   
Code
bind_rows(
  zonal_doy_thresh,
  zonal_doy
) |> 
  ggplot(
    
    aes(
      x=doy , y= value, color = as.factor(type)
    )
  )+
  geom_point(size=1)+
  geom_line()+
  facet_wrap(~country,nrow = 2, scales="free")+
  labs(
    title = "Historical average SFED by day of Year"
  ) +
  theme(
    legend.title = element_blank()
  )


bind_rows(
  zonal_smooth,
  zonal_smooth_thresh
  
) |> 
  ggplot(
    
    aes(
      x=doy , y= value, color = as.factor(type)
    )
  )+
  geom_point(size=1)+
  geom_line()+
  facet_wrap(~country,nrow = 2, scales="free")+
  labs(
    title = "Historical average SFED by day of Year"
  )+
  theme(
    legend.title = element_blank()
  )
Code
set.seed(1234)
# rnd_sample <- st_sample(r_doy_smoothed, n = 10)
# sample_smooth <- terra::spatSample(x= r_doy_smoothed,size=10,xy=T)
sample_smooth <- terra::spatSample(x= r_smoothed,size=10,xy=T)


pts <- vect(sample_smooth[,c("x","y")], geom =c("x","y")) 

sample_non_smooth<- terra::extract(x= r_doy, y= pts)
Code
df_samples <- bind_rows(
  sample_smooth |> 
    mutate(
      type = "smoothed",
      row_id = row_number()
    ),
  sample_non_smooth |> 
    mutate(
      type = "non-smoothed",
      row_id = row_number()
    )  
  
)


df_samples |> 
  pivot_longer(
    cols = -c("row_id","type")  
  ) |> 
  ggplot(
    
    aes(
      x=name , y= value, color = as.factor(row_id)
    )
  )+
  geom_point()+
  geom_line()+
  facet_wrap(~type)