Spatial resampling for Tidy Tuesday

Geographic data is special when it comes to, well, basically everything.

suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))

data("lsl", package = "spDataLarge")
landslides <- as_tibble(lsl)
landslides
## # A tibble: 350 x 8
##          x        y lslpts slope    cplan     cprof  elev log10_carea
##      <dbl>    <dbl> <fct>  <dbl>    <dbl>     <dbl> <dbl>       <dbl>
##  1 715078. 9558647. FALSE   37.4  0.0205   0.00866  2477.        2.61
##  2 713748. 9558047. FALSE   41.7 -0.0241   0.00679  2486.        3.07
##  3 712508. 9558887. FALSE   20.0  0.0390   0.0147   2142.        2.29
##  4 713998. 9558187. FALSE   45.8 -0.00632  0.00435  2391.        3.83
##  5 714308. 9557307. FALSE   41.7  0.0423  -0.0202   2570.        2.70
##  6 713488. 9558117. FALSE   52.9  0.0323   0.00703  2418.        2.48
##  7 714948. 9558347. FALSE   51.9  0.0399   0.000791 2546.        3.15
##  8 714678. 9560357. FALSE   38.5  0.0164   0.0299   1932.        3.26
##  9 714368. 9560287. FALSE   24.1 -0.0188  -0.00956  2059.        3.20
## 10 712528. 9559217. FALSE   50.5  0.0142   0.0151   1973.        2.60
## # ... with 340 more rows

How are these landslides (plus not landslides) distributed in this area?

theme_set(theme_light())

ggplot(landslides, aes(x, y)) +
               stat_summary_hex(aes(z = elev), alpha = 0.6, bins = 12) +
               geom_point(aes(color = lslpts), alpha = 0.7) +
               coord_fixed() +
               scale_fill_viridis_c() +
               scale_color_manual(values = c("gray90", "midnightblue")) +
               labs(fill = "Elevation", color = "Landslides?")

Create spatial resamples

Use a resampling strategy that accounts for that autocorrelation.

library(spatialsample)

set.seed(123)
good_folds <- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 10)

set.seed(234)
bad_folds <- vfold_cv(landslides, v = 5, strata = lslpts)
bad_folds
## #  5-fold cross-validation using stratification 
## # A tibble: 5 x 2
##   splits           id   
##   <list>           <chr>
## 1 <split [280/70]> Fold1
## 2 <split [280/70]> Fold2
## 3 <split [280/70]> Fold3
## 4 <split [280/70]> Fold4
## 5 <split [280/70]> Fold5
plot_splits <- function(split) {
  p <- bind_rows(
    analysis(split) %>%
      mutate(analysis = "Analysis"),
    assessment(split) %>%
      mutate(analysis = "Assessment")
  ) %>%
    ggplot(aes(x, y, color = analysis)) +
    geom_point(size = 1.5, alpha = 0.8) +
    coord_fixed() +
    labs(color = NULL)
  print(p)
}
walk(good_folds$splits, plot_splits)

walk(bad_folds$splits, plot_splits)

Fit and evaluate model

glm_spec <- logistic_reg()
lsl_form <- lslpts ~ slope + cplan + cprof + elev + log10_carea

lsl_wf <- workflow(lsl_form, glm_spec)

doParallel::registerDoParallel(cores = 2)

set.seed(2021)
regular_rs <- fit_resamples(lsl_wf, bad_folds)

set.seed(2021)
spatial_rs <- fit_resamples(lsl_wf, good_folds)
collect_metrics(regular_rs)
## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.737     5  0.0173 Preprocessor1_Model1
## 2 roc_auc  binary     0.808     5  0.0201 Preprocessor1_Model1
collect_metrics(spatial_rs)
## # A tibble: 2 x 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.698    10  0.0392 Preprocessor1_Model1
## 2 roc_auc  binary     0.796    10  0.0286 Preprocessor1_Model1

If we use the “regular” resampling, we get a more optimistc estimate of performance which would fool us into thinking our model would perform better than it really could. The lower performance estimate using spatial resampling is more accurate because of the autocorrelation of this geographic data; observations near each other are more alike than observations far apart. With geographic data, it’s important to use an appropriate model evaluation strategy!