Spatial resampling for Tidy Tuesday
Geographic data is special when it comes to, well, basically everything.
suppressMessages(library(tidyverse))
suppressMessages(library(tidymodels))
data("lsl", package = "spDataLarge")
landslides <- as_tibble(lsl)
landslides
## # A tibble: 350 x 8
## x y lslpts slope cplan cprof elev log10_carea
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 715078. 9558647. FALSE 37.4 0.0205 0.00866 2477. 2.61
## 2 713748. 9558047. FALSE 41.7 -0.0241 0.00679 2486. 3.07
## 3 712508. 9558887. FALSE 20.0 0.0390 0.0147 2142. 2.29
## 4 713998. 9558187. FALSE 45.8 -0.00632 0.00435 2391. 3.83
## 5 714308. 9557307. FALSE 41.7 0.0423 -0.0202 2570. 2.70
## 6 713488. 9558117. FALSE 52.9 0.0323 0.00703 2418. 2.48
## 7 714948. 9558347. FALSE 51.9 0.0399 0.000791 2546. 3.15
## 8 714678. 9560357. FALSE 38.5 0.0164 0.0299 1932. 3.26
## 9 714368. 9560287. FALSE 24.1 -0.0188 -0.00956 2059. 3.20
## 10 712528. 9559217. FALSE 50.5 0.0142 0.0151 1973. 2.60
## # ... with 340 more rows
How are these landslides (plus not landslides) distributed in this area?
theme_set(theme_light())
ggplot(landslides, aes(x, y)) +
stat_summary_hex(aes(z = elev), alpha = 0.6, bins = 12) +
geom_point(aes(color = lslpts), alpha = 0.7) +
coord_fixed() +
scale_fill_viridis_c() +
scale_color_manual(values = c("gray90", "midnightblue")) +
labs(fill = "Elevation", color = "Landslides?")
Use a resampling strategy that accounts for that autocorrelation.
library(spatialsample)
set.seed(123)
good_folds <- spatial_clustering_cv(landslides, coords = c("x", "y"), v = 10)
set.seed(234)
bad_folds <- vfold_cv(landslides, v = 5, strata = lslpts)
bad_folds
## # 5-fold cross-validation using stratification
## # A tibble: 5 x 2
## splits id
## <list> <chr>
## 1 <split [280/70]> Fold1
## 2 <split [280/70]> Fold2
## 3 <split [280/70]> Fold3
## 4 <split [280/70]> Fold4
## 5 <split [280/70]> Fold5
plot_splits <- function(split) {
p <- bind_rows(
analysis(split) %>%
mutate(analysis = "Analysis"),
assessment(split) %>%
mutate(analysis = "Assessment")
) %>%
ggplot(aes(x, y, color = analysis)) +
geom_point(size = 1.5, alpha = 0.8) +
coord_fixed() +
labs(color = NULL)
print(p)
}
walk(good_folds$splits, plot_splits)
walk(bad_folds$splits, plot_splits)
glm_spec <- logistic_reg()
lsl_form <- lslpts ~ slope + cplan + cprof + elev + log10_carea
lsl_wf <- workflow(lsl_form, glm_spec)
doParallel::registerDoParallel(cores = 2)
set.seed(2021)
regular_rs <- fit_resamples(lsl_wf, bad_folds)
set.seed(2021)
spatial_rs <- fit_resamples(lsl_wf, good_folds)
collect_metrics(regular_rs)
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.737 5 0.0173 Preprocessor1_Model1
## 2 roc_auc binary 0.808 5 0.0201 Preprocessor1_Model1
collect_metrics(spatial_rs)
## # A tibble: 2 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.698 10 0.0392 Preprocessor1_Model1
## 2 roc_auc binary 0.796 10 0.0286 Preprocessor1_Model1
If we use the “regular” resampling, we get a more optimistc estimate of performance which would fool us into thinking our model would perform better than it really could. The lower performance estimate using spatial resampling is more accurate because of the autocorrelation of this geographic data; observations near each other are more alike than observations far apart. With geographic data, it’s important to use an appropriate model evaluation strategy!