1 Repated Spatial Sampling

In the previous blog post I showed the general use case of tidy-modeling approach in multi-class classification problem using Random Forest() algorithm Click Here. The package spaitlsample() allows to cluster spatial observations into spatial groups based on kmeans() clustering. However, I don’t think there is a package that allows to perform other clustering (e.g. hierarchical clustering) and repeated-cross validation in spatial data.

Here I wrote a function repated_spatial_clustering_cv similar to spatial_clustering_cv function from spatial sample() package. The function can have following input:

Take data frame with longitude (X) and latitude (Y) as variables
Take simple feature sf point data type
Take spatial Points or SpatialPointDataFrame objects from sp packages
Applies partitioning clustering i.e. kmeans, or Hierarchical clustering using hclust() function from stats package

2 Explore data

Here, my goal is to predict appropriate Land Use and Land Cover Type

2.1 Load package and data

library(tidymodels)
library(sf)

# read training shapefile data as tdata
tdata<- sf::st_read("C:\\Users\\suved\\Documents\\train.shp")

## Reading layer `train' from data source `C:\Users\suved\Documents\train.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1922 features and 43 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 390919.6 ymin: 3119625 xmax: 460847.2 ymax: 3168895
## Projected CRS: NAD83 / UTM zone 14N

This time, I am using my data set which I digitized using heads-up digitization method. It has several features generated based on Red, Green, Blue, Near infrared bands, which were then used to derive principal components axes (bands), grey-level co-occurances matrices in first principal component axis, several spectral indices along with geometry features of objects were also derived from multi-resolution segmentation. This is beyond the scope of this tutorial.

# head of training data

tdata[1:5, 1:10]

## Simple feature collection with 5 features and 10 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: 392210.5 ymin: 3119625 xmax: 397434.5 ymax: 3119731
## Projected CRS: NAD83 / UTM zone 14N
##   Class_name      ASYM     BDRI    COMPT     DENS       MASM     MAVG     MAXD
## 1          5 0.7314657 1.884956 1.914278 1.958027 0.06211513 1.866592 3.359076
## 2          5 0.7789210 2.593156 2.364084 1.703363 0.06226670 2.062140 3.498545
## 3          5 0.5165840 2.447257 1.946576 2.038557 0.06229337 1.836628 3.201589
## 4          4 0.3670303 2.425428 2.441741 2.074016 0.09680321 1.316488 3.043043
## 5          4 0.4476154 1.558442 1.458198 2.201853 0.10974729 1.121580 3.096424
##       MBLU      MDIS                 geometry
## 1 66.63188 1.6617853 POINT (392210.5 3119625)
## 2 63.39825 1.9870977 POINT (395942.6 3119661)
## 3 69.88172 1.6976533   POINT (393647 3119704)
## 4 90.20121 1.1538607 POINT (397385.1 3119726)
## 5 92.37401 0.9766353 POINT (397434.5 3119731)

unique(tdata$Class_name)

## [1] 5 4 2 6 7 8 1 9

ggplot()+
  #geom_sf(data = tdata)+
  geom_sf(data = tdata, aes(colour = factor(Class_name)), size = 1.5, alpha = 0.7)+
  scale_colour_viridis_d(option = "magma")+
  labs(color = "Land Cover Class")+
  theme_bw(12)

3 Create repeated spatial resamples

3.1 Partitioning clustering

set.seed(1318)

srs<- repeated_spatial_clustering_cv(data = tdata,v = 5,repeats = 5,
                            coords = NULL,spatial = TRUE,clust_method = "kmeans",dist_clust = NULL)

srs

## # A tibble: 25 x 3
##    splits             id      id2  
##    <list>             <chr>   <chr>
##  1 <split [1503/419]> Repeat1 Fold1
##  2 <split [1632/290]> Repeat1 Fold2
##  3 <split [1573/349]> Repeat1 Fold3
##  4 <split [1353/569]> Repeat1 Fold4
##  5 <split [1627/295]> Repeat1 Fold5
##  6 <split [1331/591]> Repeat2 Fold1
##  7 <split [1648/274]> Repeat2 Fold2
##  8 <split [1482/440]> Repeat2 Fold3
##  9 <split [1654/268]> Repeat2 Fold4
## 10 <split [1573/349]> Repeat2 Fold5
## # ... with 15 more rows

When I was working on my PhD projects I discovered few approaches to spatial data resampling especially when they have spatial structures. Consequently, I have been working on these approaches(which, I will be discussing about them in the later tutorial). The split object srs has list of spatial_clustering_split similar to that of objects from *spatialsample** package.

spatial repeated samples srs object has now resamples based on 10-fold with 5-repeats.

plot_splits <- function(split) {
  p <- analysis(split) %>%
    mutate(analysis = "Training") %>%
    bind_rows(assessment(split) %>%
                mutate(analysis = "Testing")) %>%
    ggplot(aes(X, Y, color = analysis)) +
    geom_point(alpha = 0.7, size = 2) +
    coord_fixed()+
    #facet_wrap(~analysis,scales = "free")+
    labs(color = "Train/Test")+
    scale_color_viridis_d(direction = -1)+
    xlab("Longitude (m)")+
    ylab("Latitidue (m)")
   # transition_reveal(analysis)
  print(p)
}

# plot_splits function is inspired from one of the screen cast of Julia Silge
# https://juliasilge.com/blog/map-challenge/

walk(srs$splits, plot_splits)

The clustering of observations using partitioning algorithm kmeans is clear, unlike regular random sampling. The training and testing data shows clustering of nearby objects. The regular clustering can be obtained in the above command defining strata = Class_name, where Class_name is the land use land cover class in integer

How about the Hierarchical Clustering? lets look at this. For hierarchical clustering, the algorithms uses geometry features and returns the split objects as we’ve seen before.

3.2 Hierarchical clustering

hclust function from stats package, has the following arguments:

d : a dissimmilary structure as produced by dist. Here dist function is dist(xy_geometry, method = “euclidean” ).

method : the agglomeration method to be used.

This should be one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC). hclust(d, method = “complete”, members = NULL). In the repeated_spatial_clustering_cv, method (in hclust) == dist_clust

Further details about hclust can be invoved by using ?stats::hclust in R console.

Now lets look at example of Hierarcical clustering

set.seed(1318)

srs_hw2<- repeated_spatial_clustering_cv(data = tdata,v = 5,repeats = 5,
                            coords = NULL,spatial = TRUE,clust_method = "hclust",dist_clust = "ward.D2")

# average method (UPGMA)
srs_havg<- repeated_spatial_clustering_cv(data = tdata, v = 5, repeats = 5,
                                          coords = NULL,spatial = TRUE,clust_method = "hclust",
                                            dist_clust = "average")

Similarly other agglomertion method can be used in dist_clust arguments. Now, lets plots these two methods

Ward.D2 Method

# Lets re-use plot_splits function

purrr::walk(srs_hw2$splits, plot_splits)

Average Method

# Lets re-use plot_splits function

purrr::walk(srs_havg$splits, plot_splits)

I will discuss the classification using repeated-spatial-clustering in the next post.

sessioninfo::session_info()

## - Session info ---------------------------------------------------------------
##  setting  value
##  version  R version 4.1.2 (2021-11-01)
##  os       Windows 10 x64 (build 19042)
##  system   x86_64, mingw32
##  ui       RTerm
##  language (EN)
##  collate  English_United States.1252
##  ctype    English_United States.1252
##  tz       America/Chicago
##  date     2022-02-23
##  pandoc   2.14.0.3 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
## 
## - Packages -------------------------------------------------------------------
##  package      * version    date (UTC) lib source
##  assertthat     0.2.1      2019-03-21 [1] CRAN (R 4.1.0)
##  backports      1.4.1      2021-12-13 [1] CRAN (R 4.1.2)
##  broom        * 0.7.12     2022-01-28 [1] CRAN (R 4.1.2)
##  bslib          0.3.1      2021-10-06 [1] CRAN (R 4.1.1)
##  class          7.3-20     2022-01-13 [1] CRAN (R 4.1.2)
##  classInt       0.4-3      2020-04-07 [1] CRAN (R 4.1.0)
##  cli            3.2.0      2022-02-14 [1] CRAN (R 4.1.2)
##  codetools      0.2-18     2020-11-04 [2] CRAN (R 4.1.2)
##  colorspace     2.0-2      2021-06-24 [1] CRAN (R 4.1.1)
##  crayon         1.5.0      2022-02-14 [1] CRAN (R 4.1.2)
##  DBI            1.1.2      2021-12-20 [1] CRAN (R 4.1.2)
##  dials        * 0.1.0      2022-01-31 [1] CRAN (R 4.1.2)
##  DiceDesign     1.9        2021-02-13 [1] CRAN (R 4.1.1)
##  digest         0.6.29     2021-12-01 [1] CRAN (R 4.1.2)
##  dplyr        * 1.0.8      2022-02-08 [1] CRAN (R 4.1.2)
##  e1071          1.7-9      2021-09-16 [1] CRAN (R 4.1.0)
##  ellipsis       0.3.2      2021-04-29 [1] CRAN (R 4.1.0)
##  evaluate       0.15       2022-02-18 [1] CRAN (R 4.1.2)
##  fansi          1.0.2      2022-01-14 [1] CRAN (R 4.1.2)
##  farver         2.1.0      2021-02-28 [1] CRAN (R 4.1.0)
##  fastmap        1.1.0      2021-01-25 [1] CRAN (R 4.1.0)
##  foreach        1.5.2      2022-02-02 [1] CRAN (R 4.1.2)
##  furrr          0.2.3      2021-06-25 [1] CRAN (R 4.1.1)
##  future         1.24.0     2022-02-19 [1] CRAN (R 4.1.2)
##  future.apply   1.8.1      2021-08-10 [1] CRAN (R 4.1.1)
##  generics       0.1.2      2022-01-31 [1] CRAN (R 4.1.2)
##  ggplot2      * 3.3.5      2021-06-25 [1] CRAN (R 4.1.0)
##  gifski         1.4.3-1    2021-05-02 [1] CRAN (R 4.1.2)
##  globals        0.14.0     2020-11-22 [1] CRAN (R 4.1.0)
##  glue           1.6.1      2022-01-22 [1] CRAN (R 4.1.2)
##  gower          1.0.0      2022-02-03 [1] CRAN (R 4.1.2)
##  GPfit          1.0-8      2019-02-08 [1] CRAN (R 4.1.1)
##  gtable         0.3.0      2019-03-25 [1] CRAN (R 4.1.0)
##  hardhat        0.2.0      2022-01-24 [1] CRAN (R 4.1.2)
##  highr          0.9        2021-04-16 [1] CRAN (R 4.1.0)
##  htmltools      0.5.2      2021-08-25 [1] CRAN (R 4.1.1)
##  infer        * 1.0.0      2021-08-13 [1] CRAN (R 4.1.1)
##  ipred          0.9-12     2021-09-15 [1] CRAN (R 4.1.1)
##  iterators      1.0.14     2022-02-05 [1] CRAN (R 4.1.2)
##  jquerylib      0.1.4      2021-04-26 [1] CRAN (R 4.1.1)
##  jsonlite       1.8.0      2022-02-22 [1] CRAN (R 4.1.2)
##  KernSmooth     2.23-20    2021-05-03 [2] CRAN (R 4.1.2)
##  knitr          1.37       2021-12-16 [1] CRAN (R 4.1.2)
##  labeling       0.4.2      2020-10-20 [1] CRAN (R 4.1.0)
##  lattice        0.20-45    2021-09-22 [2] CRAN (R 4.1.2)
##  lava           1.6.10     2021-09-02 [1] CRAN (R 4.1.1)
##  lhs            1.1.4      2022-02-20 [1] CRAN (R 4.1.2)
##  lifecycle      1.0.1      2021-09-24 [1] CRAN (R 4.1.0)
##  listenv        0.8.0      2019-12-05 [1] CRAN (R 4.1.1)
##  lubridate      1.8.0      2021-10-07 [1] CRAN (R 4.1.1)
##  magrittr       2.0.2      2022-01-26 [1] CRAN (R 4.1.2)
##  MASS           7.3-55     2022-01-13 [1] CRAN (R 4.1.2)
##  Matrix         1.4-0      2021-12-08 [1] CRAN (R 4.1.2)
##  modeldata    * 0.1.1      2021-07-14 [1] CRAN (R 4.1.1)
##  munsell        0.5.0      2018-06-12 [1] CRAN (R 4.1.0)
##  nnet           7.3-17     2022-01-13 [1] CRAN (R 4.1.2)
##  parallelly     1.30.0     2021-12-17 [1] CRAN (R 4.1.2)
##  parsnip      * 0.1.7      2021-07-21 [1] CRAN (R 4.1.1)
##  pillar         1.7.0      2022-02-01 [1] CRAN (R 4.1.2)
##  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 4.1.0)
##  plyr           1.8.6      2020-03-03 [1] CRAN (R 4.1.0)
##  pROC           1.18.0     2021-09-03 [1] CRAN (R 4.1.1)
##  prodlim        2019.11.13 2019-11-17 [1] CRAN (R 4.1.0)
##  proxy          0.4-26     2021-06-07 [1] CRAN (R 4.1.0)
##  purrr        * 0.3.4      2020-04-17 [1] CRAN (R 4.1.0)
##  R6             2.5.1      2021-08-19 [1] CRAN (R 4.1.1)
##  Rcpp           1.0.8      2022-01-13 [1] CRAN (R 4.1.2)
##  recipes      * 0.2.0      2022-02-18 [1] CRAN (R 4.1.2)
##  rlang          1.0.1      2022-02-03 [1] CRAN (R 4.1.2)
##  rmarkdown      2.11       2021-09-14 [1] CRAN (R 4.1.1)
##  rpart          4.1.16     2022-01-24 [1] CRAN (R 4.1.2)
##  rsample      * 0.1.1      2021-11-08 [1] CRAN (R 4.1.1)
##  rstudioapi     0.13       2020-11-12 [1] CRAN (R 4.1.0)
##  sass           0.4.0      2021-05-12 [1] CRAN (R 4.1.1)
##  scales       * 1.1.1      2020-05-11 [1] CRAN (R 4.1.0)
##  sessioninfo    1.2.2      2021-12-06 [1] CRAN (R 4.1.2)
##  sf           * 1.0-6      2022-02-04 [1] CRAN (R 4.1.2)
##  sp           * 1.4-6      2021-11-14 [1] CRAN (R 4.1.2)
##  stringi        1.7.6      2021-11-29 [1] CRAN (R 4.1.2)
##  stringr        1.4.0      2019-02-10 [1] CRAN (R 4.1.0)
##  survival       3.2-13     2021-08-24 [2] CRAN (R 4.1.2)
##  tibble       * 3.1.6      2021-11-07 [1] CRAN (R 4.1.2)
##  tidymodels   * 0.1.4      2021-10-01 [1] CRAN (R 4.1.1)
##  tidyr        * 1.2.0      2022-02-01 [1] CRAN (R 4.1.2)
##  tidyselect     1.1.2      2022-02-21 [1] CRAN (R 4.1.2)
##  timeDate       3043.102   2018-02-21 [1] CRAN (R 4.1.0)
##  tune         * 0.1.6      2021-07-21 [1] CRAN (R 4.1.1)
##  units          0.8-0      2022-02-05 [1] CRAN (R 4.1.2)
##  utf8           1.2.2      2021-07-24 [1] CRAN (R 4.1.1)
##  vctrs          0.3.8      2021-04-29 [1] CRAN (R 4.1.0)
##  viridisLite    0.4.0      2021-04-13 [1] CRAN (R 4.1.0)
##  withr          2.4.3      2021-11-30 [1] CRAN (R 4.1.2)
##  workflows    * 0.2.4      2021-10-12 [1] CRAN (R 4.1.1)
##  workflowsets * 0.1.0      2021-07-22 [1] CRAN (R 4.1.1)
##  xfun           0.29       2021-12-14 [1] CRAN (R 4.1.2)
##  yaml           2.3.4      2022-02-17 [1] CRAN (R 4.1.2)
##  yardstick    * 0.0.9      2021-11-22 [1] CRAN (R 4.1.2)
## 
##  [1] C:/Users/suved/Documents/R/win-library/4.1
##  [2] C:/Program Files/R/R-4.1.2/library
## 
## ------------------------------------------------------------------------------

Sampling of Spatial Data in Tidymodeling Framework: Repeated Spatial Cross Validation

Mukti Subedi