In the previous blog post I showed the general use case of tidy-modeling approach in multi-class classification problem using Random Forest() algorithm Click Here. The package spaitlsample() allows to cluster spatial observations into spatial groups based on kmeans() clustering. However, I don’t think there is a package that allows to perform other clustering (e.g. hierarchical clustering) and repeated-cross validation in spatial data.
Here I wrote a function repated_spatial_clustering_cv similar to spatial_clustering_cv function from spatial sample() package. The function can have following input:
Take data frame with longitude (X) and latitude (Y) as variables
Take simple feature sf point data type
Take spatial Points or SpatialPointDataFrame objects from sp packages
Applies partitioning clustering i.e. kmeans, or Hierarchical clustering using hclust() function from stats package
Here, my goal is to predict appropriate Land Use and Land Cover Type
library(tidymodels)
library(sf)
# read training shapefile data as tdata
tdata<- sf::st_read("C:\\Users\\suved\\Documents\\train.shp")
## Reading layer `train' from data source `C:\Users\suved\Documents\train.shp' using driver `ESRI Shapefile'
## Simple feature collection with 1922 features and 43 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 390919.6 ymin: 3119625 xmax: 460847.2 ymax: 3168895
## Projected CRS: NAD83 / UTM zone 14N
This time, I am using my data set which I digitized using heads-up digitization method. It has several features generated based on Red, Green, Blue, Near infrared bands, which were then used to derive principal components axes (bands), grey-level co-occurances matrices in first principal component axis, several spectral indices along with geometry features of objects were also derived from multi-resolution segmentation. This is beyond the scope of this tutorial.
# head of training data
tdata[1:5, 1:10]
## Simple feature collection with 5 features and 10 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 392210.5 ymin: 3119625 xmax: 397434.5 ymax: 3119731
## Projected CRS: NAD83 / UTM zone 14N
## Class_name ASYM BDRI COMPT DENS MASM MAVG MAXD
## 1 5 0.7314657 1.884956 1.914278 1.958027 0.06211513 1.866592 3.359076
## 2 5 0.7789210 2.593156 2.364084 1.703363 0.06226670 2.062140 3.498545
## 3 5 0.5165840 2.447257 1.946576 2.038557 0.06229337 1.836628 3.201589
## 4 4 0.3670303 2.425428 2.441741 2.074016 0.09680321 1.316488 3.043043
## 5 4 0.4476154 1.558442 1.458198 2.201853 0.10974729 1.121580 3.096424
## MBLU MDIS geometry
## 1 66.63188 1.6617853 POINT (392210.5 3119625)
## 2 63.39825 1.9870977 POINT (395942.6 3119661)
## 3 69.88172 1.6976533 POINT (393647 3119704)
## 4 90.20121 1.1538607 POINT (397385.1 3119726)
## 5 92.37401 0.9766353 POINT (397434.5 3119731)
unique(tdata$Class_name)
## [1] 5 4 2 6 7 8 1 9
ggplot()+
#geom_sf(data = tdata)+
geom_sf(data = tdata, aes(colour = factor(Class_name)), size = 1.5, alpha = 0.7)+
scale_colour_viridis_d(option = "magma")+
labs(color = "Land Cover Class")+
theme_bw(12)
set.seed(1318)
srs<- repeated_spatial_clustering_cv(data = tdata,v = 5,repeats = 5,
coords = NULL,spatial = TRUE,clust_method = "kmeans",dist_clust = NULL)
srs
## # A tibble: 25 x 3
## splits id id2
## <list> <chr> <chr>
## 1 <split [1503/419]> Repeat1 Fold1
## 2 <split [1632/290]> Repeat1 Fold2
## 3 <split [1573/349]> Repeat1 Fold3
## 4 <split [1353/569]> Repeat1 Fold4
## 5 <split [1627/295]> Repeat1 Fold5
## 6 <split [1331/591]> Repeat2 Fold1
## 7 <split [1648/274]> Repeat2 Fold2
## 8 <split [1482/440]> Repeat2 Fold3
## 9 <split [1654/268]> Repeat2 Fold4
## 10 <split [1573/349]> Repeat2 Fold5
## # ... with 15 more rows
When I was working on my PhD projects I discovered few approaches to spatial data resampling especially when they have spatial structures. Consequently, I have been working on these approaches(which, I will be discussing about them in the later tutorial). The split object srs has list of spatial_clustering_split similar to that of objects from *spatialsample** package.
spatial repeated samples srs object has now resamples based on 10-fold with 5-repeats.
plot_splits <- function(split) {
p <- analysis(split) %>%
mutate(analysis = "Training") %>%
bind_rows(assessment(split) %>%
mutate(analysis = "Testing")) %>%
ggplot(aes(X, Y, color = analysis)) +
geom_point(alpha = 0.7, size = 2) +
coord_fixed()+
#facet_wrap(~analysis,scales = "free")+
labs(color = "Train/Test")+
scale_color_viridis_d(direction = -1)+
xlab("Longitude (m)")+
ylab("Latitidue (m)")
# transition_reveal(analysis)
print(p)
}
# plot_splits function is inspired from one of the screen cast of Julia Silge
# https://juliasilge.com/blog/map-challenge/
walk(srs$splits, plot_splits)
The clustering of observations using partitioning algorithm kmeans is clear, unlike regular random sampling. The training and testing data shows clustering of nearby objects. The regular clustering can be obtained in the above command defining strata = Class_name, where Class_name is the land use land cover class in integer
How about the Hierarchical Clustering? lets look at this. For hierarchical clustering, the algorithms uses geometry features and returns the split objects as we’ve seen before.
hclust function from stats package, has the following arguments:
d : a dissimmilary structure as produced by dist. Here dist function is dist(xy_geometry, method = “euclidean” ).
method : the agglomeration method to be used.
This should be one of “ward.D”, “ward.D2”, “single”, “complete”, “average” (= UPGMA), “mcquitty” (= WPGMA), “median” (= WPGMC) or “centroid” (= UPGMC). hclust(d, method = “complete”, members = NULL). In the repeated_spatial_clustering_cv, method (in hclust) == dist_clust
Further details about hclust can be invoved by using ?stats::hclust in R console.
Now lets look at example of Hierarcical clustering
set.seed(1318)
srs_hw2<- repeated_spatial_clustering_cv(data = tdata,v = 5,repeats = 5,
coords = NULL,spatial = TRUE,clust_method = "hclust",dist_clust = "ward.D2")
# average method (UPGMA)
srs_havg<- repeated_spatial_clustering_cv(data = tdata, v = 5, repeats = 5,
coords = NULL,spatial = TRUE,clust_method = "hclust",
dist_clust = "average")
Similarly other agglomertion method can be used in dist_clust arguments. Now, lets plots these two methods
Ward.D2 Method
# Lets re-use plot_splits function
purrr::walk(srs_hw2$splits, plot_splits)
Average Method
# Lets re-use plot_splits function
purrr::walk(srs_havg$splits, plot_splits)
I will discuss the classification using repeated-spatial-clustering in the next post.
sessioninfo::session_info()
## - Session info ---------------------------------------------------------------
## setting value
## version R version 4.1.2 (2021-11-01)
## os Windows 10 x64 (build 19042)
## system x86_64, mingw32
## ui RTerm
## language (EN)
## collate English_United States.1252
## ctype English_United States.1252
## tz America/Chicago
## date 2022-02-23
## pandoc 2.14.0.3 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown)
##
## - Packages -------------------------------------------------------------------
## package * version date (UTC) lib source
## assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
## backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2)
## broom * 0.7.12 2022-01-28 [1] CRAN (R 4.1.2)
## bslib 0.3.1 2021-10-06 [1] CRAN (R 4.1.1)
## class 7.3-20 2022-01-13 [1] CRAN (R 4.1.2)
## classInt 0.4-3 2020-04-07 [1] CRAN (R 4.1.0)
## cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.2)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.2)
## colorspace 2.0-2 2021-06-24 [1] CRAN (R 4.1.1)
## crayon 1.5.0 2022-02-14 [1] CRAN (R 4.1.2)
## DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.2)
## dials * 0.1.0 2022-01-31 [1] CRAN (R 4.1.2)
## DiceDesign 1.9 2021-02-13 [1] CRAN (R 4.1.1)
## digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
## dplyr * 1.0.8 2022-02-08 [1] CRAN (R 4.1.2)
## e1071 1.7-9 2021-09-16 [1] CRAN (R 4.1.0)
## ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0)
## evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.2)
## fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.2)
## farver 2.1.0 2021-02-28 [1] CRAN (R 4.1.0)
## fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0)
## foreach 1.5.2 2022-02-02 [1] CRAN (R 4.1.2)
## furrr 0.2.3 2021-06-25 [1] CRAN (R 4.1.1)
## future 1.24.0 2022-02-19 [1] CRAN (R 4.1.2)
## future.apply 1.8.1 2021-08-10 [1] CRAN (R 4.1.1)
## generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2)
## ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.0)
## gifski 1.4.3-1 2021-05-02 [1] CRAN (R 4.1.2)
## globals 0.14.0 2020-11-22 [1] CRAN (R 4.1.0)
## glue 1.6.1 2022-01-22 [1] CRAN (R 4.1.2)
## gower 1.0.0 2022-02-03 [1] CRAN (R 4.1.2)
## GPfit 1.0-8 2019-02-08 [1] CRAN (R 4.1.1)
## gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.0)
## hardhat 0.2.0 2022-01-24 [1] CRAN (R 4.1.2)
## highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
## htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1)
## infer * 1.0.0 2021-08-13 [1] CRAN (R 4.1.1)
## ipred 0.9-12 2021-09-15 [1] CRAN (R 4.1.1)
## iterators 1.0.14 2022-02-05 [1] CRAN (R 4.1.2)
## jquerylib 0.1.4 2021-04-26 [1] CRAN (R 4.1.1)
## jsonlite 1.8.0 2022-02-22 [1] CRAN (R 4.1.2)
## KernSmooth 2.23-20 2021-05-03 [2] CRAN (R 4.1.2)
## knitr 1.37 2021-12-16 [1] CRAN (R 4.1.2)
## labeling 0.4.2 2020-10-20 [1] CRAN (R 4.1.0)
## lattice 0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
## lava 1.6.10 2021-09-02 [1] CRAN (R 4.1.1)
## lhs 1.1.4 2022-02-20 [1] CRAN (R 4.1.2)
## lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.0)
## listenv 0.8.0 2019-12-05 [1] CRAN (R 4.1.1)
## lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.1.1)
## magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.2)
## MASS 7.3-55 2022-01-13 [1] CRAN (R 4.1.2)
## Matrix 1.4-0 2021-12-08 [1] CRAN (R 4.1.2)
## modeldata * 0.1.1 2021-07-14 [1] CRAN (R 4.1.1)
## munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.0)
## nnet 7.3-17 2022-01-13 [1] CRAN (R 4.1.2)
## parallelly 1.30.0 2021-12-17 [1] CRAN (R 4.1.2)
## parsnip * 0.1.7 2021-07-21 [1] CRAN (R 4.1.1)
## pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
## pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0)
## plyr 1.8.6 2020-03-03 [1] CRAN (R 4.1.0)
## pROC 1.18.0 2021-09-03 [1] CRAN (R 4.1.1)
## prodlim 2019.11.13 2019-11-17 [1] CRAN (R 4.1.0)
## proxy 0.4-26 2021-06-07 [1] CRAN (R 4.1.0)
## purrr * 0.3.4 2020-04-17 [1] CRAN (R 4.1.0)
## R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1)
## Rcpp 1.0.8 2022-01-13 [1] CRAN (R 4.1.2)
## recipes * 0.2.0 2022-02-18 [1] CRAN (R 4.1.2)
## rlang 1.0.1 2022-02-03 [1] CRAN (R 4.1.2)
## rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1)
## rpart 4.1.16 2022-01-24 [1] CRAN (R 4.1.2)
## rsample * 0.1.1 2021-11-08 [1] CRAN (R 4.1.1)
## rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
## sass 0.4.0 2021-05-12 [1] CRAN (R 4.1.1)
## scales * 1.1.1 2020-05-11 [1] CRAN (R 4.1.0)
## sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
## sf * 1.0-6 2022-02-04 [1] CRAN (R 4.1.2)
## sp * 1.4-6 2021-11-14 [1] CRAN (R 4.1.2)
## stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
## stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
## survival 3.2-13 2021-08-24 [2] CRAN (R 4.1.2)
## tibble * 3.1.6 2021-11-07 [1] CRAN (R 4.1.2)
## tidymodels * 0.1.4 2021-10-01 [1] CRAN (R 4.1.1)
## tidyr * 1.2.0 2022-02-01 [1] CRAN (R 4.1.2)
## tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.2)
## timeDate 3043.102 2018-02-21 [1] CRAN (R 4.1.0)
## tune * 0.1.6 2021-07-21 [1] CRAN (R 4.1.1)
## units 0.8-0 2022-02-05 [1] CRAN (R 4.1.2)
## utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.1)
## vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0)
## viridisLite 0.4.0 2021-04-13 [1] CRAN (R 4.1.0)
## withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.2)
## workflows * 0.2.4 2021-10-12 [1] CRAN (R 4.1.1)
## workflowsets * 0.1.0 2021-07-22 [1] CRAN (R 4.1.1)
## xfun 0.29 2021-12-14 [1] CRAN (R 4.1.2)
## yaml 2.3.4 2022-02-17 [1] CRAN (R 4.1.2)
## yardstick * 0.0.9 2021-11-22 [1] CRAN (R 4.1.2)
##
## [1] C:/Users/suved/Documents/R/win-library/4.1
## [2] C:/Program Files/R/R-4.1.2/library
##
## ------------------------------------------------------------------------------