Testing for spatial clustering of groups

Context
Packages
Data transformation
Two functions
Doing the test
With your data
Parallelising the randomization

Context

Imagine that you have a data frame that looks like this:

> tsclust_output
      province group
1      Dak Lak     A
2      Gia Lai     A
3      Kon Tum     A
4      Bac Kan     A
5   Quang Ninh     A
6    Hai Duong     B
7      Can Tho     B
8  Ho Chi Minh     B
9      Lao Cai     B
10 Thai Nguyen     B
11   Thai Binh     C
12    Hung Yen     C
13  Kien Giang     C
14    Lai Chau     C
15   Soc Trang     C

with provinces names and the name of the group to which each of these provinces belong too. What you want to do is to test whether these groups are spatially clustered. One way of doing this is to compute the mean within-group pairwise distance and to test whether the observed value of this statistic is significantly smaller than what you’d expect by chance. The null distribution of the statistic will be generated by randomization.

Packages

We need the following packages, install them if not installed.

> library(dplyr)     # mutate(), select()
> library(geosphere) # distm()
> library(magrittr)  # %>%, %<>%
> library(rgeos)     # gCentroid()
> library(sp)        # coordinates()

We also need the gadmVN package that contains the shapefile of the provinces of Vietnam and that you can install from Github:

> devtools::install_github("choisy/gadmVN")

Data transformation

First thing you need to do is to replace the provinces names by the coordinates of their barycenters. For that, you need to retrieve the shapefile of the provinces of Vietnam with the gadm function of the gadmVN package:

> vn <- gadmVN::gadm()

Then, you can use this shapefile to replace the provinces names with the coordinates of their centroids. The “tidyverse” way of doing it is:

> tsclust_output %<>% 
+   mutate(centroid    = lapply(province, function(x) gCentroid(subset(vn, province == x))),
+          coordinates = lapply(centroid, coordinates),
+          longitude   = lapply(coordinates, `[`, 1, "x"),
+          latitude    = lapply(coordinates, `[`, 1, "y")) %>% 
+   select(longitude, latitude, group)

And we get this:

> tsclust_output
   longitude latitude group
1   108.2134 12.82257     A
2   108.2417 13.79778     A
3   107.8769 14.64836     A
4   105.8271 22.26174     A
5   107.2725 21.25197     A
6   106.3615 20.93105     B
7   105.5297 10.11474     B
8   106.6973 10.74447     B
9   104.1128 22.36508     B
10  105.8221 21.69374     B
11  106.4001 20.50215     C
12  106.0615 20.81519     C
13  104.9407 9.999109     C
14  103.1866 22.31707     C
15  105.9281 9.556157     C

Two functions

You need a function that takes a data frame of longitudes and latitudes as an input and returns a vector of pairwise distances as an output, still in the “tidyverse” style:

> pairwise_distance <- function(df) {
+   df %>% 
+     `[`(, c("longitude", "latitude")) %>%
+     unlist() %>% 
+     matrix(, 2) %>% 
+     geosphere::distm() %>% 
+     {.[lower.tri(.)]}
+ }

You can use this function to compute the value of your statistic of interest, with is the mean within-group pairwise distance:

> compute_statistic <- function(df) {
+   df %>%
+     split(df$group) %>%
+     lapply(pairwise_distance) %>%
+     unlist() %>%
+     mean()
+ }

Doing the test

You can use the above functions to perform your test. First, calculate the value of your statistic on your data:

> (stat_val_data <- compute_statistic(tsclust_output))
[1] 759862.7

Then, to compute the significativity of this statistic, you need to generate the distribution of this statistic under a null hypothesis of absence of spatial clustering. For that you do randomization on tsclust_output and apply the function stat_val_data multiple times:

> stat_null_distr <-
+   replicate(1000, mutate(tsclust_output, group = sample(group)), FALSE) %>%
+   sapply(compute_statistic)

You can now plot the distribution, together with the observed value of the statistic:

> hist(stat_null_distr, 100, col = "grey", main = NA, xlab = "statistic value", ylab = "frequency")
> abline(v = stat_val_data, col = "red", lwd = 2)

And you can compute the probability that the observed value of the statistic is significantly lower (meaning groups clustered in space) than what you’d expect by chance:

> sum(stat_null_distr < stat_val_data) / length(stat_null_distr)
[1] 0.676

With your data

Loading your data:

> load("Result_Kmeans.Rdata")

Putting a bit in shape:

> names(result_k_means) <- c("longitude", "latitude", "group")
> result_k_means <- na.exclude(result_k_means)
> str(result_k_means)
'data.frame':   267 obs. of  3 variables:
 $ longitude: num  107 101 104 106 104 ...
 $ latitude : num  14.8 20.3 18.5 14.8 20.3 ...
 $ group    : num  2 2 1 1 2 1 2 3 2 2 ...
 - attr(*, "na.action")= 'exclude' Named int  268 269 270 271 272 273 274 275 276 277 ...
  ..- attr(*, "names")= chr  "268" "269" "270" "271" ...
> unique(result_k_means$group)
[1] 2 1 3

Exporting, before sending to the cluster:

> saveRDS(result_k_means, "result_k_means.rds")

Parallelising the randomization

Below is a script that parallelizes the code, using the package parallel. The use of the function parSapply allows to perform a parallized version of replicate function that we use above.

# number of repetitions:
nb <- 10000

# loading the parallel package:
library(parallel) # detectCores(), makeCluster(), clusterEvalQ(),
                  # clusterExport(), parSapply(), stopCluster()

# loading the required data:
result_k_means <- readRDS("result_k_means.rds")

# defining the function we need in the master workspace:
pairwise_distance <- function(df) {
  df %>%
    `[`(, c("longitude", "latitude")) %>%
    unlist() %>%
    matrix(, 2) %>%
    geosphere::distm() %>%
    {.[lower.tri(.)]}
}

# and same for the other function:
compute_statistic <- function(df) {
  df %>%
    split(df$group) %>%
    lapply(pairwise_distance) %>%
    unlist() %>%
    mean()
}

# create a cluster, with all the available cores but one:
cl <- makeCluster(detectCores() - 1)

# loading the needed packages on each core:
clusterEvalQ(cl, library(dplyr, geosphere))

# sending needed object from master workspace to each core's workspace:
clusterExport(cl, c("result_k_means", "pairwise_distance", "compute_statistic"))

# parallelized replication:
stat_null_distr <- parSapply(cl, 1:nb,
                             function(x) compute_statistic(mutate(result_k_means, group = sample(group))))

# stopping the cluster:
stopCluster(cl)

# saving the data to disk:
saveRDS(stat_null_distr, "output.rds")

This script can be saved into a text file named spatial_cluster_test.R which can be run from the shell:

$ R CMD BATCH spatial_cluster_test.R

The same, but sending the job in the background at the same time:

$ R CMD BATCH spatial_cluster_test.R &

The same, but additionally prepending the command by nohup to prevent killing the process upon log out:

$ nohup R CMD BATCH spatial_cluster_test.R &

After running for 100 minutes on 19 CPU, this code could perform 1,000,000 randomizations on the result_k_means data frame that contains 267 provinces.

> output <- readRDS("output.rds")

The observed statistic is

> (observed <- compute_statistic(result_k_means))
[1] 1327288

Then the figure:

> hist(output, 100, col = "grey", main = NA, xlab = "statistic value", ylab = "frequency")

We can see that the observed statistic is very very very much smaller than what expected from chance:

> sum(output < observed) / length(output)
[1] 0

Which means that p < 1e-6: highly significant!