Imagine that you have a data frame that looks like this:
> tsclust_output
province group
1 Dak Lak A
2 Gia Lai A
3 Kon Tum A
4 Bac Kan A
5 Quang Ninh A
6 Hai Duong B
7 Can Tho B
8 Ho Chi Minh B
9 Lao Cai B
10 Thai Nguyen B
11 Thai Binh C
12 Hung Yen C
13 Kien Giang C
14 Lai Chau C
15 Soc Trang C
with provinces names and the name of the group to which each of these provinces belong too. What you want to do is to test whether these groups are spatially clustered. One way of doing this is to compute the mean within-group pairwise distance and to test whether the observed value of this statistic is significantly smaller than what you’d expect by chance. The null distribution of the statistic will be generated by randomization.
We need the following packages, install them if not installed.
> library(dplyr) # mutate(), select()
> library(geosphere) # distm()
> library(magrittr) # %>%, %<>%
> library(rgeos) # gCentroid()
> library(sp) # coordinates()
We also need the gadmVN package that contains the shapefile of the provinces of Vietnam and that you can install from Github:
> devtools::install_github("choisy/gadmVN")
First thing you need to do is to replace the provinces names by the coordinates of their barycenters. For that, you need to retrieve the shapefile of the provinces of Vietnam with the gadm function of the gadmVN package:
> vn <- gadmVN::gadm()
Then, you can use this shapefile to replace the provinces names with the coordinates of their centroids. The “tidyverse” way of doing it is:
> tsclust_output %<>%
+ mutate(centroid = lapply(province, function(x) gCentroid(subset(vn, province == x))),
+ coordinates = lapply(centroid, coordinates),
+ longitude = lapply(coordinates, `[`, 1, "x"),
+ latitude = lapply(coordinates, `[`, 1, "y")) %>%
+ select(longitude, latitude, group)
And we get this:
> tsclust_output
longitude latitude group
1 108.2134 12.82257 A
2 108.2417 13.79778 A
3 107.8769 14.64836 A
4 105.8271 22.26174 A
5 107.2725 21.25197 A
6 106.3615 20.93105 B
7 105.5297 10.11474 B
8 106.6973 10.74447 B
9 104.1128 22.36508 B
10 105.8221 21.69374 B
11 106.4001 20.50215 C
12 106.0615 20.81519 C
13 104.9407 9.999109 C
14 103.1866 22.31707 C
15 105.9281 9.556157 C
You need a function that takes a data frame of longitudes and latitudes as an input and returns a vector of pairwise distances as an output, still in the “tidyverse” style:
> pairwise_distance <- function(df) {
+ df %>%
+ `[`(, c("longitude", "latitude")) %>%
+ unlist() %>%
+ matrix(, 2) %>%
+ geosphere::distm() %>%
+ {.[lower.tri(.)]}
+ }
You can use this function to compute the value of your statistic of interest, with is the mean within-group pairwise distance:
> compute_statistic <- function(df) {
+ df %>%
+ split(df$group) %>%
+ lapply(pairwise_distance) %>%
+ unlist() %>%
+ mean()
+ }
You can use the above functions to perform your test. First, calculate the value of your statistic on your data:
> (stat_val_data <- compute_statistic(tsclust_output))
[1] 759862.7
Then, to compute the significativity of this statistic, you need to generate the distribution of this statistic under a null hypothesis of absence of spatial clustering. For that you do randomization on tsclust_output and apply the function stat_val_data multiple times:
> stat_null_distr <-
+ replicate(1000, mutate(tsclust_output, group = sample(group)), FALSE) %>%
+ sapply(compute_statistic)
You can now plot the distribution, together with the observed value of the statistic:
> hist(stat_null_distr, 100, col = "grey", main = NA, xlab = "statistic value", ylab = "frequency")
> abline(v = stat_val_data, col = "red", lwd = 2)
And you can compute the probability that the observed value of the statistic is significantly lower (meaning groups clustered in space) than what you’d expect by chance:
> sum(stat_null_distr < stat_val_data) / length(stat_null_distr)
[1] 0.676
Loading your data:
> load("Result_Kmeans.Rdata")
Putting a bit in shape:
> names(result_k_means) <- c("longitude", "latitude", "group")
> result_k_means <- na.exclude(result_k_means)
> str(result_k_means)
'data.frame': 267 obs. of 3 variables:
$ longitude: num 107 101 104 106 104 ...
$ latitude : num 14.8 20.3 18.5 14.8 20.3 ...
$ group : num 2 2 1 1 2 1 2 3 2 2 ...
- attr(*, "na.action")= 'exclude' Named int 268 269 270 271 272 273 274 275 276 277 ...
..- attr(*, "names")= chr "268" "269" "270" "271" ...
> unique(result_k_means$group)
[1] 2 1 3
Exporting, before sending to the cluster:
> saveRDS(result_k_means, "result_k_means.rds")
Below is a script that parallelizes the code, using the package parallel. The use of the function parSapply allows to perform a parallized version of replicate function that we use above.
# number of repetitions:
nb <- 10000
# loading the parallel package:
library(parallel) # detectCores(), makeCluster(), clusterEvalQ(),
# clusterExport(), parSapply(), stopCluster()
# loading the required data:
result_k_means <- readRDS("result_k_means.rds")
# defining the function we need in the master workspace:
pairwise_distance <- function(df) {
df %>%
`[`(, c("longitude", "latitude")) %>%
unlist() %>%
matrix(, 2) %>%
geosphere::distm() %>%
{.[lower.tri(.)]}
}
# and same for the other function:
compute_statistic <- function(df) {
df %>%
split(df$group) %>%
lapply(pairwise_distance) %>%
unlist() %>%
mean()
}
# create a cluster, with all the available cores but one:
cl <- makeCluster(detectCores() - 1)
# loading the needed packages on each core:
clusterEvalQ(cl, library(dplyr, geosphere))
# sending needed object from master workspace to each core's workspace:
clusterExport(cl, c("result_k_means", "pairwise_distance", "compute_statistic"))
# parallelized replication:
stat_null_distr <- parSapply(cl, 1:nb,
function(x) compute_statistic(mutate(result_k_means, group = sample(group))))
# stopping the cluster:
stopCluster(cl)
# saving the data to disk:
saveRDS(stat_null_distr, "output.rds")
This script can be saved into a text file named spatial_cluster_test.R which can be run from the shell:
$ R CMD BATCH spatial_cluster_test.R
The same, but sending the job in the background at the same time:
$ R CMD BATCH spatial_cluster_test.R &
The same, but additionally prepending the command by nohup to prevent killing the process upon log out:
$ nohup R CMD BATCH spatial_cluster_test.R &
After running for 100 minutes on 19 CPU, this code could perform 1,000,000 randomizations on the result_k_means data frame that contains 267 provinces.
> output <- readRDS("output.rds")
The observed statistic is
> (observed <- compute_statistic(result_k_means))
[1] 1327288
Then the figure:
> hist(output, 100, col = "grey", main = NA, xlab = "statistic value", ylab = "frequency")
We can see that the observed statistic is very very very much smaller than what expected from chance:
> sum(output < observed) / length(output)
[1] 0
Which means that p < 1e-6: highly significant!