We load the saved list of ppp objects that were stored in ppp_store2 which consists of ppp objects split by region, then by room type, including one that covers all the different room types.
load("ppp_store2.RData")
# Set up empty list objects to save Ktestcsr results in
Ktestcsr1 <- list() # results of for-loop function
Ktestcsr2 <- list() # results of functionals
Ktestcsr3 <- list() # results of parallelisation
The above sets out empty list objects to store the envelope simulations for plotting later.
# Run envelope function for different room types for each region
# Function removes $all from the list and only runs the envelope if the number of points for that room type is non-zero.
runKregion1 <- function(ppp_region) {
set.seed(123)
reg_name = paste(deparse(substitute(ppp_region)) %>% gsub("\\$", " ", .) %>% word(., -1))
temp_ppp = ppp_region[-1]
for (i in seq_along(temp_ppp)) {
var_name = names(temp_ppp)[i]
if (temp_ppp[[i]][["n"]] >0) {
Ktestcsr1[[reg_name]][[var_name]] <<- envelope(temp_ppp[[i]], Kest, nsim=99, rank=1, global=TRUE)
}
}
}
The above function uses a for loop to run the K-tests for a region, split by the different room types. The seed is set at the beginning of the function for reproducibility. It removes the first element ($all) from the region, and runs the envelope() function if there are non-zero points for that room type. It then saves the results into another list, split by region and room type for plotting access later. The envelope() function was run using the ‘best’ correction, with 99 simulations.
# Run envelope function for different room types for each region using purr
# Function removes $all from the list and only runs the envelope if the number of points for that room type is non-zero.
# helper function to be used in Functionals
runenv <- function(x) {
envelope(x, Kest, nsim=99, rank=1, global=TRUE)
}
runKregion2 <- function(ppp_region) {
set.seed(123)
reg_name = paste(deparse(substitute(ppp_region)) %>% gsub("\\$", " ", .) %>% word(., -1))
# create temporary variable removing 'all' from the region list and any room types that are zero
temp_ppp = ppp_region %>% discard(., names(ppp_region)=="all") %>% discard(., map(., "n") ==0)
Ktestcsr2[[reg_name]] <<- map(temp_ppp, runenv)
}
The above code shows the same output with functional programming using functions such as map() and discard() from the purrr package that comes with tidyverse. purrr enhances R’s FP toolkit by providing a complete and consistent set of tools for working with functions and vectors. This provides more flexibility - it allows us to filter the lists in place before performing other functions on it. This means that it could apply to more general cases of any list of list with a subset as it only removes subtypes that have $all in the list. The seed is set at the beginning of the function for reproducibility. We create a runenv() function with the same specified parameters for the K-test (best correction, 99 simulations). We then use the map() function to run elements of the filtered ppp (temp_ppp) through the runenv() function.
What is parallelisation and why do we need it?
Performing Spatial Point Pattern Analysis (SPPA) can be computationally intensive for larger data sets, or data with non-uniform observation windows. It took us 60 hours to run the various Central region K-tests sequentially. There is a need for parallelisation for spatial analysis of larger datasets but there are no in-built parallelization methods in spatstat, and users will need to utilise other packages to parallelise operations.
Parallelisation is the idea that we split the computation work across multiple cores (or processors) simultaneously (i.e. run in parallel) to reduce the computation time. E.g. if we run a computation in 100 seconds on 1 processor, we should get the same computation done in (100/n) seconds using n processors. However, this is generally not seen in practice as there are overheads when running in parallel, such as spinning up the clusters and ensuring that each cluster has the required packages and data required to run the computation.
Parallelisation is also dependent on OS – functions such as mclapply()or other mc* functions from the parallel package (installed with base R) works for Mac and Linux, but not for Windows due to the use of forks. We use the doParallel and foreach packages to run this parallelisation. The following code is OS agnostic, and can be run on multi-core computers (laptop or desktop).
Users can also use the doAzureParallel package 1 to run the parallelisation on the cloud using Microsoft Azure within the foreach package. There are also other options for parallelisation on the cloud with other service providers such as Google Cloud Platform, DataBricks, AWS, which will not be discussed here.
Firstly we need to set up clusters for parallelisation. We use detectCores() to see how many cores there are in the laptop/desktop that can be used. Typically we use 1 or 2 less cores than the total number of cores to cater for other systems running in the background (e.g. OS). We then create clusters via ‘sockets’ using makeCluster() and use registerDoParallel() to register the parallel backend with the foreach package.
# Set up clusters
# Set up number of clusters (6 in this case - 2 less than max no. of cores)
nclus <- detectCores() - 2
# Make the clusters and register doParallel with the number of clusters
cl <- makeCluster(nclus)
registerDoParallel(cl)
In our case, the parallelisation is done by splitting the number of simulations between the different cores, then combining the results back together. We used 99 simulations in our sequential K-test simulations and therefore we will declare the number of simulations (100) to be split evenly between the cores.
While we have set up our clusters, they are a blank slate and we need to load objects, functions and packages that they need to run the computation. We only export the minimum required to the clusters as each object will take up memory in each of the clusters / processes. We use clusterExport() to export the data to the clusters and clusterEvalQ() to export the required libraries to each of the cores.
# Set up minimum total number of simulations for parallel function to split evenly between them (may be slightly more than 100)
nsims = 100
# Distribute the necessary R Objects
clusterExport(cl, c('ppp_store2', 'nsims','nclus'))
clusterEvalQ(cl, library(tidyverse, spatstat))
## [[1]]
## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [13] "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [13] "utils" "datasets" "methods" "base"
##
## [[3]]
## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [13] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [13] "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [13] "utils" "datasets" "methods" "base"
##
## [[6]]
## [1] "forcats" "stringr" "dplyr" "purrr" "readr" "tidyr"
## [7] "tibble" "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [13] "utils" "datasets" "methods" "base"
We now set up the functions that we will be using to parallelise the operations.
rep_ppp For each ppp object we will need to replicate it n number of times for the n cores that we set up in the cluster (nclus). We set simplify to FALSE so that it returns a list of ppp objects to iterate over, otherwise it returns a matrix of owins, and not a list of ppp objects.
runKobjpar This function is the core of the parallel operations. It takes the argument of the ppp object, runs the rep_ppp function to replicate it. Then for each of the replicated ppp object, we run an envelope of simulations using the spatstat functions envelope, using the passed function Kest into the envelope. We also set the number of simulations to be equally distributed, rounded up to the nearest number using the ceiling() function. Hence we will get a minimum of 100 simulations but sometimes more depending on the number of simulations and cores used. In this case we will have 102 (17*6) simulations in total. We need to ensure that savefuns is set to true so that each output is saved as an envelope object, otherwise we will not be able to pool them together later.
Generally, foreach with %do% is used to execute an R expression repeatedly, and return the results in some data structure or object, which is a list by default. It takes the form of for(iterator) %do% {this function}. The parallelisation happens in the foreach package with the %dopar% operator.
poolpar This is just a simple function to pool the envelopes from the outputs from the parallel job into a single envelope.
runKregionpar This function brings the 3 helper functions above together to run the parallel tests for all the room types in a region. Similar to the other 2 functions, we remove the segments with ‘all’, and any room types that do not exist in that region. We then run a foreach along the temporary ppp object created, set a seed for reproducibility across the 6 cores, and run the parallelisation for that room type in that region. Results are then pooled and saved to the Ktestcsr3 store in the form Ktestcsr3$region$room_type similar to the other 2 function outputs.
# Set up functions to be used in the parallelisation
# Function to replicate the ppp object for each core in use
rep_ppp <- function(pppobj) {
replicate(nclus, pppobj, simplify=FALSE)
}
# Function to run an envelope simulation for 1 ppp object. This results in a list of envelopes.
runKobjpar <- function(ppp_obj) {
ppp_par <- rep_ppp(ppp_obj)
foreach(i=1:length(ppp_par)) %dopar% {
set.seed(123)
spatstat::envelope(ppp_par[[i]], spatstat::Kest, nsim=ceiling(nsims/nclus), savefuns = TRUE, rank=1)
}
}
# Function to pool envelope simulation results into 1 envelope
poolpar <- function(x) {
do.call(pool, x)
}
runKregionpar <- function(ppp_region) {
reg_name = paste(deparse(substitute(ppp_region)) %>% gsub("\\$", " ", .) %>% word(., -1))
# create temporary variable removing 'all' from the region list and any room types that are zero
temp_ppp = ppp_region %>% discard(., names(ppp_region)=="all") %>% discard(., map(., "n") ==0)
foreach(i=1:length(temp_ppp)) %do% {
tempname <- names(temp_ppp)[i]
envtemp[[tempname]] <- runKobjpar(temp_ppp[[i]])
}
Ktestcsr3[[reg_name]] <<- map(envtemp, poolpar)
}
# Run selected region
ptm <- proc.time()
runKregionpar(ppp_store2$west)
proc.time() - ptm
The code above shows how we run the envelope simulation for 1 region for all room types using our runKregionpar() function. The next code chunk shows the code to run the envelope simulation for all regions in the list.
ptm <- proc.time()
map(ppp_store2 %>% discard(names(ppp_store2)=="central"), runKregionpar)
proc.time() - ptm
Once we are done with all the parallelisation, we stop the clusters to clean up. Quitting R will also stop the clusters. This is not run at this point in time as we will be benchmarking the performance of each function.
# Stop Cluster (Not run)
# stopCluster(cl)
We need to replicate the functions a few times to get a better idea of how well each of the functions do. We use the package microbenchmark, which is a simple wrapper over the in-built system.time() function to show some statistics (e.g. minimum, median, mean, maximum time) about the replication. The default setting is 100 replications, but we will use 10 replications for our purposes here. We omit the central region from the benchmarking as it would take too long to run. We save the outputs to an RData file.
ktestsummary <- microbenchmark("forloop_east" = {runKregion1(ppp_store2$east)},
"FP_east" = {runKregion2(ppp_store2$east)},
"forloop_north" = {runKregion1(ppp_store2$north)},
"FP_north" = {runKregion2(ppp_store2$north)},
"forloop_northeast" = {runKregion1(ppp_store2$northeast)},
"FP_northeast" = {runKregion2(ppp_store2$northeast)},
"forloop_west" = {runKregion1(ppp_store2$west)},
"FP_west" = {runKregion2(ppp_store2$west)},
times=10)
save(ktestsummary, file="ktestsummary.Rdata")
We compare the performance of each run of K-test summary using the code above.
## Unit: seconds
## expr min lq mean median uq max
## forloop_east 657.5529 657.8737 658.0693 657.9953 658.4222 658.5415
## FP_east 657.4600 657.8131 658.0608 658.0347 658.2637 658.6969
## forloop_north 129.9498 130.5036 130.8029 130.6666 130.8750 132.5745
## FP_north 130.4490 130.6724 131.0544 130.8669 131.0351 133.1058
## forloop_northeast 235.7780 236.3265 236.6908 236.7673 237.0119 237.5746
## FP_northeast 235.9460 236.6460 236.7181 236.7831 236.8456 237.2116
## forloop_west 1015.6882 1021.1990 1021.0794 1021.6655 1021.8177 1022.4998
## FP_west 1021.0301 1021.2345 1021.5568 1021.3584 1021.5640 1023.5938
## neval cld
## 10 c
## 10 c
## 10 a
## 10 a
## 10 b
## 10 b
## 10 d
## 10 d
The dataframe above shows the benchmarks of the K-tests. As we can see there are no discernible differences in the time it takes to run either for-loop or the map functionals as the items are not vectorized, and are fairly discrete.
ktestparsummary <- microbenchmark("par_east" = {runKregionpar(ppp_store2$east)},
"par_west" = {runKregionpar(ppp_store2$west)},
"par_northeast" = {runKregionpar(ppp_store2$northeast)},
"par_north" = {runKregionpar(ppp_store2$north)},
times=10)
save(ktestparsummary, file="ktestparsummary.Rdata")
## Unit: seconds
## expr min lq mean median uq max
## par_east 149.31041 149.52222 149.72687 149.60881 149.86778 150.45822
## par_west 231.49913 231.73338 232.79897 231.95738 232.60633 237.07334
## par_northeast 53.80125 53.89456 54.09699 53.98189 54.24705 54.91905
## par_north 30.73207 30.82947 30.98298 30.92442 31.17145 31.29125
## neval cld
## 10 c
## 10 d
## 10 b
## 10 a
Here we can see that parallelisation is about 4 times faster than using either the for-loop or functionals. There are some overheads from parallelising the operations as discussed earlier, therefore the gains are not linear (i.e. not 6 times as we used 6 cores). The most gains came from running the central region, which took close to 5 times faster than doing it sequentially (12 hours vs 60 hours).
The table below shows all results.
combi_summary <- rbind(ktestsummary, ktestparsummary)
combi_summary
## Unit: seconds
## expr min lq mean median uq
## forloop_east 657.55289 657.87370 658.06934 657.99534 658.42216
## FP_east 657.45996 657.81314 658.06080 658.03469 658.26366
## forloop_north 129.94983 130.50356 130.80289 130.66662 130.87502
## FP_north 130.44896 130.67240 131.05440 130.86691 131.03515
## forloop_northeast 235.77805 236.32647 236.69081 236.76735 237.01186
## FP_northeast 235.94604 236.64599 236.71812 236.78305 236.84562
## forloop_west 1015.68819 1021.19902 1021.07937 1021.66548 1021.81768
## FP_west 1021.03011 1021.23455 1021.55685 1021.35839 1021.56403
## par_east 149.31041 149.52222 149.72687 149.60881 149.86778
## par_west 231.49913 231.73338 232.79897 231.95738 232.60633
## par_northeast 53.80125 53.89456 54.09699 53.98189 54.24705
## par_north 30.73207 30.82947 30.98298 30.92442 31.17145
## max neval cld
## 658.54152 10 g
## 658.69688 10 g
## 132.57446 10 c
## 133.10577 10 c
## 237.57462 10 f
## 237.21159 10 f
## 1022.49983 10 h
## 1023.59381 10 h
## 150.45822 10 d
## 237.07334 10 e
## 54.91905 10 b
## 31.29125 10 a
ptm <- proc.time()
runKregionpar(ppp_store2$central)
centralreg_time <- proc.time() - ptm
The code above runs the central region once to get the results added to Ktestcsr3.
#Stop clusters
stopCluster(cl)
We will need to plot the various envelope simulations for each region and each room type for analysis. As such, we create a function to reduce repetition and errors in code. We try two different methods - using a for loop and functionals using purrr. As this is a plotting function and is not computationally intensive, we do not need to parallelise the operation.
# Function to plot K-test CSR envelope of room types for each region using for i function
plotsubtypes1 <- function(ktestlist) {
# n = ceiling(length(ktestlist)/2)
par(mfrow=c(2,2))
for (i in seq_along(ktestlist)) {
plot(ktestlist[[i]], .-r~r, xlab="d", ylab="K(d)-r", main= paste(deparse(substitute(ktestlist)), names(ktestlist[i])))
}
}
The above code shows a for-loop function to plot results of the K-tests in a 2 x 2 grid. This can be changed depending on the number of sub-plots required.
# Function to plot K-test CSR envelope of room types for each region using mapply
plotsubtypes2 <- function(ktestlist, y) {
# n = ceiling(length(ktestlist)/2)
par(mfrow=c(2,2))
plotelement <- function(element, y) {
plot(element, .-r~r, xlab="d", ylab="K(d)-r", main=y)
}
mapply(plotelement, ktestlist, names(ktestlist))
}
The above code shows a functional programming function to plot results of the K-tests in a 2 x 2 grid. This can be changed depending on the number of sub-plots required. It creates a nested function of plotting an element in the main function and applies the plotelement function to the various elements of the K-test list.
plotsummary <- microbenchmark("forloop_central" = {plotsubtypes1(Ktestcsr3$central)},
"FP_central" = {plotsubtypes2(Ktestcsr3$central)},
"forloop_east" = {plotsubtypes1(Ktestcsr3$east)},
"FP_east" = {plotsubtypes2(Ktestcsr3$east)},
"forloop_north" = {plotsubtypes1(Ktestcsr3$north)},
"FP_north" = {plotsubtypes2(Ktestcsr3$north)},
"forloop_northeast" = {plotsubtypes1(Ktestcsr3$northeast)},
"FP_northeast" = {plotsubtypes2(Ktestcsr3$northeast)},
"forloop_west" = {plotsubtypes1(Ktestcsr3$west)},
"FP_west" = {plotsubtypes2(Ktestcsr3$west)},
times=50)
save(plotsummary, file="plotsummary.Rdata")
We compare the performance of plotting functions that we created using the code above. We run the test 50 times and save the results into an Rdata file. As this is a plotting function we can include the Central region envelope simulations in this benchmarking. Outputs are also hidden to reduce memory load on the computer.
## Unit: milliseconds
## expr min lq mean median uq max
## forloop_central 771.4117 822.4540 860.7381 856.2249 896.4332 973.2023
## FP_central 765.5472 844.4345 865.1919 863.0116 888.4946 972.4347
## forloop_east 920.7912 968.5479 1004.3051 997.3271 1033.8671 1116.6188
## FP_east 915.1282 962.3995 991.4031 988.2210 1011.1485 1093.6055
## forloop_north 584.4186 627.9901 657.9179 647.0735 679.7101 790.5383
## FP_north 586.1007 639.5039 661.4123 659.5357 679.9676 734.7740
## forloop_northeast 918.4740 956.2872 984.0844 978.8845 1017.5146 1059.7060
## FP_northeast 906.8459 954.9426 984.3593 990.8192 1013.5406 1097.5013
## forloop_west 910.1647 966.4879 1001.0232 995.7588 1018.2669 1361.6902
## FP_west 911.3448 959.3465 1005.4864 993.3930 1031.9202 1374.0439
## neval cld
## 50 b
## 50 b
## 50 c
## 50 c
## 50 a
## 50 a
## 50 c
## 50 c
## 50 c
## 50 c
From the table above, we can see that the for-loop performed better (mean and median) for the central, north, northeast regions, but worse for the east and west regions. However, the difference is not significant to warrant choosing one over the other method. In terms of customization, the for-loop is better as we are able to customise the title of the graphs; however, the functionals are easier to read and understand.
We save the results of the Ktestcsrs for use in the main SPPA analysis. We will use Ktestcsr3 as it has the Central region envelope simulations.
# Saving results of K test for each region
save(Ktestcsr1, file="Ktestcsr1.RData")
save(Ktestcsr2, file="Ktestcsr2.RData")
save(Ktestcsr3, file="Ktestcsr3.RData")