1 Appendix 1: SPPA K-Function Tests by Subregions and Room Type

1.1 Loading saved objects

We load the saved list of ppp objects that were stored in ppp_store2 which consists of ppp objects split by region, then by room type, including one that covers all the different room types.

load("ppp_store2.RData")

1.2 Functions for running K-tests

# Set up empty list objects to save Ktestcsr results in
Ktestcsr1 <- list() # results of for-loop function
Ktestcsr2 <- list() # results of functionals
Ktestcsr3 <- list() # results of parallelisation

The above sets out empty list objects to store the envelope simulations for plotting later.

1.2.1 Using for loops

# Run envelope function for different room types for each region
# Function removes $all from the list and only runs the envelope if the number of points for that room type is non-zero.

runKregion1 <- function(ppp_region) {
  set.seed(123)
  reg_name = paste(deparse(substitute(ppp_region)) %>% gsub("\\$", "  ", .)  %>% word(., -1))
  temp_ppp = ppp_region[-1]
  for (i in seq_along(temp_ppp)) {
    var_name = names(temp_ppp)[i]
    if (temp_ppp[[i]][["n"]] >0) {
      Ktestcsr1[[reg_name]][[var_name]] <<- envelope(temp_ppp[[i]], Kest, nsim=99, rank=1, global=TRUE)
    }
  }
}

The above function uses a for loop to run the K-tests for a region, split by the different room types. The seed is set at the beginning of the function for reproducibility. It removes the first element ($all) from the region, and runs the envelope() function if there are non-zero points for that room type. It then saves the results into another list, split by region and room type for plotting access later. The envelope() function was run using the ‘best’ correction, with 99 simulations.

1.2.2 Using purrr

# Run envelope function for different room types for each region using purr
# Function removes $all from the list and only runs the envelope if the number of points for that room type is non-zero.

# helper function to be used in Functionals
runenv <- function(x) {
    envelope(x, Kest, nsim=99, rank=1, global=TRUE)
  }

runKregion2 <- function(ppp_region) {
  set.seed(123)
  reg_name = paste(deparse(substitute(ppp_region)) %>% gsub("\\$", "  ", .)  %>% word(., -1))
  # create temporary variable removing 'all' from the region list and any room types that are zero
  temp_ppp = ppp_region %>% discard(., names(ppp_region)=="all") %>% discard(., map(., "n") ==0)
  Ktestcsr2[[reg_name]] <<- map(temp_ppp, runenv)
  }

The above code shows the same output with functional programming using functions such as map() and discard() from the purrr package that comes with tidyverse. purrr enhances R’s FP toolkit by providing a complete and consistent set of tools for working with functions and vectors. This provides more flexibility - it allows us to filter the lists in place before performing other functions on it. This means that it could apply to more general cases of any list of list with a subset as it only removes subtypes that have $all in the list. The seed is set at the beginning of the function for reproducibility. We create a runenv() function with the same specified parameters for the K-test (best correction, 99 simulations). We then use the map() function to run elements of the filtered ppp (temp_ppp) through the runenv() function.

1.3 Parallelisation

What is parallelisation and why do we need it?

Performing Spatial Point Pattern Analysis (SPPA) can be computationally intensive for larger data sets, or data with non-uniform observation windows. It took us 60 hours to run the various Central region K-tests sequentially. There is a need for parallelisation for spatial analysis of larger datasets but there are no in-built parallelization methods in spatstat, and users will need to utilise other packages to parallelise operations.

Parallelisation is the idea that we split the computation work across multiple cores (or processors) simultaneously (i.e. run in parallel) to reduce the computation time. E.g. if we run a computation in 100 seconds on 1 processor, we should get the same computation done in (100/n) seconds using n processors. However, this is generally not seen in practice as there are overheads when running in parallel, such as spinning up the clusters and ensuring that each cluster has the required packages and data required to run the computation.

Parallelisation is also dependent on OS – functions such as mclapply()or other mc* functions from the parallel package (installed with base R) works for Mac and Linux, but not for Windows due to the use of forks. We use the doParallel and foreach packages to run this parallelisation. The following code is OS agnostic, and can be run on multi-core computers (laptop or desktop).

Users can also use the doAzureParallel package 1 to run the parallelisation on the cloud using Microsoft Azure within the foreach package. There are also other options for parallelisation on the cloud with other service providers such as Google Cloud Platform, DataBricks, AWS, which will not be discussed here.

1.3.1 Setting up

Firstly we need to set up clusters for parallelisation. We use detectCores() to see how many cores there are in the laptop/desktop that can be used. Typically we use 1 or 2 less cores than the total number of cores to cater for other systems running in the background (e.g. OS). We then create clusters via ‘sockets’ using makeCluster() and use registerDoParallel() to register the parallel backend with the foreach package.

# Set up clusters 
# Set up number of clusters (6 in this case - 2 less than max no. of cores)
nclus <- detectCores() - 2

# Make the clusters and register doParallel with the number of clusters
cl <- makeCluster(nclus)
registerDoParallel(cl)

In our case, the parallelisation is done by splitting the number of simulations between the different cores, then combining the results back together. We used 99 simulations in our sequential K-test simulations and therefore we will declare the number of simulations (100) to be split evenly between the cores.

While we have set up our clusters, they are a blank slate and we need to load objects, functions and packages that they need to run the computation. We only export the minimum required to the clusters as each object will take up memory in each of the clusters / processes. We use clusterExport() to export the data to the clusters and clusterEvalQ() to export the required libraries to each of the cores.

# Set up minimum total number of simulations for parallel function to split evenly between them (may be slightly more than 100)
nsims = 100

# Distribute the necessary R Objects
clusterExport(cl, c('ppp_store2', 'nsims','nclus'))
clusterEvalQ(cl, library(tidyverse, spatstat))
## [[1]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[2]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[3]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[5]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"     
## 
## [[6]]
##  [1] "forcats"   "stringr"   "dplyr"     "purrr"     "readr"     "tidyr"    
##  [7] "tibble"    "ggplot2"   "tidyverse" "stats"     "graphics"  "grDevices"
## [13] "utils"     "datasets"  "methods"   "base"

1.3.2 Parallelisation Functions

We now set up the functions that we will be using to parallelise the operations.

rep_ppp For each ppp object we will need to replicate it n number of times for the n cores that we set up in the cluster (nclus). We set simplify to FALSE so that it returns a list of ppp objects to iterate over, otherwise it returns a matrix of owins, and not a list of ppp objects.

runKobjpar This function is the core of the parallel operations. It takes the argument of the ppp object, runs the rep_ppp function to replicate it. Then for each of the replicated ppp object, we run an envelope of simulations using the spatstat functions envelope, using the passed function Kest into the envelope. We also set the number of simulations to be equally distributed, rounded up to the nearest number using the ceiling() function. Hence we will get a minimum of 100 simulations but sometimes more depending on the number of simulations and cores used. In this case we will have 102 (17*6) simulations in total. We need to ensure that savefuns is set to true so that each output is saved as an envelope object, otherwise we will not be able to pool them together later.

Generally, foreach with %do% is used to execute an R expression repeatedly, and return the results in some data structure or object, which is a list by default. It takes the form of for(iterator) %do% {this function}. The parallelisation happens in the foreach package with the %dopar% operator.

poolpar This is just a simple function to pool the envelopes from the outputs from the parallel job into a single envelope.

runKregionpar This function brings the 3 helper functions above together to run the parallel tests for all the room types in a region. Similar to the other 2 functions, we remove the segments with ‘all’, and any room types that do not exist in that region. We then run a foreach along the temporary ppp object created, set a seed for reproducibility across the 6 cores, and run the parallelisation for that room type in that region. Results are then pooled and saved to the Ktestcsr3 store in the form Ktestcsr3$region$room_type similar to the other 2 function outputs.

# Set up functions to be used in the parallelisation

# Function to replicate the ppp object for each core in use
rep_ppp <- function(pppobj) {
            replicate(nclus, pppobj, simplify=FALSE)
}

# Function to run an envelope simulation for 1 ppp object.  This results in a list of envelopes.  
runKobjpar <- function(ppp_obj) {
  ppp_par <- rep_ppp(ppp_obj)
  foreach(i=1:length(ppp_par)) %dopar% {
  set.seed(123)
  spatstat::envelope(ppp_par[[i]], spatstat::Kest, nsim=ceiling(nsims/nclus), savefuns = TRUE, rank=1)
  }
}

# Function to pool envelope simulation results into 1 envelope
poolpar <- function(x) {
  do.call(pool, x)
}

runKregionpar <- function(ppp_region) {
  reg_name = paste(deparse(substitute(ppp_region)) %>% gsub("\\$", "  ", .)  %>% word(., -1))
  # create temporary variable removing 'all' from the region list and any room types that are zero
  temp_ppp = ppp_region %>% discard(., names(ppp_region)=="all") %>% discard(., map(., "n") ==0)
  foreach(i=1:length(temp_ppp)) %do% {
    tempname <- names(temp_ppp)[i]
    envtemp[[tempname]] <- runKobjpar(temp_ppp[[i]])
  }
    Ktestcsr3[[reg_name]] <<- map(envtemp, poolpar)
  }

1.3.3 Running the parallel operations

# Run selected region
ptm <- proc.time()
runKregionpar(ppp_store2$west)
proc.time() - ptm

The code above shows how we run the envelope simulation for 1 region for all room types using our runKregionpar() function. The next code chunk shows the code to run the envelope simulation for all regions in the list.

ptm <- proc.time()
map(ppp_store2 %>% discard(names(ppp_store2)=="central"), runKregionpar)
proc.time() - ptm

1.3.4 Stopping the cluster

Once we are done with all the parallelisation, we stop the clusters to clean up. Quitting R will also stop the clusters. This is not run at this point in time as we will be benchmarking the performance of each function.

# Stop Cluster (Not run)
# stopCluster(cl)

1.4 Comparison of computation performance

We need to replicate the functions a few times to get a better idea of how well each of the functions do. We use the package microbenchmark, which is a simple wrapper over the in-built system.time() function to show some statistics (e.g. minimum, median, mean, maximum time) about the replication. The default setting is 100 replications, but we will use 10 replications for our purposes here. We omit the central region from the benchmarking as it would take too long to run. We save the outputs to an RData file.

1.4.1 Benchmarking for-loop and functionals using purrr

ktestsummary <- microbenchmark("forloop_east" = {runKregion1(ppp_store2$east)},
                              "FP_east" = {runKregion2(ppp_store2$east)},
                              "forloop_north" = {runKregion1(ppp_store2$north)},
                              "FP_north" = {runKregion2(ppp_store2$north)},
                              "forloop_northeast" = {runKregion1(ppp_store2$northeast)},
                              "FP_northeast" = {runKregion2(ppp_store2$northeast)},
                              "forloop_west" = {runKregion1(ppp_store2$west)},
                              "FP_west" = {runKregion2(ppp_store2$west)},
                              times=10)

save(ktestsummary, file="ktestsummary.Rdata")

We compare the performance of each run of K-test summary using the code above.

## Unit: seconds
##               expr       min        lq      mean    median        uq       max
##       forloop_east  657.5529  657.8737  658.0693  657.9953  658.4222  658.5415
##            FP_east  657.4600  657.8131  658.0608  658.0347  658.2637  658.6969
##      forloop_north  129.9498  130.5036  130.8029  130.6666  130.8750  132.5745
##           FP_north  130.4490  130.6724  131.0544  130.8669  131.0351  133.1058
##  forloop_northeast  235.7780  236.3265  236.6908  236.7673  237.0119  237.5746
##       FP_northeast  235.9460  236.6460  236.7181  236.7831  236.8456  237.2116
##       forloop_west 1015.6882 1021.1990 1021.0794 1021.6655 1021.8177 1022.4998
##            FP_west 1021.0301 1021.2345 1021.5568 1021.3584 1021.5640 1023.5938
##  neval  cld
##     10   c 
##     10   c 
##     10 a   
##     10 a   
##     10  b  
##     10  b  
##     10    d
##     10    d

The dataframe above shows the benchmarks of the K-tests. As we can see there are no discernible differences in the time it takes to run either for-loop or the map functionals as the items are not vectorized, and are fairly discrete.

1.4.2 Benchmarking parallelisation

ktestparsummary <- microbenchmark("par_east" = {runKregionpar(ppp_store2$east)},
                                  "par_west" = {runKregionpar(ppp_store2$west)},
                                  "par_northeast" = {runKregionpar(ppp_store2$northeast)},
                                  "par_north" = {runKregionpar(ppp_store2$north)},
                                  times=10)

save(ktestparsummary, file="ktestparsummary.Rdata")
## Unit: seconds
##           expr       min        lq      mean    median        uq       max
##       par_east 149.31041 149.52222 149.72687 149.60881 149.86778 150.45822
##       par_west 231.49913 231.73338 232.79897 231.95738 232.60633 237.07334
##  par_northeast  53.80125  53.89456  54.09699  53.98189  54.24705  54.91905
##      par_north  30.73207  30.82947  30.98298  30.92442  31.17145  31.29125
##  neval  cld
##     10   c 
##     10    d
##     10  b  
##     10 a

Here we can see that parallelisation is about 4 times faster than using either the for-loop or functionals. There are some overheads from parallelising the operations as discussed earlier, therefore the gains are not linear (i.e. not 6 times as we used 6 cores). The most gains came from running the central region, which took close to 5 times faster than doing it sequentially (12 hours vs 60 hours).

The table below shows all results.

combi_summary <- rbind(ktestsummary, ktestparsummary)
combi_summary
## Unit: seconds
##               expr        min         lq       mean     median         uq
##       forloop_east  657.55289  657.87370  658.06934  657.99534  658.42216
##            FP_east  657.45996  657.81314  658.06080  658.03469  658.26366
##      forloop_north  129.94983  130.50356  130.80289  130.66662  130.87502
##           FP_north  130.44896  130.67240  131.05440  130.86691  131.03515
##  forloop_northeast  235.77805  236.32647  236.69081  236.76735  237.01186
##       FP_northeast  235.94604  236.64599  236.71812  236.78305  236.84562
##       forloop_west 1015.68819 1021.19902 1021.07937 1021.66548 1021.81768
##            FP_west 1021.03011 1021.23455 1021.55685 1021.35839 1021.56403
##           par_east  149.31041  149.52222  149.72687  149.60881  149.86778
##           par_west  231.49913  231.73338  232.79897  231.95738  232.60633
##      par_northeast   53.80125   53.89456   54.09699   53.98189   54.24705
##          par_north   30.73207   30.82947   30.98298   30.92442   31.17145
##         max neval      cld
##   658.54152    10       g 
##   658.69688    10       g 
##   132.57446    10   c     
##   133.10577    10   c     
##   237.57462    10      f  
##   237.21159    10      f  
##  1022.49983    10        h
##  1023.59381    10        h
##   150.45822    10    d    
##   237.07334    10     e   
##    54.91905    10  b      
##    31.29125    10 a

1.4.3 Running Central region

ptm <- proc.time()
runKregionpar(ppp_store2$central)
centralreg_time <- proc.time() - ptm

The code above runs the central region once to get the results added to Ktestcsr3.

#Stop clusters
stopCluster(cl)

1.5 Plotting functions

We will need to plot the various envelope simulations for each region and each room type for analysis. As such, we create a function to reduce repetition and errors in code. We try two different methods - using a for loop and functionals using purrr. As this is a plotting function and is not computationally intensive, we do not need to parallelise the operation.

1.5.1 Using for loops

# Function to plot K-test CSR envelope of room types for each region using for i function
plotsubtypes1 <- function(ktestlist) {
  # n = ceiling(length(ktestlist)/2)
  par(mfrow=c(2,2))
  for (i in seq_along(ktestlist)) {
    plot(ktestlist[[i]], .-r~r, xlab="d", ylab="K(d)-r", main= paste(deparse(substitute(ktestlist)), names(ktestlist[i])))
  }
}

The above code shows a for-loop function to plot results of the K-tests in a 2 x 2 grid. This can be changed depending on the number of sub-plots required.

1.5.2 Using purrr

# Function to plot K-test CSR envelope of room types for each region using mapply
plotsubtypes2 <- function(ktestlist, y) {
  # n = ceiling(length(ktestlist)/2)
  par(mfrow=c(2,2))
  plotelement <- function(element, y) {
    plot(element, .-r~r, xlab="d", ylab="K(d)-r", main=y)
  }
  mapply(plotelement, ktestlist, names(ktestlist))
}

The above code shows a functional programming function to plot results of the K-tests in a 2 x 2 grid. This can be changed depending on the number of sub-plots required. It creates a nested function of plotting an element in the main function and applies the plotelement function to the various elements of the K-test list.

1.5.3 Comparison of computation performance

plotsummary <- microbenchmark("forloop_central" = {plotsubtypes1(Ktestcsr3$central)},
                              "FP_central" = {plotsubtypes2(Ktestcsr3$central)},
                              "forloop_east" = {plotsubtypes1(Ktestcsr3$east)},
                              "FP_east" = {plotsubtypes2(Ktestcsr3$east)},
                              "forloop_north" = {plotsubtypes1(Ktestcsr3$north)},
                              "FP_north" = {plotsubtypes2(Ktestcsr3$north)},
                              "forloop_northeast" = {plotsubtypes1(Ktestcsr3$northeast)},
                              "FP_northeast" = {plotsubtypes2(Ktestcsr3$northeast)},
                              "forloop_west" = {plotsubtypes1(Ktestcsr3$west)},
                              "FP_west" = {plotsubtypes2(Ktestcsr3$west)},
                              times=50)

save(plotsummary, file="plotsummary.Rdata")

We compare the performance of plotting functions that we created using the code above. We run the test 50 times and save the results into an Rdata file. As this is a plotting function we can include the Central region envelope simulations in this benchmarking. Outputs are also hidden to reduce memory load on the computer.

## Unit: milliseconds
##               expr      min       lq      mean   median        uq       max
##    forloop_central 771.4117 822.4540  860.7381 856.2249  896.4332  973.2023
##         FP_central 765.5472 844.4345  865.1919 863.0116  888.4946  972.4347
##       forloop_east 920.7912 968.5479 1004.3051 997.3271 1033.8671 1116.6188
##            FP_east 915.1282 962.3995  991.4031 988.2210 1011.1485 1093.6055
##      forloop_north 584.4186 627.9901  657.9179 647.0735  679.7101  790.5383
##           FP_north 586.1007 639.5039  661.4123 659.5357  679.9676  734.7740
##  forloop_northeast 918.4740 956.2872  984.0844 978.8845 1017.5146 1059.7060
##       FP_northeast 906.8459 954.9426  984.3593 990.8192 1013.5406 1097.5013
##       forloop_west 910.1647 966.4879 1001.0232 995.7588 1018.2669 1361.6902
##            FP_west 911.3448 959.3465 1005.4864 993.3930 1031.9202 1374.0439
##  neval cld
##     50  b 
##     50  b 
##     50   c
##     50   c
##     50 a  
##     50 a  
##     50   c
##     50   c
##     50   c
##     50   c

From the table above, we can see that the for-loop performed better (mean and median) for the central, north, northeast regions, but worse for the east and west regions. However, the difference is not significant to warrant choosing one over the other method. In terms of customization, the for-loop is better as we are able to customise the title of the graphs; however, the functionals are easier to read and understand.

1.6 Saving results

We save the results of the Ktestcsrs for use in the main SPPA analysis. We will use Ktestcsr3 as it has the Central region envelope simulations.

# Saving results of K test for each region
save(Ktestcsr1, file="Ktestcsr1.RData")
save(Ktestcsr2, file="Ktestcsr2.RData")
save(Ktestcsr3, file="Ktestcsr3.RData")