1 Objective of the function

There are two objectives for this function, first to reduce the memory usage (RAM), when transforming a rasterStack into a dataframe, on the other hand we want not to reduce the speed or if possibly to increase the speed of the function.

First lets load the packages we will need:

# For spaital manipulation
library(raster)
# For benchmarking speed and memory
library(bench)
# To parallelize operations
library(doParallel)
# To parallelize operations
library(foreach)
# For combination and looping
library(purrr)
# Data wranggling
library(dplyr)
library(data.table)

1.1 Tiling

The main way to reduce the RAM usage is instead of processing one big raster is to transform it into tiled rasters, for that I developed the following function that you can include in you code:

SplitRas <- function(Raster, ppside, nclus = 1) {
    TempRast <- paste0(getwd(), "/Temp")
    h <- ceiling(ncol(Raster)/ppside)
    v <- ceiling(nrow(Raster)/ppside)
    agg <- aggregate(Raster, fact = c(h, v))
    agg[] <- 1:ncell(agg)
    agg_poly <- rasterToPolygons(agg)
    names(agg_poly) <- "polis"
    r_list <- list()
    if (nclus == 1) {
        for (i in 1:ncell(agg)) {
            dir.create(TempRast)
            rasterOptions(tmpdir = TempRast)
            e1 <- extent(agg_poly[agg_poly$polis == i, ])
            r_list[[i]] <- crop(Raster, e1)
            if ((freq(r_list[[i]], value = NA)/ncell(r_list[[i]])) != 
                1) {
                writeRaster(r_list[[i]], filename = paste("SplitRas", 
                  i, sep = ""), format = "GTiff", datatype = "FLT4S", 
                  overwrite = TRUE)
            }
            unlink(TempRast, recursive = T, force = T)
        }
    } else if (nclus != 1) {
        cl <- parallel::makeCluster(nclus)
        doParallel::registerDoParallel(cl)
        r_list <- foreach(i = 1:ncell(agg), .packages = c("raster")) %dopar% 
            {
                dir.create(TempRast)
                rasterOptions(tmpdir = TempRast)
                e1 <- extent(agg_poly[agg_poly$polis == i, ])
                Temp <- crop(Raster, e1)
                if ((raster::freq(Temp, value = NA)/ncell(Temp)) != 
                  1) {
                  writeRaster(Temp, filename = paste("SplitRas", 
                    i, sep = ""), format = "GTiff", datatype = "FLT4S", 
                    overwrite = TRUE)
                }
                unlink(TempRast, recursive = T, force = T)
                Temp
            }
        parallel::stopCluster(cl)
    }
}

This function has 3 arguments:

Raster: The stack you want to divide into tiles
ppside: The number of horizontal and vertical tiles you will end up, the final number of tiles will be ppside*ppside, so if ppside is 3, you will have 9 tiles
nclus: The number of clusters you will use for the parallelizing and speeding up your processing.

At the end of this function you will have ppside*ppside number of tiles, saved in your working directory all called SplitRasN.tif where N is the number of the tile. Just as an example we will use the bioclimatic variables available in the raster package:

Bios <- getData("worldclim", var = "bio", res = 10)

Now I will just plot mean temperature as seen in figure 1.1, but we can see that we can divide this in diferent number of tiles as shown in figure 1.2, this will generate different performances in speed and or ram usage. As we will see afterwards

Figure 1.1: Mean temperature

Figure 1.2: Mean temerature shown with different number of tiles in red

1.2 Transformation from raster to tiles and then from tiles to dataframe

so first we will use SplitRas function to get the 16 tiles using 4 cores:

SplitRas(Raster = Bios, ppside = 4, nclus = 4)

This will return the following files: SplitRas1.tif, SplitRas10.tif, SplitRas11.tif, SplitRas12.tif, SplitRas13.tif, SplitRas14.tif, SplitRas15.tif, SplitRas16.tif, SplitRas2.tif, SplitRas3.tif, SplitRas4.tif, SplitRas5.tif, SplitRas6.tif, SplitRas7.tif, SplitRas8.tif, SplitRas9.tif

In order to get this tiles into one dataframe with all the non-NA cells we need a list of the tiles, which we get with:

Files <- list.files(pattern = "SplitRas", full.names = T)

Which we can use then in the following function:

SplitsToDataFrame <- function(Splits, ncores = 1) {
    TempRast <- paste0(getwd(), "/Temp")
    if (ncores == 1) {
        Temps <- list()
        for (i in 1:length(Splits)) {
            dir.create(TempRast)
            rasterOptions(tmpdir = TempRast)
            Temp <- stack(Splits[i])
            Temp <- as.data.frame(Temp, row.names = NULL, col.names = NULL, 
                xy = TRUE)
            colnames(Temp)[3:ncol(Temp)] <- paste0("Var", 1:length(3:ncol(Temp)))
            Temps[[i]] <- Temp[complete.cases(Temp), ]
            gc()
            unlink(TempRast, recursive = T, force = T)
            message(i)
        }
        Temps <- data.table::rbindlist(Temps)
    } else if (ncores > 1) {
        cl <- parallel::makeCluster(ncores)
        doParallel::registerDoParallel(cl)
        Temps <- foreach(i = 1:length(Splits), .packages = c("raster", 
            "data.table")) %dopar% {
            dir.create(TempRast)
            rasterOptions(tmpdir = TempRast)
            Temp <- stack(Splits[i])
            Temp <- as.data.frame(Temp, row.names = NULL, col.names = NULL, 
                xy = TRUE)
            colnames(Temp)[3:ncol(Temp)] <- paste0("Var", 1:length(3:ncol(Temp)))
            gc()
            unlink(TempRast, recursive = T, force = T)
            Temp[complete.cases(Temp), ]
        }
        Temps <- data.table::rbindlist(Temps)
        parallel::stopCluster(cl)
    }
    return(Temps)
}

Where Splits is a character vectors with the paths to the tiles, and ncores is the number of cores used for parallelization. This will result in the Data frame with the non empty cells.

DF <- SplitsToDataFrame(Splits = Files, ncores = 4)

The first 20 rows of results can be seen in table 1.1

Table 1.1: The first 20 observations of the resulting data frame
x	y	Var1	Var2	Var3	Var4	Var5	Var6	Var7	Var8	Var9	Var10	Var11	Var12	Var13	Var14	Var15	Var16	Var17	Var18	Var19
-90.75	81.92	-209	64	13	15866	62	-428	490	-16	-384	15	-384	70	15	2	78	40	6	35	6
-90.58	81.92	-210	63	12	15891	62	-428	490	-16	-385	15	-385	70	15	2	78	40	6	35	6
-90.42	81.92	-210	63	12	15899	61	-429	490	-17	-385	14	-385	69	14	2	75	39	6	34	6
-90.25	81.92	-212	62	12	15896	59	-430	489	-19	-387	12	-387	73	15	2	67	40	6	36	6
-90.08	81.92	-212	61	12	15911	59	-429	488	-19	-387	13	-387	74	15	2	68	41	6	36	6
-91.75	81.75	-212	63	13	15646	55	-429	484	-21	-385	9	-385	76	16	2	72	43	6	38	6
-91.58	81.75	-215	62	12	15639	52	-431	483	-24	-388	6	-388	80	17	2	72	44	6	39	6
-91.42	81.75	-218	60	12	15605	48	-434	482	-28	-391	2	-391	85	17	3	64	46	9	40	9
-91.25	81.75	-217	60	12	15659	50	-433	483	-26	-390	4	-390	84	17	3	63	45	9	40	9
-91.08	81.75	-215	61	12	15706	52	-432	484	-24	-389	6	-389	78	16	2	71	43	6	38	6
-90.92	81.75	-220	60	12	15692	47	-435	482	-29	-393	2	-393	87	17	3	64	46	9	41	9
-90.75	81.75	-219	59	12	15716	48	-435	483	-28	-393	3	-393	86	17	3	64	46	9	41	9
-90.58	81.75	-218	60	12	15759	50	-434	484	-27	-392	4	-392	85	17	3	64	46	9	40	9
-90.42	81.75	-217	60	12	15785	52	-433	485	-25	-391	6	-391	83	16	3	69	44	9	39	9
-90.25	81.75	-217	60	12	15818	52	-433	485	-25	-392	6	-392	85	16	3	62	45	9	39	9
-90.08	81.75	-219	59	12	15842	50	-435	485	-26	-394	5	-394	88	17	3	63	46	9	41	9
-91.92	81.58	-209	64	13	15649	59	-427	486	-18	-382	12	-382	72	16	2	71	42	6	36	6
-91.75	81.58	-214	62	12	15582	52	-430	482	-24	-387	6	-387	78	17	2	76	45	6	39	6
-91.58	81.58	-221	59	12	15518	44	-435	479	-31	-393	-2	-393	90	18	3	69	49	9	43	9
-91.42	81.58	-227	57	11	15481	37	-439	476	-38	-398	-8	-398	101	20	3	66	54	10	48	10

If you need to change the number of tiles, before doing so, it is recommended that you do:

file.remove(Files)

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE

1.3 Memory benchmarking

And then there is the parallel profiling here

Now lets check the parallel profiling

“Optimization of raster to data frame function”

Derek Corcoran

16/05, 2021

1 Objective of the function

1.1 Tiling

1.2 Transformation from raster to tiles and then from tiles to dataframe

1.3 Memory benchmarking