Testing the effect of tile size on query speed

The files needed to run these tests can be downloaded from here.

https://dl.dropboxusercontent.com/u/2703650/Courses/geostats/geostats.zip

Set the working directory to where you placed the files.

setwd("/home/duncan/Dropbox/Public/Courses/geostats")
rm(list = ls())
library(RODBC)
library(rgdal)

## Loading required package: sp

## rgdal: version: 0.8-10, (SVN revision 478) Geospatial Data Abstraction
## Library extensions to R successfully loaded Loaded GDAL runtime: GDAL
## 1.9.2, released 2012/10/08 Path to GDAL shared files: /usr/share/gdal/1.9
## Loaded PROJ.4 runtime: Rel. 4.8.0, 6 March 2012, [PJ_VERSION: 480] Path to
## PROJ.4 shared files: (autodetected)

library(raster)

Create temporary data base on the fly in R

When using PostGIS as a desktop GIS from R I quite often create a temporary database from R for a session. This avoids clutter when the only purpose of the database is to provide some functions for data processing.

system("dropdb temp")  #If already exists
system("createdb temp")

In order to use a ODBC connection to the temp database you need to add it to odb.ini.

sudo gedit /etc/odbc.ini

Paste the following text into this file and then save it (this is for 32 bit machines)

[temp]
Driver = /usr/lib/i386-linux-gnu/odbc/psqlodbcw.so
Database = temp
Servername = localhost
Username = postgres
Password = postgres
Protocol = 8.2.5
ReadOnly = 0

Now we can connect to the database from R

con <- odbcConnect("temp")

Add in the PostGIS extension and PLR

odbcQuery(con, "CREATE EXTENSION POSTGIS")

## [1] 1

odbcQuery(con, "CREATE EXTENSION POSTGIS_topology")

## [1] 1

odbcQuery(con, "CREATE EXTENSION PLR")

## [1] 1

Load the shapefiles

This is quick for smallish shapefiles. If files are larger and form part of a common workflow it would clearly be better to set up a permanent database and load them once.

com <- "shp2pgsql -d -s 4326 -I shapefiles/oaks.shp oaks| psql -d temp;"
system(com)
com <- "shp2pgsql -d -W LATIN1 -s 4326 -I shapefiles/mex_ecoregions.shp eco| psql -d temp;"
system(com)

Returning a subset of the data to R

We will now pull two spatial objects back into R so that we know that they are exactly equivalent to those that will be used in the subqueries within PostGIS. One convenient way of doing this is to first set up a “getquery” function as was shown in the geostat tutorial. This uses a temporary view and rgdal to return any query as a spatial object.

getquery <- function(query) {
    query <- paste("create view temp_view as ", query, sep = "")
    odbcQuery(con, query)
    result <- readOGR("PG:dbname=temp", "temp_view")
    odbcQuery(con, "drop view temp_view")
    return(result)
}

polys <- getquery("select * from eco limit 200")

## OGR data source with driver: PostgreSQL 
## Source: "PG:dbname=temp", layer: "temp_view"
## with 200 features and 3 fields
## Feature type: wkbMultiPolygon with 2 dimensions

pnts <- getquery("select * from oaks limit 200")

## OGR data source with driver: PostgreSQL 
## Source: "PG:dbname=temp", layer: "temp_view"
## with 200 features and 4 fields
## Feature type: wkbPoint with 2 dimensions

Timing a point overlay in R

If the raster layer is small enough to fit into memory R will easily beat PostGIS for speed on a point on raster overlay. The raster layer can be placed in memory by turing it into a brick.

r <- raster("/home/duncan/geostats/elevation.tif")
rbrick <- brick(r)
system.time(extract(rbrick, pnts))

##    user  system elapsed 
##   0.008   0.000   0.008

So, that is going to be hard to beat.

A fairer comparison, that is appropriate for large rasters (which could not fit in memory) would be to measure the time R takes while keeping the data on disk.

RTimePoints <- system.time(extract(r, pnts))[3]
RTimePoints

## elapsed 
##   0.147

Timing a polygon overlay in R

This is much slower, so to give R any chance at all we will use the brick.

RTimePolys <- system.time(extract(rbrick, polys))[3]
RTimePolys

## [1] 50.91

This should be easy to beat.

Add some PLR functions for aggregation

One of the very nice elements of using PostGIS for polygon on raster overlays is that any R function can be run on the results in order to summarise them by PLR, before sending them back to the R session.

query <- "CREATE OR REPLACE FUNCTION rmedian (float[]) RETURNS float AS 'x<-arg1   x<-as.numeric(as.character(x))   x<-na.omit(x)  median(x)'  LANGUAGE 'plr' STRICT;"
odbcQuery(con, query)

## [1] 1



query <- "CREATE OR REPLACE FUNCTION rmean (float[]) RETURNS float AS 'x<-arg1   x<-as.numeric(as.character(x))    x<-na.omit(x)  mean(x)'  LANGUAGE 'plr' STRICT;"
odbcQuery(con, query)

## [1] 1

Timed overlays when data is held outside the data base

Now run through some timed overlays. For this exercise we can load the raster layer several times, changing the tile resolution for each. Note that the -R flag is used. This results is the file being registered in the database, but the data remain outside.

tm_point <- numeric()
tm_poly <- numeric()

tiles <- c(20, 50, 100, 200, 400, 800, 1200)
for (i in tiles) {
    command <- paste("raster2pgsql -s 4326 -d  -M -R /home/duncan/geostats/elevation.tif -F -t ", 
        i, "x", i, " elev|psql -d temp", sep = "")
    system(command)

    query <- "select o.gid,genus,species,st_value(rast,geom) from elev, (select * from oaks limit 100) o where st_intersects(geom,rast)"
    s <- system.time(d <- sqlQuery(con, query))
    tm_point <- c(tm_point, s[3])

    query <- "select gid,rmedian((st_dumpvalues(st_union(st_clip(rast,geom)))).valarray) FROM elev,(select * from eco limit 100) foo WHERE ST_Intersects(rast, geom) GROUP BY gid order by gid"

    s <- system.time(d <- sqlQuery(con, query))
    tm_poly <- c(tm_poly, s[3])

}

Results

TITLE <- paste("Polygon overlay time \n", "R time = ", round(RTimePolys, 2), 
    sep = "")
plot(tiles, tm_poly, type = "b", main = TITLE, xlab = "Tile size (pixels)", 
    ylab = "Time (seconds)")

plot of chunk unnamed-chunk-13

TITLE <- paste("Point overlay time \n", "R time = ", round(RTimePoints, 2), 
    sep = "")

TITLE <- paste("Point overlay time \n", "R time = ", round(RTimePoints, 2), 
    sep = "")
plot(tiles, tm_point, type = "b", main = TITLE, xlab = "Tile size (pixels)", 
    ylab = "Time (seconds)")
abline(h = RTimePoints, lty = 2, col = 2)

plot of chunk unnamed-chunk-13

So, PostGIS clearly wins by around an order of magnitude on the polygon overlays. PostGIS can also just beat R for a point on raster overlay from a file, providing an optimum tile size (between 200 and 400 in this case) is used. This optimum may be sensitive to the file size and other factors.

Timed overlays when data is held within the data base

Intuitively I would have thought that queries would run faster when data values are loaded into the database. However this is not the case, apparently because the TOAST system used for data storage carries a significant overhead for data retrieval.



tm_point <- numeric()
tm_poly <- numeric()


for (i in tiles) {
    command <- paste("raster2pgsql -s 4326 -d  -M  /home/duncan/geostats/elevation.tif -F -t ", 
        i, "x", i, " elev|psql -d temp", sep = "")
    system(command)

    query <- "select o.gid,genus,species,st_value(rast,geom) from elev, (select * from oaks limit 100) o where st_intersects(geom,rast)"
    s <- system.time(d <- sqlQuery(con, query))
    tm_point <- c(tm_point, s[3])

    query <- "select gid,rmedian((st_dumpvalues(st_union(st_clip(rast,geom)))).valarray) FROM elev,(select * from eco limit 100) foo WHERE ST_Intersects(rast, geom) GROUP BY gid order by gid"

    s <- system.time(d <- sqlQuery(con, query))
    tm_poly <- c(tm_poly, s[3])

}

Results

TITLE <- paste("Polygon overlay time \n", "R time = ", round(RTimePolys, 2), 
    sep = "")
plot(tiles, tm_poly, type = "b", main = TITLE, xlab = "Tile size (pixels)", 
    ylab = "Time (seconds)")

plot of chunk unnamed-chunk-15

TITLE <- paste("Point overlay time \n", "R time = ", round(RTimePoints, 2), 
    sep = "")
plot(tiles, tm_point, type = "b", main = TITLE, xlab = "Tile size (pixels)", 
    ylab = "Time (seconds)")
abline(h = RTimePoints, lty = 2, col = 2)

plot of chunk unnamed-chunk-15

Polygon on raster overlays take up to twice the time. Point on raster overlays are not only much slower, but the time is an almost linear declining function of tile size.

Clean up

odbcClose(con)
system("dropdb temp")