What is a raster?
Is it an image, a surface, a matrix, an array, a map, pixellated coverage, a discretization of the plane, a georeferenced grid? Which of these is most accurate?
What a raster is depends a lot on on what you are doing with the data, and what you need from it. All of these descriptions might make sense.
The raster package in R has a very strong cell-based index abstraction that is very useful for dealing with gridded data. This is extremely powerful, but in its current form is not easily separated from the very powerful high-level functionality of the raster package. I believe that grasping this distinction is not really a priority for most users or developers but I contend that much benefit in efficiency of computation and understanding is missed because of it. The high-level tools avoid users needing to know these details, but in doing so have wrapped up the lower levels too tightly.
This document aims to explain why that abstraction is very useful, and worth the effort to separate it from the specific workflows enabled by the raster package.
Rasters are used in many places in R, and their support has been very good from the earliest versions of R, though there have been many somewhat obscure improvements in how they can be handled, and the topic is now very much obscured by the complicated history of their existence in R, and their use in very many packages in different ways. The possibilities for what a raster is and what it can be is not a finished story. The earliest versions of what a “raster” is had nothing to do with the raster or sp packages, pre-date the support in grid of rasterImage (used heavily by ggplot2 and raster) - this early support is the graphics::image function` which ironically enough actually drew very many little rectangular polygons in a slow but efficient way.
A fairly straightforward use of the raster package is to extract values from cells (a.k.a. pixels). These values might be data, like a unit-measure quantity such as temperature in Celsius stored as a floating point number, or a more abstract depiction of data converted to a relative scale, such as grey-scale or Red-Green-Blue (RGB, or BGR). Put the colour-model variety aside for now, it adds another level of complication to the story that we don’t need right now.
Here is a set of files containing ocean surface temperature (SST). This won’t work on your local system, but it’s a set of files somewhere that R can see. (Our system is set up to provide very easy access to these kinds of files - to download these exact files or to emulate our system see the section below. )
files <- raadtools::sstfiles()$file[13051:13062]
print(basename(files))## [1] "avhrr-only-v2.20170525.nc" "avhrr-only-v2.20170526.nc"
## [3] "avhrr-only-v2.20170527.nc" "avhrr-only-v2.20170528.nc"
## [5] "avhrr-only-v2.20170529.nc" "avhrr-only-v2.20170530.nc"
## [7] "avhrr-only-v2.20170531.nc" "avhrr-only-v2.20170601.nc"
## [9] "avhrr-only-v2.20170602.nc" "avhrr-only-v2.20170603.nc"
## [11] "avhrr-only-v2.20170604.nc" "avhrr-only-v2.20170605.nc"
A typical requirement from these data is to extract values from local sub-region/s and summarize. A sub-region might be an extent (bounding box in the coordinate system of the data), a set of point/s, a set of line/s or polygon/s. Each of these types of sub-regions is a query of the underlying gridded data.
(For now we ignore cases of varying query-in-time, continuous interpretation of cell values, and weighted or partial cell-coverage - these are specializations that build on lower level functionality of the simpler cases).
I’ve chosen two cases to illustrate, one with a single extent and another of two simple polygonal regions. For checking the context in which these queries will be built, first interrogate the data for the basic metadata. (I’m using ‘readAll’ simply to hide the full file path, it’s unnecessary in the usual case).
library(raster)## Loading required package: sp
readAll(raster(files[1]))## Warning in .varName(nc, varname, warn = warn): varname used is: sst
## If that is not correct, you can set it to one of: sst, anom, err, ice
## class : RasterLayer
## dimensions : 720, 1440, 1036800 (nrow, ncol, ncell)
## resolution : 0.25, 0.25 (x, y)
## extent : 0, 360, -90, 90 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : Daily.sea.surface.temperature
## values : -1.8, 32.14 (min, max)
## z-value : 2017-05-25
## zvar : sst
## level : 1
In this case the interpretation of the underlying grid logic is sound, though in many cases this process needs some investigation and slight re-interpretation (to account for cell-centre vs. cell-edge alignment, incomplete coordinate reference system CRS or other incomplete or inaccurate interpretations.) There’s no way to automate every possible case, these tools exist in a complex environment and not every situation can be handled without human oversight. Other options might need to restructure the data into Atlantic view with longitudes in -180, 180 but this Pacific view with longitudes 0, 360 is fairly common, and sometimes required, and neither case is inherently better or worse.
What is an extent? For raster it is a set of four numbers, xmin, xmax, ymin and ymax and there are simple idioms for creating these. An Extent is a formal class that many raster operations formally understand, but at root it’s just these four numbers. The magic of R functions for formal classes provides all the convenient behaviour to allow this to act like a rectangular polygon.
Let’s build one, here using longitude latitude values local to my part of the world.
library(raster)
ex <- extent(c(xmin = 130, xmax = 150.2, ymin = -50, ymax = -38.4))We use this extent to conveniently hone in on just a small region of the global raster data. Here I really don’t care about any interpretation of the data, it’s just an illustration of the mechanics available to us. (Once we have mastery of these mechanics, and hopefully access to the best possible techniques then we can get back to the actual work of interpreting and using these data and honing our visualizations and analysis).
This is very nice, but note how the extent we asked for is not applied to the data we end up seeing in the plot.
rdata <- raster(files[1])## Warning in .varName(nc, varname, warn = warn): varname used is: sst
## If that is not correct, you can set it to one of: sst, anom, err, ice
plot(ex) ## the extent sets the plot context first
plot(rdata, add = TRUE)
plot(ex, add = TRUE) ## put the extent layer on topWe haven’t actually cropped the data at all, it has been visually truncated by the graphics device, and it’s clear that the scale shows a range of data much wider than we see in this window. So it’s accurate, given that the data is global, but maybe not what we wanted.
Let’s crop the data.
rdata_local <- crop(rdata, ex)
print(rdata_local)## class : RasterLayer
## dimensions : 46, 81, 3726 (nrow, ncol, ncell)
## resolution : 0.25, 0.25 (x, y)
## extent : 130, 150.25, -50, -38.5 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : Daily.sea.surface.temperature
## values : 6.41, 19.04 (min, max)
## time : 2017-05-25
We now have a copy of the sub-region we were after, and it’s time to notice another key feature. The extent we cropped with is not the extent of the result, because the data itself has an underlying grain and alignment (here it’s a grain/resolution of 0.25 in both directions, and cells are aligned to chunks at 0.25 spacing from 0. Alignment is just as important as resolution, since it could have have started aligned somewhere like 0.1, not necessarily matching the resolution in whole pieces relative to 0).
We can see the difference if we crop with snap = “out”. It’s still not the same extent we asked for, but it’s more inclusive (and I wish it was the default). Note that we cannot crop a raster to an exact extent if it’s not already aligned, but that’s the nature of gridded data. We would have to remodel and restructure the data to get a new grain and alignment, and that’s a perfectly valid workflow but it’s a relatively big commitment since it modifies the data and commits us to a new structure, so it involves interpretation in all aspects of the work.
crop(rdata, ex, snap = "out")## class : RasterLayer
## dimensions : 47, 81, 3807 (nrow, ncol, ncell)
## resolution : 0.25, 0.25 (x, y)
## extent : 130, 150.25, -50, -38.25 (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## data source : in memory
## names : Daily.sea.surface.temperature
## values : 6.41, 19.04 (min, max)
## time : 2017-05-25
I needed to be explicit about exact alignment for extents, and so wrote a positive outward buffer function to ensure the input extent is widened and aligned to a particular step size (from zero). Now see that our originally arbitrary extent does align exactly with the resulting extent from the SST data set - this is exactly what snap = "out" does internally, in the context of a particular grid being used.
print(ex)## class : Extent
## xmin : 130
## xmax : 150.2
## ymin : -50
## ymax : -38.4
library(spex)
print(spex::buffer_extent(ex, 0.25))## class : Extent
## xmin : 130
## xmax : 150.25
## ymin : -50
## ymax : -38.25
Another useful conversion is to deal with the grid in purely index space, though this can be confusing when the grid is referenced from the “top-down”. Here the index_extent function provides a new extent object that details where in the grid the geographic extent (the second argument) lies.
ind_ext <- tabularaster::index_extent(rdata, ex)We might use this very explicitly with
rawmat <- t(as.matrix(rdata)[nrow(rdata):1, ])
image(seq(0, nrow(rawmat)), seq(0, ncol(rawmat)), rawmat, useRaster = TRUE)
plot(ind_ext, add = TRUE)That gives us a way to convert “geographic queries” into the “raw index” queries required by the data structure, and this is more or less what raster::crop must do internally, and similar logic is used by tidync and GDAL itself.
We should now see that we can extract that exact region of data without having to do any of the maths ourselves. (The math isn’t “hard”, it’s just very easy to get it wrong and/or get confused, mostly because of the orientation of the grid - see how this grid is Pacific-centred, and also that we had to transpose and flip it in order to make an image correctly).
The NetCDF library expects queries to come in the form of start and count values, which we can generate from the index extent.
start <- c(xmin(ind_ext), ymin(ind_ext))
count <- c(xmax(ind_ext) - xmin(ind_ext),
ymax(ind_ext) - ymin(ind_ext))
nccon <- ncdf4::nc_open(files[1])
## this won't work, because there are two degenerate dimensions - one is unlimited "time" and the
## other is a singleton to record that this is only "surface" values
#winmat <- ncdf4::ncvar_get(nccon, "sst", start = start, count = count)
winmat <- ncdf4::ncvar_get(nccon, "sst", start = c(start, 1, 1), count = c(count, 1, 1))
image(winmat)Notice how this plot is mapped into the interval 0, 1 for x and y, though above we recorded where in the overall parent grid we belonged. These things conspire to make all of this quite confusing.
There is a further exploration of these confusing relationships here:
https://hypertidy.github.io/ffraster/articles/dimension-order.html
## Option 1: direct ftp download
## Option 2. bowerbird setup