library(sp)
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3
library(raster)
library(ggplot2)

1. Introduction

There are two fundamental geographic data models: vector and raster.

The vector data model represents the world using points, lines and polygons. These have discrete, well-defined borders, meaning that vector datasets usually have a high level of precision.

The raster data model divides the surface up into cells of constant size. Raster datasets are used for storing remote sensing data as well as in web-mapping. Rasters aggregate spatially specific features to a given resolution.

Vector data

The main vector data types are points, lines and polygons. In all cases, the geometry of these data structures consists of sets of coordinate pairs (x, y). Points are the simplest case: each point has one coordinate pair, and n associated variables. It is also possible to combine several points into a multi-point structure, with a single attribute record. For example, all the coffee shops in a town could be considered as a single geometry.

The geometry of lines is a just a little bit more complex. First note that the term ‘line’ refers to a set of one or more polylines (connected series of line segments). For example, in spatial analysis, a river and all its tributaries could be considered as a single ‘line’ (but they could also also be several lines, perhaps one for each tributary river). Lines are represented as ordered sets of coordinates (nodes). The actual line segments can be computed by connecting the points. Thus, the representation of a line is very similar to that of a multi-point structure. The main difference is that the ordering of the points is important, because we need to know which points should be connected. A network (e.g. a road or river network), is a special type of lines geometry where there is additional information about things like flow, connectivity, direction, and distance.

A polygon refers to a set of closed polylines. The geometry is very similar to that of lines, but to close a polygon the last coordinate pair coincides with the first pair. A complication with polygons is that they can have holes (for example, an island inside a lake). Also, valid polygons do not self-intersect. Again, multiple polygons can be considered as a single geometry. For example, Indonesia consists of many islands. Each island can be represented by a single polygon, but together then can be represent a single (multi-) polygon representing the entire country.

Raster data

Raster data is commonly used to represent spatially continuous phenomena such as elevation. A raster divides the world into a grid of equally sized rectangles (referred to as cells or, in the context of satellite remote sensing, pixels) that all have one or more values (or missing values) for the variables of interest. A raster cell value should normally represent the average (or majority) value for the area it covers.

In contrast to vector data, in raster data the geometry is not explicitly stored as coordinates. It is implicitly set by knowing the spatial extent and the number or rows and columns in which the area is divided. From the extent and number of rows and columns, the size of the raster cells (spatial resolution) can be computed.

Continuous surface data are sometimes stored as triangulated irregular networks (TINs); these are not discussed here.

Simple representation of spatial data

The basic data types in R are numbers, characters, logical (TRUE or FALSE) and factor values. Values of a single type can be combined in vectors and matrices, and variables of multiple types can be combined into a data.frame. We can represent (only very) basic spatial data with these data types. Let’s say we have the location (represented by longitude and latitude) of ten weather stations (named A to J) and their annual precipitation.

In the example below we make a very simple map. Note that a map is special type of plot (like a scatter plot, barplot, etc.). A map is a plot of geospatial data that also has labels and other graphical objects such as a scale bar or legend. The spatial data itself should not be referred to as a map.

name <- LETTERS[1:10]
longitude <- c(-116.7, -120.4, -116.7, -113.5, -115.5,
               -120.8, -119.5, -113.7, -113.7, -110.7)
latitude <- c(45.3, 42.6, 38.9, 42.1, 35.7, 38.9,
              36.2, 39, 41.6, 36.9)
stations <- cbind(longitude, latitude)
class(stations)
## [1] "matrix"
head(stations)
##      longitude latitude
## [1,]    -116.7     45.3
## [2,]    -120.4     42.6
## [3,]    -116.7     38.9
## [4,]    -113.5     42.1
## [5,]    -115.5     35.7
## [6,]    -120.8     38.9
# Simulated rainfall data
set.seed(0)
precip <- (runif(length(latitude))*10)^3
class(precip)
## [1] "numeric"
head(precip)
## [1] 721.003613  18.716993  51.530302 187.988119 749.127376   8.203534

A map of point locations is not that different from a basic x-y scatter plot. Here I make a plot (a map in this case) that shows the location of the weather stations, and the size of the dots is proportional to the amount of precipitation. The point size is set with argument cex.

psize <- 1 + precip/500
plot(stations, cex=psize, pch=20, col='blue', main='Precipitation')

# add names to plot
text(stations, name, pos=4)

# add a legend
breaks <- c(100, 500, 1000, 2000)
legend("topright", legend=breaks, pch=20, pt.cex=psize, col='blue', bg='gray')

We can add multiple sets of points to the plot, and even draw lines and polygons:

lon <- c(-116.8, -114.2, -112.9, -111.9, -114.2, -115.4, -117.7)
lat <- c(41.3, 42.9, 42.4, 39.8, 37.6, 38.3, 37.6)
x <- cbind(lon, lat)

plot(stations, main='Precipitation')

polygon(x, col='gray', border='light gray')
lines(stations, lwd=1, col='green')
points(x, cex=2, pch=20, col='dark gray')
points(stations, cex=psize, pch=20, col='blue', main='Precipitation')

The above illustrates how numeric vectors representing locations can be used to draw simple maps. It also shows how points can (and typically are) represented by pairs of numbers, and a line and a polygons by a number of these points. Polygons is that they are “closed”, i.e. the first point coincides with the last point, but the polygon function took care of that for us.

There are cases where a simple approach like this may suffice and you may come across this in older R code or packages. Likewise, raster data could be represented by a matrix or higher-order array. Particularly when only dealing with point data such an approach may be practical. For example, a spatial data set representing points and attributes could be made by combining geometry and attributes in a single ’data.frame`.

wst <- data.frame(longitude, latitude, name, precip)
wst

However, wst is a data.frame and R does not automatically understand the special meaning of the first two columns, or to what coordinate reference system it refers (longitude/latitude, or perhaps UTM zone 17S, or ….?).

Moreover, it is non-trivial to do some basic spatial operations. For example, the polygon drawn on the map above might represent a state, and a next question might be which of the 10 stations fall within that polygon. And how about any other operation on spatial data, including reading from and writing data to files? To answer such questions there are a number of R packages that define new spatial data types and that can be used. The most important packages that define such spatial data structures are sp, sf and raster.

2. Vector data in R

The sp package

Package sp supports spatial data analysis in R. sp defines a set of classes to represent spatial data. A class defines a particular data type. The data.frame is an example of a class. Any particular data.frame you create is an object (instantiation) of that class.

The main reason for defining classes is to create a standard representation of a particular data type to make it easier to write functions (also known as ‘methods’) for them. In fact, the sp package does not provide many functions to modify or analyze spatial data; but the classes it defines are used in more than 100 other R packages that provide specific functionality.

Package sp introduces a number of classes with names that start with Spatial. For vector data, the basic types are the SpatialPoints, SpatialLines, and SpatialPolygons. These classes only represent geometries. To also store attributes, classes are available with these names plus DataFrame, for example, SpatialPolygonsDataFrame and SpatialPointsDataFrame. When referring to any object with a name that starts with Spatial, it is common to write Spatial. When referring to a SpatialPolygons or SpatialPolygonsDataFrame object it is common to write SpatialPolygons. The Spatial classes (and their use) are described in detail by Bivand, Pebesma and Gómez-Rubio.

It is possible to create Spatial* objects from scratch with R code. That can be very useful to create small self contained example to illustrate something, for example to ask a question about how to do a particular operation without needing to give access to the real data you are using. But in real life you will read these from a file or database, for example from a “shapefile”.

Spatial Points

longitude <- c(-116.7, -120.4, -116.7, -113.5, -115.5, -120.8, -119.5, -113.7, -113.7, -110.7)
latitude <- c(45.3, 42.6, 38.9, 42.1, 35.7, 38.9, 36.2, 39, 41.6, 36.9)
lonlat <- cbind(longitude, latitude)
class(lonlat)
## [1] "matrix"
lonlat
##       longitude latitude
##  [1,]    -116.7     45.3
##  [2,]    -120.4     42.6
##  [3,]    -116.7     38.9
##  [4,]    -113.5     42.1
##  [5,]    -115.5     35.7
##  [6,]    -120.8     38.9
##  [7,]    -119.5     36.2
##  [8,]    -113.7     39.0
##  [9,]    -113.7     41.6
## [10,]    -110.7     36.9

Now create a SpatialPoints object

library(sp)
pts <- SpatialPoints(lonlat)

Let’s check what kind of object pts is.

class (pts)
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"

And what is inside of it

showDefault(pts)
## An object of class "SpatialPoints"
## Slot "coords":
##       longitude latitude
##  [1,]    -116.7     45.3
##  [2,]    -120.4     42.6
##  [3,]    -116.7     38.9
##  [4,]    -113.5     42.1
##  [5,]    -115.5     35.7
##  [6,]    -120.8     38.9
##  [7,]    -119.5     36.2
##  [8,]    -113.7     39.0
##  [9,]    -113.7     41.6
## [10,]    -110.7     36.9
## 
## Slot "bbox":
##              min    max
## longitude -120.8 -110.7
## latitude    35.7   45.3
## 
## Slot "proj4string":
## CRS arguments: NA

So we see that the object has the coordinates we supplied, but also a bbox. This is a ‘bounding box’, or the ‘spatial extent’ that was computed from the coordinates. There is also a “proj4string”. This stores the coordinate reference system (“crs”, discussed in more detail later). We did not provide the crs so it is unknown (NA). That is not good, so let’s recreate the object, and now provide a crs.

crdref <- CRS('+proj=longlat +datum=WGS84')
pts <- SpatialPoints(lonlat, proj4string=crdref)
pts
## class       : SpatialPoints 
## features    : 10 
## extent      : -120.8, -110.7, 35.7, 45.3  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0

We can use the SpatialPoints object to create a SpatialPointsDataFrame object. First we need a data.frame with the same number of rows as there are geometries.

# Generate random precipitation values, same quantity as points
precipvalue <- runif(nrow(lonlat), min=0, max=100)
df <- data.frame(ID=1:nrow(lonlat), precip=precipvalue)
class(df)
## [1] "data.frame"
df

Combine the SpatialPoints with the data.frame.

ptsdf <- SpatialPointsDataFrame(pts, data=df)
ptsdf
## class       : SpatialPointsDataFrame 
## features    : 10 
## extent      : -120.8, -110.7, 35.7, 45.3  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 
## variables   : 2
## names       : ID,           precip 
## min values  :  1, 6.17862704675645 
## max values  : 10, 99.1906094830483
## class       : SpatialPointsDataFrame
## features    : 10
## extent      : -120.8, -110.7, 35.7, 45.3  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
## variables   : 2
## names       : ID,           precip
## min values  :  1, 6.17862704675645
## max values  : 10, 99.1906094830483

To see what is inside:

str(ptsdf)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  10 obs. of  2 variables:
##   .. ..$ ID    : int [1:10] 1 2 3 4 5 6 7 8 9 10
##   .. ..$ precip: num [1:10] 6.18 20.6 17.66 68.7 38.41 ...
##   ..@ coords.nrs : num(0) 
##   ..@ coords     : num [1:10, 1:2] -117 -120 -117 -114 -116 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:2] "longitude" "latitude"
##   ..@ bbox       : num [1:2, 1:2] -120.8 35.7 -110.7 45.3
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "longitude" "latitude"
##   .. .. ..$ : chr [1:2] "min" "max"
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. ..@ projargs: chr "+proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0"
showDefault(ptsdf)
## An object of class "SpatialPointsDataFrame"
## Slot "data":
##    ID    precip
## 1   1  6.178627
## 2   2 20.597457
## 3   3 17.655675
## 4   4 68.702285
## 5   5 38.410372
## 6   6 76.984142
## 7   7 49.769924
## 8   8 71.761851
## 9   9 99.190609
## 10 10 38.003518
## 
## Slot "coords.nrs":
## numeric(0)
## 
## Slot "coords":
##       longitude latitude
##  [1,]    -116.7     45.3
##  [2,]    -120.4     42.6
##  [3,]