Spatial Joins and Aggregation

Author

Dr Richard Timmemran

Abstract

This document explains a standard approach to performing a spatial join between two geographic datasets in R. The process depends on the sf package. Additionally, aggregation is also explained following the application of a spatial join.

1 Spatial joins

You are faced with two geographical data sets that can fall within the same map space. Your instincts tell you that it is a good idea to merge (join) these data sets together but you do not have a mutual field. Typically this scenario occurs when you want points to inherit the properties of polygons (and vice versa). What do you do? The answer is that you create a spatial join. This is easily achieved using the st_join(...) function in R. Here’s how it works using the Birth Weight data set featured in various formative assignments.

1.1 Essential packages

For this process, the packages we’ll need are as follows:

if(!require(sf)) install.packages("sf") & require(sf)
if(!require(tmap)) install.packages("tmap") & require(tmap)

1.2 Data

For the sake of demonstration, I have uploaded the birth weight data to my GitHub page, alongside the Shapefiles for Welsh regions it falls within. They can be loaded using the following code:

BIRTHS_DATA <- read.csv("https://raw.githubusercontent.com/Richtea84/I2Q-files/main/BIRTHS_NO68.csv")
POLY_REGIONS <- read_sf("/vsicurl/https://github.com/Richtea84/I2Q-files/raw/main/WALES_clipped/WALES_clipped.shp")

Upon initial inspection, we find that BIRTHS_DATA has 4536 and the POLY_REGIONS, 8. Already, we can see that using the merge(...) function will result in a mismatch error. Let’s perform the spatial join.

1.3 Converting the CSV file into a mappable format

To convert the CSV file into mappable data, we use the st_as_sf(...) function. Here’s the process:

BIRTHS_MAP <- st_as_sf(BIRTHS_DATA, coords = c("LONGITUDE","LATITUDE"))

Here, the Longitude and Latitude fields are used to match native projections for the POLY_REGIONS data (Word Geodetic System 1984, aka WGS 84). However, we still need to tell R what the coordinate reference system is. We do this by using the st_crs(...) function in the following way:

st_crs(BIRTHS_MAP) <- 4326

The number 4326 is the European Petroleum Survey Group’s (EPSG) reference number for the WGS 84 projection. In the past we had to insert the full PROJ.4 string. Older packages like sp still use this convention. For reference, you can retrieve the value +proj=longlat +datum=WGS84 +no_defs +type=crs in the following way:

st_crs(4326)$proj4string
[1] "+proj=longlat +datum=WGS84 +no_defs"

In any case, sf readily projects your data from the EPSG vector value.

To ensure that both your data sets are using the same projection, you can use the st_crs(...) function in the following way:

st_crs(BIRTHS_MAP) <- st_crs(POLY_REGIONS)

We can produce a quick map to see whether the data sets are aligned (Figure 1).

tm_shape(POLY_REGIONS)+
  tm_polygons()+
  tm_shape(BIRTHS_MAP)+
  tm_dots()+
  tm_compass()+
  tm_scale_bar()

Figure 1: Examining whether our points and polygons are aligned (intersections).

1.4 The joining process

With the geometric properties of data sets set to the same projection, you are now able to perform the join. This is achieved using the st_join(...) function. If you want your points to inherit the properties of the polygons, specify the polygons as your x = points and the y = as your polygons:

PTS_MODIFIED <- st_join(x = BIRTHS_MAP, y = POLY_REGIONS)

# verify
nrow(PTS_MODIFIED)
[1] 4536

Under some circumstances, you may find it useful for the polygons to inherit the properties of the points. To achieve, you’ll simply reverse the join. For the data featured in this tutorial, this will result in duplications of polygons. For now, we’ll leave this process to one side and carry on with the PTS_MODIFIED object.

We can verify our join by preparing another simple map (Figure 2).

# 
tm_shape(PTS_MODIFIED)+
  tm_dots("name")

Figure 2: Early examination of the spatial join.

2 Aggregation

Now that we’ve joined the spatial data, we can apply the aggregate(...) function to extract further insights from the data based on the properties from the polygons. For example, we can examine the number of recorded births per region (see code below and the result in Table 1).

# Create a count field
PTS_MODIFIED$Count = 1

# Aggregate (presented as a table)

knitr::kable(
  aggregate(Count ~ name, data = PTS_MODIFIED, FUN = sum)
  )
name Count
Caerphilly 58
Cardiff Central 925
Cardiff North 1330
Cardiff South and Penarth 717
Cardiff West 861
Newport West 56
Vale of Glamorgan 586
Table 1: Aggregation based on frequency of observations for each welsh region

The region with the most recorded births is Cardiff North.

If we wanted to calculate the average birth weight by region, the aggregate(...) function can be applied in the following way (results shown in Table 2).

knitr::kable(
  aggregate(Bweight ~ name, data = PTS_MODIFIED, FUN = mean)
  )
name Bweight
Caerphilly 3.333276
Cardiff Central 3.262681
Cardiff North 3.375639
Cardiff South and Penarth 3.397782
Cardiff West 3.241963
Newport West 3.456250
Vale of Glamorgan 3.146450
Table 2: Aggregation showing the average birth weight for each Welsh region

The region with the heaviest births is Newport West.