Using spatial overlap to reapportion data from one geography to another

The problem: To reapportion data from one geography (set of zones) to another, based upon the extent of the spatial overlap between the 'old' and 'new' spatial units.

The motivation: When faced with this challenge I surfed the cybershpere looking for a suitable entry-level primer and drew a blank. So, here it is….

In the worked example that follows, the Census 2001 population count for wards in Liverpool (UK) are reapportioned from an 'old' geography (2001 Census wards) to a 'new' geography (2011 Census wards). However, the solution outlined can be applied equally well to any other problem involving the reapportioning of data from one set of polygons (zones) to another (e.g. from Census Tracts to School Districts), simply by treating the original geography as the 'old' geography and the target geography as the 'new' geography.

Key techniques covered:

Disclaimer: A more sophisticated approach would probably involve the use of the over() function and array processing of the resulting matrix. The solution offered here focusses instead upon simplicity. It also has the possible side-benefit of being more efficient, since computations are restricted to the observed overlaps/intersects, rather than having to deal with all possible overlaps (old areas x new areas) as is the case with a matrix-based approach.

Data copyright: Census data sourced from Office for National Statistics and licensed under the Open Government Licence v.1.0. Map boundaries contain Ordnance Survey data © Crown copyright and database right 2013

1. Open relevant libraries

The rgdal library is used to read in the 'old' and 'new' geographies from .shp files supplied by the Office for National Statistics.

The rgeos library is used overlay the two geographies and calcualte the degree to which they overlap.

library(rgdal)
## Warning: package 'rgdal' was built under R version 3.0.1
## Loading required package: sp
## Warning: package 'sp' was built under R version 3.0.1
## rgdal: version: 0.8-9, (SVN revision 470) Geospatial Data Abstraction
## Library extensions to R successfully loaded Loaded GDAL runtime: GDAL
## 1.9.2, released 2012/10/08 Path to GDAL shared files: C:/Program
## Files/R/R-3.0.0/library/rgdal/gdal GDAL does not use iconv for recoding
## strings. Loaded PROJ.4 runtime: Rel. 4.7.1, 23 September 2009,
## [PJ_VERSION: 470] Path to PROJ.4 shared files: C:/Program
## Files/R/R-3.0.0/library/rgdal/proj
library(rgeos)
## Warning: package 'rgeos' was built under R version 3.0.1
## rgeos version: 0.2-17, (SVN revision 392) GEOS runtime version:
## 3.3.8-CAPI-1.7.8 Polygon checking: TRUE

2. Read in old and new geographies

old.geog <- readOGR("Liverpool CAS Wards 2001", "england_caswa_2001")
## OGR data source with driver: ESRI Shapefile 
## Source: "Liverpool CAS Wards 2001", layer: "england_caswa_2001"
## with 33 features and 4 fields
## Feature type: wkbPolygon with 2 dimensions
new.geog <- readOGR("Liverpool Census Merged Wards 2011", "england_cmwd_2011Polygon")
## OGR data source with driver: ESRI Shapefile 
## Source: "Liverpool Census Merged Wards 2011", layer: "england_cmwd_2011Polygon"
## with 30 features and 4 fields
## Feature type: wkbPolygon with 2 dimensions

Note that old.geog and new.geog are both SpatialPolygon (sp) objects.

In an sp object, sp@polygons stores the spatial boundaries and sp@data stores the associated spatial data (areal counts or rates).

3. Add the data to be reapportioned to old.geog

First, read in 2001 population count for Liverpool 2001 Census wards….

ward.data <- read.csv("Liverpool CAS Ward population 2001.csv", header = TRUE)
colnames(ward.data)[c(1, 3)] <- c("ons_label", "pop_2001")
head(ward.data, 1)
##   ons_label  Zone.Name pop_2001
## 1    00BYFA Abercromby    11473

…then add them to the old.geog SpatialPolygon.

head(old.geog@data, 1)
##   gid ons_label       name  label
## 0 545    00BYGC St. Mary's 04BYGC
old.geog@data <- data.frame(old.geog@data, pop_2001 = ward.data[match(old.geog@data[, 
    "ons_label"], ward.data[, "ons_label"]), "pop_2001"])
head(old.geog@data, 1)
##   gid ons_label       name  label pop_2001
## 0 545    00BYGC St. Mary's 04BYGC    12484

Note that the approach above uses the match() function to index the data entries in ward.data to the pre-existing data entries in old.geog@data via the unique ward identifier ons_label.

This approach ensures that the unique area IDs used internally within the SpatialPolygon object old.geog to link old.geog@data and old.geog@polygons are retained. The initially tempting and apparently simpler alternative of using the merge() function does not!

4. Visualise the old and new geographies

The overlaping geographies we are dealing with are visualised below, with the 2001 census geography (coloured areas) overlain with the 2011 census geography (white borders)

my_colours <- c("red", "green", "grey", "blue")
area_colour <- seq(1:4)
plot(old.geog, col = my_colours[area_colour], border = NA)
plot(new.geog, border = "white", add = TRUE)

plot of chunk unnamed-chunk-5

5. Find out how much of each old.area is located in each new.area

First, we can find the set of areas that represent the intersections (overlaps) between the old and new geographies using gIntersection() function from the rgeos library…

# Find the overlaps...
overlap.geog <- gIntersection(new.geog, old.geog, byid = TRUE)
# ...then visualise them
plot(overlap.geog, col = my_colours[area_colour], border = NA)
plot(new.geog, border = "white", add = TRUE)

plot of chunk unnamed-chunk-6

Then we can find the size (spatial extent / surface area) of each geographical intersection (overlap) using the rgeos function gArea()

overlap.area <- gArea(overlap.geog, byid = TRUE)

We can also find number of overlaps (geographical intersections) we are dealing with:

n.overlaps <- length(overlap.area)
n.overlaps
## [1] 155

A litte more tricky is idenfying the IDs of the old and new areas associated with each overlap, as the unique name automatically created by gArea() for each overlap areas identified = new.id + space + old.id. For example the overlap between old.id “0” and *new.id" “23” would be named “0 23”.

# NOTE: the type and order of calls to R functions below ensures that
# resulting dataframe retains IDs in character format; and the area of
# each overlap in numeric format
tmp <- as.character(scan(text = names(overlap.area)))
tmp <- matrix(tmp, nrow = n.overlaps, ncol = 2, byrow = TRUE, dimnames = list(c(NULL), 
    c("new.id", "old.id")))
tmp <- as.data.frame(tmp, stringsAsFactors = FALSE)
overlaps <- cbind(tmp, overlap.area)
rownames(overlaps) <- NULL

The resulting dataframe, overlaps records, for each overlap, the associated old.id, new.id and overlap.area:

head(overlaps, 1)
##   new.id old.id overlap.area
## 1      0      0         9.26
str(overlaps)
## 'data.frame':    155 obs. of  3 variables:
##  $ new.id      : chr  "0" "0" "0" "0" ...
##  $ old.id      : chr  "0" "2" "3" "23" ...
##  $ overlap.area: num  9.26 1.53 5.91e+06 4.38e+05 1.76e-02 ...

6. Find the proportion of each old.area falling in each new.area

To do this, we first need to work out the size of each old.area:

tmp <- gArea(old.geog, byid = TRUE)
old.area <- data.frame(old.id = names(tmp), old.area = tmp, stringsAsFactors = FALSE)
head(old.area, 1)
##   old.id old.area
## 0      0 22188838

Then we need to ADD the size of each old area to the relevant overlap in the dataframe overlap.area:

head(overlaps)
##   new.id old.id overlap.area
## 1      0      0    9.260e+00
## 2      0      2    1.530e+00
## 3      0      3    5.907e+06
## 4      0     23    4.380e+05
## 5      1     13    1.757e-02
## 6      1     15    2.945e+00
overlaps <- data.frame(overlaps, old.area = old.area[match(overlaps[, "old.id"], 
    old.area[, "old.id"]), "old.area"])
head(overlaps)
##   new.id old.id overlap.area old.area
## 1      0      0    9.260e+00 22188838
## 2      0      2    1.530e+00  5996177
## 3      0      3    5.907e+06  7683494
## 4      0     23    4.380e+05  3744167
## 5      1     13    1.757e-02  2460337
## 6      1     15    2.945e+00  1595174

Before, at last, calculating the proportion of each old area that overlaps each new area:

overlaps$old.area.share <- overlaps$overlap.area/overlaps$old.area

7. Overlay validation

If a zone from the old geography lies fully within the new geography, then the reapportionment of the old zone to zones in the new geography should be fully accounted for. In other words,the sum of its old.area.shares should add to 1. We can check this for each spatial unit in the old geography as follows:

# Find sum of old.area.share for each old.id
tmp <- aggregate(old.area.share ~ old.id, data = overlaps, sum)
colnames(tmp) <- c("old.id", "sum.old.area.shares")
# Sort into same order as spatial units are stored in old.geog@data
old.area.shares <- tmp[match(rownames(old.geog@data), tmp[, "old.id"]), ]
old.area.shares
##    old.id sum.old.area.shares
## 1       0              1.0000
## 2       1              1.0000
## 13      2              1.0000
## 24      3              1.0000
## 28      4              1.0000
## 29      5              0.9999
## 30      6              0.9998
## 31      7              1.0000
## 32      8              1.0000
## 33      9              1.0000
## 3      10              1.0000
## 4      11              1.0000
## 5      12              1.0000
## 6      13              1.0000
## 7      14              1.0000
## 8      15              1.0000
## 9      16              1.0000
## 10     17              1.0000
## 11     18              1.0000
## 12     19              1.0000
## 14     20              1.0000
## 15     21              1.0000
## 16     22              1.0000
## 17     23              1.0000
## 18     24              1.0000
## 19     25              0.9982
## 20     26              1.0000
## 21     27              1.0000
## 22     28              1.0000
## 23     29              1.0000
## 25     30              1.0000
## 26     31              1.0000
## 27     32              1.0000

As can be seen from the above, not all old areas have been fully reallocated to the new geography (sum of old.area.share /= 1). Further investigation shows that all of the problem areas border the city of Liverpool; and that in all cases area was 'lost'.

# Classify areas into three categories: 'lost area'; 'kept area'; 'gained
# area' [where 'correctly accounted for' are areas with
# sum(old.area.shares) in the range <1-precision> to <1+precision> ]
library("classInt")
## Warning: package 'classInt' was built under R version 3.0.1
## Loading required package: class
## Loading required package: e1071
## Warning: package 'e1071' was built under R version 3.0.1
tmp <- old.area.shares$sum.old.area.shares
precision <- 1e-07
breaks <- classIntervals(tmp, n = 3, style = "fixed", fixedBreaks = c(min(tmp), 
    1 - precision, 1 + precision, max(tmp, 1 + precision) + precision))
breaks <- breaks$brks
breaks
## [1] 0.9982 1.0000 1.0000 1.0000

# Visualise the outcome [red = 'lost area'; white = 'kept area'; blue =
# 'gained area']
my_colours <- c("Red", "White", "Blue")
plot(old.geog, col = my_colours[findInterval(tmp, breaks)])

plot of chunk unnamed-chunk-15

The reason for this pattern is that the 2001 and 2011 versions of the digitised boundary for the City of Liverpool differ; not as a result of real boundary changes, but as a result of minor imprecisions (differences) in the two digitisations. The outcome is that:

8. Rescaling the proportion of each old.area falling in each new.area

If a spatial unit in old.geog is known NOT to extend beyond the area covered by new.geog, then possible errors introduced by imprecision in digitisation can be addressed by rescaling its old.area.shares to sum to 1.

# Add the sum.old.area.shares specific to each *old.id* to the relevant
# overlap areas in the overlaps dataframe
overlaps <- data.frame(overlaps, sum.old.area.shares = old.area.shares[match(overlaps[, 
    "old.id"], old.area.shares[, "old.id"]), "sum.old.area.shares"])

# Rescale the old.area.share for each overlap by 1/sum.old.area.shares
overlaps$old.area.share.r <- overlaps$old.area.share * (1/overlaps$sum.old.area.shares)

# Show that this rescaling has worked (i.e. that sum.old.area.shares.r = 1
# for all old.ids)
aggregate(old.area.share.r ~ old.id, data = overlaps, sum)
##    old.id old.area.share.r
## 1       0                1
## 2       1                1
## 3      10                1
## 4      11                1
## 5      12                1
## 6      13                1
## 7      14                1
## 8      15                1
## 9      16                1
## 10     17                1
## 11     18                1
## 12     19                1
## 13      2                1
## 14     20                1
## 15     21                1
## 16     22                1
## 17     23                1
## 18     24                1
## 19     25                1
## 20     26                1
## 21     27                1
## 22     28                1
## 23     29                1
## 24      3                1
## 25     30                1
## 26     31                1
## 27     32                1
## 28      4                1
## 29      5                1
## 30      6                1
## 31      7                1
## 32      8                1
## 33      9                1

9. Reapportion data from the old to the new geography using the caculated overlaps

And now, at last, data can be reapportioned from the old to the new geography using the calculated share of each zone from the old geography located in each zone in the new geography.

new.geog@data$pop_2001 <- 0
for (i in 1:n.overlaps) {

    new.geog@data[overlaps$new.id[i], "pop_2001"] <- new.geog@data[overlaps$new.id[i], 
        "pop_2001"] + (overlaps$old.area.share.r[i] * old.geog@data[overlaps$old.id[i], 
        "pop_2001"])
}

If the new and old geographies cover same total area, then the sum of the sum of data reapportioned across the new geography should equal the sum of the data across the old geography. This is easy to check:

sum(old.geog@data$pop)
## [1] 439473
sum(new.geog@data$pop)
## [1] 439473
print(c("diff = ", round((sum(old.geog@data$pop) - sum(new.geog@data$pop)), 
    2)))
## [1] "diff = " "0"

10. Visualise the reapportioned data

Finally, having got all this way, it would be nice to visualise the distribution of the data across the old and new geographies, wouldn't it?

# ==== Visualise 2001 population using 2001 geography super-imposed with
# 2011 boundaries ====
breaks <- classIntervals(old.geog@data$pop_2001, n = 5, style = "fisher")
breaks <- breaks$brks
library("RColorBrewer")
my_colours <- brewer.pal(6, "YlOrRd")
plot(old.geog, col = my_colours[findInterval(old.geog@data$pop_2001, breaks)], 
    border = NA)
plot(new.geog, border = "grey", add = TRUE)
library("maptools")
## Warning: package 'maptools' was built under R version 3.0.1
## Loading required package: foreign
## Loading required package: grid
## Loading required package: lattice
## Checking rgeos availability: TRUE
legend(x = 331000, y = 386096, legend = leglabs(round(breaks), between = " to "), 
    fill = my_colours, bty = "n", cex = 0.7)

plot of chunk unnamed-chunk-19


# ==== Visualise 2001 population reapportioned to 2011 Geography,
# super-imposed with 2001 boundaries ===
breaks <- classIntervals(new.geog@data$pop_2001, n = 5, style = "fisher")
breaks <- breaks$brks
library("RColorBrewer")
my_colours <- brewer.pal(6, "YlOrRd")
plot(new.geog, col = my_colours[findInterval(new.geog@data$pop_2001, breaks)], 
    border = NA)
plot(old.geog, border = "grey", add = TRUE)
legend(x = 331000, y = 386096, legend = leglabs(round(breaks), between = " to "), 
    fill = my_colours, bty = "n", cex = 0.7)

plot of chunk unnamed-chunk-19