Hexograms: better maps of area based data

Introduction
- To get started
Illustrating the problem
Balanced cartograms
Hexagonal binning
Hexograms
Creating a hexogram in R
Further examples
- Mapping the 2015 UK General Election results
- Mapping residential geographies in London
Conclusion

Introduction

This tutorial explores the use of hexograms to produce better maps of area based data and population distributions. Hexograms are a cross between hexagonal binning, tile maps and cartograms that aim to redress the problem of ‘invisibility’ prevalent in conventional maps and also the problem of distortion caused by cartograms. Examples of hexograms and the functions to produce them in R are presented. In principle those functions will work with other data (specifically, spatial polygons and spatial polygon data frames). However, it is a proof of concept not fully developed code and it’s not at all optimised for speed.

To get started

Set your working directory

# Yours will differ from what is written below
setwd("C:/Users/profr/Downloads")

Install the required libraries

reqd <- c("cartogram", "sp", "fMultivar", "RANN","rgdal","rgeos","GISTools",
          "RCurl")
pkgs <- rownames(installed.packages())
need <- reqd[!reqd %in% pkgs]
if(length(need)) install.packages(need)

Download and load a boundary file

download.file("https://github.com/profrichharris/Rhexogram/blob/master/las.zip?raw=true", "las.zip", mode="wb")
unzip("las.zip")
map <- rgdal::readOGR("las.shp", "las")

Illustrating the problem

A perennial problem when mapping neighbourhood data is the variable size of the areas. Typically this leaves the map dominated by the largest areas - which often contain the fewest people - whereas the attributes of the smallest areas become too small to discern. An example is shown below, which is a choropleth map of the log of the average house price of local authorities in England in 2016. The highest prices are in London - the boundary of which is shown - but where in the capital is most expensive and the spatial variation within it are not clear.

require(GISTools)
z <- log(map$MeanPrice)
shades <- auto.shading(z, cutter = rangeCuts, n = 4, cols = brewer.pal(4, "YlGnBu"))
par(mai=c(0,0,0.5,0))
choropleth(map, z, shading = shades, border = "white")
outline <- rgeos::gBuffer(map)
plot(outline, add = T)
plot(rgeos::gBuffer(map[map$RGN == "London",]), add=T)
choro.legend(52473, 622458, shades, between = "to <")
title("Mean house price (log £s)", cex.main = 0.9)

A ‘solution’ has been to use cartograms that resize the areas in accordance with their population size or, alternatively, in proportion to the attribute of interest, presently the log of the average house price. Both approaches are shown below, and were produced using the cartogram package for R. The concern - which is not a criticism of the package but a comment on cartograms more generally - is that these trade invisibility for distortion: London is now much larger but it comes at the price of geographic distortion that renders parts of the rest of the map illegible. If the purpose of the map is to convey location as well as measurement then neither the choropleth nor the cartograms are especially successful. Both suffer from a problem of misrepresentation.

Balanced cartograms

Harris et al. (2017b) describe the problem of geographic distortion as ‘the curse of the cartogram’. The problem arises not from the method per se but from the choice of scaling variable used to reapportion the areas. Exchanging one positively skewed variable (the area of each local authority), with another positively skewed variable (the house prices) cannot solve the visualisation issue. The problem would, of course, be even worse if the actual prices were used instead of their log.

What can? Harris et al. (2017a) suggest addressing the specific problem head on. For the original map, this is that some of the areas are too small to be seen so the obvious solution is to make them bigger. Doing this does not require them to be re-scaled against what is really a fairly arbitrary value like population size; only to ensure that they are big enough to be seen. That requires them to have an area of about 0.02 squared-inches. So, assuming we are producing a map to fit into a graphic window with a height of 5 inches then a process to rescale the map is:

siu <- 0.02 # the smallest interpretable unit
height <- 5
bb <- sp::bbox(map)
width <- (bb[1,2] - bb[1,1]) / (bb[2,2] - bb[2,1]) * height
bbA <- (bb[1,2] - bb[1,1]) * (bb[2,2] - bb[2, 1])
mapA <- rgeos::gArea(map)
minA <- (siu * bbA) / (height * width)
map$scaleby <- rgeos::gArea(map, byid = TRUE)
map$scaleby[map$scaleby < minA] <- minA
# The following will take a little while to run
balcarto <- cartogram::cartogram(map, "scaleby", maxSizeError = 1.1, prepare = "none")

This leads to what Harris et al. describe as a balanced cartogram because it better balances invisibility and distortion.

par(mai=c(0,0,0.5,0))
par(mfrow=c(1,2))
choropleth(balcarto, z, shading = shades, border = "white")
plot(rgeos::gBuffer(balcarto), add=T)
plot(rgeos::gBuffer(balcarto[popcarto$RGN == "London",]), add=T)
plot(outline, lty = "dotted", add = T)
choro.legend(52473, 622458, shades, between = "to <")
title("Balanced cartogram of mean house price (log £s)", cex.main = 0.9)
choropleth(map, z, shading = shades, border = "white")
plot(outline, add = T)
plot(rgeos::gBuffer(map[map$RGN == "London",]), add=T)
title("Original choropleth map", cex.main = 0.9)

If an error is defined as the percentage amount by which the original map and the cartogram do not overlap, then the gain from the balanced cartogram is clear: for the population cartogram the error is 13.9 per cent, for the attribute based cartogram it is 12 per cent and for the balanced cartogram it is halved to 6.2. More precisely, the error may be defined as,

\[ \epsilon = 100-50A_{{x}\cap{y}}\left(\frac{1}{A_{x}} + \frac{1}{A_{y}}\right) \] where \(A_{x}\) is the area of the original map, \(A_{y}\) is the area of the cartogram, and \(A_{{x}\cap{y}}\) is the area of their geographical intersection.

Another way to consider the amount of distortion is to consider the average displacement of the area centroids from the original map to their new location on the cartogram. This is 26.04km for the population cartogram, 25.18km for the attribute cartogram and 14.19km for the balanced cartogram.

On these two criteria, it is the balanced cartogram that offers the best representation of the house price geography.

Hexagonal binning

Hexagonal binning has become popular as a method of data visualization and can be used to map the density of point occcurrences, as in this map where the hexagon area encodes the number of Walmart stores that fall into each bin and the colour represents the median age. Although an attractive map, the problems with applying this idea to the house price data are finding room on the map to draw the hexagons for the smallest local authorities, especially in London, and because we are seeking to map area (lattice) data not the density of points.

Hexograms

The idea behind a hexogram is the same as for the balanced cartogram except that the minimum area is that which allows each area to be represented by as its own hexagon in a process of hexagonal binning, as in the map below.

The hexagons are produced using the hexagonal binning function in R’s fMultivar package and are based on the centroids of each polygon. For the original map, many of the local authorities will fall into the same hexagonal bin. For example, if the default settings are used (with 30 bins) then just under half the 326 authorities will share a hexagon with at least one other:

hbin <- fMultivar::hexBinning(coordinates(map))
table(hbin$z)

## 
##   1   2   3   4   6   7  11 
## 170  43  10   4   1   1   1

There are two possible solutions to this. The first is to increase the bins to the number where there are no longer clashes. This occurs with 118 bins. Whilst this entails the least geographic distortion, the hexagons are now too small to be legible:

The second possibility is to increase the size of the areas where the clashes are occurring. This is controlled by the number of bins where the fewer the bins, the greater the amount of geographic distortion that will be created. The following map was set with a target of 25 bins (raised to 29 during its production) and consequently has more distortion.

However, that same map can be used to produce a non-tessellating tile map of the data, which, when combined with the original map, produces a quite attractive representation of the house price geography:

Creating a hexogram in R

The first step is to source the required functions

script <- RCurl::getURL("https://raw.githubusercontent.com/profrichharris/Rhexogram/master/functions.R")
eval(parse(text = script))

The second is to decide on the number of bins

… remembering that fewer bins will create more of a tile map but also more geographic distortion, and too many bins will result in hexograms that are hard to see. There is no correct number - it depends on the purpose - but in making a judgment it is helpful to gauge the expected number of conflicts (places without a unique hexagon) for each number of bins as this will suggest how many of the places will have to be enlarged to create the final map.

par(mai=c(1,1,0.3,0))
binN(map)

Because there is no right answer some trial-and-error is anticipated. Here we will try using 29 bins and also 48. The former is expected to generate the most geographic distortion and to take longer to generate. Be warned, the process takes time to generate the cartogram and requires some patience. Specifically, the first hexogram took 5 minutes to produce on my laptop. The second took 3.

hex29 <- hexogram(map, 29)
hex48 <- hexogram(map, 48)

The process works as follows. To begin, an initial cartogram is created that increases the size of the smallest areas in the map. Hexagonal binning is then applied to the centroids of the cartogram, identifying areas that conflict, i.e. share a hexagon. To address these conflicts, an attempt is made to move the centroids to different parts of the conflicted areas to see if that will seperate them. If not, the areas are enlarged some more to see if that works. The process iterates through these procedures but as a last resort will attempt a third approach, which is increasing the number of bins.

Having produced the hexograms, we can map them

Assuming the code was followed at the beginning of the tutorial to produce the initial choropleth map then the hexogram can be mapped as follows:

par(mai=c(0,0,0.5,0))
par(mfrow=c(2,2))
carto <- hex29[[1]]
hexo <- hex29[[2]]
outline29 <- rgeos::gBuffer(carto)
choropleth(carto, z, shading = shades, border = "white")
plot(outline29, add=T)
choropleth(hexo, z, shading = shades, border = "black", add = T)
choro.legend(52473, 622458, shades, between = "to <")
title("Hexogram with 29 bins")
choropleth(carto, z, shading = shades, border = "white")
plot(outline29, add=T)
title("Underlying cartogram")
carto <- hex48[[1]]
hexo <- hex48[[2]]
outline48 <- rgeos::gBuffer(carto)
choropleth(carto, z, shading = shades, border = "white")
plot(outline48, add=T)
choropleth(hexo, z, shading = shades, border = "black", add = T)
title("Hexogram with 48 bins")
choropleth(carto, z, shading = shades, border = "white")
plot(outline48, add=T)
title("Underlying cartogram")

Overlaying the hexograms upon the original map would avoid seeing some of the distortion to the study region’s boundary but it is not ideal because of the geographical distortion, which is worse in the first case:

par(mai=c(0,0,0.5,0))
par(mfrow=c(1,2))
hexo <- hex29[[2]]
plot(map, col="light grey", border="light grey")
choropleth(hexo, z, shading = shades, border = "black", add = T)
plot(outline29, lty = "dotted", add = T)
title("Overlaid upon the original map")
hexo <- hex48[[2]]
plot(map, col="light grey", border="light grey")
choropleth(hexo, z, shading = shades, border = "black", add = T)
plot(outline48, lty = "dotted", add = T)

However, with some re-arrangement of the hexagons, which returns those that are less clustered and towards the edge of the map back to their original positions, a non-tessellating tile map can be produced:

par(mai=c(0,0,0.5,0))
par(mfrow=c(1,2))
newhex <- reallocate(map, hex29)
choropleth(map, z, shading = shades, border = "white")
choropleth(newhex, z, shading = shades, border = "black", add = T)
choro.legend(52473, 622458, shades, between = "to <")
title("Tile maps combined with the original map")
newhex <- reallocate(map, hex48)
choropleth(map, z, shading = shades, border = "white")
choropleth(newhex, z, shading = shades, border = "black", add = T)

Further examples

Mapping the 2015 UK General Election results

Harris et al. (2017a) take the example of the 2015 UK General Election results as a motivation for the balanced cartogram. Their example is shown below.

source: Harris et al. (2017a)

The image below shows the same data but now as a hexogram.

Mapping residential geographies in London

A more demanding example is shown below. It involves mapping the ethnic composition of the residential population of each of 4662 (Lower Level Super Output Areas) in London, using the 2011 Census data.

Taking the percentage of each neighbourhood that is Bangladeshi, Indian or Pakistani as an example, the standard choropleth maps has the usual problem of invisibility:

By comparison, the cartogram underlying the hexogram (which took a little under 3 hours to generate!) cuts down on some of the parts of the map that are obscured by ‘black ink’ …

… and the hexogram offers a slightly different representation.

A hexogram for the residential geography of the Black African, Black Caribbean and Black Other groups is displayed below.

Finally, the following map shows the local index of dissimilarity values comparing the share of the White British population per neighbourhood with the share of the combined Bangladeshi, Black, Indian and Pakistani populations. These are shown as z-values, where the more purple the shading, the greater the share of the White British population relative to the rest, and the more orange the shading, the greater the share of the combined Bangladeshi, Black, Indian and Pakistani populations relative to the White British. Some of the patterns of ‘segregation’ - more closely explored by Johnston et al. 2016, amongst others - are evident.

Conclusion

The aim of this tutorial has been to show how with more judicuous use of the scaling parameter, more refined cartograms for neighbourhood data can be produced that offer less mis-representation and better balance the problems of invisibility and distirtion. In particular, hexograms have been presented as a hybrid of cartograms and tile maps, with the idea that these might produce better (and more attactive) visualisations of geographically distributed populations. If nothing else, it kept me busy as Christmas approaches.

Richard Harris, December 16, 2017

View licence deed

The research was funded under the ESRC’s Urban Big Data Centre, ES/L011921/1