A common task in geographical information systems (GIS) is analysing and mapping sociodemographic/socioeconomic data by areas. This is called choropleth mapping (see http://www.datavizcatalogue.com/methods/choropleth.html). Zones of analysis are normally defined by administrative/electoral boundaries (or aggregations thereof) but they can also be bespoke, project-specific areas that do not conform to official geographical units.

In this post I show the R code used to produce choropleth maps of aggregated UK small-area deprivation data (adjusted UK Index of Multiple Deprivation). This analysis was conducted as part of an on-going research project aimed at producing a regional agricultural ‘resilience index’ for England and Wales, and it provided us with an indicator of the wider regional socioeconomic environment in which the farms in our study are located.

I’m sure the code is a bit hacky in places and have no doubt that it could be made clearer and more efficient - I welcome any suggestions you might have on how I can improve it!

R

R is a computer programming language developed for statistical computing and graphics https://www.r-project.org/about.html). R is an alternative to traditional statistical software packages such as SPSS, Stata and SAS in that it is an open source (https://en.wikipedia.org/wiki/Open-source_model) programming and software environment. Open source tools have many advantages over proprietary software, particularly for university students and researchers. Not least, open source means there is no direct financial cost associated with acquiring the tools, so anyone can download, use, copy, and change the code without paying.

As a researcher, working with open source tools means that published work can be reproduced more easily, as anyone can access the code and software used to conduct the research. Practical issues of cost and licencing issues aside, the philosophy of open source as a means of fostering collaborative, transparent and reproducible research is one which is spreading rapidly, and R is emerging as a popular open source tool for data science. Open source is therefore both “free as in speech and free as in beer.”

Originally designed as a tool for ‘traditional’ non-spatial statistics, it is only relatively recently that R code ‘packages’ have been developed for managing, mapping and analysing geographical information. The capabilities of R as a GIS are now being realised by more and more researchers and data scientists who, probably like me, have been attracted by the prospect of leveraging R’s powerful and flexible open source data processing environment for conducting spatial analysis.

For all my R work I use the excellent (and of course, open source!) RStudio integrated development environment (IDE) for writing and managing the code (https://www.rstudio.com/).

Data

Three main sources of data were used in this project:

Adjusted UK Indices of Multiple Deprivation (IMD) – In the UK, deprivation measures for individual countries (England, Scotland, Wales and Northern Ireland) are not directly comparable. This poses a problem when, like us, one needs to examine relative deprivation between sub-regions of the UK. Thankfully a new dataset has been produced by Abel et al (2016) who devised a method of adjusting the four different national measures of deprivation so that they could be meaningfully compared between across different parts of the UK. Abel et al’s paper can be found here: http://bmjopen.bmj.com/content/6/11/e012750.full. The adjusted IMD dataset from the research is available for download from the University of Bristol: https://data.bris.ac.uk/data/dataset/1ef3q32gybk001v77c1ifmty7x.

UK Lower Super Output Areas – The adjusted IMD is calculated for small areas called Lower Super Output Areas (LSOAs) which have an average population of 1500. This is non-spatial data that needs to be joined to spatial LSOA boundary (polygon) data for mapping and analysis in a GIS. Abel et al’s (2016) adjusted IMD requires pairing with the 2011 LSOA spatial boundary data, which can be downloaded from the UK Data Service: https://census.ukdataservice.ac.uk/use-data/guides/boundary-data

County Boundaries - Spatial data describing the boundaries of the study areas (counties and unitary authorities in England and Wales) in this project was based on Ordnance Survey BoundaryLine data: https://www.ordnancesurvey.co.uk/opendatadownload/products.html

Step 1 – Join the adjusted IMD data to the spatial LSOA data

Import the required code libraries (I use a mix of R code packages and functions - this can no doubt be streamlined):

library(rgdal)
library(rgeos)
library(tidyverse)
library(stringr)
library(raster)
library(spatialEco)
library(plyr)
library(dplyr)

Import the adjusted IMD data

#import adjusted IMD UK text file
df_imd <- as_tibble(read.delim("uk_imd_scores_data.txt"))

The table below shows the first 6 rows of the IMD data - there are 34,753 records (one for each LSOA) in total



Each row in the IMD describes attributes for a single LSOA. These include the original IMD scores and quintiles and the adjusted UK measures.

The following code manipulates the raw IMD table to remove any unwanted columns and formats the data ready for joining to LSOA polygons. I have tried to explain what’s happening using code comments.

#import adjusted IMD UK text file
df_imd <- as_tibble(read.delim("uk_imd_scores_data.txt"))
#remove unwanted columns to keep only 'uk_imd_england_score' and 'uk_imd_england_quintile'
df_imd <- df_imd[-c(3:6, 9:14)]
#remove text from 'uk_imd_england_quintile' column for analysis
df_imd$uk_imd_england_quintile <- 
  str_replace(df_imd$uk_imd_england_quintile, "1 - least deprived" ,"1")
df_imd$uk_imd_england_quintile <- 
  str_replace(df_imd$uk_imd_england_quintile, "5 - most deprived" ,"5")
#change uk_imd_england_quintile' column to numeric for analysis
df_imd$uk_imd_england_quintile <- as.integer(df_imd$uk_imd_england_quintile)
#shorten column names to avoid any problems when exporting to ESRI shapefile format
colnames(df_imd)[which(names(df_imd) == "uk_imd_england_score")] <- "IMD_SCORE"
colnames(df_imd)[which(names(df_imd) == "uk_imd_england_quintile")] <- "IMD_QUIN"
#change column name of LSOA code to match the LSOA code column in the LSOA polygon data
colnames(df_imd)[which(names(df_imd) == "area_code")] <- "LSOA11CD"

Next we import the spatial LSOA boundary data:

#import LSOA 2011 polygon shapefile
poly_lsoa <- raster::shapefile("LSOA_2011")

Examining the attributes of the LSOA data (here shown in QGIS), we can see that that it shares a common LSOA code attribute with the IMD data, meaning that the two sets of data can be joined.

LSOA polygons for Cheltenham

LSOA polygons for Cheltenham

So, let’s join the two sets of data:

#import LSOA 2011 polygon shapefile
poly_lsoa <- raster::shapefile("LSOA_2011")
#join LSOA polys and IMD 
poly_lsoa_imd <- merge(poly_lsoa, df_imd, by = "LSOA11CD")
#write poly_lsoa as shapefile for future use. rgdal::writeOGR was not...
#exporting all of the attributes to I reverted to raster::shapefile for data import and export
raster::shapefile(poly_lsoa_imd, 'poly_lsoa_imd', overwrite=TRUE)

Now we have the IMD data joined to our LSOA spatial polygons, we can map it (note I’ve used QGIS for the map outputs - posts on mapping with R to follow!)

LSOA polygons for Cheltenham showing relative deprivation mapped using adjusted UK IMD quintiles - deprivation increases from light to dark

LSOA polygons for Cheltenham showing relative deprivation mapped using adjusted UK IMD quintiles - deprivation increases from light to dark

Step 2 - Aggregate the IMD data to higher-level geographies

In our project we’re interested in examining the relative differences in deprivation between counties and unitary authorities, so we need to find a way of aggregating the IMD data to these higher-level geographies. There are two things to consider: 1) how to identify the LSOAs within each area of interest (county/UA); 2) how to conduct a meaningful analysis across larger areas with ranked IMD data.

The most straightforward approach would be to perform a further attribute join on the LSOA-IMD data using a LSOA to county look-up table (e.g.http://webarchive.nationalarchives.gov.uk/20160105160709/http://www.ons.gov.uk/ons/guide-method/geography/products/census/lookup/index.html). However, for various reasons in our project some of the counties and UAs have been manipulated and altered and so our areas are slightly different to the official local authority look-up tables.

We needed a spatial approach to selecting the LSOAs within our study areas, but using the LSOA polygons threw errors because the LSOA boundaries were not perfectly matched to some of the county boundaries. The approach taken was to calculate the centroids (geometric centre) of our LSOAs and use these points (with IMD data attached) to conduct the aggregated IMD analysis at local authority level.

The R code for doing this:

#generate centroids of the LSOA polygons 
pts <- rgeos::gCentroid(poly_lsoa_imd, byid=TRUE)
#perform spatial overlay of the centroids and LSOA polygons - now with IMD attribute data attached
pts_lsoa <- sp::over(pts, poly_lsoa_imd)
#combine 'pts' coordinate pairs with the attributes from the overlay result
res <- cbind(coordinates(pts), pts_lsoa)
#get the x,y coords from res data frame
xy <- res[,c(1,2)]
#convert to SpatialPointsDataFrame using the British National Grid Coordinate System
BNG = "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +datum=OSGB36 +units=m +no_defs+ellps=airy +towgs84=446.448,-125.157,542.060,0.1502,0.2470,0.8421,-20.4894"
pts_lsoa_imd <- SpatialPointsDataFrame(coords = xy,data = res, proj4string = CRS(BNG))
#write pts_lsoa_imd as shapefile for future use
raster::shapefile(pts_lsoa_imd, "pts_lsoa_imd", overwrite=TRUE)

An example of the output is shown below - LSOA polygons have been converted to points with the IMD attribute information data retained. Note: caution must be used with this approach the centroid of a highly irregular polygon can fall outside its own polygon boundary.

LSOA centroids for pembrokshire with IMD attribute data attached

LSOA centroids for pembrokshire with IMD attribute data attached



The next step is to perform another spatial overlay using the LSOA points and the study area polygons to attached the attribute information of the counties to the points.

#import local authorities polygon shapefile
poly_counties <- raster::shapefile("counties")
#set CRS for both point and poly layers before runing point.in.poly
#CRS were both subtly different (but geometrically identical) British National Grid CRS strings, but an error 
#was being thrown suggesting CRSs didn't match (though they did)- this step solved the issue
proj4string(pts_lsoa_imd) <- BNG
proj4string(poly_counties) <- BNG
#overlay the 'pts_lsoa_imd' iand 'poly_counties' to join county attributes to points 'pts_lsoa_imd'
pts_lsoa_imd_county <- spatialEco::point.in.poly(pts_lsoa_imd, poly_counties)
#export 'pts_lsoa_imd_county' SpatialPointsDataFrame as a shapefile for future use
raster::shapefile(pts_lsoa_imd_county, "pts_lsoa_imd_county", overwrite=TRUE)

Each LSOA-IMD point now has the name the local authority within which they fall attached as an attribute. The final section of code then manipulates the points data table by grouping the data by county name and adding columns for the: a) total number of LSOAs in each county; b) number of LSOAs in each IMD quintile and; c) the proportion of LSOAs in each county that are in the highest (most deprived) quintiles. The data table is then merged with a spatial data layer of county polygons.

#convert the points into a  tibble
tib <- as_tibble(pts_lsoa_imd_county)
#remove unwanted columns
tib <- dplyr::select(tib, -c(x, y,  LSOA11CD, LSOA11NM, ctry, IMD_SCORE, x.1, y.1))
#put a '0' to ensure columns are of numeric data type
tib[c("LSOA_COUNT", "Q_1_COUNT", "Q_2_COUNT", "Q_3_COUNT", "Q_4_COUNT", "Q_5_COUNT")] <-0
#group points by local authority
#add columns of total LSOA count, and count of number of LSOAs in each IMD quintile
tib2 <- tib %>%
dplyr::group_by(County_UA) %>%
  dplyr::summarise(
        LSOA_COUNT = n(), 
        Q_1_COUNT = length(which(IMD_QUIN == 1)),
        Q_2_COUNT = length(which(IMD_QUIN == 2)),
        Q_3_COUNT = length(which(IMD_QUIN == 3)),
        Q_4_COUNT = length(which(IMD_QUIN == 4)),
        Q_5_COUNT = length(which(IMD_QUIN == 5)))

#merge tibble with the spatial polygon counties data
df_final <- merge(poly_counties, tib2, by = 'County_UA')

#calculate the proportion of LSOAs in the highest (most deprived) quintile in each local authority
df_final$PROP_Q5 <- df_final$Q_5_COUNT/df_final$LSOA_COUNT * 100
#calculate the proportion of LSOAs in the upper two quintiles in each local authority
df_final$COUNT_Q4_5 <- df_final$Q_5_COUNT + df_final$Q_4_COUNT
df_final$PROP_Q4_5 <- df_final$COUNT_Q4_5/df_final$LSOA_COUNT * 100

#export data as shapefile
raster::shapefile(df_final, "Counties_Deprivation", overwrite = TRUE)





Step 3 - Map the data

The final aggregated deprivation data for each county can then be mapped. Note that raw IMD scores are only relative, ranked measures and so cannot be meaningfully averaged across aggregated areas. Only results for counties included in our agricultural resilience research project are shown below.