Going Postal: Building link tables between postcodes and weather station locations

A word from our

Objective

To take a table of all the postcodes in the UK, and work out which of 31 weather stations is the closest to supply weather data to them

Workflow Breakdown

String manipulation and data reduction

Objective

To take lat-lon data for every postcode in the UK and create an average lat-lon value for the postcodes per sector

lat-lon stands for ‘Latitude’ and ‘Longitude’, which refer to the 2 dimensional co-ordinate system for mapping a location on the surface of the earth.

Some libs and data

data.table makes data objects with SQLish behaviors
stringr handles string manipulation tasks

library(data.table)
library(stringr)

ukpostcodes <- data.table::fread("./Raw_Data/ukpostcodes.csv",
                                 verbose = FALSE,
                                 showProgress = FALSE)
ukpostcodes[, .N] # Number of rows in the table

## [1] 1812402

ukpostcodes[1:3] # First three rows

##    id postcode latitude longitude
## 1:  1 AB10 1XG 57.14417 -2.114848
## 2:  2 AB10 6RN 57.13788 -2.121487
## 3:  3 AB10 7JB 57.12427 -2.127190

Some data

stringr::word() separates the postcodes on the space, returning each part, with no leading or trailing space
substr() takes the first character of inward and attaches it to the end of outward to make the ‘sector’ level

ukpostcodes[, outward := stringr::word(postcode, 1)] 
ukpostcodes[, inward := stringr::word(postcode, 2)]
ukpostcodes[, sector := paste(outward, substr(inward, 1, 1))]
ukpostcodes[1:3]

##    id postcode latitude longitude outward inward sector
## 1:  1 AB10 1XG 57.14417 -2.114848    AB10    1XG AB10 1
## 2:  2 AB10 6RN 57.13788 -2.121487    AB10    6RN AB10 6
## 3:  3 AB10 7JB 57.12427 -2.127190    AB10    7JB AB10 7

:= assigns a new column in the data.table

Some munging

Some sectors are known to not be ‘geographical’ and are effectively proxies, so they should be removed

! is equivalent to ‘NOT’
%in% is a keyword for data.tables which searches for every instance of the criteria in the relevant column

nonGeo <- c("AB99", "BT58", "CA99", "CM92", "CM98", "CR44", "CR90", "GIR", 
            "IM99", "IV99", "JE5", "M61", "ME99", "N1C", "N81", "NR99", 
            "NW26", "PA80", "PE99", "RH77", "SL60", "SO97", "SW95", "SY99", 
            "WD99", "WF90")
ukpostcodesSub <- ukpostcodes[!outward %in% nonGeo]

Sector ‘Average’ location

sectorPostcodes <- ukpostcodesSub[, .(sector_latitude = mean(latitude), 
                                      sector_longitude = mean(longitude)), 
                                  by = sector] # by is a catagorical grouping
sectorPostcodes[, .N]

## [1] 11201

sectorPostcodes[1:3]

##    sector sector_latitude sector_longitude
## 1: AB10 1        57.14435        -2.106600
## 2: AB10 6        57.13620        -2.120432
## 3: AB10 7        57.12520        -2.131890

Using the average location of each sector is a data reduction reduction exercise. Decreasing 1811720 records to only 11201. It would be computationally more challenging to operate on the original data set length.

Calculating the distance

Objective

To calculate the distance between each postcode sector and each weather station and identify which weather station is the closest to each sector

Some more libs and data

geosphere contains functions to calculate distance matrices

library(data.table)
library(geosphere)

postcodesSector <- data.table::fread("./Processed_Data/Postcodes by sector.csv", verbose = FALSE, showProgress = FALSE)
stationLocation <- data.table::fread("./Raw_Data/stationLocations.csv", verbose = FALSE, showProgress = FALSE)
setnames(stationLocation, old = c("lat", "lon"), new = c("landstat_lat", "landstat_lon"))
stationLocation[1:3]

landstat_id landstat_name_wmo landstat_lat landstat_lon 1: A Inverness 57.53 -4.05 2: B Aberdeen 57.20 -2.21 3: C Glasgow 55.91 -4.53

Calculating the distance matrix

distm() calculates a distance matrix from two tables with latitutde and longitude

The Vicenty Ellipse method is the most accurate, and computationally intensive, however it can be completed in around a minute on reasonable domestic hardware.

postcodeDistances <- 
  geosphere::distm(x = postcodesSector[,.(sector_longitude,sector_latitude)], 
                   y = stationLocation[,.(landstat_lon,landstat_lat)], 
                   fun=distVincentyEllipsoid)
head(postcodeDistances, n=1)

##          [,1]     [,2]   [,3]     [,4]   [,5]   [,6]     [,7]     [,8]
## [1,] 124660.6 8805.122 202792 153409.2 252219 236952 361035.5 460331.8
##          [,9]    [,10]  [,11]    [,12]  [,13]    [,14]    [,15]    [,16]
## [1,] 441543.8 446008.2 423539 421661.8 463486 372308.7 379416.7 486027.8
##         [,17]    [,18]    [,19]  [,20]    [,21]    [,22]    [,23]  [,24]
## [1,] 519922.4 557098.7 593165.6 632265 626804.2 644138.7 678321.4 639603
##         [,25]  [,26]    [,27]    [,28]    [,29]    [,30]    [,31]
## [1,] 682755.4 767698 683676.9 519358.4 413613.7 521583.4 696646.6

In postcodeDistances, each column is a different station and each row is a different postcode

Selecting minimum distance

melt() converts wide data structure to long

postcodeDistances_melt <- data.table(melt(postcodeDistances))
setnames(postcodeDistances_melt, old = c("Var1", "Var2", "value"), 
         new = c("postcodeRow", "stationRowMelt", "Distance"))
postcodeDistances_melt[1:2]

##    postcodeRow stationRowMelt Distance
## 1:           1              1 124660.6
## 2:           2              1 124208.4

which.min() returns location of minimum distance

postcodeStation_link <- 
  postcodeDistances_melt[, .(stationRow=which.min(Distance)), 
                         by = "postcodeRow"]
postcodeStation_link[, .N]

## [1] 11201

Index by .I

.I is a data.table argument which is equivalent to seq_len(nrow(x)).

postcodesSector[, postcodeRow := .I]
stationLocation[, stationRow := .I]

This effectively gives us a counter for each row of stationLocation which directly correlates with the columns of postcodeDistances, and therefore also with stationRow in postcodesSector.

Join by on

So we can now merge the information from stationLocation onto postcodesSector using stationRow as an index.

on= specifies key to join on

postcodeStation <- postcodesSector[postcodeStation_link, on ="postcodeRow"]
postcodeStation <- stationLocation[postcodeStation, on="stationRow"]
postcodeStation[1:3]

##    landstat_id landstat_name_wmo landstat_lat landstat_lon stationRow
## 1:           B          Aberdeen         57.2        -2.21          2
## 2:           B          Aberdeen         57.2        -2.21          2
## 3:           B          Aberdeen         57.2        -2.21          2
##    sector sector_latitude sector_longitude postcodeRow
## 1: AB10 1        57.14435        -2.106600           1
## 2: AB10 6        57.13620        -2.120432           2
## 3: AB10 7        57.12520        -2.131890           3

Mapping the groupings

Objective

Plot each sector on a map of the UK and identify which weather station it belongs to

Moar libs and data

Use ggmap to plot data onto maps

library(ggplot2)
library(ggmap)
library(data.table)

postcodeStation_match <- fread("./Processed_Data/Sector by nearest station.csv")
stationLocation <- fread("./Raw_Data/stationLocations.csv")

Calculating the geographical bounding box

If we know what the geographical extent of the data is…

myLocation <- c(postcodeStation_match[, round(min(sector_longitude)) - 0.5],
                postcodeStation_match[, round(min(sector_latitude)) - 0.5],
                postcodeStation_match[, round(max(sector_longitude)) + 0.5],
                postcodeStation_match[, round(max(sector_latitude)) + 0.5])

Source map tiles

… we can use ggmap functions to pull map tiles from various services

myMap <- get_map(location=myLocation, 
                 source="stamen", 
                 zoom = 5, 
                 maptype = "toner-lite", 
                 messaging = FALSE)

## Map from URL : http://tile.stamen.com/toner-lite/5/15/9.png

## Map from URL : http://tile.stamen.com/toner-lite/5/16/9.png

## Map from URL : http://tile.stamen.com/toner-lite/5/15/10.png

## Map from URL : http://tile.stamen.com/toner-lite/5/16/10.png

## Map from URL : http://tile.stamen.com/toner-lite/5/15/11.png

## Map from URL : http://tile.stamen.com/toner-lite/5/16/11.png

Plot points - code

ggmap syntax is based on “The Grammar of Graphics”, just like ggplot2

Plot a layer with geom_something()
Add layers with +

pointMap <- ggmap(myMap) +
  geom_point(data = postcodeStation_match,
             aes(x = sector_longitude, 
                 y = sector_latitude, 
                 colour = landstat_name_wmo)) +
  geom_point(data = stationLocation,
             aes(x = lon, 
                 y = lat), 
             colour = "black") +
  theme(legend.position = "right")

Plot points - map

Boundary polygons

chull() calculates outer boundary of each group

postcodeStation_chull <- 
  postcodeStation_match[, .SD[chull(sector_latitude, sector_longitude)], 
                        by = landstat_name_wmo]
postcodeStation_chull[1:3]

##    landstat_name_wmo landstat_id landstat_lat landstat_lon stationRow
## 1:     Aberdeen/Dyce        3091         57.2        -2.21          2
## 2:     Aberdeen/Dyce        3091         57.2        -2.21          2
## 3:     Aberdeen/Dyce        3091         57.2        -2.21          2
##    sector sector_latitude sector_longitude postcodeRow
## 1: AB56 5        57.64728        -2.947899       10108
## 2: AB35 5        57.03842        -3.149083       10158
## 3:  DD8 5        56.66796        -3.042808        8629

postcodeStation_chull[, .N]

## [1] 447

Plot boundary - code

geom_polygon() is a layer that plots area rather than points

polyMap <- ggmap(myMap) +
  geom_polygon(data = postcodeStation_chull,
             aes(x = sector_longitude, 
                 y = sector_latitude, 
                 colour = landstat_name_wmo, 
                 fill = landstat_name_wmo),
             alpha = 0.5)+
  geom_point(data = stationLocation,
             aes(x = lon, 
                 y = lat), 
             colour = "black") +
  theme(legend.position = "right")

Plot boundary - map

Conclusions

Workflow Solution

Feedback

Me

David Parr

Strategic Analytics Manager, RUMM

David.Parr@RUMM.co.uk

Twitter: @Biomimicron

Github: DaveRGP

Going Postal: Building link tables between postcodes and weather station locations

A word from our

Sponsors

Task

Objective

Workflow Breakdown

String manipulation and data reduction

Objective

Some libs and data

Some data

Some munging

Sector ‘Average’ location

Calculating the distance

Objective

Some more libs and data

Calculating the distance matrix

Selecting minimum distance

Index by .I

Join by on

Mapping the groupings

Objective

Moar libs and data

Calculating the geographical bounding box

Source map tiles

Plot points - code

Plot points - map

Boundary polygons

Plot boundary - code

Plot boundary - map

Conclusions

Workflow Solution

Feedback

Me