Going Postal: Building link tables between postcodes and weather station locations

A word from our

Sponsors

sponsors

Task

Objective

To take a table of all the postcodes in the UK, and work out which of 31 weather stations is the closest to supply weather data to them

Workflow Breakdown

Workflow Breakdown

String manipulation and data reduction

Objective

To take lat-lon data for every postcode in the UK and create an average lat-lon value for the postcodes per sector

lat-lon stands for ‘Latitude’ and ‘Longitude’, which refer to the 2 dimensional co-ordinate system for mapping a location on the surface of the earth.

Some libs and data

  • data.table makes data objects with SQLish behaviors
  • stringr handles string manipulation tasks
library(data.table)
library(stringr)

ukpostcodes <- data.table::fread("./Raw_Data/ukpostcodes.csv",
                                 verbose = FALSE,
                                 showProgress = FALSE)
ukpostcodes[, .N] # Number of rows in the table
## [1] 1812402
ukpostcodes[1:3] # First three rows
##    id postcode latitude longitude
## 1:  1 AB10 1XG 57.14417 -2.114848
## 2:  2 AB10 6RN 57.13788 -2.121487
## 3:  3 AB10 7JB 57.12427 -2.127190

Some data

  • stringr::word() separates the postcodes on the space, returning each part, with no leading or trailing space
  • substr() takes the first character of inward and attaches it to the end of outward to make the ‘sector’ level
ukpostcodes[, outward := stringr::word(postcode, 1)] 
ukpostcodes[, inward := stringr::word(postcode, 2)]
ukpostcodes[, sector := paste(outward, substr(inward, 1, 1))]
ukpostcodes[1:3]
##    id postcode latitude longitude outward inward sector
## 1:  1 AB10 1XG 57.14417 -2.114848    AB10    1XG AB10 1
## 2:  2 AB10 6RN 57.13788 -2.121487    AB10    6RN AB10 6
## 3:  3 AB10 7JB 57.12427 -2.127190    AB10    7JB AB10 7

:= assigns a new column in the data.table

Some munging

Some sectors are known to not be ‘geographical’ and are effectively proxies, so they should be removed

  • ! is equivalent to ‘NOT’
  • %in% is a keyword for data.tables which searches for every instance of the criteria in the relevant column
nonGeo <- c("AB99", "BT58", "CA99", "CM92", "CM98", "CR44", "CR90", "GIR", 
            "IM99", "IV99", "JE5", "M61", "ME99", "N1C", "N81", "NR99", 
            "NW26", "PA80", "PE99", "RH77", "SL60", "SO97", "SW95", "SY99", 
            "WD99", "WF90")
ukpostcodesSub <- ukpostcodes[!outward %in% nonGeo]

Sector ‘Average’ location

sectorPostcodes <- ukpostcodesSub[, .(sector_latitude = mean(latitude), 
                                      sector_longitude = mean(longitude)), 
                                  by = sector] # by is a catagorical grouping
sectorPostcodes[, .N]
## [1] 11201
sectorPostcodes[1:3]
##    sector sector_latitude sector_longitude
## 1: AB10 1        57.14435        -2.106600
## 2: AB10 6        57.13620        -2.120432
## 3: AB10 7        57.12520        -2.131890

Using the average location of each sector is a data reduction reduction exercise. Decreasing 1811720 records to only 11201. It would be computationally more challenging to operate on the original data set length.

Calculating the distance

Objective

To calculate the distance between each postcode sector and each weather station and identify which weather station is the closest to each sector

Some more libs and data

  • geosphere contains functions to calculate distance matrices
library(data.table)
library(geosphere)

postcodesSector <- data.table::fread("./Processed_Data/Postcodes by sector.csv", verbose = FALSE, showProgress = FALSE)
stationLocation <- data.table::fread("./Raw_Data/stationLocations.csv", verbose = FALSE, showProgress = FALSE)
setnames(stationLocation, old = c("lat", "lon"), new = c("landstat_lat", "landstat_lon"))
stationLocation[1:3]

landstat_id landstat_name_wmo landstat_lat landstat_lon 1: A Inverness 57.53 -4.05 2: B Aberdeen 57.20 -2.21 3: C Glasgow 55.91 -4.53

Calculating the distance matrix

  • distm() calculates a distance matrix from two tables with latitutde and longitude

The Vicenty Ellipse method is the most accurate, and computationally intensive, however it can be completed in around a minute on reasonable domestic hardware.

postcodeDistances <- 
  geosphere::distm(x = postcodesSector[,.(sector_longitude,sector_latitude)], 
                   y = stationLocation[,.(landstat_lon,landstat_lat)], 
                   fun=distVincentyEllipsoid)
head(postcodeDistances, n=1)
##          [,1]     [,2]   [,3]     [,4]   [,5]   [,6]     [,7]     [,8]
## [1,] 124660.6 8805.122 202792 153409.2 252219 236952 361035.5 460331.8
##          [,9]    [,10]  [,11]    [,12]  [,13]    [,14]    [,15]    [,16]
## [1,] 441543.8 446008.2 423539 421661.8 463486 372308.7 379416.7 486027.8
##         [,17]    [,18]    [,19]  [,20]    [,21]    [,22]    [,23]  [,24]
## [1,] 519922.4 557098.7 593165.6 632265 626804.2 644138.7 678321.4 639603
##         [,25]  [,26]    [,27]    [,28]    [,29]    [,30]    [,31]
## [1,] 682755.4 767698 683676.9 519358.4 413613.7 521583.4 696646.6

In postcodeDistances, each column is a different station and each row is a different postcode

Selecting minimum distance

  • melt() converts wide data structure to long
postcodeDistances_melt <- data.table(melt(postcodeDistances))
setnames(postcodeDistances_melt, old = c("Var1", "Var2", "value"), 
         new = c("postcodeRow", "stationRowMelt", "Distance"))
postcodeDistances_melt[1:2]
##    postcodeRow stationRowMelt Distance
## 1:           1              1 124660.6
## 2:           2              1 124208.4
  • which.min() returns location of minimum distance
postcodeStation_link <- 
  postcodeDistances_melt[, .(stationRow=which.min(Distance)), 
                         by = "postcodeRow"]
postcodeStation_link[, .N]
## [1] 11201

Index by .I

  • .I is a data.table argument which is equivalent to seq_len(nrow(x)).
postcodesSector[, postcodeRow := .I]
stationLocation[, stationRow := .I]

This effectively gives us a counter for each row of stationLocation which directly correlates with the columns of postcodeDistances, and therefore also with stationRow in postcodesSector.

Join by on

So we can now merge the information from stationLocation onto postcodesSector using stationRow as an index.

  • on= specifies key to join on
postcodeStation <- postcodesSector[postcodeStation_link, on ="postcodeRow"]
postcodeStation <- stationLocation[postcodeStation, on="stationRow"]
postcodeStation[1:3]
##    landstat_id landstat_name_wmo landstat_lat landstat_lon stationRow
## 1:           B          Aberdeen         57.2        -2.21          2
## 2:           B          Aberdeen         57.2        -2.21          2
## 3:           B          Aberdeen         57.2        -2.21          2
##    sector sector_latitude sector_longitude postcodeRow
## 1: AB10 1        57.14435        -2.106600           1
## 2: AB10 6        57.13620        -2.120432           2
## 3: AB10 7        57.12520        -2.131890           3

Mapping the groupings

Objective

Plot each sector on a map of the UK and identify which weather station it belongs to

Moar libs and data

  • Use ggmap to plot data onto maps
library(ggplot2)
library(ggmap)
library(data.table)

postcodeStation_match <- fread("./Processed_Data/Sector by nearest station.csv")
stationLocation <- fread("./Raw_Data/stationLocations.csv")

Calculating the geographical bounding box

If we know what the geographical extent of the data is…

myLocation <- c(postcodeStation_match[, round(min(sector_longitude)) - 0.5],
                postcodeStation_match[, round(min(sector_latitude)) - 0.5],
                postcodeStation_match[, round(max(sector_longitude)) + 0.5],
                postcodeStation_match[, round(max(sector_latitude)) + 0.5])

Source map tiles

… we can use ggmap functions to pull map tiles from various services

myMap <- get_map(location=myLocation, 
                 source="stamen", 
                 zoom = 5, 
                 maptype = "toner-lite", 
                 messaging = FALSE)
## Map from URL : http://tile.stamen.com/toner-lite/5/15/9.png
## Map from URL : http://tile.stamen.com/toner-lite/5/16/9.png
## Map from URL : http://tile.stamen.com/toner-lite/5/15/10.png
## Map from URL : http://tile.stamen.com/toner-lite/5/16/10.png
## Map from URL : http://tile.stamen.com/toner-lite/5/15/11.png
## Map from URL : http://tile.stamen.com/toner-lite/5/16/11.png

Plot points - code

ggmap syntax is based on “The Grammar of Graphics”, just like ggplot2

  • Plot a layer with geom_something()
  • Add layers with +
pointMap <- ggmap(myMap) +
  geom_point(data = postcodeStation_match,
             aes(x = sector_longitude, 
                 y = sector_latitude, 
                 colour = landstat_name_wmo)) +
  geom_point(data = stationLocation,
             aes(x = lon, 
                 y = lat), 
             colour = "black") +
  theme(legend.position = "right")

Plot points - map

Boundary polygons

  • chull() calculates outer boundary of each group
postcodeStation_chull <- 
  postcodeStation_match[, .SD[chull(sector_latitude, sector_longitude)], 
                        by = landstat_name_wmo]
postcodeStation_chull[1:3]
##    landstat_name_wmo landstat_id landstat_lat landstat_lon stationRow
## 1:     Aberdeen/Dyce        3091         57.2        -2.21          2
## 2:     Aberdeen/Dyce        3091         57.2        -2.21          2
## 3:     Aberdeen/Dyce        3091         57.2        -2.21          2
##    sector sector_latitude sector_longitude postcodeRow
## 1: AB56 5        57.64728        -2.947899       10108
## 2: AB35 5        57.03842        -3.149083       10158
## 3:  DD8 5        56.66796        -3.042808        8629
postcodeStation_chull[, .N]
## [1] 447

Plot boundary - code

  • geom_polygon() is a layer that plots area rather than points
polyMap <- ggmap(myMap) +
  geom_polygon(data = postcodeStation_chull,
             aes(x = sector_longitude, 
                 y = sector_latitude, 
                 colour = landstat_name_wmo, 
                 fill = landstat_name_wmo),
             alpha = 0.5)+
  geom_point(data = stationLocation,
             aes(x = lon, 
                 y = lat), 
             colour = "black") +
  theme(legend.position = "right")

Plot boundary - map

Conclusions

Workflow Solution

Workflow Solution

Feedback

Feedback

Me

David Parr

Strategic Analytics Manager, RUMM

David.Parr@RUMM.co.uk

Twitter: @Biomimicron

Github: DaveRGP