To take a table of all the postcodes in the UK, and work out which of 31 weather stations is the closest to supply weather data to them
To take lat-lon data for every postcode in the UK and create an average lat-lon value for the postcodes per sector
lat-lon stands for ‘Latitude’ and ‘Longitude’, which refer to the 2 dimensional co-ordinate system for mapping a location on the surface of the earth.
data.table makes data objects with SQLish behaviorsstringr handles string manipulation taskslibrary(data.table)
library(stringr)
ukpostcodes <- data.table::fread("./Raw_Data/ukpostcodes.csv",
verbose = FALSE,
showProgress = FALSE)
ukpostcodes[, .N] # Number of rows in the table## [1] 1812402
ukpostcodes[1:3] # First three rows## id postcode latitude longitude
## 1: 1 AB10 1XG 57.14417 -2.114848
## 2: 2 AB10 6RN 57.13788 -2.121487
## 3: 3 AB10 7JB 57.12427 -2.127190
stringr::word() separates the postcodes on the space, returning each part, with no leading or trailing spacesubstr() takes the first character of inward and attaches it to the end of outward to make the ‘sector’ levelukpostcodes[, outward := stringr::word(postcode, 1)]
ukpostcodes[, inward := stringr::word(postcode, 2)]
ukpostcodes[, sector := paste(outward, substr(inward, 1, 1))]
ukpostcodes[1:3]## id postcode latitude longitude outward inward sector
## 1: 1 AB10 1XG 57.14417 -2.114848 AB10 1XG AB10 1
## 2: 2 AB10 6RN 57.13788 -2.121487 AB10 6RN AB10 6
## 3: 3 AB10 7JB 57.12427 -2.127190 AB10 7JB AB10 7
:= assigns a new column in the data.table
Some sectors are known to not be ‘geographical’ and are effectively proxies, so they should be removed
! is equivalent to ‘NOT’%in% is a keyword for data.tables which searches for every instance of the criteria in the relevant columnnonGeo <- c("AB99", "BT58", "CA99", "CM92", "CM98", "CR44", "CR90", "GIR",
"IM99", "IV99", "JE5", "M61", "ME99", "N1C", "N81", "NR99",
"NW26", "PA80", "PE99", "RH77", "SL60", "SO97", "SW95", "SY99",
"WD99", "WF90")
ukpostcodesSub <- ukpostcodes[!outward %in% nonGeo]sectorPostcodes <- ukpostcodesSub[, .(sector_latitude = mean(latitude),
sector_longitude = mean(longitude)),
by = sector] # by is a catagorical grouping
sectorPostcodes[, .N]## [1] 11201
sectorPostcodes[1:3]## sector sector_latitude sector_longitude
## 1: AB10 1 57.14435 -2.106600
## 2: AB10 6 57.13620 -2.120432
## 3: AB10 7 57.12520 -2.131890
Using the average location of each sector is a data reduction reduction exercise. Decreasing 1811720 records to only 11201. It would be computationally more challenging to operate on the original data set length.
To calculate the distance between each postcode sector and each weather station and identify which weather station is the closest to each sector
geosphere contains functions to calculate distance matriceslibrary(data.table)
library(geosphere)
postcodesSector <- data.table::fread("./Processed_Data/Postcodes by sector.csv", verbose = FALSE, showProgress = FALSE)
stationLocation <- data.table::fread("./Raw_Data/stationLocations.csv", verbose = FALSE, showProgress = FALSE)
setnames(stationLocation, old = c("lat", "lon"), new = c("landstat_lat", "landstat_lon"))
stationLocation[1:3]landstat_id landstat_name_wmo landstat_lat landstat_lon 1: A Inverness 57.53 -4.05 2: B Aberdeen 57.20 -2.21 3: C Glasgow 55.91 -4.53
distm() calculates a distance matrix from two tables with latitutde and longitude The Vicenty Ellipse method is the most accurate, and computationally intensive, however it can be completed in around a minute on reasonable domestic hardware.
postcodeDistances <-
geosphere::distm(x = postcodesSector[,.(sector_longitude,sector_latitude)],
y = stationLocation[,.(landstat_lon,landstat_lat)],
fun=distVincentyEllipsoid)
head(postcodeDistances, n=1)## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 124660.6 8805.122 202792 153409.2 252219 236952 361035.5 460331.8
## [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
## [1,] 441543.8 446008.2 423539 421661.8 463486 372308.7 379416.7 486027.8
## [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## [1,] 519922.4 557098.7 593165.6 632265 626804.2 644138.7 678321.4 639603
## [,25] [,26] [,27] [,28] [,29] [,30] [,31]
## [1,] 682755.4 767698 683676.9 519358.4 413613.7 521583.4 696646.6
In postcodeDistances, each column is a different station and each row is a different postcode
melt() converts wide data structure to longpostcodeDistances_melt <- data.table(melt(postcodeDistances))
setnames(postcodeDistances_melt, old = c("Var1", "Var2", "value"),
new = c("postcodeRow", "stationRowMelt", "Distance"))
postcodeDistances_melt[1:2]## postcodeRow stationRowMelt Distance
## 1: 1 1 124660.6
## 2: 2 1 124208.4
which.min() returns location of minimum distancepostcodeStation_link <-
postcodeDistances_melt[, .(stationRow=which.min(Distance)),
by = "postcodeRow"]
postcodeStation_link[, .N]## [1] 11201
.I is a data.table argument which is equivalent to seq_len(nrow(x)).postcodesSector[, postcodeRow := .I]
stationLocation[, stationRow := .I]This effectively gives us a counter for each row of stationLocation which directly correlates with the columns of postcodeDistances, and therefore also with stationRow in postcodesSector.
So we can now merge the information from stationLocation onto postcodesSector using stationRow as an index.
on= specifies key to join onpostcodeStation <- postcodesSector[postcodeStation_link, on ="postcodeRow"]
postcodeStation <- stationLocation[postcodeStation, on="stationRow"]
postcodeStation[1:3]## landstat_id landstat_name_wmo landstat_lat landstat_lon stationRow
## 1: B Aberdeen 57.2 -2.21 2
## 2: B Aberdeen 57.2 -2.21 2
## 3: B Aberdeen 57.2 -2.21 2
## sector sector_latitude sector_longitude postcodeRow
## 1: AB10 1 57.14435 -2.106600 1
## 2: AB10 6 57.13620 -2.120432 2
## 3: AB10 7 57.12520 -2.131890 3
Plot each sector on a map of the UK and identify which weather station it belongs to
ggmap to plot data onto mapslibrary(ggplot2)
library(ggmap)
library(data.table)
postcodeStation_match <- fread("./Processed_Data/Sector by nearest station.csv")
stationLocation <- fread("./Raw_Data/stationLocations.csv")If we know what the geographical extent of the data is…
myLocation <- c(postcodeStation_match[, round(min(sector_longitude)) - 0.5],
postcodeStation_match[, round(min(sector_latitude)) - 0.5],
postcodeStation_match[, round(max(sector_longitude)) + 0.5],
postcodeStation_match[, round(max(sector_latitude)) + 0.5])… we can use ggmap functions to pull map tiles from various services
myMap <- get_map(location=myLocation,
source="stamen",
zoom = 5,
maptype = "toner-lite",
messaging = FALSE)## Map from URL : http://tile.stamen.com/toner-lite/5/15/9.png
## Map from URL : http://tile.stamen.com/toner-lite/5/16/9.png
## Map from URL : http://tile.stamen.com/toner-lite/5/15/10.png
## Map from URL : http://tile.stamen.com/toner-lite/5/16/10.png
## Map from URL : http://tile.stamen.com/toner-lite/5/15/11.png
## Map from URL : http://tile.stamen.com/toner-lite/5/16/11.png
ggmap syntax is based on “The Grammar of Graphics”, just like ggplot2
geom_something()+pointMap <- ggmap(myMap) +
geom_point(data = postcodeStation_match,
aes(x = sector_longitude,
y = sector_latitude,
colour = landstat_name_wmo)) +
geom_point(data = stationLocation,
aes(x = lon,
y = lat),
colour = "black") +
theme(legend.position = "right")chull() calculates outer boundary of each grouppostcodeStation_chull <-
postcodeStation_match[, .SD[chull(sector_latitude, sector_longitude)],
by = landstat_name_wmo]
postcodeStation_chull[1:3]## landstat_name_wmo landstat_id landstat_lat landstat_lon stationRow
## 1: Aberdeen/Dyce 3091 57.2 -2.21 2
## 2: Aberdeen/Dyce 3091 57.2 -2.21 2
## 3: Aberdeen/Dyce 3091 57.2 -2.21 2
## sector sector_latitude sector_longitude postcodeRow
## 1: AB56 5 57.64728 -2.947899 10108
## 2: AB35 5 57.03842 -3.149083 10158
## 3: DD8 5 56.66796 -3.042808 8629
postcodeStation_chull[, .N]## [1] 447
geom_polygon() is a layer that plots area rather than pointspolyMap <- ggmap(myMap) +
geom_polygon(data = postcodeStation_chull,
aes(x = sector_longitude,
y = sector_latitude,
colour = landstat_name_wmo,
fill = landstat_name_wmo),
alpha = 0.5)+
geom_point(data = stationLocation,
aes(x = lon,
y = lat),
colour = "black") +
theme(legend.position = "right")David Parr
Strategic Analytics Manager, RUMM
Twitter: @Biomimicron
Github: DaveRGP