This script shows how to create a SQL lite database based on the UK’s Office for National Statistics Postcode Directory. This is useful as it allows postcodes to be mapped to various administrative areas in the UK and for statistics to then be matched to these areas (such as deprivation measures and population projections).
This is something I have to do every so often. I used to download the zipped csv file in two parts and then load them into an Access database. This was a pain as you had to deal with big files and set the data types manually and all manner of other fiddling. It always seemed to go wrong five times before it would work.
So this time I decided I would do it in a more sensible fashion. Write an R script which I can just re-run when I need to which loads everything into a SQL lite database and can be used to refresh the data when it’s updated by the ONS. It also adds all the indexes I need which is useful!
I haven’t done it in this example but you could point this directly at the ONS website to handle the downloading of the files too.
Onward!
We use a couple of packages for handling SQL lite databases in R and get Hadley Wickham’s helper packages up and running also.
The ONS is annoyingly inconsistent with file formats. Sometimes you get a CSV, other times you get Excel workbooks (.xls or .xlsx) so we need Hadley’s readxl package up and running too.
library("RSQLite") # To create/manipulate SQL lite database
library("tidyverse")
library("readxl")
theme_set(theme_bw()) # Set the ggplot theme
I’ve already downloaded a csv file from the ONS geoportal which contains all postcodes for the country. I have left this zipped for now as it’s over a gb in size unzipped. You can get it from here: http://geoportal.statistics.gov.uk/datasets?q=ONS%20Postcode%20Directory%20(ONSPD)&sort=name
We’ll start by unzipping it into a temporary directory and then loading it straight into a SQL lite database. Putting it into a database will make things easier and a lot faster once I add a indexes later on.
First we unzip it all into a temp directory.
# Create a temporary directory
tmpdir <- tempdir()
# Set the file name
file <- "data\\ONSPD_MAY_2017_UK.zip"
# Unzip the ONS Postcode Data file
unzip(file, exdir = tmpdir )
# Create a path pointing at the unzipped csv file
ONSPD_path <- paste0(tmpdir,"\\ONSPD_MAY_2017_UK.csv")
Now we load the data into a database via R. A good tutorial on loading data into SQL lite using R can be found here: https://www.r-bloggers.com/r-and-sqlite-part-1/
I show three different methods below but currently my favoured method is to oad it using the sqldf package’s read.csv.sql function.
# First, create a SQL Lite database
# db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
# Now load the data into our SQL lite database
# This method from RSQLite leaves quotes in the data (grr)
# dbWriteTable(conn = db_connection,
# name = "ONS_PD",
# value = ONSPD_path,
# row.names = FALSE,
# header = TRUE,
# overwrite = TRUE
# )
# ----------------------------------------------------------------------------------------------------
# This is the sqldf package approach
# Note the filter argument is used to remove double quotes in the csv file
# This requires the R Tools package to be installed for Windows
library("sqldf")
sqldf("attach 'ons_lkp_db' as new")
## data frame with 0 columns and 0 rows
db_connection <- dbConnect(SQLite(), dbname="ons_lkp_db")
read.csv.sql(ONSPD_path,
sql = "CREATE TABLE ONS_PD AS SELECT * FROM file",
dbname = "ons_lkp_db",
filter = "tr.exe -d '^\"'"
)
## data frame with 0 columns and 0 rows
## ----------------------------------------------------------------------------------------------------
# So instead I have to load the data into R (which means you have to be sure you have enough RAM)
# First import the csv into memory
# ONS_PD_import <- read.csv(ONSPD_path, stringsAsFactors=F)
# Then push it to the database
# dbWriteTable(conn = db_connection,
# name = "ONS_PD",
# value = ONS_PD_import,
# row.names = FALSE,
# header = TRUE,
# overwrite = TRUE
# )
#
# rm(ONS_PD_import)
# Check it loaded correctly
dbListTables(db_connection)
## [1] "ONS_PD"
dbListFields(db_connection, "ONS_PD")
## [1] "pcd" "pcd2" "pcds" "dointr" "doterm"
## [6] "oscty" "oslaua" "osward" "usertype" "oseast1m"
## [11] "osnrth1m" "osgrdind" "oshlthau" "hro" "ctry"
## [16] "gor" "streg" "pcon" "eer" "teclec"
## [21] "ttwa" "pct" "nuts" "psed" "cened"
## [26] "edind" "oshaprev" "lea" "oldha" "wardc91"
## [31] "wardo91" "ward98" "statsward" "oa01" "casward"
## [36] "park" "lsoa01" "msoa01" "ur01ind" "oac01"
## [41] "oldpct" "oa11" "lsoa11" "msoa11" "parish"
## [46] "wz11" "ccg" "bua11" "buasd11" "ru11ind"
## [51] "oac11" "lat" "long" "lep1" "lep2"
## [56] "pfa" "imd"
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds from ONS_PD LIMIT 5")
## pcd pcd2 pcds
## 1 AB1 0AA AB1 0AA AB1 0AA
## 2 AB1 0AB AB1 0AB AB1 0AB
## 3 AB1 0AD AB1 0AD AB1 0AD
## 4 AB1 0AE AB1 0AE AB1 0AE
## 5 AB1 0AF AB1 0AF AB1 0AF
Postcodes appear in a few different formats within the table and in real life datasets, generally you see variation in the numbers of spaces. I tend to create a field with the spaces removed so that you don’t have to waste time checking how many spaces there are in your postcodes. I’m going to create a spaceless postcode field using SQL code.
# First create an empty field
dbGetQuery(db_connection,"ALTER TABLE ONS_PD ADD COLUMN pcd_spaceless NVARCHAR(15)")
## data frame with 0 columns and 0 rows
# Then fill it with the pcd field with the space replaced with a zero length string
dbGetQuery(db_connection,"UPDATE ONS_PD SET pcd_spaceless = REPLACE(pcd,' ','')")
## data frame with 0 columns and 0 rows
# Eyeball it to check it worked
dbGetQuery(db_connection,"SELECT pcd, pcd2, pcds, pcd_spaceless from ONS_PD LIMIT 5")
## pcd pcd2 pcds pcd_spaceless
## 1 AB1 0AA AB1 0AA AB1 0AA AB10AA
## 2 AB1 0AB AB1 0AB AB1 0AB AB10AB
## 3 AB1 0AD AB1 0AD AB1 0AD AB10AD
## 4 AB1 0AE AB1 0AE AB1 0AE AB10AE
## 5 AB1 0AF AB1 0AF AB1 0AF AB10AF
It’s also a good idea to add a date column to make a record of the edition of the ONSPD that you are using in this database.
# First create an empty field
dbGetQuery(db_connection,"ALTER TABLE ONS_PD ADD COLUMN table_edition DATE")
## data frame with 0 columns and 0 rows
# Then fill it with the pcd field with the space replaced with a zero length string
dbGetQuery(db_connection,"UPDATE ONS_PD SET table_edition = date('2017-05-01')")
## data frame with 0 columns and 0 rows
# Eyeball it to check it worked
dbGetQuery(db_connection,"SELECT pcd, table_edition from ONS_PD LIMIT 5")
## pcd table_edition
## 1 AB1 0AA 2017-05-01
## 2 AB1 0AB 2017-05-01
## 3 AB1 0AD 2017-05-01
## 4 AB1 0AE 2017-05-01
## 5 AB1 0AF 2017-05-01
We’ll want to index a few fields in the database to ensure that any joins we perform are fast. This comes at the expense of time spent creating the join and the memory to store it (plus longer write times but we won’t really be writing to this table again).
# Create an index on the pcd_spaceless field
dbGetQuery(db_connection,
"CREATE INDEX index_pcd_spaceless ON ONS_PD(pcd_spaceless)")
## data frame with 0 columns and 0 rows
# Create an index on the lsoa11 field
dbGetQuery(db_connection,
"CREATE INDEX index_lsoa11 ON ONS_PD(lsoa11)")
## data frame with 0 columns and 0 rows
# Create an index on the ccg field
dbGetQuery(db_connection,
"CREATE INDEX index_ccg ON ONS_PD(ccg)")
## data frame with 0 columns and 0 rows
Now it’s simple to make an LSOA11 to CCG lookup table which we’ll use as a base for adding population projections to. This will allow easy assignment of population numbers to our data.
dbGetQuery(db_connection,
"CREATE TABLE LSOA11_LKP AS
SELECT lsoa11,ccg FROM ONS_PD GROUP BY lsoa11,ccg")
## data frame with 0 columns and 0 rows
# Check the new table looks ok
dbGetQuery(db_connection, "SELECT * FROM LSOA11_LKP LIMIT 20")
## lsoa11 ccg
## 1
## 2 95AA01S1 ZC020
## 3 95AA01S2 ZC020
## 4 95AA01S3 ZC020
## 5 95AA02W1 ZC020
## 6 95AA03W1 ZC020
## 7 95AA04W1 ZC020
## 8 95AA05W1 ZC020
## 9 95AA06S1 ZC020
## 10 95AA06S2 ZC020
## 11 95AA07W1 ZC020
## 12 95AA08W1 ZC020
## 13 95AA09W1 ZC020
## 14 95AA10W1 ZC020
## 15 95AA11S1 ZC020
## 16 95AA11S2 ZC020
## 17 95AA12W1 ZC020
## 18 95AA13S1 ZC020
## 19 95AA13S2 ZC020
## 20 95AA14W1 ZC020
# Create an index on the lsoa11 field
dbGetQuery(db_connection,
"CREATE INDEX index_lsoa112 ON LSOA11_LKP(lsoa11)")
## data frame with 0 columns and 0 rows
# Create an index on the ccg field
dbGetQuery(db_connection,
"CREATE INDEX index_ccg2 ON LSOA11_LKP(ccg)")
## data frame with 0 columns and 0 rows
Now I’ll repeat what we did earlier to load ONS population data. We have to take a slightly different approach because this dataset is released as an Excel workbook. We can use the readxl package to help here.
This data is available from: https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/lowersuperoutputareamidyearpopulationestimates
# Create a temporary directory
tmpdir <- tempdir()
# Set the file name
# The obtuse filename here stands for mid 2015 lsoa single year of age estimates
file <- "data\\mid2015lsoasyoaestimates.zip"
# Unzip the ONS Postcode Data file
unzip(file, exdir = tmpdir )
# Create a path pointing at the unzipped xls file
pop_estimates_path <- paste0(tmpdir,"\\SAPE18DT1-mid-2015-lsoa-syoa-estimates.xls")
# Now load the individual worksheets of interest
male_pop <- read_xls(pop_estimates_path, sheet="Mid-2015 Males", skip=3)
female_pop <- read_xls(pop_estimates_path, sheet="Mid-2015 Females", skip=3)
# It's necessary to clean these up and get the into a single 'tidy' dataframe
# First add a sex column and join the two together into a single dataframe
male_pop$sex <- "Male"
female_pop$sex <- "Female"
pop_estimates <- rbind(male_pop,female_pop)
# Now get rid of any fields we don't need
pop_estimates$`Area Names` <- NULL
pop_estimates$`All Ages` <- NULL
# Rename the code fields
colnames(pop_estimates)[1:2] <- c("lsoa11cd","lsoa11nm")
# Get rid of some rows we don't want which relate to higher order geographies
pop_estimates <- pop_estimates %>%
filter(!is.na(lsoa11nm))
# Finally we can now pivot the data to make it loooong instead of wiiiide
pop_estimates <- pop_estimates %>%
gather(age, population,3:93) %>%
mutate(age = as.numeric(age))
# And then upload it into our database
dbWriteTable(conn = db_connection,
name = "POP_ESTIMATES",
value = pop_estimates,
row.names = FALSE)
# Index the lsoa column too
dbGetQuery(db_connection,
"CREATE INDEX index_lsoa113 ON POP_ESTIMATES(lsoa11cd)")
## data frame with 0 columns and 0 rows
# Eyeball it to check everything looks ok
dbGetQuery(db_connection,"SELECT * from POP_ESTIMATES LIMIT 20")
## lsoa11cd lsoa11nm sex age population
## 1 E01020634 County Durham 001A Male 0 15
## 2 E01020635 County Durham 001B Male 0 3
## 3 E01020636 County Durham 001C Male 0 13
## 4 E01020654 County Durham 001D Male 0 9
## 5 E01020676 County Durham 001E Male 0 9
## 6 E01020613 County Durham 002A Male 0 6
## 7 E01020614 County Durham 002B Male 0 3
## 8 E01020622 County Durham 002C Male 0 7
## 9 E01020623 County Durham 002D Male 0 3
## 10 E01020627 County Durham 003A Male 0 5
## 11 E01020628 County Durham 003B Male 0 10
## 12 E01020629 County Durham 003C Male 0 10
## 13 E01020655 County Durham 003D Male 0 7
## 14 E01020656 County Durham 003E Male 0 3
## 15 E01020657 County Durham 003F Male 0 14
## 16 E01020661 County Durham 004A Male 0 20
## 17 E01020662 County Durham 004B Male 0 9
## 18 E01020663 County Durham 004C Male 0 12
## 19 E01020674 County Durham 004D Male 0 5
## 20 E01020678 County Durham 004E Male 0 6
Whilst it’s very useful to have single year of age, it’s normally sufficient to have a broad adult/child categorisation. Therefore I’ll create a table which has this information calculated and ready to go.
# Use dplyr to pivot the data
pop_estimates_adult_child <- pop_estimates %>%
mutate("age_cat"=ifelse(age<16,"less_than_16","plus_16")) %>%
group_by(lsoa11cd, lsoa11nm, age_cat) %>%
summarise("population"=sum(population)) %>%
spread(key=age_cat,value=population)
# Upload it into our database
dbWriteTable(conn = db_connection,
name = "POP_ESTIMATES_ADULT_CHILD",
value = pop_estimates_adult_child,
row.names = FALSE)
# Index the lsoa column
dbGetQuery(db_connection,
"CREATE INDEX index_lsoa114 ON POP_ESTIMATES_ADULT_CHILD(lsoa11cd)")
## data frame with 0 columns and 0 rows
# Remove dataframes no longer needed from memory
rm(female_pop)
rm(male_pop)
rm(pop_estimates)
Now I’ll just run a quick test SQL query to show that things are working.
ccg_pops <- dbGetQuery(db_connection,"
SELECT
ccg,
SUM(less_than_16) AS 'less_than_16',
SUM(plus_16) AS 'plus_16',
SUM(less_than_16 + plus_16) AS 'total_pop'
FROM
POP_ESTIMATES_ADULT_CHILD AS 'POP'
LEFT OUTER JOIN
LSOA11_LKP
ON
POP.lsoa11cd = LSOA11_LKP.lsoa11
GROUP BY
ccg
")
# A plot of the distribution of CCG populations
ggplot(ccg_pops, aes(total_pop)) + geom_histogram()
I now have a SQL lite database in my project’s working directory which I can pick up and use whenever I have a query which requires the use of ONS area codes or postcodes.
I’ll add to this in the future to include loading LSOA deprivation scores.
Fin!