This guide is intended to provide some advice to those wishing to use the 2016 Census of Population SAPS data with R.
The Census web pages at the Central Statistics Office website are to be found here. The Small Area Population Statistics are to be found here; look down the page for the section entitled ‘CSV Downloads’. If you want a copy of all the data (that is for all 11 sets of spatial units) then a zipfile is available here. We recommend a complete data download - then the files can be read directly from your computer’s hard drive rather than over the network. A glossary which describes the contents of each column is available as an Excel spreadsheet here; it also gives the column name for the 2011 data in cases where this has been changed for the 2016 release.
Boundary files are also available from the same download page at a variety of levels of generalisation (none, 20m, 50m, and 100m). The ungeneralised files are rather large, and for most practical purposes the 20m generalisation is adequate. The Enumeration Districts are the same set of 3409 that are used in the 2011 data release, but the Small Areas have been modified into 18641 polygons (some of the 2011 areas have been split, and other amalgamated). Clicking on a link takes you to a page at data.gov.ie from where you can download the data in a variety of formats. I have downloaded the data as shapefiles.
Note that the boundary data are stored in WGS84 geographical (latitude/longitude) coordinates. Details about converting these to another projection are detailed later on in these notes. However you’ll find the following EPSG codes useful:
| EPSG Code | Coordinate System |
|---|---|
| 4326 | WGS84 |
| 29903 | Irish National Grid (TM75) |
| 2157 | Irish National Grid (IRENET95) |
On my laptop I have stored the downloaded data files in a folder named SAPS2016_Complete. You can get a list of the files in a folder using the dir() function:
dir("./SAPS2016_Complete")
[1] "SAPS2016_CTY31.csv" "SAPS2016_DC2013.csv"
[3] "SAPS2016_ED3409.csv" "SAPS2016_GAEL.csv"
[5] "SAPS2016_LEA2014.csv" "SAPS2016_LPT2012.csv"
[7] "SAPS2016_MD2014.csv" "SAPS2016_NUTS3.csv"
[9] "SAPS2016_PROV.csv" "SAPS2016_SA2017.csv"
[11] "SAPS2016_ST2016.csv" "SAPS2016_STATE_TOTALS.csv"
We’ll start with the ED level dataset. To read the CSV file, determine its dimensions and look at the first few columns we use:
ED_SAPS <- read.csv("./SAPS2016_Complete/SAPS2016_ED3409.csv",stringsAsFactors=FALSE)
print(dim(ED_SAPS))
[1] 3409 807
There are 3409 rows of data, one for each ED, and 807 items of data in each row.
print(ED_SAPS[1:5,1:4])
GUID GEOGID GEOGDESC T1_1AGE0M
1 2AE19629185813A3E055000000000001 ED3409_01001 Carlow Urban 21
2 2AE196291A5913A3E055000000000001 ED3409_01002 Graigue Urban 6
3 2AE19629186413A3E055000000000001 ED3409_01003 Clonmore 2
4 2AE19629187F13A3E055000000000001 ED3409_01004 Hacketstown 9
5 2AE19629188713A3E055000000000001 ED3409_01005 Haroldstown 2
The ‘top left’ corner of the data reveals that the first three columns contain identification information.
| Name | Description |
|---|---|
| GUID | The OSi GUID - used to link to the boundary data |
| GEOGID | A modified version of the CSO GEOGID from the 2011 data |
| GEOGDESC | ED Name |
| T1_1AGE0M | Count of males aged 0-4 years. |
There is an issue with these data that affects the R user in particular.
Embedded Thousands Separators
Notice that those values which have embedded thousands separators are stored as character strings in the CSV files. When these are read by read.csv() they are converted as character variables in the data frame.
print(table(unlist(lapply(ED_SAPS,class))))
character integer
466 341
There should be no more than three columns which are character variables. The following function will identify those columns stored as characters, remove the commas, and convert them to integers.
convertCh <- function(x) {
ChCols <- which(unlist(lapply(x,class)) == 'character')
if (length(ChCols) > 3) {
for(i in 4:length(ChCols)) {
Col <- ChCols[i]
x[,Col] <- as.integer(gsub(",","",x[,Col]))
}
}
return(x)
}
So, to deal with the ED data:
ED_SAPS <- convertCh(ED_SAPS)
I have downloaded the boundaries for the 2016 EDs as a shapefile which is located in my Boundaries folder. The readOGR() function from the rgdal library will read these into a Spatial Polygons Data Frame.
library(rgdal)
ED_BND <- readOGR("./Boundaries",
"Electoral_Divisions__CSO_Generalised_20M",
stringsAsFactors=FALSE)
OGR data source with driver: ESRI Shapefile
Source: "./Boundaries", layer: "Electoral_Divisions__CSO_Generalised_20M"
with 3409 features
It has 15 fields
Integer64 fields read as strings: OBJECTID
print(proj4string(ED_BND))
[1] "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"
There are a few things to note.
First, there are two underscores between Divisions and CSO in the layer name. Second, the coordinates are in WGS84 - they are geographic coordinates in decimal degrees. You can check this by examining the east-west and north-south limits: the bounding box.
print(bbox(ED_BND))
min max
x -10.66265 -5.996385
y 51.41990 55.435136
If we wish to determine population density, or to create a cartogram, then we’ll the coordinates projected into a Cartesian coordinate system. The current national system is Irish Transverse Mercator. To re-project the coordinates we’ll need to specify a new coordinate system, and then use spTransform() to undertake the conversion. The spTransform() function is from the sp library which is loaded when rgdal os loaded.
ED_ITM <- spTransform(ED_BND,CRS("+init=epsg:2157"))
print(proj4string(ED_ITM))
[1] "+init=epsg:2157 +proj=tmerc +lat_0=53.5 +lon_0=-8 +k=0.99982 +x_0=600000 +y_0=750000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"
And a check on the bounding box confirms the conversion.
print(bbox(ED_ITM))
min max
x 417489.4 734475.1
y 519663.7 965634.0
This requires a table join. Are the SAPS data in the same order as the polygons in the boundary file? We know that the SAPS data has a GUID field; what is to be found in the data from the shapefile?
print(head(ED_ITM@data))
OBJECTID ED_ID ED_ENGLISH ED_GAEILGE COUNTY
0 3001 247006 AUGHWILLIAM <NA> WEXFORD
1 3002 267020 BLACKROCK-CARYSFORT <NA> DUBLIN
2 3003 97028 COOLAGHMORE An Chuailleach Mhór KILKENNY
3 3004 57129 RATHMULLAN <NA> DONEGAL
4 3005 77005 ARDAGH <NA> KERRY
5 3006 177001 AGHABOG Achadh Bog MONAGHAN
CONTAE PROVINCE CENTROID_X CENTROID_Y
0 Loch Garman Leinster 695180.287713613 616289.230043841
1 Baile Ãtha Cliath Leinster 721024.524290722 727734.553337762
2 Cill Chainnigh Leinster 640258.501531936 637693.505219255
3 Dún na nGall Ulster 628313.935273137 929172.005661047
4 Ciarraà Munster 483502.992050644 635898.045774622
5 Muineachán Ulster 659479.788431437 821863.816083927
GUID_ CSOED_3409 OSIED_3441
0 2AE196291BD413A3E055000000000001 14095 247006
1 2AE196291CBE13A3E055000000000001 05009 267020
2 2AE1962923CC13A3E055000000000001 07006 97028
3 2AE19629209713A3E055000000000001 33125 57129
4 2AE19629239213A3E055000000000001 19093 77005
5 2AE196291A6313A3E055000000000001 34034 177001
CSOED_34_1 Shape__Are Shape__Len
0 Aughwilliam 0.0020651031 0.27564590
1 Blackrock-Carysfort 0.0001761057 0.05950107
2 Coolaghmore 0.0034492240 0.31131458
3 Rathmullan 0.0025358525 0.28670939
4 Ardagh 0.0020464396 0.29314214
5 Aghabog 0.0029906620 0.30410749
The corresponding GUIDs are in a field name GUID_. The GEOGIDs are also present, but will require some manipulation to get them into something comparable with the GEOGIDs in the SAPS data. WE can re-order the SAPS data frame to correspond with the order of the polygons in the spatial data.
ED_reorder <- match(ED_ITM$GUID_,ED_SAPS$GUID)
print(head(ED_reorder))
[1] 1150 316 471 3315 1901 3373
So, row 1150 in the SAPS data corresponds with polygon 1 in the spatial data, row 316 with polygon 2, row 471 with polygon 3, and so on. To re-order the data, and check that the rows correspond:
ED_SAPS <- ED_SAPS[ED_reorder,]
print(ED_SAPS[1:5,1:3])
GUID GEOGID GEOGDESC
1150 2AE196291BD413A3E055000000000001 ED3409_14095 Aughwilliam
316 2AE196291CBE13A3E055000000000001 ED3409_05009 Blackrock-Carysfort
471 2AE1962923CC13A3E055000000000001 ED3409_07006 Coolaghmore
3315 2AE19629209713A3E055000000000001 ED3409_33125 Rathmullan
1901 2AE19629239213A3E055000000000001 ED3409_19093 Ardagh
print(ED_ITM@data[1:5,c(10,11,3)])
GUID_ CSOED_3409 ED_ENGLISH
0 2AE196291BD413A3E055000000000001 14095 AUGHWILLIAM
1 2AE196291CBE13A3E055000000000001 05009 BLACKROCK-CARYSFORT
2 2AE1962923CC13A3E055000000000001 07006 COOLAGHMORE
3 2AE19629209713A3E055000000000001 33125 RATHMULLAN
4 2AE19629239213A3E055000000000001 19093 ARDAGH
The names and GUIDs match. We have re-ordered the SAPS data correctly.
As an example, we can now compute the population density. The gArea() function from the rgeos library will return the area of each polygon in metres2. The log of the population density will reveal whether there are any discontinuities in the right hand tail.
library(rgeos)
To create a histogram of the ratio of persons per hectare, with a density estimate, we first extract the areas:
Areas <- gArea(ED_ITM,byid=TRUE)/10000
print(summary(Areas))
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.363 1343.000 1958.000 2061.000 2618.000 16320.000
LogDen <- log(ED_SAPS$T1_1AGETT / Areas)
hist(LogDen,freq=FALSE,n=100,border=NA,col="lightgrey",xlab="Log Persons/Ha",main="")
Den <- density(LogDen)
lines(Den$x,Den$y,col="red")
The local minimum between the two modes is at which(Den$y == min(Den$y[Den$x > 1 & Den$x < 2])) which corresponds to 4.0185987 persons per hectare. Above this value we can think of a ‘more urban’ and below it as ‘more rural’.
Version 1.0