Introduction

This guide is intended to provide some advice to those wishing to use the 2016 Census of Population SAPS data with R.

The Census web pages at the Central Statistics Office website are to be found here. The Small Area Population Statistics are to be found here; look down the page for the section entitled ‘CSV Downloads’. If you want a copy of all the data (that is for all 11 sets of spatial units) then a zipfile is available here. We recommend a complete data download - then the files can be read directly from your computer’s hard drive rather than over the network. A glossary which describes the contents of each column is available as an Excel spreadsheet here; it also gives the column name for the 2011 data in cases where this has been changed for the 2016 release.

Boundary files are also available from the same download page at a variety of levels of generalisation (none, 20m, 50m, and 100m). The ungeneralised files are rather large, and for most practical purposes the 20m generalisation is adequate. The Enumeration Districts are the same set of 3409 that are used in the 2011 data release, but the Small Areas have been modified into 18641 polygons (some of the 2011 areas have been split, and other amalgamated). Clicking on a link takes you to a page at data.gov.ie from where you can download the data in a variety of formats. I have downloaded the data as shapefiles.

Note that the boundary data are stored in WGS84 geographical (latitude/longitude) coordinates. Details about converting these to another projection are detailed later on in these notes. However you’ll find the following EPSG codes useful:

EPSG Code Coordinate System
4326 WGS84
29903 Irish National Grid (TM75)
2157 Irish National Grid (IRENET95)

Reading the data

On my laptop I have stored the downloaded data files in a folder named SAPS2016_Complete. You can get a list of the files in a folder using the dir() function:

dir("./SAPS2016_Complete")
 [1] "SAPS2016_CTY31.csv"        "SAPS2016_DC2013.csv"      
 [3] "SAPS2016_ED3409.csv"       "SAPS2016_GAEL.csv"        
 [5] "SAPS2016_LEA2014.csv"      "SAPS2016_LPT2012.csv"     
 [7] "SAPS2016_MD2014.csv"       "SAPS2016_NUTS3.csv"       
 [9] "SAPS2016_PROV.csv"         "SAPS2016_SA2017.csv"      
[11] "SAPS2016_ST2016.csv"       "SAPS2016_STATE_TOTALS.csv"

We’ll start with the ED level dataset. To read the CSV file, determine its dimensions and look at the first few columns we use:

ED_SAPS <- read.csv("./SAPS2016_Complete/SAPS2016_ED3409.csv",stringsAsFactors=FALSE)
print(dim(ED_SAPS))
[1] 3409  807

There are 3409 rows of data, one for each ED, and 807 items of data in each row.

print(ED_SAPS[1:5,1:4])
                              GUID       GEOGID      GEOGDESC T1_1AGE0M
1 2AE19629185813A3E055000000000001 ED3409_01001  Carlow Urban        21
2 2AE196291A5913A3E055000000000001 ED3409_01002 Graigue Urban         6
3 2AE19629186413A3E055000000000001 ED3409_01003      Clonmore         2
4 2AE19629187F13A3E055000000000001 ED3409_01004   Hacketstown         9
5 2AE19629188713A3E055000000000001 ED3409_01005   Haroldstown         2

The ‘top left’ corner of the data reveals that the first three columns contain identification information.

Name Description
GUID The OSi GUID - used to link to the boundary data
GEOGID A modified version of the CSO GEOGID from the 2011 data
GEOGDESC ED Name
T1_1AGE0M Count of males aged 0-4 years.

There is an issue with these data that affects the R user in particular.

Embedded Thousands Separators

Embedded Thousands Separators

Notice that those values which have embedded thousands separators are stored as character strings in the CSV files. When these are read by read.csv() they are converted as character variables in the data frame.

print(table(unlist(lapply(ED_SAPS,class))))

character   integer 
      466       341 

There should be no more than three columns which are character variables. The following function will identify those columns stored as characters, remove the commas, and convert them to integers.

convertCh <- function(x) {
   ChCols <- which(unlist(lapply(x,class)) == 'character')
   if (length(ChCols) > 3) {
      for(i in 4:length(ChCols)) {
         Col <- ChCols[i]
         x[,Col] <- as.integer(gsub(",","",x[,Col]))
      }
   }
   return(x)
}

So, to deal with the ED data:

ED_SAPS <- convertCh(ED_SAPS)

Boundary data

I have downloaded the boundaries for the 2016 EDs as a shapefile which is located in my Boundaries folder. The readOGR() function from the rgdal library will read these into a Spatial Polygons Data Frame.

library(rgdal)
ED_BND <- readOGR("./Boundaries",
                  "Electoral_Divisions__CSO_Generalised_20M",
                  stringsAsFactors=FALSE)
OGR data source with driver: ESRI Shapefile 
Source: "./Boundaries", layer: "Electoral_Divisions__CSO_Generalised_20M"
with 3409 features
It has 15 fields
Integer64 fields read as strings:  OBJECTID 
print(proj4string(ED_BND))
[1] "+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0"

There are a few things to note.

First, there are two underscores between Divisions and CSO in the layer name. Second, the coordinates are in WGS84 - they are geographic coordinates in decimal degrees. You can check this by examining the east-west and north-south limits: the bounding box.

print(bbox(ED_BND))
        min       max
x -10.66265 -5.996385
y  51.41990 55.435136

If we wish to determine population density, or to create a cartogram, then we’ll the coordinates projected into a Cartesian coordinate system. The current national system is Irish Transverse Mercator. To re-project the coordinates we’ll need to specify a new coordinate system, and then use spTransform() to undertake the conversion. The spTransform() function is from the sp library which is loaded when rgdal os loaded.

ED_ITM <- spTransform(ED_BND,CRS("+init=epsg:2157"))
print(proj4string(ED_ITM))
[1] "+init=epsg:2157 +proj=tmerc +lat_0=53.5 +lon_0=-8 +k=0.99982 +x_0=600000 +y_0=750000 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

And a check on the bounding box confirms the conversion.

print(bbox(ED_ITM))
       min      max
x 417489.4 734475.1
y 519663.7 965634.0

Joining the boundary and census data

This requires a table join. Are the SAPS data in the same order as the polygons in the boundary file? We know that the SAPS data has a GUID field; what is to be found in the data from the shapefile?

print(head(ED_ITM@data))
  OBJECTID  ED_ID          ED_ENGLISH           ED_GAEILGE   COUNTY
0     3001 247006         AUGHWILLIAM                 <NA>  WEXFORD
1     3002 267020 BLACKROCK-CARYSFORT                 <NA>   DUBLIN
2     3003  97028         COOLAGHMORE An Chuailleach Mhór KILKENNY
3     3004  57129          RATHMULLAN                 <NA>  DONEGAL
4     3005  77005              ARDAGH                 <NA>    KERRY
5     3006 177001             AGHABOG           Achadh Bog MONAGHAN
              CONTAE PROVINCE       CENTROID_X       CENTROID_Y
0        Loch Garman Leinster 695180.287713613 616289.230043841
1 Baile Átha Cliath Leinster 721024.524290722 727734.553337762
2     Cill Chainnigh Leinster 640258.501531936 637693.505219255
3      Dún na nGall   Ulster 628313.935273137 929172.005661047
4           Ciarraí  Munster 483502.992050644 635898.045774622
5        Muineachán   Ulster 659479.788431437 821863.816083927
                             GUID_ CSOED_3409 OSIED_3441
0 2AE196291BD413A3E055000000000001      14095     247006
1 2AE196291CBE13A3E055000000000001      05009     267020
2 2AE1962923CC13A3E055000000000001      07006      97028
3 2AE19629209713A3E055000000000001      33125      57129
4 2AE19629239213A3E055000000000001      19093      77005
5 2AE196291A6313A3E055000000000001      34034     177001
           CSOED_34_1   Shape__Are Shape__Len
0         Aughwilliam 0.0020651031 0.27564590
1 Blackrock-Carysfort 0.0001761057 0.05950107
2         Coolaghmore 0.0034492240 0.31131458
3          Rathmullan 0.0025358525 0.28670939
4              Ardagh 0.0020464396 0.29314214
5             Aghabog 0.0029906620 0.30410749

The corresponding GUIDs are in a field name GUID_. The GEOGIDs are also present, but will require some manipulation to get them into something comparable with the GEOGIDs in the SAPS data. WE can re-order the SAPS data frame to correspond with the order of the polygons in the spatial data.

ED_reorder <- match(ED_ITM$GUID_,ED_SAPS$GUID)
print(head(ED_reorder))
[1] 1150  316  471 3315 1901 3373

So, row 1150 in the SAPS data corresponds with polygon 1 in the spatial data, row 316 with polygon 2, row 471 with polygon 3, and so on. To re-order the data, and check that the rows correspond:

ED_SAPS <- ED_SAPS[ED_reorder,]
print(ED_SAPS[1:5,1:3])
                                 GUID       GEOGID            GEOGDESC
1150 2AE196291BD413A3E055000000000001 ED3409_14095         Aughwilliam
316  2AE196291CBE13A3E055000000000001 ED3409_05009 Blackrock-Carysfort
471  2AE1962923CC13A3E055000000000001 ED3409_07006         Coolaghmore
3315 2AE19629209713A3E055000000000001 ED3409_33125          Rathmullan
1901 2AE19629239213A3E055000000000001 ED3409_19093              Ardagh
print(ED_ITM@data[1:5,c(10,11,3)])
                             GUID_ CSOED_3409          ED_ENGLISH
0 2AE196291BD413A3E055000000000001      14095         AUGHWILLIAM
1 2AE196291CBE13A3E055000000000001      05009 BLACKROCK-CARYSFORT
2 2AE1962923CC13A3E055000000000001      07006         COOLAGHMORE
3 2AE19629209713A3E055000000000001      33125          RATHMULLAN
4 2AE19629239213A3E055000000000001      19093              ARDAGH

The names and GUIDs match. We have re-ordered the SAPS data correctly.

Example: population density

As an example, we can now compute the population density. The gArea() function from the rgeos library will return the area of each polygon in metres2. The log of the population density will reveal whether there are any discontinuities in the right hand tail.

library(rgeos)

To create a histogram of the ratio of persons per hectare, with a density estimate, we first extract the areas:

Areas <- gArea(ED_ITM,byid=TRUE)/10000
print(summary(Areas))
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
    4.363  1343.000  1958.000  2061.000  2618.000 16320.000 
LogDen <- log(ED_SAPS$T1_1AGETT / Areas)
hist(LogDen,freq=FALSE,n=100,border=NA,col="lightgrey",xlab="Log Persons/Ha",main="")
Den <- density(LogDen)
lines(Den$x,Den$y,col="red")

The local minimum between the two modes is at which(Den$y == min(Den$y[Den$x > 1 & Den$x < 2])) which corresponds to 4.0185987 persons per hectare. Above this value we can think of a ‘more urban’ and below it as ‘more rural’.


Version 1.0