Introduction

This note describes how to use the An Post GeoDirectory database to examine spatial variations in the distances between buildings in Ireland. The version of GeoDirectory that is used is for the first quarter of 2016, so there might be some variation in the results if other versions are employed. This note covers (i) reading GeoDirectory tables into R (ii) creating the area codes to link these data to the digital boundary data available the CSO website and (iii) some suitable visualisations.

GeoDirectory is the main data base for postal addresses1 in the Republic of Ireland. It is supplied in a number of forms, one of which is a Microsoft Access database. Access is a relational database management system (RDBMS), and in GeoDirectory there is a complex series of relationships among the indexed tables to speed searching and keep the total database size manageable. The data can be read using the RODBC library functions in the 32 bit version of R. These do not work with the 64 bit version. When using RStudio, hold the Ctrl key down while the RStudio icon is double clicked, and then select the 32 bit version.

If Access is not running on your machine, download and install the appropriate 32 or 64 bit runtime libraries from Microsoft.

Getting started

Several R libraries are required to run the examples. These are:

Library Description
RODBC R ODBC access functions Access, Excel &c
sp Spatial functions - basic mapping routines
rgdal Interface to the gdal library - readOGR()
RANN Interface to Arya and Mount’s ANN library
RColorBrewer Useful library of colour palettes.
classInt Class interval routines
library(RODBC)
library(sp)
library(rgdal)
library(RANN)
library(RColorBrewer)
library(classInt)

First steps

The first step is to connect to the database. There are a number of system tables which deal with the internal organisation of the data - such as the relationships between the tables - as well as the tables which contain the data in which are of interest. The RODBC functions allow SQL2 queries of the database to be made. The odbcCOnnectAccess2007() function will make the initial connection to the database. The object id contains the information about the connection and is used in subsequent RODBC function calls. The GeoDirectory file is best located on the hard drive of your computer.

id <- odbcConnectAccess2007("c:/GeoDirectory/NUIMA_ACCESS07_M_Q116.accdb")

Many of the tables in GeoDirectory can be loaded in their entirety. This is not recommended for the BUILDINGS or various ADDRESS tables on account of their size - there are about 2 million records in the BULDINGS table alone. We’ll see how to query those below. Use the sqlFetch() function to load an entire table from the database into an R object. Then load the boundary data for the Counties and EDs as shown below.

COUNTIES <- sqlFetch(id,'COUNTIES')
EDS <- sqlFetch(id,'EDS')
CTbnd <- readOGR(dsn="C:/GeoDirectory",layer="Census2011_Admin_Counties_generalised20m",
                 stringsAsFactors=FALSE)
EDbnd  <- readOGR(dsn="C:/GeoDirectory",layer="Census2011_Electoral_Divisions_generalised20m",
                 stringsAsFactors=FALSE)

Area codes

It would be useful to know which codes in GeoDirectory correspond to the codes used in the Census of Population data and associated boundary files which can be downloaded from the Central Statistics Office website3.

Electoral Divisions

Examine the various tables that have been loaded - these include attribute table for the Spatial Polygons Data Frame (EDbnd@data) and the GeoDirectory EDS table.

head(EDS)
   ED_ID COUNTY_ID CHANGED_DATE         NAME ED_UID CS_ID CSO_ED_ID
1 128035         3   2004-12-16   SINGLAND A   4270    NA     20036
2 128036         3   2004-12-16   SINGLAND B   4255    NA     20037
3 128037         3   2015-10-05 St. LAURENCE   4186    NA     20033
4 137001        17   2004-09-30    ABBEYLARA   2483    NA      9013
5 137002        17   2004-09-30      AGHABOY    275    NA      9037
6 137003        17   2004-09-30      AGHARRA    339    NA      9003
head(EDbnd@data[,7:11])
  COUNTY      COUNTYNAME CSOED  OSIED       EDNAME
0     34 Monaghan County 34066 177065  Shanmullagh
1     34 Monaghan County 34067 177066      Sheskin
2     34 Monaghan County 34068 177068     Tedavnet
3     34 Monaghan County 34069 177069     Tehallan
4     34 Monaghan County 34070 177070  Tullycorbet
5     01   Carlow County 01001 017011 Carlow Urban

While ED_ID is the GeoDirectory internal ED code, the external CSO code is in the column CSO_ED_ID, and not only has it been converted to numeric, it has also lost leading zeroes to pad it to five digits. This problem is easily managed with an ifelse statement.

EDS$CSOED <- ifelse(nchar(EDS$CSO_ED_ID) == 4, paste("0",EDS$CSO_ED_ID,sep=""),
                     EDS$CSO_ED_ID)
head(EDS$CSOED)
[1] "20036" "20037" "20033" "09013" "09037" "09003"
print(c(length(EDS$CSOED),length(EDbnd$CSOED)))
[1] 3441 3409

The ifelse(test,yes,no) function is useful to do the adjustment. Notice that the GeoDirectory ED list is longer than the CSO boundary file list. This because some of the CSO EDs have been amalgamated to preserve confidentiality in EDs with small populations. Examination of the codes reveals that these are in adjacent pairs of EDs. The single ED codes in EDS can be replaced with the appropriate combined ED code. The clue is to find the codes in the SPDF which contain a /.

amalg <- grep("/",EDbnd$CSOED)
head(amalg)
[1]  628  704  911 1009 1279 1284
head(EDbnd$CSOED[(amalg)])
[1] "08045/08046" "09024/09035" "12043/12034" "13054/13085" "16014/16008"
[6] "16020/16017"
N <- length(amalg)
for(i in 1:N){
  Index <- amalg[i]
  amalg.code <- EDbnd$CSOED[Index]
  ED1 <- substr(amalg.code,1,5)
  ED2 <- substr(amalg.code,7,11)
  EDS$CSOED[grep(ED1,EDS$CSOED)] <- amalg.code 
  EDS$CSOED[grep(ED2,EDS$CSOED)] <- amalg.code
}
print(c(ED1,ED2))
[1] "32086" "32087"
EDS[EDS$CSOED == amalg.code,]
     ED_ID COUNTY_ID CHANGED_DATE      NAME ED_UID CS_ID CSO_ED_ID
2751 27058        24   2004-09-30 KILLINAGH   2540    NA     32086
2778 27085        24   2004-09-30   TEEBANE    257    NA     32087
           CSOED
2751 32086/32087
2778 32086/32087

The last couple of R instructions print the values of ED1 and ED2 in the final iteration of the loop, and the corresponding lines from the EDS data frame.

Counties

The county codes dealt with in this section.

head(COUNTIES[order(COUNTIES$COUNTY_ID),])
   COUNTY_ID     NAME IRISH_NAME IRISH_VERIFICATION
3          1     CORK   CORCAIGH                  N
4          2    KERRY    CIARRAÍ                  N
21         3 LIMERICK  LUIMNEACH                  N
1          4    CLARE    AN CLÁR                  N
20         5     MAYO   MAIGH EO                  N
24         6   GALWAY   GAILLIMH                  N
head(CTbnd@data[order(CTbnd$COUNTY),7:8])
  COUNTY             COUNTYNAME
2     01          Carlow County
3     02            Dublin City
4     03           South Dublin
5     04                 Fingal
6     05 Dún Laoghaire-Rathdown
7     06         Kildare County

In GeoDirectory COUNTIES table the four counties of Dublin have been amalgamated into a single entity, as have North and South Tipperary. There are also 36 buildings where the County assignment is UNKNOWN. However, all is not lost. The CSO county codes are also in the first two characters of the ED codes that were created in the last section, so a CSO county code can be extracted from them.

EDS$CSO_CTY <- as.character(substr(EDS$CSOED,1,2))

Querying the BUILDINGS table

The sqlQuery() function used to make the SQL query to read data from the BUILDINGS table - the text of the query itself is stored in a character object. A example query for the verified and valid entries in Maynooth would take the form:

query101 <- "SELECT *
             FROM   BUILDINGS
             WHERE  VERIFIED='Y' AND INVALID='N' AND ED_ID = 87066;"

Note that the ID for Maynooth Electoral Division is 87066.

There general form of the SELECT statement is:

SELECT columns FROM database WHERE conditions.

The wildcard character is used as a shorthand for all columns. The selection may be made from several tables which are related through a common column. The SQL structures can become quite complex, and they can be slower than the simple query in the example above.

Inter Building Spacing in Ireland

Buildings appear to be more widely dispersed in rural areas, so might there be some observable urban/rural disparities. What would the average spacing be in the different counties of Ireland?

How can building spacing be measured? The dist() function is possible candidate, and find the smallest distance in each row that isn’t in the leading diagonal. However, memory limitations preclude doing this for the whole country - there are 11,343,084 inter building distances in Maynooth alone!

The nn2() function from the RANN library implements a k-dtree structure to find nearest neighbours in astonishingly short time. this can be used to find the 2-NNs, since the first nearest neighbour to each location is the location itself.

The query is for the whole of Ireland. Select a subset of of the columns from the table and note its size.

query103 <- "SELECT ED_ID,BUILDING_USE,COUNTY_ID,EAST,NORTH
             FROM   BUILDINGS
             WHERE  VERIFIED='Y' AND INVALID='N';"      
Buildings <- sqlQuery(id,query103)
dim(Buildings)
[1] 1927810       5

It takes a relatively short time to load the 1927810 records from the database into an R data frame which is called Buildings.

nn.result <- nn2(Buildings[,4:5],k=2)
summary(nn.result$nn.dists[,2])
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    0.000    6.352   10.970   35.960   39.690 3821.000

The spatial coordinates are in columns 4 and 5 of Buildings. The output from nn2() includes a matrix of the distances. Each row corresponds to the same row in Buildings, and the 2-NN distances are in the second column.

To find the 2-NNs for the 1.9m records takes about 10 seconds… running on 2.20GHz AMD Athlon X3 400e processor - not a very fast machine. Notice that the mean distance is much larger than the median value - while 50% of the buildings are no more than ~11m from their closest neighbour, the other 50% can be spaced up to 3.82km apart. A histogram of the distribution would benefit in clarity if the log of the distances is used.

hist(log(nn.result$nn.dists[,2]),n=100,xlab="Log Distance",ylim=c(0,1),
   prob=T,col="wheat",border=NA,main="Inter Building Distances")
lines(density(log(nn.result$nn.dists[,2])),lty=2)
abline(v=2.9)

There is local minimum at about 18m in the national distribution of the logged distances. There are two very different processes taking place in determining building spacing.

Building spacing at county levels

Next compute the distributions of building spacing by GeoDirectory county. The results are intriguing - lower distances suggest more ‘urban’ counties.

z <- tapply(nn.result$nn.dists[,2],Buildings$COUNTY_ID,median)
names(z) <- COUNTIES$NAME[match(names(z),COUNTIES$COUNTY_ID)]
print(sort(z))
   DUBLIN     LOUTH   KILDARE WATERFORD   WICKLOW      CORK  LIMERICK 
 6.360626  8.970132  9.493475 10.210164 10.546259 11.070579 11.781031 
    MEATH    CARLOW     LAOIS WESTMEATH    OFFALY  KILKENNY   WEXFORD 
13.222114 14.023352 16.408678 18.126705 20.383572 22.066979 22.487202 
TIPPERARY     CLARE     SLIGO    GALWAY     KERRY   DONEGAL  LONGFORD 
22.717242 23.155126 23.415073 25.785536 28.988280 29.427376 29.532745 
     MAYO     CAVAN  MONAGHAN ROSCOMMON   LEITRIM   UNKNOWN 
35.656821 35.679713 38.028959 40.067047 43.209592 64.315593 

It should not surprise that Dublin appears to be the most ‘urban’. There’s a step in the distribution of sorted values which sets off the most ‘rural’ counties of Cavan, Mayo, Monaghan, Roscommon and Leitrim. A small number of properties have no reliable county coding, but are likely to be very rural.

summary(nn.result$nn.dists[Buildings$COUNTY_ID == 100,2])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  4.375  35.980  64.320  91.680  90.100 636.400 

The evidence would suggest that the buildings in the UNKNOWN group are rural.

Next, repeat the by county median distance analysis, but this time using the CSO counties.The EDlkp variable provides the record numbers in the EDS table that match those in the Buildings table - based on the ED_ID. It’s faster to extract a column from the table as a vector and re-order it with EDlkp than to index the records of the EDS table directly.

EDlkp <- match(Buildings$ED_ID,EDS$ED_ID)
z <- tapply(nn.result$nn.dists[,2],EDS$CSO_CTY[EDlkp],median)
names(z) <- CTbnd$COUNTYNAME[match(names(z),CTbnd$COUNTY)]
print(sort(z),digits=3)
           Dublin City           South Dublin              Cork City 
                  6.00                   6.25                   6.38 
                Fingal          Limerick City         Waterford City 
                  6.66                   6.69                   6.78 
           Galway City Dún Laoghaire-Rathdown           Louth County 
                  6.90                   7.67                   8.97 
        Kildare County         Wicklow County           Meath County 
                  9.49                  10.55                  13.22 
         Carlow County           Laois County       Westmeath County 
                 14.02                  16.41                  18.13 
           Cork County        South Tipperary          Offaly County 
                 18.64                  20.11                  20.38 
       Kilkenny County         Wexford County           Clare County 
                 22.07                  22.49                  23.16 
      Waterford County           Sligo County        North Tipperary 
                 23.26                  23.42                  25.78 
       Limerick County           Kerry County         Donegal County 
                 27.08                  28.99                  29.43 
       Longford County          Galway County            Mayo County 
                 29.53                  35.46                  35.66 
          Cavan County        Monaghan County       Roscommon County 
                 35.68                  38.03                  40.07 
        Leitrim County 
                 43.21 

The most rural would appear to be Galway, Mayo, Cavan,Monaghan, Roscommon and Leitrim. There’s a group of ‘slightly less rural’ counties in Tipperary North, Limerick, Kerry, Longford, and Donegal. The most urban are the four counties which comprise Dublin, and not surprisingly, the cities of Cork, Limerick, Waterford and Galway. There are other analyses which are possible, for example the variances of the county distributions.

zv <- tapply(nn.result$nn.dists[,2],EDS$CSO_CTY[EDlkp],var)
names(zv) <- CTbnd$COUNTYNAME[match(names(zv),CTbnd$COUNTY)]
print(sort(zv),digits=3)
           Dublin City              Cork City          Limerick City 
                  73.9                   76.7                   92.9 
Dún Laoghaire-Rathdown            Galway City           South Dublin 
                 170.0                  185.0                  219.3 
        Waterford City                 Fingal           Louth County 
                 227.3                  491.0                 1475.2 
        Kildare County           Meath County         Wexford County 
                2320.1                 2989.5                 3206.4 
        Wicklow County         Donegal County          Carlow County 
                3677.1                 3920.4                 4499.9 
          Kerry County       Westmeath County          Galway County 
                4609.4                 4706.2                 4710.2 
       Limerick County            Cork County        Longford County 
                4742.7                 4931.3                 5058.7 
         Offaly County       Waterford County        Kilkenny County 
                5428.1                 5430.3                 5521.7 
          Sligo County           Laois County           Clare County 
                5547.4                 5548.8                 5727.1 
           Mayo County        Monaghan County        South Tipperary 
                5802.9                 5828.4                 5872.4 
          Cavan County        North Tipperary       Roscommon County 
                5975.8                 6202.1                 6371.3 
        Leitrim County 
                7274.4 
plot(z,zv,xlab="Mean Spacing",ylab="Variance of Spacing",pch=16,cex=0.5)

It is hardly surprising that there strong relationship between the mean spacing the the variance of the spacing at county level. In the cities the buildings are consistently close together, but in more rural areas there is a greater spread of distances - there are small settlements which each county which form clusters of buildings.

Compare the within and between county distributions using a multiple boxplot.

z <- tapply(nn.result$nn.dists[,2],EDS$CSO_CTY[EDlkp],median)
Colours <- rep('white',34)
Colours[c(2:6,10,15)] <- 'green'
boxplot(log(nn.result$nn.dists[,2]+0.001)~EDS$CSO_CTY[EDlkp],at=rank(z),
        pars=list(outcol="darkgrey",outcex=0.5,outpch=16),
        col=Colours,xlab="County",ylab="log(distance)")

The first eight boxes are for the cities (including the four counties of Dublin). Whilst the medians rise gently, the variance increases noticeably once there is a mixture of urban and rural areas. Those boxes shaded green represent the Greater Dublin area. Noticeable too is the much lower incidence of outliers in the non-city counties.

It would be helpful to have the county names as well.

boxplot(log(nn.result$nn.dists[,2]+0.001)~EDS$CSO_CTY[EDlkp],at=rank(z),
        pars=list(outcol="darkgrey",outcex=0.5,outpch=16),
        col=Colours,xlab="",ylab="log(distance)",xaxt="n")
CTlkp <- match(names(z),CTbnd$COUNTY)
CtyNames <- gsub(" County","",CTbnd$COUNTYNAME[CTlkp])
axis(1,at=rank(z),labels=CtyNames,las=2,cex.axis=0.5)

The gsub() call removes the string " County" from the count name strings.

Mapping at ED level

It would be instructive to map the spatial variation in the building space by ED. To reveal the patterns in the cities more clearly the results can be plotted on a cartogram. The cartograms for the EDs and County SPDFs are created thus:

library(getcartr)
to.carto <- carto.transform(EDbnd,log(EDbnd$Total2011),res=1024)
CTcart <- to.carto(CTbnd)
EDcart <- to.carto(EDbnd)

The log2 of the population count is chosen as the mass variable for the cartogram - the distortion in the resulting cartogram isn’t quite as extreme as when the raw counts are used.

The to create the map, the following are required: (i) median values of the building spacing by ED, using the CSO codes, (ii) matching these to the corresponding codes in the Spatial Polygons Data Frame and (iii) determining class breaks to assign shading. The county boundaries, transformed to log(population) space, to aid interpretation.

z <- tapply(nn.result$nn.dists[,2],EDS$CSOED[EDlkp],median)
EDlkp2 <- match(EDcart$CSOED,names(z))
Var <- z[EDlkp2]
nClass <- 7
Palette <- rev(brewer.pal(nClass,"Oranges"))
Classes <- classIntervals(Var,nClass,"quantile",dataPrecision = 1)
Colours <- findColours(Classes,Palette)
plot(EDcart,col=Colours,border=NA)
plot(CTcart,border="black",add=T)
legend("topleft", 
       legend=names(attr(Colours, "table")), 
       fill=attr(Colours, "palette"), 
       cex=0.5, bty="n")
title("Building Spacing by ED")
box()

A very distinct spatial pattern - the class intervals are based on 7-tiles. Cities and rural towns stand out very clearly.

Finally

It is regarded as good practice to close the ODBC connection.

close(id)

  1. https://www.geodirectory.ie/

  2. There’s a useful guide to SQL in Gennick J, 2006, SQL Pocket Guide, Sebastopol CA: O’Reilly

  3. The 2011 Census data are available at http://www.cso.ie/en/census/; the SAPS downloads are at Option 2 on http://www.cso.ie/en/census/census2011smallareapopulationstatisticssaps/ and the digital boundary files are available at http://www.cso.ie/en/census/census2011boundaryfiles/