This note describes how to use the An Post GeoDirectory database to examine spatial variations in the distances between buildings in Ireland. The version of GeoDirectory that is used is for the first quarter of 2016, so there might be some variation in the results if other versions are employed. This note covers (i) reading GeoDirectory tables into R (ii) creating the area codes to link these data to the digital boundary data available the CSO website and (iii) some suitable visualisations.
GeoDirectory is the main data base for postal addresses1 in the Republic of Ireland. It is supplied in a number of forms, one of which is a Microsoft Access database. Access is a relational database management system (RDBMS), and in GeoDirectory there is a complex series of relationships among the indexed tables to speed searching and keep the total database size manageable. The data can be read using the RODBC library functions in the 32 bit version of R. These do not work with the 64 bit version. When using RStudio, hold the Ctrl key down while the RStudio icon is double clicked, and then select the 32 bit version.
If Access is not running on your machine, download and install the appropriate 32 or 64 bit runtime libraries from Microsoft.
Several R libraries are required to run the examples. These are:
Library | Description |
---|---|
RODBC | R ODBC access functions Access, Excel &c |
sp | Spatial functions - basic mapping routines |
rgdal | Interface to the gdal library - readOGR() |
RANN | Interface to Arya and Mount’s ANN library |
RColorBrewer | Useful library of colour palettes. |
classInt | Class interval routines |
library(RODBC)
library(sp)
library(rgdal)
library(RANN)
library(RColorBrewer)
library(classInt)
The first step is to connect to the database. There are a number of system tables which deal with the internal organisation of the data - such as the relationships between the tables - as well as the tables which contain the data in which are of interest. The RODBC functions allow SQL2 queries of the database to be made. The odbcCOnnectAccess2007() function will make the initial connection to the database. The object id contains the information about the connection and is used in subsequent RODBC function calls. The GeoDirectory file is best located on the hard drive of your computer.
id <- odbcConnectAccess2007("c:/GeoDirectory/NUIMA_ACCESS07_M_Q116.accdb")
Many of the tables in GeoDirectory can be loaded in their entirety. This is not recommended for the BUILDINGS or various ADDRESS tables on account of their size - there are about 2 million records in the BULDINGS table alone. We’ll see how to query those below. Use the sqlFetch() function to load an entire table from the database into an R object. Then load the boundary data for the Counties and EDs as shown below.
COUNTIES <- sqlFetch(id,'COUNTIES')
EDS <- sqlFetch(id,'EDS')
CTbnd <- readOGR(dsn="C:/GeoDirectory",layer="Census2011_Admin_Counties_generalised20m",
stringsAsFactors=FALSE)
EDbnd <- readOGR(dsn="C:/GeoDirectory",layer="Census2011_Electoral_Divisions_generalised20m",
stringsAsFactors=FALSE)
It would be useful to know which codes in GeoDirectory correspond to the codes used in the Census of Population data and associated boundary files which can be downloaded from the Central Statistics Office website3.
Examine the various tables that have been loaded - these include attribute table for the Spatial Polygons Data Frame (EDbnd@data) and the GeoDirectory EDS table.
head(EDS)
ED_ID COUNTY_ID CHANGED_DATE NAME ED_UID CS_ID CSO_ED_ID
1 128035 3 2004-12-16 SINGLAND A 4270 NA 20036
2 128036 3 2004-12-16 SINGLAND B 4255 NA 20037
3 128037 3 2015-10-05 St. LAURENCE 4186 NA 20033
4 137001 17 2004-09-30 ABBEYLARA 2483 NA 9013
5 137002 17 2004-09-30 AGHABOY 275 NA 9037
6 137003 17 2004-09-30 AGHARRA 339 NA 9003
head(EDbnd@data[,7:11])
COUNTY COUNTYNAME CSOED OSIED EDNAME
0 34 Monaghan County 34066 177065 Shanmullagh
1 34 Monaghan County 34067 177066 Sheskin
2 34 Monaghan County 34068 177068 Tedavnet
3 34 Monaghan County 34069 177069 Tehallan
4 34 Monaghan County 34070 177070 Tullycorbet
5 01 Carlow County 01001 017011 Carlow Urban
While ED_ID
is the GeoDirectory internal ED code, the external CSO code is in the column CSO_ED_ID
, and not only has it been converted to numeric, it has also lost leading zeroes to pad it to five digits. This problem is easily managed with an ifelse
statement.
EDS$CSOED <- ifelse(nchar(EDS$CSO_ED_ID) == 4, paste("0",EDS$CSO_ED_ID,sep=""),
EDS$CSO_ED_ID)
head(EDS$CSOED)
[1] "20036" "20037" "20033" "09013" "09037" "09003"
print(c(length(EDS$CSOED),length(EDbnd$CSOED)))
[1] 3441 3409
The ifelse(test,yes,no) function is useful to do the adjustment. Notice that the GeoDirectory ED list is longer than the CSO boundary file list. This because some of the CSO EDs have been amalgamated to preserve confidentiality in EDs with small populations. Examination of the codes reveals that these are in adjacent pairs of EDs. The single ED codes in EDS can be replaced with the appropriate combined ED code. The clue is to find the codes in the SPDF which contain a /.
amalg <- grep("/",EDbnd$CSOED)
head(amalg)
[1] 628 704 911 1009 1279 1284
head(EDbnd$CSOED[(amalg)])
[1] "08045/08046" "09024/09035" "12043/12034" "13054/13085" "16014/16008"
[6] "16020/16017"
N <- length(amalg)
for(i in 1:N){
Index <- amalg[i]
amalg.code <- EDbnd$CSOED[Index]
ED1 <- substr(amalg.code,1,5)
ED2 <- substr(amalg.code,7,11)
EDS$CSOED[grep(ED1,EDS$CSOED)] <- amalg.code
EDS$CSOED[grep(ED2,EDS$CSOED)] <- amalg.code
}
print(c(ED1,ED2))
[1] "32086" "32087"
EDS[EDS$CSOED == amalg.code,]
ED_ID COUNTY_ID CHANGED_DATE NAME ED_UID CS_ID CSO_ED_ID
2751 27058 24 2004-09-30 KILLINAGH 2540 NA 32086
2778 27085 24 2004-09-30 TEEBANE 257 NA 32087
CSOED
2751 32086/32087
2778 32086/32087
The last couple of R instructions print the values of ED1
and ED2
in the final iteration of the loop, and the corresponding lines from the EDS data frame.
The county codes dealt with in this section.
head(COUNTIES[order(COUNTIES$COUNTY_ID),])
COUNTY_ID NAME IRISH_NAME IRISH_VERIFICATION
3 1 CORK CORCAIGH N
4 2 KERRY CIARRAÍ N
21 3 LIMERICK LUIMNEACH N
1 4 CLARE AN CLÁR N
20 5 MAYO MAIGH EO N
24 6 GALWAY GAILLIMH N
head(CTbnd@data[order(CTbnd$COUNTY),7:8])
COUNTY COUNTYNAME
2 01 Carlow County
3 02 Dublin City
4 03 South Dublin
5 04 Fingal
6 05 Dún Laoghaire-Rathdown
7 06 Kildare County
In GeoDirectory COUNTIES table the four counties of Dublin have been amalgamated into a single entity, as have North and South Tipperary. There are also 36 buildings where the County assignment is UNKNOWN. However, all is not lost. The CSO county codes are also in the first two characters of the ED codes that were created in the last section, so a CSO county code can be extracted from them.
EDS$CSO_CTY <- as.character(substr(EDS$CSOED,1,2))
The sqlQuery() function used to make the SQL query to read data from the BUILDINGS table - the text of the query itself is stored in a character object. A example query for the verified and valid entries in Maynooth would take the form:
query101 <- "SELECT *
FROM BUILDINGS
WHERE VERIFIED='Y' AND INVALID='N' AND ED_ID = 87066;"
Note that the ID for Maynooth Electoral Division is 87066.
There general form of the SELECT statement is:
SELECT columns FROM database WHERE conditions.
The wildcard character is used as a shorthand for all columns. The selection may be made from several tables which are related through a common column. The SQL structures can become quite complex, and they can be slower than the simple query in the example above.
Buildings appear to be more widely dispersed in rural areas, so might there be some observable urban/rural disparities. What would the average spacing be in the different counties of Ireland?
How can building spacing be measured? The dist() function is possible candidate, and find the smallest distance in each row that isn’t in the leading diagonal. However, memory limitations preclude doing this for the whole country - there are 11,343,084 inter building distances in Maynooth alone!
The nn2() function from the RANN library implements a k-dtree structure to find nearest neighbours in astonishingly short time. this can be used to find the 2-NNs, since the first nearest neighbour to each location is the location itself.
The query is for the whole of Ireland. Select a subset of of the columns from the table and note its size.
query103 <- "SELECT ED_ID,BUILDING_USE,COUNTY_ID,EAST,NORTH
FROM BUILDINGS
WHERE VERIFIED='Y' AND INVALID='N';"
Buildings <- sqlQuery(id,query103)
dim(Buildings)
[1] 1927810 5
It takes a relatively short time to load the 1927810 records from the database into an R data frame which is called Buildings.
nn.result <- nn2(Buildings[,4:5],k=2)
summary(nn.result$nn.dists[,2])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 6.352 10.970 35.960 39.690 3821.000
The spatial coordinates are in columns 4 and 5 of Buildings. The output from nn2() includes a matrix of the distances. Each row corresponds to the same row in Buildings, and the 2-NN distances are in the second column.
To find the 2-NNs for the 1.9m records takes about 10 seconds… running on 2.20GHz AMD Athlon X3 400e processor - not a very fast machine. Notice that the mean distance is much larger than the median value - while 50% of the buildings are no more than ~11m from their closest neighbour, the other 50% can be spaced up to 3.82km apart. A histogram of the distribution would benefit in clarity if the log of the distances is used.
hist(log(nn.result$nn.dists[,2]),n=100,xlab="Log Distance",ylim=c(0,1),
prob=T,col="wheat",border=NA,main="Inter Building Distances")
lines(density(log(nn.result$nn.dists[,2])),lty=2)
abline(v=2.9)
There is local minimum at about 18m in the national distribution of the logged distances. There are two very different processes taking place in determining building spacing.
Next compute the distributions of building spacing by GeoDirectory county. The results are intriguing - lower distances suggest more ‘urban’ counties.
z <- tapply(nn.result$nn.dists[,2],Buildings$COUNTY_ID,median)
names(z) <- COUNTIES$NAME[match(names(z),COUNTIES$COUNTY_ID)]
print(sort(z))
DUBLIN LOUTH KILDARE WATERFORD WICKLOW CORK LIMERICK
6.360626 8.970132 9.493475 10.210164 10.546259 11.070579 11.781031
MEATH CARLOW LAOIS WESTMEATH OFFALY KILKENNY WEXFORD
13.222114 14.023352 16.408678 18.126705 20.383572 22.066979 22.487202
TIPPERARY CLARE SLIGO GALWAY KERRY DONEGAL LONGFORD
22.717242 23.155126 23.415073 25.785536 28.988280 29.427376 29.532745
MAYO CAVAN MONAGHAN ROSCOMMON LEITRIM UNKNOWN
35.656821 35.679713 38.028959 40.067047 43.209592 64.315593
It should not surprise that Dublin appears to be the most ‘urban’. There’s a step in the distribution of sorted values which sets off the most ‘rural’ counties of Cavan, Mayo, Monaghan, Roscommon and Leitrim. A small number of properties have no reliable county coding, but are likely to be very rural.
summary(nn.result$nn.dists[Buildings$COUNTY_ID == 100,2])
Min. 1st Qu. Median Mean 3rd Qu. Max.
4.375 35.980 64.320 91.680 90.100 636.400
The evidence would suggest that the buildings in the UNKNOWN group are rural.
Next, repeat the by county median distance analysis, but this time using the CSO counties.The EDlkp variable provides the record numbers in the EDS table that match those in the Buildings table - based on the ED_ID. It’s faster to extract a column from the table as a vector and re-order it with EDlkp than to index the records of the EDS table directly.
EDlkp <- match(Buildings$ED_ID,EDS$ED_ID)
z <- tapply(nn.result$nn.dists[,2],EDS$CSO_CTY[EDlkp],median)
names(z) <- CTbnd$COUNTYNAME[match(names(z),CTbnd$COUNTY)]
print(sort(z),digits=3)
Dublin City South Dublin Cork City
6.00 6.25 6.38
Fingal Limerick City Waterford City
6.66 6.69 6.78
Galway City Dún Laoghaire-Rathdown Louth County
6.90 7.67 8.97
Kildare County Wicklow County Meath County
9.49 10.55 13.22
Carlow County Laois County Westmeath County
14.02 16.41 18.13
Cork County South Tipperary Offaly County
18.64 20.11 20.38
Kilkenny County Wexford County Clare County
22.07 22.49 23.16
Waterford County Sligo County North Tipperary
23.26 23.42 25.78
Limerick County Kerry County Donegal County
27.08 28.99 29.43
Longford County Galway County Mayo County
29.53 35.46 35.66
Cavan County Monaghan County Roscommon County
35.68 38.03 40.07
Leitrim County
43.21
The most rural would appear to be Galway, Mayo, Cavan,Monaghan, Roscommon and Leitrim. There’s a group of ‘slightly less rural’ counties in Tipperary North, Limerick, Kerry, Longford, and Donegal. The most urban are the four counties which comprise Dublin, and not surprisingly, the cities of Cork, Limerick, Waterford and Galway. There are other analyses which are possible, for example the variances of the county distributions.
zv <- tapply(nn.result$nn.dists[,2],EDS$CSO_CTY[EDlkp],var)
names(zv) <- CTbnd$COUNTYNAME[match(names(zv),CTbnd$COUNTY)]
print(sort(zv),digits=3)
Dublin City Cork City Limerick City
73.9 76.7 92.9
Dún Laoghaire-Rathdown Galway City South Dublin
170.0 185.0 219.3
Waterford City Fingal Louth County
227.3 491.0 1475.2
Kildare County Meath County Wexford County
2320.1 2989.5 3206.4
Wicklow County Donegal County Carlow County
3677.1 3920.4 4499.9
Kerry County Westmeath County Galway County
4609.4 4706.2 4710.2
Limerick County Cork County Longford County
4742.7 4931.3 5058.7
Offaly County Waterford County Kilkenny County
5428.1 5430.3 5521.7
Sligo County Laois County Clare County
5547.4 5548.8 5727.1
Mayo County Monaghan County South Tipperary
5802.9 5828.4 5872.4
Cavan County North Tipperary Roscommon County
5975.8 6202.1 6371.3
Leitrim County
7274.4
plot(z,zv,xlab="Mean Spacing",ylab="Variance of Spacing",pch=16,cex=0.5)
It is hardly surprising that there strong relationship between the mean spacing the the variance of the spacing at county level. In the cities the buildings are consistently close together, but in more rural areas there is a greater spread of distances - there are small settlements which each county which form clusters of buildings.
Compare the within and between county distributions using a multiple boxplot.
z <- tapply(nn.result$nn.dists[,2],EDS$CSO_CTY[EDlkp],median)
Colours <- rep('white',34)
Colours[c(2:6,10,15)] <- 'green'
boxplot(log(nn.result$nn.dists[,2]+0.001)~EDS$CSO_CTY[EDlkp],at=rank(z),
pars=list(outcol="darkgrey",outcex=0.5,outpch=16),
col=Colours,xlab="County",ylab="log(distance)")
The first eight boxes are for the cities (including the four counties of Dublin). Whilst the medians rise gently, the variance increases noticeably once there is a mixture of urban and rural areas. Those boxes shaded green represent the Greater Dublin area. Noticeable too is the much lower incidence of outliers in the non-city counties.
It would be helpful to have the county names as well.
boxplot(log(nn.result$nn.dists[,2]+0.001)~EDS$CSO_CTY[EDlkp],at=rank(z),
pars=list(outcol="darkgrey",outcex=0.5,outpch=16),
col=Colours,xlab="",ylab="log(distance)",xaxt="n")
CTlkp <- match(names(z),CTbnd$COUNTY)
CtyNames <- gsub(" County","",CTbnd$COUNTYNAME[CTlkp])
axis(1,at=rank(z),labels=CtyNames,las=2,cex.axis=0.5)
The gsub() call removes the string " County" from the count name strings.
It would be instructive to map the spatial variation in the building space by ED. To reveal the patterns in the cities more clearly the results can be plotted on a cartogram. The cartograms for the EDs and County SPDFs are created thus:
library(getcartr)
to.carto <- carto.transform(EDbnd,log(EDbnd$Total2011),res=1024)
CTcart <- to.carto(CTbnd)
EDcart <- to.carto(EDbnd)
The log2 of the population count is chosen as the mass variable for the cartogram - the distortion in the resulting cartogram isn’t quite as extreme as when the raw counts are used.
The to create the map, the following are required: (i) median values of the building spacing by ED, using the CSO codes, (ii) matching these to the corresponding codes in the Spatial Polygons Data Frame and (iii) determining class breaks to assign shading. The county boundaries, transformed to log(population) space, to aid interpretation.
z <- tapply(nn.result$nn.dists[,2],EDS$CSOED[EDlkp],median)
EDlkp2 <- match(EDcart$CSOED,names(z))
Var <- z[EDlkp2]
nClass <- 7
Palette <- rev(brewer.pal(nClass,"Oranges"))
Classes <- classIntervals(Var,nClass,"quantile",dataPrecision = 1)
Colours <- findColours(Classes,Palette)
plot(EDcart,col=Colours,border=NA)
plot(CTcart,border="black",add=T)
legend("topleft",
legend=names(attr(Colours, "table")),
fill=attr(Colours, "palette"),
cex=0.5, bty="n")
title("Building Spacing by ED")
box()
A very distinct spatial pattern - the class intervals are based on 7-tiles. Cities and rural towns stand out very clearly.
It is regarded as good practice to close the ODBC connection.
close(id)
There’s a useful guide to SQL in Gennick J, 2006, SQL Pocket Guide, Sebastopol CA: O’Reilly↩
The 2011 Census data are available at http://www.cso.ie/en/census/; the SAPS downloads are at Option 2 on http://www.cso.ie/en/census/census2011smallareapopulationstatisticssaps/ and the digital boundary files are available at http://www.cso.ie/en/census/census2011boundaryfiles/↩