Burcin Yazgi Walsh, Chris Brunsdon, Martin Charlton

NCG, MU SSI, Maynooth University

The Dublin Geodemographic Tool is following the Ireland Geodemographic classification method but focusing on Dublin for a better insight. For the Dublin Geodemographic Tool, Small Areas are selected as spatial units and 2016 Census of Population results are used for socio-economic components.

Geodemographic classification is a tool that groups spatial units based on socio-economic characteristics. The classifications are the result of similarities found in data through the analysis of 33 different variables under the themes of age and marital status; ethnicity; housing; socio-economic group; education; commuting; occupation type; motor car availability and internet access.

All these geodemographic tools are based on the study done for 2011 Ireland Census Data by Chris Brunsdon, Jan Rigby and Martin Charlton. For detailed information for that study you can check this link (https://rpubs.com/chrisbrunsdon/14998).

The original census data used in this analyis can be found through the link below (http://www.cso.ie/en/census/census2016reports/census2016smallareapopulationstatistics/).

CreateVariables <- function(SA2016,na.rm=TRUE) {
  attach(SA2016)
  Age0_4    <- 100 * ( T1_1AGE0T + T1_1AGE1T + T1_1AGE2T + T1_1AGE3T + T1_1AGE4T ) / T1_1AGETT
  Age5_14   <- 100 * ( T1_1AGE5T + T1_1AGE6T + T1_1AGE7T + T1_1AGE8T + T1_1AGE9T +
                         T1_1AGE10T + T1_1AGE11T + T1_1AGE12T + T1_1AGE13T + T1_1AGE14T) / T1_1AGETT
  Age25_44  <- 100 * ( T1_1AGE25_29T + T1_1AGE30_34T + T1_1AGE35_39T + T1_1AGE40_44T ) / T1_1AGETT
  Age45_64  <- 100 * ( T1_1AGE45_49T + T1_1AGE50_54T + T1_1AGE55_59T + T1_1AGE60_64T ) / T1_1AGETT
  Age65over <- 100 * ( T1_1AGE65_69T + T1_1AGE70_74T + T1_1AGE75_79T + T1_1AGE80_84T + T1_1AGEGE_85T ) / T1_1AGETT
  
  EU_National           <- 100 * (T2_1UKN + T2_1PLN + T2_1LTN + T2_1EUN) / (T2_1TN - T2_1NSN)
  ROW_National          <- 100 * (T2_1RWN) / (T2_1TN - T2_1NSN)
  Born_outside_Ireland  <- 100 * (T2_1TBP - T2_1IEBP) / (T2_1TN - T2_1NSN)
  
  Separated            <- 100 * (T1_2SEPT + T1_2DIVT) / T1_2T
  SinglePerson         <- 100 * T5_2_1PP / T5_2_TP
  Pensioner            <- 100 * T4_5RP / T4_5TP
  LoneParent           <- 100 * (T4_3FOPFCT + T4_3FOPMCT) / T4_5TF
  DINK                 <- 100 * T4_5PFF / T4_5TF
  NonDependentKids     <- 100 * T4_4AGE_GE20F / T4_4TF
  
  RentPublic         <- 100 * T6_3_RLAH / (T6_3_TH - T6_3_NSH)
  RentPrivate        <- 100 * T6_3_RPLH / (T6_3_TH - T6_3_NSH)
  
  Flats              <- 100 * T6_1_FA_H / (T6_1_TH - T6_1_NS_H)
  NoCenHeat          <- 100 * T6_5_NCH / (T6_5_T - T6_5_NS)
  RoomsHH            <- (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH +   8*T6_4_GE8RH ) / (T6_4_TH - T6_4_NSH)
  PeopleRoom         <- T1_1AGETT / (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH + 8*T6_4_GE8RH + T6_4_NSH)
  SepticTank         <- 100 * T6_7_IST / (T6_7_T - T6_7_NS)
  
  HEQual              <-  100 * ((T10_4_ODNDT + T10_4_HDPQT + T10_4_PDT + T10_4_DT) / (T10_4_TT - T10_4_NST)) 
  Employed            <-  100 * T8_1_WT / T8_1_TT
  TwoCars             <-  100 * (T15_1_2C + T15_1_3C + T15_1_GE4C) / (T15_1_TC - T15_1_NSC)
  JTWPublic           <-  100 * (T11_1_BUW + T11_1_TDLW) / (T11_1_TW - T11_1_NSW)
  HomeWork            <-  100 * T9_2_PH / T9_2_PT
  LLTI                <-  100 * (T12_3_BT + T12_3_VBT) / (T12_3_TT - T12_3_NST)
  UnpaidCare          <-  100 * T12_2_T / T1_1AGETT
  
  Students           <- 100 * T8_1_ST / T8_1_TT
  Unemployed         <- 100 * T8_1_ULGUPJT / T8_1_TT
  
  EconInactFam       <- 100 * T8_1_LAHFT / T8_1_TT
  Agric              <- 100 * T14_1_AFFT / T14_1_TT 
  Construction       <- 100 * T14_1_BCT / T14_1_TT 
  Manufacturing      <- 100 * T14_1_MIT / T14_1_TT 
  Commerce           <- 100 * T14_1_CTT / T14_1_TT 
  Transport          <- 100 * T14_1_TCT / T14_1_TT 
  Public             <- 100 * T14_1_PAT / T14_1_TT 
  Professional       <- 100 * T14_1_PST / T14_1_TT 
  
  Broadband          <- 100 * T15_3_B / (T15_3_B + T15_3_OTH) 
  Internet           <- 100 *(T15_3_B + T15_3_OTH) / (T15_3_T - T15_3_NS) 
  
  Place <- data.frame(substring(GEOGID, 8),stringsAsFactors=FALSE)
  
  detach(SA2016)
  
  colnames(Place)[1] <- 'GEOGID'
  Demographic <- data.frame(Age0_4, Age5_14, Age25_44, Age45_64, Age65over, EU_National, ROW_National, Born_outside_Ireland)
  HouseholdComposition <- data.frame(Separated, SinglePerson, Pensioner, LoneParent, DINK, NonDependentKids)
  Housing <- data.frame(RentPublic, RentPrivate, Flats, NoCenHeat, RoomsHH, PeopleRoom, SepticTank)
  SocioEconomic <- data.frame(HEQual, Employed, TwoCars, JTWPublic, HomeWork, LLTI, UnpaidCare)
  Employment <- data.frame(Students, Unemployed, EconInactFam, Agric, Construction, Manufacturing, Commerce, Transport, Public, Professional)
  Misc  <- data.frame(Internet, Broadband)
  
  DerivedData <- data.frame(Place, Demographic, HouseholdComposition, Housing, SocioEconomic, Employment, Misc)
  if (na.rm) DerivedData[which(is.na(DerivedData),arr.ind=T)] <- 0                            
  
  DerivedData
}

This section of the code is to organise the data that we are going to use through the analysis. The original data set is based on 40 different variables that were used for the Ireland Geodemographics Tool. For Dublin Geodemographic Tool some further analysis is done to explore the selected variables which will be described in the following step.

Some of the small areas from Dublin were excluded from the study. Inclusion of the data from these small areas was misleading the analysis and the results.

SADatad<-read.csv(file="SAPS2016_SA2017_dublin.csv", header = TRUE, sep=",", stringsAsFactors=FALSE)
SAVarsd<-CreateVariables(SADatad)

It is advised to check the derivation of the variables that will be included in the analysis beforehand. In order to understand the effect of each variable in this study, Sparse Principal Component Analysis is applied. The first four PCs were able to explain the 81 % of the variation. When we check the loadings for these four PCAs, we can notice that some of the variables have ‘0’ value. On the other hand, if you explore the effect of these ‘0’ valued variables in all PCAs, it is noticeable that even though it is small they may still have some effect on the analysis. Therefore only the variables without any significant effect are excluded from the dataset. In Dublin case, seven of these variables (Number of rooms per household; Number of people per room; Agriculture; Construction; Central heating; Unpaid care; Health condition including bad and very bad - LLTI) were excluded from the study.

library(sparsepca)
SPCA <- robspca(SAVarsd[, -1])
summary(SPCA)
##                             PC1     PC2     PC3     PC4     PC5     PC6
## Explained variance     2741.962 982.396 326.132 270.215 155.184 128.437
## Standard deviations      52.364  31.343  18.059  16.438  12.457  11.333
## Proportion of variance    0.519   0.186   0.062   0.051   0.029   0.024
## Cumulative proportion     0.519   0.704   0.766   0.817   0.847   0.871
##                           PC7    PC8    PC9   PC10   PC11   PC12   PC13
## Explained variance     94.209 80.807 58.812 54.452 49.328 44.621 40.240
## Standard deviations     9.706  8.989  7.669  7.379  7.023  6.680  6.344
## Proportion of variance  0.018  0.015  0.011  0.010  0.009  0.008  0.008
## Cumulative proportion   0.889  0.904  0.915  0.925  0.935  0.943  0.951
##                          PC14   PC15   PC16   PC17   PC18   PC19   PC20
## Explained variance     37.279 24.182 23.020 20.071 16.309 13.415 12.305
## Standard deviations     6.106  4.918  4.798  4.480  4.038  3.663  3.508
## Proportion of variance  0.007  0.005  0.004  0.004  0.003  0.003  0.002
## Cumulative proportion   0.958  0.962  0.967  0.971  0.974  0.976  0.979
##                          PC21  PC22  PC23  PC24  PC25  PC26  PC27  PC28
## Explained variance     10.814 8.972 8.809 7.449 6.802 6.341 5.646 3.946
## Standard deviations     3.289 2.995 2.968 2.729 2.608 2.518 2.376 1.986
## Proportion of variance  0.002 0.002 0.002 0.001 0.001 0.001 0.001 0.001
## Cumulative proportion   0.981 0.982 0.984 0.985 0.987 0.988 0.989 0.990
##                         PC29  PC30  PC31  PC32  PC33  PC34  PC35  PC36
## Explained variance     3.472 3.284 2.753 2.312 2.094 1.701 1.154 0.863
## Standard deviations    1.863 1.812 1.659 1.520 1.447 1.304 1.074 0.929
## Proportion of variance 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000
## Cumulative proportion  0.990 0.991 0.991 0.992 0.992 0.993 0.993 0.993
##                         PC37  PC38  PC39  PC40
## Explained variance     0.656 0.474 0.079 0.047
## Standard deviations    0.810 0.688 0.281 0.216
## Proportion of variance 0.000 0.000 0.000 0.000
## Cumulative proportion  0.993 0.993 0.993 0.993
#print(SPCA$loadings[1,]) - to check the loading related to the variable

PCA <- princomp(SAVarsd[, -c(1,19,20,21,28,29,33,34)], cor = T, scores = T)
PCA
## Call:
## princomp(x = SAVarsd[, -c(1, 19, 20, 21, 28, 29, 33, 34)], cor = T, 
##     scores = T)
## 
## Standard deviations:
##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
## 3.0745573 2.2527611 1.8996760 1.2933929 1.1924958 1.1086183 1.0695380 
##    Comp.8    Comp.9   Comp.10   Comp.11   Comp.12   Comp.13   Comp.14 
## 1.0303587 0.9377233 0.9092222 0.8775804 0.8754159 0.8138007 0.7515720 
##   Comp.15   Comp.16   Comp.17   Comp.18   Comp.19   Comp.20   Comp.21 
## 0.7127757 0.6677847 0.6064130 0.5853017 0.5598842 0.5233276 0.5055322 
##   Comp.22   Comp.23   Comp.24   Comp.25   Comp.26   Comp.27   Comp.28 
## 0.4745023 0.4545730 0.4181935 0.3956592 0.3747180 0.3644353 0.3234127 
##   Comp.29   Comp.30   Comp.31   Comp.32   Comp.33 
## 0.2768951 0.2649412 0.1911217 0.1529239 0.0960739 
## 
##  33  variables and  4880 observations.

For comparison purposes, as well as giving more options to users for different clustering alternatives, two different algorithms will be applied. One of them is k-means cluster analysis and the other one is the Partitioning Around Medoids (PAM) algorithm. As a first step though, we would be running the principal component analysis. After learning the proportion of the each component’s varince explanation, we will be investigating the cumulative effect.

cumsum(PCA$sdev^2/sum(PCA$sdev^2))
##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
## 0.2864516 0.4402374 0.5495941 0.6002869 0.6433793 0.6806227 0.7152867 
##    Comp.8    Comp.9   Comp.10   Comp.11   Comp.12   Comp.13   Comp.14 
## 0.7474576 0.7741038 0.7991549 0.8224927 0.8457155 0.8657843 0.8829013 
##   Comp.15   Comp.16   Comp.17   Comp.18   Comp.19   Comp.20   Comp.21 
## 0.8982967 0.9118100 0.9229535 0.9333347 0.9428338 0.9511329 0.9588772 
##   Comp.22   Comp.23   Comp.24   Comp.25   Comp.26   Comp.27   Comp.28 
## 0.9657000 0.9719617 0.9772613 0.9820051 0.9862601 0.9902847 0.9934543 
##   Comp.29   Comp.30   Comp.31   Comp.32   Comp.33 
## 0.9957777 0.9979047 0.9990116 0.9997203 1.0000000

As a result of the cumulative amount of the variance explained by the components, it is noticable that the first 11 components can explain the 82 % of the variance in total. We will be using this number for k-means cluster analysis. Note that the number of principal component considered should be updated based on your own analysis. It is useful to plot smallest cluster size against the total number of clusters for a better idea on the decision of cluster size.

smallest.clus <- wss <- rep(0, 100)
set.seed(218833)  # Reproducible outcome

for (i in 1:100) {
  clus <- kmeans(PCA$scores[, 1:11], i, iter.max = 100, nstart = 20)
  wss[i] <- clus$tot.withinss
  smallest.clus[i] <- min(clus$size)
}

plot(1:100, wss[1:100], type = "h", main = "Cluster Scree Plot", xlab = "Number of Clusters", 
     ylab = "Within Cluster Sum of Squares")

plot(1:100, smallest.clus[1:100], type = "h", main = "Smallest Cluster Plot", 
     xlab = "Number of Clusters", ylab = "Smallest Cluster Size")

In Dublin case, when the number of the clusters is more than 6, it looks like very small clusters are being more common. Some further investigation is needed before the decision of best number of clusters and there are several different approaches on this. In Dublin case, after some more tests, it is decided that classifications will be based on 7 different groups. Based on this info, using the cluster number 7, we can recalculate the classifications. Note that the number of the cluster ‘i’ should be updated based on your own analysis.

SAclusd <- kmeans(PCA$scores[, 1:11], 7, iter.max = 100, nstart = 20)
SAclustersd <- SAclusd$cluster

The following code creates z-scores for each cluster. These values can be used to identify the changing characteristics of each variables within the clusters.

library(plyr)
mean_by_cluster <- ddply(SAVarsd, .(SAclustersd), numcolwise(mean))[, -c(1,19,20,21,28,29,33,34)]

# Compute the column-wise means for *all* observations
mean_by_col <- apply(SAVarsd[, -c(1,19,20,21,28,29,33,34)], 2, mean)

# Compute the column-wise *sample* sd's for *all* observations
sd_by_col <- apply(SAVarsd[, -c(1,19,20,21,28,29,33,34)], 2, sd)

# Create the z-scores via the 'scale' function
z_scoresd <- scale(mean_by_cluster, center = mean_by_col, scale = sd_by_col)

The heatmap function is one of the visual representation to investigate the varying characteristic of the clusters.

library(RColorBrewer)
heatmap(t(z_scoresd),
        scale = 'none',
        col=brewer.pal(6,'BrBG'),
        breaks=c(-1e10,-2,-1,0,1,2,+1e10),
        xlab='Cluster Number',
        add.expr=abline(h=(0:40)+0.5,v=(0:6)+0.5,col='white'))

We can also check the distribution of the size of each cluster group.

barplot(sort(table(SAclustersd)), las = 3)

An alternative clustering technique PAM algorithm is applied for the classifications. As a result, PAM is preferred over k-means since PAM is more robust to noise and outliers, compared to k-means. Also, PAM algorithm is less likely to form very small clusters. After this stage, previous steps will be repeated.

library(cluster)
PAMclusd <- pam(x = PCA$scores[, 1:11], k = 7, pamonce = 2)
PAMclustersd <- PAMclusd$clustering
mean_by_cluster2 <- ddply(SAVarsd, .(PAMclustersd), numcolwise(mean))[, -c(1,19,20,21,28,29,33,34)]
z_scores2d <- scale(mean_by_cluster2, center = mean_by_col, scale = sd_by_col)
heatmap(t(z_scores2d),
        scale = 'none',
        col=brewer.pal(7,'BrBG'),
        breaks=c(-1e10,-2,-1,-0.5,0.5,1,2,+1e10),
        margins = c(3,8),
        xlab='Cluster Number',
        add.expr=abline(h=(0:40)+0.5,v=(0:6)+0.5,col='white'))

barplot(sort(table(PAMclustersd)), las = 3)

Mapping Diagnostics

There are many possible ways to map your clusters. Some of the alternatives will be covered in this section.

#tmap option
library(tmap)
load("n.RData")
tm_shape(n)+
  tm_fill("PAMclustersd", title="clusters", legend.show=FALSE, palette = "Blues")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0

#if you want to add facets as well
tm_shape(n)+
  tm_fill("PAMclustersd", title="clusters", legend.show=FALSE, palette = "Blues")+
  tm_facets("PAMclustersd")

Another alternative display style is Leaflet-Shiny integration. This is a useful tool when you want some interactivity for your map and data. Here are some images from the output.

You can see the live application through this link
Leaflet-Shiny example







References

\(1\): For more information - Dublin Dashboard Tools Page (https://www.dublindashboard.ie/pages/index).

\(2\): 2011 Ireland Population Census Geodemographics Study (https://rpubs.com/chrisbrunsdon/14998).

\(3\): 2016 Ireland Population Census Geodemographics Study (https://rpubs.com/burcinwalsh/343141).