Burcin Yazgi Walsh

NCG, MU SSI, Maynooth University

The Cork Geodemographic Tool is following the Ireland Geodemographic classification method but focusing on Cork for a better insight. For the Cork Geodemographic Tool, Small Areas are selected as spatial units and 2016 Census of Population results are used for socio-economic components.

Geodemographic classification is a tool that groups spatial units based on socio-economic characteristics. The classifications are the result of similarities found in data through the analysis of 35 different variables under the themes of age and marital status; ethnicity; housing; socio-economic group; education; commuting; occupation type; motor car availability and internet access.

All these geodemographic tools are based on the study done for 2011 Ireland Census Data by Chris Brunsdon, Jan Rigby and Martin Charlton. For detailed information for that study you can check this link (https://rpubs.com/chrisbrunsdon/14998).

The original census data used in this analyis can be found through the link below (http://www.cso.ie/en/census/census2016reports/census2016smallareapopulationstatistics/).

CreateVariables <- function(SA2016,na.rm=TRUE) {
  attach(SA2016)
  Age0_4    <- 100 * ( T1_1AGE0T + T1_1AGE1T + T1_1AGE2T + T1_1AGE3T + T1_1AGE4T ) / T1_1AGETT
  Age5_14   <- 100 * ( T1_1AGE5T + T1_1AGE6T + T1_1AGE7T + T1_1AGE8T + T1_1AGE9T +
                         T1_1AGE10T + T1_1AGE11T + T1_1AGE12T + T1_1AGE13T + T1_1AGE14T) / T1_1AGETT
  Age25_44  <- 100 * ( T1_1AGE25_29T + T1_1AGE30_34T + T1_1AGE35_39T + T1_1AGE40_44T ) / T1_1AGETT
  Age45_64  <- 100 * ( T1_1AGE45_49T + T1_1AGE50_54T + T1_1AGE55_59T + T1_1AGE60_64T ) / T1_1AGETT
  Age65over <- 100 * ( T1_1AGE65_69T + T1_1AGE70_74T + T1_1AGE75_79T + T1_1AGE80_84T + T1_1AGEGE_85T ) / T1_1AGETT
  
  EU_National           <- 100 * (T2_1UKN + T2_1PLN + T2_1LTN + T2_1EUN) / (T2_1TN - T2_1NSN)
  ROW_National          <- 100 * (T2_1RWN) / (T2_1TN - T2_1NSN)
  Born_outside_Ireland  <- 100 * (T2_1TBP - T2_1IEBP) / (T2_1TN - T2_1NSN)
  
  Separated            <- 100 * (T1_2SEPT + T1_2DIVT) / T1_2T
  SinglePerson         <- 100 * T5_2_1PP / T5_2_TP
  Pensioner            <- 100 * T4_5RP / T4_5TP
  LoneParent           <- 100 * (T4_3FOPFCT + T4_3FOPMCT) / T4_5TF
  DINK                 <- 100 * T4_5PFF / T4_5TF
  NonDependentKids     <- 100 * T4_4AGE_GE20F / T4_4TF
  
  RentPublic         <- 100 * T6_3_RLAH / (T6_3_TH - T6_3_NSH)
  RentPrivate        <- 100 * T6_3_RPLH / (T6_3_TH - T6_3_NSH)
  
  Flats              <- 100 * T6_1_FA_H / (T6_1_TH - T6_1_NS_H)
  NoCenHeat          <- 100 * T6_5_NCH / (T6_5_T - T6_5_NS)
  RoomsHH            <- (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH +   8*T6_4_GE8RH ) / (T6_4_TH - T6_4_NSH)
  PeopleRoom         <- T1_1AGETT / (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH + 8*T6_4_GE8RH + T6_4_NSH)
  SepticTank         <- 100 * T6_7_IST / (T6_7_T - T6_7_NS)
  
  HEQual              <-  100 * ((T10_4_ODNDT + T10_4_HDPQT + T10_4_PDT + T10_4_DT) / (T10_4_TT - T10_4_NST)) 
  Employed            <-  100 * T8_1_WT / T8_1_TT
  TwoCars             <-  100 * (T15_1_2C + T15_1_3C + T15_1_GE4C) / (T15_1_TC - T15_1_NSC)
  JTWPublic           <-  100 * (T11_1_BUW + T11_1_TDLW) / (T11_1_TW - T11_1_NSW)
  HomeWork            <-  100 * T9_2_PH / T9_2_PT
  LLTI                <-  100 * (T12_3_BT + T12_3_VBT) / (T12_3_TT - T12_3_NST)
  UnpaidCare          <-  100 * T12_2_T / T1_1AGETT
  
  Students           <- 100 * T8_1_ST / T8_1_TT
  Unemployed         <- 100 * T8_1_ULGUPJT / T8_1_TT
  
  EconInactFam       <- 100 * T8_1_LAHFT / T8_1_TT
  Agric              <- 100 * T14_1_AFFT / T14_1_TT 
  Construction       <- 100 * T14_1_BCT / T14_1_TT 
  Manufacturing      <- 100 * T14_1_MIT / T14_1_TT 
  Commerce           <- 100 * T14_1_CTT / T14_1_TT 
  Transport          <- 100 * T14_1_TCT / T14_1_TT 
  Public             <- 100 * T14_1_PAT / T14_1_TT 
  Professional       <- 100 * T14_1_PST / T14_1_TT 
  
  Broadband          <- 100 * T15_3_B / (T15_3_B + T15_3_OTH) 
  Internet           <- 100 *(T15_3_B + T15_3_OTH) / (T15_3_T - T15_3_NS) 
  
  Place <- data.frame(substring(GEOGID, 8),stringsAsFactors=FALSE)
  
  detach(SA2016)
  
  colnames(Place)[1] <- 'GEOGID'
  Demographic <- data.frame(Age0_4, Age5_14, Age25_44, Age45_64, Age65over, EU_National, ROW_National, Born_outside_Ireland)
  HouseholdComposition <- data.frame(Separated, SinglePerson, Pensioner, LoneParent, DINK, NonDependentKids)
  Housing <- data.frame(RentPublic, RentPrivate, Flats, NoCenHeat, RoomsHH, PeopleRoom, SepticTank)
  SocioEconomic <- data.frame(HEQual, Employed, TwoCars, JTWPublic, HomeWork, LLTI, UnpaidCare)
  Employment <- data.frame(Students, Unemployed, EconInactFam, Agric, Construction, Manufacturing, Commerce, Transport, Public, Professional)
  Misc  <- data.frame(Internet, Broadband)
  
  DerivedData <- data.frame(Place, Demographic, HouseholdComposition, Housing, SocioEconomic, Employment, Misc)
  if (na.rm) DerivedData[which(is.na(DerivedData),arr.ind=T)] <- 0                            
  
  DerivedData
}

This section of the code is to organise the data that we are going to use through the analysis. The original data set is based on 40 different variables that were used for the Ireland Geodemographics Tool. For Cork Geodemographic Tool some further analysis is done to explore the selected variables which will be described in the following step.

SADatac<-read.csv(file="SAPS2016_SA2017_cork.csv", header = TRUE, sep=",", stringsAsFactors=FALSE)
SAVarsc<-CreateVariables(SADatac)

It is advised to check the derivation of the variables that will be included in the analysis beforehand. In order to understand the effect of each variable in this study, Sparse Principal Component Analysis is applied. The first four PCs were able to explain the 80 % of the variation. When we check the loadings for these four PCAs, we can notice that some of the variables have ‘0’ value. On the other hand, if you explore the effect of these ‘0’ valued variables in all PCAs, it is noticeable that even though it is small they may still have some effect on the analysis. Therefore only the variables without any significant effect are excluded from the dataset. In Cork case, five of these variables (Number of rooms per household; Number of people per room; Central heating; Unpaid care; Health condition including bad and very bad - LLTI) were excluded from the study.

library(sparsepca)
SPCA <- robspca(SAVarsc[, -1])
summary(SPCA)
##                             PC1     PC2     PC3     PC4     PC5     PC6
## Explained variance     2078.609 916.069 626.408 269.996 158.896 108.931
## Standard deviations      45.592  30.267  25.028  16.432  12.605  10.437
## Proportion of variance    0.431   0.190   0.130   0.056   0.033   0.023
## Cumulative proportion     0.431   0.621   0.750   0.806   0.839   0.862
##                           PC7    PC8    PC9   PC10   PC11   PC12   PC13
## Explained variance     85.574 74.329 63.510 52.067 41.392 39.336 33.876
## Standard deviations     9.251  8.621  7.969  7.216  6.434  6.272  5.820
## Proportion of variance  0.018  0.015  0.013  0.011  0.009  0.008  0.007
## Cumulative proportion   0.880  0.895  0.908  0.919  0.927  0.936  0.943
##                          PC14   PC15   PC16   PC17   PC18   PC19   PC20
## Explained variance     30.132 27.202 21.302 20.202 19.586 18.836 14.615
## Standard deviations     5.489  5.216  4.615  4.495  4.426  4.340  3.823
## Proportion of variance  0.006  0.006  0.004  0.004  0.004  0.004  0.003
## Cumulative proportion   0.949  0.955  0.959  0.963  0.967  0.971  0.974
##                          PC21   PC22  PC23  PC24  PC25  PC26  PC27  PC28
## Explained variance     12.156 11.181 9.124 9.049 8.335 7.886 6.729 6.475
## Standard deviations     3.487  3.344 3.021 3.008 2.887 2.808 2.594 2.545
## Proportion of variance  0.003  0.002 0.002 0.002 0.002 0.002 0.001 0.001
## Cumulative proportion   0.977  0.979 0.981 0.983 0.984 0.986 0.987 0.989
##                         PC29  PC30  PC31  PC32  PC33  PC34  PC35  PC36
## Explained variance     4.279 3.652 3.456 2.771 2.458 2.155 1.464 1.107
## Standard deviations    2.069 1.911 1.859 1.665 1.568 1.468 1.210 1.052
## Proportion of variance 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000
## Cumulative proportion  0.990 0.990 0.991 0.992 0.992 0.993 0.993 0.993
##                         PC37  PC38  PC39  PC40
## Explained variance     1.058 0.476 0.064 0.005
## Standard deviations    1.029 0.690 0.253 0.073
## Proportion of variance 0.000 0.000 0.000 0.000
## Cumulative proportion  0.993 0.994 0.994 0.994
#print(SPCA$loadings[1,]) - to check the loading related to the variable

PCA <- princomp(SAVarsc[, -c(1,19,20,21,28,29)], cor = T, scores = T)
PCA
## Call:
## princomp(x = SAVarsc[, -c(1, 19, 20, 21, 28, 29)], cor = T, scores = T)
## 
## Standard deviations:
##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
## 2.8881754 2.4430835 1.8066290 1.6976664 1.3183415 1.0746999 1.0583865 
##    Comp.8    Comp.9   Comp.10   Comp.11   Comp.12   Comp.13   Comp.14 
## 1.0123510 1.0024692 0.9496385 0.9024610 0.8719464 0.8415815 0.7575714 
##   Comp.15   Comp.16   Comp.17   Comp.18   Comp.19   Comp.20   Comp.21 
## 0.7174186 0.7047429 0.6798725 0.6359779 0.5900237 0.5665010 0.5407355 
##   Comp.22   Comp.23   Comp.24   Comp.25   Comp.26   Comp.27   Comp.28 
## 0.5052826 0.4942481 0.4827786 0.4565467 0.4354216 0.4130546 0.3795681 
##   Comp.29   Comp.30   Comp.31   Comp.32   Comp.33   Comp.34   Comp.35 
## 0.3520145 0.3311900 0.2924479 0.2639456 0.1905629 0.1444487 0.1061215 
## 
##  35  variables and  2186 observations.

For comparison purposes, as well as giving more options to users for different clustering alternatives, two different algorithms will be applied. One of them is k-means cluster analysis and the other one is the Partitioning Around Medoids (PAM) algorithm. As a first step though, we would be running the principal component analysis. After learning the proportion of the each component’s varince explanation, we will be investigating the cumulative effect.

cumsum(PCA$sdev^2/sum(PCA$sdev^2))
##    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7 
## 0.2383302 0.4088633 0.5021178 0.5844627 0.6341205 0.6671199 0.6991251 
##    Comp.8    Comp.9   Comp.10   Comp.11   Comp.12   Comp.13   Comp.14 
## 0.7284067 0.7571194 0.7828855 0.8061551 0.8278777 0.8481136 0.8645112 
##   Comp.15   Comp.16   Comp.17   Comp.18   Comp.19   Comp.20   Comp.21 
## 0.8792166 0.8934070 0.9066134 0.9181697 0.9281162 0.9372854 0.9456396 
##   Comp.22   Comp.23   Comp.24   Comp.25   Comp.26   Comp.27   Comp.28 
## 0.9529342 0.9599136 0.9665729 0.9725282 0.9779451 0.9828198 0.9869361 
##   Comp.29   Comp.30   Comp.31   Comp.32   Comp.33   Comp.34   Comp.35 
## 0.9904765 0.9936104 0.9960540 0.9980445 0.9990821 0.9996782 1.0000000

As a result of the cumulative amount of the variance explained by the components, it is noticable that the first 11 components can explain the 80 % of the variance in total. We will be using this number for k-means cluster analysis. Note that the number of principal component considered should be updated based on your own analysis. It is useful to plot smallest cluster size against the total number of clusters for a better idea on the decision of cluster size.

smallest.clus <- wss <- rep(0, 100)
set.seed(218833)  # Reproducible outcome

for (i in 1:100) {
  clus <- kmeans(PCA$scores[, 1:11], i, iter.max = 100, nstart = 20)
  wss[i] <- clus$tot.withinss
  smallest.clus[i] <- min(clus$size)
}

plot(1:100, wss[1:100], type = "h", main = "Cluster Scree Plot", xlab = "Number of Clusters", 
     ylab = "Within Cluster Sum of Squares")

plot(1:100, smallest.clus[1:100], type = "h", main = "Smallest Cluster Plot", 
     xlab = "Number of Clusters", ylab = "Smallest Cluster Size")

In this case, when the number of the clusters is more than 6, it looks like very small clusters are being more common. Some further investigation is needed before the decision of best number of clusters and there are several different approaches on this. In this case, after some more tests, it is decided that classifications will be based on 7 different groups. Based on this info, using the cluster number 7, we can recalculate the classifications. Note that the number of the cluster ‘i’ should be updated based on your own analysis.

SAclusc <- kmeans(PCA$scores[, 1:11], 7, iter.max = 100, nstart = 20)
SAclustersc <- SAclusc$cluster

The following code creates z-scores for each cluster. These values can be used to identify the changing characteristics of each variables within the clusters.

library(plyr)
mean_by_cluster <- ddply(SAVarsc, .(SAclustersc), numcolwise(mean))[, -c(1,19,20,21,28,29)]

# Compute the column-wise means for *all* observations
mean_by_col <- apply(SAVarsc[, -c(1,19,20,21,28,29)], 2, mean)

# Compute the column-wise *sample* sd's for *all* observations
sd_by_col <- apply(SAVarsc[, -c(1,19,20,21,28,29)], 2, sd)

# Create the z-scores via the 'scale' function
z_scoresc <- scale(mean_by_cluster, center = mean_by_col, scale = sd_by_col)

The heatmap function is one of the visual representation to investigate the varying characteristic of the clusters.

library(RColorBrewer)
heatmap(t(z_scoresc),
        scale = 'none',
        col=brewer.pal(6,'BrBG'),
        breaks=c(-1e10,-2,-1,0,1,2,+1e10),
        xlab='Cluster Number',
        add.expr=abline(h=(0:40)+0.5,v=(0:6)+0.5,col='white'))

We can also check the distribution of the size of each cluster group.

barplot(sort(table(SAclustersc)), las = 3)

An alternative clustering technique PAM algorithm is applied for the classifications. As a result, PAM is preferred over k-means since PAM is more robust to noise and outliers, compared to k-means. Also, PAM algorithm is less likely to form very small clusters. After this stage, previous steps will be repeated.

library(cluster)
PAMclusc <- pam(x = PCA$scores[, 1:11], k = 7, pamonce = 2)
PAMclustersc <- PAMclusc$clustering
mean_by_cluster2 <- ddply(SAVarsc, .(PAMclustersc), numcolwise(mean))[, -c(1,19,20,21,28,29)]
z_scores2c <- scale(mean_by_cluster2, center = mean_by_col, scale = sd_by_col)
heatmap(t(z_scores2c),
        scale = 'none',
        col=brewer.pal(7,'BrBG'),
        breaks=c(-1e10,-2,-1,-0.5,0.5,1,2,+1e10),
        margins = c(3,8),
        xlab='Cluster Number',
        add.expr=abline(h=(0:40)+0.5,v=(0:6)+0.5,col='white'))

barplot(sort(table(PAMclustersc)), las = 3)

Mapping Diagnostics

There are many possible ways to map your clusters. Some of the alternatives will be covered in this section.

#tmap option
library(tmap)
load("n_cork.RData")
tm_shape(n_cork)+
  tm_fill("PAMclustersc", title="clusters", legend.show=FALSE, palette = "Blues")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0

#if you want to add facets as well
tm_shape(n_cork)+
  tm_fill("PAMclustersc", title="clusters", legend.show=FALSE, palette = "Blues")+
  tm_facets("PAMclustersc")

Another alternative display style is Leaflet-Shiny integration. This is a useful tool when you want some interactivity for your map and data. Here are some images from the output.

You can see the live application through this link
Leaflet-Shiny example







References

\(1\): For more information - Cork Dashboard Tools Page (https://www.corkdashboard.ie/pages/index).

\(2\): 2011 Ireland Population Census Geodemographics Study (https://rpubs.com/chrisbrunsdon/14998).

\(3\): 2016 Ireland Population Census Geodemographics Study (https://rpubs.com/burcinwalsh/343141).