Burcin Yazgi Walsh
NCG, MU SSI, Maynooth University
The Cork Geodemographic Tool is following the Ireland Geodemographic classification method but focusing on Cork for a better insight. For the Cork Geodemographic Tool, Small Areas are selected as spatial units and 2016 Census of Population results are used for socio-economic components.
Geodemographic classification is a tool that groups spatial units based on socio-economic characteristics. The classifications are the result of similarities found in data through the analysis of 35 different variables under the themes of age and marital status; ethnicity; housing; socio-economic group; education; commuting; occupation type; motor car availability and internet access.
All these geodemographic tools are based on the study done for 2011 Ireland Census Data by Chris Brunsdon, Jan Rigby and Martin Charlton. For detailed information for that study you can check this link (https://rpubs.com/chrisbrunsdon/14998).
The original census data used in this analyis can be found through the link below (http://www.cso.ie/en/census/census2016reports/census2016smallareapopulationstatistics/).
CreateVariables <- function(SA2016,na.rm=TRUE) {
attach(SA2016)
Age0_4 <- 100 * ( T1_1AGE0T + T1_1AGE1T + T1_1AGE2T + T1_1AGE3T + T1_1AGE4T ) / T1_1AGETT
Age5_14 <- 100 * ( T1_1AGE5T + T1_1AGE6T + T1_1AGE7T + T1_1AGE8T + T1_1AGE9T +
T1_1AGE10T + T1_1AGE11T + T1_1AGE12T + T1_1AGE13T + T1_1AGE14T) / T1_1AGETT
Age25_44 <- 100 * ( T1_1AGE25_29T + T1_1AGE30_34T + T1_1AGE35_39T + T1_1AGE40_44T ) / T1_1AGETT
Age45_64 <- 100 * ( T1_1AGE45_49T + T1_1AGE50_54T + T1_1AGE55_59T + T1_1AGE60_64T ) / T1_1AGETT
Age65over <- 100 * ( T1_1AGE65_69T + T1_1AGE70_74T + T1_1AGE75_79T + T1_1AGE80_84T + T1_1AGEGE_85T ) / T1_1AGETT
EU_National <- 100 * (T2_1UKN + T2_1PLN + T2_1LTN + T2_1EUN) / (T2_1TN - T2_1NSN)
ROW_National <- 100 * (T2_1RWN) / (T2_1TN - T2_1NSN)
Born_outside_Ireland <- 100 * (T2_1TBP - T2_1IEBP) / (T2_1TN - T2_1NSN)
Separated <- 100 * (T1_2SEPT + T1_2DIVT) / T1_2T
SinglePerson <- 100 * T5_2_1PP / T5_2_TP
Pensioner <- 100 * T4_5RP / T4_5TP
LoneParent <- 100 * (T4_3FOPFCT + T4_3FOPMCT) / T4_5TF
DINK <- 100 * T4_5PFF / T4_5TF
NonDependentKids <- 100 * T4_4AGE_GE20F / T4_4TF
RentPublic <- 100 * T6_3_RLAH / (T6_3_TH - T6_3_NSH)
RentPrivate <- 100 * T6_3_RPLH / (T6_3_TH - T6_3_NSH)
Flats <- 100 * T6_1_FA_H / (T6_1_TH - T6_1_NS_H)
NoCenHeat <- 100 * T6_5_NCH / (T6_5_T - T6_5_NS)
RoomsHH <- (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH + 8*T6_4_GE8RH ) / (T6_4_TH - T6_4_NSH)
PeopleRoom <- T1_1AGETT / (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH + 8*T6_4_GE8RH + T6_4_NSH)
SepticTank <- 100 * T6_7_IST / (T6_7_T - T6_7_NS)
HEQual <- 100 * ((T10_4_ODNDT + T10_4_HDPQT + T10_4_PDT + T10_4_DT) / (T10_4_TT - T10_4_NST))
Employed <- 100 * T8_1_WT / T8_1_TT
TwoCars <- 100 * (T15_1_2C + T15_1_3C + T15_1_GE4C) / (T15_1_TC - T15_1_NSC)
JTWPublic <- 100 * (T11_1_BUW + T11_1_TDLW) / (T11_1_TW - T11_1_NSW)
HomeWork <- 100 * T9_2_PH / T9_2_PT
LLTI <- 100 * (T12_3_BT + T12_3_VBT) / (T12_3_TT - T12_3_NST)
UnpaidCare <- 100 * T12_2_T / T1_1AGETT
Students <- 100 * T8_1_ST / T8_1_TT
Unemployed <- 100 * T8_1_ULGUPJT / T8_1_TT
EconInactFam <- 100 * T8_1_LAHFT / T8_1_TT
Agric <- 100 * T14_1_AFFT / T14_1_TT
Construction <- 100 * T14_1_BCT / T14_1_TT
Manufacturing <- 100 * T14_1_MIT / T14_1_TT
Commerce <- 100 * T14_1_CTT / T14_1_TT
Transport <- 100 * T14_1_TCT / T14_1_TT
Public <- 100 * T14_1_PAT / T14_1_TT
Professional <- 100 * T14_1_PST / T14_1_TT
Broadband <- 100 * T15_3_B / (T15_3_B + T15_3_OTH)
Internet <- 100 *(T15_3_B + T15_3_OTH) / (T15_3_T - T15_3_NS)
Place <- data.frame(substring(GEOGID, 8),stringsAsFactors=FALSE)
detach(SA2016)
colnames(Place)[1] <- 'GEOGID'
Demographic <- data.frame(Age0_4, Age5_14, Age25_44, Age45_64, Age65over, EU_National, ROW_National, Born_outside_Ireland)
HouseholdComposition <- data.frame(Separated, SinglePerson, Pensioner, LoneParent, DINK, NonDependentKids)
Housing <- data.frame(RentPublic, RentPrivate, Flats, NoCenHeat, RoomsHH, PeopleRoom, SepticTank)
SocioEconomic <- data.frame(HEQual, Employed, TwoCars, JTWPublic, HomeWork, LLTI, UnpaidCare)
Employment <- data.frame(Students, Unemployed, EconInactFam, Agric, Construction, Manufacturing, Commerce, Transport, Public, Professional)
Misc <- data.frame(Internet, Broadband)
DerivedData <- data.frame(Place, Demographic, HouseholdComposition, Housing, SocioEconomic, Employment, Misc)
if (na.rm) DerivedData[which(is.na(DerivedData),arr.ind=T)] <- 0
DerivedData
}
This section of the code is to organise the data that we are going to use through the analysis. The original data set is based on 40 different variables that were used for the Ireland Geodemographics Tool. For Cork Geodemographic Tool some further analysis is done to explore the selected variables which will be described in the following step.
SADatac<-read.csv(file="SAPS2016_SA2017_cork.csv", header = TRUE, sep=",", stringsAsFactors=FALSE)
SAVarsc<-CreateVariables(SADatac)
It is advised to check the derivation of the variables that will be included in the analysis beforehand. In order to understand the effect of each variable in this study, Sparse Principal Component Analysis is applied. The first four PCs were able to explain the 80 % of the variation. When we check the loadings for these four PCAs, we can notice that some of the variables have ‘0’ value. On the other hand, if you explore the effect of these ‘0’ valued variables in all PCAs, it is noticeable that even though it is small they may still have some effect on the analysis. Therefore only the variables without any significant effect are excluded from the dataset. In Cork case, five of these variables (Number of rooms per household; Number of people per room; Central heating; Unpaid care; Health condition including bad and very bad - LLTI) were excluded from the study.
library(sparsepca)
SPCA <- robspca(SAVarsc[, -1])
summary(SPCA)
## PC1 PC2 PC3 PC4 PC5 PC6
## Explained variance 2078.609 916.069 626.408 269.996 158.896 108.931
## Standard deviations 45.592 30.267 25.028 16.432 12.605 10.437
## Proportion of variance 0.431 0.190 0.130 0.056 0.033 0.023
## Cumulative proportion 0.431 0.621 0.750 0.806 0.839 0.862
## PC7 PC8 PC9 PC10 PC11 PC12 PC13
## Explained variance 85.574 74.329 63.510 52.067 41.392 39.336 33.876
## Standard deviations 9.251 8.621 7.969 7.216 6.434 6.272 5.820
## Proportion of variance 0.018 0.015 0.013 0.011 0.009 0.008 0.007
## Cumulative proportion 0.880 0.895 0.908 0.919 0.927 0.936 0.943
## PC14 PC15 PC16 PC17 PC18 PC19 PC20
## Explained variance 30.132 27.202 21.302 20.202 19.586 18.836 14.615
## Standard deviations 5.489 5.216 4.615 4.495 4.426 4.340 3.823
## Proportion of variance 0.006 0.006 0.004 0.004 0.004 0.004 0.003
## Cumulative proportion 0.949 0.955 0.959 0.963 0.967 0.971 0.974
## PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Explained variance 12.156 11.181 9.124 9.049 8.335 7.886 6.729 6.475
## Standard deviations 3.487 3.344 3.021 3.008 2.887 2.808 2.594 2.545
## Proportion of variance 0.003 0.002 0.002 0.002 0.002 0.002 0.001 0.001
## Cumulative proportion 0.977 0.979 0.981 0.983 0.984 0.986 0.987 0.989
## PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36
## Explained variance 4.279 3.652 3.456 2.771 2.458 2.155 1.464 1.107
## Standard deviations 2.069 1.911 1.859 1.665 1.568 1.468 1.210 1.052
## Proportion of variance 0.001 0.001 0.001 0.001 0.001 0.000 0.000 0.000
## Cumulative proportion 0.990 0.990 0.991 0.992 0.992 0.993 0.993 0.993
## PC37 PC38 PC39 PC40
## Explained variance 1.058 0.476 0.064 0.005
## Standard deviations 1.029 0.690 0.253 0.073
## Proportion of variance 0.000 0.000 0.000 0.000
## Cumulative proportion 0.993 0.994 0.994 0.994
#print(SPCA$loadings[1,]) - to check the loading related to the variable
PCA <- princomp(SAVarsc[, -c(1,19,20,21,28,29)], cor = T, scores = T)
PCA
## Call:
## princomp(x = SAVarsc[, -c(1, 19, 20, 21, 28, 29)], cor = T, scores = T)
##
## Standard deviations:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 2.8881754 2.4430835 1.8066290 1.6976664 1.3183415 1.0746999 1.0583865
## Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14
## 1.0123510 1.0024692 0.9496385 0.9024610 0.8719464 0.8415815 0.7575714
## Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 Comp.21
## 0.7174186 0.7047429 0.6798725 0.6359779 0.5900237 0.5665010 0.5407355
## Comp.22 Comp.23 Comp.24 Comp.25 Comp.26 Comp.27 Comp.28
## 0.5052826 0.4942481 0.4827786 0.4565467 0.4354216 0.4130546 0.3795681
## Comp.29 Comp.30 Comp.31 Comp.32 Comp.33 Comp.34 Comp.35
## 0.3520145 0.3311900 0.2924479 0.2639456 0.1905629 0.1444487 0.1061215
##
## 35 variables and 2186 observations.
For comparison purposes, as well as giving more options to users for different clustering alternatives, two different algorithms will be applied. One of them is k-means cluster analysis and the other one is the Partitioning Around Medoids (PAM) algorithm. As a first step though, we would be running the principal component analysis. After learning the proportion of the each component’s varince explanation, we will be investigating the cumulative effect.
cumsum(PCA$sdev^2/sum(PCA$sdev^2))
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 0.2383302 0.4088633 0.5021178 0.5844627 0.6341205 0.6671199 0.6991251
## Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14
## 0.7284067 0.7571194 0.7828855 0.8061551 0.8278777 0.8481136 0.8645112
## Comp.15 Comp.16 Comp.17 Comp.18 Comp.19 Comp.20 Comp.21
## 0.8792166 0.8934070 0.9066134 0.9181697 0.9281162 0.9372854 0.9456396
## Comp.22 Comp.23 Comp.24 Comp.25 Comp.26 Comp.27 Comp.28
## 0.9529342 0.9599136 0.9665729 0.9725282 0.9779451 0.9828198 0.9869361
## Comp.29 Comp.30 Comp.31 Comp.32 Comp.33 Comp.34 Comp.35
## 0.9904765 0.9936104 0.9960540 0.9980445 0.9990821 0.9996782 1.0000000
As a result of the cumulative amount of the variance explained by the components, it is noticable that the first 11 components can explain the 80 % of the variance in total. We will be using this number for k-means cluster analysis. Note that the number of principal component considered should be updated based on your own analysis. It is useful to plot smallest cluster size against the total number of clusters for a better idea on the decision of cluster size.
smallest.clus <- wss <- rep(0, 100)
set.seed(218833) # Reproducible outcome
for (i in 1:100) {
clus <- kmeans(PCA$scores[, 1:11], i, iter.max = 100, nstart = 20)
wss[i] <- clus$tot.withinss
smallest.clus[i] <- min(clus$size)
}
plot(1:100, wss[1:100], type = "h", main = "Cluster Scree Plot", xlab = "Number of Clusters",
ylab = "Within Cluster Sum of Squares")
plot(1:100, smallest.clus[1:100], type = "h", main = "Smallest Cluster Plot",
xlab = "Number of Clusters", ylab = "Smallest Cluster Size")
In this case, when the number of the clusters is more than 6, it looks like very small clusters are being more common. Some further investigation is needed before the decision of best number of clusters and there are several different approaches on this. In this case, after some more tests, it is decided that classifications will be based on 7 different groups. Based on this info, using the cluster number 7, we can recalculate the classifications. Note that the number of the cluster ‘i’ should be updated based on your own analysis.
SAclusc <- kmeans(PCA$scores[, 1:11], 7, iter.max = 100, nstart = 20)
SAclustersc <- SAclusc$cluster
The following code creates z-scores for each cluster. These values can be used to identify the changing characteristics of each variables within the clusters.
library(plyr)
mean_by_cluster <- ddply(SAVarsc, .(SAclustersc), numcolwise(mean))[, -c(1,19,20,21,28,29)]
# Compute the column-wise means for *all* observations
mean_by_col <- apply(SAVarsc[, -c(1,19,20,21,28,29)], 2, mean)
# Compute the column-wise *sample* sd's for *all* observations
sd_by_col <- apply(SAVarsc[, -c(1,19,20,21,28,29)], 2, sd)
# Create the z-scores via the 'scale' function
z_scoresc <- scale(mean_by_cluster, center = mean_by_col, scale = sd_by_col)
The heatmap function is one of the visual representation to investigate the varying characteristic of the clusters.
library(RColorBrewer)
heatmap(t(z_scoresc),
scale = 'none',
col=brewer.pal(6,'BrBG'),
breaks=c(-1e10,-2,-1,0,1,2,+1e10),
xlab='Cluster Number',
add.expr=abline(h=(0:40)+0.5,v=(0:6)+0.5,col='white'))
We can also check the distribution of the size of each cluster group.
barplot(sort(table(SAclustersc)), las = 3)
An alternative clustering technique PAM algorithm is applied for the classifications. As a result, PAM is preferred over k-means since PAM is more robust to noise and outliers, compared to k-means. Also, PAM algorithm is less likely to form very small clusters. After this stage, previous steps will be repeated.
library(cluster)
PAMclusc <- pam(x = PCA$scores[, 1:11], k = 7, pamonce = 2)
PAMclustersc <- PAMclusc$clustering
mean_by_cluster2 <- ddply(SAVarsc, .(PAMclustersc), numcolwise(mean))[, -c(1,19,20,21,28,29)]
z_scores2c <- scale(mean_by_cluster2, center = mean_by_col, scale = sd_by_col)
heatmap(t(z_scores2c),
scale = 'none',
col=brewer.pal(7,'BrBG'),
breaks=c(-1e10,-2,-1,-0.5,0.5,1,2,+1e10),
margins = c(3,8),
xlab='Cluster Number',
add.expr=abline(h=(0:40)+0.5,v=(0:6)+0.5,col='white'))
barplot(sort(table(PAMclustersc)), las = 3)
Mapping Diagnostics
There are many possible ways to map your clusters. Some of the alternatives will be covered in this section.
#tmap option
library(tmap)
load("n_cork.RData")
tm_shape(n_cork)+
tm_fill("PAMclustersc", title="clusters", legend.show=FALSE, palette = "Blues")
## Linking to GEOS 3.7.2, GDAL 2.4.2, PROJ 5.2.0
#if you want to add facets as well
tm_shape(n_cork)+
tm_fill("PAMclustersc", title="clusters", legend.show=FALSE, palette = "Blues")+
tm_facets("PAMclustersc")
Another alternative display style is Leaflet-Shiny integration. This is a useful tool when you want some interactivity for your map and data. Here are some images from the output.
You can see the live application through this link
Leaflet-Shiny example
References
\(1\): For more information - Cork Dashboard Tools Page (https://www.corkdashboard.ie/pages/index).
\(2\): 2011 Ireland Population Census Geodemographics Study (https://rpubs.com/chrisbrunsdon/14998).
\(3\): 2016 Ireland Population Census Geodemographics Study (https://rpubs.com/burcinwalsh/343141).