Burcin Yazgi Walsh, Chris Brunsdon
NCG, MU SSI, Maynooth University
After working with hard clustering techniques at the national as well as the city level we wanted to try some soft clustering techniques as well for geodemographics studies for some improvement. Fuzzy approaches can be more robust to noise in data and they would give you the oppurtunity to represent geographical areas in a more detailed way rather than a strict one group classification.
The study is following the same principles of our Dublin Geodemographics analysis (1) for the initial steps. For more information regarding to the general background and as well as selection and the elimination of variables you can visit the link (https://rpubs.com/burcinwalsh/620510). In Dublin case, seven of these variables (Number of rooms per household; Number of people per room; Agriculture; Construction; Central heating; Unpaid care; Health condition including bad and very bad - LLTI) without any significant effect - out of forty variables used in the national level (2, 3) - were excluded from the study.
The code below is for putting the variables together from the Ireland latest (2016) Population Census dataset.
library(e1071)
library(advclust)
library(tmap)
CreateVariables <- function(SA2016, na.rm=TRUE) {
attach(SA2016)
Age0_4 <- 100 * ( T1_1AGE0T + T1_1AGE1T + T1_1AGE2T + T1_1AGE3T + T1_1AGE4T ) / T1_1AGETT
Age5_14 <- 100 * ( T1_1AGE5T + T1_1AGE6T + T1_1AGE7T + T1_1AGE8T + T1_1AGE9T +
T1_1AGE10T + T1_1AGE11T + T1_1AGE12T + T1_1AGE13T + T1_1AGE14T) / T1_1AGETT
Age25_44 <- 100 * ( T1_1AGE25_29T + T1_1AGE30_34T + T1_1AGE35_39T + T1_1AGE40_44T ) / T1_1AGETT
Age45_64 <- 100 * ( T1_1AGE45_49T + T1_1AGE50_54T + T1_1AGE55_59T + T1_1AGE60_64T ) / T1_1AGETT
Age65over <- 100 * ( T1_1AGE65_69T + T1_1AGE70_74T + T1_1AGE75_79T + T1_1AGE80_84T + T1_1AGEGE_85T ) / T1_1AGETT
EU_National <- 100 * (T2_1UKN + T2_1PLN + T2_1LTN + T2_1EUN) / (T2_1TN - T2_1NSN)
ROW_National <- 100 * (T2_1RWN) / (T2_1TN - T2_1NSN)
Born_outside_Ireland <- 100 * (T2_1TBP - T2_1IEBP) / (T2_1TN - T2_1NSN)
Separated <- 100 * (T1_2SEPT + T1_2DIVT) / T1_2T
SinglePerson <- 100 * T5_2_1PP / T5_2_TP
Pensioner <- 100 * T4_5RP / T4_5TP
LoneParent <- 100 * (T4_3FOPFCT + T4_3FOPMCT) / T4_5TF
DINK <- 100 * T4_5PFF / T4_5TF
NonDependentKids <- 100 * T4_4AGE_GE20F / T4_4TF
RentPublic <- 100 * T6_3_RLAH / (T6_3_TH - T6_3_NSH)
RentPrivate <- 100 * T6_3_RPLH / (T6_3_TH - T6_3_NSH)
Flats <- 100 * T6_1_FA_H / (T6_1_TH - T6_1_NS_H)
NoCenHeat <- 100 * T6_5_NCH / (T6_5_T - T6_5_NS)
RoomsHH <- (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH + 8*T6_4_GE8RH ) / (T6_4_TH - T6_4_NSH)
PeopleRoom <- T1_1AGETT / (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH + 8*T6_4_GE8RH + T6_4_NSH)
SepticTank <- 100 * T6_7_IST / (T6_7_T - T6_7_NS)
HEQual <- 100 * ((T10_4_ODNDT + T10_4_HDPQT + T10_4_PDT + T10_4_DT) / (T10_4_TT - T10_4_NST)) # educ to degree or higher
Employed <- 100 * T8_1_WT / T8_1_TT
TwoCars <- 100 * (T15_1_2C + T15_1_3C + T15_1_GE4C) / (T15_1_TC - T15_1_NSC)
JTWPublic <- 100 * (T11_1_BUW + T11_1_TDLW) / (T11_1_TW - T11_1_NSW)
HomeWork <- 100 * T9_2_PH / T9_2_PT
LLTI <- 100 * (T12_3_BT + T12_3_VBT) / (T12_3_TT - T12_3_NST)
UnpaidCare <- 100 * T12_2_T / T1_1AGETT
Students <- 100 * T8_1_ST / T8_1_TT
Unemployed <- 100 * T8_1_ULGUPJT / T8_1_TT
EconInactFam <- 100 * T8_1_LAHFT / T8_1_TT
Agric <- 100 * T14_1_AFFT / T14_1_TT
Construction <- 100 * T14_1_BCT / T14_1_TT
Manufacturing <- 100 * T14_1_MIT / T14_1_TT
Commerce <- 100 * T14_1_CTT / T14_1_TT
Transport <- 100 * T14_1_TCT / T14_1_TT
Public <- 100 * T14_1_PAT / T14_1_TT
Professional <- 100 * T14_1_PST / T14_1_TT
#MISC
Broadband <- 100 * T15_3_B / (T15_3_B + T15_3_OTH) # Internet connected HH with Broadband
Internet <- 100 *(T15_3_B + T15_3_OTH) / (T15_3_T - T15_3_NS) # Households with Internet
Place <- data.frame(substring(GEOGID, 8), stringsAsFactors=FALSE)
detach(SA2016)
### Bringing it all together
colnames(Place)[1] <- 'GEOGID'
Demographic <- data.frame(Age0_4, Age5_14, Age25_44, Age45_64, Age65over, EU_National, ROW_National, Born_outside_Ireland)
HouseholdComposition <- data.frame(Separated, SinglePerson, Pensioner, LoneParent, DINK, NonDependentKids)
Housing <- data.frame(RentPublic, RentPrivate, Flats, NoCenHeat, RoomsHH, PeopleRoom, SepticTank)
SocioEconomic <- data.frame(HEQual, Employed, TwoCars, JTWPublic, HomeWork, LLTI, UnpaidCare)
Employment <- data.frame(Students, Unemployed, EconInactFam, Agric, Construction, Manufacturing, Commerce, Transport, Public, Professional)
Misc <- data.frame(Internet, Broadband)
DerivedData <- data.frame(Place, Demographic, HouseholdComposition, Housing, SocioEconomic, Employment, Misc)
if (na.rm) DerivedData[which(is.na(DerivedData),arr.ind=T)] <- 0
DerivedData
}
SADatad <- read.csv(file="SAPS2016_SA2017_dublin.csv", header = TRUE, sep=",", stringsAsFactors=FALSE)
SAVarsd <- CreateVariables(SADatad)
One of common type of the soft clustering techniques is Fuzzy c-means (FCM). With the help of FCM algorithm each spatial unit will be represented by membership values changing between 0 and 1 and this value will be available for each group. This then gives the oppurtunity to examine further within the spatial units as well as avoiding to assign each spatial unit only one particular category (what happens exactly in hard clustering).
FCM works based on an exponent which controls how fuzzy the outcomes are. This exponent ‘m’ is also referred as fuzziness parameter. There is no optimal value for ‘m’. It is advised that good results can be obtained when ‘m’ is between ‘1.5’ - ‘3.0’.
In order to get a meaningful evaluation in terms of fuzziness parameter and number of clusters some validation tools are useful for fuzzy applications. Some of the more widely used validation measures are Partition Co-efficient (PC) and Classification Entropy (CE) (Bezdek, 1981), Xie-Beni index (XB) (Xie and Beni, 1991) and Separation index (S) (Bensaid et al, 1996). Each validation measure works in a different way, for the best fit, preferably the PC measure should be at its highest where CE, XB and S measures are at their lowest values.It is important to note that looking out for a general range of ‘good’ estimates rather than a single, best value is the general approved approach.
The code below is to run varying values of ‘m’ (from 1.1 to 2.0 - increasing value of ‘m’ means increasing fuzziness level) for changing cluster numbers ‘K’ (from 3 to 7 in this case) to get the validation index result of each run.
Note that you have to replace your data with ‘SAVarsd’ and this step takes some time.
# for(m in c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)) {
# for (K in 3:7) {
# results <- fuzzy.CM(SAVarsd[, -c(1,19,20,21,28,29,33,34)], K = K, m= m, max.iteration= max.iteration10000, RandomNumber = 3388)
# validation.index(results) -> val
# print(paste("m=", m))
# print(paste("K=", K))
# print(paste("PC", PC(val)))
# print(paste("CE", CE(val)))
# print(paste("XB", XB(val)))
# print(paste("S", S(val)))
# }
# }
After the decision on the optimal value for fuzziness parameter and the optimal number of clusters based on the validation indexes and some further examination, the fuzzy clustering algorithm can be computed. In this study number of the clusters is set for 4 and fuzziness parameter for ‘1.5’. I prefer to save the output results as a .csv file as well for further use. As a result of this run, we would receive an output of varying membership values across the 4 different clusters for each spatial unit.
# resultsd51 <- fuzzy.CM(SAVarsd[, -c(1,19,20,21,28,29,33,34)], K = 4, m= 1.5, max.iteration= 10000, RandomNumber = 3388)
# df51 <- data.frame(resultsd51@member)
# write.csv(df51, file = "df51.csv")
It is rather difficult to find a solution to visualise fuzzy outcomes. For our study we were more interested in finding out the ‘geographies of vagueness’. After spending some time exploring the outcome, I was able to formalise - in a way - the membership value outputs of different spatial units. This ‘formulisation’ would be changing based on each studies’ outcomes, just to give you an idea you can see the details below.
# uncertain <- ((s_map_d@data$X1>0.2 & s_map_d@data$X3>0.2 & s_map_d@data$X4>0.2) | (s_map_d@data$X2>0.2 & s_map_d@data$X3>0.2 & s_map_d@data$X4>0.2) |
# (s_map_d@data$X1>0.2 & s_map_d@data$X2>0.2 & s_map_d@data$X3>0.2) |
# (s_map_d@data$X1>0.2 & s_map_d@data$X2>0.2 & s_map_d@data$X4>0.2))
# s_map_d@data$fuzziness[uncertain] = "Fuzziness - degree 2"
#
# shared <- ((s_map_d@data$X1>0.3 & s_map_d@data$X3>0.3) | (s_map_d@data$X2>0.3 & s_map_d@data$X3>0.3) | (s_map_d@data$X1>0.3 & s_map_d@data$X4>0.3) | (s_map_d@data$X3>0.3 & s_map_d@data$X4>0.3) | (s_map_d@data$X2>0.3 & s_map_d@data$X4>0.3))
# s_map_d@data$fuzziness[shared] = "Fuzziness - degree 1"
#
# dominant <- (s_map_d@data$X1>0.75 | s_map_d@data$X2>0.75 | s_map_d@data$X3>0.75 | s_map_d@data$X4>0.75)
# s_map_d@data$fuzziness[dominant] = "Dominance - level 1"
#
# s_map_d@data$fuzziness[!(shared | dominant | uncertain)] = "Dominance - level 2"
#
# tm_shape(s_map_d) + tm_fill("fuzziness", palette = 'YlOrBr')
Here is an image of the output map. Even though more than half of the small areas are dominantly characterized with some specific features presented in yellow colours, there are still some ‘geographies of vagueness’ in Dublin County which is an important outcome of this study. The darker colour (orange and brown) areas shown in the map, should be the focus of attention. Outer skirts of Dublin as well as some peripherial areas of the inner city need some close inspection before any strategic decision regarding to public service allocation or any public policy implementation.
References
\(1\): Dublin Geodemographics (https://rpubs.com/burcinwalsh/620510).
\(2\): Ireland Census 2016: A Classification of Small Areas (https://rpubs.com/burcinwalsh/343141).
\(3\): 2011 Ireland Population Census Geodemographics Study (https://rpubs.com/chrisbrunsdon/14998).
\(4\): For more information - Dublin Dashboard Query Page (https://www.dublindashboard.ie/queries/geodemos).
\(5\): Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press.
\(6\): Xie XL and Beni GA (1991) Validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3, 841–846.
\(7\): Bensaid AM, Hall LO, Bezdek JC, Clarke LP, Silbiger ML, Arrington JA and Murtagh RF (1996) Validity-guided (re)clustering with applications to image segmentation. IEEE Transactions on Fuzzy Systems, 4, 112–123.