Burcin Yazgi Walsh, Chris Brunsdon

NCG, MU SSI, Maynooth University

After working with hard clustering techniques at the national as well as the city level we wanted to try some soft clustering techniques as well for geodemographics studies for some improvement. Fuzzy approaches can be more robust to noise in data and they would give you the oppurtunity to represent geographical areas in a more detailed way rather than a strict one group classification.

The study is following the same principles of our Dublin Geodemographics analysis (1) for the initial steps. For more information regarding to the general background and as well as selection and the elimination of variables you can visit the link (https://rpubs.com/burcinwalsh/620510). In Dublin case, seven of these variables (Number of rooms per household; Number of people per room; Agriculture; Construction; Central heating; Unpaid care; Health condition including bad and very bad - LLTI) without any significant effect - out of forty variables used in the national level (2, 3) - were excluded from the study.

The code below is for putting the variables together from the Ireland latest (2016) Population Census dataset.

library(e1071)
library(advclust)
library(tmap)

CreateVariables <- function(SA2016, na.rm=TRUE) {
  attach(SA2016)
  
  Age0_4    <- 100 * ( T1_1AGE0T + T1_1AGE1T + T1_1AGE2T + T1_1AGE3T + T1_1AGE4T ) / T1_1AGETT
  Age5_14   <- 100 * ( T1_1AGE5T + T1_1AGE6T + T1_1AGE7T + T1_1AGE8T + T1_1AGE9T +
                         T1_1AGE10T + T1_1AGE11T + T1_1AGE12T + T1_1AGE13T + T1_1AGE14T) / T1_1AGETT
  Age25_44  <- 100 * ( T1_1AGE25_29T + T1_1AGE30_34T + T1_1AGE35_39T + T1_1AGE40_44T ) / T1_1AGETT
  Age45_64  <- 100 * ( T1_1AGE45_49T + T1_1AGE50_54T + T1_1AGE55_59T + T1_1AGE60_64T ) / T1_1AGETT
  Age65over <- 100 * ( T1_1AGE65_69T + T1_1AGE70_74T + T1_1AGE75_79T + T1_1AGE80_84T + T1_1AGEGE_85T ) / T1_1AGETT
  
  EU_National           <- 100 * (T2_1UKN + T2_1PLN + T2_1LTN + T2_1EUN) / (T2_1TN - T2_1NSN)
  ROW_National          <- 100 * (T2_1RWN) / (T2_1TN - T2_1NSN)
  Born_outside_Ireland  <- 100 * (T2_1TBP - T2_1IEBP) / (T2_1TN - T2_1NSN)
  
  Separated            <- 100 * (T1_2SEPT + T1_2DIVT) / T1_2T
  SinglePerson         <- 100 * T5_2_1PP / T5_2_TP
  Pensioner            <- 100 * T4_5RP / T4_5TP
  LoneParent           <- 100 * (T4_3FOPFCT + T4_3FOPMCT) / T4_5TF
  DINK                 <- 100 * T4_5PFF / T4_5TF
  NonDependentKids     <- 100 * T4_4AGE_GE20F / T4_4TF
  
  RentPublic         <- 100 * T6_3_RLAH / (T6_3_TH - T6_3_NSH)
  RentPrivate        <- 100 * T6_3_RPLH / (T6_3_TH - T6_3_NSH)
  
  Flats              <- 100 * T6_1_FA_H / (T6_1_TH - T6_1_NS_H)
  NoCenHeat          <- 100 * T6_5_NCH / (T6_5_T - T6_5_NS)
  RoomsHH            <- (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH + 8*T6_4_GE8RH ) / (T6_4_TH - T6_4_NSH)
  PeopleRoom         <- T1_1AGETT / (T6_4_1RH + 2*T6_4_2RH + 3*T6_4_3RH + 4*T6_4_4RH + 5*T6_4_5RH + 6*T6_4_6RH + 7*T6_4_7RH + 8*T6_4_GE8RH + T6_4_NSH)
  SepticTank         <- 100 * T6_7_IST / (T6_7_T - T6_7_NS)
  
  HEQual              <-  100 * ((T10_4_ODNDT + T10_4_HDPQT + T10_4_PDT + T10_4_DT) / (T10_4_TT - T10_4_NST)) # educ to degree or higher
  Employed            <-  100 * T8_1_WT / T8_1_TT
  TwoCars             <-  100 * (T15_1_2C + T15_1_3C + T15_1_GE4C) / (T15_1_TC - T15_1_NSC)
  JTWPublic           <-  100 * (T11_1_BUW + T11_1_TDLW) / (T11_1_TW - T11_1_NSW)
  HomeWork            <-  100 * T9_2_PH / T9_2_PT
  LLTI                <-  100 * (T12_3_BT + T12_3_VBT) / (T12_3_TT - T12_3_NST)
  UnpaidCare          <-  100 * T12_2_T / T1_1AGETT
  
  Students           <- 100 * T8_1_ST / T8_1_TT
  Unemployed         <- 100 * T8_1_ULGUPJT / T8_1_TT
  
  EconInactFam       <- 100 * T8_1_LAHFT / T8_1_TT
  Agric              <- 100 * T14_1_AFFT / T14_1_TT 
  Construction       <- 100 * T14_1_BCT / T14_1_TT 
  Manufacturing      <- 100 * T14_1_MIT / T14_1_TT 
  Commerce           <- 100 * T14_1_CTT / T14_1_TT 
  Transport          <- 100 * T14_1_TCT / T14_1_TT 
  Public             <- 100 * T14_1_PAT / T14_1_TT 
  Professional       <- 100 * T14_1_PST / T14_1_TT 
  
  #MISC
  Broadband          <- 100 * T15_3_B / (T15_3_B + T15_3_OTH) # Internet connected HH with Broadband
  Internet           <- 100 *(T15_3_B + T15_3_OTH) / (T15_3_T - T15_3_NS) # Households with Internet
  
  Place <- data.frame(substring(GEOGID, 8), stringsAsFactors=FALSE)
  detach(SA2016)
  
  ### Bringing it all together
  colnames(Place)[1] <- 'GEOGID'
  Demographic <- data.frame(Age0_4, Age5_14, Age25_44, Age45_64, Age65over, EU_National, ROW_National, Born_outside_Ireland)
  HouseholdComposition <- data.frame(Separated, SinglePerson, Pensioner, LoneParent, DINK, NonDependentKids)
  Housing <- data.frame(RentPublic, RentPrivate, Flats, NoCenHeat, RoomsHH, PeopleRoom, SepticTank)
  SocioEconomic <- data.frame(HEQual, Employed, TwoCars, JTWPublic, HomeWork, LLTI, UnpaidCare)
  Employment <- data.frame(Students, Unemployed, EconInactFam, Agric, Construction, Manufacturing, Commerce, Transport, Public, Professional)
  Misc <- data.frame(Internet, Broadband)
  
  DerivedData <- data.frame(Place, Demographic, HouseholdComposition, Housing, SocioEconomic, Employment, Misc)
  if (na.rm) DerivedData[which(is.na(DerivedData),arr.ind=T)] <- 0                            
  
  DerivedData
}

SADatad <- read.csv(file="SAPS2016_SA2017_dublin.csv", header = TRUE, sep=",", stringsAsFactors=FALSE)
SAVarsd <- CreateVariables(SADatad)

One of common type of the soft clustering techniques is Fuzzy c-means (FCM). With the help of FCM algorithm each spatial unit will be represented by membership values changing between 0 and 1 and this value will be available for each group. This then gives the oppurtunity to examine further within the spatial units as well as avoiding to assign each spatial unit only one particular category (what happens exactly in hard clustering).

FCM works based on an exponent which controls how fuzzy the outcomes are. This exponent ‘m’ is also referred as fuzziness parameter. There is no optimal value for ‘m’. It is advised that good results can be obtained when ‘m’ is between ‘1.5’ - ‘3.0’.

In order to get a meaningful evaluation in terms of fuzziness parameter and number of clusters some validation tools are useful for fuzzy applications. Some of the more widely used validation measures are Partition Co-efficient (PC) and Classification Entropy (CE) (Bezdek, 1981), Xie-Beni index (XB) (Xie and Beni, 1991) and Separation index (S) (Bensaid et al, 1996). Each validation measure works in a different way, for the best fit, preferably the PC measure should be at its highest where CE, XB and S measures are at their lowest values.It is important to note that looking out for a general range of ‘good’ estimates rather than a single, best value is the general approved approach.

The code below is to run varying values of ‘m’ (from 1.1 to 2.0 - increasing value of ‘m’ means increasing fuzziness level) for changing cluster numbers ‘K’ (from 3 to 7 in this case) to get the validation index result of each run.

Note that you have to replace your data with ‘SAVarsd’ and this step takes some time.

# for(m in c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)) {
#   for (K in 3:7) {
#     results <- fuzzy.CM(SAVarsd[, -c(1,19,20,21,28,29,33,34)], K = K, m= m, max.iteration= max.iteration10000, RandomNumber = 3388)
#     validation.index(results) -> val
#     print(paste("m=", m))
#     print(paste("K=", K))
#     print(paste("PC", PC(val)))
#     print(paste("CE", CE(val)))
#     print(paste("XB", XB(val)))
#     print(paste("S", S(val)))
#   }
# }

After the decision on the optimal value for fuzziness parameter and the optimal number of clusters based on the validation indexes and some further examination, the fuzzy clustering algorithm can be computed. In this study number of the clusters is set for 4 and fuzziness parameter for ‘1.5’. I prefer to save the output results as a .csv file as well for further use. As a result of this run, we would receive an output of varying membership values across the 4 different clusters for each spatial unit.

# resultsd51 <- fuzzy.CM(SAVarsd[, -c(1,19,20,21,28,29,33,34)], K = 4, m= 1.5, max.iteration= 10000, RandomNumber = 3388) 
# df51 <- data.frame(resultsd51@member)
# write.csv(df51, file = "df51.csv")

It is rather difficult to find a solution to visualise fuzzy outcomes. For our study we were more interested in finding out the ‘geographies of vagueness’. After spending some time exploring the outcome, I was able to formalise - in a way - the membership value outputs of different spatial units. This ‘formulisation’ would be changing based on each studies’ outcomes, just to give you an idea you can see the details below.

# uncertain <- ((s_map_d@data$X1>0.2 & s_map_d@data$X3>0.2 & s_map_d@data$X4>0.2) | (s_map_d@data$X2>0.2 & s_map_d@data$X3>0.2 & s_map_d@data$X4>0.2) | 
# (s_map_d@data$X1>0.2 & s_map_d@data$X2>0.2 & s_map_d@data$X3>0.2) | 
# (s_map_d@data$X1>0.2 & s_map_d@data$X2>0.2 & s_map_d@data$X4>0.2))
# s_map_d@data$fuzziness[uncertain] = "Fuzziness - degree 2"
# 
# shared <- ((s_map_d@data$X1>0.3 & s_map_d@data$X3>0.3) | (s_map_d@data$X2>0.3 & s_map_d@data$X3>0.3) | (s_map_d@data$X1>0.3 & s_map_d@data$X4>0.3) | (s_map_d@data$X3>0.3 & s_map_d@data$X4>0.3) | (s_map_d@data$X2>0.3 & s_map_d@data$X4>0.3))
# s_map_d@data$fuzziness[shared] = "Fuzziness - degree 1"
# 
# dominant <- (s_map_d@data$X1>0.75 | s_map_d@data$X2>0.75 | s_map_d@data$X3>0.75 | s_map_d@data$X4>0.75)
# s_map_d@data$fuzziness[dominant] = "Dominance - level 1"
# 
# s_map_d@data$fuzziness[!(shared | dominant | uncertain)] = "Dominance - level 2"
# 
# tm_shape(s_map_d) + tm_fill("fuzziness", palette = 'YlOrBr')

Here is an image of the output map. Even though more than half of the small areas are dominantly characterized with some specific features presented in yellow colours, there are still some ‘geographies of vagueness’ in Dublin County which is an important outcome of this study. The darker colour (orange and brown) areas shown in the map, should be the focus of attention. Outer skirts of Dublin as well as some peripherial areas of the inner city need some close inspection before any strategic decision regarding to public service allocation or any public policy implementation.



References

\(1\): Dublin Geodemographics (https://rpubs.com/burcinwalsh/620510).

\(2\): Ireland Census 2016: A Classification of Small Areas (https://rpubs.com/burcinwalsh/343141).

\(3\): 2011 Ireland Population Census Geodemographics Study (https://rpubs.com/chrisbrunsdon/14998).

\(4\): For more information - Dublin Dashboard Query Page (https://www.dublindashboard.ie/queries/geodemos).

\(5\): Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Plenum Press.

\(6\): Xie XL and Beni GA (1991) Validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 3, 841–846.

\(7\): Bensaid AM, Hall LO, Bezdek JC, Clarke LP, Silbiger ML, Arrington JA and Murtagh RF (1996) Validity-guided (re)clustering with applications to image segmentation. IEEE Transactions on Fuzzy Systems, 4, 112–123.