Delineating Social Area Analysis using Geographic Segmentation Approach

1 Introduction

1.1 Singapore, a Liveable City?

Social area analysis is the study of socioeconomic characteristics of a population. In his paper, Shevky looked into indicators such as social rank, urbanisation and ethnic groups segregation.

Here, Singapore is the area of study and we are interested to find out how we can make Singapore a more Liveable City. In the Economist Intelligence Unit’s (EIU) 2019 Global Liveability Index, Singapore is ranked 40th amongst the world’s Liveable Cities. Cities were ranked in terms of stability, healthcare, culture and environment, education and infrastructure.

Singapore is a unique destination because it is a country and a city at the same time. Her limited land means that population density is high. To reduce congestion in the Central Business District (CBD) area, the Government is developing office clusters in the heartlands, such as the Tampines and Jurong planning area, where residents can live, work and play.

1.2 Variables

In addition to the variables required, we will create indicators to understand Singapore’s liveability at the Planning Subzone level. Several Planning Subzones (e.g. Ang Mo Kio Town Centre, Cheng San etc.) make up a Planning Area (e.g. Ang Mo Kio).

1.2.1 Demographic Variables

Population density is the ratio of number of people to the land area. We will be using Singapore’s resident (citizens and permanent residents) population as the numerator since Singstat’s data does not include non-residents.

Old-age support ratio is defined as the ratio of elderly (65 years and over) to economically active population (25 to 64 years old). Although the Department of Statistics define the economically active population to be 20 to 64 years old, we will be using 25 to 64 years old instead. This is because a large proportion of Singapore’s 20 to 24 years old are serving National Service, or studying in Institutes of Higher Learning.

The old-age support ratio has been declining over the decades. This means that a fewer number of economically active population is supporting each elderly and could indicate a strain on economic resources.

1.2.2 Level of Living Variables

Number of shopping amenities per 100 residents indicates the density of shopping facilities to the population. These shopping spaces provide basic necessities and material goods and are thus indicators of the standard of living of a population.

Number of financial services per 100 residents indicates the density of money changes, ATMs and other financial amenities to the resident population. These financial services enable residents to engage in financial transactions.

Number of businesses per 100 residents indicates the density of business entities to the resident population. These business entities provide employment to the population and having a low density in the heartlands reveal the potential for growth of office clusters in line with Singapore’s office decentralization strategy.

Number of private residences per 100 residents indicates the density of upmarket residential areas to the resident population. These upmarket areas provide more choices for Singapore’s growing rich population and makes it attractive for expats to live to Singapore.

Number of government services per 100 residents indicates the density of government organizations to the resident population. These government organizations, such as Town Council, serve the population and ensure that Singapore can continue to grow vibrantly.

1.3 Methodology

We will be using three clustering techniques:

Hierarchical clustering
K-means clustering
Spatially-constrained

Hierarchical clustering is a common method of grouping the most similar clusters together. In particular, the Agglomerative Clustering is a bottom-up approach that combines the two most homogeneous clusters into a bigger cluster at each step. Dissimilarity can be measured by the Ward’s method which minimizes total within-cluster variance.

Next, we will use the K-means clustering to group the subzones. A reason for using another clustering method is that both hierarchical clustering and K-means clustering have their shortfalls. On one hand, the K-means algorithm requires pre-specification of the number of clusters and is sensitive to the starting points. In other words, varying the starting point could result in different clusters each time. On the other hand, a disadvantage of the hierarchical clustering is that the clusters are fixed and new observations do not change the cluster groups. K-means make up for this shortfall.

The spatially-constrained clustering technique will be used in a separate analysis. It is based on pruning a minimum spanning tree which is constructed from the contiguity structure of the spatial units.

2 Data Import and Preparation

In our analysis we need these data

URA Master Plan 2014 Planning Subzone boundary (source: data.gov.sg
Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 (source: SingStat
Govt_Embassy, Private residential, Shopping, Business and Financial geospatial datasets

2.1 Importing required libraries

tidyverse for data wrangling
sf for data importation and geospatial data wrangling
tmap, ggplot and ggpubr for data visualization
ggplot for data visualization
corrplot for correlation plot
spdep for spatial handling
cluster and heatmaply for clustering analysis

A new library used is the factoextra, which enables us to create visualization for our clustering analysis. Another new library, gridExtra enables us to plot multiple plots on the same page.

packages = c('spdep', 'ClustGeo',  'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'tidyverse','psych', 'factoextra')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}
library(gridExtra)

2.2 Importing aspatial data as Dataframe

population <- read_csv("data/aspatial/respopagesextod2011to2019.csv")
head(population)

## # A tibble: 6 x 7
##   PA        SZ               AG    Sex   TOD                           Pop  Time
##   <chr>     <chr>            <chr> <chr> <chr>                       <dbl> <dbl>
## 1 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 1- and 2-Room Flats         0  2011
## 2 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 3-Room Flats               10  2011
## 3 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 4-Room Flats               30  2011
## 4 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HDB 5-Room and Executive F~    50  2011
## 5 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males HUDC Flats (excluding thos~     0  2011
## 6 Ang Mo K~ Ang Mo Kio Town~ 0_to~ Males Landed Properties               0  2011

2.3 Aspatial Data wrangling

2019 population data required:

Economy active population (i.e. age 25-64)
Young group (i.e. age 0-24)
Aged group (i.e 65 and above)
Population density
HDB1-2RM dwellers
HDB3-4RM dwellers
HDB5RM-Ec dweller
Condominium and apartment dwellers
Landed property dwellers

List the age groups

unique(population$AG)

##  [1] "0_to_4"      "5_to_9"      "10_to_14"    "15_to_19"    "20_to_24"   
##  [6] "25_to_29"    "30_to_34"    "35_to_39"    "40_to_44"    "45_to_49"   
## [11] "50_to_54"    "55_to_59"    "60_to_64"    "65_to_69"    "70_to_74"   
## [16] "75_to_79"    "80_to_84"    "85_to_89"    "90_and_over"

The grouping of the age groups will be direct. We will simply add the various age groups up to get the mega age groups as required.

List the type of dwelling and its corresponding population

This allows us to understand which types of dwelling there are and how to treat these variables into our desired groups.

population2019 <- population %>%
  filter(Time == 2019)
aggregate(population2019$Pop, by=list(Category=population2019$TOD), FUN=sum)

##                                  Category       x
## 1       Condominiums and Other Apartments  590110
## 2                 HDB 1- and 2-Room Flats  175050
## 3                        HDB 3-Room Flats  582250
## 4                        HDB 4-Room Flats 1340580
## 5          HDB 5-Room and Executive Flats 1065050
## 6 HUDC Flats (excluding those privatised)       0
## 7                       Landed Properties  249980
## 8                                  Others   30400

Several of these dwelling types may seem foreign. Here are the definitions from SingStat:

Condominiums and Other Apartments - Condominium (except Executive Condominium)/ Executive Condominium/ Other Apartments nec (not elsewhere classified)
HDB 1- and 2-Room Flats - Includes HDB studio apartments
HUDC Flats - Housing and Urban Development Corporation (HUDC) flats, except for those privatised, were managed by HDB until 2017.
Landed Properties - Bungalow/ Detached House/ Semi-Detached House/ Terrace House
Others - Less common housings such as Shophouse/ Other Housing Units nec

Given that the available data is vague on what “Others” TOD is, we will drop it. Then, we will be grouping the TOD as follows

Indicators required	Corresponding data from population dataframe
HDB1-2RM dwellers	HDB 1- and 2-Room Flats
HDB3-4RM dwellers	HDB 3-Room Flats + HDB 4-Room Flats
HDB5RM-Ec dweller	HDB 5-Room and Executive Flats
Condominium and apartment dwellers	Condominiums and Other Apartments
Landed property dwellers	Landed Properties

2.3.1 Filter out rows with >0 population and convert PA and SZ to uppercase

population2019 <- population2019 %>%
  filter(Pop > 0) %>%
  mutate_at(.vars = vars(PA, SZ), .funs = funs(toupper))

2.3.2 Get Population Age Group data

Steps done:

Spread the age groups (key) and population (value) to get a new column for each age group
Group the age groups into the bigger required age groups
Select the necessary columns
Replace NA values with 0
Group the data by Subzones, summing the YOUNG, ECONACTIVE and AGED columns

age_df <- population2019 %>%
  spread(AG, Pop) %>%
  mutate(YOUNG = `0_to_4`+`5_to_9`+`10_to_14`+`15_to_19`+`20_to_24`) %>%
  mutate(ECONACTIVE = `25_to_29`+`30_to_34`+`35_to_39`+`40_to_44`+`45_to_49`+
           `50_to_54`+`55_to_59`+`60_to_64`) %>%
  mutate(AGED = `65_to_69`+`70_to_74`+`75_to_79`+`80_to_84`+`85_to_89`+`90_and_over`) %>%
  select(`SZ`, `YOUNG`, `ECONACTIVE`, `AGED`)
age_df[is.na(age_df)] <- 0
age_df <- age_df %>% 
  group_by(`SZ`) %>%
  summarize(`YOUNG`=sum(`YOUNG`), `ECONACTIVE`=sum(`ECONACTIVE`), `AGED`=sum(`AGED`))
head(age_df)

## # A tibble: 6 x 4
##   SZ              YOUNG ECONACTIVE  AGED
##   <chr>           <dbl>      <dbl> <dbl>
## 1 ADMIRALTY        4370       8380   890
## 2 ALEXANDRA HILL   2880       7560  2960
## 3 ALEXANDRA NORTH   580       1420     0
## 4 ALJUNIED         8220      24330  7460
## 5 ANAK BUKIT       5850      12330  3400
## 6 ANCHORVALE      15250      27740  3180

2.3.3 Get Population by Type of Dwelling data

Same steps are done as the Age Group data for the Type of Dwelling feature, except that we are filtering out the ‘Others’ properties.

tod_df <- population2019 %>%
  filter(!(TOD == c('HUDC Flats (excluding those privatised)','Others'))) %>%
  spread(TOD, Pop) %>%
  mutate(HDB12RM = `HDB 1- and 2-Room Flats`) %>%
  mutate(HDB34RM = `HDB 3-Room Flats`+`HDB 4-Room Flats`) %>%
  mutate(HDB5RMEC = `HDB 5-Room and Executive Flats`) %>%
  mutate(CONDO = `Condominiums and Other Apartments`) %>%
  mutate(LANDED = `Landed Properties`) %>%
  select(`SZ`, `HDB12RM`, `HDB34RM`, `HDB5RMEC`, `CONDO`, `LANDED`)
tod_df[is.na(tod_df)] <- 0
tod_df <- tod_df %>% 
  group_by(`SZ`) %>%
  summarize(`HDB12RM`=sum(`HDB12RM`), `HDB34RM`=sum(`HDB34RM`), `HDB5RMEC`=sum(`HDB5RMEC`),
            `CONDO`=sum(`CONDO`), `LANDED`=sum(`LANDED`))
head(tod_df)

## # A tibble: 6 x 6
##   SZ              HDB12RM HDB34RM HDB5RMEC CONDO LANDED
##   <chr>             <dbl>   <dbl>    <dbl> <dbl>  <dbl>
## 1 ADMIRALTY          1140    6580     6220     0      0
## 2 ALEXANDRA HILL     3910    6310     3120     0      0
## 3 ALEXANDRA NORTH       0       0        0  2120      0
## 4 ALJUNIED           2120   18020     6400 11930    460
## 5 ANAK BUKIT          160      60     3170  9020   7600
## 6 ANCHORVALE         1680   17190    22650  5000      0

2.4 Importing geospatial data as sf’s Simple Features Dataframe

Singapore Planning Subzone Boundary

sf_mpsz <- st_read(dsn = "data/geospatial", 
                   layer = "MP14_SUBZONE_WEB_PL")

## Reading layer `MP14_SUBZONE_WEB_PL' from data source `C:\Users\weich\Downloads\SMU2020\IS415\IS415_Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## proj4string:    +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs

# head(sf_mpsz)

Business/industry buildings

sf_business <- st_read(dsn = "data/geospatial", 
                   layer = "Business")

## Reading layer `Business' from data source `C:\Users\weich\Downloads\SMU2020\IS415\IS415_Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## CRS:            4326

# head(sf_business)

Financial services

sf_financial <- st_read(dsn = "data/geospatial", 
                   layer = "Financial")

## Reading layer `Financial' from data source `C:\Users\weich\Downloads\SMU2020\IS415\IS415_Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3320 features and 29 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6256 ymin: 1.24392 xmax: 103.9998 ymax: 1.46247
## CRS:            4326

# head(sf_financial)

Government services including embassy

sf_govtembassy <- st_read(dsn = "data/geospatial", 
                   layer = "Govt_Embassy")

## Reading layer `Govt_Embassy' from data source `C:\Users\weich\Downloads\SMU2020\IS415\IS415_Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 443 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6282 ymin: 1.24911 xmax: 103.9884 ymax: 1.45765
## CRS:            4326

# head(sf_govtembassy)

Private residential buildings

sf_privresidential <- st_read(dsn = "data/geospatial", 
                   layer = "Private residential")

## Reading layer `Private residential' from data source `C:\Users\weich\Downloads\SMU2020\IS415\IS415_Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3604 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6295 ymin: 1.23943 xmax: 103.9749 ymax: 1.45379
## CRS:            4326

# head(sf_privresidential)

Shopping amenities

sf_shopping <- st_read(dsn = "data/geospatial", 
                   layer = "Shopping")

## Reading layer `Shopping' from data source `C:\Users\weich\Downloads\SMU2020\IS415\IS415_Take-home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 511 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.679 ymin: 1.24779 xmax: 103.9644 ymax: 1.4535
## CRS:            4326

# head(sf_shopping)

2.5 Geospatial data wrangling

2.5.1 Identify duplicated entries according to POI_ID and remove them

Some entries have the same POI_ID but have spelling differences (e.g. MINISTRY OF EDUCATION STELLAR CTR vs MINISTRY OF EDUCATION STELLAR CENTRE) or use an abbreviation (e.g. MEWER vs MINISTRY OF ENVIRONMENT AND WATER RESOURCES). We want to remove the duplicated entries.

Inspect the duplicated entries

The duplicated() function has a fromLast argument that returns the first till the second last duplicated entries but not the last if fromLast=TRUE. It returns the second till the last duplicated entries but not the first if fromLast=FALSE. Therefore, we use the OR (“|”) operator to get all the duplicated entries.
Encompass the code in View() to view the whole list in a separate tab.
Replace the dataframe name to inspect the other simple features dataframe

View(sf_govtembassy[duplicated(sf_govtembassy$POI_ID) | duplicated(sf_govtembassy$POI_ID, fromLast=TRUE), ])

Remove the duplicated entries

sf_business <- sf_business[!duplicated(sf_business$POI_ID),]
sf_financial <- sf_financial[!duplicated(sf_financial$POI_ID),]
sf_govtembassy <- sf_govtembassy[!duplicated(sf_govtembassy$POI_ID),]
sf_privresidential <- sf_privresidential[!duplicated(sf_privresidential$POI_ID),]
sf_shopping <- sf_shopping[!duplicated(sf_shopping$POI_ID),]

2.5.2 Identify the Planning Subzone that the services are located in

Create function that assigns 3414 Coordinate Reference System and identifies which Planning Subzone the services are located. Function returns a Simple Feature Dataframe.

sf_point_in_poly <- function(point, epsg, polygon){
  point <- st_transform(point, epsg)
  point <- st_join(point, polygon, join=st_intersects) %>%
    select(`POI_ID`,`POI_NAME`,`SUBZONE_N`)
  return(point)
}
sf_mpsz <- st_transform(sf_mpsz, 3414)
sf_business <- sf_point_in_poly(sf_business, 3414, sf_mpsz)
sf_financial <- sf_point_in_poly(sf_financial, 3414, sf_mpsz)
sf_govtembassy <- sf_point_in_poly(sf_govtembassy, 3414, sf_mpsz)
sf_privresidential <- sf_point_in_poly(sf_privresidential, 3414, sf_mpsz)
sf_shopping <- sf_point_in_poly(sf_shopping, 3414, sf_mpsz)

2.6 Create the new variables

We will be creating the variables mentioned in the Variables section.

Population density (POPDENSE) - divide resident population by area (SHAPE_Area)
Old-age support ratio (OLDS_RATIO) - divide elderly population (AGED) by economically active population (ECONACTIVE)
Number of shopping amenities per 100 residents (SHOPPER100) - divide the number of shopping amenities by every 100 people
Number of financial services per 100 residents (FINPER100) - divide the number of financial services by every 100 people
Number of businesses per 100 residents (BIZPER100) - divide the number of businesses by every 100 people
Number of private residences per 100 residents (PRIVRESPER100) - divide the number of private residences by every 100 people
Number of government services per 100 residents (GOVTPER100) - divide the number of government services by every 100 people

2.6.1 Creating population density variable

Steps:

Sum population data (population2019) by Subzones
Select relevant columns from sf_mpsz for easy reference
Use left_join to combine dataframes by the Subzone name
Calculate population density in km^2

pop_df <- population2019 %>%
  group_by(`SZ`) %>%
  summarize(`POP`=sum(`Pop`))
sf_mpsz <- sf_mpsz %>%
  select(`SUBZONE_N`, `PLN_AREA_N`, `SHAPE_Area`)
sf_mpsz_select <- left_join(sf_mpsz, pop_df, by = c("SUBZONE_N" = "SZ")) %>%
  mutate_at(vars(`POP`), ~replace(., is.na(.), 0)) %>%
  mutate(POPDENSE = `POP` / `SHAPE_Area` * 1e6)

2.6.2 Creating old-age support ratio variable

Add observations to mpsz dataframe
Create new variable by dividing AGED by ECONACTIVE

sf_mpsz_select <- left_join(sf_mpsz_select, age_df, by = c("SUBZONE_N" = "SZ")) %>%
  mutate(OLDS_RATIO = `AGED` / `ECONACTIVE`)

2.6.3 Creating number of shopping amenities per 100 residents variable

Steps:

Group the observations by Subzones
Add observations to mpsz dataframe

shop_df <- data.frame(table(sf_shopping$SUBZONE_N)) %>%
  rename(`SUBZONE_N` = `Var1`, `shop_count` = `Freq`)
sf_mpsz_select <- left_join(sf_mpsz_select, shop_df) %>%
  mutate(SHOPPER100 = `shop_count` / `POP` * 100)

2.6.4 Creating number of financial services per 100 residents variable

fin_df <- data.frame(table(sf_financial$SUBZONE_N)) %>%
  rename(`SUBZONE_N` = `Var1`, `fin_count` = `Freq`)
sf_mpsz_select <- left_join(sf_mpsz_select, fin_df) %>%
  mutate(FINPER100 = `fin_count` / `POP` * 100)

2.6.5 Creating number of businesses per 100 residents variable

biz_df <- data.frame(table(sf_shopping$SUBZONE_N)) %>%
  rename(`SUBZONE_N` = `Var1`, `biz_count` = `Freq`)
sf_mpsz_select <- left_join(sf_mpsz_select, biz_df) %>%
  mutate(BIZPER100 = `biz_count` / `POP` * 100)

2.6.6 Creating number of private residences per 100 residents variable

privres_df <- data.frame(table(sf_privresidential$SUBZONE_N)) %>%
  rename(`SUBZONE_N` = `Var1`, `privres_count` = `Freq`)
sf_mpsz_select <- left_join(sf_mpsz_select, privres_df) %>%
  mutate(PRIVRESPER100 = `privres_count` / `POP` * 100)

2.6.7 Creating number of government services per 100 residents variable

Steps are same as above except that we add an additional step before that:

Remove embassies from the dataframe

to_delete <- "CONSULATE|EMBASSY|HIGH COMMISSION"
sf_govtembassy <- sf_govtembassy[- grep(to_delete, sf_govtembassy$POI_NAME),]
govt_df <- data.frame(table(sf_govtembassy$SUBZONE_N)) %>%
  rename(`SUBZONE_N` = `Var1`, `govt_count` = `Freq`)
sf_mpsz_select <- left_join(sf_mpsz_select, govt_df) %>%
  mutate(GOVTPER100 = `govt_count` / `POP` * 100)

2.6.8 Add type of dwelling data to aggregated data

sf_mpsz_select <- left_join(sf_mpsz_select, tod_df, by = c("SUBZONE_N" = "SZ"))

2.6.9 Replace NA values with 0

sf_mpsz_select[is.na(sf_mpsz_select)] <- 0

2.6.10 Filter out rows without resident population

Since our analysis has to do with the socioeconomic classes, we are only interested in the Subzones with resident population.

sf_mpsz_select <- sf_mpsz_select %>%
  filter(`POP` > 0)

3 Exploratory Data Analysis

3.1 EDA using statistical graphics

First, we will plot the required indicators in a histogram chart.

YOUNG_hist <- ggplot(data=sf_mpsz_select, aes(x=`YOUNG`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
ECONACTIVE_hist <- ggplot(data=sf_mpsz_select, aes(x=`ECONACTIVE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
AGED_hist <- ggplot(data=sf_mpsz_select, aes(x=`AGED`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
HDB12RM_hist <- ggplot(data=sf_mpsz_select, aes(x=`HDB12RM`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
HDB34RM_hist <- ggplot(data=sf_mpsz_select, aes(x=`HDB34RM`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
HDB5RMEC_hist <- ggplot(data=sf_mpsz_select, aes(x=`HDB5RMEC`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
CONDO_hist <- ggplot(data=sf_mpsz_select, aes(x=`CONDO`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
LANDED_hist <- ggplot(data=sf_mpsz_select, aes(x=`LANDED`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
ggarrange(YOUNG_hist, ECONACTIVE_hist, AGED_hist, HDB12RM_hist, HDB34RM_hist,
          HDB5RMEC_hist, CONDO_hist, LANDED_hist,
          ncol = 4, 
          nrow = 2)

From the histograms above, we can tell that the data is highly skewed. Though the horizontal axis is in the ’000 range, the charts vary between 5000 and 80000. Hence, the range is not the same.

POPDENSE_hist <- ggplot(data=sf_mpsz_select, aes(x=`POPDENSE`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
OLDS_RATIO_hist <- ggplot(data=sf_mpsz_select, aes(x=`OLDS_RATIO`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
SHOPPER100_hist <- ggplot(data=sf_mpsz_select, aes(x=`SHOPPER100`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
FINPER100_hist <- ggplot(data=sf_mpsz_select, aes(x=`FINPER100`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
BIZPER100_hist <- ggplot(data=sf_mpsz_select, aes(x=`BIZPER100`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
PRIVRESPER100_hist <- ggplot(data=sf_mpsz_select, aes(x=`PRIVRESPER100`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
GOVTPER100_hist <- ggplot(data=sf_mpsz_select, aes(x=`GOVTPER100`)) +
  geom_histogram(bins=20, color="black", fill="light blue")
ggarrange(POPDENSE_hist, OLDS_RATIO_hist, SHOPPER100_hist, FINPER100_hist, BIZPER100_hist,
          PRIVRESPER100_hist, GOVTPER100_hist,
          ncol = 4, 
          nrow = 2)

From the histograms above, we can tell that the scale is vastly different from the variables above. This means that we have to perform data standardisation before our clustering analysis.

3.2 EDA using choropleth map

From the above histograms, we can plot the variables that seem to have a similar distribution on the choropleth map.

m1 <- tm_shape(sf_mpsz_select) + 
  tm_fill(col = "YOUNG",
          n = 5,
          style = "jenks", 
          title = "Young population") + 
  tm_borders(alpha = 0.5) 
m2 <- tm_shape(sf_mpsz_select) + 
  tm_fill(col = "ECONACTIVE",
          n = 5,
          style = "jenks",
          title = "Economically-active pop.") + 
  tm_borders(alpha = 0.5) 
m3 <- tm_shape(sf_mpsz_select) + 
  tm_fill(col = "AGED",
          n = 5,
          style = "jenks", 
          title = "Elderly population") + 
  tm_borders(alpha = 0.5) 
m4 <- tm_shape(sf_mpsz_select) + 
  tm_fill(col = "CONDO",
          n = 5,
          style = "jenks",
          title = "Condo") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(m1, m2, m3, m4,
             asp=NA, nrow=2, ncol=2)

The subzones have rather similar pattern across the four plots. This could suggest high correlation amongst these variables. For a more mathematical approach, we can plot the correlation plot.

3.3 Correlation Analysis

The Correlation Analysis allows us to understand the bivariate relationship between variables. It includes the direction (positive/negative) of the relationship as well as how strongly they are correlated (the higher the correlation coefficient, the stronger the correlation). If variables are strongly correlated (correlation coefficient >= 0.85), we have to remove the variable that is more correlated with other variables.

to_corr <- sf_mpsz_select %>%
  select(`SUBZONE_N`,`YOUNG`,`ECONACTIVE`,`AGED`,`HDB12RM`,`HDB34RM`,`HDB5RMEC`,`CONDO`,`LANDED`,
         `POPDENSE`,`OLDS_RATIO`,`SHOPPER100`,`FINPER100`,`BIZPER100`,`PRIVRESPER100`,`GOVTPER100`) %>%
  st_set_geometry(NULL)
to_corr <- do.call(data.frame,lapply(to_corr, function(x) replace(x, is.infinite(x),0)))
cluster_vars.cor = cor(to_corr[,2:16])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse",
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

Variable selection between YOUNG, ECONACTIVE, HDB34RM, HDB5RMEC

From the plot above, we can tell that the YOUNG population is highly correlated with three variables:

ECONACTIVE population (0.98),
HDB34RM (0.85), and
HDB5RMEC (0.91).

The ECONACTIVE population is highly correlated with four variables:

YOUNG population (0.98),
AGED population (0.87),
HDB34RM (0.9), and
HDB5RMEC (0.89).

Therefore, between the YOUNG and ECONACTIVE variable, we will retain the YOUNG variable.

The HDB34RM variable is highly correlated with three variables:

YOUNG (0.9),
ECONACTIVE (0.85), and
AGED (0.9).

The HDB5RMEC variable is highly correlated with two variables:

YOUNG (0.91), and
ECONACTIVE (0.89).

Therefore, between the HDB34RM and HDB5RMEC variable, we will retain the HDB5RMEC variable.

Now, we have the YOUNG and HDB5RMEC variable being highly correlated with each other. Since YOUNG is more correlated with AGED (0.78) than HDB5RMEC is correlated with AGED (0.65), we will remove the YOUNG variable.

Variable selection between YOUNG, ECONACTIVE, HDB34RM, HDB5RMEC

The SHOPPER100 variable is highly correlated with three variables:

FINPER100 (0.88),
BIZPER100 (1), and
PRIVRESPER100 (0.9).

Since SHOPPER100 and BIZPER100 have perfect correlation, we will retain either (e.g. SHOPPER100).

The FINPER100 variable is highly correlated with two variables:

SHOPPER100 (0.88), and
BIZPER100 (0.88).

The PRIVRESPER100 variable is highly correlated with two variables:

SHOPPER100 (0.9), and
BIZPER100 (0.9).

Therefore, between SHOPPER100, FINPER100, and PRIVRESPER100, we will retain FINPER100.

Variables selected

In sum, these are the variables to include in the clustering analysis: AGED, HDB12RM, HDB5RMEC, CONDO, LANDED, POPDENSE, OLDS_RATIO, FINPER100, GOVTPER100.

4 Hierarchy Cluster Analysis

Once again, the Hierarchical Clustering technique groups the most homogeneous observations together. We can use this to find the most similar clusters and understand what it suggests about Singapore’s liveability.

4.1 Extrating clustering variables

cluster_vars <- to_corr %>%
  select(`SUBZONE_N`,`AGED`, `HDB12RM`, `HDB5RMEC`, `CONDO`, `LANDED`, `POPDENSE`, `OLDS_RATIO`, `FINPER100`, `GOVTPER100`)
row.names(cluster_vars) <- cluster_vars$SUBZONE_N
cluster_vars <- select(cluster_vars, c(2:10))
head(cluster_vars)

##                AGED HDB12RM HDB5RMEC CONDO LANDED   POPDENSE OLDS_RATIO
## PEARL'S HILL   1700    4520        0   420      0 11843.1719  0.5279503
## BOAT QUAY         0       0        0    50      0   435.3031  0.0000000
## HENDERSON HILL 2790    3930     1270   190      0 22471.1973  0.3780488
## REDHILL        1190    1970     3240  1930      0 27695.3656  0.1900958
## ALEXANDRA HILL 2960    3910     3120     0      0 13373.7225  0.3915344
## BUKIT HO SWEE  3330    3700     1900   540      0 26788.3659  0.4007220
##                 FINPER100 GOVTPER100
## PEARL'S HILL   0.36199095 0.01508296
## BOAT QUAY      2.85714286 1.42857143
## HENDERSON HILL 0.02989537 0.00000000
## REDHILL        0.11183597 0.00000000
## ALEXANDRA HILL 0.10159652 0.02902758
## BUKIT HO SWEE  0.04059540 0.01353180

4.2 Data Standardisation with Z-score Standardisation

As mentioned in the EDA, the scale of the data is highly varied and standardisation is necessary. We will use Z-score standardisation which rescales data to have mean=0 and standard deviation=1. It is preferred over normalization as it can deal with outliers.

cluster_vars.z <- scale(cluster_vars)
describe(cluster_vars.z)

##            vars   n mean sd median trimmed  mad   min   max range skew kurtosis
## AGED          1 234    0  1  -0.26   -0.17 0.81 -0.83  6.40  7.23 2.36     9.34
## HDB12RM       2 234    0  1  -0.65   -0.22 0.00 -0.65  3.42  4.07 1.69     2.12
## HDB5RMEC      3 234    0  1  -0.36   -0.22 0.36 -0.60  5.77  6.37 2.80     9.44
## CONDO         4 234    0  1  -0.40   -0.18 0.62 -0.82  4.64  5.46 1.58     2.67
## LANDED        5 234    0  1  -0.44   -0.24 0.00 -0.44  7.23  7.67 3.93    18.93
## POPDENSE      6 234    0  1  -0.23   -0.08 1.18 -1.21  2.55  3.75 0.55    -0.86
## OLDS_RATIO    7 234    0  1  -0.08   -0.05 1.24 -1.32  2.62  3.94 0.21    -0.97
## FINPER100     8 234    0  1  -0.21   -0.19 0.01 -0.22 10.14 10.36 7.26    58.58
## GOVTPER100    9 234    0  1  -0.19   -0.18 0.00 -0.19 10.96 11.15 7.89    71.05
##              se
## AGED       0.07
## HDB12RM    0.07
## HDB5RMEC   0.07
## CONDO      0.07
## LANDED     0.07
## POPDENSE   0.07
## OLDS_RATIO 0.07
## FINPER100  0.07
## GOVTPER100 0.07

r <- ggplot(data=cluster_vars, aes(x= `AGED`)) +
  geom_histogram(bins=20, color="black", fill="light blue") +
  ggtitle("Before Standardisation")
r2 <- ggplot(data=cluster_vars, aes(x= `OLDS_RATIO`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

cluster_vars.z_df <- as.data.frame(cluster_vars.z)
z <- ggplot(data=cluster_vars.z_df, aes(x=`AGED`)) +
  geom_histogram(bins=20, color="black", fill="light blue") +
  ggtitle("Z-score Standardisation")
z2 <- ggplot(data=cluster_vars.z_df, aes(x=`OLDS_RATIO`)) +
  geom_histogram(bins=20, color="black", fill="light blue")

ggarrange(r, z, r2, z2,
          ncol = 2, nrow = 2)

As seen from the above chart, the AGED and OLDS_RATIO are of very different scales before standardisation. After Z-score standardisation, they have a similar range.

4.3 Selecting the optimal clustering algorithm

The agnes() function enables us to get the agglomerative coefficient, which measures the strength of clustering. Thus, we will use a function that computes the agglomerative coefficient for each clustering method (the closer the value to 1, the stronger the clustering structure).

Steps:

Define methods to assess
Create function to compute coefficient
Find the method that gives the strongest clustering structure

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(cluster_vars.z, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.8896250 0.8561701 0.9350264 0.9662691

Ward gives the strongest clustering structure. Therefore, we will be using the Ward’s method in our hierarchical clustering.

4.4 Computing hierarchical clustering

Steps:

Compute proximity matrix. Distance measures include euclidean, maximum, manhattan etc.
Compute hierarchical clustering using hclust() which is an agglomerative clustering approach

proxmat <- dist(cluster_vars.z, method = 'euclidean')
hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.6)

4.5 Determining the optimal number of clusters

The fviz_nbclust() function enables us to find the optimal number of clusters with different methods. We will use FUNcluster=hcut for hierarchical clustering.

4.5.1 Elbow Method

The Elbow Method plots the total within sum of squares (i.e. intraclass variance) against the number of clusters, k. A low TWSS means that the class is more compact. We get the optimal number of clusters by finding a bend.

fviz_nbclust(cluster_vars.z, hcut, method = "wss")

This Elbow Method plot does not give a clearcut bend for us to determine the optimal number of clusters. We could turn to the Average Silhouette Method.

4.5.2 Average Silhouette Method

The Average Silhouette Method measures how well the observation lies in the cluster. It plots the average sihouette width against the number of clusters, k. We pick the k that is maximised.

fviz_nbclust(cluster_vars.z, hcut, method = "silhouette")

The optimal number of clusters would be k=6 where the average sihouette width is maximised.

4.6 Visualizing the dendrogram with the optimal number of clusters

Use rect.hclust() and specify the optimal number of clusters, k=6.

plot(hclust_ward, cex = 0.6)
rect.hclust(hclust_ward, k = 6, border = 2:3)

4.8 Visualizing the clustering in a scatter plot

Steps:

Obtain the cluster which the observation belongs to by cutree() function and specifying the optimal number of clusters, k=6
Add the cluster number to the dataframe
Use fviz_cluster() function to visualize the clustering in a scatter plot

sub_grp <- cutree(hclust_ward, k = 6)
ncluster_vars.z_df <- cluster_vars.z_df %>%
  mutate(cluster = sub_grp) 
fviz_cluster(list(data = cluster_vars.z, cluster = sub_grp), geom = "point")

We can see that the clusters have some overlapping dimensions portions on the scatter plot. This reveals that the indicators may not be significant enough to differentiate between the clusters. We can use the interactive cluster heatmap to find out the strength of the variables.

4.9 Plotting interactive cluster heatmap using heatmaply()

The heatmaply() function allows us to interact with the heat map by hovering our mouse over the plot. The darker the color of the bars, the higher the value of the variables.

heatmaply(cluster_vars.z,
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 6,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of Singapore by Liveability Indicators",
          xlab = "Liveability Indicators",
          ylab = "Singapore Planning Subzones"
          )

The heatmap does not seem too intuitive for us to understand how the variables affect the clustering. However, we could gather several insights:

The purple cluster has a high number of landed properties.
The light blue cluster has a high number of condominums and a high number of HDB 5 Room Flats and Executive Condominiums.
The cyan cluster has a high ageing population and a high number of HDB 1-2 Room Flats.
The pink cluster has a high number of financial and government services per 100 people.

4.10 Mapping the clusters formed

We can use qtm() to do a simple map of the hierarhical clustering to understand the geographic distribution.

sf_ncluster_vars.z_df <- cbind(ncluster_vars.z_df, sf_mpsz_select$geometry)
sf_ncluster_vars.z_df <- st_as_sf(sf_ncluster_vars.z_df)
qtm(sf_ncluster_vars.z_df, "cluster")

From the map above, we can gather that the subzones in a particular cluster are geographically scattered.

Cluster 1 stretches from the northeast to the central region.
Cluster 2 is mostly at the outskirts of Singapore, possibly the industrial areas with low resident population.
Cluster 5 is located at the central region and boasts some old towns.
Cluster 6 is an upmarket region.

4.11 Significant variables analysis

4.11.1 Summarizing observations by cluster mean

The heatmap showed us that the indicators were not very telling across all the subzones in the cluster. For instance, some subzones in cluster 3 have a low number of financial and government services per 100 people while others in this cluster have a high number. Therefore, we can summarize them and take the mean to compare a single number across the indicators.

ncluster_vars.z_df %>%
  group_by(`cluster`) %>%
  summarize_all(mean)

## # A tibble: 6 x 10
##   cluster   AGED HDB12RM HDB5RMEC  CONDO LANDED POPDENSE OLDS_RATIO FINPER100
##     <int>  <dbl>   <dbl>    <dbl>  <dbl>  <dbl>    <dbl>      <dbl>     <dbl>
## 1       1  0.792   2.15     0.266  0.458 -0.301    0.815     0.740   -0.198  
## 2       2 -0.727  -0.510   -0.463 -0.313 -0.267   -0.755    -0.901   -0.00986
## 3       3  0.941   0.174    1.44   0.448 -0.188    1.24      0.0611  -0.210  
## 4       4 -0.833  -0.648   -0.605 -0.781 -0.432   -1.17     -1.32     3.96   
## 5       5  0.129   0.110   -0.283 -0.588 -0.394    0.460     1.09    -0.169  
## 6       6  0.182  -0.339   -0.408  0.951  1.95    -0.592     0.759   -0.200  
## # ... with 1 more variable: GOVTPER100 <dbl>

Analysis of cluster mean will be done below after finding out the significant variables.

4.11.2 One-way ANOVA test for comparing means between multiple groups

The Analysis of Variance (ANOVA) test allows us to decipher whether the differences between means are statistically significant.

Hypothesis:

H0: The difference between means are the same.
H1: The difference between means are not the same for at least one group.

Confidence interval: 95%

We reject the null hypothesis if the p-value is less than the alpha value of 0.05.

res.aov <- aov(cluster ~ ., data = ncluster_vars.z_df)
summary(res.aov)

##              Df Sum Sq Mean Sq F value   Pr(>F)    
## AGED          1  11.07   11.07  10.359 0.001480 ** 
## HDB12RM       1  63.96   63.96  59.852 3.46e-13 ***
## HDB5RMEC      1  36.97   36.97  34.595 1.46e-08 ***
## CONDO         1   0.06    0.06   0.056 0.813087    
## LANDED        1  68.00   68.00  63.631 7.68e-14 ***
## POPDENSE      1  30.22   30.22  28.282 2.53e-07 ***
## OLDS_RATIO    1 138.07  138.07 129.197  < 2e-16 ***
## FINPER100     1  14.58   14.58  13.642 0.000278 ***
## GOVTPER100    1  11.55   11.55  10.811 0.001171 ** 
## Residuals   224 239.39    1.07                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA test, we find out that eight out of 9 variables are statistically significant at the 95% confidence interval. We can reject H0 for these variables to conclude that the difference between the means are not the same.

4.11.3 Analysis of cluster mean

Cluster 1

Cluster 1 has a high ageing population and high old-age support ratio. The number of HDB 1-2 Room Flats is significantly higher than the other clusters as well. This could signify that it is a less prosperous zone. However, it also has a high population density.

Since AGED is highly correlated with ECONACTIVE, the size of the economically-active population is high in this cluster. Yet, the number of financial services and hence business entities and shopping amenities per 100 residents is low in this cluster. The Government could turn this cluster into a live-work-play area to decentralize office locations. Given that some of these subzones are located far from the central business district, it could be a viable alternative to explore.

Cluster 3

Cluster 3 has the highest ageing population. However, its old-age support ratio is significantly lower than that of cluster 1. This may create a strain on the economically-active population. Therefore, the Government could pay more attention to this cluster and bring in more elderly-friendly amenities for the elderly to age gracefully.

Cluster 6

Cluster 6 has a high number of condominiums and landed properties. Its population density, however, is significantly low. This may signify an upmarket region where the Government can attract expats to live in the region. Some of these subzones are located far from the Central Business District, which means that upmarket facilities could be built in these subzones to attract this population with high earnings power.

Cluster 4

Cluster 4 has a high number of financial and government services per 100 residents. Its population density, however, is significantly low. The Government could look into the accessibility of this area since residents who work here do not live close to their working location.

Cluster 5

Cluster 5 has a high old-age support ratio although its elderly population is not high. This may mean that there is little cause for concern here with regards to the elderly population.

Cluster 2

Cluster 2 scores low on all indicators. It may not be the most important region to look into at the moment.

5 Spatially Constrained Clustering - SKATER approach

The Spatially Constrained Clustering technique groups makes the clusters more geographically compact. It takes into consideration the distance between the clusters.

5.1 Converting into SpatialPolygonsDataFrame

Since the SKATER function only supports sp object, we have to convert the sf object into a sp object.

sp_cluster_vars.z <- as_Spatial(sf_ncluster_vars.z_df[,1:9])

5.2 Computing Neighbour List

cluster_vars.nb <- poly2nb(sp_cluster_vars.z)
summary(cluster_vars.nb)

## Neighbour list object:
## Number of regions: 234 
## Number of nonzero links: 1198 
## Percentage nonzero weights: 2.187888 
## Average number of links: 5.119658 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 
##  4 10 25 36 69 41 33 11  5 
## 4 least connected regions:
## 13 204 222 233 with 1 link
## 5 most connected regions:
## 59 66 71 121 131 with 9 links

Plotting neighbours list

The coordinates of the sp object gets the centroid position of the subzones and will be used as the nodes in the plot.

plot(sp_cluster_vars.z, border=grey(.5))
plot(cluster_vars.nb, coordinates(sp_cluster_vars.z), col="blue", add=TRUE)

5.3 Computing minimum spanning tree

5.3.1 Calculating edge costs

The nbcosts() function calculates the distance between the nodes. This is the pairwise dissimilarity between the subzone and its neighbours.

Then, we convert is to a weights object with nb2listw() function.

lcosts <- nbcosts(cluster_vars.nb, cluster_vars.z)
cluster_vars.w <- nb2listw(cluster_vars.nb, lcosts, style="B")
summary(cluster_vars.w)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 234 
## Number of nonzero links: 1198 
## Percentage nonzero weights: 2.187888 
## Average number of links: 5.119658 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 
##  4 10 25 36 69 41 33 11  5 
## 4 least connected regions:
## 13 204 222 233 with 1 link
## 5 most connected regions:
## 59 66 71 121 131 with 9 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn     S0       S1       S2
## B 234 54756 3885.8 34394.62 349721.3

5.3.2 Computing minimum spanning tree

The mstree() function calculates the minimum spanning tree.

cluster_vars.mst <- mstree(cluster_vars.w)

Checking class and dimensions

class(cluster_vars.mst)

## [1] "mst"    "matrix"

dim(cluster_vars.mst)

## [1] 233   3

5.4 Computing spatially constrained clusters using SKATER method

clust <- skater(cluster_vars.mst[,1:2], cluster_vars.z, 5)
ccs <- clust$groups
table(ccs)

## ccs
##   1   2   3   4   5   6 
## 217   5   4   2   5   1

This gives us the number of observations in each cluster. Cluster 1 encompasses majority of the subzones.

Plotting the pruned tree

plot(sp_cluster_vars.z, border=gray(.5))
plot(clust, coordinates(sp_cluster_vars.z), cex.lab=.7,
     groups.colors=c("red", "green", "blue", "brown", "pink", "orange"), cex.circles=0.005, add=TRUE)

5.5 Visualising the clusters in choropleth map

groups_mat <- as.matrix(clust$groups)
sf_spatialcluster <- cbind(sf_ncluster_vars.z_df, as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(sf_spatialcluster, "SP_CLUSTER")

5.6 Visualising the cluster means

5.6.1 Obtaining the cluster means

sf_spatialcluster %>%
  group_by(`SP_CLUSTER`) %>%
  summarize_all(mean)

## Simple feature collection with 6 features and 11 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: 6351.521 ymin: 24495.28 xmax: 45539.4 ymax: 49597.42
## CRS:            EPSG:3414
## # A tibble: 6 x 12
##   SP_CLUSTER   AGED HDB12RM HDB5RMEC   CONDO  LANDED POPDENSE OLDS_RATIO
##   <fct>       <dbl>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl>      <dbl>
## 1 1          -0.129 -0.0683  -0.0955 -0.0858 -0.0674  -0.0291    -0.0153
## 2 2           3.02   0.762    1.81    2.29    2.37     0.290      0.669 
## 3 3           0.914  0.389    3.10    0.986  -0.384    1.79      -0.298 
## 4 4          -0.833 -0.648   -0.605  -0.741  -0.435   -1.16      -1.32  
## 5 5           2.33   2.28     0.221   1.11    1.12     0.245      1.03  
## 6 6          -0.833 -0.648   -0.605  -0.821  -0.435   -1.21      -1.32  
## # ... with 4 more variables: FINPER100 <dbl>, GOVTPER100 <dbl>, cluster <dbl>,
## #   geometry <GEOMETRY [m]>

5.6.2 Visualising the cluster means with ggplot()

p1 <- ggplot(sf_spatialcluster, aes(x=SP_CLUSTER, y=AGED)) + 
  geom_boxplot() + 
  coord_flip()
p2 <- ggplot(sf_spatialcluster, aes(x=SP_CLUSTER, y=HDB12RM)) + 
  geom_boxplot() + 
  coord_flip()
p3 <- ggplot(sf_spatialcluster, aes(x=SP_CLUSTER, y=HDB5RMEC)) + 
  geom_boxplot() + 
  coord_flip()
p4 <- ggplot(sf_spatialcluster, aes(x=SP_CLUSTER, y=CONDO)) + 
  geom_boxplot() + 
  coord_flip()
p5 <- ggplot(sf_spatialcluster, aes(x=SP_CLUSTER, y=LANDED)) + 
  geom_boxplot() + 
  coord_flip()
p6 <- ggplot(sf_spatialcluster, aes(x=SP_CLUSTER, y=POPDENSE)) + 
  geom_boxplot() + 
  coord_flip()
p7 <- ggplot(sf_spatialcluster, aes(x=SP_CLUSTER, y=OLDS_RATIO)) + 
  geom_boxplot() + 
  coord_flip()
p8 <- ggplot(sf_spatialcluster, aes(x=SP_CLUSTER, y=FINPER100)) + 
  geom_boxplot() + 
  coord_flip()
p9 <- ggplot(sf_spatialcluster, aes(x=SP_CLUSTER, y=GOVTPER100)) + 
  geom_boxplot() + 
  coord_flip()

ggarrange(p1,p2,p3,p4,p5,p6,p7,p8,p9,
          ncol = 3, nrow = 3)

5.6.3 SKATER Cluster results

Cluster 2

Cluster 2 (yellow, e.g. Frankel, Simei) seems to be a mixed of rich and poor neighbourhoods with a mid-number of HDB 1-2 Room Flats, but high number of landed and condominium properties. Since population density is somewhat high and the number of financial services per 100 people is low, more can be done in this area to serve the people. Since FINPER100 is highly correlated with BIZPER100, there is potential to turn this cluster into a work-play-live area in line with the Government’s aim of decentralizing office spaces.

In addition, since ageing population and old-age support ratio is high, elderly-friendly amenities need to be in place to serve this group.

Cluster 3

Cluster 3 (purple, e.g. Sengkang Town Centre, Punggol Field) is high in population density and the number of HDB 5 Room Flats and Executive Condominiums. Since ageing population and old-age support ratio is mid level, it suggests that subzones in this cluster are new estates which seeks to attract the new generation families. Since the number of financial services per 100 people is low, the number of business entities per 100 people is low as well. In line with the Government’s strategy to develop the northeast region into bustling new towns, the Government could build innovative business clusters which are attractive to the younger economically-active population.

Cluster 4

Cluster 4 (pink, e.g. Boulevard, Somerset) is low on all indicators except for the number of financial and government services per 100 people. As the premium shopping belt of Singapore, it is high in tourist traffic. Expats could be living in this cluster. Since shopping amenities and business entities are high in number in this region, it could rank high in liveability for expats.

Cluster 6

Cluster 6 (orange, e.g. Lim Chu Kang) is low on all indicators except for having the highest number of government services per 100 people. This could be due to the low population.

Cluster 5

Cluster 5 (blue, e.g. Geylang East, Aljunied) has a high elderly population, as well as high old-age support ratio. Since population density is high and financial services is low, more can be done to bring these financial services closer to the elderly people. As some elderly may be lowly educated and not financially literate, an offline approach is needed to serve this population.

Cluster 1

Cluster 1 (green) scores in the mid range for all indicators. Perhaps this cluster could be isolated for further analysis.

6 K-Means Cluster Analysis

6.1 Plotting K-Means from k=4 to k=7

Since the optimal number of clusters obtained by hierarchical clustering is 6, we will plot the K-Means on a scatter plot from k=4 to k=7.

k4 <- kmeans(cluster_vars.z, centers = 4, nstart = 25)
k5 <- kmeans(cluster_vars.z, centers = 5, nstart = 25)
k6 <- kmeans(cluster_vars.z, centers = 6, nstart = 25)
k7 <- kmeans(cluster_vars.z, centers = 7, nstart = 25)

# plots to compare
p1 <- fviz_cluster(k4, geom = "point", data = cluster_vars.z) + ggtitle("k = 4")
p2 <- fviz_cluster(k5, geom = "point",  data = cluster_vars.z) + ggtitle("k = 5")
p3 <- fviz_cluster(k6, geom = "point",  data = cluster_vars.z) + ggtitle("k = 6")
p4 <- fviz_cluster(k7, geom = "point",  data = cluster_vars.z) + ggtitle("k = 7")

grid.arrange(p1, p2, p3, p4, nrow = 2)

We can see that as k increases, the portion of overlaps increases. Again, we can use a more mathematical approach to get the optimal number of clusters.

6.2 Determining the optimal number of clusters

We will use the Average Sihouette Method since it is more clearcut than the Elbow Method.

fviz_nbclust(cluster_vars.z, kmeans, method = "silhouette")

The optimal number of clusters is 3.

6.3 Visualizing the optimal number of clusters

k3 <- kmeans(cluster_vars.z, centers = 3, nstart = 25)
fviz_cluster(k3, geom = "point", data = cluster_vars.z) + ggtitle("k = 3")

Now, we see that the optimal number of clusters, k=3 gives a clear separation between the observations.

6.4 Interpreting the K-Means clustering results

6.4.1 Summarizing observations by cluster mean

Again, summarizing the results would give us a better measure of how the indicators fair amongst the clusters.

kmeans_cluster <- cluster_vars.z_df %>%
  mutate(cluster = k3$cluster) %>%
  group_by(`cluster`) %>%
  summarize_all(mean)
kmeans_cluster

## # A tibble: 3 x 10
##   cluster   AGED HDB12RM HDB5RMEC   CONDO  LANDED POPDENSE OLDS_RATIO FINPER100
##     <int>  <dbl>   <dbl>    <dbl>   <dbl>   <dbl>    <dbl>      <dbl>     <dbl>
## 1       1 -0.568  -0.548   -0.475 -0.0827  0.0932   -0.669     -0.383   -0.0579
## 2       2 -0.833  -0.648   -0.605 -0.781  -0.432    -1.17      -1.32     3.96  
## 3       3  0.794   0.755    0.657  0.162  -0.0900    0.947      0.588   -0.205 
## # ... with 1 more variable: GOVTPER100 <dbl>

6.4.2 Mapping the clusters formed

sf_kmeans_cluster <- cbind(kmeans_cluster, sf_mpsz_select$geometry)
sf_kmeans_cluster <- st_as_sf(sf_kmeans_cluster)
qtm(sf_kmeans_cluster, "cluster")

6.4.3 Analysis of K-Means cluster results

All three clusters are geographically scattered.

Cluster 2

Cluster 2 is high on the number of financial and government services per 100 resident. Since population density is not high, this may mean an oversupply of such services and relocation could be explored.

Cluster 3

Cluster 3 is high in ageing population and population density. However, the number of financial and government services per 100 residents is not high. This may mean an undersupply of such services and relocation could be explored. Thus, this area seems like a residential area and not a working area. Should the Government explore turning this cluster into a live-work-play area, the Government would need to build more amenities to serve this group. In particular, elderly-friendly services need to be present.

Cluster 1

Cluster 1 scores low on all indicators. This may mean that it is not an immediate source for concern.

7 Conclusion

The clustering techniques have allowed us to understand the social spaces in Singapore. In particular, we wanted to understand how Singapore could become a more liveable city by analysing several demographic and level of living indicators.

Since all clustering techniques have given varied conclusions, the technique to use depend on the Government strategy. For instance, the socially constrained cluster technique is useful for a macro planning level. However, the other clustering methods can allow the Government to understand the Subzone differences of a Planning Area and create a Planning Area strategy that incorporates these subzones.

7.1 Limitations

The indicators used are limited in scope and does not give us a full picture on whether Singapore is a Liveable City. As mentioned earlier, the liveability includes the culture of a place, and even how economically vibrant it is. However, these indicators are not available at the Planning Subzone level.

Delineating Social Area Analysis using Geographic Segmentation Approach

Ong Wei Ching

5/22/2020

1 Introduction

1.1 Singapore, a Liveable City?

1.2 Variables

1.2.1 Demographic Variables

1.2.2 Level of Living Variables

1.3 Methodology

2 Data Import and Preparation

2.1 Importing required libraries

2.2 Importing aspatial data as Dataframe

2.3 Aspatial Data wrangling

List the age groups

List the type of dwelling and its corresponding population

2.3.1 Filter out rows with >0 population and convert PA and SZ to uppercase

2.3.2 Get Population Age Group data

2.3.3 Get Population by Type of Dwelling data

2.4 Importing geospatial data as sf’s Simple Features Dataframe

Singapore Planning Subzone Boundary

Business/industry buildings

Financial services

Government services including embassy

Private residential buildings

Shopping amenities

2.5 Geospatial data wrangling

2.5.1 Identify duplicated entries according to POI_ID and remove them

Inspect the duplicated entries

Remove the duplicated entries

2.5.2 Identify the Planning Subzone that the services are located in

2.6 Create the new variables

2.6.1 Creating population density variable

2.6.2 Creating old-age support ratio variable

2.6.3 Creating number of shopping amenities per 100 residents variable

2.6.4 Creating number of financial services per 100 residents variable

2.6.5 Creating number of businesses per 100 residents variable

2.6.6 Creating number of private residences per 100 residents variable

2.6.7 Creating number of government services per 100 residents variable

2.6.8 Add type of dwelling data to aggregated data

2.6.9 Replace NA values with 0

2.6.10 Filter out rows without resident population

3 Exploratory Data Analysis

3.1 EDA using statistical graphics

3.2 EDA using choropleth map

3.3 Correlation Analysis

Variable selection between YOUNG, ECONACTIVE, HDB34RM, HDB5RMEC

Variable selection between YOUNG, ECONACTIVE, HDB34RM, HDB5RMEC

Variables selected

4 Hierarchy Cluster Analysis

4.1 Extrating clustering variables

4.2 Data Standardisation with Z-score Standardisation

4.3 Selecting the optimal clustering algorithm

4.4 Computing hierarchical clustering

4.5 Determining the optimal number of clusters

4.5.1 Elbow Method

4.5.2 Average Silhouette Method

4.6 Visualizing the dendrogram with the optimal number of clusters

4.8 Visualizing the clustering in a scatter plot

4.9 Plotting interactive cluster heatmap using heatmaply()

4.10 Mapping the clusters formed

4.11 Significant variables analysis

4.11.1 Summarizing observations by cluster mean

4.11.2 One-way ANOVA test for comparing means between multiple groups

4.11.3 Analysis of cluster mean

5 Spatially Constrained Clustering - SKATER approach

5.1 Converting into SpatialPolygonsDataFrame

5.2 Computing Neighbour List

Plotting neighbours list

5.3 Computing minimum spanning tree

5.3.1 Calculating edge costs

5.3.2 Computing minimum spanning tree

5.4 Computing spatially constrained clusters using SKATER method

Plotting the pruned tree

5.5 Visualising the clusters in choropleth map

5.6 Visualising the cluster means

5.6.1 Obtaining the cluster means

5.6.2 Visualising the cluster means with ggplot()

5.6.3 SKATER Cluster results

6 K-Means Cluster Analysis

6.1 Plotting K-Means from k=4 to k=7