1 Introduction

Social area analysis (SAA), which is a generic name for a number of methods employing census and other data to classify small areas into similar socioeconomic groups, is an approach which quantifies data in a useful fashion and has important applications in regional development, urban planning, medical and health services research. The potentialities of the approach for Liveable City planning and Smart City preparedness, however, have yet to be explored.

In this exercise, we will segment Singapore at the planning subzone level into homogeneous socioeconomic areas by combining geodemographic data extracted from Singapore Department of Statistics and urban functions extracted from the geospatial data provided.

2 Getting Started

2.1 The Data

The following data sets will be used for this study:

5 Geospatial Data Sets: Govt_Embassy, Private residential, Shopping, Business and Financial in ESRI shapefile format.
Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 in CSV format taken from Singstat
URA Master Plan 2014 Planning Subzone boundary in ESRI shapefile format taken from URA.

2.2 Installing & Loading R Packages

packages = c('rgdal', 'spdep', 'ClustGeo',  'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse', 'raster', 'formattable')

for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}

3 Importing Data

3.1 Importing Aspatial Data

The csv file containing Singapore residents’ data will be imported using read_csv(). For the purpose of the analysis, we will only be using data from 2019. The subzone names are changed to capital letters for ease of data handling further on.

population <- read_csv("data/aspatial/respopagesextod2011to2019.csv") %>%
  filter(Time==2019) %>%
  mutate_at(.vars = vars(PA, SZ), .funs = funs(toupper))

## Parsed with column specification:
## cols(
##   PA = col_character(),
##   SZ = col_character(),
##   AG = col_character(),
##   Sex = col_character(),
##   TOD = col_character(),
##   Pop = col_double(),
##   Time = col_double()
## )

Checking for Na/Invalid Values:

population[rowSums(is.na(population)) != 0,]

## # A tibble: 0 x 7
## # ... with 7 variables: PA <chr>, SZ <chr>, AG <chr>, Sex <chr>, TOD <chr>,
## #   Pop <dbl>, Time <dbl>

The data set is valid.

3.2 Importing Geospatial Data

Importing Government Embassies, Business, Shopping, Financial and Residential data using st_read():

government <- st_read(dsn = "data/geospatial", layer = "Govt_Embassy")

## Reading layer `Govt_Embassy' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 443 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6282 ymin: 1.24911 xmax: 103.9884 ymax: 1.45765
## CRS:            4326

business <- st_read(dsn = "data/geospatial", layer = "Business")

## Reading layer `Business' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## CRS:            4326

shopping <- st_read(dsn = "data/geospatial", layer = "Shopping")

## Reading layer `Shopping' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 511 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.679 ymin: 1.24779 xmax: 103.9644 ymax: 1.4535
## CRS:            4326

financial <- st_read(dsn = "data/geospatial", layer = "Financial")

## Reading layer `Financial' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3320 features and 29 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6256 ymin: 1.24392 xmax: 103.9998 ymax: 1.46247
## CRS:            4326

residential <- st_read(dsn = "data/geospatial", layer = "Private residential")

## Reading layer `Private residential' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3604 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6295 ymin: 1.23943 xmax: 103.9749 ymax: 1.45379
## CRS:            4326

Importing Planning Area subzones data using st_read():

subzones <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")

## Reading layer `MP14_SUBZONE_WEB_PL' from data source `F:\IS415\Assignments\IS415_Take-Home_Ex03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## proj4string:    +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs

Checking subzone data set for NA/Invalid values:

test <- st_is_valid(subzones)
length(which(test==FALSE))

## [1] 9

There are 9 subzone polygons that are invalid as observed from the st_is_valid test. Hence, we need to make these polygons valid.

Making polygons valid:

subzones <- st_make_valid(subzones)
test <- st_is_valid(subzones)
length(which(test==FALSE))

## [1] 0

The 9 subzones are now valid.

3.3 Transforming CRS of Geospatial Data

Checking CRS of Geospatial Data:

crs(government)

## [1] "+proj=longlat +datum=WGS84 +no_defs "

crs(subzones)

## [1] "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs "

The CRS of the Geospatial data and the subzone data is not the same hence we will transform the crs of the geospatial data to match the crs of the subzones.

Transforming CRS:

government_crs <- st_transform(government, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")

business_crs <- st_transform(business, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")

shopping_crs <- st_transform(shopping, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")

financial_crs <- st_transform(financial, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")

residential_crs <- st_transform(residential, crs = "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs")

Rechecking to confirm if the CRS are the same:

crs(government_crs)

## [1] "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs "

crs(subzones)

## [1] "+proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs "

The crs projections are now the same.

4 Data Wrangling

4.1 Extracting Indicators from Aspatial Data

Extracting Economy Active Population:

i_economy_active <- population %>% 
  filter(AG == "25_to_29" | AG == "30_to_34" | AG == "35_to_39" | AG == "40_to_44" | AG == "45_to_49" | AG == "50_to_54" | AG == "55_to_59" | AG == "60_to_64") %>%
  group_by(SZ = SZ) %>%
  summarise(economy_active = sum(Pop)) %>% 
  ungroup()

Extracting Young Group:

i_young <- population %>% 
  filter(AG == "0_to_4" | AG == "5_to_9" | AG == "10_to_14" | AG == "15_to_19" | AG == "20_to_24" ) %>%
  group_by(SZ = SZ) %>%
  summarise(young = sum(Pop)) %>%
  ungroup()

Extracting Aged Group:

i_aged <- population %>% 
  filter(AG == "65_to_69" | AG == "70_to_74" | AG == "75_to_79" | AG == "80_to_84" | AG == "85_to_89" | AG == "90_and_over") %>%
  group_by(SZ = SZ) %>%
  summarise(aged = sum(Pop)) %>%
  ungroup()

Calculating Population Density :

popsum_by_sz <- population %>%
  group_by(SZ) %>%
  summarise(pop_sum = sum(Pop)) 

sg_popsum <- left_join(popsum_by_sz, subzones, by = c("SZ" = "SUBZONE_N"))

## Warning: Column `SZ`/`SUBZONE_N` joining character vector and factor, coercing
## into character vector

density <- sg_popsum %>%
  group_by(SZ) %>%
  transmute(density = pop_sum/(SHAPE_Area/1000000)) %>%
  ungroup()

Getting HDB1-2RM Dwellers:

i_hdb_1to2 <- population %>%
  filter(TOD == "HDB 1- and 2-Room Flats") %>%
  group_by(SZ = SZ) %>%
  summarise(hdb_1to2 = sum(Pop)) %>%
  ungroup()

Getting HDB3-4RM Dwellers:

i_hdb_3to4 <- population %>%
  filter(TOD == "HDB 3-Room Flats" | TOD == "HDB 4-Room Flats") %>%
  group_by(SZ = SZ) %>%
  summarise(hdb_3to4 = sum(Pop)) %>%
  ungroup()

Getting HDB5RM-Ec Dwellers:

HUDC flats that are not privatised are considered to be executive condominiums according to https://www.propertyguru.com.sg/property-guides/executive-condo-vs-private-condo-10371 hence they will be included for this indicator.

i_hdb_5toec <- population %>%
  filter(TOD == "HDB 5-Room and Executive Flats" | TOD == "HUDC Flats (excluding those privatised)") %>%
  group_by(SZ = SZ) %>%
  summarise(hdb_5toec = sum(Pop)) %>%
  ungroup()

Getting Condominium & Apartment Dwellers:

i_condo <- population %>%
  filter(TOD == "Condominiums and Other Apartments") %>%
  group_by(SZ = SZ) %>%
  summarise(condo = sum(Pop)) %>%
  ungroup()

Getting Landed Property Dwellers:

i_landed <- population %>%
  filter(TOD == "Landed Properties") %>%
  group_by(SZ = SZ) %>%
  summarise(landed = sum(Pop)) %>%
  ungroup()

4.2 Combining Social Demographics

Combining the 9 social demographic indicators by using the common variable ‘SZ’:

join1 <- left_join(density, i_young, by = c('SZ' = 'SZ'))
join2 <- left_join(join1, i_economy_active, by = c('SZ'='SZ'))          
join3 <- left_join(join2, i_aged, by = c('SZ'='SZ'))                 
join4 <- left_join(join3, i_hdb_1to2, by = c('SZ'='SZ'))                 
join5 <- left_join(join4, i_hdb_3to4, by = c('SZ'='SZ'))                 
join6 <- left_join(join5, i_hdb_5toec, by = c('SZ'='SZ'))             
join7 <- left_join(join6, i_condo, by = c('SZ'='SZ'))    

socialdemographics <- left_join(join7, i_landed, by = c('SZ' = 'SZ'))

4.3 Joining Geospatial Data with Subzones

sg_government <- st_join(government_crs, subzones)
sg_business <- st_join(business_crs, subzones)
sg_shopping <- st_join(shopping_crs, subzones)
sg_financial <- st_join(financial_crs, subzones)
sg_residential <- st_join(residential_crs, subzones)

4.4 Extracting Urban Functions

Extracting Government Buildings:

sg_government <- st_set_geometry(sg_government,NULL)

u_government <- sg_government %>%
  group_by(SUBZONE_N = SUBZONE_N) %>%
  summarise(government = length(POI_NAME))

Extracting Businesses:

Upon examining the dataset, it can be seen that the facility type for businesses is 5000. Hence, we will filter out businesses with FAC_TYPE equal to 5000.

sg_business <- st_set_geometry(sg_business,NULL)

u_business <- sg_business %>%
  filter(FAC_TYPE == 5000) %>%
  group_by(SUBZONE_N = SUBZONE_N) %>%
  summarise(business = length(POI_NAME))

Extracting Industries:

Upon examining the dataset, it can be seen that the facility type for industries is 9991. Hence, we will filter out industries with FAC_TYPE equal to 9991.

u_industry <- sg_business %>%
  filter(FAC_TYPE == 9991) %>%
  group_by(SUBZONE_N = SUBZONE_N) %>%
  summarise(industry = length(POI_NAME))

Extracting Shopping Establishments:

sg_shopping <- st_set_geometry(sg_shopping,NULL)

u_shopping <- sg_shopping %>%
  group_by(SUBZONE_N = SUBZONE_N) %>%
  summarise(shopping = length(POI_NAME))

Extracting Financial Establishments:

sg_financial <- st_set_geometry(sg_financial,NULL)

u_financial <- sg_financial %>%
  group_by(SUBZONE_N = SUBZONE_N) %>%
  summarise(financial = length(POI_NAME))

Extracting Upmarket Residential Areas:

sg_residential <- st_set_geometry(sg_residential,NULL)

u_residential <- sg_residential %>%
  group_by(SUBZONE_N = SUBZONE_N) %>%
  summarise(upmarket_residential = length(POI_NAME))

4.5 Combining Urban Functions

Combining the 6 urban functions using the common variable ‘SUBZONE_N’:

combine1 <- left_join(u_financial, u_business, by = c('SUBZONE_N' = 'SUBZONE_N'))
combine2 <- left_join(combine1, u_industry, by = c('SUBZONE_N'='SUBZONE_N'))            
combine3 <- left_join(combine2, u_shopping, by = c('SUBZONE_N'='SUBZONE_N'))  
combine4 <- left_join(combine3, u_residential, by = c('SUBZONE_N'='SUBZONE_N'))               
urbanfunctions <- left_join(combine4, u_government, by = c('SUBZONE_N' = 'SUBZONE_N'))

4.6 Combining Social Demographics with Urban Functions

Combining all the cluster variables together:

sd_uf <- left_join(socialdemographics, urbanfunctions, by = c('SZ' = 'SUBZONE_N'))

## Warning: Column `SZ`/`SUBZONE_N` joining character vector and factor, coercing
## into character vector

Replacing NA values with 0:

sd_uf[is.na(sd_uf)] <- 0

Viewing summary statistics of sd_uf dataframe:

summary(sd_uf)

##       SZ               density          young       economy_active 
##  Length:323         Min.   :    0   Min.   :    0   Min.   :    0  
##  Class :character   1st Qu.:    0   1st Qu.:    0   1st Qu.:    0  
##  Mode  :character   Median : 5910   Median : 1170   Median : 2840  
##                     Mean   :10737   Mean   : 3296   Mean   : 7386  
##                     3rd Qu.:19928   3rd Qu.: 4365   3rd Qu.:10285  
##                     Max.   :46058   Max.   :34260   Max.   :79780  
##       aged          hdb_1to2       hdb_3to4       hdb_5toec         condo      
##  Min.   :    0   Min.   :   0   Min.   :    0   Min.   :    0   Min.   :    0  
##  1st Qu.:    0   1st Qu.:   0   1st Qu.:    0   1st Qu.:    0   1st Qu.:    0  
##  Median :  730   Median :   0   Median :    0   Median :    0   Median :  230  
##  Mean   : 1806   Mean   : 542   Mean   : 5953   Mean   : 3297   Mean   : 1827  
##  3rd Qu.: 3000   3rd Qu.: 605   3rd Qu.: 9705   3rd Qu.: 3660   3rd Qu.: 2835  
##  Max.   :18860   Max.   :4700   Max.   :75000   Max.   :47960   Max.   :16770  
##      landed          financial         business         industry     
##  Min.   :    0.0   Min.   :  0.00   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:    0.0   1st Qu.:  1.00   1st Qu.:  0.00   1st Qu.:0.0000  
##  Median :    0.0   Median :  5.00   Median :  1.00   Median :0.0000  
##  Mean   :  773.9   Mean   : 10.28   Mean   : 12.43   Mean   :0.2632  
##  3rd Qu.:  400.0   3rd Qu.: 13.00   3rd Qu.:  8.00   3rd Qu.:0.0000  
##  Max.   :18820.0   Max.   :134.00   Max.   :254.00   Max.   :8.0000  
##     shopping      upmarket_residential   government    
##  Min.   : 0.000   Min.   :  0.00       Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.:  0.00       1st Qu.: 0.000  
##  Median : 0.000   Median :  3.00       Median : 0.000  
##  Mean   : 1.573   Mean   : 10.54       Mean   : 1.294  
##  3rd Qu.: 1.000   3rd Qu.: 10.50       3rd Qu.: 1.000  
##  Max.   :31.000   Max.   :217.00       Max.   :19.000

From the summary, we can see that the range of values for each variable differs signicantly compared to other variables.

5 Exploratory Data Analysis (EDA)

5.1 EDA using statistical Graphs

Plotting histograms to display the distribution of the social demographic indicators as well as the urban functions for subzones:

hist_density <- ggplot(data=sd_uf, 
             aes(x= `density`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_young <- ggplot(data=sd_uf, 
             aes(x= `young`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_economy_active<- ggplot(data=sd_uf, 
             aes(x= `economy_active`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_aged<- ggplot(data=sd_uf, 
             aes(x= `aged`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_hdb1to2<- ggplot(data=sd_uf, 
             aes(x= `hdb_1to2`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_hdb3to4<- ggplot(data=sd_uf, 
             aes(x= `hdb_3to4`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_hdb5toec<- ggplot(data=sd_uf, 
             aes(x= `hdb_5toec`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_condo<- ggplot(data=sd_uf, 
             aes(x= `condo`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_landed<- ggplot(data=sd_uf, 
             aes(x= `landed`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_government<- ggplot(data=sd_uf, 
             aes(x= `government`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_business<- ggplot(data=sd_uf, 
             aes(x= `business`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_industry<- ggplot(data=sd_uf, 
             aes(x= `industry`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_shopping<- ggplot(data=sd_uf, 
             aes(x= `shopping`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_financial<- ggplot(data=sd_uf, 
             aes(x= `financial`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

hist_upmarket_residential<- ggplot(data=sd_uf, 
             aes(x= `upmarket_residential`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ggarrange(hist_density, hist_young, hist_economy_active, hist_aged, hist_hdb1to2, hist_hdb3to4, hist_hdb5toec, hist_condo, hist_landed, hist_government, hist_business, hist_industry, hist_shopping, hist_financial, hist_upmarket_residential,
          ncol = 3, 
          nrow = 5)

From the histograms above, it is observed that the data is right skewed. The data will need to be standardised subsequently.

5.2 Correlation Analysis

Plotting correlation plot:

cluster_vars.cor = cor(sd_uf[,2:16])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black")

From the correlation plot, we can see that the 3 age groups (young, economy active and aged) and hdb housing (3-4 rooms and 5-ec) are highly correlated to one another. This suggests that only one out of these 5 variables will be required for cluster analysis. Due to the strong postive correlation, we will only be used economy active age group.

6 Hierarchy Cluster Analysis

6.1 Extracting Clustering Variables

Excluding unwanted variables:

Amongst the 5 highly correlated variables, we will only include economy active people.

variables <- sd_uf %>%
  dplyr::select("SZ", "density", "economy_active", "hdb_1to2", "condo", "landed", "government", "business", "industry", "shopping", "financial", "upmarket_residential")

Changing the rows to subzone names instead of row numbers:

row.names(variables) <- variables$SZ

## Warning: Setting row names on a tibble is deprecated.

Removing the subzone names column:

as_data_frame(variables)

## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## # A tibble: 323 x 12
##    SZ    density economy_active hdb_1to2 condo landed government business
##    <chr>   <dbl>          <dbl>    <dbl> <dbl>  <dbl>      <int>    <int>
##  1 ADMI~  11184.           8380     1140     0      0          0        0
##  2 AIRP~      0               0        0     0      0          0        0
##  3 ALEX~  13374.           7560     3910     0      0          7       39
##  4 ALEX~   7218.           1420        0  2120      0          0        0
##  5 ALJU~  13581.          24330     2120 11930    460          2       62
##  6 ANAK~   8037.          12490      160  9020   7600          1       13
##  7 ANCH~  31092.          27740     1680  5000      0          0        0
##  8 ANG ~  15432.           2790        0  1930      0          1        1
##  9 ANSON      0               0        0     0      0          3       15
## 10 BALE~  17004.          19320     1790  9610    570          6       23
## # ... with 313 more rows, and 4 more variables: industry <int>, shopping <int>,
## #   financial <int>, upmarket_residential <int>

cluster_variables <- dplyr::select(variables, c(2:12))
summary(cluster_variables)

##     density      economy_active     hdb_1to2        condo      
##  Min.   :    0   Min.   :    0   Min.   :   0   Min.   :    0  
##  1st Qu.:    0   1st Qu.:    0   1st Qu.:   0   1st Qu.:    0  
##  Median : 5910   Median : 2840   Median :   0   Median :  230  
##  Mean   :10737   Mean   : 7386   Mean   : 542   Mean   : 1827  
##  3rd Qu.:19928   3rd Qu.:10285   3rd Qu.: 605   3rd Qu.: 2835  
##  Max.   :46058   Max.   :79780   Max.   :4700   Max.   :16770  
##      landed          government        business         industry     
##  Min.   :    0.0   Min.   : 0.000   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:    0.0   1st Qu.: 0.000   1st Qu.:  0.00   1st Qu.:0.0000  
##  Median :    0.0   Median : 0.000   Median :  1.00   Median :0.0000  
##  Mean   :  773.9   Mean   : 1.294   Mean   : 12.43   Mean   :0.2632  
##  3rd Qu.:  400.0   3rd Qu.: 1.000   3rd Qu.:  8.00   3rd Qu.:0.0000  
##  Max.   :18820.0   Max.   :19.000   Max.   :254.00   Max.   :8.0000  
##     shopping        financial      upmarket_residential
##  Min.   : 0.000   Min.   :  0.00   Min.   :  0.00      
##  1st Qu.: 0.000   1st Qu.:  1.00   1st Qu.:  0.00      
##  Median : 0.000   Median :  5.00   Median :  3.00      
##  Mean   : 1.573   Mean   : 10.28   Mean   : 10.54      
##  3rd Qu.: 1.000   3rd Qu.: 13.00   3rd Qu.: 10.50      
##  Max.   :31.000   Max.   :134.00   Max.   :217.00

6.2 Data Standardisation

The values range of the variables are very different. In order to prevent the cluster analysis from being biased, we will standardise the variables before proceeding with cluster analysis.

We will use min-max standardisation as the variables are not normally distributed as observed from the EDA previously.

cluster_variables.std <- heatmaply::normalize(cluster_variables)
summary(cluster_variables.std)

##     density       economy_active       hdb_1to2          condo        
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.1283   Median :0.03560   Median :0.0000   Median :0.01371  
##  Mean   :0.2331   Mean   :0.09258   Mean   :0.1153   Mean   :0.10894  
##  3rd Qu.:0.4327   3rd Qu.:0.12892   3rd Qu.:0.1287   3rd Qu.:0.16905  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##      landed          government         business           industry      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.003937   Median :0.00000  
##  Mean   :0.04112   Mean   :0.06811   Mean   :0.048951   Mean   :0.03289  
##  3rd Qu.:0.02125   3rd Qu.:0.05263   3rd Qu.:0.031496   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##     shopping         financial        upmarket_residential
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000     
##  1st Qu.:0.00000   1st Qu.:0.007463   1st Qu.:0.00000     
##  Median :0.00000   Median :0.037313   Median :0.01382     
##  Mean   :0.05073   Mean   :0.076706   Mean   :0.04859     
##  3rd Qu.:0.03226   3rd Qu.:0.097015   3rd Qu.:0.04839     
##  Max.   :1.00000   Max.   :1.000000   Max.   :1.00000

Comparing distribution of variables before and after standardisation:

We will use density as an example to do the comparison.

r <- ggplot(data= sd_uf,
             aes(x= `density`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

df_cluster_variables.std <- as.data.frame(cluster_variables.std)

s <- ggplot(data=df_cluster_variables.std, 
       aes(x=`density`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")

ggarrange(r, s,
          ncol = 2,
          nrow = 1)

This shows us that the overall distribution of population density has not been altered significantly hence the distribution for the rest of the variables should not differ greatly too.

6.3 Computing Proximity Matrix

Calculating proximity matrix using default euclidean method:

proximity_matrix <- dist(cluster_variables.std, method = 'euclidean')

6.4 Computing Hierarchical Clustering

6.4.1 Selecting Optimal Clustering Algorithm

Getting aggomerative coefficient of all the the hierarchical clustering algorithms by using the agnes() function of the cluster package.

m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")

ac <- function(x) {
  agnes(cluster_variables.std, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.9002887 0.8142469 0.9236774 0.9769280

From the output, we can observe that the Ward’s method provides the strongest clustering structure as it has the highest agglomerative coefficient amongst all the methods assessed. From https://stat.ethz.ch/R-manual/R-devel/library/stats/html/hclust.html we note that the method “ward” in agnes() corresponds to “ward.D2” in hclust. Hence, the Ward’s method will be used to analyse the variables.

6.4.2 Computing Hierarchical Clustering using Ward’s D2 Method

hcluster_wards <- hclust(proximity_matrix, method = 'ward.D2')

Plotting clustering dendogram:

plot(hcluster_wards, cex = 0.6)

Determining number of clusters using K elbow method:

mydata <- cluster_variables.std
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
  for (i in 1:11) wss[i] <- sum(kmeans(mydata,
                                       centers=i)$withinss)
plot(1:11, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

As seen from the plot, 2 and 5 have elbows. 5 clusters will be most suitable for the analysis.

Plotting the dendogram with highlighted clusters:

plot(hcluster_wards, cex = 0.6)
rect.hclust(hcluster_wards, k = 5, border = 2:5)

As we move up the tree, observations that are similar to one another are combined into branches. The higher the height of the fusion, the less similar the observations are.

6.5 Visually-Driven Hierarchical Clustering Analysis

6.5.1 Transforming the variables into a data matrix:

Converting the dataframe into a matrix to plot a heatmap displaying the clustering:

matrix_cluster_variables <- data.matrix(cluster_variables.std)

6.5.2 Plotting Interactive Cluster Heatmap

heatmaply(matrix_cluster_variables,
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D2",
          seriate = "OLO",
          colors = Blues,
          k_row = 5,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of Shan State by ICT indicators",
          xlab = "Cluster Variables",
          ylab = "Subzones of Singapore"
          )

6.5.3 Mapping Clusters Formed

We will be retaining 5 clusters to be mapped upon observation of the dendogram.

groups <- as.factor(cutree(hcluster_wards, k = 5))

Plotting choropleth map with the clusters:

subzone_clusters <- cbind(subzones, as.matrix(groups)) %>% rename(`CLUSTER` = `as.matrix.groups.`)

qtm(subzone_clusters, "CLUSTER")

From the chloropleth, it is revealed that the clusters are very fragmented. The different clusters are scattered across Singapore. This signifies the major constrain faced when the hierarchical cluster analysis method is used.

7 Spatially Constrained Clustering using SKATER Approach

7.1 Converting Subzones Data into SpatialPolygonsDataFrame

The SKATER function requires the spatial data to be in ‘sp’ format.

subzones_sp <- as_Spatial(subzones)

7.2 Computing Neighbour List

Computing neighbours list will from the polygon list:

subzones.nb <- poly2nb(subzones_sp)
summary(subzones.nb)

## Neighbour list object:
## Number of regions: 323 
## Number of nonzero links: 1934 
## Percentage nonzero weights: 1.853751 
## Average number of links: 5.987616 
## 5 regions with no links:
## 17 18 19 295 302
## Link number distribution:
## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  5  2  6 10 26 77 87 51 34 16  3  3  1  1  1 
## 2 least connected regions:
## 16 234 with 1 link
## 1 most connected region:
## 313 with 17 links

Plotting neighbours list:

plot(subzones_sp, border=grey(0.5))
plot(subzones.nb, coordinates(subzones_sp), col = "red", add = TRUE)

There are some clusters without neighbours.

7.3 Computing Minimum Spanning Tree

7.3.1 Removing Subzones without Neighbours

There are 5 regions with no links hence we will remove them as we will not be able to calculate edge costs without any neighbours.

sub_nb_subzones <- subset(subzones.nb, subset = card(subzones.nb) > 0)
summary(sub_nb_subzones)

## Neighbour list object:
## Number of regions: 318 
## Number of nonzero links: 1934 
## Percentage nonzero weights: 1.912503 
## Average number of links: 6.081761 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  2  6 10 26 77 87 51 34 16  3  3  1  1  1 
## 2 least connected regions:
## 16 234 with 1 link
## 1 most connected region:
## 313 with 17 links

The subzones without neighbours have been removed.

7.3.2 Calculating Edge costs

The cost of each edge is computed as it is the distance between the nodes.

lcosts <- nbcosts(sub_nb_subzones, cluster_variables.std)

Incorporating costs as weights objects:

subzones.w <- nb2listw(sub_nb_subzones, lcosts, style = "B")
summary(subzones.w)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 318 
## Number of nonzero links: 1934 
## Percentage nonzero weights: 1.912503 
## Average number of links: 6.081761 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  2  6 10 26 77 87 51 34 16  3  3  1  1  1 
## 2 least connected regions:
## 16 234 with 1 link
## 1 most connected region:
## 313 with 17 links
## 
## Weights style: B 
## Weights constants summary:
##     n     nn       S0       S1       S2
## B 318 101124 1266.783 2190.305 25762.91

7.3.3 Computing Minimum Spanning Tree

The minimum spanning tree is computing by using the mstree() function from spdep package:

subzones.mst <- mstree(subzones.w)

Checking the class and dimension of the MST computed:

class(subzones.mst)

## [1] "mst"    "matrix"

dim(subzones.mst)

## [1] 317   3

It is a matrix and the dimension is 318.

Plotting the MST to view the observation numbers:

plot(subzones_sp, border=gray(0.5))
plot.mst(subzones.mst, coordinates(subzones_sp), col = "red", cex.lab = 0.7, cex.circles = 0.005, add = TRUE)

7.4 Computing Spatially Constrained Clusters (SKATER Method)

Computing Clusters:

clusters <- skater(subzones.mst[,1:2], cluster_variables.std, 4)

Plotting pruned tree that shows the clusters:

plot(subzones_sp, border = gray(0.5))
plot(clusters, coordinates(subzones_sp), cex.lab = 0.7,      groups.colors = c("red", "blue", "green", "brown", "pink"), cex.circles = 0.005, add = TRUE)

7.5 Visualising Clusters & the Means of their Variabls

Converting the clusters into a matrix:

groups_matrix <- as.matrix(clusters$groups)

Creating function to include subzones with no neighbours:

insert_row <- function(existingDF, newrow, r) {
  existingDF[seq(r+1,nrow(existingDF)+1),] <- existingDF[seq(r,nrow(existingDF)),]
  existingDF[r,] <- newrow
  existingDF
}

Implementing function to add in rows for the subzones without neighbours that were removed previously:

The value added for each of this subzone will be 0 to indicate that they have no neighbours.

The subzones that were removed were (17,18,19,295 and 302) hence their rows will be appended respectively.

existingDF <- as.data.frame(groups_matrix)
newrow <- 0
r <- 17
existingDF <- insert_row(existingDF, newrow , r)

newrow <- 0
r <- 18

existingDF <- insert_row(existingDF, newrow , r)

newrow <- 0
r <- 19
existingDF <- insert_row(existingDF, newrow , r)

newrow <- 0
r <- 295
existingDF <- insert_row(existingDF, newrow , r)

newrow <- 0
r <- 302

existingDF <- insert_row(existingDF, newrow , r)

groups_matrix <- as.matrix(existingDF)

Computing Mean Values of Variables for Clusters

dataset <- cbind(cluster_variables.std, as.factor(groups_matrix)) %>%
  rename(`SP_CLUSTER` = `as.factor(groups_matrix)`)

mean_values <- dataset %>%
  group_by(SP_CLUSTER) %>%
  summarise_all(funs(mean))

mean_vals <- data.frame(t(mean_values[-1]))
colnames(mean_vals) <- mean_values[, 1]
names(mean_vals)[1] <- "Cluster 0"
names(mean_vals)[2] <- "Cluster 1"
names(mean_vals)[3] <- "Cluster 2"
names(mean_vals)[4] <- "Cluster 3"
names(mean_vals)[5] <- "Cluster 4"
names(mean_vals)[6] <- "Cluster 5"

mean_vals <- mean_vals %>%
  dplyr::select(`Cluster 1`, `Cluster 2`, `Cluster 3`, `Cluster 4`, `Cluster 5`)

Viewing as a table:

	Cluster 1	Cluster 2	Cluster 3	Cluster 4	Cluster 5
density	0.22332473	0.13495013	0.29323263	0.26552265	0.25944264
economy_active	0.08786714	0.10049511	0.11238861	0.05569483	0.13948358
hdb_1to2	0.10570247	0.07500000	0.14727273	0.37943262	0.17787234
condo	0.10921035	0.07871199	0.11968342	0.02862254	0.12176506
landed	0.04520129	0.01554198	0.03267317	0.00000000	0.02412327
government	0.07138291	0.01973684	0.05454545	0.35087719	0.02105263
business	0.05017693	0.08661417	0.03493200	0.12992126	0.06771654
industry	0.02935223	0.14062500	0.02954545	0.12500000	0.05000000
shopping	0.05472117	0.03629032	0.04105572	0.02150538	0.01290323
financial	0.08042782	0.04570896	0.06309362	0.19402985	0.04328358
upmarket_residential	0.05362040	0.01152074	0.03544198	0.01689708	0.02396313

Visualising the clusters obtained through SKATER method:

subzones_sf_clusters <- cbind(subzone_clusters, as.factor(groups_matrix)) %>%
  rename(`SP_CLUSTER` = `as.factor.groups_matrix.`)

tmap_mode("view")

## tmap mode set to interactive viewing

tm_shape(subzones_sf_clusters)+
  tm_borders(alpha = 0.5)+
  tm_fill(col = "SP_CLUSTER", alpha = 0.9, id = "SUBZONE_N")

7.6 Analysing Results

Cluster 0 occupies the green regions of the map (e.g North-Eastern Islands, Sudong, Semakau, etc). These are areas with no neighbours and least social activity is found here.
Cluster 1 occupies the majority of Singapore, covering most areas except Northern Singapore. Cluster 1 has the highest proportion of landed properties, government buildings such as embassies, shopping establishments such as malls and upmarket residential areas. Cluster 1 has the least number of industial areas. This shows us that government buildings and amenties such as shopping malls are widely spread around Singapore for ease of accessibility.
Cluster 2 is located closer to the North of Singapore. Cluster 2 is dominated by industrial areas and the least population density. It also has the lowest proportion of government areas and housing such as 1-2 bedroom hdb or upmarket residential areas. It can be inferred that Cluster 2 represents an industrial area with less people residing.
Cluster 3 is located towards the Northern most part of Singapore and surrounds Cluster 2. Cluster 3 has the highest population density and the lowest proportion of business areas. This is a densely populated residential area with many housing estates.
Cluster 4 occupies a small region around Boat Quay, Clarke Quay and Raffles Place. It has the highest proportion of business and financial buildings. It has the least number of condominium and landed properties. Cluster 4 represents the Central Business District of Singapore that is dominated by businesses and banks hence there are less residential areas here.
Cluster 5 lies towards the North-East region of Singapore at Punggol. It has the highest proportion of economy active people staying there and the lowest proportion of financial buildings. The reason why many economy active people are staying there may be due to Punggol/ Sengkang being developing residential towns with many new HDBs being built there. Many economy active people will choose to buy a new flat at the Punggol region.
It can be inferred that the East, Central, South and West regions of Singapore are more prominent social areas as most of the urban functions are found in these regions.

Geospatial Analytics and Applications Assignment 3

Harpreet Kaur

21/05/2020