Overview

In this exercise, I will segment Singapore at the planning subzone level into homogeneous socioeconomic areas by combining geodemographic data extracted from Singapore Department of Statistics and urban functions extracted from the geospatial data provided.

Importing packages

packages = c('rgdal', 'spdep', 'ClustGeo',  'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse', 'ggplot2','dplyr','factoextra','NbClust','normalr','reshape2')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}

## Loading required package: rgdal

## Loading required package: sp

## rgdal: version: 1.4-8, (SVN revision 845)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/Users/vanes/Documents/R/win-library/3.6/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/vanes/Documents/R/win-library/3.6/rgdal/proj
##  Linking to sp version: 1.4-1

## Loading required package: spdep

## Loading required package: spData

## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`

## Loading required package: sf

## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1

## Loading required package: ClustGeo

## Loading required package: tmap

## Loading required package: ggpubr

## Loading required package: ggplot2

## Loading required package: cluster

## Loading required package: heatmaply

## Loading required package: plotly

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

## Loading required package: viridis

## Loading required package: viridisLite

## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus

## 
## ======================
## Welcome to heatmaply version 1.1.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================

## Loading required package: corrplot

## corrplot 0.84 loaded

## Loading required package: psych

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

## Loading required package: tidyverse

## -- Attaching packages -------------

## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ----------------------
## x psych::%+%()    masks ggplot2::%+%()
## x psych::alpha()  masks ggplot2::alpha()
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()

## Loading required package: factoextra

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

## Loading required package: NbClust

## Loading required package: normalr

## Loading required package: reshape2

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

Importing geospatial data

mpsz <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")

## Reading layer `MP14_SUBZONE_WEB_PL' from data source `C:\IS415\take home exercise\take home exercise 3\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS:  SVY21

test <- st_is_valid(mpsz)
length(which(test==FALSE))

## [1] 9

mpsz <- st_make_valid(mpsz)

Updating CRS information

mpsz <- st_transform(mpsz, 3414)
st_crs(mpsz)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCRS["SVY21 / Singapore TM",
##     BASEGEOGCRS["SVY21",
##         DATUM["SVY21",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4757]],
##     CONVERSION["Singapore Transverse Mercator",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",1.36666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",103.833333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",1,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",28001.642,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",38744.572,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["northing (N)",north,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["easting (E)",east,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["Singapore"],
##         BBOX[1.13,103.59,1.47,104.07]],
##     ID["EPSG",3414]]

view extent and plot

st_bbox(mpsz)

##      xmin      ymin      xmax      ymax 
##  2667.538 15748.721 56396.440 50256.334

Selecting polygons to be removed

For the analysis, we will focus on mainland Singapore as we need the nearest neighbour weights in our analysis.

mpsz <-mpsz[!mpsz$SUBZONE_N=="CENTRAL WATER CATCHMENT",] 
mpsz <-mpsz[!mpsz$PLN_AREA_N=="WESTERN ISLANDS",] 
mpsz <-mpsz[!mpsz$PLN_AREA_N=="SOUTHERN ISLANDS",] 
mpsz <-mpsz[!mpsz$PLN_AREA_N=="NORTH-EASTERN ISLANDS",] 
mpsz <-mpsz[!mpsz$SUBZONE_N=="PULAU SELETAR",] 
mpsz <-mpsz[!mpsz$SUBZONE_N=="WESTERN WATER CATCHMENT",] 
mpsz <-mpsz[!mpsz$SUBZONE_N=="TUAS VIEW EXTENSION",]

plot(mpsz$geometry)

Inspect Project string

mpsz.spdf <- as_Spatial(mpsz)
proj4string(mpsz.spdf)

## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

Business Data

business <- st_read(dsn = "data/geospatial", layer = "Business")

## Reading layer `Business' from data source `C:\IS415\take home exercise\take home exercise 3\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## geographic CRS: WGS 84

business <- business %>%
  filter(FAC_TYPE == 5000)

Updating CRS information

business<- st_transform(business, 3414)
st_crs(business)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCRS["SVY21 / Singapore TM",
##     BASEGEOGCRS["SVY21",
##         DATUM["SVY21",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4757]],
##     CONVERSION["Singapore Transverse Mercator",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",1.36666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",103.833333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",1,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",28001.642,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",38744.572,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["northing (N)",north,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["easting (E)",east,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["Singapore"],
##         BBOX[1.13,103.59,1.47,104.07]],
##     ID["EPSG",3414]]

Inspect project string

Business.spdf <- as_Spatial(business)
proj4string(Business.spdf)

## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

Aggregating with mpsz Dataframe

mpsz$business <- lengths(st_intersects(mpsz, business))

industry Data

industry <- st_read(dsn = "data/geospatial", layer = "Business")

## Reading layer `Business' from data source `C:\IS415\take home exercise\take home exercise 3\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## geographic CRS: WGS 84

industry <- industry %>%
  filter(FAC_TYPE == 9991)

Updating CRS information

industry<- st_transform(industry, 3414)
st_crs(industry)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCRS["SVY21 / Singapore TM",
##     BASEGEOGCRS["SVY21",
##         DATUM["SVY21",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4757]],
##     CONVERSION["Singapore Transverse Mercator",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",1.36666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",103.833333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",1,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",28001.642,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",38744.572,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["northing (N)",north,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["easting (E)",east,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["Singapore"],
##         BBOX[1.13,103.59,1.47,104.07]],
##     ID["EPSG",3414]]

Inspect project string

industry.spdf <- as_Spatial(industry)
proj4string(industry.spdf)

## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

Aggregating with mpsz Dataframe

mpsz$industry <- lengths(st_intersects(mpsz, industry))

Financial Data

financial <- st_read(dsn = "data/geospatial", layer = "financial")

## Reading layer `financial' from data source `C:\IS415\take home exercise\take home exercise 3\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3320 features and 29 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6256 ymin: 1.24392 xmax: 103.9998 ymax: 1.46247
## geographic CRS: WGS 84

Updating CRS information

financial<- st_transform(financial, 3414)
st_crs(financial)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCRS["SVY21 / Singapore TM",
##     BASEGEOGCRS["SVY21",
##         DATUM["SVY21",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4757]],
##     CONVERSION["Singapore Transverse Mercator",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",1.36666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",103.833333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",1,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",28001.642,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",38744.572,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["northing (N)",north,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["easting (E)",east,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["Singapore"],
##         BBOX[1.13,103.59,1.47,104.07]],
##     ID["EPSG",3414]]

Inspect project string

financial.spdf <- as_Spatial(financial)
proj4string(financial.spdf)

## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

Aggregating with mpsz Dataframe

mpsz$financial <- lengths(st_intersects(mpsz, financial))

Government Data

Government <- st_read(dsn = "data/geospatial", layer = "Govt_Embassy")

## Reading layer `Govt_Embassy' from data source `C:\IS415\take home exercise\take home exercise 3\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 443 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6282 ymin: 1.24911 xmax: 103.9884 ymax: 1.45765
## geographic CRS: WGS 84

Updating CRS information

Government<- st_transform(Government, 3414)
st_crs(Government)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCRS["SVY21 / Singapore TM",
##     BASEGEOGCRS["SVY21",
##         DATUM["SVY21",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4757]],
##     CONVERSION["Singapore Transverse Mercator",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",1.36666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",103.833333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",1,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",28001.642,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",38744.572,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["northing (N)",north,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["easting (E)",east,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["Singapore"],
##         BBOX[1.13,103.59,1.47,104.07]],
##     ID["EPSG",3414]]

Inspect project string

Government.spdf <- as_Spatial(Government)
proj4string(Government.spdf)

## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

Aggregating with mpsz Dataframe

mpsz$Government <- lengths(st_intersects(mpsz, Government))

residential Data

residential <- st_read(dsn = "data/geospatial", layer = "Private residential")

## Reading layer `Private residential' from data source `C:\IS415\take home exercise\take home exercise 3\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3604 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6295 ymin: 1.23943 xmax: 103.9749 ymax: 1.45379
## geographic CRS: WGS 84

Updating CRS information

residential<- st_transform(residential, 3414)
st_crs(residential)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCRS["SVY21 / Singapore TM",
##     BASEGEOGCRS["SVY21",
##         DATUM["SVY21",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4757]],
##     CONVERSION["Singapore Transverse Mercator",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",1.36666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",103.833333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",1,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",28001.642,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",38744.572,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["northing (N)",north,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["easting (E)",east,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["Singapore"],
##         BBOX[1.13,103.59,1.47,104.07]],
##     ID["EPSG",3414]]

Inspect project string

residential.spdf <- as_Spatial(residential)
proj4string(residential.spdf)

## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

Aggregating with mpsz Dataframe

mpsz$residential <- lengths(st_intersects(mpsz, residential))

shopping Data

shopping <- st_read(dsn = "data/geospatial", layer = "Shopping")

## Reading layer `Shopping' from data source `C:\IS415\take home exercise\take home exercise 3\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 511 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.679 ymin: 1.24779 xmax: 103.9644 ymax: 1.4535
## geographic CRS: WGS 84

Updating CRS information

shopping<- st_transform(shopping, 3414)
st_crs(shopping)

## Coordinate Reference System:
##   User input: EPSG:3414 
##   wkt:
## PROJCRS["SVY21 / Singapore TM",
##     BASEGEOGCRS["SVY21",
##         DATUM["SVY21",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["degree",0.0174532925199433]],
##         ID["EPSG",4757]],
##     CONVERSION["Singapore Transverse Mercator",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",1.36666666666667,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",103.833333333333,
##             ANGLEUNIT["degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",1,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",28001.642,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",38744.572,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["northing (N)",north,
##             ORDER[1],
##             LENGTHUNIT["metre",1]],
##         AXIS["easting (E)",east,
##             ORDER[2],
##             LENGTHUNIT["metre",1]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["Singapore"],
##         BBOX[1.13,103.59,1.47,104.07]],
##     ID["EPSG",3414]]

Inspect project string

shopping.spdf <- as_Spatial(shopping)
proj4string(shopping.spdf)

## [1] "+proj=tmerc +lat_0=1.36666666666667 +lon_0=103.833333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs"

Aggregating with mpsz Dataframe

mpsz$shopping <- lengths(st_intersects(mpsz, shopping))

Importing the aspatial data

sgpop <- read_csv("data/aspatial/sgpop.csv")

## Parsed with column specification:
## cols(
##   PA = col_character(),
##   SZ = col_character(),
##   AG = col_character(),
##   Sex = col_character(),
##   TOD = col_character(),
##   Pop = col_double(),
##   Time = col_double()
## )

summary(sgpop)

##       PA                 SZ                 AG                Sex           
##  Length:883728      Length:883728      Length:883728      Length:883728     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##      TOD                 Pop               Time     
##  Length:883728      Min.   :   0.00   Min.   :2011  
##  Class :character   1st Qu.:   0.00   1st Qu.:2013  
##  Mode  :character   Median :   0.00   Median :2015  
##                     Mean   :  39.83   Mean   :2015  
##                     3rd Qu.:  10.00   3rd Qu.:2017  
##                     Max.   :2860.00   Max.   :2019

According to the summary , there are no missing data and NA values have been checked.

Filtering data - Extract indicators

Age Group There are 3 age groups (Economy active population (i.e. age 25-64),Young group (i.e. age 0-24) and Aged group (i.e 65 and above))

sgpop_ag <- sgpop %>%
  filter(Time == 2019) %>%
  spread(AG, Pop) %>%
  mutate(young = `0_to_4`+`5_to_9`+`10_to_14`+
  `15_to_19`+`20_to_24`) %>%
  mutate(economy_active = `25_to_29`+`30_to_34`+`35_to_39`+`40_to_44`+`45_to_49`+`50_to_54`+`55_to_59`+`60_to_64`) %>%
  mutate(aged=`65_to_69`+`70_to_74`+`75_to_79`+ `80_to_84`+`85_to_89`+`90_and_over`) %>%
  mutate_at(.vars = vars(SZ), .funs = funs(toupper)) %>%
  select(`SZ`, `young`, `economy_active`, `aged`) %>%
  group_by(`SZ`) %>%
  summarise(young=sum(young), economy_active=sum(economy_active), aged=sum(aged), TOTAL=sum(young, economy_active, aged))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

Dwelling Type There are 5 dwelling types (HDB1-2RM dwellers, HDB3-4RM dwellers, HDB5RM-Ec dweller,Condominium and apartment dwellers, and Landed property dwellers)

sgpop_dwelling <- sgpop %>%
  filter(Time == 2019) %>%
  spread(TOD, Pop) %>%
  rename(`HDB12rm` = `HDB 1- and 2-Room Flats`) %>%
  rename(`HDB5rmec` = `HDB 5-Room and Executive Flats`) %>%
  rename(`condo` = `Condominiums and Other Apartments`) %>%
  rename(`landed` = `Landed Properties`) %>%
  mutate(`HDB34rm` = `HDB 3-Room Flats` + `HDB 4-Room Flats`) %>%
  mutate_at(.vars = vars(SZ), .funs = funs(toupper)) %>%
  select(`SZ`, `HDB12rm`, `HDB5rmec`, `condo`, `landed`, `HDB34rm`) %>%
  group_by(`SZ`) %>%
  summarise_at(c("HDB12rm", "HDB5rmec", "condo", "landed", "HDB34rm"), sum)

Merging data

merged <- left_join(mpsz , sgpop_ag, by = c("SUBZONE_N" = "SZ"))

## Warning: Column `SUBZONE_N`/`SZ` joining factor and character vector, coercing
## into character vector

merged <- left_join(merged, sgpop_dwelling, c("SUBZONE_N" = "SZ"))
merged$Density <- (merged$TOTAL / merged$SHAPE_Area)

Chloropleth mapping

Looking at the distribution according to age group

pop_density.map <- tm_shape(merged) + 
  tm_fill(col = "Density",
          n = 5,
          style = "jenks", 
          title = "Population Density") + 
  tm_borders(alpha = 0.5) 

young.map <- tm_shape(merged) + 
  tm_fill(col = "young",
          n = 5,
          style = "jenks",
          title = "Young Residents ") + 
  tm_borders(alpha = 0.5) 

ec.map <- tm_shape(merged) + 
  tm_fill(col = "economy_active",
          n = 5,
          style = "jenks",
          title = "Economy Active Residents ") + 
  tm_borders(alpha = 0.5) 

aged.map <- tm_shape(merged) + 
  tm_fill(col = "aged",
          n = 5,
          style = "jenks",
          title = "Aged Residents ") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(pop_density.map, young.map, ec.map, aged.map)

HDB12rm.map <- tm_shape(merged) + 
  tm_fill(col = "HDB12rm",
          n = 5,
          style = "jenks",
          title = "Number of Residents living in HDB12rm") + 
  tm_borders(alpha = 0.5) 

HDB34rm.map <- tm_shape(merged) + 
  tm_fill(col = "HDB34rm",
          n = 5,
          style = "jenks",
          title = "Number of Residents living in HDB34rm") + 
  tm_borders(alpha = 0.5) 

HDB5rmec.map <- tm_shape(merged) + 
  tm_fill(col = "HDB5rmec",
          n = 5,
          style = "jenks",
          title = "Number of Residents living in HDB5rmec") + 
  tm_borders(alpha = 0.5) 

condo.map <- tm_shape(merged) + 
  tm_fill(col = "condo",
          n = 5,
          style = "jenks",
          title = "Number of Residents living in condo") + 
  tm_borders(alpha = 0.5) 

landed.map <- tm_shape(merged) + 
  tm_fill(col = "landed",
          n = 5,
          style = "jenks",
          title = "Number of Residents living in landed") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(pop_density.map, HDB12rm.map, HDB34rm.map, HDB5rmec.map, condo.map, landed.map)

pop_density.map <- tm_shape(merged) + 
  tm_fill(col = "Density",
          n = 5,
          style = "jenks",
          title = "Number of Business") + 
  tm_borders(alpha = 0.5)


business.map <- tm_shape(merged) + 
  tm_fill(col = "business",
          n = 5,
          style = "jenks",
          title = "Number of Business") + 
  tm_borders(alpha = 0.5) 

industry.map <- tm_shape(merged) + 
  tm_fill(col = "industry",
          n = 5,
          style = "jenks",
          title = "Number of Industry") + 
  tm_borders(alpha = 0.5) 

financial.map <- tm_shape(merged) + 
  tm_fill(col = "financial",
          n = 5,
          style = "jenks",
          title = "Number of Financial") + 
  tm_borders(alpha = 0.5) 

govt.map <- tm_shape(merged) + 
  tm_fill(col = "Government",
          n = 5,
          style = "jenks",
          title = "Number of Govt") + 
  tm_borders(alpha = 0.5) 

pr.map <- tm_shape(merged) + 
  tm_fill(col = "residential",
          n = 5,
          style = "jenks",
          title = "Number of Private Residents") + 
  tm_borders(alpha = 0.5) 

shopping.map <- tm_shape(merged) + 
  tm_fill(col = "shopping",
          n = 5,
          style = "jenks",
          title = "Number of Shopping") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(pop_density.map, business.map, industry.map, financial.map, govt.map, pr.map, shopping.map,ncol=3, nrow = 3)

## Legend labels were too wide. The labels have been resized to 0.55, 0.55, 0.55, 0.55, 0.55. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.

From the plots, the value ranges are very different which may lead to biased results. This is due to the unit of measurement whcih is the number of people.n general, the subzones with relatively higher total number of households will also have higher number of people from different age group, dwellings and urban functions.Hence , we will derive new variables.

Derive new variables with dpylr

There may common derivation from the total resident count in each subzone. As such, we should instead use Ratios instead of raw count in order to properly determine proportion of Youths, Economically Active and Aged.

derived <- merged %>%
  mutate(YOUNG_RATIO = ifelse((young == 0), 0, (young/TOTAL))) %>%
  mutate(ACTIVE_RATIO = ifelse((economy_active == 0), 0, (economy_active/TOTAL))) %>%
  mutate(AGED_RATIO = ifelse((aged == 0), 0, (aged/TOTAL)))%>%
  mutate(HDB1_2RM_RATIO = ifelse((HDB12rm == 0), 0, (HDB12rm/TOTAL)))%>%
  mutate(HDB34RM_RATIO = ifelse((HDB34rm == 0), 0, (HDB34rm/TOTAL)))%>%
  
  mutate(HDB5RM_EF_RATIO = ifelse((HDB5rmec == 0), 0, (HDB5rmec/TOTAL)))%>%
  mutate(CONDO_APART_RATIO = ifelse((condo == 0), 0, (condo/TOTAL)))%>%
  mutate(LANDED_RATIO = ifelse((landed== 0), 0, (landed/TOTAL)))

derived <- derived %>% mutate_if(is.numeric, ~replace_na(., 0))

Exploratory Data Analysis (EDA)

EDA using statistical graphics

Histograms are useful to identify the overall distribution of the data values (i.e. left skew, right skew or normal distribution)

young <- ggplot(data=derived, 
             aes(x= `young`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

economy_active <- ggplot(data=derived, 
             aes(x= `economy_active`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

aged <- ggplot(data=derived, 
             aes(x= `aged`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

HDB12rm <- ggplot(data=derived, 
             aes(x= `HDB12rm`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

HDB34rm <- ggplot(data=derived, 
             aes(x= `HDB34rm`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

HDB5rmec <- ggplot(data=derived, 
             aes(x= `HDB5rmec`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

condo <- ggplot(data=derived, 
             aes(x= `condo`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

landed <- ggplot(data=derived, 
             aes(x= `landed`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

pop_density <- ggplot(data=derived, 
             aes(x= `Density`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
biz <- ggplot(data=derived, 
             aes(x= `business`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ind <- ggplot(data=derived, 
             aes(x= `industry`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

fin <- ggplot(data=derived, 
             aes(x= `financial`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

gov <- ggplot(data=derived, 
             aes(x= `Government`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

pri <- ggplot(data=derived, 
             aes(x= `residential`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

shop <- ggplot(data=derived, 
             aes(x= `shopping`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ggarrange(young, economy_active, aged, HDB12rm, HDB34rm, HDB5rmec, condo, landed, pop_density,biz, ind, fin, gov, pri, shop, ncol = 5, nrow = 3)

From the plots , we can see that the distribution are very skewed to the right .We cannot assume normal distribution.

To get a better analysis to identify the relationships between the various factors , I will know conduct a correlation analysis to better understand the relationships between the respective factors.

Correlation analysis

Selecting indicators

indicators <- derived %>%
  
  select("economy_active", "young", "aged", "Density", "HDB12rm", "HDB34rm",  "HDB5rmec", "condo", "landed", "business", "industry", "financial", "Government", "residential", "shopping")

cluster_vars.cor <- cor(as.data.frame(indicators)[,0:15])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black", tl.cex = 1)

We will be setting a threshold of >0.75 as it may lead to issues when running multi-regression analysis. Further discussion can be found here https://www.researchgate.net/post/How_can_I_avoid_multicollinearity

From the plot, there are a few significant observations to point out.

Firstly, the young has a high correlation with aged,economy active , HDB3-4 room and HDB5 room.

Secondly , there is a high correlation between HDB3-4 rooms across all the age groups, all above 0.75 which may indicate that majority of residents stay in HDB3-4 rooms .

Lastly , there is high correlation between economy active with aged,young, HDB3-4 room and HDB5 room.

Hence , will not include one indicator out of the pair - namely economy active, young and HDB3-4rm as they have the highest occurances of high correlation from their pairs in our cluster analysis.

indicators <-  subset(indicators, select = -c(`economy_active`,`HDB34rm`,`young`))

derived <- indicators %>% select(0:12)
derived <- st_set_geometry(derived, NULL)
derived.cor = cor(derived)
corrplot.mixed(derived.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black",
                number.cex = 0.5)

Now, all our values fall below the 0.75 correlation coefficient.

Hierarchy Cluster Analysis

cluster_vars <- indicators %>%
  st_set_geometry(NULL) 
  
head(cluster_vars,10)

##    aged      Density HDB12rm HDB5rmec condo landed business industry financial
## 1     0 0.0000000000       0        0     0      0        0        0         3
## 2  2110 0.0118431719    4520        0   420      0        6        0        25
## 3    20 0.0004353031       0        0    50      0       40        0         2
## 4  3260 0.0224711973    3930     1270   190      0        0        0         4
## 5  1630 0.0276953656    1970     3240  1930      0        2        0        12
## 6  3310 0.0133737225    3910     3120     0      0       39        1        15
## 7  3590 0.0267883659    3700     1900   540      0        6        0         6
## 8    10 0.0002067649       0        0    10      0       12        0        19
## 9   560 0.0040837310       0        0  2970   1430       16        0         4
## 10   50 0.0004591192       0        0   290      0        0        0         2
##    Government residential shopping
## 1           0           0        0
## 2           1           6        3
## 3           2           1        1
## 4           0           5        0
## 5           0           6        1
## 6           7           4        1
## 7           4          11        1
## 8           4           6       10
## 9           0          56        0
## 10          0           1        1

Rename rows to subzone name instead of index numbers

row.names(cluster_vars) <- cluster_vars$"SUBZONE_N"
head(cluster_vars,10)

##    aged      Density HDB12rm HDB5rmec condo landed business industry financial
## 1     0 0.0000000000       0        0     0      0        0        0         3
## 2  2110 0.0118431719    4520        0   420      0        6        0        25
## 3    20 0.0004353031       0        0    50      0       40        0         2
## 4  3260 0.0224711973    3930     1270   190      0        0        0         4
## 5  1630 0.0276953656    1970     3240  1930      0        2        0        12
## 6  3310 0.0133737225    3910     3120     0      0       39        1        15
## 7  3590 0.0267883659    3700     1900   540      0        6        0         6
## 8    10 0.0002067649       0        0    10      0       12        0        19
## 9   560 0.0040837310       0        0  2970   1430       16        0         4
## 10   50 0.0004591192       0        0   290      0        0        0         2
##    Government residential shopping
## 1           0           0        0
## 2           1           6        3
## 3           2           1        1
## 4           0           5        0
## 5           0           6        1
## 6           7           4        1
## 7           4          11        1
## 8           4           6       10
## 9           0          56        0
## 10          0           1        1

Notice that the row number has been replaced into the Subzone name.

derived_vars <- select(cluster_vars, 1:12)
head(derived_vars, 10)

##    aged      Density HDB12rm HDB5rmec condo landed business industry financial
## 1     0 0.0000000000       0        0     0      0        0        0         3
## 2  2110 0.0118431719    4520        0   420      0        6        0        25
## 3    20 0.0004353031       0        0    50      0       40        0         2
## 4  3260 0.0224711973    3930     1270   190      0        0        0         4
## 5  1630 0.0276953656    1970     3240  1930      0        2        0        12
## 6  3310 0.0133737225    3910     3120     0      0       39        1        15
## 7  3590 0.0267883659    3700     1900   540      0        6        0         6
## 8    10 0.0002067649       0        0    10      0       12        0        19
## 9   560 0.0040837310       0        0  2970   1430       16        0         4
## 10   50 0.0004591192       0        0   290      0        0        0         2
##    Government residential shopping
## 1           0           0        0
## 2           1           6        3
## 3           2           1        1
## 4           0           5        0
## 5           0           6        1
## 6           7           4        1
## 7           4          11        1
## 8           4           6       10
## 9           0          56        0
## 10          0           1        1

Deleted SUBZONE_N field

Data Standardisation

normalize() of heatmaply package is used to stadardisation the clustering variables by using Min-Max method. The method is used due to the skewness of our histrogram plots.

derived_vars.std  <- normalize(derived_vars)
derived_vars <- as.data.frame(derived_vars.std)
describe(derived_vars)

##             vars   n mean   sd median trimmed  mad min max range skew kurtosis
## aged           1 313 0.10 0.13   0.04    0.07 0.06   0   1     1 2.36     9.20
## Density        2 313 0.24 0.27   0.13    0.20 0.20   0   1     1 0.90    -0.43
## HDB12rm        3 313 0.12 0.22   0.00    0.06 0.00   0   1     1 2.13     4.01
## HDB5rmec       4 313 0.07 0.14   0.00    0.04 0.00   0   1     1 3.27    13.05
## condo          5 313 0.11 0.17   0.02    0.08 0.02   0   1     1 1.95     4.12
## landed         6 313 0.04 0.12   0.00    0.01 0.00   0   1     1 4.58    25.95
## business       7 313 0.07 0.15   0.01    0.03 0.01   0   1     1 3.55    13.43
## industry       8 313 0.04 0.14   0.00    0.01 0.00   0   1     1 4.51    22.88
## financial      9 313 0.08 0.12   0.04    0.05 0.06   0   1     1 3.73    19.92
## Government    10 313 0.07 0.15   0.00    0.04 0.00   0   1     1 3.27    12.46
## residential   11 313 0.05 0.11   0.02    0.03 0.03   0   1     1 4.63    28.16
## shopping      12 313 0.05 0.12   0.00    0.02 0.00   0   1     1 4.62    27.76
##               se
## aged        0.01
## Density     0.02
## HDB12rm     0.01
## HDB5rmec    0.01
## condo       0.01
## landed      0.01
## business    0.01
## industry    0.01
## financial   0.01
## Government  0.01
## residential 0.01
## shopping    0.01

Plotting normalised histrograms

a <- ggplot(data=derived_vars, 
       aes(x=`aged`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")



b <- ggplot(data=derived_vars, 
       aes(x=`Density`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")


c <- ggplot(data=derived_vars, 
       aes(x=`HDB12rm`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")


d <- ggplot(data=derived_vars, 
       aes(x=`HDB5rmec`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")

e <- ggplot(data=derived_vars, 
       aes(x=`condo`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")


f <- ggplot(data=derived_vars, 
       aes(x=`landed`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")

g <- ggplot(data=derived_vars, 
       aes(x=`business`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")


h <- ggplot(data=derived_vars, 
       aes(x=`industry`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")

i <- ggplot(data=derived_vars, 
       aes(x=`financial`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")

j <- ggplot(data=derived_vars, 
       aes(x=`Government`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")

k <- ggplot(data=derived_vars, 
       aes(x=`residential`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")

l <- ggplot(data=derived_vars, 
       aes(x=`shopping`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue") +
  ggtitle("Min-Max Standardisation")

ggarrange( a,b,c,d,e,f,g,h,i,j,k,l,ncol = 3, nrow = 3)

## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Transforming dataframe into matrix

derived.mat <- data.matrix(derived_vars)

Computing proximity matrix

The code chunk below is used to compute the proximity matrix using euclidean method.

proxmat <- dist(derived.mat, method = 'euclidean')

Selecting the optimal clustering algorithm

One of the challenge in performing hierarchical clustering is to identify stronger clustering structures. The issue can be solved by using use agnes() function of cluster package. It functions like hclus(), however, with the agnes() function you can also get the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure).

The code chunk below will be used to compute the agglomerative coefficients of all hierarchical clustering algorithms.

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(derived_vars, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.8881110 0.8164315 0.9198903 0.9746125

With reference to the output above, we can see that Ward’s method provides the strongest clustering structure among the four methods assessed. Hence, in the subsequent analysis, only Ward’s method will be used.

Computing hierarchical clustering

The code chunk below performs hierarchical cluster analysis using ward.D method. The hierarchical clustering output is stored in an object of class hclust which describes the tree produced by the clustering process.

hclust_ward <- hclust(proxmat, method = 'ward.D')

hclust_ward

## 
## Call:
## hclust(d = proxmat, method = "ward.D")
## 
## Cluster method   : ward.D 
## Distance         : euclidean 
## Number of objects: 313

plot(hclust_ward, cex = 0.6)

Selecting optimal number of clusters

Duda-Hart Stopping rule to decide how many clusters to stop at. he Dunn’s index is the ratio between the minimum inter-cluster distances to the maximum intra-cluster diameter. The diameter of a cluster is the distance between its two furthermost points. In order to have well separated and compact clusters you should aim for a higher Dunn’s index.

library(NbClust)
duda <- NbClust(derived_vars, distance = "euclidean", method = "ward.D", index = "duda")

duda$Best.nc

## Number_clusters     Value_Index 
##          9.0000          0.8587

In this case , the optimal number of clusters is 9.

Interpret dendogram

plot(hclust_ward, cex = 0.6)
rect.hclust(hclust_ward, k = 9, border = 2:5)

Visually-driven hierarchical clustering analysis

With heatmaply, we are able to build both highly interactive cluster heatmap or static cluster heatmap.

Plotting interactive cluster heatmap using heatmaply()

heatmaply(normalize(derived.mat),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 5,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of Subzones by Residents indicators",
          xlab = "Residents Indicators",
          ylab = "Subzones"
          )

Mapping the clusters formed

groups <- as.factor(cutree(hclust_ward, k=9))

The output is called groups. It is a list object.

In order to visualise the clusters, the groups object need to be appended onto socio simple feature object.

The code chunk below form the join in three steps: 1. the groups list object will be converted into a matrix; 2. cbnd() is used to append groups matrix onto socio to produce an output simple feature object called sg_cluster; and 3. rename of dplyr package is used to rename as.matrix.groups field as CLUSTER.

derived_cluster <- cbind(merged, as.matrix(groups)) %>%
  rename(`CLUSTER`=`as.matrix.groups.`)

qtm(derived_cluster, "CLUSTER")

The choropleth map above reveals the clusters are very fragmented. The is one of the major limitation when non-spatial clustering algorithm such as hierarchical cluster analysis method is used.

Spatially Constrained Clustering - SKATER approach

Converting into SpatialPolygonsDataFrame

Merging_sp <- as_Spatial(merged)

Computing neighbour list

Merging.nb <- poly2nb(Merging_sp)
summary(Merging.nb)

## Neighbour list object:
## Number of regions: 313 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.900601 
## Average number of links: 5.948882 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 
##  2  4 12 32 79 80 50 32 16  3  3 
## 2 least connected regions:
## 228 289 with 1 link
## 3 most connected regions:
## 43 259 294 with 11 links

Plotting connections

plot(Merging_sp, border=grey(.5), main= "Nearest Neighbor Connections")
plot(Merging.nb, coordinates(Merging_sp), col="blue", add=TRUE)

Computing lcost using nbcosts() function

nbcosts() of spdep package is used to compute the cost of each edge. It is the distance between it nodes

lcosts <- nbcosts(Merging.nb, derived_vars)

convert the neighbor list to a list weights object by specifying the just computed lcosts as the weights.

GWeighted.w <- nb2listw(Merging.nb, lcosts, style="B") 
summary(GWeighted.w)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 313 
## Number of nonzero links: 1862 
## Percentage nonzero weights: 1.900601 
## Average number of links: 5.948882 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 
##  2  4 12 32 79 80 50 32 16  3  3 
## 2 least connected regions:
## 228 289 with 1 link
## 3 most connected regions:
## 43 259 294 with 11 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn       S0       S1       S2
## B 313 97969 1146.904 1890.428 22053.24

Computing minimum spanning tree again with weighted connections

GWeighted.mst <- mstree(GWeighted.w)
class(GWeighted.mst)

## [1] "mst"    "matrix"

checking dimensions

dim(GWeighted.mst)

## [1] 312   3

The dimension is 312 and not 313 because the minimum spanning tree consists on n-1 edges (links) in order to traverse all the nodes.

We can display the content of population.mst by using head() as shown in the code chunk below

head(GWeighted.mst)

##      [,1] [,2]       [,3]
## [1,]  222  264 0.14255355
## [2,]  264  255 0.04545455
## [3,]  255  306 0.00974026
## [4,]  255  304 0.04754130
## [5,]  304  279 0.01393051
## [6,]  279  274 0.01623377

Plotting connections

plot(Merging_sp, border=gray(.5), main="Weighted Nearest Neighbor Connections")
plot.mst(GWeighted.mst, coordinates(Merging_sp), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)

Computing spatially constrained clusters using SKATER method

clust <- skater(GWeighted.mst[,1:2], derived_vars, 8)

The skater() takes three mandatory arguments: - the first two columns of the MST matrix (i.e. not the cost), - the data matrix (to update the costs as units are being grouped), and - the number of cuts. Note: It is set to one less than the number of clusters. So, the value specified is not the number of clusters, but the number of cuts in the graph, one less than the number of custers.

The result of the skater() is an object of class skater. We can examine its contents by using the code chunk below.

str(clust)

## List of 8
##  $ groups      : num [1:313] 1 4 1 4 4 4 4 1 1 1 ...
##  $ edges.groups:List of 9
##   ..$ :List of 3
##   .. ..$ node: num [1:212] 142 172 98 109 141 152 137 165 52 85 ...
##   .. ..$ edge: num [1:211, 1:3] 172 109 98 152 137 52 141 165 275 85 ...
##   .. ..$ ssw : num 79.3
##   ..$ :List of 3
##   .. ..$ node: num [1:13] 220 196 267 256 266 257 260 202 254 261 ...
##   .. ..$ edge: num [1:12, 1:3] 220 267 196 267 256 196 202 266 257 196 ...
##   .. ..$ ssw : num 5.31
##   ..$ :List of 3
##   .. ..$ node: num [1:12] 231 201 209 253 213 217 218 200 258 216 ...
##   .. ..$ edge: num [1:11, 1:3] 201 253 218 201 218 200 217 231 213 209 ...
##   .. ..$ ssw : num 3.25
##   ..$ :List of 3
##   .. ..$ node: num [1:17] 95 5 4 7 23 44 35 59 113 2 ...
##   .. ..$ edge: num [1:16, 1:3] 4 7 44 95 35 95 23 4 59 5 ...
##   .. ..$ ssw : num 5.54
##   ..$ :List of 3
##   .. ..$ node: num [1:11] 168 144 186 171 140 166 187 169 146 177 ...
##   .. ..$ edge: num [1:10, 1:3] 187 166 169 186 171 144 140 169 168 168 ...
##   .. ..$ ssw : num 4.35
##   ..$ :List of 3
##   .. ..$ node: num [1:6] 112 164 111 193 157 97
##   .. ..$ edge: num [1:5, 1:3] 111 164 164 112 112 193 111 157 97 164 ...
##   .. ..$ ssw : num 3.11
##   ..$ :List of 3
##   .. ..$ node: num [1:9] 296 286 288 292 284 280 269 283 278
##   .. ..$ edge: num [1:8, 1:3] 284 269 292 296 286 288 280 288 283 278 ...
##   .. ..$ ssw : num 3.65
##   ..$ :List of 3
##   .. ..$ node: num [1:5] 208 176 102 183 161
##   .. ..$ edge: num [1:4, 1:3] 176 102 208 208 183 ...
##   .. ..$ ssw : num 3.86
##   ..$ :List of 3
##   .. ..$ node: num [1:28] 155 153 133 103 151 139 188 104 132 184 ...
##   .. ..$ edge: num [1:27, 1:3] 104 103 184 132 133 155 225 184 131 151 ...
##   .. ..$ ssw : num 14.8
##  $ not.prune   : NULL
##  $ candidates  : int [1:9] 1 2 3 4 5 6 7 8 9
##  $ ssto        : num 154
##  $ ssw         : num [1:9] 154 149 145 140 136 ...
##  $ crit        : num [1:2] 1 Inf
##  $ vec.crit    : num [1:313] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "class")= chr "skater"

ccs <- clust$groups
table(ccs)

## ccs
##   1   2   3   4   5   6   7   8   9 
## 212  13  12  17  11   6   9   5  28

Visualising the clusters in choropleth map

groups_mat <- as.matrix(clust$groups)
  
spatialcluster <- cbind(derived_cluster, as.factor(groups_mat)) %>%
    rename(`SP_CLUSTER`=`as.factor.groups_mat.`)

qtm(spatialcluster, "SP_CLUSTER")

Analysing clusters

demo <- derived_cluster %>% select(CLUSTER, aged,economy_active,young,HDB12rm,HDB34rm,HDB5rmec,condo,landed) %>% st_set_geometry(NULL)

demo <- melt(demo, id.vars='CLUSTER')
colnames(demo)[3]="count"

demo2 <- derived_cluster %>% select(CLUSTER, industry, business, financial, Government, shopping, residential) %>% st_set_geometry(NULL)

demo2 <- melt(demo2, id.vars='CLUSTER')
colnames(demo2)[3]="count"

ggplot(demo, aes(x=CLUSTER, y=count, fill= variable)) +
    geom_bar(stat='identity', position='dodge')

ggplot(demo2, aes(x=CLUSTER, y=count, fill= variable)) +
    geom_bar(stat='identity', position='dodge')

Cluster 1

Cluster 1 is located all around Singapore. The most common age group is economy active and the most common dwelling is HDB 3-4 room flat.This cluster appeared to be a business area with some HDB residential.

Cluster 2

Cluster 2 is located in the north-east of Singapore.It has a signifiantly large number of residents and HDB flats compred to the other clusters. Cluster 2 is most likely a residential area.It has the most financial function, this may imply that residents need financial help such as banks near their place of residence.

Cluster 3

Cluster 3 is located in the west of Singapore. There are more HDB3-4 room flats that residents and the urban functions across all types are relatively low in numbers. This may imply that this cluster is a relatively secluded area which is not accessible and convenient to various places in Singapore.

Cluster 4

Cluster 4 is located in the South of Singapore. There is a significant number of private residential,landed and economy active. This cluster appears to be where the rich reside .

Cluster 5

Cluster 5 is located in the West of Singapore. There is a significant number of businesses .This cluster is likely to be a business hub . The majority of age group is economy active which may be due to the business location.

Cluster 6

Cluster 6 is located in the east of Singapore. There is a high number of residents as well as HDB3-4 rooms and 5 room flats. The urban functions are relatively low in numbers. This implied that this cluster is a residential area with majority of residents staying in HDB flats.

Cluster 7

Cluster 7 is located in the North of singapore. The type of dwelling in this cluster is condo and private residences. There is a significant number of financial functions. This may imply that the cluster has more affluent residents and require access to financial services due to their affluence.

Cluster 8

Cluster 8 is located in the east of Singapore. There appears to be no HDB flats and there is an insignificant number of residents. However, there is a significant number of business functions. This cluster is likely to primarily serve as a business center.

Cluster 9

Cluster 9 is located in the central/east of Singapore. There is a high number of private residences. The majority of the age groups are economy active and there are relatively high numbers of condo,landed, hdb3-4 and 5 rooms. This implies that cluster 9 is a residential area with residents being more affluent.This may be due to the location of this cluster as the economy active have greater access to the central area which is near the central business district and are able to afford it.

Comparing analysis methods

plot1<- qtm(derived_cluster, "CLUSTER")
plot2<- qtm(spatialcluster, "SP_CLUSTER")

tmap_arrange(plot1,plot2,nrow=1)

As we can see from the plots ,there is a significant difference between the hierarical clusters and spatially constrained clusters. We can see that the spatially constraint clusters are over generalised . This is due to the minimum spanning tree for the observations which reduces the search space. Hence , there is a huge proportion of Cluster 1 . With this , SKATER approach is not very useful in delineating social area analysis for the Singapore context. We can increase the number of cluster to account for the decentralisation concept of Singapore’s subzones planning .

Decentralisation concept: The Urban Redevelopment Authority (URA) launched plans for Singapore’s decentralisation in 1991. The URA planned to ease congestion, relieve pressure on infrastructure and bring work closer to home.

Other ways of determining number of clusters

Reference : https://uc-r.github.io/kmeans_clustering

Average silhoutte method

The average silhouette approach measures the quality of a clustering. That is, it determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k.

silhouette_plot_k <- fviz_nbclust(derived_vars, FUN = hcut, method = "silhouette")
silhouette_plot_k

2 Clusters are the most optimal. The disadvantage of the average silhouette methods is that, they measure a global clustering characteristic only. A more sophisticated method is to use the gap statistic which provides a statistical procedure to formalize the silhouette heuristic in order to estimate the optimal number of clusters.

Gap Statistics method

The gap statistic compares the total intracluster variation for different values of k with their expected values under null reference distribution of the data (i.e. a distribution with no obvious clustering).

set.seed(123)
gap_stat <- clusGap(derived_vars, FUN = kmeans, nstart = 25,
                    K.max = 10, B = 50)
# Print the result
print(gap_stat, method = "firstmax")

## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = derived_vars, FUNcluster = kmeans, K.max = 10, B = 50,     nstart = 25)
## B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 10
##           logW   E.logW       gap      SE.sim
##  [1,] 3.997851 4.822174 0.8243233 0.008591342
##  [2,] 3.777467 4.731015 0.9535481 0.009138057
##  [3,] 3.711580 4.677792 0.9662123 0.008871390
##  [4,] 3.641259 4.635057 0.9937978 0.008851455
##  [5,] 3.584247 4.602497 1.0182494 0.008912095
##  [6,] 3.517186 4.574343 1.0571576 0.009094427
##  [7,] 3.482773 4.551177 1.0684039 0.008555101
##  [8,] 3.424516 4.529981 1.1054646 0.008234703
##  [9,] 3.386282 4.511697 1.1254156 0.008448587
## [10,] 3.361194 4.494451 1.1332573 0.008402354

fviz_gap_stat(gap_stat)

From this method , we see that 9 clusters is ideal. We can perform the final analysis and extract the results using 9 clusters.

set.seed(123)
final <- kmeans(derived_vars, 9, nstart = 25)
print(final)

## K-means clustering with 9 clusters of sizes 23, 9, 31, 33, 125, 53, 18, 11, 10
## 
## Cluster means:
##          aged    Density    HDB12rm    HDB5rmec      condo      landed
## 1 0.273272166 0.51572519 0.76086957 0.164493962 0.17212413 0.040890819
## 2 0.064628255 0.07393757 0.04444444 0.046126402 0.01828662 0.032117133
## 3 0.265727089 0.74184329 0.23246397 0.360226264 0.24315694 0.019574235
## 4 0.076850156 0.17256189 0.01540941 0.020515834 0.32722575 0.117621486
## 5 0.012992577 0.04139163 0.00160000 0.005646372 0.02305069 0.018142402
## 6 0.153724564 0.48458638 0.17579285 0.084114907 0.06774226 0.009243478
## 7 0.008925415 0.01245296 0.00000000 0.002108238 0.01424501 0.020102728
## 8 0.006507279 0.03226227 0.00000000 0.000000000 0.04830054 0.006424500
## 9 0.327624602 0.26325306 0.21127660 0.096267723 0.53965414 0.433953241
##      business    industry  financial Government residential    shopping
## 1 0.022021457 0.010869565 0.16807268 0.08924485 0.068523342 0.089761571
## 2 0.395743146 0.722222222 0.06799337 0.08187135 0.021505376 0.014336918
## 3 0.005969837 0.004032258 0.09508907 0.03565365 0.037758287 0.043704475
## 4 0.010428965 0.003787879 0.04681140 0.05263158 0.142996788 0.031280547
## 5 0.036000000 0.024000000 0.04059701 0.05052632 0.019428571 0.037161290
## 6 0.012864494 0.021226415 0.06983948 0.05064548 0.027562821 0.032258065
## 7 0.517676768 0.097222222 0.01077944 0.01754386 0.007680492 0.003584229
## 8 0.070247934 0.000000000 0.39077341 0.66507177 0.064935065 0.392961877
## 9 0.070779221 0.087500000 0.21865672 0.08947368 0.411059908 0.093548387
## 
## Clustering vector:
##   [1] 5 1 5 1 6 1 1 5 4 5 5 5 5 4 5 2 5 5 5 6 8 8 6 6 5 5 8 6 5 2 6 5 8 5 6 5 3
##  [38] 5 5 8 5 5 8 6 5 5 5 8 5 4 5 5 8 5 5 6 8 6 6 5 6 5 4 4 6 5 5 4 6 5 4 6 5 7
##  [75] 5 5 8 5 5 5 5 4 2 4 4 4 7 7 4 5 5 5 5 1 6 5 1 5 6 4 5 9 3 9 2 7 7 2 7 4 1
## [112] 4 6 5 5 5 5 5 6 1 5 5 5 6 4 6 5 4 5 4 2 5 6 7 6 1 7 5 5 3 5 6 2 6 5 3 4 5
## [149] 5 4 7 7 6 6 6 5 9 5 8 5 9 6 4 9 5 1 6 6 3 4 3 7 5 6 6 3 1 7 4 4 5 5 1 2 5
## [186] 6 3 4 5 5 1 6 1 1 5 6 5 3 5 3 3 3 5 6 9 5 2 9 3 3 1 1 6 4 5 3 6 3 4 6 5 5
## [223] 7 7 6 4 4 5 5 9 6 3 9 6 1 9 5 5 4 3 5 6 6 6 5 5 6 3 5 5 5 4 3 3 5 1 3 3 5
## [260] 1 1 5 5 5 5 3 3 5 3 4 5 5 5 5 5 3 6 1 5 3 5 7 3 3 5 6 5 6 5 5 7 6 5 5 5 6
## [297] 3 1 4 5 6 5 5 5 7 5 5 5 5 5 5 5 7
## 
## Within cluster sum of squares by cluster:
## [1] 5.6524177 1.8994887 4.2464903 2.9965965 5.0226775 4.1910653 0.9591366
## [8] 3.1844051 3.0706945
##  (between_SS / total_SS =  67.6 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

fviz_cluster(final, data = derived_vars)

Take_Home_Ex03

Vanessa Su

5/23/2020

Overview

Importing packages

Importing geospatial data

Updating CRS information

view extent and plot

Selecting polygons to be removed

Inspect Project string

Business Data

Updating CRS information

Inspect project string

Aggregating with mpsz Dataframe

industry Data

Updating CRS information

Inspect project string

Aggregating with mpsz Dataframe

Financial Data

Updating CRS information

Inspect project string

Aggregating with mpsz Dataframe

Government Data

Updating CRS information

Inspect project string

Aggregating with mpsz Dataframe

residential Data

Updating CRS information

Inspect project string

Aggregating with mpsz Dataframe

shopping Data

Updating CRS information

Inspect project string

Aggregating with mpsz Dataframe

Importing the aspatial data

Filtering data - Extract indicators

Merging data

Chloropleth mapping

Derive new variables with dpylr

Exploratory Data Analysis (EDA)

EDA using statistical graphics

Correlation analysis

Selecting indicators

Hierarchy Cluster Analysis

Data Standardisation

Plotting normalised histrograms

Transforming dataframe into matrix

Computing proximity matrix

Selecting the optimal clustering algorithm

Computing hierarchical clustering

Selecting optimal number of clusters

Visually-driven hierarchical clustering analysis

Plotting interactive cluster heatmap using heatmaply()

Mapping the clusters formed

Spatially Constrained Clustering - SKATER approach

Converting into SpatialPolygonsDataFrame

Computing neighbour list

Plotting connections

Computing lcost using nbcosts() function

Plotting connections

Computing spatially constrained clusters using SKATER method

Visualising the clusters in choropleth map

Analysing clusters

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

Cluster 7

Cluster 8

Cluster 9

Comparing analysis methods

Other ways of determining number of clusters

Average silhoutte method

Gap Statistics method