1. Introduction

Objective

In this exercise, we will be segmenting Singapore at the planning subzone level into homogeneous socioeconomic areas, by combining geodemographic data extracted from Singapore Department of Statistics, and urban functions extracted from geospatial data.

From the Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019 provide by Singapore Department of Statistics, we extract the following indicators for 2019:
- Economy active population (i.e. age 25-64)
- Young group (i.e. age 0-24)
- Aged group (i.t. 65 and above)
- Population density
- HDB 1-2 RM dwellers
- HDB 3-4 RM dwellers
- HDB 5RM-Ec dwellers
- Condominium and apartment dwellers
- Landed property dwellers

We are provided with several geosptail data, and from there, we extract the following urban functions:
- Government includding embassy
- Business
- Industry
- Shopping
- Residential
- Government

2. Getting Started

The analytical question

Social area analysis (SAA), which is a generic name for a number of methods employing census and other data to classify small areas into similar socioeconomic groups, is an approach which quantifies data in a useful fashion and has important applications in regional development, urban planning, medical and health services research. The potentialities of the approach for Liveable City planning and Smart City preparedness, however, have yet to be explored.

The data

We are provided with five geospatial data sets: Govt_Embassy, Private residential, Shopping, Business, and Financial. These data sets are in ESRI shapefile format. In addition, we downloaded the following data from the relevant government portals: - Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 (https://www.singstat.gov.sg/),
- URA Master Plan 2014 Planning Subzone boundary (www.data.gov.sg)

Before we begin, a side note would be that all the interactive map will be static in this RPubs as the file is too large, and unable to be published on RPubs.

Installing and loading R packages

packages = c('rgdal', 'spdep', 'ClustGeo',  'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'tidyverse', 'psych', 'reshape2', 'maptools', 'leaflet', 'stringr', 'factoextra', 'NbClust')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}
## Loading required package: rgdal
## Loading required package: sp
## rgdal: version: 1.4-8, (SVN revision 845)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/R/R-4.0.0/library/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/R/R-4.0.0/library/rgdal/proj
##  Linking to sp version: 1.4-1
## Loading required package: spdep
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## Loading required package: ClustGeo
## Loading required package: tmap
## Loading required package: ggpubr
## Loading required package: ggplot2
## Loading required package: cluster
## Loading required package: heatmaply
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
## Loading required package: viridis
## Loading required package: viridisLite
## Registered S3 method overwritten by 'seriation':
##   method         from 
##   reorder.hclust gclus
## 
## ======================
## Welcome to heatmaply version 1.1.0
## 
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
## 
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
## Loading required package: corrplot
## corrplot 0.84 loaded
## Loading required package: tidyverse
## -- Attaching packages -------------------------------- tidyverse 1.3.0 --
## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.4
## -- Conflicts ----------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
## Loading required package: reshape2
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
## Loading required package: maptools
## Checking rgeos availability: FALSE
##      Note: when rgeos is not available, polygon geometry     computations in maptools depend on gpclib,
##      which has a restricted licence. It is disabled by default;
##      to enable gpclib, type gpclibPermit()
## Loading required package: leaflet
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: NbClust

3. Data Import and Preparation

3.1 Import subzone data

subzone <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_NO_SEA_PL")
## Reading layer `MP14_SUBZONE_NO_SEA_PL' from data source `C:\Users\amoss\OneDrive\Documents\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS:  SVY21

Check CRS of subzone

st_crs(subzone)
## Coordinate Reference System:
##   User input: SVY21 
##   wkt:
## PROJCRS["SVY21",
##     BASEGEOGCRS["SVY21[WGS84]",
##         DATUM["World Geodetic System 1984",
##             ELLIPSOID["WGS 84",6378137,298.257223563,
##                 LENGTHUNIT["metre",1]],
##             ID["EPSG",6326]],
##         PRIMEM["Greenwich",0,
##             ANGLEUNIT["Degree",0.0174532925199433]]],
##     CONVERSION["unnamed",
##         METHOD["Transverse Mercator",
##             ID["EPSG",9807]],
##         PARAMETER["Latitude of natural origin",1.36666666666667,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8801]],
##         PARAMETER["Longitude of natural origin",103.833333333333,
##             ANGLEUNIT["Degree",0.0174532925199433],
##             ID["EPSG",8802]],
##         PARAMETER["Scale factor at natural origin",1,
##             SCALEUNIT["unity",1],
##             ID["EPSG",8805]],
##         PARAMETER["False easting",28001.642,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8806]],
##         PARAMETER["False northing",38744.572,
##             LENGTHUNIT["metre",1],
##             ID["EPSG",8807]]],
##     CS[Cartesian,2],
##         AXIS["(E)",east,
##             ORDER[1],
##             LENGTHUNIT["metre",1,
##                 ID["EPSG",9001]]],
##         AXIS["(N)",north,
##             ORDER[2],
##             LENGTHUNIT["metre",1,
##                 ID["EPSG",9001]]]]

Notice that subzone do not have any CRS and is in SYV21. Hence we need to set a CRS for subzone and transform it into WGS84.

subzone <- st_as_sf(subzone,3414)
subzone <- st_transform(subzone, 4326)

3.2 Import aspatial data

respond <- read_csv("data/aspatial/respopagesextod2011to2019.csv")
## Parsed with column specification:
## cols(
##   PA = col_character(),
##   SZ = col_character(),
##   AG = col_character(),
##   Sex = col_character(),
##   TOD = col_character(),
##   Pop = col_double(),
##   Time = col_double()
## )

We filter out only the year 2019 In addition, we select only the few indicators that we are interested in, namely, subzone, age group, type of dwelling, and population

year <- respond %>%
  filter(Time == 2019) %>%
  select(SZ,AG,TOD,Pop)

To have common variables with the geosptial data, we rename SZ to SUBZONE_N

names(year)[names(year) == "SZ"] <- "SUBZONE_N"

For consistency, we capitalise all the observations under SUBZONE_N

year$SUBZONE_N = toupper(year$SUBZONE_N)

3.2.1 Splitting age group into 3 different groups

We will rename the age groups into 3 different groups: Young group, Economy active population, and Aged group ##### Young group

year$AG <- replace(year$AG, year$AG=="0_to_4", "Young group")
year$AG <- replace(year$AG, year$AG=="5_to_9", "Young group")
year$AG <- replace(year$AG, year$AG=="10_to_14", "Young group")
year$AG <- replace(year$AG, year$AG=="15_to_19", "Young group")
year$AG <- replace(year$AG, year$AG=="20_to_24", "Young group")
Economy active population
year$AG <- replace(year$AG, year$AG=="25_to_29", "Economy active population")
year$AG <- replace(year$AG, year$AG=="30_to_34", "Economy active population")
year$AG <- replace(year$AG, year$AG=="35_to_39", "Economy active population")
year$AG <- replace(year$AG, year$AG=="40_to_44", "Economy active population")
year$AG <- replace(year$AG, year$AG=="45_to_49", "Economy active population")
year$AG <- replace(year$AG, year$AG=="50_to_54", "Economy active population")
year$AG <- replace(year$AG, year$AG=="55_to_59", "Economy active population")
year$AG <- replace(year$AG, year$AG=="60_to_64", "Economy active population")
Aged group
year$AG <- replace(year$AG, year$AG=="65_to_69", "Aged group")
year$AG <- replace(year$AG, year$AG=="70_to_74", "Aged group")
year$AG <- replace(year$AG, year$AG=="75_to_79", "Aged group")
year$AG <- replace(year$AG, year$AG=="80_to_84", "Aged group")
year$AG <- replace(year$AG, year$AG=="85_to_89", "Aged group")
year$AG <- replace(year$AG, year$AG=="90_and_over", "Aged group")

3.2.2 Calculate population density

Next, we will need to select variables that can be used to calculate population density Population density is calculated by taking population divided by area In this case, the area will be the variable SHAPE_Area

density <- inner_join(year, subzone)%>%
  dplyr::select(SHAPE_Area, Pop, SUBZONE_N)
## Joining, by = "SUBZONE_N"

Next, we will use mutate() to get the population density

pop_density <- density %>%
  mutate(pop_density=Pop/SHAPE_Area)%>%
  dplyr::select(SUBZONE_N, pop_density)

3.3 Import geospatial data (urban functions)

business <- readOGR(dsn= "data/geospatial", layer="Business")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Business"
## with 6550 features
## It has 5 fields
## Integer64 fields read as strings:  POI_ID
shopping  <- readOGR(dsn= "data/geospatial", layer="Shopping")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Shopping"
## with 511 features
## It has 5 fields
## Integer64 fields read as strings:  POI_ID
financial <- readOGR(dsn= "data/geospatial", layer="Financial")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Financial"
## with 3320 features
## It has 29 fields
## Integer64 fields read as strings:  LINK_ID POI_ID CHAIN_ID VANCITY_ID
residential <- readOGR(dsn= "data/geospatial", layer="Private residential")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Private residential"
## with 3604 features
## It has 5 fields
## Integer64 fields read as strings:  POI_ID
govt <- readOGR(dsn="data/geospatial", layer="Govt_Embassy")
## OGR data source with driver: ESRI Shapefile 
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Govt_Embassy"
## with 443 features
## It has 5 fields
## Integer64 fields read as strings:  POI_ID

For easier identification, we rename variable names of the urban functions

names(business)[names(business) == "POI_NAME"] <- "BUSINESS"
names(shopping)[names(shopping) == "POI_NAME"] <- "SHOPPING"
names(financial)[names(financial) == "POI_NAME"] <- "FINANCIAL"
names(residential)[names(residential) == "POI_NAME"] <- "RESIDENTIAL"
names(govt)[names(govt) == "POI_NAME"] <- "GOVT"

3.3.1 Combining Dataframe

Use st_as_sf() to set the different urban functions into shapefile

business = st_as_sf(business)
shopping = st_as_sf(shopping)
financial = st_as_sf(financial)
residential = st_as_sf(residential)
govt = st_as_sf(govt)
Subzone dataframe

Check for NA values

any(is.na(subzone))
## [1] FALSE

Plot subzone

plot(st_geometry(subzone), main="SG Subzone")

Business dataframe

Check for NA values

any(is.na(business))
## [1] TRUE

Plot Business

plot(st_geometry(subzone), main="BUSINESS")
plot((business), add=TRUE, col="blue")
## Warning in plot.sf((business), add = TRUE, col = "blue"): ignoring all but the
## first attribute

Shopping dataframe

Check for NA values

any(is.na(shopping))
## [1] TRUE

Plot Shopping

plot(st_geometry(subzone), main="SHOPPING")
plot((shopping), add=TRUE, col="blue")
## Warning in plot.sf((shopping), add = TRUE, col = "blue"): ignoring all but the
## first attribute

Financial dataframe

Check for NA values

any(is.na(financial))
## [1] TRUE

Plot Financial

plot(st_geometry(subzone), main="FINANCIAL")
plot((financial), add=TRUE, col="blue")
## Warning in plot.sf((financial), add = TRUE, col = "blue"): ignoring all but the
## first attribute

Residential dataframe

Check for NA values

any(is.na(residential))
## [1] TRUE

Plot Residential

plot(st_geometry(subzone), main="RESIDENTIAL")
plot((residential), add=TRUE, col="blue")
## Warning in plot.sf((residential), add = TRUE, col = "blue"): ignoring all but
## the first attribute

Government dataframe

Check for NA values

any(is.na(govt))
## [1] TRUE

Plot government

plot(st_geometry(subzone), main="GOVERNMENT")
plot((govt), add=TRUE, col="blue")
## Warning in plot.sf((govt), add = TRUE, col = "blue"): ignoring all but the first
## attribute

3.3.2 Extract industry from business

We need to extract a new df for industry We identifed that FAC_TYPE 5000 is business, 9991 is industry after tally with its POI_NAME (renamed to BUSINESS). So we filter out 9991 and create a df named industry

industry <- business%>%
  filter(business$FAC_TYPE==9991)
names(industry)[names(industry) == "BUSINESS"] <- "INDUSTRY"

Check for NA values

any(is.na(industry))
## [1] TRUE

Plot Industry

plot(st_geometry(subzone), main="INDUSTRY")
plot((industry), add=TRUE, col="blue")
## Warning in plot.sf((industry), add = TRUE, col = "blue"): ignoring all but the
## first attribute

3.3.3 Remove the industry part in business

business <- business %>%
  filter(!business$FAC_TYPE==9991)

Recheck for NA values

any(is.na(business))
## [1] TRUE

Plot Business again

plot(st_geometry(subzone), main="BUSINESS")
plot((business), add=TRUE, col="blue")
## Warning in plot.sf((business), add = TRUE, col = "blue"): ignoring all but the
## first attribute

Check if there is NA values in the geometry column as we need it to perform our analysis using the geometry

sum(is.na(business$geometry))
## [1] 0
sum(is.na(shopping$geometry))
## [1] 0
sum(is.na(financial$geometry))
## [1] 0
sum(is.na(industry$geometry))
## [1] 0
sum(is.na(residential$geometry))
## [1] 0
sum(is.na(govt$geometry))
## [1] 0

Plot a plots with all the urban functions

plot(st_geometry(subzone), main= "Urban Functions in Singapore")
plot(business, col="blue", add=TRUE)
## Warning in plot.sf(business, col = "blue", add = TRUE): ignoring all but the
## first attribute
plot(industry, col="green", add=TRUE)
## Warning in plot.sf(industry, col = "green", add = TRUE): ignoring all but the
## first attribute
plot(shopping, col="purple", add=TRUE)
## Warning in plot.sf(shopping, col = "purple", add = TRUE): ignoring all but the
## first attribute
plot(financial, col="yellow", add=TRUE)
## Warning in plot.sf(financial, col = "yellow", add = TRUE): ignoring all but the
## first attribute
plot(residential, col="orange", add=TRUE)
## Warning in plot.sf(residential, col = "orange", add = TRUE): ignoring all but
## the first attribute
plot(govt, col="red", add=TRUE)
## Warning in plot.sf(govt, col = "red", add = TRUE): ignoring all but the first
## attribute

3.3.4 Checking CRS

st_crs(subzone)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["World"],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]
st_crs(business)
## Coordinate Reference System:
##   User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
##   wkt:
## BOUNDCRS[
##     SOURCECRS[
##         GEOGCRS["unknown",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]],
##                 ID["EPSG",6326]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433],
##                 ID["EPSG",8901]],
##             CS[ellipsoidal,2],
##                 AXIS["longitude",east,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]],
##                 AXIS["latitude",north,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]]]],
##     TARGETCRS[
##         GEOGCRS["WGS 84",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             CS[ellipsoidal,2],
##                 AXIS["latitude",north,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##                 AXIS["longitude",east,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##             ID["EPSG",4326]]],
##     ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
##         METHOD["Geocentric translations (geog2D domain)",
##             ID["EPSG",9603]],
##         PARAMETER["X-axis translation",0,
##             ID["EPSG",8605]],
##         PARAMETER["Y-axis translation",0,
##             ID["EPSG",8606]],
##         PARAMETER["Z-axis translation",0,
##             ID["EPSG",8607]]]]
st_crs(shopping)
## Coordinate Reference System:
##   User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
##   wkt:
## BOUNDCRS[
##     SOURCECRS[
##         GEOGCRS["unknown",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]],
##                 ID["EPSG",6326]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433],
##                 ID["EPSG",8901]],
##             CS[ellipsoidal,2],
##                 AXIS["longitude",east,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]],
##                 AXIS["latitude",north,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]]]],
##     TARGETCRS[
##         GEOGCRS["WGS 84",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             CS[ellipsoidal,2],
##                 AXIS["latitude",north,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##                 AXIS["longitude",east,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##             ID["EPSG",4326]]],
##     ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
##         METHOD["Geocentric translations (geog2D domain)",
##             ID["EPSG",9603]],
##         PARAMETER["X-axis translation",0,
##             ID["EPSG",8605]],
##         PARAMETER["Y-axis translation",0,
##             ID["EPSG",8606]],
##         PARAMETER["Z-axis translation",0,
##             ID["EPSG",8607]]]]
st_crs(financial)
## Coordinate Reference System:
##   User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
##   wkt:
## BOUNDCRS[
##     SOURCECRS[
##         GEOGCRS["unknown",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]],
##                 ID["EPSG",6326]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433],
##                 ID["EPSG",8901]],
##             CS[ellipsoidal,2],
##                 AXIS["longitude",east,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]],
##                 AXIS["latitude",north,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]]]],
##     TARGETCRS[
##         GEOGCRS["WGS 84",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             CS[ellipsoidal,2],
##                 AXIS["latitude",north,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##                 AXIS["longitude",east,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##             ID["EPSG",4326]]],
##     ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
##         METHOD["Geocentric translations (geog2D domain)",
##             ID["EPSG",9603]],
##         PARAMETER["X-axis translation",0,
##             ID["EPSG",8605]],
##         PARAMETER["Y-axis translation",0,
##             ID["EPSG",8606]],
##         PARAMETER["Z-axis translation",0,
##             ID["EPSG",8607]]]]
st_crs(residential)
## Coordinate Reference System:
##   User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
##   wkt:
## BOUNDCRS[
##     SOURCECRS[
##         GEOGCRS["unknown",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]],
##                 ID["EPSG",6326]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433],
##                 ID["EPSG",8901]],
##             CS[ellipsoidal,2],
##                 AXIS["longitude",east,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]],
##                 AXIS["latitude",north,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]]]],
##     TARGETCRS[
##         GEOGCRS["WGS 84",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             CS[ellipsoidal,2],
##                 AXIS["latitude",north,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##                 AXIS["longitude",east,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##             ID["EPSG",4326]]],
##     ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
##         METHOD["Geocentric translations (geog2D domain)",
##             ID["EPSG",9603]],
##         PARAMETER["X-axis translation",0,
##             ID["EPSG",8605]],
##         PARAMETER["Y-axis translation",0,
##             ID["EPSG",8606]],
##         PARAMETER["Z-axis translation",0,
##             ID["EPSG",8607]]]]
st_crs(govt)
## Coordinate Reference System:
##   User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
##   wkt:
## BOUNDCRS[
##     SOURCECRS[
##         GEOGCRS["unknown",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]],
##                 ID["EPSG",6326]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433],
##                 ID["EPSG",8901]],
##             CS[ellipsoidal,2],
##                 AXIS["longitude",east,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]],
##                 AXIS["latitude",north,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]]]],
##     TARGETCRS[
##         GEOGCRS["WGS 84",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             CS[ellipsoidal,2],
##                 AXIS["latitude",north,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##                 AXIS["longitude",east,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##             ID["EPSG",4326]]],
##     ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
##         METHOD["Geocentric translations (geog2D domain)",
##             ID["EPSG",9603]],
##         PARAMETER["X-axis translation",0,
##             ID["EPSG",8605]],
##         PARAMETER["Y-axis translation",0,
##             ID["EPSG",8606]],
##         PARAMETER["Z-axis translation",0,
##             ID["EPSG",8607]]]]
st_crs(industry)
## Coordinate Reference System:
##   User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 
##   wkt:
## BOUNDCRS[
##     SOURCECRS[
##         GEOGCRS["unknown",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]],
##                 ID["EPSG",6326]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433],
##                 ID["EPSG",8901]],
##             CS[ellipsoidal,2],
##                 AXIS["longitude",east,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]],
##                 AXIS["latitude",north,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433,
##                         ID["EPSG",9122]]]]],
##     TARGETCRS[
##         GEOGCRS["WGS 84",
##             DATUM["World Geodetic System 1984",
##                 ELLIPSOID["WGS 84",6378137,298.257223563,
##                     LENGTHUNIT["metre",1]]],
##             PRIMEM["Greenwich",0,
##                 ANGLEUNIT["degree",0.0174532925199433]],
##             CS[ellipsoidal,2],
##                 AXIS["latitude",north,
##                     ORDER[1],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##                 AXIS["longitude",east,
##                     ORDER[2],
##                     ANGLEUNIT["degree",0.0174532925199433]],
##             ID["EPSG",4326]]],
##     ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
##         METHOD["Geocentric translations (geog2D domain)",
##             ID["EPSG",9603]],
##         PARAMETER["X-axis translation",0,
##             ID["EPSG",8605]],
##         PARAMETER["Y-axis translation",0,
##             ID["EPSG",8606]],
##         PARAMETER["Z-axis translation",0,
##             ID["EPSG",8607]]]]

3.4 Using st_within()

We use st_within() to combine the subzone geospatial data in. We are interested in SUBZONE_N as it tells us the location of the area We then use dplyr function to select variables that are required for our analysis

business <- bind_cols(
  business,
  subzone[as.numeric(st_within(business, subzone)),]) %>% 
  st_set_geometry(NULL)%>%
  dplyr::select(OBJECTID,BUSINESS,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
business_count <- business %>% drop_na(BUSINESS) %>% group_by(SUBZONE_N) %>% count(BUSINESS)

business_count <- distinct(business_count)

business_count <- business_count %>% 
  group_by(SUBZONE_N) %>% 
  summarize(n=sum(n))

colnames(business_count)[2]="Business"
industry <- bind_cols(
  industry,
  subzone[as.numeric(st_within(industry, subzone)),]) %>% 
  st_set_geometry(NULL)%>%
  dplyr::select(OBJECTID,INDUSTRY,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
industry_count <- industry %>% drop_na(INDUSTRY) %>% group_by(SUBZONE_N) %>% count(INDUSTRY)

industry_count <- distinct(industry_count)

industry_count <- industry_count %>% 
  group_by(SUBZONE_N) %>% 
  summarize(n=sum(n))

colnames(industry_count)[2]="Industry"
shopping <- bind_cols(
  shopping,
  subzone[as.numeric(st_within(shopping, subzone)),]) %>% 
  st_set_geometry(NULL)%>%
  dplyr::select(OBJECTID,SHOPPING,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
shopping_count <- shopping %>% drop_na(SHOPPING) %>% group_by(SUBZONE_N) %>% count(SHOPPING) 

shopping_count <- distinct(shopping_count)

shopping_count <- shopping_count %>% group_by(SUBZONE_N) %>% summarize(n=sum(n))

colnames(shopping_count)[2]="Shopping"
financial <- bind_cols(
  financial,
  subzone[as.numeric(st_within(financial, subzone)),])%>% 
  st_set_geometry(NULL)%>%
  dplyr::select(OBJECTID,FINANCIAL,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
financial_count <- financial %>% drop_na(FINANCIAL) %>% group_by(SUBZONE_N) %>% count(FINANCIAL)

financial_count <- distinct(financial_count)

financial_count <- financial_count %>% group_by(SUBZONE_N) %>% summarize(n=sum(n))

colnames(financial_count)[2]="Financial"
residential <- bind_cols(
  residential,
  subzone[as.numeric(st_within(residential, subzone)),])%>%
  st_set_geometry(NULL)%>%
  dplyr::select(OBJECTID,RESIDENTIAL,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
residential_count <- residential %>% drop_na(RESIDENTIAL) %>% group_by(SUBZONE_N) %>% count(RESIDENTIAL)

residential_count <- distinct(residential_count)

residential_count <- residential_count %>% group_by(SUBZONE_N) %>% summarize(n=sum(n))

colnames(residential_count)[2]="Residential"
govt <- bind_cols(
  govt,
  subzone[as.numeric(st_within(govt, subzone)),])%>% 
  st_set_geometry(NULL)%>%
  dplyr::select(OBJECTID,GOVT,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
govt_count <- govt %>% drop_na(GOVT) %>% group_by(SUBZONE_N) %>% count(GOVT) 

govt_count <- distinct(govt_count)

govt_count <- govt_count %>% group_by(SUBZONE_N) %>% summarize(n=sum(n))

colnames(govt_count)[2]="Government"

3.5 Cast data to variable

This function splits the Type of Dwelling into its individual types, as well as the age groups into individual types. New variables are created.

3.5.1 Cast housing variables (TOD)

cast_housing <- dcast(year, SUBZONE_N~TOD, sum)
## Using Pop as value column: use value.var to override.
Combine HDB 3 Room and HDB 4 room together
cast_housing <- cast_housing %>%
  mutate(`HDB 3-Room Flats`+`HDB 4-Room Flats`)
names(cast_housing)[names(cast_housing) == "`HDB 3-Room Flats` + `HDB 4-Room Flats`"] <- "THREE_FOUR_RM"
Rename variables
cast_housing <- cast_housing%>%
  rename(`CONDO` =`Condominiums and Other Apartments`,
         `ONE_TWO_RM`=`HDB 1- and 2-Room Flats`,
         `FIVE_RM`=`HDB 5-Room and Executive Flats`,
         `LANDED`=`Landed Properties`) 
cast_housing <- cast_housing%>%
  dplyr::select(SUBZONE_N, CONDO, ONE_TWO_RM, THREE_FOUR_RM, FIVE_RM, LANDED)

3.5.2 Cast age variables

cast_age <- dcast(year, SUBZONE_N~AG, sum)
## Using Pop as value column: use value.var to override.

3.6 Combine all the datas together using left_join()

year4 <- left_join(cast_housing, cast_age)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, business_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, shopping_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, financial_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, residential_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, industry_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, govt_count)
## Joining, by = "SUBZONE_N"

3.6.1 Add population column

This column will be needed for deriving new variables, as well as calculating population density

year4_pop <- year %>% 
  group_by(SUBZONE_N) %>% 
  summarise(Pop=sum(Pop))

year4 <- left_join(year4,year4_pop)
## Joining, by = "SUBZONE_N"

3.6.2 Change NA values to 0

year4$`Business`[is.na(year4$`Business`)]<-0
year4$`Financial`[is.na(year4$`Financial`)]<-0
year4$`Shopping`[is.na(year4$`Shopping`)]<-0
year4$`Residential`[is.na(year4$`Residential`)]<-0
year4$`Government`[is.na(year4$`Government`)]<-0
year4$`Industry`[is.na(year4$`Industry`)]<-0

Check for NA values again

summary(year4)
##   SUBZONE_N             CONDO         ONE_TWO_RM   THREE_FOUR_RM  
##  Length:323         Min.   :    0   Min.   :   0   Min.   :    0  
##  Class :character   1st Qu.:    0   1st Qu.:   0   1st Qu.:    0  
##  Mode  :character   Median :  230   Median :   0   Median :    0  
##                     Mean   : 1827   Mean   : 542   Mean   : 5953  
##                     3rd Qu.: 2835   3rd Qu.: 605   3rd Qu.: 9705  
##                     Max.   :16770   Max.   :4700   Max.   :75000  
##     FIVE_RM          LANDED          Aged group    Economy active population
##  Min.   :    0   Min.   :    0.0   Min.   :    0   Min.   :    0            
##  1st Qu.:    0   1st Qu.:    0.0   1st Qu.:    0   1st Qu.:    0            
##  Median :    0   Median :    0.0   Median :  730   Median : 2840            
##  Mean   : 3297   Mean   :  773.9   Mean   : 1806   Mean   : 7386            
##  3rd Qu.: 3660   3rd Qu.:  400.0   3rd Qu.: 3000   3rd Qu.:10285            
##  Max.   :47960   Max.   :18820.0   Max.   :18860   Max.   :79780            
##   Young group       Business         Shopping        Financial     
##  Min.   :    0   Min.   :  0.00   Min.   : 0.000   Min.   :  0.00  
##  1st Qu.:    0   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:  1.00  
##  Median : 1170   Median :  2.00   Median : 0.000   Median :  5.00  
##  Mean   : 3296   Mean   : 19.94   Mean   : 1.582   Mean   : 10.28  
##  3rd Qu.: 4365   3rd Qu.: 14.00   3rd Qu.: 1.000   3rd Qu.: 13.00  
##  Max.   :34260   Max.   :308.00   Max.   :31.000   Max.   :134.00  
##   Residential        Industry        Government          Pop        
##  Min.   :  0.00   Min.   :0.0000   Min.   : 0.000   Min.   :     0  
##  1st Qu.:  0.00   1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:     0  
##  Median :  4.00   Median :0.0000   Median : 0.000   Median :  4890  
##  Mean   : 11.16   Mean   :0.3406   Mean   : 1.372   Mean   : 12487  
##  3rd Qu.: 11.00   3rd Qu.:0.0000   3rd Qu.: 1.000   3rd Qu.: 17065  
##  Max.   :217.00   Max.   :8.0000   Max.   :19.000   Max.   :132900

3.7 Deriving new variables using dplyr

The unit of measurement of the different variables are different. For instance, the unit of measurement for age group is Population, as using the values of the different age groups will be based on the number ot Population. On the other hand, the unit of measurement for TOD is SHAPE_Area. In general, subzones with relatively higher number of population will also have higher number of age group living in that area. Similarly, subzones with relatively larger area will have higher number of TOD available in that area.

3.7.1 We will first need to add in the SHAPE_Area variable for our analysis

subzone_area <- subzone%>%
  st_set_geometry(NULL)%>%
  dplyr::select(SUBZONE_N, SHAPE_Area)
year4 <- left_join(subzone_area, year4)
## Joining, by = "SUBZONE_N"

We mutate to get new variables Population density is derived from dividing SHAPE_Area by Population

year4_derived <- year4 %>%
  mutate(`CONDO_PR` = `CONDO`/`SHAPE_Area`*1000) %>%
  mutate(`ONE_TWO_RM_PR` = `ONE_TWO_RM`/`SHAPE_Area`*1000) %>%
  mutate(`THREE_FOUR_RM_PR` = `THREE_FOUR_RM`/`SHAPE_Area`*1000) %>%
  mutate(`FIVE_RM_PR` = `FIVE_RM`/`SHAPE_Area`*1000) %>%
  mutate(`LANDED_PR` = `LANDED`/`SHAPE_Area`*1000)%>%
  mutate(`Aged_group_PR` = `Aged group`/`Pop`*1000)%>%
  mutate(`Economy_active_PR` = `Economy active population`/`Pop`*1000)%>%
  mutate(`Young_group_PR` = `Young group`/`Pop`*1000)%>%
  mutate(`pop_density_PR`=`Pop`/`SHAPE_Area`*1000)%>%
  rename(`Business_PR` =`Business`,
         `Shopping_PR`=`Shopping`,
         `Industry_PR`=`Industry`, 
         `Financial_PR`=`Financial`,
         `Government_PR`=`Government`,
         `Residential_PR`=`Residential`) 

Replace NaN values to 0

year4_derived$Aged_group_PR <- replace(year4_derived$Aged_group_PR, year4_derived$Aged_group_PR=="NaN", 0)
year4_derived$Economy_active_PR <- replace(year4_derived$Economy_active_PR, year4_derived$Economy_active_PR=="NaN", 0)
year4_derived$Young_group_PR <- replace(year4_derived$Young_group_PR, year4_derived$Young_group_PR=="NaN", 0)

4. Exploratory Data Analysis (EDA)

4.1 EDA using statistical graphics

We can plot the distribution of the variables (i.e. number of condos available in an area) by using appropriate Exploratory Data Analysis (EDA), as shown in the code chunk below. We will be using ggplo() function to plot histograms, which is useful to identify the overall distribution of the data values (eg whether they are left skew, right skew, or normal distribution)

Eg plot with CONDO
ggplot(data=year4_derived, aes(x=`CONDO`))+
  geom_histogram(bins=20, color="black", fill="light blue")

Boxplot can be used to detect outliers

ggplot(data=year4_derived, aes(x=`CONDO`)) +
  geom_boxplot(color="black", fill="light blue")

Next, we will also be plotting the distribution of the newly derived variables (i.e. Condo penetration rate)

ggplot(data=year4_derived, aes(x=`CONDO_PR`))+
  geom_histogram(bins=20, color="black", fill="light blue")

And also use boxplot to detect outliers

ggplot(data=year4_derived, aes(x=`CONDO_PR`)) +
  geom_boxplot(color="black", fill="light blue")


We observed that the data is less skewed for the penetration rate variable.

4.1.1 We will plot multiple histograms to reveal the distribution of the selected variables

We will first create the individual histograms, before using ggarrange() function to group these histograms together

condo <- ggplot(data=year4_derived, 
             aes(x= `CONDO_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
                 
one_two_rm <- ggplot(data=year4_derived, 
             aes(x= `ONE_TWO_RM_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

three_four_rm <- ggplot(data=year4_derived, 
             aes(x= `THREE_FOUR_RM_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

five_rm <- ggplot(data=year4_derived,
             aes(x= `FIVE_RM_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

landed <- ggplot(data=year4_derived, 
             aes(x= `LANDED_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

aged <- ggplot(data=year4_derived,
             aes(x= `Aged_group_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

active <- ggplot(data=year4_derived, 
             aes(x= `Economy_active_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

young <- ggplot(data=year4_derived, 
             aes(x= `Young_group_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

business <- ggplot(data=year4_derived, 
             aes(x= `Business_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

shopping <- ggplot(data=year4_derived, 
             aes(x= `Shopping_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

financial <- ggplot(data=year4_derived, 
             aes(x= `Financial_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

industry <- ggplot(data=year4_derived, 
             aes(x= `Industry_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

residential <- ggplot(data=year4_derived, 
             aes(x= `Residential_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

government <- ggplot(data=year4_derived, 
             aes(x= `Government_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

pop_density <- ggplot(data=year4_derived, 
             aes(x= `pop_density_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ggarrange(condo, one_two_rm, three_four_rm, five_rm, landed, aged, active, young, business, shopping, residential, industry, government, financial, pop_density, ncol = 4, nrow = 4)


From these histograms, we can see that the histograms are all mainly skewed, with some variables having a less skewed observation, such as Young_group_PR, which resembles more of a normal distribution.

4.2 EDA using choropleth map

4.2.1 Joining geospatial data with aspatial data

Before we can prepare the choropleth map, we will need the geospatial data (subzone), as well as the aspatial data (year4_derived). We will need to combine these two datas into one.

We will reimport the geospatial subzone data

subzone <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_NO_SEA_PL")
## Reading layer `MP14_SUBZONE_NO_SEA_PL' from data source `C:\Users\amoss\OneDrive\Documents\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS:  SVY21

When doing the SKATER approach later, there will be 5 regions without neighbours. We will remove these 5 regions for consistency. ##### Filter out regions with no neighbours

subzone <- filter(subzone, OBJECTID != 36) 
subzone <- filter(subzone, OBJECTID != 37) 
subzone <- filter(subzone, OBJECTID != 38) 
subzone <- filter(subzone, OBJECTID != 276) 
subzone <- filter(subzone, OBJECTID != 298)

Then, making use of left_join() function of dplyr package, we will combine both datas together. The unique identifier used to join both data objects is SUBZONE_N, and SHAPE_Area

subzone_year <- left_join(subzone, year4_derived)
## Joining, by = c("SUBZONE_N", "SHAPE_Area")

We will then select the variables that are necessary for our analysis.

subzone_year <- subzone_year%>%
  dplyr::select("SUBZONE_N", "CONDO", "ONE_TWO_RM", "THREE_FOUR_RM", "FIVE_RM", "LANDED", "Aged group", "Economy active population", "Young group","CONDO_PR", "ONE_TWO_RM_PR", "THREE_FOUR_RM_PR", "FIVE_RM_PR", "LANDED_PR", "Aged_group_PR", "Economy_active_PR", "Young_group_PR", "Business_PR", "Shopping_PR", "Industry_PR", "Financial_PR", "Residential_PR", "Government_PR", "pop_density_PR", "SHAPE_Area", "Pop")

4.2.2 Preparing a choropleth map

To have a quick look at the distribution of Condo penetration rate of subzone, we will need to plot a choropleth map. We will use the qtm() function of tmap package

qtm(subzone_year, "CONDO_PR")
## Warning: The shape subzone_year is invalid. See sf::st_is_valid

In order to reveal the distribution shown in the choropleth map above are bias to the denominator that we divided them by previously, we will need to plot both choropleth maps in.

4.2.3 Plotting by age group

We previously identified that age groups are bias to the population, so we will plot the age group choropleth together with the population choropleth.

pop.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Pop",
          n=5,
          style = "jenks", 
          title = "Population") + 
  tm_borders(alpha = 0.5) 

aged.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Aged group",
          n=5,
          style = "jenks",
          title = "Aged") + 
  tm_borders(alpha = 0.5) 

active.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Economy active population",
          n=5,
          style = "jenks",
          title = "Active") + 
  tm_borders(alpha = 0.5) 

young.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Young group",
          n=5,
          style = "jenks",
          title = "Young") + 
  tm_borders(alpha = 0.5) 

tmap_arrange(pop.map, aged.map, active.map, young.map,
             asp=NA, ncol=2, nrow=2)
## Warning: The shape subzone_year is invalid. See sf::st_is_valid

## Warning: The shape subzone_year is invalid. See sf::st_is_valid

## Warning: The shape subzone_year is invalid. See sf::st_is_valid

## Warning: The shape subzone_year is invalid. See sf::st_is_valid


We identified from the choropleth maps above that subzones with relatively larger population also show larger number of a certain aged group. For example, there are darker chunks in all 4 choropleth maps in the east area, which shows that there is higher number of population in the east area, resulting in higher proportion of Aged, Active, and Young group in the east area too.

4.2.4 Plotting against Housing types

We have also previously identified that housing types are bias to the SHAPE_Area, so we will plot the housing choropleth together with the SHAPE_Area choropleth.

shape_area.map <- tm_shape(subzone_year) + 
  tm_fill(col = "SHAPE_Area",
          n = 5,
          style = "jenks", 
          title = "Shape area") + 
  tm_borders(alpha = 0.5) 

condo.map <- tm_shape(subzone_year) + 
  tm_fill(col = "CONDO",
          n = 5,
          style = "jenks", 
          title = "Condo") + 
  tm_borders(alpha = 0.5) 

one_two_rm.map <- tm_shape(subzone_year) + 
  tm_fill(col = "ONE_TWO_RM",
          n = 5,
          style = "jenks",
          title = "1 2 room") + 
  tm_borders(alpha = 0.5) 

three_four_rm.map <- tm_shape(subzone_year) + 
  tm_fill(col = "THREE_FOUR_RM",
          n = 5,
          style = "jenks",
          title = "3 room") + 
  tm_borders(alpha = 0.5) 

five_rm.map <- tm_shape(subzone_year) + 
  tm_fill(col = "FIVE_RM",
          n = 5,
          style = "jenks",
          title = "5 room") + 
  tm_borders(alpha = 0.5)

landed.map <- tm_shape(subzone_year) + 
  tm_fill(col = "LANDED",
          n = 5,
          style = "jenks",
          title = "landed") + 
  tm_borders(alpha = 0.5)

tmap_arrange(shape_area.map, condo.map, one_two_rm.map, three_four_rm.map, five_rm.map, landed.map,
             asp=1, ncol=3, nrow=2)
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Legend labels were too wide. The labels have been resized to 0.39, 0.34, 0.32, 0.31, 0.31. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.55, 0.55, 0.55, 0.50. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.62, 0.55, 0.55, 0.55. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.50, 0.47, 0.47, 0.47. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.55, 0.50, 0.47, 0.47. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.62, 0.55, 0.55, 0.50. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.


We can observed from the choropleth map above that there are certain areas that are dark shaded in the Shape area choropleth map, but are light shaded in the housing choropleth map. This could largely be due to these areas being water catchment areas, or islands that do not have any population at all. Not taking these areas into consideration,
We identified from the choropleth maps above that subzones with relatively larger shape area also show larger number of a certain housing group.

4.2.5 Plotting against urban functions

business.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Business_PR",
          n = 5,
          style = "jenks", 
          title = "Business") + 
  tm_borders(alpha = 0.5) 

shopping.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Shopping_PR",
          n = 5,
          style = "jenks",
          title = "Shopping") + 
  tm_borders(alpha = 0.5) 

financial.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Financial_PR",
          n = 5,
          style = "jenks",
          title = "Financial") + 
  tm_borders(alpha = 0.5) 

industry.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Industry_PR",
          n = 5,
          style = "jenks",
          title = "Industry") + 
  tm_borders(alpha = 0.5)

residential.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Residential_PR",
          n = 5,
          style = "jenks",
          title = "Residential") + 
  tm_borders(alpha = 0.5)

government.map <- tm_shape(subzone_year) + 
  tm_fill(col = "Government_PR",
          n = 5,
          style = "jenks",
          title = "Government") + 
  tm_borders(alpha = 0.5)

tmap_arrange(business.map, shopping.map, financial.map, industry.map, residential.map, government.map,
             asp=1, ncol=3, nrow=2)
## Warning: The shape subzone_year is invalid. See sf::st_is_valid

## Warning: The shape subzone_year is invalid. See sf::st_is_valid

## Warning: The shape subzone_year is invalid. See sf::st_is_valid

## Warning: The shape subzone_year is invalid. See sf::st_is_valid

## Warning: The shape subzone_year is invalid. See sf::st_is_valid

## Warning: The shape subzone_year is invalid. See sf::st_is_valid


Business, as an urban function, has a higher concentration as compared to other urban functions in certain subzones as observed.

4.2.6 Plotting against population density

tm_shape(subzone_year)+
  tm_fill(col="pop_density_PR",
          n=5,
          style="jenks",
          title="Population density")+
  tm_borders(alpha=0.5)
## Warning: The shape subzone_year is invalid. See sf::st_is_valid


Population density of certain subzones are higher, which means that more people stay in that subzone per square kilometre.

4.3 Correlation Analysis

Before we are able to perform any cluster analysis, it is important for us to ensure that the input variables are not highly correlatied.
To do so, we will use the corrplot.mixed() function of corrplot package to visualise and analyse the correlation of the input variables.

cluster_vars.cor = cor(year4_derived[, c(11:16, 18:26)])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
         number.cex=0.4,
               tl.col = "black")


The correlation plot above shows that THREE_FOUR_RM and pop_density_PR are highly correlated, with a correlation value of 0.93. This suggests that only one of them should be used in the cluster analysis instead of both. Comparing the correlation value of these two variables among the other variables, we realise that pop_density_PR is slightly higher correlated than THREE_FOUR_RM, hence we remove pop_density_PR to prevent collinearity issues.

4.3.1 Correlation Analysis after removing Economy active population and FOUR_RM

We attempt to plot the correlation analysis again after removing pop_density_PR

cluster_vars.cor = cor(year4_derived[, c(11:16, 18:25)])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
         number.cex=0.4,
               tl.col = "black")


The correlation plot above shows a more evenly spread out correlation analysis between the variables. We can now proceed with our cluster analysis.

5. Hierarchy Cluster Analysis

5.1 Extracting clustering variables

The code chunk below will be used to extract the clustering variables from the subzone_year simple feather object into data.frame We do not include pop_density because it is highly correlated with THREE_FOUR_RM as mentioned above.

cluster_vars <- subzone_year %>%
  st_set_geometry(NULL)%>%
  select("SUBZONE_N","CONDO_PR", "ONE_TWO_RM_PR", "THREE_FOUR_RM_PR", "FIVE_RM_PR", "LANDED_PR", "Aged_group_PR", "Economy_active_PR", "Young_group_PR", "Business_PR", "Shopping_PR", "Financial_PR", "Residential_PR", "Industry_PR", "Government_PR")
head(cluster_vars,10)
##              SUBZONE_N   CONDO_PR ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR
## 1        PEOPLE'S PARK 3.32830732     0.0000000        0.0000000  0.0000000
## 2          BUKIT MERAH 0.00000000     0.0000000        0.4371873  2.2345130
## 3            CHINATOWN 0.51087945     1.7710488       12.2270482  3.4058630
## 4              PHILLIP 0.00000000     0.0000000        0.0000000  0.0000000
## 5        RAFFLES PLACE 0.00000000     0.0000000        0.0000000  0.0000000
## 6         CHINA SQUARE 0.07518405     0.0000000        9.6235583  0.4511043
## 7          TIONG BAHRU 1.53974007     0.9595482       20.3513471  6.0920151
## 8     BAYFRONT SUBZONE 0.00000000     0.0000000        0.0000000  0.0000000
## 9  TIONG BAHRU STATION 3.73445397     4.6181797       22.1786655 13.4554372
## 10       CLIFFORD PIER 0.00000000     0.0000000        0.0000000  0.0000000
##    LANDED_PR Aged_group_PR Economy_active_PR Young_group_PR Business_PR
## 1          0      193.5484          741.9355       64.51613           2
## 2          0      236.3636          536.3636      227.27273          16
## 3          0      194.2379          591.0781      214.68401          46
## 4          0        0.0000            0.0000        0.00000          11
## 5          0        0.0000            0.0000        0.00000          48
## 6          0      325.9259          570.3704      103.70370          14
## 7          0      192.6324          587.8741      219.49348           0
## 8          0        0.0000            0.0000        0.00000           1
## 9          0      180.8166          585.2236      233.95982           1
## 10         0        0.0000            0.0000        0.00000           7
##    Shopping_PR Financial_PR Residential_PR Industry_PR Government_PR
## 1            1           12              1           0             0
## 2            0           16              1           4             4
## 3            8           36             11           0             7
## 4            0            9              0           0             3
## 5            9          126              2           0            12
## 6            7           23              1           0             2
## 7            0            9              6           0             0
## 8            2           17              0           0             0
## 9            2           20             12           0             0
## 10           1            2              0           0             2

Next, we will change the rows by SUBZONE_N instead of row numbers

row.names(cluster_vars) <- cluster_vars$"SUBZONE_N"
head(cluster_vars,10)
##                               SUBZONE_N   CONDO_PR ONE_TWO_RM_PR
## PEOPLE'S PARK             PEOPLE'S PARK 3.32830732     0.0000000
## BUKIT MERAH                 BUKIT MERAH 0.00000000     0.0000000
## CHINATOWN                     CHINATOWN 0.51087945     1.7710488
## PHILLIP                         PHILLIP 0.00000000     0.0000000
## RAFFLES PLACE             RAFFLES PLACE 0.00000000     0.0000000
## CHINA SQUARE               CHINA SQUARE 0.07518405     0.0000000
## TIONG BAHRU                 TIONG BAHRU 1.53974007     0.9595482
## BAYFRONT SUBZONE       BAYFRONT SUBZONE 0.00000000     0.0000000
## TIONG BAHRU STATION TIONG BAHRU STATION 3.73445397     4.6181797
## CLIFFORD PIER             CLIFFORD PIER 0.00000000     0.0000000
##                     THREE_FOUR_RM_PR FIVE_RM_PR LANDED_PR Aged_group_PR
## PEOPLE'S PARK              0.0000000  0.0000000         0      193.5484
## BUKIT MERAH                0.4371873  2.2345130         0      236.3636
## CHINATOWN                 12.2270482  3.4058630         0      194.2379
## PHILLIP                    0.0000000  0.0000000         0        0.0000
## RAFFLES PLACE              0.0000000  0.0000000         0        0.0000
## CHINA SQUARE               9.6235583  0.4511043         0      325.9259
## TIONG BAHRU               20.3513471  6.0920151         0      192.6324
## BAYFRONT SUBZONE           0.0000000  0.0000000         0        0.0000
## TIONG BAHRU STATION       22.1786655 13.4554372         0      180.8166
## CLIFFORD PIER              0.0000000  0.0000000         0        0.0000
##                     Economy_active_PR Young_group_PR Business_PR Shopping_PR
## PEOPLE'S PARK                741.9355       64.51613           2           1
## BUKIT MERAH                  536.3636      227.27273          16           0
## CHINATOWN                    591.0781      214.68401          46           8
## PHILLIP                        0.0000        0.00000          11           0
## RAFFLES PLACE                  0.0000        0.00000          48           9
## CHINA SQUARE                 570.3704      103.70370          14           7
## TIONG BAHRU                  587.8741      219.49348           0           0
## BAYFRONT SUBZONE               0.0000        0.00000           1           2
## TIONG BAHRU STATION          585.2236      233.95982           1           2
## CLIFFORD PIER                  0.0000        0.00000           7           1
##                     Financial_PR Residential_PR Industry_PR Government_PR
## PEOPLE'S PARK                 12              1           0             0
## BUKIT MERAH                   16              1           4             4
## CHINATOWN                     36             11           0             7
## PHILLIP                        9              0           0             3
## RAFFLES PLACE                126              2           0            12
## CHINA SQUARE                  23              1           0             2
## TIONG BAHRU                    9              6           0             0
## BAYFRONT SUBZONE              17              0           0             0
## TIONG BAHRU STATION           20             12           0             0
## CLIFFORD PIER                  2              0           0             2

Then, remove the column SUBZONE_N

subzone_count <- select(cluster_vars, c(2:15))
head(subzone_count, 10)
##                       CONDO_PR ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR
## PEOPLE'S PARK       3.32830732     0.0000000        0.0000000  0.0000000
## BUKIT MERAH         0.00000000     0.0000000        0.4371873  2.2345130
## CHINATOWN           0.51087945     1.7710488       12.2270482  3.4058630
## PHILLIP             0.00000000     0.0000000        0.0000000  0.0000000
## RAFFLES PLACE       0.00000000     0.0000000        0.0000000  0.0000000
## CHINA SQUARE        0.07518405     0.0000000        9.6235583  0.4511043
## TIONG BAHRU         1.53974007     0.9595482       20.3513471  6.0920151
## BAYFRONT SUBZONE    0.00000000     0.0000000        0.0000000  0.0000000
## TIONG BAHRU STATION 3.73445397     4.6181797       22.1786655 13.4554372
## CLIFFORD PIER       0.00000000     0.0000000        0.0000000  0.0000000
##                     LANDED_PR Aged_group_PR Economy_active_PR Young_group_PR
## PEOPLE'S PARK               0      193.5484          741.9355       64.51613
## BUKIT MERAH                 0      236.3636          536.3636      227.27273
## CHINATOWN                   0      194.2379          591.0781      214.68401
## PHILLIP                     0        0.0000            0.0000        0.00000
## RAFFLES PLACE               0        0.0000            0.0000        0.00000
## CHINA SQUARE                0      325.9259          570.3704      103.70370
## TIONG BAHRU                 0      192.6324          587.8741      219.49348
## BAYFRONT SUBZONE            0        0.0000            0.0000        0.00000
## TIONG BAHRU STATION         0      180.8166          585.2236      233.95982
## CLIFFORD PIER               0        0.0000            0.0000        0.00000
##                     Business_PR Shopping_PR Financial_PR Residential_PR
## PEOPLE'S PARK                 2           1           12              1
## BUKIT MERAH                  16           0           16              1
## CHINATOWN                    46           8           36             11
## PHILLIP                      11           0            9              0
## RAFFLES PLACE                48           9          126              2
## CHINA SQUARE                 14           7           23              1
## TIONG BAHRU                   0           0            9              6
## BAYFRONT SUBZONE              1           2           17              0
## TIONG BAHRU STATION           1           2           20             12
## CLIFFORD PIER                 7           1            2              0
##                     Industry_PR Government_PR
## PEOPLE'S PARK                 0             0
## BUKIT MERAH                   4             4
## CHINATOWN                     0             7
## PHILLIP                       0             3
## RAFFLES PLACE                 0            12
## CHINA SQUARE                  0             2
## TIONG BAHRU                   0             0
## BAYFRONT SUBZONE              0             0
## TIONG BAHRU STATION           0             0
## CLIFFORD PIER                 0             2
summary(subzone_count)
##     CONDO_PR      ONE_TWO_RM_PR     THREE_FOUR_RM_PR   FIVE_RM_PR    
##  Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 0.352   Median : 0.0000   Median : 0.000   Median : 0.000  
##  Mean   : 1.591   Mean   : 0.6140   Mean   : 5.441   Mean   : 2.718  
##  3rd Qu.: 2.720   3rd Qu.: 0.6363   3rd Qu.:10.486   3rd Qu.: 3.807  
##  Max.   :11.693   Max.   :10.7964   Max.   :29.630   Max.   :27.554  
##    LANDED_PR      Aged_group_PR   Economy_active_PR Young_group_PR 
##  Min.   :0.0000   Min.   :  0.0   Min.   :  0.0     Min.   :  0.0  
##  1st Qu.:0.0000   1st Qu.:  0.0   1st Qu.:  0.0     1st Qu.:  0.0  
##  Median :0.0000   Median :126.2   Median :576.0     Median :226.1  
##  Mean   :0.4658   Mean   :115.1   Mean   :438.6     Mean   :182.1  
##  3rd Qu.:0.2881   3rd Qu.:184.3   3rd Qu.:599.8     3rd Qu.:272.7  
##  Max.   :7.2889   Max.   :947.4   Max.   :900.0     Max.   :558.8  
##   Business_PR      Shopping_PR      Financial_PR    Residential_PR  
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:  1.00   1st Qu.:  1.00  
##  Median :  2.00   Median : 0.000   Median :  5.00   Median :  4.00  
##  Mean   : 20.25   Mean   : 1.607   Mean   : 10.44   Mean   : 11.33  
##  3rd Qu.: 14.00   3rd Qu.: 1.000   3rd Qu.: 13.00   3rd Qu.: 11.00  
##  Max.   :308.00   Max.   :31.000   Max.   :134.00   Max.   :217.00  
##   Industry_PR     Government_PR   
##  Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.: 0.000  
##  Median :0.0000   Median : 0.000  
##  Mean   :0.3459   Mean   : 1.393  
##  3rd Qu.:0.0000   3rd Qu.: 1.000  
##  Max.   :8.0000   Max.   :19.000

There is no NA values present.
We will need to standardise as the units of measurements are different.
Since many of our variables are not normally distributed from the histograms above, we will need to use Min-Max standardisation method.

5.2 Data Standardisation

In general, multiple variables will be used in cluster analysis, and it is not unusual that their values range are different. So to avoid biasness of the cluster analysis result with large values, we will need to standardise the input variables before performing cluster analysis.

5.2.1 Min-Max standardisation

We will use normalize() function of the heatmaply package to standardise the clustering variables by using Min-Max method.

subzone_count.std <- normalize(subzone_count)
summary(subzone_count.std)
##     CONDO_PR      ONE_TWO_RM_PR     THREE_FOUR_RM_PR   FIVE_RM_PR     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.0301   Median :0.00000   Median :0.0000   Median :0.00000  
##  Mean   :0.1361   Mean   :0.05687   Mean   :0.1836   Mean   :0.09863  
##  3rd Qu.:0.2326   3rd Qu.:0.05894   3rd Qu.:0.3539   3rd Qu.:0.13818  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000  
##    LANDED_PR       Aged_group_PR    Economy_active_PR Young_group_PR  
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000    Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000    1st Qu.:0.0000  
##  Median :0.00000   Median :0.1332   Median :0.6400    Median :0.4046  
##  Mean   :0.06390   Mean   :0.1215   Mean   :0.4874    Mean   :0.3259  
##  3rd Qu.:0.03952   3rd Qu.:0.1946   3rd Qu.:0.6664    3rd Qu.:0.4880  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000    Max.   :1.0000  
##   Business_PR        Shopping_PR       Financial_PR      Residential_PR    
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.000000   Min.   :0.000000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.007463   1st Qu.:0.004608  
##  Median :0.006494   Median :0.00000   Median :0.037313   Median :0.018433  
##  Mean   :0.065752   Mean   :0.05184   Mean   :0.077912   Mean   :0.052227  
##  3rd Qu.:0.045455   3rd Qu.:0.03226   3rd Qu.:0.097015   3rd Qu.:0.050691  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.000000   Max.   :1.000000  
##   Industry_PR      Government_PR    
##  Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000  
##  Mean   :0.04324   Mean   :0.07332  
##  3rd Qu.:0.00000   3rd Qu.:0.05263  
##  Max.   :1.00000   Max.   :1.00000


Notice that the values range of the Min-Max standardised clustering variables are 0-1 now.

5.2.2 Z-score standardisation

We do not have to do Z-score standardisation in this case as the datas are not normally distributed (they are skewed as mentioned previously )

5.2.3 Visualising the standardised clustering variables

By age and housing
Aged <- ggplot(data=year4_derived,
            aes(x=`Aged_group_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Aged_minmax <- ggplot(data=subzone_s_df,
            aes(x=`Aged_group_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max ")

Young <- ggplot(data=year4_derived,
            aes(x=`Young_group_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Young_minmax <- ggplot(data=subzone_s_df,
            aes(x=`Young_group_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max ")

Condo <- ggplot(data=year4_derived,
            aes(x=`CONDO_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Condo_minmax <- ggplot(data=subzone_s_df,
            aes(x=`CONDO_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

One_two_room <- ggplot(data=year4_derived,
            aes(x=`ONE_TWO_RM_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
One_two_room_minmax <- ggplot(data=subzone_s_df,
            aes(x=`ONE_TWO_RM_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

Three_four_room <- ggplot(data=year4_derived,
            aes(x=`THREE_FOUR_RM_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Three_four_room_minmax <- ggplot(data=subzone_s_df,
            aes(x=`THREE_FOUR_RM_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

Five_room <- ggplot(data=year4_derived,
            aes(x=`FIVE_RM_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Five_room_minmax <- ggplot(data=subzone_s_df,
            aes(x=`FIVE_RM_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

Landed <- ggplot(data=year4_derived,
            aes(x=`LANDED_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Landed_minmax <- ggplot(data=subzone_s_df,
            aes(x=`LANDED_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

ggarrange(Aged, Aged_minmax, Young, Young_minmax, Condo, Condo_minmax, One_two_room, One_two_room_minmax, Three_four_room, Three_four_room_minmax, Five_room, Five_room_minmax, Landed, Landed_minmax,
          ncol=4,
          nrow=4)

By urban functions,
Business <- ggplot(data=year4_derived,
            aes(x=`Business_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Business_minmax <- ggplot(data=subzone_s_df,
            aes(x=`Business_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

Shopping <- ggplot(data=year4_derived,
            aes(x=`Shopping_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Shopping_minmax <- ggplot(data=subzone_s_df,
            aes(x=`Shopping_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

Financial <- ggplot(data=year4_derived,
            aes(x=`Financial_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Financial_minmax <- ggplot(data=subzone_s_df,
            aes(x=`Financial_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

Industry <- ggplot(data=year4_derived,
            aes(x=`Industry_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Industry_minmax <- ggplot(data=subzone_s_df,
            aes(x=`Industry_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

Residential <- ggplot(data=year4_derived,
            aes(x=`Residential_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Residential_minmax <- ggplot(data=subzone_s_df,
            aes(x=`Residential_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

Government <- ggplot(data=year4_derived,
            aes(x=`Government_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")

subzone_s_df <- as.data.frame(subzone_count.std)
Government_minmax <- ggplot(data=subzone_s_df,
            aes(x=`Government_PR`))+
  geom_histogram(bins=20,
                 color="black",
                 fill="light blue")+
  ggtitle("Min-Max")

ggarrange(Business, Business_minmax,Shopping, Shopping_minmax, Financial, Financial_minmax, Industry, Industry_minmax, Residential, Residential_minmax, Government, Government_minmax,
          ncol=4,
          nrow=3)

5.3 Computing proximity matrix

We will make use of the default euclidean proximity matrix

proxmat <- dist(subzone_count, method="euclidean")

5.4 Computing hierarchical clustering

We will perform hierarchical cluster analysis using ward.D method

hclust_ward <- hclust(proxmat, method="ward.D")

We can then plot the tree by using plot() function

plot(hclust_ward, cex=0.6)

5.4.1 Selecting optimal number of clusters

fviz_nbclust(subzone_count.std, FUN=hcut, method="silhouette")

As seen above, 2 has an elbow. However, having 2 clusters does not seem suitable since it is found to have more than 2 clusters while doing the analyhsis with 2 clusters. Hence, we make use of the NbClust function to determine the number of clusters availabe.

duda <- NbClust(subzone_count.std, diss=NULL, distance="euclidean", min.nc=1, max.nc=15, method="ward.D", index="duda", alphaBeale=0.1)
duda$Best.nc
## Number_clusters     Value_Index 
##          8.0000          0.8636


From this, we can tell that there are a total of 8 clusters. We will use 8 clusters for our subsequent analysis.

5.5 Selecting the optimal clustering algorithm

We will need to identify stronger clustering structures. This can be done by using agnes() function of cluster package

m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")

ac <- function(x){
  agnes(subzone_count, method=x)$ac
}

map_dbl(m,ac)
##   average    single  complete      ward 
## 0.9727232 0.9696775 0.9758965 0.9960726


With reference to the output above, we can see that Ward’s method provides the strongest clustering structure among the four methods assessed (0.9960726). Hence, in the subsequent analysis, only Ward’s method will be used.

5.5.1 Interpreting the dendograms

In the dendogram displayed above, each leaf corresponds to one observations. There are many leaves as there are many observations. As we move up the tree, observations that are similar to each other are combined into branches, which are themselves fused at a higher height.
The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fushion, the less similar the observations are. Note that, conclusions about the proximity of two observations can be drawn only based on the height where branches containing these two observations first are fused. We cannot use the proximity of two observations along with the horizontal axis as criteria of their similarity.

We will now draw the dendogram with a border around the selected clusters by using rect.hclust() of R stats. The argument border is used to specify the border colors for the rectangles

plot(hclust_ward, cex = 0.6)
rect.hclust(hclust_ward, k = 8, border = 2:5)

5.6 Visually driven hierarchical clustering analysis

With heatmaply, we are able to build both highly interactive cluster heatmap or static cluster heatmap

5.6.1 Transforming the data frame into a matrix

subzone_count_mat <- data.matrix(subzone_count)

5.6.2 Plotting interactive cluster heatmap using heatmaply()

heatmaply(normalize(subzone_count_mat),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 8,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of subzones by Housing indicators",
          xlab = "Indicators",
          ylab = "subzone"
          )

5.7 Mapping the clusters formed

groups <- as.factor(cutree(hclust_ward, k=8))
subzone_year_cluster <- cbind(subzone_year, as.matrix(groups)) %>%
  rename(`CLUSTER`=`as.matrix.groups.`)

Next, qtm() of tmap package will be used to plot the choropleth map showing the cluster formed

qtm(subzone_year_cluster, "CLUSTER")
## Warning: The shape subzone_year_cluster is invalid. See sf::st_is_valid


The choropleth map above reveals the clusters are very fragmented. This is one major limitation when non spatial clustering algorithm such as hierarchical cluster analysis method is used

6. Interpreting the clusters

6.1 Plotting business with clusters

tm_shape(st_make_valid(subzone_year_cluster)) +
  tm_fill(col="CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Business") +
  tm_bubbles(col = "Business_PR",
             size = "Business_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

6.1.2 Plotting all urban functions together for easy comparison

business_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
  tm_fill(col="CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Business",
            scale=0.5) +
  tm_bubbles(col = "Business_PR",
             size = "Business_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

shopping_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
  tm_fill(col="CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Shopping",
            scale=0.5) +
  tm_bubbles(col = "Shopping_PR",
             size = "Shopping_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

industry_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
  tm_fill(col="CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Industry",
            scale=0.5) +
  tm_bubbles(col = "Industry_PR",
             size = "Industry_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

financial_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
  tm_fill(col="CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Financial",
            scale=0.5) +
  tm_bubbles(col = "Financial_PR",
             size = "Financial_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

residential_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
  tm_fill(col="CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Residential",
            scale=0.5) +
  tm_bubbles(col = "Residential_PR",
             size = "Residential_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

govt_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
  tm_fill(col="CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Government",
            scale=0.5) +
  tm_bubbles(col = "Government_PR",
             size = "Government_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

tmap_arrange(business_cluster, shopping_cluster, industry_cluster, financial_cluster, residential_cluster, govt_cluster,
             asp=0, 
          ncol=2,
          nrow=3)

6.2 Plotting number of urban functions in clusters (in bar graph)

business <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Business_PR)) + 
  geom_bar(stat = "identity")

shopping <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Shopping_PR)) + 
  geom_bar(stat = "identity")

industry <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Industry_PR)) + 
  geom_bar(stat = "identity")

financial <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Financial_PR)) + 
  geom_bar(stat = "identity")

residential <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Residential_PR)) + 
  geom_bar(stat = "identity")

government <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Government_PR)) + 
  geom_bar(stat = "identity")

ggarrange(business, shopping, industry, financial, residential, government, ncol=2, nrow=3)


Based on the bar charts and choropleth maps plotted above, we can make the conclusion that different urban functions has the highest number of different clusters. For example, business urban function has the most of cluster 7, whereas residential urban function has the most of cluster 4. Shopping, Industry, Financial, and Government cluster have a few clusters with high proprtion, with the remaining clusters having low proportion .

6.3 Plotting bar charts

6.3.1 To compare urban functions across clusters

dataframe <- subzone_year_cluster %>%
  select(CLUSTER, Business_PR, Shopping_PR, Financial_PR, Industry_PR, Government_PR, Residential_PR)%>%
  st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
##   CLUSTER    variable count
## 1       1 Business_PR     2
## 2       2 Business_PR    16
## 3       2 Business_PR    46
## 4       3 Business_PR    11
## 5       3 Business_PR    48
## 6       2 Business_PR    14
ggplot(dataframe2, aes(x=CLUSTER, y=count, fill=variable))+
  geom_bar(stat="identity", position="dodge")

From the cluster bar chart above, we can conclude that:
- Cluster 1 has all urban functions in small numbers, but business urban function has the highest proportion among the other urban functions
- Cluster 2 has a high proportion of financial urban function
- Cluster 3 is similar to cluster 2
- Cluster 4 has a high proportion of residential and business urban function
- Cluster 5 has small numbers of urban functions too, with financial and business having almost the same amount
- Cluster 6 has high number of business urban function
- Cluster 7 has an abnormally high number of business urban function
- Cluster 8 also has an abnormally high number of business urban function, with the other urban functions being almost close to 0

6.3.2 To compare type of dwelling across clusters

dataframe <- subzone_year_cluster%>%
  select(CLUSTER, CONDO, ONE_TWO_RM, THREE_FOUR_RM, FIVE_RM, LANDED)%>%
  st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
##   CLUSTER variable count
## 1       1    CONDO   310
## 2       2    CONDO     0
## 3       2    CONDO   300
## 4       3    CONDO     0
## 5       3    CONDO     0
## 6       2    CONDO    10
ggplot(dataframe2, aes(x=CLUSTER, y=count, fill=variable))+
  geom_bar(stat="identity", position="dodge")


It can be concluded that clusters 3,7 & 8 do not have any residential types, and cluster 1 has close to 0 residential types. In addition, for the remaining clusters, they all have the highest proportion of three_four_room.

6.3.3 To compare age group across clusters

dataframe <- subzone_year_cluster%>%
  select(CLUSTER, Aged.group, Economy.active.population, Young.group)%>%
  st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
##   CLUSTER   variable count
## 1       1 Aged.group    60
## 2       2 Aged.group   260
## 3       2 Aged.group  2090
## 4       3 Aged.group     0
## 5       3 Aged.group     0
## 6       2 Aged.group   440
ggplot(dataframe2, aes(x=CLUSTER, y=count, fill=variable))+
  geom_bar(stat="identity", position="dodge")


It can be observed that clusters 3, 7 & 8 do not have any age groups as they are not residential areas. Also, cluster 2 shows the highest proportion of economy active popution. The remaining cluster’s highest number of proportion also belongs to economy active population.

7. Spatially Constrained Clustering - SKATER approach

7.1 Converting into SpatialPolygonsDataFrame

We will first need to convert subzone into SpatialPolygonDataFrame, as SKATER function only support sp objects such as SpatialPolygonDataFrame.

subzone_year_sp <- as_Spatial(subzone_year)

7.2 Computing Neighbour List

subzone.nb <- poly2nb(subzone_year_sp)
summary(subzone.nb)
## Neighbour list object:
## Number of regions: 318 
## Number of nonzero links: 1934 
## Percentage nonzero weights: 1.912503 
## Average number of links: 6.081761 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  2  6 10 26 77 87 51 34 16  3  3  1  1  1 
## 2 least connected regions:
## 35 195 with 1 link
## 1 most connected region:
## 292 with 17 links

We can plot the neighbours list on subzone by using the code chunk below. Since we now can plot the community area boundaries as well, we plot this graph on top of the subzone map.
The first plot command gives the boundaries. This is followed by the plot of the neighbour list object, with coordinates applied to the original SpatialPolygonDataFrame to extract the centroids of the polygons.
These are used as the nodes for the graphic representation. We set color to blue, and specify add=TRUE, to ensure that the network is plotted on top of the boundary.

plot(subzone_year_sp, border=grey(.5))
plot(subzone.nb, coordinates(subzone_year_sp), col="blue", add=TRUE)

7.3 Computing minimum spanning tree

7.3.1 Calculating edge costs

Edge cost is the distance between its nodes. This function compute the distance using data.frame with observations vector in each node

lcosts <- nbcosts(subzone.nb, subzone_count)

For each observation, this gives the pairwise dissimilarity between its values on the five variables and the values for the neighbouring observation (from the neighbour list). Basically, this is the notion of a generalised weight for a spatial weights matrix.
Next, we will incorporate these costs into a weights object in the same was as we did in the calculation of inverse of distance weights.
To achieve this, we make use of nb2listw() of spdep package, and specify style as B to make sure the cost values are not row-standardised.

subzone.w <- nb2listw(subzone.nb, lcosts, style="B")
summary(subzone.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 318 
## Number of nonzero links: 1934 
## Percentage nonzero weights: 1.912503 
## Average number of links: 6.081761 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  2  6 10 26 77 87 51 34 16  3  3  1  1  1 
## 2 least connected regions:
## 35 195 with 1 link
## 1 most connected region:
## 292 with 17 links
## 
## Weights style: B 
## Weights constants summary:
##     n     nn       S0        S1         S2
## B 318 101124 495346.3 545934159 5088278851

Computing minimum spanning tree

subzone.mst <- mstree(subzone.w)

After computing the minimum spanning tree, we can check its class and dimension by using class() and dim() respectively.

class(subzone.mst)
## [1] "mst"    "matrix"
dim(subzone.mst)
## [1] 317   3


The dimenstion is now 317 as the minimum spanning tree consists of n-1 edges in order to traverse all the nodes

We can display the content of subzone.mst by using head() function

head(subzone.mst)
##      [,1] [,2]      [,3]
## [1,]  316  255  9.972409
## [2,]  316  310 19.719934
## [3,]  316  258 21.528379
## [4,]  310  244 21.838932
## [5,]  316  312 22.756519
## [6,]  244  309 34.952037

The plot method for the minimum spanning tree include a way to show the observation numbers of the nodes in addition to the edge. As before, we plot this together with the subzone boundaries. We observe how the initial neighbour list is simplified to just one edge connecting with each of the nodes, while passing through all the nodes

plot(subzone_year_sp, border=gray(.5))
plot.mst(subzone.mst, coordinates(subzone_year_sp), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)

7.4 Computing spatially constrained clusters using SKATER method

clust5 <- skater(subzone.mst[,1:2], subzone_count,7)
str(clust5)
## List of 8
##  $ groups      : num [1:318] 1 1 1 3 3 1 1 3 1 3 ...
##  $ edges.groups:List of 8
##   ..$ :List of 3
##   .. ..$ node: num [1:246] 167 252 205 179 135 16 243 12 3 54 ...
##   .. ..$ edge: num [1:245, 1:3] 252 205 49 179 135 16 243 12 3 54 ...
##   .. ..$ ssw : num 31928
##   ..$ :List of 3
##   .. ..$ node: num [1:23] 100 78 134 87 180 237 112 84 63 113 ...
##   .. ..$ edge: num [1:22, 1:3] 87 237 112 180 87 63 130 96 84 113 ...
##   .. ..$ ssw : num 1797
##   ..$ :List of 3
##   .. ..$ node: num [1:14] 44 72 32 11 8 73 39 122 37 5 ...
##   .. ..$ edge: num [1:13, 1:3] 8 11 73 122 32 72 11 44 8 37 ...
##   .. ..$ ssw : num 380
##   ..$ :List of 3
##   .. ..$ node: num [1:13] 292 273 297 256 285 300 280 283 307 279 ...
##   .. ..$ edge: num [1:12, 1:3] 256 285 273 307 283 273 292 297 280 300 ...
##   .. ..$ ssw : num 673
##   ..$ :List of 3
##   .. ..$ node: num [1:11] 189 226 218 315 217 220 291 196 228 161 ...
##   .. ..$ edge: num [1:10, 1:3] 189 226 315 217 220 217 291 217 218 226 ...
##   .. ..$ ssw : num 330
##   ..$ :List of 3
##   .. ..$ node: num [1:6] 290 311 301 270 246 264
##   .. ..$ edge: num [1:5, 1:3] 301 290 270 311 290 270 246 264 301 311 ...
##   .. ..$ ssw : num 45.8
##   ..$ :List of 3
##   .. ..$ node: num [1:3] 296 299 274
##   .. ..$ edge: num [1:2, 1:3] 296 296 299 274 0 0
##   .. ..$ ssw : num 0
##   ..$ :List of 3
##   .. ..$ node: num [1:2] 239 86
##   .. ..$ edge: num [1, 1:3] 239 86 37.5
##   .. ..$ ssw : num 37.5
##  $ not.prune   : NULL
##  $ candidates  : int [1:8] 1 2 3 4 5 6 7 8
##  $ ssto        : num 90176
##  $ ssw         : num [1:8] 90176 74943 64183 53825 44549 ...
##  $ crit        : num [1:2] 1 Inf
##  $ vec.crit    : num [1:318] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "class")= chr "skater"

To check the cluster assignment,

ccs5 <- clust5$groups
ccs5
##   [1] 1 1 1 3 3 1 1 3 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 2 1 3
##  [38] 3 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 3 3 1
##  [75] 1 1 2 2 1 1 1 1 1 2 1 8 2 1 1 1 1 1 1 2 2 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1
## [112] 2 2 1 1 1 2 1 1 3 1 3 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 2 1 1 1 1 2 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## [186] 1 1 1 5 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 1 5 1 1
## [223] 1 1 1 5 1 5 1 1 1 1 1 2 1 1 2 2 8 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 1 4 1 1 1
## [260] 1 1 1 1 6 1 1 1 1 1 6 1 4 4 7 1 1 1 1 4 4 1 1 4 1 4 1 1 1 1 6 5 4 1 1 1 7
## [297] 4 1 7 4 6 1 4 5 1 4 4 1 1 1 6 1 1 1 5 1 1 1

To find out how many observations are in each cluster,

table(ccs5)
## ccs5
##   1   2   3   4   5   6   7   8 
## 246  23  14  13  11   6   3   2

We also plot the pruned tree that shows different clusters on top of the subzone area

plot(subzone_year_sp, border=gray(.5))
plot(clust5, coordinates(subzone_year_sp), cex.lab=.7,
     groups.colors=c("red","green","blue", "brown", "pink"), cex.circles=0.005, add=TRUE)
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter

7.5 Visualising the clusters in choropleth map

We now plot the newly derived clusters by using SKATER method

groups_mat <- as.matrix(clust5$groups)
subzone_spatialcluster <- cbind(subzone_year_cluster, as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(subzone_spatialcluster, "SP_CLUSTER")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid


Based on the choropleth map above, we can observe that:
1. Cluster 1 is majority
2. Cluster 2 is located at the west of the map
3. Cluster 3 is located at the south of the map
4. Cluseter 4 is located at the central of the map
5. Cluster 5 is located at the east of the map

7.6 Putting it on interactive mode

tm_shape(subzone_spatialcluster)+
  tm_borders(alpha=0.5)+
  tm_fill(col="SP_CLUSTER", alpha=0.9, id="SUBZONE_N")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

8. Interpreting the clusters (SKATER method)

8.1 Plotting business with clusters

tm_shape(st_make_valid(subzone_spatialcluster)) +
  tm_fill(col="SP_CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Business") +
  tm_bubbles(col = "Business_PR",
             size = "Business_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

8.1.2 Plotting all urban functions together for easy comparison

business_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
  tm_fill(col="SP_CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Business",
            scale=0.5) +
  tm_bubbles(col = "Business_PR",
             size = "Business_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

shopping_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
  tm_fill(col="SP_CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Shopping",
            scale=0.5) +
  tm_bubbles(col = "Shopping_PR",
             size = "Shopping_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

industry_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
  tm_fill(col="SP_CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Industry",
            scale=0.5) +
  tm_bubbles(col = "Industry_PR",
             size = "Industry_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

financial_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
  tm_fill(col="SP_CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Financial",
            scale=0.5) +
  tm_bubbles(col = "Financial_PR",
             size = "Financial_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

residential_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
  tm_fill(col="SP_CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Residential",
            scale=0.5) +
  tm_bubbles(col = "Residential_PR",
             size = "Residential_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

govt_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
  tm_fill(col="SP_CLUSTER") +
  tm_borders(lwd = 0.1,  alpha = 1) +
  tm_layout(title = "Government",
            scale=0.5) +
  tm_bubbles(col = "Government_PR",
             size = "Government_PR",
             border.col = "black",
             border.lwd = 1,
             alpha = 0.7)

tmap_arrange(business_cluster, shopping_cluster, industry_cluster, financial_cluster, residential_cluster, govt_cluster,
             asp=0, 
          ncol=2,
          nrow=3)

8.2 Plotting number of urban functions in clusters (in bar graph)

business <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Business_PR)) + 
  geom_bar(stat = "identity")

shopping <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Shopping_PR)) + 
  geom_bar(stat = "identity")

industry <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Industry_PR)) + 
  geom_bar(stat = "identity")

financial <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Financial_PR)) + 
  geom_bar(stat = "identity")

residential <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Residential_PR)) + 
  geom_bar(stat = "identity")

government <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Government_PR)) + 
  geom_bar(stat = "identity")

ggarrange(business, shopping, industry, financial, residential, government, ncol=2, nrow=3)

Based on the bar charts and choropleth maps plotted above, we can make the conclusion that majority of the cluster among the urban functions is cluster 1, with a high percentage of cluster 1 in each of the urban functions. For example, for the Residential_PR group, almost all of the data are under cluster 1. This is supported by the choropleth map, with a large proportion of the area on the map being cluster 1 (green colour)
We can also observe that the Business_PR group has the widest variety of clusters, with cluster 2 being almsot as high as cluster 1. There is also a slight number of cluster 3, 4, and 5.

8.3 Plotting bar charts

8.3.1 To compare urban functions across clusters

dataframe <- subzone_spatialcluster %>%
  select(SP_CLUSTER, Business_PR, Shopping_PR, Financial_PR, Industry_PR, Government_PR, Residential_PR)%>%
  st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="SP_CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
##   SP_CLUSTER    variable count
## 1          1 Business_PR     2
## 2          1 Business_PR    16
## 3          1 Business_PR    46
## 4          3 Business_PR    11
## 5          3 Business_PR    48
## 6          1 Business_PR    14
ggplot(dataframe2, aes(x=SP_CLUSTER, y=count, fill=variable))+
  geom_bar(stat="identity", position="dodge")


From this, we can conclude that:
- Cluster 1 has a high proportion of business and residential urban functions
- Cluster 2 has an abnormally high number of business proportion
- Cluster 3 has slightly lower number of urban function proportion as compared to clusters 1 and 2, but highest proportion of financial
- Cluster 4 has an abnormally high number of business proportion
- Cluster 5 also has an abnormally high number of business proportion
- Cluster 6 has very few number of urban functions, but business proportion is still the most
- Cluster 7 has no urban functions at all
- Cluster 8 has high proportion of business urban function

8.3.2 Plotting bar chart to compare type of dwelling across clusters

dataframe <- subzone_spatialcluster %>%
  select(SP_CLUSTER, CONDO, ONE_TWO_RM, THREE_FOUR_RM, FIVE_RM, LANDED)%>%
  st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="SP_CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
##   SP_CLUSTER variable count
## 1          1    CONDO   310
## 2          1    CONDO     0
## 3          1    CONDO   300
## 4          3    CONDO     0
## 5          3    CONDO     0
## 6          1    CONDO    10
ggplot(dataframe2, aes(x=SP_CLUSTER, y=count, fill=variable))+
  geom_bar(stat="identity", position="dodge")


It can be observed that only cluster 1 are residential areas, with three_four_rm being the highest in count.

8.3.3 Plotting bar chart to compare type of age group across clusters

dataframe <- subzone_spatialcluster%>%
  select(SP_CLUSTER, Aged.group, Economy.active.population, Young.group)%>%
  st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="SP_CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
##   SP_CLUSTER   variable count
## 1          1 Aged.group    60
## 2          1 Aged.group   260
## 3          1 Aged.group  2090
## 4          3 Aged.group     0
## 5          3 Aged.group     0
## 6          1 Aged.group   440
ggplot(dataframe2, aes(x=SP_CLUSTER, y=count, fill=variable))+
  geom_bar(stat="identity", position="dodge")


Similarly, only cluster 1 has age groups, mainly because they are residential areas. The higher proportion is economy active population.

9. Bonus

9.1 Analyse statistics by socio-economic indicators

Add pop_density into year4 data

subzone_spatialcluster <- subzone_spatialcluster %>%
  mutate(pop_density=Pop/SHAPE_Area)

9.1.1 pop_density group

ggplot(data=subzone_spatialcluster, aes(x=`pop_density`))+
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows=vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="pop_density",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="pop_density")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid


By looking at the map, we can see that cluster 1 generally have high population density, with the exception of Lim Chu Kang area, and Changi area. However, I did not remove these areas as even though these areas have population = 0, this does not mean that urban functions = 0. For example, it could be due to industrial area having alot of industries, but with nobody staying there.
In cluster 4, there are no population density plots on in since it is the central water catchment area.
One of the highly populated cluster 1 includes Tiong Bahru, Bukit Ho Swee, Rivervale, and Sengkang Town Centre.

9.1.2 Young group

ggplot(data=subzone_spatialcluster, aes(x=`Young.group`))+
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows=vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="Young.group",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="Young.group")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid


By looking at the map, we can see that cluster 1 generally have higher number of Young group across the subzones.

9.1.3 Aged group

ggplot(data=subzone_spatialcluster, aes(x=`Aged.group`))+
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows=vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="Aged.group",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="Aged.group")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid


Similar to Young group, by looking at the map, we can see that cluster 1 generally have higher number of Aged group across the subzones.

9.1.4 Active group

ggplot(data=subzone_spatialcluster, aes(x=`Economy.active.population`))+
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows=vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="Economy.active.population",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="Economy.active.population")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid


Similar to Young group, by looking at the map, we can see that cluster 1 generally have higher number of Active group across the subzones.

9.1.5 Condo group

ggplot(data=subzone_spatialcluster, aes(x=`CONDO`))+
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows=vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="CONDO",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="CONDO")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.1.6 HDB 1,2 Room group

ggplot(data=subzone_spatialcluster, aes(x=`ONE_TWO_RM`))+
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows=vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="ONE_TWO_RM",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="ONE_TWO_RM")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.1.7 HDB 3,4 Room group

ggplot(data=subzone_spatialcluster, aes(x=`THREE_FOUR_RM`))+
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows=vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="THREE_FOUR_RM",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="THREE_FOUR_RM")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.1.8 FIVE Room group

ggplot(data=subzone_spatialcluster, aes(x=`FIVE_RM`))+
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows=vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="FIVE_RM",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="FIVE_RM")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.1.9 Landed group

ggplot(data=subzone_spatialcluster, aes(x=`LANDED`))+
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows=vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="LANDED",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="LANDED")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.2 Analyse by cluster

9.2.1 Filter by cluster

cluster1 <- subzone_spatialcluster %>%
  filter(SP_CLUSTER==1)

cluster2 <- subzone_spatialcluster %>%
  filter(SP_CLUSTER==2)

cluster3 <- subzone_spatialcluster %>%
  filter(SP_CLUSTER==3)

cluster4 <- subzone_spatialcluster %>%
  filter(SP_CLUSTER==4)

cluster5 <- subzone_spatialcluster %>%
  filter(SP_CLUSTER==5)
summary(cluster1)
##   SUBZONE_N             CONDO         ONE_TWO_RM     THREE_FOUR_RM  
##  Length:246         Min.   :    0   Min.   :   0.0   Min.   :    0  
##  Class :character   1st Qu.:    0   1st Qu.:   0.0   1st Qu.:    0  
##  Mode  :character   Median : 1205   Median :   0.0   Median : 3605  
##                     Mean   : 2399   Mean   : 711.6   Mean   : 7816  
##                     3rd Qu.: 3945   3rd Qu.:1217.5   3rd Qu.:12650  
##                     Max.   :16770   Max.   :4700.0   Max.   :75000  
##                                                                     
##     FIVE_RM          LANDED        Aged.group    Economy.active.population
##  Min.   :    0   Min.   :    0   Min.   :    0   Min.   :    0            
##  1st Qu.:    0   1st Qu.:    0   1st Qu.:  280   1st Qu.: 1222            
##  Median : 1315   Median :    0   Median : 1725   Median : 5955            
##  Mean   : 4329   Mean   : 1016   Mean   : 2371   Mean   : 9698            
##  3rd Qu.: 5305   3rd Qu.:  760   3rd Qu.: 3458   3rd Qu.:15242            
##  Max.   :47960   Max.   :18820   Max.   :18860   Max.   :79780            
##                                                                           
##   Young.group         CONDO_PR      ONE_TWO_RM_PR     THREE_FOUR_RM_PR
##  Min.   :    0.0   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
##  1st Qu.:  522.5   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 0.000  
##  Median : 2455.0   Median : 1.232   Median : 0.0000   Median : 4.314  
##  Mean   : 4327.0   Mean   : 2.057   Mean   : 0.7937   Mean   : 7.034  
##  3rd Qu.: 6437.5   3rd Qu.: 3.527   3rd Qu.: 0.9636   3rd Qu.:13.501  
##  Max.   :34260.0   Max.   :11.693   Max.   :10.7964   Max.   :29.630  
##                                                                       
##    FIVE_RM_PR       LANDED_PR      Aged_group_PR   Economy_active_PR
##  Min.   : 0.000   Min.   :0.0000   Min.   :  0.0   Min.   :  0.0    
##  1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:100.1   1st Qu.:566.5    
##  Median : 1.374   Median :0.0000   Median :149.4   Median :588.2    
##  Mean   : 3.513   Mean   :0.6021   Mean   :148.8   Mean   :567.0    
##  3rd Qu.: 5.149   3rd Qu.:0.5942   3rd Qu.:193.5   3rd Qu.:605.3    
##  Max.   :27.554   Max.   :7.2889   Max.   :947.4   Max.   :900.0    
##                                                                     
##  Young_group_PR   Business_PR      Shopping_PR      Industry_PR    
##  Min.   :  0.0   Min.   :  0.00   Min.   : 0.000   Min.   :0.0000  
##  1st Qu.:211.5   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:0.0000  
##  Median :248.2   Median :  2.00   Median : 1.000   Median :0.0000  
##  Mean   :235.4   Mean   : 11.15   Mean   : 1.817   Mean   :0.2358  
##  3rd Qu.:285.1   3rd Qu.:  8.00   3rd Qu.: 2.000   3rd Qu.:0.0000  
##  Max.   :558.8   Max.   :230.00   Max.   :31.000   Max.   :7.0000  
##                                                                    
##   Financial_PR    Residential_PR   Government_PR    pop_density_PR  
##  Min.   :  0.00   Min.   :  0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:  3.00   1st Qu.:  3.00   1st Qu.: 0.000   1st Qu.: 3.138  
##  Median :  7.00   Median :  6.00   Median : 0.000   Median :10.774  
##  Mean   : 12.13   Mean   : 14.57   Mean   : 1.533   Mean   :14.097  
##  3rd Qu.: 15.00   3rd Qu.: 14.00   3rd Qu.: 2.000   3rd Qu.:23.387  
##  Max.   :134.00   Max.   :217.00   Max.   :19.000   Max.   :46.058  
##                                                                     
##    SHAPE_Area            Pop           CLUSTER            SP_CLUSTER 
##  Min.   :   49626   Min.   :     0   Length:246         1      :246  
##  1st Qu.:  547736   1st Qu.:  2030   Class :character   2      :  0  
##  Median : 1049955   Median : 10640   Mode  :character   3      :  0  
##  Mean   : 1855093   Mean   : 16396                      4      :  0  
##  3rd Qu.: 1795660   3rd Qu.: 25515                      5      :  0  
##  Max.   :69748299   Max.   :132900                      6      :  0  
##                                                         (Other):  0  
##           geometry    pop_density      
##  MULTIPOLYGON :246   Min.   :0.000000  
##  epsg:NA      :  0   1st Qu.:0.003138  
##  +proj=tmer...:  0   Median :0.010774  
##                      Mean   :0.014097  
##                      3rd Qu.:0.023387  
##                      Max.   :0.046058  
## 
summary(cluster2)
##   SUBZONE_N             CONDO     ONE_TWO_RM THREE_FOUR_RM    FIVE_RM 
##  Length:23          Min.   :0   Min.   :0    Min.   :0     Min.   :0  
##  Class :character   1st Qu.:0   1st Qu.:0    1st Qu.:0     1st Qu.:0  
##  Mode  :character   Median :0   Median :0    Median :0     Median :0  
##                     Mean   :0   Mean   :0    Mean   :0     Mean   :0  
##                     3rd Qu.:0   3rd Qu.:0    3rd Qu.:0     3rd Qu.:0  
##                     Max.   :0   Max.   :0    Max.   :0     Max.   :0  
##                                                                       
##      LANDED    Aged.group Economy.active.population  Young.group    CONDO_PR
##  Min.   :0   Min.   :0    Min.   :0                 Min.   :0    Min.   :0  
##  1st Qu.:0   1st Qu.:0    1st Qu.:0                 1st Qu.:0    1st Qu.:0  
##  Median :0   Median :0    Median :0                 Median :0    Median :0  
##  Mean   :0   Mean   :0    Mean   :0                 Mean   :0    Mean   :0  
##  3rd Qu.:0   3rd Qu.:0    3rd Qu.:0                 3rd Qu.:0    3rd Qu.:0  
##  Max.   :0   Max.   :0    Max.   :0                 Max.   :0    Max.   :0  
##                                                                             
##  ONE_TWO_RM_PR THREE_FOUR_RM_PR   FIVE_RM_PR   LANDED_PR Aged_group_PR
##  Min.   :0     Min.   :0        Min.   :0    Min.   :0   Min.   :0    
##  1st Qu.:0     1st Qu.:0        1st Qu.:0    1st Qu.:0   1st Qu.:0    
##  Median :0     Median :0        Median :0    Median :0   Median :0    
##  Mean   :0     Mean   :0        Mean   :0    Mean   :0   Mean   :0    
##  3rd Qu.:0     3rd Qu.:0        3rd Qu.:0    3rd Qu.:0   3rd Qu.:0    
##  Max.   :0     Max.   :0        Max.   :0    Max.   :0   Max.   :0    
##                                                                       
##  Economy_active_PR Young_group_PR  Business_PR     Shopping_PR     
##  Min.   :0         Min.   :0      Min.   :  0.0   Min.   :0.00000  
##  1st Qu.:0         1st Qu.:0      1st Qu.: 34.5   1st Qu.:0.00000  
##  Median :0         Median :0      Median : 83.0   Median :0.00000  
##  Mean   :0         Mean   :0      Mean   :110.3   Mean   :0.08696  
##  3rd Qu.:0         3rd Qu.:0      3rd Qu.:181.0   3rd Qu.:0.00000  
##  Max.   :0         Max.   :0      Max.   :308.0   Max.   :1.00000  
##                                                                    
##   Industry_PR     Financial_PR Residential_PR   Government_PR    pop_density_PR
##  Min.   :0.000   Min.   :0.0   Min.   :0.0000   Min.   :0.0000   Min.   :0     
##  1st Qu.:0.000   1st Qu.:0.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0     
##  Median :0.000   Median :0.0   Median :0.0000   Median :0.0000   Median :0     
##  Mean   :0.913   Mean   :1.0   Mean   :0.2609   Mean   :0.1739   Mean   :0     
##  3rd Qu.:0.500   3rd Qu.:1.5   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.:0     
##  Max.   :8.000   Max.   :5.0   Max.   :4.0000   Max.   :1.0000   Max.   :0     
##                                                                                
##    SHAPE_Area            Pop      CLUSTER            SP_CLUSTER
##  Min.   : 1287950   Min.   :0   Length:23          2      :23  
##  1st Qu.: 1991662   1st Qu.:0   Class :character   1      : 0  
##  Median : 2349378   Median :0   Mode  :character   3      : 0  
##  Mean   : 5077986   Mean   :0                      4      : 0  
##  3rd Qu.: 3195702   3rd Qu.:0                      5      : 0  
##  Max.   :36707721   Max.   :0                      6      : 0  
##                                                    (Other): 0  
##           geometry   pop_density
##  MULTIPOLYGON :23   Min.   :0   
##  epsg:NA      : 0   1st Qu.:0   
##  +proj=tmer...: 0   Median :0   
##                     Mean   :0   
##                     3rd Qu.:0   
##                     Max.   :0   
## 
summary(cluster3)
##   SUBZONE_N             CONDO     ONE_TWO_RM THREE_FOUR_RM    FIVE_RM 
##  Length:14          Min.   :0   Min.   :0    Min.   :0     Min.   :0  
##  Class :character   1st Qu.:0   1st Qu.:0    1st Qu.:0     1st Qu.:0  
##  Mode  :character   Median :0   Median :0    Median :0     Median :0  
##                     Mean   :0   Mean   :0    Mean   :0     Mean   :0  
##                     3rd Qu.:0   3rd Qu.:0    3rd Qu.:0     3rd Qu.:0  
##                     Max.   :0   Max.   :0    Max.   :0     Max.   :0  
##                                                                       
##      LANDED    Aged.group Economy.active.population  Young.group    CONDO_PR
##  Min.   :0   Min.   :0    Min.   :0                 Min.   :0    Min.   :0  
##  1st Qu.:0   1st Qu.:0    1st Qu.:0                 1st Qu.:0    1st Qu.:0  
##  Median :0   Median :0    Median :0                 Median :0    Median :0  
##  Mean   :0   Mean   :0    Mean   :0                 Mean   :0    Mean   :0  
##  3rd Qu.:0   3rd Qu.:0    3rd Qu.:0                 3rd Qu.:0    3rd Qu.:0  
##  Max.   :0   Max.   :0    Max.   :0                 Max.   :0    Max.   :0  
##                                                                             
##  ONE_TWO_RM_PR THREE_FOUR_RM_PR   FIVE_RM_PR   LANDED_PR Aged_group_PR
##  Min.   :0     Min.   :0        Min.   :0    Min.   :0   Min.   :0    
##  1st Qu.:0     1st Qu.:0        1st Qu.:0    1st Qu.:0   1st Qu.:0    
##  Median :0     Median :0        Median :0    Median :0   Median :0    
##  Mean   :0     Mean   :0        Mean   :0    Mean   :0   Mean   :0    
##  3rd Qu.:0     3rd Qu.:0        3rd Qu.:0    3rd Qu.:0   3rd Qu.:0    
##  Max.   :0     Max.   :0        Max.   :0    Max.   :0   Max.   :0    
##                                                                       
##  Economy_active_PR Young_group_PR  Business_PR    Shopping_PR    
##  Min.   :0         Min.   :0      Min.   : 0.0   Min.   : 0.000  
##  1st Qu.:0         1st Qu.:0      1st Qu.: 1.0   1st Qu.: 0.000  
##  Median :0         Median :0      Median : 6.5   Median : 1.000  
##  Mean   :0         Mean   :0      Mean   :10.5   Mean   : 3.786  
##  3rd Qu.:0         3rd Qu.:0      3rd Qu.:14.0   3rd Qu.: 1.750  
##  Max.   :0         Max.   :0      Max.   :48.0   Max.   :21.000  
##                                                                  
##   Industry_PR      Financial_PR   Residential_PR   Government_PR   
##  Min.   :0.0000   Min.   :  0.0   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:  0.5   1st Qu.:0.0000   1st Qu.: 0.000  
##  Median :0.0000   Median :  5.0   Median :0.0000   Median : 1.500  
##  Mean   :0.3571   Mean   : 19.5   Mean   :0.7143   Mean   : 3.571  
##  3rd Qu.:0.0000   3rd Qu.: 15.5   3rd Qu.:1.5000   3rd Qu.: 3.000  
##  Max.   :5.0000   Max.   :126.0   Max.   :3.0000   Max.   :17.000  
##                                                                    
##  pop_density_PR   SHAPE_Area           Pop      CLUSTER            SP_CLUSTER
##  Min.   :0      Min.   :  39438   Min.   :0   Length:14          3      :14  
##  1st Qu.:0      1st Qu.: 232866   1st Qu.:0   Class :character   1      : 0  
##  Median :0      Median : 651174   Median :0   Mode  :character   2      : 0  
##  Mean   :0      Mean   : 940623   Mean   :0                      4      : 0  
##  3rd Qu.:0      3rd Qu.:1474579   3rd Qu.:0                      5      : 0  
##  Max.   :0      Max.   :3449641   Max.   :0                      6      : 0  
##                                                                  (Other): 0  
##           geometry   pop_density
##  MULTIPOLYGON :14   Min.   :0   
##  epsg:NA      : 0   1st Qu.:0   
##  +proj=tmer...: 0   Median :0   
##                     Mean   :0   
##                     3rd Qu.:0   
##                     Max.   :0   
## 
summary(cluster4)
##   SUBZONE_N             CONDO     ONE_TWO_RM THREE_FOUR_RM    FIVE_RM 
##  Length:13          Min.   :0   Min.   :0    Min.   :0     Min.   :0  
##  Class :character   1st Qu.:0   1st Qu.:0    1st Qu.:0     1st Qu.:0  
##  Mode  :character   Median :0   Median :0    Median :0     Median :0  
##                     Mean   :0   Mean   :0    Mean   :0     Mean   :0  
##                     3rd Qu.:0   3rd Qu.:0    3rd Qu.:0     3rd Qu.:0  
##                     Max.   :0   Max.   :0    Max.   :0     Max.   :0  
##                                                                       
##      LANDED    Aged.group Economy.active.population  Young.group    CONDO_PR
##  Min.   :0   Min.   :0    Min.   :0                 Min.   :0    Min.   :0  
##  1st Qu.:0   1st Qu.:0    1st Qu.:0                 1st Qu.:0    1st Qu.:0  
##  Median :0   Median :0    Median :0                 Median :0    Median :0  
##  Mean   :0   Mean   :0    Mean   :0                 Mean   :0    Mean   :0  
##  3rd Qu.:0   3rd Qu.:0    3rd Qu.:0                 3rd Qu.:0    3rd Qu.:0  
##  Max.   :0   Max.   :0    Max.   :0                 Max.   :0    Max.   :0  
##                                                                             
##  ONE_TWO_RM_PR THREE_FOUR_RM_PR   FIVE_RM_PR   LANDED_PR Aged_group_PR
##  Min.   :0     Min.   :0        Min.   :0    Min.   :0   Min.   :0    
##  1st Qu.:0     1st Qu.:0        1st Qu.:0    1st Qu.:0   1st Qu.:0    
##  Median :0     Median :0        Median :0    Median :0   Median :0    
##  Mean   :0     Mean   :0        Mean   :0    Mean   :0   Mean   :0    
##  3rd Qu.:0     3rd Qu.:0        3rd Qu.:0    3rd Qu.:0   3rd Qu.:0    
##  Max.   :0     Max.   :0        Max.   :0    Max.   :0   Max.   :0    
##                                                                       
##  Economy_active_PR Young_group_PR  Business_PR      Shopping_PR    
##  Min.   :0         Min.   :0      Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:0         1st Qu.:0      1st Qu.:  3.00   1st Qu.:0.0000  
##  Median :0         Median :0      Median : 22.00   Median :0.0000  
##  Mean   :0         Mean   :0      Mean   : 50.46   Mean   :0.6923  
##  3rd Qu.:0         3rd Qu.:0      3rd Qu.:102.00   3rd Qu.:0.0000  
##  Max.   :0         Max.   :0      Max.   :173.00   Max.   :8.0000  
##                                                                    
##   Industry_PR      Financial_PR    Residential_PR   Government_PR   
##  Min.   :0.0000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median : 0.000   Median :0.0000   Median :0.0000  
##  Mean   :0.8462   Mean   : 2.385   Mean   :0.1538   Mean   :0.6923  
##  3rd Qu.:1.0000   3rd Qu.: 2.000   3rd Qu.:0.0000   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :21.000   Max.   :1.0000   Max.   :6.0000  
##                                                                     
##  pop_density_PR   SHAPE_Area            Pop      CLUSTER            SP_CLUSTER
##  Min.   :0      Min.   :  595652   Min.   :0   Length:13          4      :13  
##  1st Qu.:0      1st Qu.: 1635808   1st Qu.:0   Class :character   1      : 0  
##  Median :0      Median : 2241387   Median :0   Mode  :character   2      : 0  
##  Mean   :0      Mean   : 5497730   Mean   :0                      3      : 0  
##  3rd Qu.:0      3rd Qu.: 4387133   3rd Qu.:0                      5      : 0  
##  Max.   :0      Max.   :37147854   Max.   :0                      6      : 0  
##                                                                   (Other): 0  
##           geometry   pop_density
##  MULTIPOLYGON :13   Min.   :0   
##  epsg:NA      : 0   1st Qu.:0   
##  +proj=tmer...: 0   Median :0   
##                     Mean   :0   
##                     3rd Qu.:0   
##                     Max.   :0   
## 
summary(cluster5)
##   SUBZONE_N             CONDO     ONE_TWO_RM THREE_FOUR_RM    FIVE_RM 
##  Length:11          Min.   :0   Min.   :0    Min.   :0     Min.   :0  
##  Class :character   1st Qu.:0   1st Qu.:0    1st Qu.:0     1st Qu.:0  
##  Mode  :character   Median :0   Median :0    Median :0     Median :0  
##                     Mean   :0   Mean   :0    Mean   :0     Mean   :0  
##                     3rd Qu.:0   3rd Qu.:0    3rd Qu.:0     3rd Qu.:0  
##                     Max.   :0   Max.   :0    Max.   :0     Max.   :0  
##                                                                       
##      LANDED    Aged.group Economy.active.population  Young.group    CONDO_PR
##  Min.   :0   Min.   :0    Min.   :0                 Min.   :0    Min.   :0  
##  1st Qu.:0   1st Qu.:0    1st Qu.:0                 1st Qu.:0    1st Qu.:0  
##  Median :0   Median :0    Median :0                 Median :0    Median :0  
##  Mean   :0   Mean   :0    Mean   :0                 Mean   :0    Mean   :0  
##  3rd Qu.:0   3rd Qu.:0    3rd Qu.:0                 3rd Qu.:0    3rd Qu.:0  
##  Max.   :0   Max.   :0    Max.   :0                 Max.   :0    Max.   :0  
##                                                                             
##  ONE_TWO_RM_PR THREE_FOUR_RM_PR   FIVE_RM_PR   LANDED_PR Aged_group_PR
##  Min.   :0     Min.   :0        Min.   :0    Min.   :0   Min.   :0    
##  1st Qu.:0     1st Qu.:0        1st Qu.:0    1st Qu.:0   1st Qu.:0    
##  Median :0     Median :0        Median :0    Median :0   Median :0    
##  Mean   :0     Mean   :0        Mean   :0    Mean   :0   Mean   :0    
##  3rd Qu.:0     3rd Qu.:0        3rd Qu.:0    3rd Qu.:0   3rd Qu.:0    
##  Max.   :0     Max.   :0        Max.   :0    Max.   :0   Max.   :0    
##                                                                       
##  Economy_active_PR Young_group_PR  Business_PR      Shopping_PR
##  Min.   :0         Min.   :0      Min.   :  0.00   Min.   :0   
##  1st Qu.:0         1st Qu.:0      1st Qu.:  0.00   1st Qu.:0   
##  Median :0         Median :0      Median :  0.00   Median :0   
##  Mean   :0         Mean   :0      Mean   : 19.09   Mean   :0   
##  3rd Qu.:0         3rd Qu.:0      3rd Qu.:  7.00   3rd Qu.:0   
##  Max.   :0         Max.   :0      Max.   :184.00   Max.   :0   
##                                                                
##   Industry_PR      Financial_PR    Residential_PR Government_PR   
##  Min.   :0.0000   Min.   :0.0000   Min.   :0      Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0      1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0      Median :0.0000  
##  Mean   :0.5455   Mean   :0.1818   Mean   :0      Mean   :0.1818  
##  3rd Qu.:0.5000   3rd Qu.:0.0000   3rd Qu.:0      3rd Qu.:0.0000  
##  Max.   :3.0000   Max.   :2.0000   Max.   :0      Max.   :2.0000  
##                                                                   
##  pop_density_PR   SHAPE_Area           Pop      CLUSTER            SP_CLUSTER
##  Min.   :0      Min.   : 494504   Min.   :0   Length:11          5      :11  
##  1st Qu.:0      1st Qu.:1062835   1st Qu.:0   Class :character   1      : 0  
##  Median :0      Median :1287387   Median :0   Mode  :character   2      : 0  
##  Mean   :0      Mean   :2155523   Mean   :0                      3      : 0  
##  3rd Qu.:0      3rd Qu.:2116806   3rd Qu.:0                      4      : 0  
##  Max.   :0      Max.   :7034806   Max.   :0                      6      : 0  
##                                                                  (Other): 0  
##           geometry   pop_density
##  MULTIPOLYGON :11   Min.   :0   
##  epsg:NA      : 0   1st Qu.:0   
##  +proj=tmer...: 0   Median :0   
##                     Mean   :0   
##                     3rd Qu.:0   
##                     Max.   :0   
## 

9.3 Analyse statistics by urban functions

9.3.1 Business

ggplot(data=subzone_spatialcluster, aes(x=`Business_PR`)) +
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows = vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="Business_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Business_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid


The most number of business functions can be found in cluster 2, with a significant amount found in cluster 1

9.3.2 Shopping

ggplot(data=subzone_spatialcluster, aes(x=`Shopping_PR`)) +
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows = vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="Shopping_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Shopping_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.3.3 Industry

ggplot(data=subzone_spatialcluster, aes(x=`Industry_PR`)) +
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows = vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="Industry_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Industry_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.3.4 Financial

ggplot(data=subzone_spatialcluster, aes(x=`Financial_PR`)) +
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows = vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="Financial_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Financial_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.3.5 Industry

ggplot(data=subzone_spatialcluster, aes(x=`Industry_PR`)) +
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows = vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="Industry_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Industry_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.3.6 Government

ggplot(data=subzone_spatialcluster, aes(x=`Government_PR`)) +
  geom_histogram(bins=20, color="black", fill="light blue")+
  facet_grid(rows = vars(SP_CLUSTER))

tm_shape(subzone_spatialcluster) +
  tm_fill(col = "SP_CLUSTER",alpha=0.7) +
  tm_borders(lwd=1,lty="solid") +
  tm_bubbles(col="Government_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Government_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.4 Visualising Clusters

We will first convert the clusters into matrix form

groups_matrix <- as.matrix(clust5$groups)

9.4.1 Computing Mean Values of Variables of Clusters

clusters <- cbind(subzone_count.std, as.factor(groups_matrix))%>%
  rename(`SP_CLUSTER`=`as.factor(groups_matrix)`)

mean <- clusters%>%
  group_by(SP_CLUSTER)%>%
  summarise_all(funs(mean))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
mean_value <- data.frame(t(mean[-1]))
colnames(mean_value) <- mean[,1]
names(mean_value)[1] <- "Cluster1"
names(mean_value)[2] <- "Cluster2"
names(mean_value)[3] <- "Cluster3"
names(mean_value)[4] <- "Cluster4"
names(mean_value)[5] <- "Cluster5"
names(mean_value)[6] <- "Cluster6"
names(mean_value)[7] <- "Cluster7"
names(mean_value)[8] <- "Cluster8"

head(mean_value)
##                    Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6
## CONDO_PR         0.17588867        0        0        0        0        0
## ONE_TWO_RM_PR    0.07351903        0        0        0        0        0
## THREE_FOUR_RM_PR 0.23737864        0        0        0        0        0
## FIVE_RM_PR       0.12749622        0        0        0        0        0
## LANDED_PR        0.08260174        0        0        0        0        0
## Aged_group_PR    0.15703702        0        0        0        0        0
##                  Cluster7 Cluster8
## CONDO_PR                0        0
## ONE_TWO_RM_PR           0        0
## THREE_FOUR_RM_PR        0        0
## FIVE_RM_PR              0        0
## LANDED_PR               0        0
## Aged_group_PR           0        0

9.4.2 Visualising the clusters obtained through SKATER method

tm_shape(subzone_spatialcluster)+
  tm_borders(alpha = 0.5)+
  tm_fill(col = "SP_CLUSTER", alpha = 0.9, id = "SUBZONE_N")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid

9.5 Analysing Results

From the interactive map plotted above, we can see that there is a high proportion of Cluster 1, which is also supported by the mean_value table above, with most of the values coming from Cluster 1.
1. Cluster 1 occupies the majority of Singapore, which includes most area except the northern part and west part of Singapore. It includes area such as Bukit Batok, Ghim Moh, Clark Quay, Kovan, and even, Changi Airport. From our earlier analysis, we have identified that Cluster 1 consists of residential areas. This explains the high proportion of Cluster 1 in the Singapore map.
Our analysis above also shows that cluster 1 has high numbers of Residential and Business urban functions, and the least number of industrial. This could be due to buildings and amenities built in these areas that are highly requented by people, and they are all widely spread across Singapore for easy accessibility of these ammenties and residential areas.
Cluster 2 is located at the west of Singapore. These areas include Tuas View Extension, Gul Circle, and Jurong Island and Bukom. Cluster 2 has a high proportion of business urban functions, as it may be where offices are held at. Since there are no residentials, the buildings there are mainly business offices.
Cluster 3 is located at the south of Singapore, and areas include Marina South and Marina East. There is a high proportion of Financial urban function in Cluster 3. This could be due to the building of the casino in Singapore, and there are banks available to allow people for easy changing or transfer of money.
Cluster 4 is the central water catchment, which has no residentials. Instead, it too has a high proportion of businesses.
Cluster 5 is the north-east part of Singapore, which includes areas such as Paya Lebar West.
Cluster 6 is the north part of Singapore, which includes Seletar Aerospace Park. There are few urban functions there, and no residential area.
Cluster 7 is a small island at the corner of Singapore, with no urban functions, and no residential area there. There are also no sign of population there.
Lastly, Cluster 8 is a small area at Toh Tuck with high proportion of business urban function.

In conclusion, we can see that business urban function has a huge proportion across the Singapore map, as businesses are the ones that bring in the revenue for Singapore economy. In addition, we can see that places with residential areas tend to have higher population, with economy active population being the largest group in Singapore. These could be due to the economy active population being the one who runs the businesses in Singapore. In addition, places with high residential area are known to have higher amount of ammenities functions, such as Financial, or Shopping, and fewer amount of Industrial. Instead, Industrial will have more in areas that are less populated, such as Tuas.