In this exercise, we will be segmenting Singapore at the planning subzone level into homogeneous socioeconomic areas, by combining geodemographic data extracted from Singapore Department of Statistics, and urban functions extracted from geospatial data.
From the Singapore Residents by Planning Area Subzone, Age Group, Sex and Type of Dwelling, June 2011-2019 provide by Singapore Department of Statistics, we extract the following indicators for 2019:
- Economy active population (i.e. age 25-64)
- Young group (i.e. age 0-24)
- Aged group (i.t. 65 and above)
- Population density
- HDB 1-2 RM dwellers
- HDB 3-4 RM dwellers
- HDB 5RM-Ec dwellers
- Condominium and apartment dwellers
- Landed property dwellers
We are provided with several geosptail data, and from there, we extract the following urban functions:
- Government includding embassy
- Business
- Industry
- Shopping
- Residential
- Government
Social area analysis (SAA), which is a generic name for a number of methods employing census and other data to classify small areas into similar socioeconomic groups, is an approach which quantifies data in a useful fashion and has important applications in regional development, urban planning, medical and health services research. The potentialities of the approach for Liveable City planning and Smart City preparedness, however, have yet to be explored.
We are provided with five geospatial data sets: Govt_Embassy, Private residential, Shopping, Business, and Financial. These data sets are in ESRI shapefile format. In addition, we downloaded the following data from the relevant government portals: - Singapore Residents by Planning AreaSubzone, Age Group, Sex and Type of Dwelling, June 2011-2019 (https://www.singstat.gov.sg/),
- URA Master Plan 2014 Planning Subzone boundary (www.data.gov.sg)
Before we begin, a side note would be that all the interactive map will be static in this RPubs as the file is too large, and unable to be published on RPubs.
packages = c('rgdal', 'spdep', 'ClustGeo', 'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'tidyverse', 'psych', 'reshape2', 'maptools', 'leaflet', 'stringr', 'factoextra', 'NbClust')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
## Loading required package: rgdal
## Loading required package: sp
## rgdal: version: 1.4-8, (SVN revision 845)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
## Path to GDAL shared files: C:/R/R-4.0.0/library/rgdal/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
## Path to PROJ.4 shared files: C:/R/R-4.0.0/library/rgdal/proj
## Linking to sp version: 1.4-1
## Loading required package: spdep
## Loading required package: spData
## To access larger datasets in this package, install the spDataLarge
## package with: `install.packages('spDataLarge',
## repos='https://nowosad.github.io/drat/', type='source')`
## Loading required package: sf
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
## Loading required package: ClustGeo
## Loading required package: tmap
## Loading required package: ggpubr
## Loading required package: ggplot2
## Loading required package: cluster
## Loading required package: heatmaply
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Loading required package: viridis
## Loading required package: viridisLite
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
##
## ======================
## Welcome to heatmaply version 1.1.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
## Loading required package: corrplot
## corrplot 0.84 loaded
## Loading required package: tidyverse
## -- Attaching packages -------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.0.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ----------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: psych
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
## Loading required package: reshape2
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
## Loading required package: maptools
## Checking rgeos availability: FALSE
## Note: when rgeos is not available, polygon geometry computations in maptools depend on gpclib,
## which has a restricted licence. It is disabled by default;
## to enable gpclib, type gpclibPermit()
## Loading required package: leaflet
## Loading required package: factoextra
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
## Loading required package: NbClust
subzone <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_NO_SEA_PL")
## Reading layer `MP14_SUBZONE_NO_SEA_PL' from data source `C:\Users\amoss\OneDrive\Documents\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS: SVY21
Check CRS of subzone
st_crs(subzone)
## Coordinate Reference System:
## User input: SVY21
## wkt:
## PROJCRS["SVY21",
## BASEGEOGCRS["SVY21[WGS84]",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6326]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["Degree",0.0174532925199433]]],
## CONVERSION["unnamed",
## METHOD["Transverse Mercator",
## ID["EPSG",9807]],
## PARAMETER["Latitude of natural origin",1.36666666666667,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8801]],
## PARAMETER["Longitude of natural origin",103.833333333333,
## ANGLEUNIT["Degree",0.0174532925199433],
## ID["EPSG",8802]],
## PARAMETER["Scale factor at natural origin",1,
## SCALEUNIT["unity",1],
## ID["EPSG",8805]],
## PARAMETER["False easting",28001.642,
## LENGTHUNIT["metre",1],
## ID["EPSG",8806]],
## PARAMETER["False northing",38744.572,
## LENGTHUNIT["metre",1],
## ID["EPSG",8807]]],
## CS[Cartesian,2],
## AXIS["(E)",east,
## ORDER[1],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]],
## AXIS["(N)",north,
## ORDER[2],
## LENGTHUNIT["metre",1,
## ID["EPSG",9001]]]]
Notice that subzone do not have any CRS and is in SYV21. Hence we need to set a CRS for subzone and transform it into WGS84.
subzone <- st_as_sf(subzone,3414)
subzone <- st_transform(subzone, 4326)
respond <- read_csv("data/aspatial/respopagesextod2011to2019.csv")
## Parsed with column specification:
## cols(
## PA = col_character(),
## SZ = col_character(),
## AG = col_character(),
## Sex = col_character(),
## TOD = col_character(),
## Pop = col_double(),
## Time = col_double()
## )
We filter out only the year 2019 In addition, we select only the few indicators that we are interested in, namely, subzone, age group, type of dwelling, and population
year <- respond %>%
filter(Time == 2019) %>%
select(SZ,AG,TOD,Pop)
To have common variables with the geosptial data, we rename SZ to SUBZONE_N
names(year)[names(year) == "SZ"] <- "SUBZONE_N"
For consistency, we capitalise all the observations under SUBZONE_N
year$SUBZONE_N = toupper(year$SUBZONE_N)
We will rename the age groups into 3 different groups: Young group, Economy active population, and Aged group ##### Young group
year$AG <- replace(year$AG, year$AG=="0_to_4", "Young group")
year$AG <- replace(year$AG, year$AG=="5_to_9", "Young group")
year$AG <- replace(year$AG, year$AG=="10_to_14", "Young group")
year$AG <- replace(year$AG, year$AG=="15_to_19", "Young group")
year$AG <- replace(year$AG, year$AG=="20_to_24", "Young group")
year$AG <- replace(year$AG, year$AG=="25_to_29", "Economy active population")
year$AG <- replace(year$AG, year$AG=="30_to_34", "Economy active population")
year$AG <- replace(year$AG, year$AG=="35_to_39", "Economy active population")
year$AG <- replace(year$AG, year$AG=="40_to_44", "Economy active population")
year$AG <- replace(year$AG, year$AG=="45_to_49", "Economy active population")
year$AG <- replace(year$AG, year$AG=="50_to_54", "Economy active population")
year$AG <- replace(year$AG, year$AG=="55_to_59", "Economy active population")
year$AG <- replace(year$AG, year$AG=="60_to_64", "Economy active population")
year$AG <- replace(year$AG, year$AG=="65_to_69", "Aged group")
year$AG <- replace(year$AG, year$AG=="70_to_74", "Aged group")
year$AG <- replace(year$AG, year$AG=="75_to_79", "Aged group")
year$AG <- replace(year$AG, year$AG=="80_to_84", "Aged group")
year$AG <- replace(year$AG, year$AG=="85_to_89", "Aged group")
year$AG <- replace(year$AG, year$AG=="90_and_over", "Aged group")
Next, we will need to select variables that can be used to calculate population density Population density is calculated by taking population divided by area In this case, the area will be the variable SHAPE_Area
density <- inner_join(year, subzone)%>%
dplyr::select(SHAPE_Area, Pop, SUBZONE_N)
## Joining, by = "SUBZONE_N"
Next, we will use mutate() to get the population density
pop_density <- density %>%
mutate(pop_density=Pop/SHAPE_Area)%>%
dplyr::select(SUBZONE_N, pop_density)
business <- readOGR(dsn= "data/geospatial", layer="Business")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Business"
## with 6550 features
## It has 5 fields
## Integer64 fields read as strings: POI_ID
shopping <- readOGR(dsn= "data/geospatial", layer="Shopping")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Shopping"
## with 511 features
## It has 5 fields
## Integer64 fields read as strings: POI_ID
financial <- readOGR(dsn= "data/geospatial", layer="Financial")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Financial"
## with 3320 features
## It has 29 fields
## Integer64 fields read as strings: LINK_ID POI_ID CHAIN_ID VANCITY_ID
residential <- readOGR(dsn= "data/geospatial", layer="Private residential")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Private residential"
## with 3604 features
## It has 5 fields
## Integer64 fields read as strings: POI_ID
govt <- readOGR(dsn="data/geospatial", layer="Govt_Embassy")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\amoss\OneDrive\Documents\data\geospatial", layer: "Govt_Embassy"
## with 443 features
## It has 5 fields
## Integer64 fields read as strings: POI_ID
For easier identification, we rename variable names of the urban functions
names(business)[names(business) == "POI_NAME"] <- "BUSINESS"
names(shopping)[names(shopping) == "POI_NAME"] <- "SHOPPING"
names(financial)[names(financial) == "POI_NAME"] <- "FINANCIAL"
names(residential)[names(residential) == "POI_NAME"] <- "RESIDENTIAL"
names(govt)[names(govt) == "POI_NAME"] <- "GOVT"
Use st_as_sf() to set the different urban functions into shapefile
business = st_as_sf(business)
shopping = st_as_sf(shopping)
financial = st_as_sf(financial)
residential = st_as_sf(residential)
govt = st_as_sf(govt)
Check for NA values
any(is.na(subzone))
## [1] FALSE
Plot subzone
plot(st_geometry(subzone), main="SG Subzone")
Check for NA values
any(is.na(business))
## [1] TRUE
Plot Business
plot(st_geometry(subzone), main="BUSINESS")
plot((business), add=TRUE, col="blue")
## Warning in plot.sf((business), add = TRUE, col = "blue"): ignoring all but the
## first attribute
Check for NA values
any(is.na(shopping))
## [1] TRUE
Plot Shopping
plot(st_geometry(subzone), main="SHOPPING")
plot((shopping), add=TRUE, col="blue")
## Warning in plot.sf((shopping), add = TRUE, col = "blue"): ignoring all but the
## first attribute
Check for NA values
any(is.na(financial))
## [1] TRUE
Plot Financial
plot(st_geometry(subzone), main="FINANCIAL")
plot((financial), add=TRUE, col="blue")
## Warning in plot.sf((financial), add = TRUE, col = "blue"): ignoring all but the
## first attribute
Check for NA values
any(is.na(residential))
## [1] TRUE
Plot Residential
plot(st_geometry(subzone), main="RESIDENTIAL")
plot((residential), add=TRUE, col="blue")
## Warning in plot.sf((residential), add = TRUE, col = "blue"): ignoring all but
## the first attribute
Check for NA values
any(is.na(govt))
## [1] TRUE
Plot government
plot(st_geometry(subzone), main="GOVERNMENT")
plot((govt), add=TRUE, col="blue")
## Warning in plot.sf((govt), add = TRUE, col = "blue"): ignoring all but the first
## attribute
We need to extract a new df for industry We identifed that FAC_TYPE 5000 is business, 9991 is industry after tally with its POI_NAME (renamed to BUSINESS). So we filter out 9991 and create a df named industry
industry <- business%>%
filter(business$FAC_TYPE==9991)
names(industry)[names(industry) == "BUSINESS"] <- "INDUSTRY"
Check for NA values
any(is.na(industry))
## [1] TRUE
Plot Industry
plot(st_geometry(subzone), main="INDUSTRY")
plot((industry), add=TRUE, col="blue")
## Warning in plot.sf((industry), add = TRUE, col = "blue"): ignoring all but the
## first attribute
business <- business %>%
filter(!business$FAC_TYPE==9991)
Recheck for NA values
any(is.na(business))
## [1] TRUE
Plot Business again
plot(st_geometry(subzone), main="BUSINESS")
plot((business), add=TRUE, col="blue")
## Warning in plot.sf((business), add = TRUE, col = "blue"): ignoring all but the
## first attribute
Check if there is NA values in the geometry column as we need it to perform our analysis using the geometry
sum(is.na(business$geometry))
## [1] 0
sum(is.na(shopping$geometry))
## [1] 0
sum(is.na(financial$geometry))
## [1] 0
sum(is.na(industry$geometry))
## [1] 0
sum(is.na(residential$geometry))
## [1] 0
sum(is.na(govt$geometry))
## [1] 0
Plot a plots with all the urban functions
plot(st_geometry(subzone), main= "Urban Functions in Singapore")
plot(business, col="blue", add=TRUE)
## Warning in plot.sf(business, col = "blue", add = TRUE): ignoring all but the
## first attribute
plot(industry, col="green", add=TRUE)
## Warning in plot.sf(industry, col = "green", add = TRUE): ignoring all but the
## first attribute
plot(shopping, col="purple", add=TRUE)
## Warning in plot.sf(shopping, col = "purple", add = TRUE): ignoring all but the
## first attribute
plot(financial, col="yellow", add=TRUE)
## Warning in plot.sf(financial, col = "yellow", add = TRUE): ignoring all but the
## first attribute
plot(residential, col="orange", add=TRUE)
## Warning in plot.sf(residential, col = "orange", add = TRUE): ignoring all but
## the first attribute
plot(govt, col="red", add=TRUE)
## Warning in plot.sf(govt, col = "red", add = TRUE): ignoring all but the first
## attribute
st_crs(subzone)
## Coordinate Reference System:
## User input: EPSG:4326
## wkt:
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["geodetic latitude (Lat)",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["geodetic longitude (Lon)",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## USAGE[
## SCOPE["unknown"],
## AREA["World"],
## BBOX[-90,-180,90,180]],
## ID["EPSG",4326]]
st_crs(business)
## Coordinate Reference System:
## User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## wkt:
## BOUNDCRS[
## SOURCECRS[
## GEOGCRS["unknown",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6326]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]],
## CS[ellipsoidal,2],
## AXIS["longitude",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]],
## AXIS["latitude",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]]]],
## TARGETCRS[
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]],
## ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
## METHOD["Geocentric translations (geog2D domain)",
## ID["EPSG",9603]],
## PARAMETER["X-axis translation",0,
## ID["EPSG",8605]],
## PARAMETER["Y-axis translation",0,
## ID["EPSG",8606]],
## PARAMETER["Z-axis translation",0,
## ID["EPSG",8607]]]]
st_crs(shopping)
## Coordinate Reference System:
## User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## wkt:
## BOUNDCRS[
## SOURCECRS[
## GEOGCRS["unknown",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6326]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]],
## CS[ellipsoidal,2],
## AXIS["longitude",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]],
## AXIS["latitude",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]]]],
## TARGETCRS[
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]],
## ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
## METHOD["Geocentric translations (geog2D domain)",
## ID["EPSG",9603]],
## PARAMETER["X-axis translation",0,
## ID["EPSG",8605]],
## PARAMETER["Y-axis translation",0,
## ID["EPSG",8606]],
## PARAMETER["Z-axis translation",0,
## ID["EPSG",8607]]]]
st_crs(financial)
## Coordinate Reference System:
## User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## wkt:
## BOUNDCRS[
## SOURCECRS[
## GEOGCRS["unknown",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6326]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]],
## CS[ellipsoidal,2],
## AXIS["longitude",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]],
## AXIS["latitude",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]]]],
## TARGETCRS[
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]],
## ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
## METHOD["Geocentric translations (geog2D domain)",
## ID["EPSG",9603]],
## PARAMETER["X-axis translation",0,
## ID["EPSG",8605]],
## PARAMETER["Y-axis translation",0,
## ID["EPSG",8606]],
## PARAMETER["Z-axis translation",0,
## ID["EPSG",8607]]]]
st_crs(residential)
## Coordinate Reference System:
## User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## wkt:
## BOUNDCRS[
## SOURCECRS[
## GEOGCRS["unknown",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6326]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]],
## CS[ellipsoidal,2],
## AXIS["longitude",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]],
## AXIS["latitude",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]]]],
## TARGETCRS[
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]],
## ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
## METHOD["Geocentric translations (geog2D domain)",
## ID["EPSG",9603]],
## PARAMETER["X-axis translation",0,
## ID["EPSG",8605]],
## PARAMETER["Y-axis translation",0,
## ID["EPSG",8606]],
## PARAMETER["Z-axis translation",0,
## ID["EPSG",8607]]]]
st_crs(govt)
## Coordinate Reference System:
## User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## wkt:
## BOUNDCRS[
## SOURCECRS[
## GEOGCRS["unknown",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6326]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]],
## CS[ellipsoidal,2],
## AXIS["longitude",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]],
## AXIS["latitude",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]]]],
## TARGETCRS[
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]],
## ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
## METHOD["Geocentric translations (geog2D domain)",
## ID["EPSG",9603]],
## PARAMETER["X-axis translation",0,
## ID["EPSG",8605]],
## PARAMETER["Y-axis translation",0,
## ID["EPSG",8606]],
## PARAMETER["Z-axis translation",0,
## ID["EPSG",8607]]]]
st_crs(industry)
## Coordinate Reference System:
## User input: +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0
## wkt:
## BOUNDCRS[
## SOURCECRS[
## GEOGCRS["unknown",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]],
## ID["EPSG",6326]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8901]],
## CS[ellipsoidal,2],
## AXIS["longitude",east,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]],
## AXIS["latitude",north,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433,
## ID["EPSG",9122]]]]],
## TARGETCRS[
## GEOGCRS["WGS 84",
## DATUM["World Geodetic System 1984",
## ELLIPSOID["WGS 84",6378137,298.257223563,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4326]]],
## ABRIDGEDTRANSFORMATION["Transformation from unknown to WGS84",
## METHOD["Geocentric translations (geog2D domain)",
## ID["EPSG",9603]],
## PARAMETER["X-axis translation",0,
## ID["EPSG",8605]],
## PARAMETER["Y-axis translation",0,
## ID["EPSG",8606]],
## PARAMETER["Z-axis translation",0,
## ID["EPSG",8607]]]]
We use st_within() to combine the subzone geospatial data in. We are interested in SUBZONE_N as it tells us the location of the area We then use dplyr function to select variables that are required for our analysis
business <- bind_cols(
business,
subzone[as.numeric(st_within(business, subzone)),]) %>%
st_set_geometry(NULL)%>%
dplyr::select(OBJECTID,BUSINESS,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
business_count <- business %>% drop_na(BUSINESS) %>% group_by(SUBZONE_N) %>% count(BUSINESS)
business_count <- distinct(business_count)
business_count <- business_count %>%
group_by(SUBZONE_N) %>%
summarize(n=sum(n))
colnames(business_count)[2]="Business"
industry <- bind_cols(
industry,
subzone[as.numeric(st_within(industry, subzone)),]) %>%
st_set_geometry(NULL)%>%
dplyr::select(OBJECTID,INDUSTRY,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
industry_count <- industry %>% drop_na(INDUSTRY) %>% group_by(SUBZONE_N) %>% count(INDUSTRY)
industry_count <- distinct(industry_count)
industry_count <- industry_count %>%
group_by(SUBZONE_N) %>%
summarize(n=sum(n))
colnames(industry_count)[2]="Industry"
shopping <- bind_cols(
shopping,
subzone[as.numeric(st_within(shopping, subzone)),]) %>%
st_set_geometry(NULL)%>%
dplyr::select(OBJECTID,SHOPPING,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
shopping_count <- shopping %>% drop_na(SHOPPING) %>% group_by(SUBZONE_N) %>% count(SHOPPING)
shopping_count <- distinct(shopping_count)
shopping_count <- shopping_count %>% group_by(SUBZONE_N) %>% summarize(n=sum(n))
colnames(shopping_count)[2]="Shopping"
financial <- bind_cols(
financial,
subzone[as.numeric(st_within(financial, subzone)),])%>%
st_set_geometry(NULL)%>%
dplyr::select(OBJECTID,FINANCIAL,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
financial_count <- financial %>% drop_na(FINANCIAL) %>% group_by(SUBZONE_N) %>% count(FINANCIAL)
financial_count <- distinct(financial_count)
financial_count <- financial_count %>% group_by(SUBZONE_N) %>% summarize(n=sum(n))
colnames(financial_count)[2]="Financial"
residential <- bind_cols(
residential,
subzone[as.numeric(st_within(residential, subzone)),])%>%
st_set_geometry(NULL)%>%
dplyr::select(OBJECTID,RESIDENTIAL,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
residential_count <- residential %>% drop_na(RESIDENTIAL) %>% group_by(SUBZONE_N) %>% count(RESIDENTIAL)
residential_count <- distinct(residential_count)
residential_count <- residential_count %>% group_by(SUBZONE_N) %>% summarize(n=sum(n))
colnames(residential_count)[2]="Residential"
govt <- bind_cols(
govt,
subzone[as.numeric(st_within(govt, subzone)),])%>%
st_set_geometry(NULL)%>%
dplyr::select(OBJECTID,GOVT,SUBZONE_N,SHAPE_Area)
## although coordinates are longitude/latitude, st_within assumes that they are planar
govt_count <- govt %>% drop_na(GOVT) %>% group_by(SUBZONE_N) %>% count(GOVT)
govt_count <- distinct(govt_count)
govt_count <- govt_count %>% group_by(SUBZONE_N) %>% summarize(n=sum(n))
colnames(govt_count)[2]="Government"
This function splits the Type of Dwelling into its individual types, as well as the age groups into individual types. New variables are created.
cast_housing <- dcast(year, SUBZONE_N~TOD, sum)
## Using Pop as value column: use value.var to override.
cast_housing <- cast_housing %>%
mutate(`HDB 3-Room Flats`+`HDB 4-Room Flats`)
names(cast_housing)[names(cast_housing) == "`HDB 3-Room Flats` + `HDB 4-Room Flats`"] <- "THREE_FOUR_RM"
cast_housing <- cast_housing%>%
rename(`CONDO` =`Condominiums and Other Apartments`,
`ONE_TWO_RM`=`HDB 1- and 2-Room Flats`,
`FIVE_RM`=`HDB 5-Room and Executive Flats`,
`LANDED`=`Landed Properties`)
cast_housing <- cast_housing%>%
dplyr::select(SUBZONE_N, CONDO, ONE_TWO_RM, THREE_FOUR_RM, FIVE_RM, LANDED)
cast_age <- dcast(year, SUBZONE_N~AG, sum)
## Using Pop as value column: use value.var to override.
year4 <- left_join(cast_housing, cast_age)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, business_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, shopping_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, financial_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, residential_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, industry_count)
## Joining, by = "SUBZONE_N"
year4 <- left_join(year4, govt_count)
## Joining, by = "SUBZONE_N"
This column will be needed for deriving new variables, as well as calculating population density
year4_pop <- year %>%
group_by(SUBZONE_N) %>%
summarise(Pop=sum(Pop))
year4 <- left_join(year4,year4_pop)
## Joining, by = "SUBZONE_N"
year4$`Business`[is.na(year4$`Business`)]<-0
year4$`Financial`[is.na(year4$`Financial`)]<-0
year4$`Shopping`[is.na(year4$`Shopping`)]<-0
year4$`Residential`[is.na(year4$`Residential`)]<-0
year4$`Government`[is.na(year4$`Government`)]<-0
year4$`Industry`[is.na(year4$`Industry`)]<-0
Check for NA values again
summary(year4)
## SUBZONE_N CONDO ONE_TWO_RM THREE_FOUR_RM
## Length:323 Min. : 0 Min. : 0 Min. : 0
## Class :character 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Mode :character Median : 230 Median : 0 Median : 0
## Mean : 1827 Mean : 542 Mean : 5953
## 3rd Qu.: 2835 3rd Qu.: 605 3rd Qu.: 9705
## Max. :16770 Max. :4700 Max. :75000
## FIVE_RM LANDED Aged group Economy active population
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 0.0 Median : 730 Median : 2840
## Mean : 3297 Mean : 773.9 Mean : 1806 Mean : 7386
## 3rd Qu.: 3660 3rd Qu.: 400.0 3rd Qu.: 3000 3rd Qu.:10285
## Max. :47960 Max. :18820.0 Max. :18860 Max. :79780
## Young group Business Shopping Financial
## Min. : 0 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 0 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 1.00
## Median : 1170 Median : 2.00 Median : 0.000 Median : 5.00
## Mean : 3296 Mean : 19.94 Mean : 1.582 Mean : 10.28
## 3rd Qu.: 4365 3rd Qu.: 14.00 3rd Qu.: 1.000 3rd Qu.: 13.00
## Max. :34260 Max. :308.00 Max. :31.000 Max. :134.00
## Residential Industry Government Pop
## Min. : 0.00 Min. :0.0000 Min. : 0.000 Min. : 0
## 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.: 0
## Median : 4.00 Median :0.0000 Median : 0.000 Median : 4890
## Mean : 11.16 Mean :0.3406 Mean : 1.372 Mean : 12487
## 3rd Qu.: 11.00 3rd Qu.:0.0000 3rd Qu.: 1.000 3rd Qu.: 17065
## Max. :217.00 Max. :8.0000 Max. :19.000 Max. :132900
The unit of measurement of the different variables are different. For instance, the unit of measurement for age group is Population, as using the values of the different age groups will be based on the number ot Population. On the other hand, the unit of measurement for TOD is SHAPE_Area. In general, subzones with relatively higher number of population will also have higher number of age group living in that area. Similarly, subzones with relatively larger area will have higher number of TOD available in that area.
subzone_area <- subzone%>%
st_set_geometry(NULL)%>%
dplyr::select(SUBZONE_N, SHAPE_Area)
year4 <- left_join(subzone_area, year4)
## Joining, by = "SUBZONE_N"
We mutate to get new variables Population density is derived from dividing SHAPE_Area by Population
year4_derived <- year4 %>%
mutate(`CONDO_PR` = `CONDO`/`SHAPE_Area`*1000) %>%
mutate(`ONE_TWO_RM_PR` = `ONE_TWO_RM`/`SHAPE_Area`*1000) %>%
mutate(`THREE_FOUR_RM_PR` = `THREE_FOUR_RM`/`SHAPE_Area`*1000) %>%
mutate(`FIVE_RM_PR` = `FIVE_RM`/`SHAPE_Area`*1000) %>%
mutate(`LANDED_PR` = `LANDED`/`SHAPE_Area`*1000)%>%
mutate(`Aged_group_PR` = `Aged group`/`Pop`*1000)%>%
mutate(`Economy_active_PR` = `Economy active population`/`Pop`*1000)%>%
mutate(`Young_group_PR` = `Young group`/`Pop`*1000)%>%
mutate(`pop_density_PR`=`Pop`/`SHAPE_Area`*1000)%>%
rename(`Business_PR` =`Business`,
`Shopping_PR`=`Shopping`,
`Industry_PR`=`Industry`,
`Financial_PR`=`Financial`,
`Government_PR`=`Government`,
`Residential_PR`=`Residential`)
Replace NaN values to 0
year4_derived$Aged_group_PR <- replace(year4_derived$Aged_group_PR, year4_derived$Aged_group_PR=="NaN", 0)
year4_derived$Economy_active_PR <- replace(year4_derived$Economy_active_PR, year4_derived$Economy_active_PR=="NaN", 0)
year4_derived$Young_group_PR <- replace(year4_derived$Young_group_PR, year4_derived$Young_group_PR=="NaN", 0)
We can plot the distribution of the variables (i.e. number of condos available in an area) by using appropriate Exploratory Data Analysis (EDA), as shown in the code chunk below. We will be using ggplo() function to plot histograms, which is useful to identify the overall distribution of the data values (eg whether they are left skew, right skew, or normal distribution)
ggplot(data=year4_derived, aes(x=`CONDO`))+
geom_histogram(bins=20, color="black", fill="light blue")
Boxplot can be used to detect outliers
ggplot(data=year4_derived, aes(x=`CONDO`)) +
geom_boxplot(color="black", fill="light blue")
Next, we will also be plotting the distribution of the newly derived variables (i.e. Condo penetration rate)
ggplot(data=year4_derived, aes(x=`CONDO_PR`))+
geom_histogram(bins=20, color="black", fill="light blue")
And also use boxplot to detect outliers
ggplot(data=year4_derived, aes(x=`CONDO_PR`)) +
geom_boxplot(color="black", fill="light blue")
We observed that the data is less skewed for the penetration rate variable.
We will first create the individual histograms, before using ggarrange() function to group these histograms together
condo <- ggplot(data=year4_derived,
aes(x= `CONDO_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
one_two_rm <- ggplot(data=year4_derived,
aes(x= `ONE_TWO_RM_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
three_four_rm <- ggplot(data=year4_derived,
aes(x= `THREE_FOUR_RM_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
five_rm <- ggplot(data=year4_derived,
aes(x= `FIVE_RM_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
landed <- ggplot(data=year4_derived,
aes(x= `LANDED_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
aged <- ggplot(data=year4_derived,
aes(x= `Aged_group_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
active <- ggplot(data=year4_derived,
aes(x= `Economy_active_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
young <- ggplot(data=year4_derived,
aes(x= `Young_group_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
business <- ggplot(data=year4_derived,
aes(x= `Business_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
shopping <- ggplot(data=year4_derived,
aes(x= `Shopping_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
financial <- ggplot(data=year4_derived,
aes(x= `Financial_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
industry <- ggplot(data=year4_derived,
aes(x= `Industry_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
residential <- ggplot(data=year4_derived,
aes(x= `Residential_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
government <- ggplot(data=year4_derived,
aes(x= `Government_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
pop_density <- ggplot(data=year4_derived,
aes(x= `pop_density_PR`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
ggarrange(condo, one_two_rm, three_four_rm, five_rm, landed, aged, active, young, business, shopping, residential, industry, government, financial, pop_density, ncol = 4, nrow = 4)
From these histograms, we can see that the histograms are all mainly skewed, with some variables having a less skewed observation, such as Young_group_PR, which resembles more of a normal distribution.
Before we can prepare the choropleth map, we will need the geospatial data (subzone), as well as the aspatial data (year4_derived). We will need to combine these two datas into one.
We will reimport the geospatial subzone data
subzone <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_NO_SEA_PL")
## Reading layer `MP14_SUBZONE_NO_SEA_PL' from data source `C:\Users\amoss\OneDrive\Documents\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## projected CRS: SVY21
When doing the SKATER approach later, there will be 5 regions without neighbours. We will remove these 5 regions for consistency. ##### Filter out regions with no neighbours
subzone <- filter(subzone, OBJECTID != 36)
subzone <- filter(subzone, OBJECTID != 37)
subzone <- filter(subzone, OBJECTID != 38)
subzone <- filter(subzone, OBJECTID != 276)
subzone <- filter(subzone, OBJECTID != 298)
Then, making use of left_join() function of dplyr package, we will combine both datas together. The unique identifier used to join both data objects is SUBZONE_N, and SHAPE_Area
subzone_year <- left_join(subzone, year4_derived)
## Joining, by = c("SUBZONE_N", "SHAPE_Area")
We will then select the variables that are necessary for our analysis.
subzone_year <- subzone_year%>%
dplyr::select("SUBZONE_N", "CONDO", "ONE_TWO_RM", "THREE_FOUR_RM", "FIVE_RM", "LANDED", "Aged group", "Economy active population", "Young group","CONDO_PR", "ONE_TWO_RM_PR", "THREE_FOUR_RM_PR", "FIVE_RM_PR", "LANDED_PR", "Aged_group_PR", "Economy_active_PR", "Young_group_PR", "Business_PR", "Shopping_PR", "Industry_PR", "Financial_PR", "Residential_PR", "Government_PR", "pop_density_PR", "SHAPE_Area", "Pop")
To have a quick look at the distribution of Condo penetration rate of subzone, we will need to plot a choropleth map. We will use the qtm() function of tmap package
qtm(subzone_year, "CONDO_PR")
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
In order to reveal the distribution shown in the choropleth map above are bias to the denominator that we divided them by previously, we will need to plot both choropleth maps in.
We previously identified that age groups are bias to the population, so we will plot the age group choropleth together with the population choropleth.
pop.map <- tm_shape(subzone_year) +
tm_fill(col = "Pop",
n=5,
style = "jenks",
title = "Population") +
tm_borders(alpha = 0.5)
aged.map <- tm_shape(subzone_year) +
tm_fill(col = "Aged group",
n=5,
style = "jenks",
title = "Aged") +
tm_borders(alpha = 0.5)
active.map <- tm_shape(subzone_year) +
tm_fill(col = "Economy active population",
n=5,
style = "jenks",
title = "Active") +
tm_borders(alpha = 0.5)
young.map <- tm_shape(subzone_year) +
tm_fill(col = "Young group",
n=5,
style = "jenks",
title = "Young") +
tm_borders(alpha = 0.5)
tmap_arrange(pop.map, aged.map, active.map, young.map,
asp=NA, ncol=2, nrow=2)
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
We identified from the choropleth maps above that subzones with relatively larger population also show larger number of a certain aged group. For example, there are darker chunks in all 4 choropleth maps in the east area, which shows that there is higher number of population in the east area, resulting in higher proportion of Aged, Active, and Young group in the east area too.
We have also previously identified that housing types are bias to the SHAPE_Area, so we will plot the housing choropleth together with the SHAPE_Area choropleth.
shape_area.map <- tm_shape(subzone_year) +
tm_fill(col = "SHAPE_Area",
n = 5,
style = "jenks",
title = "Shape area") +
tm_borders(alpha = 0.5)
condo.map <- tm_shape(subzone_year) +
tm_fill(col = "CONDO",
n = 5,
style = "jenks",
title = "Condo") +
tm_borders(alpha = 0.5)
one_two_rm.map <- tm_shape(subzone_year) +
tm_fill(col = "ONE_TWO_RM",
n = 5,
style = "jenks",
title = "1 2 room") +
tm_borders(alpha = 0.5)
three_four_rm.map <- tm_shape(subzone_year) +
tm_fill(col = "THREE_FOUR_RM",
n = 5,
style = "jenks",
title = "3 room") +
tm_borders(alpha = 0.5)
five_rm.map <- tm_shape(subzone_year) +
tm_fill(col = "FIVE_RM",
n = 5,
style = "jenks",
title = "5 room") +
tm_borders(alpha = 0.5)
landed.map <- tm_shape(subzone_year) +
tm_fill(col = "LANDED",
n = 5,
style = "jenks",
title = "landed") +
tm_borders(alpha = 0.5)
tmap_arrange(shape_area.map, condo.map, one_two_rm.map, three_four_rm.map, five_rm.map, landed.map,
asp=1, ncol=3, nrow=2)
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Legend labels were too wide. The labels have been resized to 0.39, 0.34, 0.32, 0.31, 0.31. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.55, 0.55, 0.55, 0.50. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.62, 0.55, 0.55, 0.55. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.50, 0.47, 0.47, 0.47. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.55, 0.50, 0.47, 0.47. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Some legend labels were too wide. These labels have been resized to 0.62, 0.55, 0.55, 0.50. Increase legend.width (argument of tm_layout) to make the legend wider and therefore the labels larger.
We can observed from the choropleth map above that there are certain areas that are dark shaded in the Shape area choropleth map, but are light shaded in the housing choropleth map. This could largely be due to these areas being water catchment areas, or islands that do not have any population at all. Not taking these areas into consideration,
We identified from the choropleth maps above that subzones with relatively larger shape area also show larger number of a certain housing group.
business.map <- tm_shape(subzone_year) +
tm_fill(col = "Business_PR",
n = 5,
style = "jenks",
title = "Business") +
tm_borders(alpha = 0.5)
shopping.map <- tm_shape(subzone_year) +
tm_fill(col = "Shopping_PR",
n = 5,
style = "jenks",
title = "Shopping") +
tm_borders(alpha = 0.5)
financial.map <- tm_shape(subzone_year) +
tm_fill(col = "Financial_PR",
n = 5,
style = "jenks",
title = "Financial") +
tm_borders(alpha = 0.5)
industry.map <- tm_shape(subzone_year) +
tm_fill(col = "Industry_PR",
n = 5,
style = "jenks",
title = "Industry") +
tm_borders(alpha = 0.5)
residential.map <- tm_shape(subzone_year) +
tm_fill(col = "Residential_PR",
n = 5,
style = "jenks",
title = "Residential") +
tm_borders(alpha = 0.5)
government.map <- tm_shape(subzone_year) +
tm_fill(col = "Government_PR",
n = 5,
style = "jenks",
title = "Government") +
tm_borders(alpha = 0.5)
tmap_arrange(business.map, shopping.map, financial.map, industry.map, residential.map, government.map,
asp=1, ncol=3, nrow=2)
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
Business, as an urban function, has a higher concentration as compared to other urban functions in certain subzones as observed.
tm_shape(subzone_year)+
tm_fill(col="pop_density_PR",
n=5,
style="jenks",
title="Population density")+
tm_borders(alpha=0.5)
## Warning: The shape subzone_year is invalid. See sf::st_is_valid
Population density of certain subzones are higher, which means that more people stay in that subzone per square kilometre.
Before we are able to perform any cluster analysis, it is important for us to ensure that the input variables are not highly correlatied.
To do so, we will use the corrplot.mixed() function of corrplot package to visualise and analyse the correlation of the input variables.
cluster_vars.cor = cor(year4_derived[, c(11:16, 18:26)])
corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
number.cex=0.4,
tl.col = "black")
The correlation plot above shows that THREE_FOUR_RM and pop_density_PR are highly correlated, with a correlation value of 0.93. This suggests that only one of them should be used in the cluster analysis instead of both. Comparing the correlation value of these two variables among the other variables, we realise that pop_density_PR is slightly higher correlated than THREE_FOUR_RM, hence we remove pop_density_PR to prevent collinearity issues.
We attempt to plot the correlation analysis again after removing pop_density_PR
cluster_vars.cor = cor(year4_derived[, c(11:16, 18:25)])
corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
number.cex=0.4,
tl.col = "black")
The correlation plot above shows a more evenly spread out correlation analysis between the variables. We can now proceed with our cluster analysis.
The code chunk below will be used to extract the clustering variables from the subzone_year simple feather object into data.frame We do not include pop_density because it is highly correlated with THREE_FOUR_RM as mentioned above.
cluster_vars <- subzone_year %>%
st_set_geometry(NULL)%>%
select("SUBZONE_N","CONDO_PR", "ONE_TWO_RM_PR", "THREE_FOUR_RM_PR", "FIVE_RM_PR", "LANDED_PR", "Aged_group_PR", "Economy_active_PR", "Young_group_PR", "Business_PR", "Shopping_PR", "Financial_PR", "Residential_PR", "Industry_PR", "Government_PR")
head(cluster_vars,10)
## SUBZONE_N CONDO_PR ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR
## 1 PEOPLE'S PARK 3.32830732 0.0000000 0.0000000 0.0000000
## 2 BUKIT MERAH 0.00000000 0.0000000 0.4371873 2.2345130
## 3 CHINATOWN 0.51087945 1.7710488 12.2270482 3.4058630
## 4 PHILLIP 0.00000000 0.0000000 0.0000000 0.0000000
## 5 RAFFLES PLACE 0.00000000 0.0000000 0.0000000 0.0000000
## 6 CHINA SQUARE 0.07518405 0.0000000 9.6235583 0.4511043
## 7 TIONG BAHRU 1.53974007 0.9595482 20.3513471 6.0920151
## 8 BAYFRONT SUBZONE 0.00000000 0.0000000 0.0000000 0.0000000
## 9 TIONG BAHRU STATION 3.73445397 4.6181797 22.1786655 13.4554372
## 10 CLIFFORD PIER 0.00000000 0.0000000 0.0000000 0.0000000
## LANDED_PR Aged_group_PR Economy_active_PR Young_group_PR Business_PR
## 1 0 193.5484 741.9355 64.51613 2
## 2 0 236.3636 536.3636 227.27273 16
## 3 0 194.2379 591.0781 214.68401 46
## 4 0 0.0000 0.0000 0.00000 11
## 5 0 0.0000 0.0000 0.00000 48
## 6 0 325.9259 570.3704 103.70370 14
## 7 0 192.6324 587.8741 219.49348 0
## 8 0 0.0000 0.0000 0.00000 1
## 9 0 180.8166 585.2236 233.95982 1
## 10 0 0.0000 0.0000 0.00000 7
## Shopping_PR Financial_PR Residential_PR Industry_PR Government_PR
## 1 1 12 1 0 0
## 2 0 16 1 4 4
## 3 8 36 11 0 7
## 4 0 9 0 0 3
## 5 9 126 2 0 12
## 6 7 23 1 0 2
## 7 0 9 6 0 0
## 8 2 17 0 0 0
## 9 2 20 12 0 0
## 10 1 2 0 0 2
Next, we will change the rows by SUBZONE_N instead of row numbers
row.names(cluster_vars) <- cluster_vars$"SUBZONE_N"
head(cluster_vars,10)
## SUBZONE_N CONDO_PR ONE_TWO_RM_PR
## PEOPLE'S PARK PEOPLE'S PARK 3.32830732 0.0000000
## BUKIT MERAH BUKIT MERAH 0.00000000 0.0000000
## CHINATOWN CHINATOWN 0.51087945 1.7710488
## PHILLIP PHILLIP 0.00000000 0.0000000
## RAFFLES PLACE RAFFLES PLACE 0.00000000 0.0000000
## CHINA SQUARE CHINA SQUARE 0.07518405 0.0000000
## TIONG BAHRU TIONG BAHRU 1.53974007 0.9595482
## BAYFRONT SUBZONE BAYFRONT SUBZONE 0.00000000 0.0000000
## TIONG BAHRU STATION TIONG BAHRU STATION 3.73445397 4.6181797
## CLIFFORD PIER CLIFFORD PIER 0.00000000 0.0000000
## THREE_FOUR_RM_PR FIVE_RM_PR LANDED_PR Aged_group_PR
## PEOPLE'S PARK 0.0000000 0.0000000 0 193.5484
## BUKIT MERAH 0.4371873 2.2345130 0 236.3636
## CHINATOWN 12.2270482 3.4058630 0 194.2379
## PHILLIP 0.0000000 0.0000000 0 0.0000
## RAFFLES PLACE 0.0000000 0.0000000 0 0.0000
## CHINA SQUARE 9.6235583 0.4511043 0 325.9259
## TIONG BAHRU 20.3513471 6.0920151 0 192.6324
## BAYFRONT SUBZONE 0.0000000 0.0000000 0 0.0000
## TIONG BAHRU STATION 22.1786655 13.4554372 0 180.8166
## CLIFFORD PIER 0.0000000 0.0000000 0 0.0000
## Economy_active_PR Young_group_PR Business_PR Shopping_PR
## PEOPLE'S PARK 741.9355 64.51613 2 1
## BUKIT MERAH 536.3636 227.27273 16 0
## CHINATOWN 591.0781 214.68401 46 8
## PHILLIP 0.0000 0.00000 11 0
## RAFFLES PLACE 0.0000 0.00000 48 9
## CHINA SQUARE 570.3704 103.70370 14 7
## TIONG BAHRU 587.8741 219.49348 0 0
## BAYFRONT SUBZONE 0.0000 0.00000 1 2
## TIONG BAHRU STATION 585.2236 233.95982 1 2
## CLIFFORD PIER 0.0000 0.00000 7 1
## Financial_PR Residential_PR Industry_PR Government_PR
## PEOPLE'S PARK 12 1 0 0
## BUKIT MERAH 16 1 4 4
## CHINATOWN 36 11 0 7
## PHILLIP 9 0 0 3
## RAFFLES PLACE 126 2 0 12
## CHINA SQUARE 23 1 0 2
## TIONG BAHRU 9 6 0 0
## BAYFRONT SUBZONE 17 0 0 0
## TIONG BAHRU STATION 20 12 0 0
## CLIFFORD PIER 2 0 0 2
Then, remove the column SUBZONE_N
subzone_count <- select(cluster_vars, c(2:15))
head(subzone_count, 10)
## CONDO_PR ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR
## PEOPLE'S PARK 3.32830732 0.0000000 0.0000000 0.0000000
## BUKIT MERAH 0.00000000 0.0000000 0.4371873 2.2345130
## CHINATOWN 0.51087945 1.7710488 12.2270482 3.4058630
## PHILLIP 0.00000000 0.0000000 0.0000000 0.0000000
## RAFFLES PLACE 0.00000000 0.0000000 0.0000000 0.0000000
## CHINA SQUARE 0.07518405 0.0000000 9.6235583 0.4511043
## TIONG BAHRU 1.53974007 0.9595482 20.3513471 6.0920151
## BAYFRONT SUBZONE 0.00000000 0.0000000 0.0000000 0.0000000
## TIONG BAHRU STATION 3.73445397 4.6181797 22.1786655 13.4554372
## CLIFFORD PIER 0.00000000 0.0000000 0.0000000 0.0000000
## LANDED_PR Aged_group_PR Economy_active_PR Young_group_PR
## PEOPLE'S PARK 0 193.5484 741.9355 64.51613
## BUKIT MERAH 0 236.3636 536.3636 227.27273
## CHINATOWN 0 194.2379 591.0781 214.68401
## PHILLIP 0 0.0000 0.0000 0.00000
## RAFFLES PLACE 0 0.0000 0.0000 0.00000
## CHINA SQUARE 0 325.9259 570.3704 103.70370
## TIONG BAHRU 0 192.6324 587.8741 219.49348
## BAYFRONT SUBZONE 0 0.0000 0.0000 0.00000
## TIONG BAHRU STATION 0 180.8166 585.2236 233.95982
## CLIFFORD PIER 0 0.0000 0.0000 0.00000
## Business_PR Shopping_PR Financial_PR Residential_PR
## PEOPLE'S PARK 2 1 12 1
## BUKIT MERAH 16 0 16 1
## CHINATOWN 46 8 36 11
## PHILLIP 11 0 9 0
## RAFFLES PLACE 48 9 126 2
## CHINA SQUARE 14 7 23 1
## TIONG BAHRU 0 0 9 6
## BAYFRONT SUBZONE 1 2 17 0
## TIONG BAHRU STATION 1 2 20 12
## CLIFFORD PIER 7 1 2 0
## Industry_PR Government_PR
## PEOPLE'S PARK 0 0
## BUKIT MERAH 4 4
## CHINATOWN 0 7
## PHILLIP 0 3
## RAFFLES PLACE 0 12
## CHINA SQUARE 0 2
## TIONG BAHRU 0 0
## BAYFRONT SUBZONE 0 0
## TIONG BAHRU STATION 0 0
## CLIFFORD PIER 0 2
summary(subzone_count)
## CONDO_PR ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR
## Min. : 0.000 Min. : 0.0000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.: 0.000
## Median : 0.352 Median : 0.0000 Median : 0.000 Median : 0.000
## Mean : 1.591 Mean : 0.6140 Mean : 5.441 Mean : 2.718
## 3rd Qu.: 2.720 3rd Qu.: 0.6363 3rd Qu.:10.486 3rd Qu.: 3.807
## Max. :11.693 Max. :10.7964 Max. :29.630 Max. :27.554
## LANDED_PR Aged_group_PR Economy_active_PR Young_group_PR
## Min. :0.0000 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.0
## Median :0.0000 Median :126.2 Median :576.0 Median :226.1
## Mean :0.4658 Mean :115.1 Mean :438.6 Mean :182.1
## 3rd Qu.:0.2881 3rd Qu.:184.3 3rd Qu.:599.8 3rd Qu.:272.7
## Max. :7.2889 Max. :947.4 Max. :900.0 Max. :558.8
## Business_PR Shopping_PR Financial_PR Residential_PR
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 1.00 1st Qu.: 1.00
## Median : 2.00 Median : 0.000 Median : 5.00 Median : 4.00
## Mean : 20.25 Mean : 1.607 Mean : 10.44 Mean : 11.33
## 3rd Qu.: 14.00 3rd Qu.: 1.000 3rd Qu.: 13.00 3rd Qu.: 11.00
## Max. :308.00 Max. :31.000 Max. :134.00 Max. :217.00
## Industry_PR Government_PR
## Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.0000 Median : 0.000
## Mean :0.3459 Mean : 1.393
## 3rd Qu.:0.0000 3rd Qu.: 1.000
## Max. :8.0000 Max. :19.000
There is no NA values present.
We will need to standardise as the units of measurements are different.
Since many of our variables are not normally distributed from the histograms above, we will need to use Min-Max standardisation method.
In general, multiple variables will be used in cluster analysis, and it is not unusual that their values range are different. So to avoid biasness of the cluster analysis result with large values, we will need to standardise the input variables before performing cluster analysis.
We will use normalize() function of the heatmaply package to standardise the clustering variables by using Min-Max method.
subzone_count.std <- normalize(subzone_count)
summary(subzone_count.std)
## CONDO_PR ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR
## Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0301 Median :0.00000 Median :0.0000 Median :0.00000
## Mean :0.1361 Mean :0.05687 Mean :0.1836 Mean :0.09863
## 3rd Qu.:0.2326 3rd Qu.:0.05894 3rd Qu.:0.3539 3rd Qu.:0.13818
## Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## LANDED_PR Aged_group_PR Economy_active_PR Young_group_PR
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :0.1332 Median :0.6400 Median :0.4046
## Mean :0.06390 Mean :0.1215 Mean :0.4874 Mean :0.3259
## 3rd Qu.:0.03952 3rd Qu.:0.1946 3rd Qu.:0.6664 3rd Qu.:0.4880
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## Business_PR Shopping_PR Financial_PR Residential_PR
## Min. :0.000000 Min. :0.00000 Min. :0.000000 Min. :0.000000
## 1st Qu.:0.000000 1st Qu.:0.00000 1st Qu.:0.007463 1st Qu.:0.004608
## Median :0.006494 Median :0.00000 Median :0.037313 Median :0.018433
## Mean :0.065752 Mean :0.05184 Mean :0.077912 Mean :0.052227
## 3rd Qu.:0.045455 3rd Qu.:0.03226 3rd Qu.:0.097015 3rd Qu.:0.050691
## Max. :1.000000 Max. :1.00000 Max. :1.000000 Max. :1.000000
## Industry_PR Government_PR
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.00000 Median :0.00000
## Mean :0.04324 Mean :0.07332
## 3rd Qu.:0.00000 3rd Qu.:0.05263
## Max. :1.00000 Max. :1.00000
Notice that the values range of the Min-Max standardised clustering variables are 0-1 now.
We do not have to do Z-score standardisation in this case as the datas are not normally distributed (they are skewed as mentioned previously )
Aged <- ggplot(data=year4_derived,
aes(x=`Aged_group_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Aged_minmax <- ggplot(data=subzone_s_df,
aes(x=`Aged_group_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max ")
Young <- ggplot(data=year4_derived,
aes(x=`Young_group_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Young_minmax <- ggplot(data=subzone_s_df,
aes(x=`Young_group_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max ")
Condo <- ggplot(data=year4_derived,
aes(x=`CONDO_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Condo_minmax <- ggplot(data=subzone_s_df,
aes(x=`CONDO_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
One_two_room <- ggplot(data=year4_derived,
aes(x=`ONE_TWO_RM_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
One_two_room_minmax <- ggplot(data=subzone_s_df,
aes(x=`ONE_TWO_RM_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
Three_four_room <- ggplot(data=year4_derived,
aes(x=`THREE_FOUR_RM_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Three_four_room_minmax <- ggplot(data=subzone_s_df,
aes(x=`THREE_FOUR_RM_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
Five_room <- ggplot(data=year4_derived,
aes(x=`FIVE_RM_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Five_room_minmax <- ggplot(data=subzone_s_df,
aes(x=`FIVE_RM_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
Landed <- ggplot(data=year4_derived,
aes(x=`LANDED_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Landed_minmax <- ggplot(data=subzone_s_df,
aes(x=`LANDED_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
ggarrange(Aged, Aged_minmax, Young, Young_minmax, Condo, Condo_minmax, One_two_room, One_two_room_minmax, Three_four_room, Three_four_room_minmax, Five_room, Five_room_minmax, Landed, Landed_minmax,
ncol=4,
nrow=4)
Business <- ggplot(data=year4_derived,
aes(x=`Business_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Business_minmax <- ggplot(data=subzone_s_df,
aes(x=`Business_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
Shopping <- ggplot(data=year4_derived,
aes(x=`Shopping_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Shopping_minmax <- ggplot(data=subzone_s_df,
aes(x=`Shopping_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
Financial <- ggplot(data=year4_derived,
aes(x=`Financial_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Financial_minmax <- ggplot(data=subzone_s_df,
aes(x=`Financial_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
Industry <- ggplot(data=year4_derived,
aes(x=`Industry_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Industry_minmax <- ggplot(data=subzone_s_df,
aes(x=`Industry_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
Residential <- ggplot(data=year4_derived,
aes(x=`Residential_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Residential_minmax <- ggplot(data=subzone_s_df,
aes(x=`Residential_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
Government <- ggplot(data=year4_derived,
aes(x=`Government_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")
subzone_s_df <- as.data.frame(subzone_count.std)
Government_minmax <- ggplot(data=subzone_s_df,
aes(x=`Government_PR`))+
geom_histogram(bins=20,
color="black",
fill="light blue")+
ggtitle("Min-Max")
ggarrange(Business, Business_minmax,Shopping, Shopping_minmax, Financial, Financial_minmax, Industry, Industry_minmax, Residential, Residential_minmax, Government, Government_minmax,
ncol=4,
nrow=3)
We will make use of the default euclidean proximity matrix
proxmat <- dist(subzone_count, method="euclidean")
We will perform hierarchical cluster analysis using ward.D method
hclust_ward <- hclust(proxmat, method="ward.D")
We can then plot the tree by using plot() function
plot(hclust_ward, cex=0.6)
fviz_nbclust(subzone_count.std, FUN=hcut, method="silhouette")
As seen above, 2 has an elbow. However, having 2 clusters does not seem suitable since it is found to have more than 2 clusters while doing the analyhsis with 2 clusters. Hence, we make use of the NbClust function to determine the number of clusters availabe.
duda <- NbClust(subzone_count.std, diss=NULL, distance="euclidean", min.nc=1, max.nc=15, method="ward.D", index="duda", alphaBeale=0.1)
duda$Best.nc
## Number_clusters Value_Index
## 8.0000 0.8636
From this, we can tell that there are a total of 8 clusters. We will use 8 clusters for our subsequent analysis.
We will need to identify stronger clustering structures. This can be done by using agnes() function of cluster package
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")
ac <- function(x){
agnes(subzone_count, method=x)$ac
}
map_dbl(m,ac)
## average single complete ward
## 0.9727232 0.9696775 0.9758965 0.9960726
With reference to the output above, we can see that Ward’s method provides the strongest clustering structure among the four methods assessed (0.9960726). Hence, in the subsequent analysis, only Ward’s method will be used.
In the dendogram displayed above, each leaf corresponds to one observations. There are many leaves as there are many observations. As we move up the tree, observations that are similar to each other are combined into branches, which are themselves fused at a higher height.
The height of the fusion, provided on the vertical axis, indicates the (dis)similarity between two observations. The higher the height of the fushion, the less similar the observations are. Note that, conclusions about the proximity of two observations can be drawn only based on the height where branches containing these two observations first are fused. We cannot use the proximity of two observations along with the horizontal axis as criteria of their similarity.
We will now draw the dendogram with a border around the selected clusters by using rect.hclust() of R stats. The argument border is used to specify the border colors for the rectangles
plot(hclust_ward, cex = 0.6)
rect.hclust(hclust_ward, k = 8, border = 2:5)
With heatmaply, we are able to build both highly interactive cluster heatmap or static cluster heatmap
subzone_count_mat <- data.matrix(subzone_count)
heatmaply(normalize(subzone_count_mat),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 8,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Geographic Segmentation of subzones by Housing indicators",
xlab = "Indicators",
ylab = "subzone"
)
groups <- as.factor(cutree(hclust_ward, k=8))
subzone_year_cluster <- cbind(subzone_year, as.matrix(groups)) %>%
rename(`CLUSTER`=`as.matrix.groups.`)
Next, qtm() of tmap package will be used to plot the choropleth map showing the cluster formed
qtm(subzone_year_cluster, "CLUSTER")
## Warning: The shape subzone_year_cluster is invalid. See sf::st_is_valid
The choropleth map above reveals the clusters are very fragmented. This is one major limitation when non spatial clustering algorithm such as hierarchical cluster analysis method is used
tm_shape(st_make_valid(subzone_year_cluster)) +
tm_fill(col="CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Business") +
tm_bubbles(col = "Business_PR",
size = "Business_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
business_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
tm_fill(col="CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Business",
scale=0.5) +
tm_bubbles(col = "Business_PR",
size = "Business_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
shopping_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
tm_fill(col="CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Shopping",
scale=0.5) +
tm_bubbles(col = "Shopping_PR",
size = "Shopping_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
industry_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
tm_fill(col="CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Industry",
scale=0.5) +
tm_bubbles(col = "Industry_PR",
size = "Industry_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
financial_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
tm_fill(col="CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Financial",
scale=0.5) +
tm_bubbles(col = "Financial_PR",
size = "Financial_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
residential_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
tm_fill(col="CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Residential",
scale=0.5) +
tm_bubbles(col = "Residential_PR",
size = "Residential_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
govt_cluster <- tm_shape(st_make_valid(subzone_year_cluster)) +
tm_fill(col="CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Government",
scale=0.5) +
tm_bubbles(col = "Government_PR",
size = "Government_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
tmap_arrange(business_cluster, shopping_cluster, industry_cluster, financial_cluster, residential_cluster, govt_cluster,
asp=0,
ncol=2,
nrow=3)
business <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Business_PR)) +
geom_bar(stat = "identity")
shopping <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Shopping_PR)) +
geom_bar(stat = "identity")
industry <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Industry_PR)) +
geom_bar(stat = "identity")
financial <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Financial_PR)) +
geom_bar(stat = "identity")
residential <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Residential_PR)) +
geom_bar(stat = "identity")
government <- ggplot(subzone_year_cluster, aes(x=CLUSTER, y=Government_PR)) +
geom_bar(stat = "identity")
ggarrange(business, shopping, industry, financial, residential, government, ncol=2, nrow=3)
Based on the bar charts and choropleth maps plotted above, we can make the conclusion that different urban functions has the highest number of different clusters. For example, business urban function has the most of cluster 7, whereas residential urban function has the most of cluster 4. Shopping, Industry, Financial, and Government cluster have a few clusters with high proprtion, with the remaining clusters having low proportion .
dataframe <- subzone_year_cluster %>%
select(CLUSTER, Business_PR, Shopping_PR, Financial_PR, Industry_PR, Government_PR, Residential_PR)%>%
st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
## CLUSTER variable count
## 1 1 Business_PR 2
## 2 2 Business_PR 16
## 3 2 Business_PR 46
## 4 3 Business_PR 11
## 5 3 Business_PR 48
## 6 2 Business_PR 14
ggplot(dataframe2, aes(x=CLUSTER, y=count, fill=variable))+
geom_bar(stat="identity", position="dodge")
From the cluster bar chart above, we can conclude that:
- Cluster 1 has all urban functions in small numbers, but business urban function has the highest proportion among the other urban functions
- Cluster 2 has a high proportion of financial urban function
- Cluster 3 is similar to cluster 2
- Cluster 4 has a high proportion of residential and business urban function
- Cluster 5 has small numbers of urban functions too, with financial and business having almost the same amount
- Cluster 6 has high number of business urban function
- Cluster 7 has an abnormally high number of business urban function
- Cluster 8 also has an abnormally high number of business urban function, with the other urban functions being almost close to 0
dataframe <- subzone_year_cluster%>%
select(CLUSTER, CONDO, ONE_TWO_RM, THREE_FOUR_RM, FIVE_RM, LANDED)%>%
st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
## CLUSTER variable count
## 1 1 CONDO 310
## 2 2 CONDO 0
## 3 2 CONDO 300
## 4 3 CONDO 0
## 5 3 CONDO 0
## 6 2 CONDO 10
ggplot(dataframe2, aes(x=CLUSTER, y=count, fill=variable))+
geom_bar(stat="identity", position="dodge")
It can be concluded that clusters 3,7 & 8 do not have any residential types, and cluster 1 has close to 0 residential types. In addition, for the remaining clusters, they all have the highest proportion of three_four_room.
dataframe <- subzone_year_cluster%>%
select(CLUSTER, Aged.group, Economy.active.population, Young.group)%>%
st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
## CLUSTER variable count
## 1 1 Aged.group 60
## 2 2 Aged.group 260
## 3 2 Aged.group 2090
## 4 3 Aged.group 0
## 5 3 Aged.group 0
## 6 2 Aged.group 440
ggplot(dataframe2, aes(x=CLUSTER, y=count, fill=variable))+
geom_bar(stat="identity", position="dodge")
It can be observed that clusters 3, 7 & 8 do not have any age groups as they are not residential areas. Also, cluster 2 shows the highest proportion of economy active popution. The remaining cluster’s highest number of proportion also belongs to economy active population.
We will first need to convert subzone into SpatialPolygonDataFrame, as SKATER function only support sp objects such as SpatialPolygonDataFrame.
subzone_year_sp <- as_Spatial(subzone_year)
subzone.nb <- poly2nb(subzone_year_sp)
summary(subzone.nb)
## Neighbour list object:
## Number of regions: 318
## Number of nonzero links: 1934
## Percentage nonzero weights: 1.912503
## Average number of links: 6.081761
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 14 17
## 2 6 10 26 77 87 51 34 16 3 3 1 1 1
## 2 least connected regions:
## 35 195 with 1 link
## 1 most connected region:
## 292 with 17 links
We can plot the neighbours list on subzone by using the code chunk below. Since we now can plot the community area boundaries as well, we plot this graph on top of the subzone map.
The first plot command gives the boundaries. This is followed by the plot of the neighbour list object, with coordinates applied to the original SpatialPolygonDataFrame to extract the centroids of the polygons.
These are used as the nodes for the graphic representation. We set color to blue, and specify add=TRUE, to ensure that the network is plotted on top of the boundary.
plot(subzone_year_sp, border=grey(.5))
plot(subzone.nb, coordinates(subzone_year_sp), col="blue", add=TRUE)
Edge cost is the distance between its nodes. This function compute the distance using data.frame with observations vector in each node
lcosts <- nbcosts(subzone.nb, subzone_count)
For each observation, this gives the pairwise dissimilarity between its values on the five variables and the values for the neighbouring observation (from the neighbour list). Basically, this is the notion of a generalised weight for a spatial weights matrix.
Next, we will incorporate these costs into a weights object in the same was as we did in the calculation of inverse of distance weights.
To achieve this, we make use of nb2listw() of spdep package, and specify style as B to make sure the cost values are not row-standardised.
subzone.w <- nb2listw(subzone.nb, lcosts, style="B")
summary(subzone.w)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 318
## Number of nonzero links: 1934
## Percentage nonzero weights: 1.912503
## Average number of links: 6.081761
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 14 17
## 2 6 10 26 77 87 51 34 16 3 3 1 1 1
## 2 least connected regions:
## 35 195 with 1 link
## 1 most connected region:
## 292 with 17 links
##
## Weights style: B
## Weights constants summary:
## n nn S0 S1 S2
## B 318 101124 495346.3 545934159 5088278851
subzone.mst <- mstree(subzone.w)
After computing the minimum spanning tree, we can check its class and dimension by using class() and dim() respectively.
class(subzone.mst)
## [1] "mst" "matrix"
dim(subzone.mst)
## [1] 317 3
The dimenstion is now 317 as the minimum spanning tree consists of n-1 edges in order to traverse all the nodes
We can display the content of subzone.mst by using head() function
head(subzone.mst)
## [,1] [,2] [,3]
## [1,] 316 255 9.972409
## [2,] 316 310 19.719934
## [3,] 316 258 21.528379
## [4,] 310 244 21.838932
## [5,] 316 312 22.756519
## [6,] 244 309 34.952037
The plot method for the minimum spanning tree include a way to show the observation numbers of the nodes in addition to the edge. As before, we plot this together with the subzone boundaries. We observe how the initial neighbour list is simplified to just one edge connecting with each of the nodes, while passing through all the nodes
plot(subzone_year_sp, border=gray(.5))
plot.mst(subzone.mst, coordinates(subzone_year_sp),
col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)
clust5 <- skater(subzone.mst[,1:2], subzone_count,7)
str(clust5)
## List of 8
## $ groups : num [1:318] 1 1 1 3 3 1 1 3 1 3 ...
## $ edges.groups:List of 8
## ..$ :List of 3
## .. ..$ node: num [1:246] 167 252 205 179 135 16 243 12 3 54 ...
## .. ..$ edge: num [1:245, 1:3] 252 205 49 179 135 16 243 12 3 54 ...
## .. ..$ ssw : num 31928
## ..$ :List of 3
## .. ..$ node: num [1:23] 100 78 134 87 180 237 112 84 63 113 ...
## .. ..$ edge: num [1:22, 1:3] 87 237 112 180 87 63 130 96 84 113 ...
## .. ..$ ssw : num 1797
## ..$ :List of 3
## .. ..$ node: num [1:14] 44 72 32 11 8 73 39 122 37 5 ...
## .. ..$ edge: num [1:13, 1:3] 8 11 73 122 32 72 11 44 8 37 ...
## .. ..$ ssw : num 380
## ..$ :List of 3
## .. ..$ node: num [1:13] 292 273 297 256 285 300 280 283 307 279 ...
## .. ..$ edge: num [1:12, 1:3] 256 285 273 307 283 273 292 297 280 300 ...
## .. ..$ ssw : num 673
## ..$ :List of 3
## .. ..$ node: num [1:11] 189 226 218 315 217 220 291 196 228 161 ...
## .. ..$ edge: num [1:10, 1:3] 189 226 315 217 220 217 291 217 218 226 ...
## .. ..$ ssw : num 330
## ..$ :List of 3
## .. ..$ node: num [1:6] 290 311 301 270 246 264
## .. ..$ edge: num [1:5, 1:3] 301 290 270 311 290 270 246 264 301 311 ...
## .. ..$ ssw : num 45.8
## ..$ :List of 3
## .. ..$ node: num [1:3] 296 299 274
## .. ..$ edge: num [1:2, 1:3] 296 296 299 274 0 0
## .. ..$ ssw : num 0
## ..$ :List of 3
## .. ..$ node: num [1:2] 239 86
## .. ..$ edge: num [1, 1:3] 239 86 37.5
## .. ..$ ssw : num 37.5
## $ not.prune : NULL
## $ candidates : int [1:8] 1 2 3 4 5 6 7 8
## $ ssto : num 90176
## $ ssw : num [1:8] 90176 74943 64183 53825 44549 ...
## $ crit : num [1:2] 1 Inf
## $ vec.crit : num [1:318] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "class")= chr "skater"
To check the cluster assignment,
ccs5 <- clust5$groups
ccs5
## [1] 1 1 1 3 3 1 1 3 1 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 2 1 3
## [38] 3 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 3 3 1
## [75] 1 1 2 2 1 1 1 1 1 2 1 8 2 1 1 1 1 1 1 2 2 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1
## [112] 2 2 1 1 1 2 1 1 3 1 3 1 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [149] 1 1 1 1 1 1 2 1 1 1 1 2 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## [186] 1 1 1 5 1 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 5 1 5 1 1
## [223] 1 1 1 5 1 5 1 1 1 1 1 2 1 1 2 2 8 1 1 1 1 1 1 6 1 1 1 1 1 1 1 1 1 4 1 1 1
## [260] 1 1 1 1 6 1 1 1 1 1 6 1 4 4 7 1 1 1 1 4 4 1 1 4 1 4 1 1 1 1 6 5 4 1 1 1 7
## [297] 4 1 7 4 6 1 4 5 1 4 4 1 1 1 6 1 1 1 5 1 1 1
To find out how many observations are in each cluster,
table(ccs5)
## ccs5
## 1 2 3 4 5 6 7 8
## 246 23 14 13 11 6 3 2
We also plot the pruned tree that shows different clusters on top of the subzone area
plot(subzone_year_sp, border=gray(.5))
plot(clust5, coordinates(subzone_year_sp), cex.lab=.7,
groups.colors=c("red","green","blue", "brown", "pink"), cex.circles=0.005, add=TRUE)
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
## Warning in segments(coords[id1, 1], coords[id1, 2], coords[id2, 1],
## coords[id2, : "add" is not a graphical parameter
We now plot the newly derived clusters by using SKATER method
groups_mat <- as.matrix(clust5$groups)
subzone_spatialcluster <- cbind(subzone_year_cluster, as.factor(groups_mat)) %>%
rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(subzone_spatialcluster, "SP_CLUSTER")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
Based on the choropleth map above, we can observe that:
1. Cluster 1 is majority
2. Cluster 2 is located at the west of the map
3. Cluster 3 is located at the south of the map
4. Cluseter 4 is located at the central of the map
5. Cluster 5 is located at the east of the map
tm_shape(subzone_spatialcluster)+
tm_borders(alpha=0.5)+
tm_fill(col="SP_CLUSTER", alpha=0.9, id="SUBZONE_N")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
tm_shape(st_make_valid(subzone_spatialcluster)) +
tm_fill(col="SP_CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Business") +
tm_bubbles(col = "Business_PR",
size = "Business_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
business_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
tm_fill(col="SP_CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Business",
scale=0.5) +
tm_bubbles(col = "Business_PR",
size = "Business_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
shopping_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
tm_fill(col="SP_CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Shopping",
scale=0.5) +
tm_bubbles(col = "Shopping_PR",
size = "Shopping_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
industry_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
tm_fill(col="SP_CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Industry",
scale=0.5) +
tm_bubbles(col = "Industry_PR",
size = "Industry_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
financial_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
tm_fill(col="SP_CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Financial",
scale=0.5) +
tm_bubbles(col = "Financial_PR",
size = "Financial_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
residential_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
tm_fill(col="SP_CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Residential",
scale=0.5) +
tm_bubbles(col = "Residential_PR",
size = "Residential_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
govt_cluster <- tm_shape(st_make_valid(subzone_spatialcluster)) +
tm_fill(col="SP_CLUSTER") +
tm_borders(lwd = 0.1, alpha = 1) +
tm_layout(title = "Government",
scale=0.5) +
tm_bubbles(col = "Government_PR",
size = "Government_PR",
border.col = "black",
border.lwd = 1,
alpha = 0.7)
tmap_arrange(business_cluster, shopping_cluster, industry_cluster, financial_cluster, residential_cluster, govt_cluster,
asp=0,
ncol=2,
nrow=3)
business <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Business_PR)) +
geom_bar(stat = "identity")
shopping <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Shopping_PR)) +
geom_bar(stat = "identity")
industry <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Industry_PR)) +
geom_bar(stat = "identity")
financial <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Financial_PR)) +
geom_bar(stat = "identity")
residential <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Residential_PR)) +
geom_bar(stat = "identity")
government <- ggplot(subzone_spatialcluster, aes(x=SP_CLUSTER, y=Government_PR)) +
geom_bar(stat = "identity")
ggarrange(business, shopping, industry, financial, residential, government, ncol=2, nrow=3)
Based on the bar charts and choropleth maps plotted above, we can make the conclusion that majority of the cluster among the urban functions is cluster 1, with a high percentage of cluster 1 in each of the urban functions. For example, for the Residential_PR group, almost all of the data are under cluster 1. This is supported by the choropleth map, with a large proportion of the area on the map being cluster 1 (green colour)
We can also observe that the Business_PR group has the widest variety of clusters, with cluster 2 being almsot as high as cluster 1. There is also a slight number of cluster 3, 4, and 5.
dataframe <- subzone_spatialcluster %>%
select(SP_CLUSTER, Business_PR, Shopping_PR, Financial_PR, Industry_PR, Government_PR, Residential_PR)%>%
st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="SP_CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
## SP_CLUSTER variable count
## 1 1 Business_PR 2
## 2 1 Business_PR 16
## 3 1 Business_PR 46
## 4 3 Business_PR 11
## 5 3 Business_PR 48
## 6 1 Business_PR 14
ggplot(dataframe2, aes(x=SP_CLUSTER, y=count, fill=variable))+
geom_bar(stat="identity", position="dodge")
From this, we can conclude that:
- Cluster 1 has a high proportion of business and residential urban functions
- Cluster 2 has an abnormally high number of business proportion
- Cluster 3 has slightly lower number of urban function proportion as compared to clusters 1 and 2, but highest proportion of financial
- Cluster 4 has an abnormally high number of business proportion
- Cluster 5 also has an abnormally high number of business proportion
- Cluster 6 has very few number of urban functions, but business proportion is still the most
- Cluster 7 has no urban functions at all
- Cluster 8 has high proportion of business urban function
dataframe <- subzone_spatialcluster %>%
select(SP_CLUSTER, CONDO, ONE_TWO_RM, THREE_FOUR_RM, FIVE_RM, LANDED)%>%
st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="SP_CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
## SP_CLUSTER variable count
## 1 1 CONDO 310
## 2 1 CONDO 0
## 3 1 CONDO 300
## 4 3 CONDO 0
## 5 3 CONDO 0
## 6 1 CONDO 10
ggplot(dataframe2, aes(x=SP_CLUSTER, y=count, fill=variable))+
geom_bar(stat="identity", position="dodge")
It can be observed that only cluster 1 are residential areas, with three_four_rm being the highest in count.
dataframe <- subzone_spatialcluster%>%
select(SP_CLUSTER, Aged.group, Economy.active.population, Young.group)%>%
st_set_geometry(NULL)
dataframe2 <- melt(dataframe, id.vars="SP_CLUSTER")
colnames(dataframe2)[3]="count"
head(dataframe2)
## SP_CLUSTER variable count
## 1 1 Aged.group 60
## 2 1 Aged.group 260
## 3 1 Aged.group 2090
## 4 3 Aged.group 0
## 5 3 Aged.group 0
## 6 1 Aged.group 440
ggplot(dataframe2, aes(x=SP_CLUSTER, y=count, fill=variable))+
geom_bar(stat="identity", position="dodge")
Similarly, only cluster 1 has age groups, mainly because they are residential areas. The higher proportion is economy active population.
Add pop_density into year4 data
subzone_spatialcluster <- subzone_spatialcluster %>%
mutate(pop_density=Pop/SHAPE_Area)
ggplot(data=subzone_spatialcluster, aes(x=`pop_density`))+
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows=vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="pop_density",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="pop_density")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
By looking at the map, we can see that cluster 1 generally have high population density, with the exception of Lim Chu Kang area, and Changi area. However, I did not remove these areas as even though these areas have population = 0, this does not mean that urban functions = 0. For example, it could be due to industrial area having alot of industries, but with nobody staying there.
In cluster 4, there are no population density plots on in since it is the central water catchment area.
One of the highly populated cluster 1 includes Tiong Bahru, Bukit Ho Swee, Rivervale, and Sengkang Town Centre.
ggplot(data=subzone_spatialcluster, aes(x=`Young.group`))+
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows=vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="Young.group",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="Young.group")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
By looking at the map, we can see that cluster 1 generally have higher number of Young group across the subzones.
ggplot(data=subzone_spatialcluster, aes(x=`Aged.group`))+
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows=vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="Aged.group",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="Aged.group")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
Similar to Young group, by looking at the map, we can see that cluster 1 generally have higher number of Aged group across the subzones.
ggplot(data=subzone_spatialcluster, aes(x=`Economy.active.population`))+
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows=vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="Economy.active.population",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="Economy.active.population")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
Similar to Young group, by looking at the map, we can see that cluster 1 generally have higher number of Active group across the subzones.
ggplot(data=subzone_spatialcluster, aes(x=`CONDO`))+
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows=vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="CONDO",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="CONDO")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
ggplot(data=subzone_spatialcluster, aes(x=`ONE_TWO_RM`))+
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows=vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="ONE_TWO_RM",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="ONE_TWO_RM")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
ggplot(data=subzone_spatialcluster, aes(x=`THREE_FOUR_RM`))+
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows=vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="THREE_FOUR_RM",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="THREE_FOUR_RM")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
ggplot(data=subzone_spatialcluster, aes(x=`FIVE_RM`))+
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows=vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="FIVE_RM",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="FIVE_RM")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
ggplot(data=subzone_spatialcluster, aes(x=`LANDED`))+
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows=vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="LANDED",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "quantile",size="LANDED")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
cluster1 <- subzone_spatialcluster %>%
filter(SP_CLUSTER==1)
cluster2 <- subzone_spatialcluster %>%
filter(SP_CLUSTER==2)
cluster3 <- subzone_spatialcluster %>%
filter(SP_CLUSTER==3)
cluster4 <- subzone_spatialcluster %>%
filter(SP_CLUSTER==4)
cluster5 <- subzone_spatialcluster %>%
filter(SP_CLUSTER==5)
summary(cluster1)
## SUBZONE_N CONDO ONE_TWO_RM THREE_FOUR_RM
## Length:246 Min. : 0 Min. : 0.0 Min. : 0
## Class :character 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## Mode :character Median : 1205 Median : 0.0 Median : 3605
## Mean : 2399 Mean : 711.6 Mean : 7816
## 3rd Qu.: 3945 3rd Qu.:1217.5 3rd Qu.:12650
## Max. :16770 Max. :4700.0 Max. :75000
##
## FIVE_RM LANDED Aged.group Economy.active.population
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 280 1st Qu.: 1222
## Median : 1315 Median : 0 Median : 1725 Median : 5955
## Mean : 4329 Mean : 1016 Mean : 2371 Mean : 9698
## 3rd Qu.: 5305 3rd Qu.: 760 3rd Qu.: 3458 3rd Qu.:15242
## Max. :47960 Max. :18820 Max. :18860 Max. :79780
##
## Young.group CONDO_PR ONE_TWO_RM_PR THREE_FOUR_RM_PR
## Min. : 0.0 Min. : 0.000 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 522.5 1st Qu.: 0.000 1st Qu.: 0.0000 1st Qu.: 0.000
## Median : 2455.0 Median : 1.232 Median : 0.0000 Median : 4.314
## Mean : 4327.0 Mean : 2.057 Mean : 0.7937 Mean : 7.034
## 3rd Qu.: 6437.5 3rd Qu.: 3.527 3rd Qu.: 0.9636 3rd Qu.:13.501
## Max. :34260.0 Max. :11.693 Max. :10.7964 Max. :29.630
##
## FIVE_RM_PR LANDED_PR Aged_group_PR Economy_active_PR
## Min. : 0.000 Min. :0.0000 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:100.1 1st Qu.:566.5
## Median : 1.374 Median :0.0000 Median :149.4 Median :588.2
## Mean : 3.513 Mean :0.6021 Mean :148.8 Mean :567.0
## 3rd Qu.: 5.149 3rd Qu.:0.5942 3rd Qu.:193.5 3rd Qu.:605.3
## Max. :27.554 Max. :7.2889 Max. :947.4 Max. :900.0
##
## Young_group_PR Business_PR Shopping_PR Industry_PR
## Min. : 0.0 Min. : 0.00 Min. : 0.000 Min. :0.0000
## 1st Qu.:211.5 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.:0.0000
## Median :248.2 Median : 2.00 Median : 1.000 Median :0.0000
## Mean :235.4 Mean : 11.15 Mean : 1.817 Mean :0.2358
## 3rd Qu.:285.1 3rd Qu.: 8.00 3rd Qu.: 2.000 3rd Qu.:0.0000
## Max. :558.8 Max. :230.00 Max. :31.000 Max. :7.0000
##
## Financial_PR Residential_PR Government_PR pop_density_PR
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 3.00 1st Qu.: 3.00 1st Qu.: 0.000 1st Qu.: 3.138
## Median : 7.00 Median : 6.00 Median : 0.000 Median :10.774
## Mean : 12.13 Mean : 14.57 Mean : 1.533 Mean :14.097
## 3rd Qu.: 15.00 3rd Qu.: 14.00 3rd Qu.: 2.000 3rd Qu.:23.387
## Max. :134.00 Max. :217.00 Max. :19.000 Max. :46.058
##
## SHAPE_Area Pop CLUSTER SP_CLUSTER
## Min. : 49626 Min. : 0 Length:246 1 :246
## 1st Qu.: 547736 1st Qu.: 2030 Class :character 2 : 0
## Median : 1049955 Median : 10640 Mode :character 3 : 0
## Mean : 1855093 Mean : 16396 4 : 0
## 3rd Qu.: 1795660 3rd Qu.: 25515 5 : 0
## Max. :69748299 Max. :132900 6 : 0
## (Other): 0
## geometry pop_density
## MULTIPOLYGON :246 Min. :0.000000
## epsg:NA : 0 1st Qu.:0.003138
## +proj=tmer...: 0 Median :0.010774
## Mean :0.014097
## 3rd Qu.:0.023387
## Max. :0.046058
##
summary(cluster2)
## SUBZONE_N CONDO ONE_TWO_RM THREE_FOUR_RM FIVE_RM
## Length:23 Min. :0 Min. :0 Min. :0 Min. :0
## Class :character 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Mode :character Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0
##
## LANDED Aged.group Economy.active.population Young.group CONDO_PR
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
##
## ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR LANDED_PR Aged_group_PR
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
##
## Economy_active_PR Young_group_PR Business_PR Shopping_PR
## Min. :0 Min. :0 Min. : 0.0 Min. :0.00000
## 1st Qu.:0 1st Qu.:0 1st Qu.: 34.5 1st Qu.:0.00000
## Median :0 Median :0 Median : 83.0 Median :0.00000
## Mean :0 Mean :0 Mean :110.3 Mean :0.08696
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:181.0 3rd Qu.:0.00000
## Max. :0 Max. :0 Max. :308.0 Max. :1.00000
##
## Industry_PR Financial_PR Residential_PR Government_PR pop_density_PR
## Min. :0.000 Min. :0.0 Min. :0.0000 Min. :0.0000 Min. :0
## 1st Qu.:0.000 1st Qu.:0.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0
## Median :0.000 Median :0.0 Median :0.0000 Median :0.0000 Median :0
## Mean :0.913 Mean :1.0 Mean :0.2609 Mean :0.1739 Mean :0
## 3rd Qu.:0.500 3rd Qu.:1.5 3rd Qu.:0.0000 3rd Qu.:0.0000 3rd Qu.:0
## Max. :8.000 Max. :5.0 Max. :4.0000 Max. :1.0000 Max. :0
##
## SHAPE_Area Pop CLUSTER SP_CLUSTER
## Min. : 1287950 Min. :0 Length:23 2 :23
## 1st Qu.: 1991662 1st Qu.:0 Class :character 1 : 0
## Median : 2349378 Median :0 Mode :character 3 : 0
## Mean : 5077986 Mean :0 4 : 0
## 3rd Qu.: 3195702 3rd Qu.:0 5 : 0
## Max. :36707721 Max. :0 6 : 0
## (Other): 0
## geometry pop_density
## MULTIPOLYGON :23 Min. :0
## epsg:NA : 0 1st Qu.:0
## +proj=tmer...: 0 Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
##
summary(cluster3)
## SUBZONE_N CONDO ONE_TWO_RM THREE_FOUR_RM FIVE_RM
## Length:14 Min. :0 Min. :0 Min. :0 Min. :0
## Class :character 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Mode :character Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0
##
## LANDED Aged.group Economy.active.population Young.group CONDO_PR
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
##
## ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR LANDED_PR Aged_group_PR
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
##
## Economy_active_PR Young_group_PR Business_PR Shopping_PR
## Min. :0 Min. :0 Min. : 0.0 Min. : 0.000
## 1st Qu.:0 1st Qu.:0 1st Qu.: 1.0 1st Qu.: 0.000
## Median :0 Median :0 Median : 6.5 Median : 1.000
## Mean :0 Mean :0 Mean :10.5 Mean : 3.786
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:14.0 3rd Qu.: 1.750
## Max. :0 Max. :0 Max. :48.0 Max. :21.000
##
## Industry_PR Financial_PR Residential_PR Government_PR
## Min. :0.0000 Min. : 0.0 Min. :0.0000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 0.5 1st Qu.:0.0000 1st Qu.: 0.000
## Median :0.0000 Median : 5.0 Median :0.0000 Median : 1.500
## Mean :0.3571 Mean : 19.5 Mean :0.7143 Mean : 3.571
## 3rd Qu.:0.0000 3rd Qu.: 15.5 3rd Qu.:1.5000 3rd Qu.: 3.000
## Max. :5.0000 Max. :126.0 Max. :3.0000 Max. :17.000
##
## pop_density_PR SHAPE_Area Pop CLUSTER SP_CLUSTER
## Min. :0 Min. : 39438 Min. :0 Length:14 3 :14
## 1st Qu.:0 1st Qu.: 232866 1st Qu.:0 Class :character 1 : 0
## Median :0 Median : 651174 Median :0 Mode :character 2 : 0
## Mean :0 Mean : 940623 Mean :0 4 : 0
## 3rd Qu.:0 3rd Qu.:1474579 3rd Qu.:0 5 : 0
## Max. :0 Max. :3449641 Max. :0 6 : 0
## (Other): 0
## geometry pop_density
## MULTIPOLYGON :14 Min. :0
## epsg:NA : 0 1st Qu.:0
## +proj=tmer...: 0 Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
##
summary(cluster4)
## SUBZONE_N CONDO ONE_TWO_RM THREE_FOUR_RM FIVE_RM
## Length:13 Min. :0 Min. :0 Min. :0 Min. :0
## Class :character 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Mode :character Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0
##
## LANDED Aged.group Economy.active.population Young.group CONDO_PR
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
##
## ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR LANDED_PR Aged_group_PR
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
##
## Economy_active_PR Young_group_PR Business_PR Shopping_PR
## Min. :0 Min. :0 Min. : 0.00 Min. :0.0000
## 1st Qu.:0 1st Qu.:0 1st Qu.: 3.00 1st Qu.:0.0000
## Median :0 Median :0 Median : 22.00 Median :0.0000
## Mean :0 Mean :0 Mean : 50.46 Mean :0.6923
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:102.00 3rd Qu.:0.0000
## Max. :0 Max. :0 Max. :173.00 Max. :8.0000
##
## Industry_PR Financial_PR Residential_PR Government_PR
## Min. :0.0000 Min. : 0.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median : 0.000 Median :0.0000 Median :0.0000
## Mean :0.8462 Mean : 2.385 Mean :0.1538 Mean :0.6923
## 3rd Qu.:1.0000 3rd Qu.: 2.000 3rd Qu.:0.0000 3rd Qu.:1.0000
## Max. :3.0000 Max. :21.000 Max. :1.0000 Max. :6.0000
##
## pop_density_PR SHAPE_Area Pop CLUSTER SP_CLUSTER
## Min. :0 Min. : 595652 Min. :0 Length:13 4 :13
## 1st Qu.:0 1st Qu.: 1635808 1st Qu.:0 Class :character 1 : 0
## Median :0 Median : 2241387 Median :0 Mode :character 2 : 0
## Mean :0 Mean : 5497730 Mean :0 3 : 0
## 3rd Qu.:0 3rd Qu.: 4387133 3rd Qu.:0 5 : 0
## Max. :0 Max. :37147854 Max. :0 6 : 0
## (Other): 0
## geometry pop_density
## MULTIPOLYGON :13 Min. :0
## epsg:NA : 0 1st Qu.:0
## +proj=tmer...: 0 Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
##
summary(cluster5)
## SUBZONE_N CONDO ONE_TWO_RM THREE_FOUR_RM FIVE_RM
## Length:11 Min. :0 Min. :0 Min. :0 Min. :0
## Class :character 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Mode :character Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0
##
## LANDED Aged.group Economy.active.population Young.group CONDO_PR
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
##
## ONE_TWO_RM_PR THREE_FOUR_RM_PR FIVE_RM_PR LANDED_PR Aged_group_PR
## Min. :0 Min. :0 Min. :0 Min. :0 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0 Median :0 Median :0 Median :0 Median :0
## Mean :0 Mean :0 Mean :0 Mean :0 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :0 Max. :0 Max. :0 Max. :0
##
## Economy_active_PR Young_group_PR Business_PR Shopping_PR
## Min. :0 Min. :0 Min. : 0.00 Min. :0
## 1st Qu.:0 1st Qu.:0 1st Qu.: 0.00 1st Qu.:0
## Median :0 Median :0 Median : 0.00 Median :0
## Mean :0 Mean :0 Mean : 19.09 Mean :0
## 3rd Qu.:0 3rd Qu.:0 3rd Qu.: 7.00 3rd Qu.:0
## Max. :0 Max. :0 Max. :184.00 Max. :0
##
## Industry_PR Financial_PR Residential_PR Government_PR
## Min. :0.0000 Min. :0.0000 Min. :0 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0 1st Qu.:0.0000
## Median :0.0000 Median :0.0000 Median :0 Median :0.0000
## Mean :0.5455 Mean :0.1818 Mean :0 Mean :0.1818
## 3rd Qu.:0.5000 3rd Qu.:0.0000 3rd Qu.:0 3rd Qu.:0.0000
## Max. :3.0000 Max. :2.0000 Max. :0 Max. :2.0000
##
## pop_density_PR SHAPE_Area Pop CLUSTER SP_CLUSTER
## Min. :0 Min. : 494504 Min. :0 Length:11 5 :11
## 1st Qu.:0 1st Qu.:1062835 1st Qu.:0 Class :character 1 : 0
## Median :0 Median :1287387 Median :0 Mode :character 2 : 0
## Mean :0 Mean :2155523 Mean :0 3 : 0
## 3rd Qu.:0 3rd Qu.:2116806 3rd Qu.:0 4 : 0
## Max. :0 Max. :7034806 Max. :0 6 : 0
## (Other): 0
## geometry pop_density
## MULTIPOLYGON :11 Min. :0
## epsg:NA : 0 1st Qu.:0
## +proj=tmer...: 0 Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
##
ggplot(data=subzone_spatialcluster, aes(x=`Business_PR`)) +
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows = vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="Business_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Business_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
The most number of business functions can be found in cluster 2, with a significant amount found in cluster 1
ggplot(data=subzone_spatialcluster, aes(x=`Shopping_PR`)) +
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows = vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="Shopping_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Shopping_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
ggplot(data=subzone_spatialcluster, aes(x=`Industry_PR`)) +
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows = vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="Industry_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Industry_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
ggplot(data=subzone_spatialcluster, aes(x=`Financial_PR`)) +
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows = vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="Financial_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Financial_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
ggplot(data=subzone_spatialcluster, aes(x=`Industry_PR`)) +
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows = vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="Industry_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Industry_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
ggplot(data=subzone_spatialcluster, aes(x=`Government_PR`)) +
geom_histogram(bins=20, color="black", fill="light blue")+
facet_grid(rows = vars(SP_CLUSTER))
tm_shape(subzone_spatialcluster) +
tm_fill(col = "SP_CLUSTER",alpha=0.7) +
tm_borders(lwd=1,lty="solid") +
tm_bubbles(col="Government_PR",alpha=0.8, border.lwd = 0, border.alpha = 0.2,palette="YlOrRd",style = "jenks",size="Government_PR",size.lim=c(1))
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
We will first convert the clusters into matrix form
groups_matrix <- as.matrix(clust5$groups)
clusters <- cbind(subzone_count.std, as.factor(groups_matrix))%>%
rename(`SP_CLUSTER`=`as.factor(groups_matrix)`)
mean <- clusters%>%
group_by(SP_CLUSTER)%>%
summarise_all(funs(mean))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
mean_value <- data.frame(t(mean[-1]))
colnames(mean_value) <- mean[,1]
names(mean_value)[1] <- "Cluster1"
names(mean_value)[2] <- "Cluster2"
names(mean_value)[3] <- "Cluster3"
names(mean_value)[4] <- "Cluster4"
names(mean_value)[5] <- "Cluster5"
names(mean_value)[6] <- "Cluster6"
names(mean_value)[7] <- "Cluster7"
names(mean_value)[8] <- "Cluster8"
head(mean_value)
## Cluster1 Cluster2 Cluster3 Cluster4 Cluster5 Cluster6
## CONDO_PR 0.17588867 0 0 0 0 0
## ONE_TWO_RM_PR 0.07351903 0 0 0 0 0
## THREE_FOUR_RM_PR 0.23737864 0 0 0 0 0
## FIVE_RM_PR 0.12749622 0 0 0 0 0
## LANDED_PR 0.08260174 0 0 0 0 0
## Aged_group_PR 0.15703702 0 0 0 0 0
## Cluster7 Cluster8
## CONDO_PR 0 0
## ONE_TWO_RM_PR 0 0
## THREE_FOUR_RM_PR 0 0
## FIVE_RM_PR 0 0
## LANDED_PR 0 0
## Aged_group_PR 0 0
tm_shape(subzone_spatialcluster)+
tm_borders(alpha = 0.5)+
tm_fill(col = "SP_CLUSTER", alpha = 0.9, id = "SUBZONE_N")
## Warning: The shape subzone_spatialcluster is invalid. See sf::st_is_valid
From the interactive map plotted above, we can see that there is a high proportion of Cluster 1, which is also supported by the mean_value table above, with most of the values coming from Cluster 1.
1. Cluster 1 occupies the majority of Singapore, which includes most area except the northern part and west part of Singapore. It includes area such as Bukit Batok, Ghim Moh, Clark Quay, Kovan, and even, Changi Airport. From our earlier analysis, we have identified that Cluster 1 consists of residential areas. This explains the high proportion of Cluster 1 in the Singapore map.
Our analysis above also shows that cluster 1 has high numbers of Residential and Business urban functions, and the least number of industrial. This could be due to buildings and amenities built in these areas that are highly requented by people, and they are all widely spread across Singapore for easy accessibility of these ammenties and residential areas.
Cluster 2 is located at the west of Singapore. These areas include Tuas View Extension, Gul Circle, and Jurong Island and Bukom. Cluster 2 has a high proportion of business urban functions, as it may be where offices are held at. Since there are no residentials, the buildings there are mainly business offices.
Cluster 3 is located at the south of Singapore, and areas include Marina South and Marina East. There is a high proportion of Financial urban function in Cluster 3. This could be due to the building of the casino in Singapore, and there are banks available to allow people for easy changing or transfer of money.
Cluster 4 is the central water catchment, which has no residentials. Instead, it too has a high proportion of businesses.
Cluster 5 is the north-east part of Singapore, which includes areas such as Paya Lebar West.
Cluster 6 is the north part of Singapore, which includes Seletar Aerospace Park. There are few urban functions there, and no residential area.
Cluster 7 is a small island at the corner of Singapore, with no urban functions, and no residential area there. There are also no sign of population there.
Lastly, Cluster 8 is a small area at Toh Tuck with high proportion of business urban function.