Social area analysis will be performed in the different subzones of Singapore examine to the socio-economic differences and to classify them into relatively homogenous groups.
To get started with our analysis, we will get started with importing the required R packages which will help us in the upcoming sections to perform the analysis. Here is a brief description of the packages used:
* The tidyverse package will be used heavily to perform data wrangling and clean our data sets in order to perform the analysis required.
* The rgdal, spdep, and sf package will be used for spatial data manupulation and analysis. They are used for performing various different functions on spatial data.
* The corrplot and tmap packages will be used for visualisation purposes.
* ClustGeo, heatmaply, and psych will be used to perform statistical analysis on spatial data.
packages = c('rgdal', 'spdep', 'ClustGeo', 'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse',"factoextra","NbClust","FactoMineR","knitr", "tmaptools")
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Through the code in this section, we will be importing all the required datasets. These involve both spatial and aspatial data.
The data below is taken from is from www.data.gov.sg, which is an official government website for Singapore’s public data. The URL for the dataset is as follows:
https://data.gov.sg/dataset/singapore-residents-by-subzone-and-type-of-dwelling-2011-2019
residentData <- read_csv("data/aspatial/singapore-residents-by-subzone-and-type-of-dwelling-2011-2019/planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv")
There are multiple datasets which are imported in this section. The function st_read will be used while importing it to ensure that the geospatial data is imported in sfc format.
mpsz <- st_read(dsn="data/geospatial/master-plan-2014-subzone-boundary-no-sea-shp", layer="MP14_SUBZONE_NO_SEA_PL")
## Reading layer `MP14_SUBZONE_NO_SEA_PL' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial/master-plan-2014-subzone-boundary-no-sea-shp' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## proj4string: +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs
The above code imports the subzone boundary of Singapore. As seen in the output, there is no CRS assigned currently and the data is represented in meters. Hence, we will assign the EPSG code of 3414 and transform the data into EPSG 3414 format, which is the most accurate projection system for spatial data in Singapore.
mpsz <- st_set_crs(mpsz,3414)
mpsz3414 <- st_transform(mpsz,3414)
Next, we will import geospatial data for all the important urban functions in Singapore.
business <- st_read(dsn="data/geospatial", layer="Business")
## Reading layer `Business' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## CRS: 4326
financial <- st_read(dsn="data/geospatial", layer="Financial")
## Reading layer `Financial' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3320 features and 29 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6256 ymin: 1.24392 xmax: 103.9998 ymax: 1.46247
## CRS: 4326
govt <- st_read(dsn="data/geospatial", layer="Govt_Embassy")
## Reading layer `Govt_Embassy' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 443 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6282 ymin: 1.24911 xmax: 103.9884 ymax: 1.45765
## CRS: 4326
private <- st_read(dsn="data/geospatial", layer="Private residential")
## Reading layer `Private residential' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3604 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.6295 ymin: 1.23943 xmax: 103.9749 ymax: 1.45379
## CRS: 4326
shopping <- st_read(dsn="data/geospatial", layer="Shopping")
## Reading layer `Shopping' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 511 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 103.679 ymin: 1.24779 xmax: 103.9644 ymax: 1.4535
## CRS: 4326
Spatial properties of various urban functions are imported above. As seen in the output, all of them have CRS 4326, and expressed in meters. Singapore uses an EPSG code of 3414. Hence, to ensure that the data is projected accurately, we will be transforming the data into EPSG 3414.
business3414 <- st_transform(business,3414)
financial3414 <- st_transform(financial,3414)
govt3414 <- st_transform(govt,3414)
private3414 <- st_transform(private,3414)
shopping3414 <- st_transform(shopping,3414)
business3414
## Simple feature collection with 6550 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 3669.148 ymin: 25408.41 xmax: 47034.83 ymax: 50148.54
## CRS: EPSG:3414
## First 10 features:
## POI_ID SEQ_NUM FAC_TYPE POI_NAME
## 1 1101180209 1 5000 JOHN CHEN
## 2 1101180210 1 5000 TROPICAL INDUSTRIAL BUILDING
## 3 1101180211 1 5000 LIAN CHEONG INDUSTRIAL BUILDING
## 4 1101180212 1 5000 MALAYSIA GARMENT MANUFACTURERS
## 5 1101180213 1 5000 UNIGOLD
## 6 1192316144 1 5000 NUS UNIVERSITY HALL
## 7 1144317654 1 5000 SUITES AT BUKIT TIMAH
## 8 1103507488 1 5000 TIONG HUAT
## 9 1001052867 1 5000 LEE CHOON GUAN TIMBER MERCHANT
## 10 1001052868 1 5000 WEIGHT BRIDGE SERVICE
## ST_NAME geometry
## 1 LITTLE RD POINT (33818.36 35620.16)
## 2 LITTLE RD POINT (33770.51 35610.2)
## 3 LITTLE RD POINT (33779.41 35612.41)
## 4 <NA> POINT (33802.78 35598.04)
## 5 LITTLE RD POINT (33835.06 35623.47)
## 6 LOWER KENT RIDGE RD POINT (21813.48 31063.37)
## 7 JALAN JURONG KECHIL POINT (21375.11 35831.37)
## 8 KALLANG PUDDING RD POINT (33088.33 34439.2)
## 9 PENJURU RD POINT (17103.73 33407.71)
## 10 PENJURU RD POINT (17178.3 33503.9)
financial3414
## Simple feature collection with 3320 features and 29 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 4881.527 ymin: 25171.88 xmax: 46526.16 ymax: 49338.02
## CRS: EPSG:3414
## First 10 features:
## LINK_ID POI_ID SEQ_NUM FAC_TYPE POI_NAME POI_LANGCD
## 1 1170624361 1132324230 1 3578 UOB ENG
## 2 1112103842 1132315471 1 3578 POSB ENG
## 3 1112103842 1132315472 1 3578 UOB ENG
## 4 1112103842 1132315473 1 3578 OCBC ENG
## 5 864687596 1100784924 1 3578 OCBC ENG
## 6 902073032 1132324170 1 6000 MAYBANK ENG
## 7 778516217 1141424387 1 6000 ADPOST MONEYCHANGER ENG
## 8 880495939 1096910285 1 3578 UOB ENG
## 9 866996334 1096910292 1 3578 OCBC ENG
## 10 880495939 1096910286 1 3578 CITIBANK ENG
## POI_NMTYPE POI_ST_NUM ST_NUM_FUL ST_NFUL_LC ST_NAME ST_LANGCD
## 1 B 201 <NA> <NA> YISHUN AVE 2 ENG
## 2 B 375 <NA> <NA> COMMONWEALTH AVE ENG
## 3 B 375 <NA> <NA> COMMONWEALTH AVE ENG
## 4 B 375 <NA> <NA> COMMONWEALTH AVE ENG
## 5 B <NA> <NA> <NA> JURONG WEST ST 51 ENG
## 6 B 707 <NA> <NA> EAST COAST RD ENG
## 7 B 163 <NA> <NA> TANGLIN RD ENG
## 8 B <NA> <NA> <NA> <NA> <NA>
## 9 B 11 <NA> <NA> ARTS LINK ENG
## 10 B <NA> <NA> <NA> <NA> <NA>
## POI_ST_SD ACC_TYPE PH_NUMBER CHAIN_ID NAT_IMPORT PRIVATE IN_VICIN
## 1 L <NA> <NA> 6919 N N N
## 2 R <NA> <NA> 6918 N N N
## 3 R <NA> <NA> 6919 N N N
## 4 R <NA> <NA> 6920 N N N
## 5 R <NA> <NA> 6920 N N N
## 6 L <NA> 18006292266 3657 N N N
## 7 R <NA> 67330779 0 N N N
## 8 R <NA> <NA> 6919 N N N
## 9 R <NA> <NA> 6920 N N N
## 10 R <NA> <NA> 1165 N N N
## NUM_PARENT NUM_CHILD PERCFRREF VANCITY_ID
## 1 0 0 NA 0
## 2 0 0 NA 0
## 3 0 0 NA 0
## 4 0 0 NA 0
## 5 0 0 60 0
## 6 0 0 NA 0
## 7 1 0 50 0
## 8 0 0 20 0
## 9 0 0 NA 0
## 10 0 0 20 0
## ACT_ADDR
## 1 <NA>
## 2 <NA>
## 3 <NA>
## 4 <NA>
## 5 501 JURONG WEST STREET 51 SINGAPORE 640501
## 6 <NA>
## 7 <NA>
## 8 <NA>
## 9 <NA>
## 10 <NA>
## ACT_LANGCD ACT_ST_NAM ACT_ST_NUM ACT_ADMIN ACT_POSTAL
## 1 <NA> <NA> <NA> <NA> <NA>
## 2 <NA> <NA> <NA> <NA> <NA>
## 3 <NA> <NA> <NA> <NA> <NA>
## 4 <NA> <NA> <NA> <NA> <NA>
## 5 ENG JURONG WEST STREET 51 501 SINGAPORE 640501
## 6 <NA> <NA> <NA> <NA> <NA>
## 7 <NA> <NA> <NA> <NA> <NA>
## 8 <NA> <NA> <NA> <NA> <NA>
## 9 <NA> <NA> <NA> <NA> <NA>
## 10 <NA> <NA> <NA> <NA> <NA>
## geometry
## 1 POINT (27966.77 44304.65)
## 2 POINT (24163.96 31606.25)
## 3 POINT (24163.96 31606.25)
## 4 POINT (24163.96 31606.25)
## 5 POINT (15270.94 36919.65)
## 6 POINT (37917.26 32698.88)
## 7 POINT (26981.85 31956.75)
## 8 POINT (21205.83 30939.54)
## 9 POINT (21159.08 30673.06)
## 10 POINT (21205.83 30939.54)
govt3414
## Simple feature collection with 443 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 5177.756 ymin: 25745.76 xmax: 45262.14 ymax: 48805.09
## CRS: EPSG:3414
## First 10 features:
## POI_ID SEQ_NUM FAC_TYPE POI_NAME ST_NAME
## 1 1141424380 1 9993 CONSULATE SAN MARINO CHURCH ST
## 2 1141424404 1 9993 EMBASSY LAOS GOLDHILL PLZ
## 3 1141424402 1 9993 CONSULATE BELIZE CECIL ST
## 4 1141424338 1 9993 GENERAL CONSULATE OMAN <NA>
## 5 1192460871 1 9525 MND TOWER BLOCK MAXWELL RD
## 6 1192460819 1 9525 MND AUDITORIUM & FUNCTION HALL MAXWELL RD
## 7 1192460843 1 9525 AICARE LINK @ MAXWELL MAXWELL RD
## 8 1192460783 1 9525 HARMONY IN DIVERSITY GALLERY MAXWELL RD
## 9 1192460750 1 9525 FAMILY SUPPORT DIVISION MSF MAXWELL RD
## 10 1194224304 1 9525 LTA BEDOK CAMPUS CHAI CHEE ST
## geometry
## 1 POINT (29790.84 29540.69)
## 2 POINT (29086.35 33403.07)
## 3 POINT (29780.83 29302.96)
## 4 POINT (30723.45 31361.87)
## 5 POINT (29363.48 29016.57)
## 6 POINT (29352.36 29032.05)
## 7 POINT (29352.36 29032.05)
## 8 POINT (29352.36 29032.05)
## 9 POINT (29352.36 29032.05)
## 10 POINT (37470.93 34345.33)
private3414
## Simple feature collection with 3604 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 5316.959 ymin: 24675.4 xmax: 43760.83 ymax: 48378.23
## CRS: EPSG:3414
## First 10 features:
## POI_ID SEQ_NUM FAC_TYPE POI_NAME ST_NAME
## 1 1132324282 1 9590 MARINA BAY SERVICED APARTMENTS MARINA BLVD
## 2 1132106212 1 9590 SIN MING VILLE JALAN TODAK
## 3 1202668778 1 9590 GREENTOPS @ SIMS PLACE <NA>
## 4 1099690099 1 9590 MOUNTBATTEN DAKOTA CRESCENT DAKOTA CRES
## 5 995195128 1 9590 SINGA COURT JALAN SINGA
## 6 1176000954 1 9590 FORESQUE RESIDENCES PETIR RD
## 7 1100738877 1 9590 TIONG BAHRU COURT JALAN MEMBINA
## 8 935999454 1 9590 BIRMINGHAM MANSIONS THOMSON RD
## 9 935999453 1 9590 THOMSON EURO-ASIA THOMSON RD
## 10 1069807806 1 9590 STRATFORD COURT BEDOK RIA CRES
## geometry
## 1 POINT (30144.75 29293.01)
## 2 POINT (28238.32 37300.83)
## 3 POINT (33158.46 33189.71)
## 4 POINT (34253.58 32295.18)
## 5 POINT (36358.02 34731.2)
## 6 POINT (21556.59 39011.5)
## 7 POINT (27313.49 29646.84)
## 8 POINT (29236.59 33304.66)
## 9 POINT (29222.12 33348.89)
## 10 POINT (41169.09 34457.16)
shopping3414
## Simple feature collection with 511 features and 5 fields
## geometry type: POINT
## dimension: XY
## bbox: xmin: 10824.78 ymin: 25599.8 xmax: 42586.69 ymax: 48346.17
## CRS: EPSG:3414
## First 10 features:
## POI_ID SEQ_NUM FAC_TYPE POI_NAME
## 1 1132106213 1 6512 SIN MING CENTRE
## 2 801758392 1 6512 THE ADELPHI
## 3 842821452 1 6512 BOON LAY SHOPPING CENTRE
## 4 1193779191 1 6512 KATONG SQUARE
## 5 801758399 1 6512 SIM LIM SQUARE
## 6 1001450091 1 6512 PEOPLE'S PARK COMPLEX
## 7 1069767253 1 6512 UNITED SQUARE GOLDHILL PLAZA ENTRANCE
## 8 1069767253 2 6512 UNITED SQUARE GOLDHILL PLZ ENTRANCE
## 9 1039562724 1 6512 THE FORUM
## 10 1039562723 1 6512 WATERFRONT
## ST_NAME geometry
## 1 SIN MING RD POINT (28293.96 37316.31)
## 2 COLEMAN ST POINT (30020.1 30404.29)
## 3 BOON LAY PL POINT (14574.25 36539.3)
## 4 EAST COAST RD POINT (35876.21 31925.9)
## 5 ROCHOR CANAL RD POINT (30225.98 31749.98)
## 6 PARK RD POINT (29076.35 29667.85)
## 7 <NA> POINT (29099.71 33301.34)
## 8 <NA> POINT (29099.71 33301.34)
## 9 <NA> POINT (26574.5 26528.63)
## 10 <NA> POINT (26574.5 26528.63)
The above output shows that all the sfc tables containing key urban feautures have been converted to EPSG 3414 format, which is the Singapore standard. this will allow our data to be projected accurately.
The dataset for business can be further seperated into business and industry. Industry will include all manufacturing and other primary and secondary businesses whereas Business will include all the tertiary businesses.
industry3414 <- business3414 %>%
filter(FAC_TYPE==9991)
business3414 <- business3414 %>%
filter(FAC_TYPE==5000)
summary(industry3414)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1.000 Min. :9991
## 1st Qu.:1.100e+09 1st Qu.:1.000 1st Qu.:9991
## Median :1.104e+09 Median :1.000 Median :9991
## Mean :1.075e+09 Mean :1.136 Mean :9991
## 3rd Qu.:1.139e+09 3rd Qu.:1.000 3rd Qu.:9991
## Max. :1.203e+09 Max. :2.000 Max. :9991
##
## POI_NAME ST_NAME
## TUAS TERRACE COMPLEX : 3 INTERNATIONAL BUSINESS PARK: 7
## JTC TERRACE FACTORIES TUAS S ST 5: 2 TUAS AVE 13 : 5
## TUAS BAY INDUSTRIAL CENTRE : 2 TUAS SOUTH ST 5 : 5
## TUAS ROAD TERRACE FACTORY : 2 HENDERSON RD : 3
## 115A, 115B COMMONWEALTH DRIVE : 1 TUAS RD : 3
## 512,514 CHAI CHEE LANE : 1 (Other) :74
## (Other) :99 NA's :13
## geometry
## POINT :110
## epsg:3414 : 0
## +proj=tmer...: 0
##
##
##
##
The resident data is inspected below using the summary function which allows us to see the data class for each column and its distribution.
summary(residentData)
## planning_area subzone age_group sex
## Length:883728 Length:883728 Length:883728 Length:883728
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## type_of_dwelling resident_count year
## Length:883728 Min. : 0.00 Min. :2011
## Class :character 1st Qu.: 0.00 1st Qu.:2013
## Mode :character Median : 0.00 Median :2015
## Mean : 39.83 Mean :2015
## 3rd Qu.: 10.00 3rd Qu.:2017
## Max. :2860.00 Max. :2019
As seen above, all columns except for resident_count and year have the class character. As the median for resident count is 0 and the third quartile is below the mean, it is very evident that more than 50% of the subzones have a residential population of 0. This is because many subzones are inhabitable (ex: Central Catchement Area, Western Catachement Area, etc.) and various subzones such as Changi Bay contain key transportation facilities of Singapore, hence do not have any population. Secondly, the data in this table is from year 2011 to 2019. As we will beperforming analysis on the latest (2019) data, we will remove data for all the other years (2011-2018).
summary(business3414)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1.000 Min. :5000
## 1st Qu.:9.967e+08 1st Qu.:1.000 1st Qu.:5000
## Median :1.084e+09 Median :1.000 Median :5000
## Mean :9.919e+08 Mean :1.019 Mean :5000
## 3rd Qu.:1.108e+09 3rd Qu.:1.000 3rd Qu.:5000
## Max. :1.204e+09 Max. :3.000 Max. :5000
##
## POI_NAME ST_NAME
## CAMBRIDGE INDUSTRIAL TRUST: 8 TAGORE LN : 82
## DHL : 6 JOO KOON CIR : 80
## NATIONAL OILWELL VARCO : 6 GUL CIR : 62
## ST MICROELECTRONICS : 6 KAKI BUKIT PL : 53
## CWT : 5 KAKI BUKIT IND TER: 52
## HALLIBURTON : 5 (Other) :5845
## (Other) :6404 NA's : 266
## geometry
## POINT :6440
## epsg:3414 : 0
## +proj=tmer...: 0
##
##
##
##
From the above summary, we can notice that there are 266 NA values for ST_NAMES. However, ST_NAMES is not our variable of interest. We need to prepare the dataset such that it contains distinct businesses. As each business is identified with its POI_ID, we will group by the POI_ID in order to remove any duplicated data.
business3414_cleaned <- business3414 %>%
distinct_at(vars(POI_ID),.keep_all = TRUE)
summary(business3414_cleaned)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1 Min. :5000
## 1st Qu.:9.967e+08 1st Qu.:1 1st Qu.:5000
## Median :1.084e+09 Median :1 Median :5000
## Mean :9.930e+08 Mean :1 Mean :5000
## 3rd Qu.:1.108e+09 3rd Qu.:1 3rd Qu.:5000
## Max. :1.204e+09 Max. :1 Max. :5000
##
## POI_NAME ST_NAME
## CAMBRIDGE INDUSTRIAL TRUST: 8 TAGORE LN : 80
## DHL : 6 JOO KOON CIR : 79
## NATIONAL OILWELL VARCO : 6 GUL CIR : 62
## ST MICROELECTRONICS : 6 KAKI BUKIT IND TER: 51
## CWT : 5 KAKI BUKIT PL : 51
## HALLIBURTON : 5 (Other) :5744
## (Other) :6284 NA's : 253
## geometry
## POINT :6320
## epsg:3414 : 0
## +proj=tmer...: 0
##
##
##
##
As the data is now clean, we will create a new table which has the subzone name for each of the business based on its location. However, before that, we will create a new variable which only consists of the subzone name and location which will make it easier to perform relational joins and assigning subzones.
mpsz3414_2 <- mpsz3414 %>%
rename("subzone"=SUBZONE_N)%>%
select(subzone,geometry)
business_by_subzone <- st_intersection(mpsz3414_2,business3414_cleaned) %>%
group_by(subzone) %>%
summarise(Businesses=n())
summary(business_by_subzone)
## subzone Businesses geometry
## ALEXANDRA HILL : 1 Min. : 1.00 MULTIPOINT :174
## ALEXANDRA NORTH : 1 1st Qu.: 2.00 POINT : 42
## ALJUNIED : 1 Median : 7.00 epsg:3414 : 0
## ANAK BUKIT : 1 Mean : 29.26 +proj=tmer...: 0
## ANG MO KIO TOWN CENTRE: 1 3rd Qu.: 29.00
## ANSON : 1 Max. :303.00
## (Other) :210
As we have eliminated duplicates, we will now check if any location value is empty
is_empty(business_by_subzone$geometry)
## [1] FALSE
As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.
summary(industry3414)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1.000 Min. :9991
## 1st Qu.:1.100e+09 1st Qu.:1.000 1st Qu.:9991
## Median :1.104e+09 Median :1.000 Median :9991
## Mean :1.075e+09 Mean :1.136 Mean :9991
## 3rd Qu.:1.139e+09 3rd Qu.:1.000 3rd Qu.:9991
## Max. :1.203e+09 Max. :2.000 Max. :9991
##
## POI_NAME ST_NAME
## TUAS TERRACE COMPLEX : 3 INTERNATIONAL BUSINESS PARK: 7
## JTC TERRACE FACTORIES TUAS S ST 5: 2 TUAS AVE 13 : 5
## TUAS BAY INDUSTRIAL CENTRE : 2 TUAS SOUTH ST 5 : 5
## TUAS ROAD TERRACE FACTORY : 2 HENDERSON RD : 3
## 115A, 115B COMMONWEALTH DRIVE : 1 TUAS RD : 3
## 512,514 CHAI CHEE LANE : 1 (Other) :74
## (Other) :99 NA's :13
## geometry
## POINT :110
## epsg:3414 : 0
## +proj=tmer...: 0
##
##
##
##
Similarly to the methodology used above, we will group by poi_id so that we remove duplicated values. This is because each industry has a unique POI_ID.
industry3414_cleaned <- industry3414 %>%
distinct_at(vars(POI_ID),.keep_all = TRUE)
summary(industry3414_cleaned)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1 Min. :9991
## 1st Qu.:1.100e+09 1st Qu.:1 1st Qu.:9991
## Median :1.104e+09 Median :1 Median :9991
## Mean :1.072e+09 Mean :1 Mean :9991
## 3rd Qu.:1.139e+09 3rd Qu.:1 3rd Qu.:9991
## Max. :1.203e+09 Max. :1 Max. :9991
##
## POI_NAME ST_NAME
## 115A, 115B COMMONWEALTH DRIVE: 1 INTERNATIONAL BUSINESS PARK: 4
## 512,514 CHAI CHEE LANE : 1 TUAS AVE 13 : 3
## AIRPORT LOGISTICS PARK : 1 TUAS SOUTH ST 5 : 3
## ANG MO KIO INDUSTRIAL PARK 1 : 1 HENDERSON RD : 2
## ANG MO KIO INDUSTRIAL PARK 2 : 1 PASIR RIS IND DR 1 : 2
## ANG MO KIO INDUSTRIAL PARK 3 : 1 (Other) :69
## (Other) :89 NA's :12
## geometry
## POINT :95
## epsg:3414 : 0
## +proj=tmer...: 0
##
##
##
##
We will now assign a subzone to each of the industry through st_intersection method.
industry_by_subzone <- st_intersection(mpsz3414_2,industry3414_cleaned) %>%
group_by(subzone) %>%
summarise(Industries=n())
summary(industry_by_subzone)
## subzone Industries geometry
## ALEXANDRA HILL : 1 Min. :1.000 MULTIPOINT :21
## ALJUNIED : 1 1st Qu.:1.000 POINT :28
## BRADDELL : 1 Median :1.000 epsg:3414 : 0
## BUKIT BATOK SOUTH: 1 Mean :1.939 +proj=tmer...: 0
## BUKIT MERAH : 1 3rd Qu.:2.000
## CHANGI AIRPORT : 1 Max. :5.000
## (Other) :43
As we have eliminated duplicates, we will now check if any location value is empty
is_empty(industry_by_subzone$geometry)
## [1] FALSE
As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.
summary(shopping3414)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1.000 Min. :6512
## 1st Qu.:9.656e+08 1st Qu.:1.000 1st Qu.:6512
## Median :1.070e+09 Median :1.000 Median :6512
## Mean :8.934e+08 Mean :1.108 Mean :6512
## 3rd Qu.:1.104e+09 3rd Qu.:1.000 3rd Qu.:6512
## Max. :1.204e+09 Max. :3.000 Max. :6512
##
## POI_NAME ST_NAME
## BUKIT BATOK WEST SHOPPING CENTRE: 2 ORCHARD RD : 37
## CHANGE ALLEY : 2 BUKIT TIMAH RD: 7
## FARMART CENTRE : 2 SCOTTS RD : 7
## FORTUNE CENTRE : 2 BEACH RD : 6
## HARBOUR FRONT CENTRE ENTRANCE : 2 BENCOOLEN ST : 5
## NEW WORLD CENTRE : 2 (Other) :347
## (Other) :499 NA's :102
## geometry
## POINT :511
## epsg:3414 : 0
## +proj=tmer...: 0
##
##
##
##
From the above summary, we can notice that there are 102 NA values for ST_NAMES. However, ST_NAMES is not our variable of interest. We need to prepare the dataset such that it contains distinct shopping infrastructure in order to avoid repetitions. As each shooping infrastructure is identified with its POI_ID, we will group by the POI_ID in order to remove any duplicated data.
shopping3414_cleaned <- shopping3414 %>%
distinct_at(vars(POI_ID),.keep_all = TRUE)
summary(shopping3414_cleaned)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1 Min. :6512
## 1st Qu.:9.360e+08 1st Qu.:1 1st Qu.:6512
## Median :1.070e+09 Median :1 Median :6512
## Mean :8.787e+08 Mean :1 Mean :6512
## 3rd Qu.:1.105e+09 3rd Qu.:1 3rd Qu.:6512
## Max. :1.204e+09 Max. :1 Max. :6512
##
## POI_NAME ST_NAME
## BUKIT BATOK WEST SHOPPING CENTRE: 2 ORCHARD RD : 31
## CHANGE ALLEY : 2 BUKIT TIMAH RD: 7
## FARMART CENTRE : 2 BEACH RD : 6
## FORTUNE CENTRE : 2 SCOTTS RD : 6
## NEW WORLD CENTRE : 2 EAST COAST RD : 5
## SULTAN PLAZA : 2 (Other) :325
## (Other) :446 NA's : 78
## geometry
## POINT :458
## epsg:3414 : 0
## +proj=tmer...: 0
##
##
##
##
As the data is now clean, we will create a new table which has the subzone name for each of the shopping infrastructure based on its location.
shopping_by_subzone <- st_intersection(mpsz3414_2,shopping3414_cleaned) %>%
group_by(subzone) %>%
summarise(Shopping_Infrastructures=n())
summary(shopping_by_subzone)
## subzone Shopping_Infrastructures geometry
## ADMIRALTY : 1 Min. : 1.000 MULTIPOINT :77
## ALEXANDRA HILL : 1 1st Qu.: 1.000 POINT :70
## ALJUNIED : 1 Median : 2.000 epsg:3414 : 0
## ANAK BUKIT : 1 Mean : 3.116 +proj=tmer...: 0
## ANG MO KIO TOWN CENTRE: 1 3rd Qu.: 3.500
## ANSON : 1 Max. :27.000
## (Other) :141
As we have eliminated duplicates, we will now check if any location value is empty
is_empty(shopping_by_subzone$geometry)
## [1] FALSE
As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.
summary(govt3414)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1.000 Min. :9525
## 1st Qu.:1.010e+09 1st Qu.:1.000 1st Qu.:9525
## Median :1.058e+09 Median :1.000 Median :9525
## Mean :1.006e+09 Mean :1.111 Mean :9651
## 3rd Qu.:1.113e+09 3rd Qu.:1.000 3rd Qu.:9993
## Max. :1.203e+09 Max. :2.000 Max. :9993
##
## POI_NAME ST_NAME geometry
## ANG MO KIO TOWN COUNCIL : 5 MAXWELL RD: 16 POINT :443
## SEMBAWANG-NEE SOON TOWN COUNCIL: 3 THOMSON RD: 12 epsg:3414 : 0
## ALJUNIED HOUGANG TOWN COUNCIL : 2 ORCHARD RD: 11 +proj=tmer...: 0
## ALJUNIED TOWN COUNCIL : 2 COLLEGE RD: 10
## BISHAN-TOA PAYOH TOWN COUNCIL : 2 SCOTTS RD : 8
## CENTRAL PROVIDENT FUND BOARD : 2 (Other) :358
## (Other) :427 NA's : 28
From the above summary, we can notice that there are 28 NA values for ST_NAMES. However, ST_NAMES is not our variable of interest. We need to prepare the dataset such that it contains distinct government institutions in order to avoid repetitions. As each governemnt institution is identified with its POI_ID, we will group by the POI_ID in order to remove any duplicated data.
govt3414_cleaned <- govt3414 %>%
distinct_at(vars(POI_ID),.keep_all = TRUE)
summary(govt3414_cleaned)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1 Min. :9525
## 1st Qu.:1.010e+09 1st Qu.:1 1st Qu.:9525
## Median :1.058e+09 Median :1 Median :9525
## Mean :1.002e+09 Mean :1 Mean :9662
## 3rd Qu.:1.112e+09 3rd Qu.:1 3rd Qu.:9993
## Max. :1.203e+09 Max. :1 Max. :9993
##
## POI_NAME ST_NAME
## ANG MO KIO TOWN COUNCIL : 5 MAXWELL RD : 16
## SEMBAWANG-NEE SOON TOWN COUNCIL: 3 COLLEGE RD : 10
## ALJUNIED HOUGANG TOWN COUNCIL : 2 ORCHARD RD : 10
## ALJUNIED TOWN COUNCIL : 2 THOMSON RD : 10
## BISHAN-TOA PAYOH TOWN COUNCIL : 2 NORTH BRIDGE RD: 7
## CENTRAL PROVIDENT FUND BOARD : 2 (Other) :316
## (Other) :378 NA's : 25
## geometry
## POINT :394
## epsg:3414 : 0
## +proj=tmer...: 0
##
##
##
##
As the data is now clean, we will create a new table which has the subzone name for each of the government institution based on its location.
govt_by_subzone <- st_intersection(mpsz3414_2,govt3414_cleaned) %>%
group_by(subzone) %>%
summarise(Govt_institutions=n())
summary(govt_by_subzone)
## subzone Govt_institutions geometry
## ALEXANDRA HILL : 1 Min. : 1.000 MULTIPOINT :67
## ALJUNIED : 1 1st Qu.: 1.000 POINT :66
## ANAK BUKIT : 1 Median : 2.000 epsg:3414 : 0
## ANG MO KIO TOWN CENTRE: 1 Mean : 2.962 +proj=tmer...: 0
## ANSON : 1 3rd Qu.: 3.000
## BALESTIER : 1 Max. :17.000
## (Other) :127
As we have eliminated duplicates, we will now check if any location value is empty
is_empty(govt_by_subzone$geometry)
## [1] FALSE
As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.
summary(financial3414)
## LINK_ID POI_ID SEQ_NUM FAC_TYPE
## Min. :1.161e+08 Min. :3.644e+07 Min. :1.000 Min. :3578
## 1st Qu.:8.594e+08 1st Qu.:1.097e+09 1st Qu.:1.000 1st Qu.:3578
## Median :9.140e+08 Median :1.113e+09 Median :1.000 Median :3578
## Mean :9.092e+08 Mean :1.088e+09 Mean :1.008 Mean :4397
## 3rd Qu.:1.046e+09 3rd Qu.:1.132e+09 3rd Qu.:1.000 3rd Qu.:6000
## Max. :1.224e+09 Max. :1.204e+09 Max. :2.000 Max. :6000
##
## POI_NAME POI_LANGCD POI_NMTYPE POI_ST_NUM ST_NUM_FUL ST_NFUL_LC
## OCBC :788 ENG:3320 B:3293 1 : 212 29A : 1 ENG : 5
## UOB :577 J: 27 10 : 76 333A: 1 NA's:3315
## POSB :564 2 : 53 77B : 1
## DBS :282 11 : 50 7A : 1
## CITIBANK:153 304 : 50 8A : 1
## MAYBANK : 51 (Other):2004 NA's:3315
## (Other) :905 NA's : 875
## ST_NAME ST_LANGCD POI_ST_SD ACC_TYPE PH_NUMBER
## ORCHARD RD : 156 ENG :2926 L:1652 NA's:3320 63396666 : 52
## BEACH RD : 44 NA's: 394 N: 24 63272265 : 16
## NORTH BRIDGE RD : 39 R:1644 18002222121: 15
## COLLYER QUAY : 38 18004383333: 13
## NEW UPP CHANGI RD: 35 18001111111: 11
## (Other) :2614 (Other) : 539
## NA's : 394 NA's :2674
## CHAIN_ID NAT_IMPORT PRIVATE IN_VICIN NUM_PARENT
## Min. : 0 N:3320 N:3320 N:3320 Min. :0.0000
## 1st Qu.: 2526 1st Qu.:0.0000
## Median : 6918 Median :0.0000
## Mean : 5121 Mean :0.3807
## 3rd Qu.: 6920 3rd Qu.:1.0000
## Max. :24982 Max. :2.0000
##
## NUM_CHILD PERCFRREF VANCITY_ID
## Min. :0.0000000 Min. : 1.00 Min. :0
## 1st Qu.:0.0000000 1st Qu.:30.00 1st Qu.:0
## Median :0.0000000 Median :50.00 Median :0
## Mean :0.0003012 Mean :46.87 Mean :0
## 3rd Qu.:0.0000000 3rd Qu.:60.00 3rd Qu.:0
## Max. :1.0000000 Max. :99.00 Max. :0
## NA's :1339
## ACT_ADDR
## 1 KIM SENG PROMENADE SINGAPORE 237994: 7
## 3 TEMASEK BOULEVARD SINGAPORE 038983: 7
## 530 LORONG 6 TOA PAYOH SINGAPORE 310530: 7
## 2 JURONG EAST ST 21 SINGAPORE 609601: 6
## 3D RIVER VALLEY ROAD SINGAPORE 179023: 6
## (Other) : 243
## NA's :3044
## ACT_LANGCD ACT_ST_NAM ACT_ST_NUM ACT_ADMIN
## ENG : 276 DUNEARN ROAD : 7 1 : 20 INGAPORE : 1
## NA's:3044 KIM SENG PROMENADE: 7 3 : 11 SINGAPORE: 275
## LORONG 6 TOA PAYOH: 7 2 : 10 NA's :3044
## PAYA LEBAR ROAD : 7 50 : 8
## TEMASEK BOULEVARD : 7 530 : 7
## (Other) : 241 (Other): 220
## NA's :3044 NA's :3044
## ACT_POSTAL geometry
## 038983 : 7 POINT :3320
## 237994 : 7 epsg:3414 : 0
## 310530 : 7 +proj=tmer...: 0
## 609601 : 7
## 179023 : 6
## (Other): 242
## NA's :3044
There are various variables in this dataset which contain NA values. However, as our end goal is to find the number of financial institutions present in a subzone, we will count distinct locations by grouping the table by POI_ID as each distinct location of a financial institution has a distinct POI_ID.
financial3414_cleaned <- financial3414 %>%
distinct_at(vars(POI_ID),.keep_all = TRUE)
summary(financial3414_cleaned)
## LINK_ID POI_ID SEQ_NUM FAC_TYPE
## Min. :1.161e+08 Min. :3.644e+07 Min. :1 Min. :3578
## 1st Qu.:8.594e+08 1st Qu.:1.097e+09 1st Qu.:1 1st Qu.:3578
## Median :9.140e+08 Median :1.113e+09 Median :1 Median :3578
## Mean :9.099e+08 Mean :1.088e+09 Mean :1 Mean :4384
## 3rd Qu.:1.046e+09 3rd Qu.:1.132e+09 3rd Qu.:1 3rd Qu.:6000
## Max. :1.224e+09 Max. :1.204e+09 Max. :1 Max. :6000
##
## POI_NAME POI_LANGCD POI_NMTYPE POI_ST_NUM ST_NUM_FUL ST_NFUL_LC
## OCBC :788 ENG:3293 B:3293 1 : 209 29A : 1 ENG : 5
## UOB :577 J: 0 10 : 74 333A: 1 NA's:3288
## POSB :564 2 : 53 77B : 1
## DBS :282 11 : 49 7A : 1
## CITIBANK:153 304 : 49 8A : 1
## MAYBANK : 51 (Other):1986 NA's:3288
## (Other) :878 NA's : 873
## ST_NAME ST_LANGCD POI_ST_SD ACC_TYPE PH_NUMBER
## ORCHARD RD : 154 ENG :2900 L:1638 NA's:3293 63396666 : 52
## BEACH RD : 44 NA's: 393 N: 24 63272265 : 16
## COLLYER QUAY : 37 R:1631 18002222121: 15
## NORTH BRIDGE RD : 37 18004383333: 13
## NEW UPP CHANGI RD: 35 18001111111: 11
## (Other) :2593 (Other) : 531
## NA's : 393 NA's :2655
## CHAIN_ID NAT_IMPORT PRIVATE IN_VICIN NUM_PARENT
## Min. : 0 N:3293 N:3293 N:3293 Min. :0.0000
## 1st Qu.: 2529 1st Qu.:0.0000
## Median : 6918 Median :0.0000
## Mean : 5160 Mean :0.3799
## 3rd Qu.: 6920 3rd Qu.:1.0000
## Max. :24982 Max. :2.0000
##
## NUM_CHILD PERCFRREF VANCITY_ID
## Min. :0.0000000 Min. : 1.00 Min. :0
## 1st Qu.:0.0000000 1st Qu.:30.00 1st Qu.:0
## Median :0.0000000 Median :50.00 Median :0
## Mean :0.0003037 Mean :46.92 Mean :0
## 3rd Qu.:0.0000000 3rd Qu.:60.00 3rd Qu.:0
## Max. :1.0000000 Max. :99.00 Max. :0
## NA's :1327
## ACT_ADDR
## 1 KIM SENG PROMENADE SINGAPORE 237994: 7
## 3 TEMASEK BOULEVARD SINGAPORE 038983: 7
## 530 LORONG 6 TOA PAYOH SINGAPORE 310530: 7
## 2 JURONG EAST ST 21 SINGAPORE 609601: 6
## 3D RIVER VALLEY ROAD SINGAPORE 179023: 6
## (Other) : 242
## NA's :3018
## ACT_LANGCD ACT_ST_NAM ACT_ST_NUM ACT_ADMIN
## ENG : 275 DUNEARN ROAD : 7 1 : 20 INGAPORE : 1
## NA's:3018 KIM SENG PROMENADE: 7 3 : 11 SINGAPORE: 274
## LORONG 6 TOA PAYOH: 7 2 : 10 NA's :3018
## TEMASEK BOULEVARD : 7 50 : 8
## JURONG EAST ST 21 : 6 530 : 7
## (Other) : 241 (Other): 219
## NA's :3018 NA's :3018
## ACT_POSTAL geometry
## 038983 : 7 POINT :3293
## 237994 : 7 epsg:3414 : 0
## 310530 : 7 +proj=tmer...: 0
## 609601 : 7
## 179023 : 6
## (Other): 241
## NA's :3018
We will now assign a subzone to each of the financial institution through st_intersection method.
financial_by_subzone <- st_intersection(mpsz3414_2,financial3414_cleaned) %>%
group_by(subzone) %>%
summarise(Financials=n())
summary(financial_by_subzone)
## subzone Financials geometry
## ADMIRALTY : 1 Min. : 1.00 MULTIPOINT :223
## ALEXANDRA HILL : 1 1st Qu.: 3.25 POINT : 27
## ALJUNIED : 1 Median : 8.00 epsg:3414 : 0
## ANAK BUKIT : 1 Mean : 13.17 +proj=tmer...: 0
## ANCHORVALE : 1 3rd Qu.: 16.00
## ANG MO KIO TOWN CENTRE: 1 Max. :132.00
## (Other) :244
As we have eliminated duplicates, we will now check if any location value is empty
is_empty(financial_by_subzone$geometry)
## [1] FALSE
As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.
summary(private3414)
## POI_ID SEQ_NUM FAC_TYPE
## Min. :3.644e+07 Min. :1.000 Min. :9590
## 1st Qu.:9.968e+08 1st Qu.:1.000 1st Qu.:9590
## Median :1.070e+09 Median :1.000 Median :9590
## Mean :1.052e+09 Mean :1.007 Mean :9590
## 3rd Qu.:1.105e+09 3rd Qu.:1.000 3rd Qu.:9590
## Max. :1.204e+09 Max. :2.000 Max. :9590
##
## POI_NAME ST_NAME geometry
## BLISSFUL VIEW : 3 PASIR PANJANG RD : 45 POINT :3604
## CLEMENTI PARK : 3 UPP EAST COAST RD: 29 epsg:3414 : 0
## COMPASSVALE VIEW : 3 LOR K TELOK KURAU: 26 +proj=tmer...: 0
## KING'S MANSION : 3 BUKIT TIMAH RD : 24
## MIDPOINT PROPERTIES : 3 EAST COAST RD : 23
## NEE SOON CENTRAL ESTATE: 3 (Other) :3412
## (Other) :3586 NA's : 45
There are various variables in this dataset which contain NA values. However, as our end goal is to find the number of upmarket residential loctations present in a subzone, we will count distinct locations by grouping the table by POI_ID as each distinct location of a private property has a distinct POI_ID.
private3414_cleaned <- private3414 %>%
distinct_at(vars(POI_ID),.keep_all = TRUE)
summary(private3414_cleaned)
## POI_ID SEQ_NUM FAC_TYPE POI_NAME
## Min. :3.644e+07 Min. :1 Min. :9590 BLISSFUL VIEW : 3
## 1st Qu.:9.968e+08 1st Qu.:1 1st Qu.:9590 CLEMENTI PARK : 3
## Median :1.070e+09 Median :1 Median :9590 COMPASSVALE VIEW : 3
## Mean :1.052e+09 Mean :1 Mean :9590 KING'S MANSION : 3
## 3rd Qu.:1.105e+09 3rd Qu.:1 3rd Qu.:9590 MIDPOINT PROPERTIES : 3
## Max. :1.204e+09 Max. :1 Max. :9590 NEE SOON CENTRAL ESTATE: 3
## (Other) :3562
## ST_NAME geometry
## PASIR PANJANG RD : 45 POINT :3580
## UPP EAST COAST RD: 28 epsg:3414 : 0
## LOR K TELOK KURAU: 26 +proj=tmer...: 0
## BUKIT TIMAH RD : 24
## EAST COAST RD : 23
## (Other) :3391
## NA's : 43
We will now assign a subzone to each of the private property location through st_intersection method.
private_by_subzone <- st_intersection(mpsz3414_2,private3414_cleaned) %>%
group_by(subzone) %>%
summarise(Private_properties=n())
summary(private_by_subzone)
## subzone Private_properties geometry
## ADMIRALTY : 1 Min. : 1.00 MULTIPOINT :213
## ALEXANDRA HILL : 1 1st Qu.: 3.00 POINT : 26
## ALEXANDRA NORTH: 1 Median : 7.00 epsg:3414 : 0
## ALJUNIED : 1 Mean : 14.98 +proj=tmer...: 0
## ANAK BUKIT : 1 3rd Qu.: 14.50
## ANCHORVALE : 1 Max. :215.00
## (Other) :233
As we have eliminated duplicates, we will now check if any location value is empty
Check for duplicate as well.
is_empty(private_by_subzone$geometry)
## [1] FALSE
As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.
sum(complete.cases(residentData))
## [1] 883728
sum(!complete.cases(residentData))
## [1] 0
As seen above, none of the 883728 observations have NA value.
mpsz3414 <- mpsz3414%>%rename("subzone"=SUBZONE_N)
mpsz3414_1 <- mpsz3414 %>%
select(subzone,SHAPE_Area, geometry)%>%
mutate(SHAPE_Area=SHAPE_Area/1000000)
one <- residentData %>%
spread(age_group, resident_count) %>%
mutate(YOUNG=rowSums(.[6:9])+rowSums(.[15])) %>%
mutate(ACTIVE=rowSums(.[10:14])+rowSums(.[16:18])) %>%
mutate(AGED=rowSums(.[19:24])) %>%
select(subzone,type_of_dwelling,YOUNG,ACTIVE,AGED) %>%
group_by(subzone) %>%
summarise(YOUNG = sum(YOUNG), AGED= sum(AGED), ACTIVE = sum(ACTIVE))%>%
mutate(TOTAL=YOUNG+AGED+ACTIVE)
one$subzone <- toupper(one$subzone)
one <- left_join(one,mpsz3414_1)
one <- one %>%
mutate(DENSITY=TOTAL/SHAPE_Area)
two <- residentData %>%
spread(type_of_dwelling,resident_count)
names(two)<-str_replace_all(names(two), c(" " = "_" , "-" = "" ))
colnames(two)[11] <- "HUDC_Flats"
three <- two %>%
group_by(subzone) %>%
summarise(Condominiums_and_Other_Apartments=sum(Condominiums_and_Other_Apartments),
HDB_1_and_2Room_Flats=sum(HDB_1_and_2Room_Flats),
HDB_3Room_Flats=sum(HDB_3Room_Flats),
HDB_4Room_Flats=sum(HDB_4Room_Flats),
HDB_5Room_and_Executive_Flats= sum(HDB_5Room_and_Executive_Flats),
HUDC_Flats = sum(HUDC_Flats),
Landed_Properties = sum(Landed_Properties),
Others = sum(Others)) %>%
mutate(HDB_3_and_4Room_Flats=HDB_3Room_Flats+HDB_4Room_Flats) %>%
select(subzone,HDB_1_and_2Room_Flats,HDB_3_and_4Room_Flats,HDB_5Room_and_Executive_Flats,Condominiums_and_Other_Apartments,Landed_Properties)
three$subzone <- toupper(one$subzone)
First, we will create a base table which has the subzone name and geometry
data_by_subzones <- mpsz3414 %>%
select(OBJECTID, subzone,geometry)
We will now convert all the sf tables into data.frame objects by removing its special properties. This will allow us to make relational joins.
st_geometry(private_by_subzone) <- NULL
st_geometry(shopping_by_subzone) <- NULL
st_geometry(business_by_subzone) <- NULL
st_geometry(industry_by_subzone) <- NULL
st_geometry(govt_by_subzone) <- NULL
st_geometry(financial_by_subzone) <- NULL
one$geometry <- NULL
Now we will join all the urban properties to this table
data_by_subzones <- left_join(data_by_subzones,private_by_subzone)
data_by_subzones <- left_join(data_by_subzones,shopping_by_subzone)
data_by_subzones <- left_join(data_by_subzones,business_by_subzone)
data_by_subzones <- left_join(data_by_subzones,industry_by_subzone)
data_by_subzones <- left_join(data_by_subzones,govt_by_subzone)
data_by_subzones <- left_join(data_by_subzones,financial_by_subzone)
Before joining the demographic data, we will examine the data using the summary functions.
summary(data_by_subzones)
## OBJECTID subzone Private_properties
## Min. : 1.0 ADMIRALTY : 1 Min. : 1.00
## 1st Qu.: 81.5 AIRPORT ROAD : 1 1st Qu.: 3.00
## Median :162.0 ALEXANDRA HILL : 1 Median : 7.00
## Mean :162.0 ALEXANDRA NORTH: 1 Mean : 14.98
## 3rd Qu.:242.5 ALJUNIED : 1 3rd Qu.: 14.50
## Max. :323.0 ANAK BUKIT : 1 Max. :215.00
## (Other) :317 NA's :84
## Shopping_Infrastructures Businesses Industries Govt_institutions
## Min. : 1.000 Min. : 1.00 Min. :1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 2.00 1st Qu.:1.000 1st Qu.: 1.000
## Median : 2.000 Median : 7.00 Median :1.000 Median : 2.000
## Mean : 3.116 Mean : 29.26 Mean :1.939 Mean : 2.962
## 3rd Qu.: 3.500 3rd Qu.: 29.00 3rd Qu.:2.000 3rd Qu.: 3.000
## Max. :27.000 Max. :303.00 Max. :5.000 Max. :17.000
## NA's :176 NA's :107 NA's :274 NA's :190
## Financials geometry
## Min. : 1.00 MULTIPOLYGON :323
## 1st Qu.: 3.25 epsg:3414 : 0
## Median : 8.00 +proj=tmer...: 0
## Mean : 13.17
## 3rd Qu.: 16.00
## Max. :132.00
## NA's :73
As seen in the above output, almost all the properties have NA values. This is because many subzones dont contain various urban functions at all. To make the data more accurate, we will replace the NA values by 0. Note that we had already performed an NA check on while performing cleaning on the individual dataset for each urban function, hence the NA values have only arised while performing a relational join.
data_by_subzones[is.na(data_by_subzones)]=0
Joining demographic data
data_by_subzones <- left_join(data_by_subzones,one)
data_by_subzones <- left_join(data_by_subzones,three)
Examining the data
summary(data_by_subzones)
## OBJECTID subzone Private_properties Shopping_Infrastructures
## Min. : 1.0 Length:323 Min. : 0.00 Min. : 0.000
## 1st Qu.: 81.5 Class :character 1st Qu.: 0.00 1st Qu.: 0.000
## Median :162.0 Mode :character Median : 4.00 Median : 0.000
## Mean :162.0 Mean : 11.08 Mean : 1.418
## 3rd Qu.:242.5 3rd Qu.: 11.00 3rd Qu.: 1.000
## Max. :323.0 Max. :215.00 Max. :27.000
## Businesses Industries Govt_institutions Financials
## Min. : 0.00 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 1.0
## Median : 2.00 Median :0.0000 Median : 0.00 Median : 5.0
## Mean : 19.57 Mean :0.2941 Mean : 1.22 Mean : 10.2
## 3rd Qu.: 13.50 3rd Qu.:0.0000 3rd Qu.: 1.00 3rd Qu.: 13.0
## Max. :303.00 Max. :5.0000 Max. :17.00 Max. :132.0
## YOUNG AGED ACTIVE TOTAL
## Min. : 0 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0 1st Qu.: 0
## Median : 10740 Median : 4440 Median : 22420 Median : 36430
## Mean : 30969 Mean : 12916 Mean : 65096 Mean : 108981
## 3rd Qu.: 40065 3rd Qu.: 20880 3rd Qu.: 91505 3rd Qu.: 150475
## Max. :360610 Max. :129850 Max. :741760 Max. :1232220
## SHAPE_Area DENSITY HDB_1_and_2Room_Flats
## Min. : 0.03944 Min. : 0 Min. : 0
## 1st Qu.: 0.62826 1st Qu.: 0 1st Qu.: 0
## Median : 1.22989 Median : 41420 Median : 0
## Mean : 2.42088 Mean : 94944 Mean : 4323
## 3rd Qu.: 2.10648 3rd Qu.:176119 3rd Qu.: 3385
## Max. :69.74830 Max. :435403 Max. :48330
## HDB_3_and_4Room_Flats HDB_5Room_and_Executive_Flats
## Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 0
## Mean : 53600 Mean : 29951
## 3rd Qu.: 86575 3rd Qu.: 31230
## Max. :709850 Max. :448060
## Condominiums_and_Other_Apartments Landed_Properties geometry
## Min. : 0 Min. : 0 MULTIPOLYGON :323
## 1st Qu.: 0 1st Qu.: 0 epsg:3414 : 0
## Median : 1510 Median : 0 +proj=tmer...: 0
## Mean : 13123 Mean : 6989
## 3rd Qu.: 18440 3rd Qu.: 3745
## Max. :144470 Max. :172520
As we do not require the area if subzone, we will be removing it
data_by_subzones$SHAPE_Area = NULL
data_by_subzones$OBJECTID = NULL
data_by_subzones$TOTAL = NULL
rownames(data_by_subzones) <- data_by_subzones$subzone
data_by_subzones$subzone <- NULL
From the above summary, we have 15 variables attached to every subzone for analysis. However, before we perform hierarchical cluster analysis, we will perform univariant analysis in order to understand the scale and spread of data for each of the 15 variables.
However, before we start analysis of each variable, we will first examine the subzones.
tm_shape(data_by_subzones)+
tm_polygons()+
tm_borders()
As seen above, all the subzones of Singapore are included. To continue with socioeconomic analysis, we will analyse few subzones specifically in order to visualise if any feauture are present in them. These subzones include water catchement areas, which predominantly consists of water bodies and forests. We will also be analysing islands which are disconnected from mainland Singapore.
data_by_subzones = data_by_subzones[ !(row.names(data_by_subzones) %in% c("SUDONG","SEMAKAU", "SOUTHERN GROUP","NORTH-EASTERN ISLANDS","PULAU SELETAR")), ]
The code chunk below makes a function to make histograms and box plots so that we dont have to keep repeating the code.
plot_data <- function(maindata,attribute){
return(ggplot(data=maindata,
aes_string(x= attribute)) +
geom_histogram(bins=20,
color="black",
fill="light blue"))
}
private_plot <- plot_data(data_by_subzones,"Financials")
box_plot <- function(maindata,attribute){
return(ggplot(data=maindata, aes_string(x=attribute)) +
geom_boxplot(color="black", fill="light blue"))
}
All the plots are now stored in a variable from the code below
private_plot <- plot_data(data_by_subzones,"Financials")
shopping_plot <- plot_data(data_by_subzones,"Shopping_Infrastructures")
business_plot <- plot_data(data_by_subzones,"Businesses")
industry_plot <- plot_data(data_by_subzones,"Industries")
govt_plot <- plot_data(data_by_subzones,"Govt_institutions")
financial_plot <- plot_data(data_by_subzones,"Financials")
young_plot <- plot_data(data_by_subzones,"YOUNG")
aged_plot <- plot_data(data_by_subzones,"AGED")
active_plot <- plot_data(data_by_subzones,"ACTIVE")
density_plot <- plot_data(data_by_subzones,"DENSITY")
HDB1_2_plot <- plot_data(data_by_subzones,"HDB_1_and_2Room_Flats")
HDB3_4_plot <- plot_data(data_by_subzones,"HDB_3_and_4Room_Flats")
HDB5_plot <- plot_data(data_by_subzones,"HDB_5Room_and_Executive_Flats")
condo_plot <- plot_data(data_by_subzones,"Condominiums_and_Other_Apartments")
landed_plot <- plot_data(data_by_subzones,"Landed_Properties")
To visualise the graphs, we arrange it and plot it.
ggarrange(private_plot, shopping_plot, business_plot, industry_plot, govt_plot, financial_plot,
young_plot, aged_plot, active_plot, density_plot, HDB1_2_plot, HDB3_4_plot, HDB5_plot, condo_plot, landed_plot,
ncol = 3,
nrow = 2)
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
As seen above, all the data is left skewed and has widely varying scales. Before making a decision on whether or not we need to standardise the data, we will plot the data using box-whisker plot in order to identify the outliers.
private_plot <- box_plot(data_by_subzones,"Financials")
shopping_plot <- box_plot(data_by_subzones,"Shopping_Infrastructures")
business_plot <- box_plot(data_by_subzones,"Businesses")
industry_plot <- box_plot(data_by_subzones,"Industries")
govt_plot <- box_plot(data_by_subzones,"Govt_institutions")
financial_plot <- box_plot(data_by_subzones,"Financials")
young_plot <- box_plot(data_by_subzones,"YOUNG")
aged_plot <- box_plot(data_by_subzones,"AGED")
active_plot <- box_plot(data_by_subzones,"ACTIVE")
density_plot <- box_plot(data_by_subzones,"DENSITY")
HDB1_2_plot <- box_plot(data_by_subzones,"HDB_1_and_2Room_Flats")
HDB3_4_plot <- box_plot(data_by_subzones,"HDB_3_and_4Room_Flats")
HDB5_plot <- box_plot(data_by_subzones,"HDB_5Room_and_Executive_Flats")
condo_plot <- box_plot(data_by_subzones,"Condominiums_and_Other_Apartments")
landed_plot <- box_plot(data_by_subzones,"Landed_Properties")
ggarrange(private_plot, shopping_plot, business_plot, industry_plot, govt_plot, financial_plot,
young_plot, aged_plot, active_plot, density_plot, HDB1_2_plot, HDB3_4_plot, HDB5_plot, condo_plot, landed_plot,
ncol = 3,
nrow = 2)
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
Most of the data is left skewed and contains multiple outliers. To perform accurate hierarchical cluster analysis, we will be normalising the data using min-max function. This function is preferred over using z-scores as none of the graphs resemble normality as seen in the histograms.
Standardising data requires our current data to be transformed from sfc to a data.frame object. The code below preserves the spatial property by creating a new variable data_by_subzones_sf.
data_by_subzones_sf <- data_by_subzones
st_geometry(data_by_subzones) <- NULL
The code below standardises the data using the min-max method, which scales the data from 0 to 1.
data_by_subzones.std <- normalize(data_by_subzones)
summary(data_by_subzones.std)
## Private_properties Shopping_Infrastructures Businesses
## Min. :0.000000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.004651 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.018605 Median :0.00000 Median :0.006601
## Mean :0.052362 Mean :0.05334 Mean :0.065591
## 3rd Qu.:0.051163 3rd Qu.:0.03704 3rd Qu.:0.046205
## Max. :1.000000 Max. :1.00000 Max. :1.000000
## Industries Govt_institutions Financials YOUNG
## Min. :0.00000 Min. :0.00000 Min. :0.000000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.007576 1st Qu.:0.00000
## Median :0.00000 Median :0.00000 Median :0.037879 Median :0.03369
## Mean :0.05975 Mean :0.07288 Mean :0.078450 Mean :0.08723
## 3rd Qu.:0.00000 3rd Qu.:0.05882 3rd Qu.:0.098485 3rd Qu.:0.11727
## Max. :1.00000 Max. :1.00000 Max. :1.000000 Max. :1.00000
## AGED ACTIVE DENSITY HDB_1_and_2Room_Flats
## Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.03808 Median :0.03238 Median :0.1080 Median :0.00000
## Mean :0.10104 Mean :0.08914 Mean :0.2215 Mean :0.09085
## 3rd Qu.:0.16290 3rd Qu.:0.12559 3rd Qu.:0.4063 3rd Qu.:0.07273
## Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000
## HDB_3_and_4Room_Flats HDB_5Room_and_Executive_Flats
## Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0000 1st Qu.:0.00000
## Median :0.0000 Median :0.00000
## Mean :0.0767 Mean :0.06790
## 3rd Qu.:0.1259 3rd Qu.:0.07235
## Max. :1.0000 Max. :1.00000
## Condominiums_and_Other_Apartments Landed_Properties
## Min. :0.00000 Min. :0.00000
## 1st Qu.:0.00000 1st Qu.:0.00000
## Median :0.01135 Median :0.00000
## Mean :0.09226 Mean :0.04115
## 3rd Qu.:0.13207 3rd Qu.:0.02385
## Max. :1.00000 Max. :1.00000
As seen in the above summary, all the data is scalled as all have a minimum value of 0 and a maximum value of 1.
In order to perform hierarchical cluster analysis, we need to ensure that our variables are not highly correlated. This is because we would prefer to have a mixture of high, low, and moderate values in different variables so that our clusters are well diffrentiated, hence variables with high correlation can hinder the cluster analysis. To examine the corelation, we will plot a corelation plot which indicates the corelation coefficient.
cluster_vars.cor = cor(data_by_subzones.std[,1:15])
corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "black",
tl.cex=0.5,
number.cex=0.8)
The above matrix has the correlation coefficient for all the pairs of variables. We are now interested in capturing pairs which have the high correlation coefficient. If a pair of variables are highly correlaeted, we will eliminate one of the variables in the pair for our cluster analysis. Furthermore, the varaible to be retained in the analysis will be chosen on its practical usefulness or actionability potential. We will adaopt the threshold of 0.80 to classify a pair of varaiables as highly correlated. We wil broadly classify our variables into two sub categories and then perform varaible elimination. The categories are:
In the first category (Urban functions), none of the pair of variables are highly correlated, i.e. none of the combination of pair of variables have correlation coefficient more than 0.80.
In the second category, there are various variables which have correlation coefficient higher than 0.80. They are as follows:
| Var1 | Var2 | Correlation |
|---|---|---|
| YOUNG | AGED | 0.85 |
| YOUNG | ACTIVE | 0.99 |
| YOUNG | HDB 3_4 ROOM | 0.91 |
| YOUNG | HDB_5_EXEC | 0.92 |
| AGED | ACTIVE | 0.91 |
| AGED | HDB 3_4 ROOM | 0.90 |
| ACTIVE | HDB 3_4 ROOM | 0.95 |
| ACTIVE | HDB_5_EXEC | 0.88 |
From the above reslts, we are going to eliminate the variables AGED and ACTIVE. This is because both of these variables have high correlation with Young and HDB 3,4 room. To further understand the relationship of the data, we will be performing principal component analysis (PCA). This technique helps in reducing the dimensionality increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.
res.pca <- PCA(data_by_subzones.std[,1:12], graph = FALSE)
fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 80))
summary(res.pca)
##
## Call:
## PCA(X = data_by_subzones.std[, 1:12], graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 4.939 2.170 1.492 0.937 0.760 0.549 0.470
## % of var. 41.161 18.081 12.429 7.804 6.329 4.572 3.921
## Cumulative % of var. 41.161 59.243 71.672 79.476 85.806 90.378 94.299
## Dim.8 Dim.9 Dim.10 Dim.11 Dim.12
## Variance 0.307 0.215 0.112 0.048 0.002
## % of var. 2.557 1.795 0.933 0.403 0.013
## Cumulative % of var. 96.856 98.651 99.584 99.987 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2 ctr cos2
## PEOPLE'S PARK | 1.721 | -1.426 0.129 0.687 | -0.068 0.001 0.002
## BUKIT MERAH | 3.668 | -1.575 0.158 0.184 | 0.535 0.042 0.021
## CHINATOWN | 2.871 | 0.935 0.056 0.106 | 2.338 0.792 0.663
## PHILLIP | 1.969 | -1.643 0.172 0.696 | 0.294 0.013 0.022
## RAFFLES PLACE | 8.434 | -0.072 0.000 0.000 | 7.295 7.712 0.748
## CHINA SQUARE | 2.421 | -0.763 0.037 0.099 | 1.734 0.436 0.513
## TIONG BAHRU | 1.600 | 0.512 0.017 0.102 | -0.809 0.095 0.256
## BAYFRONT SUBZONE | 1.934 | -1.488 0.141 0.592 | 0.339 0.017 0.031
## TIONG BAHRU STATION | 3.396 | 2.094 0.279 0.380 | -0.344 0.017 0.010
## CLIFFORD PIER | 1.901 | -1.695 0.183 0.795 | 0.047 0.000 0.001
## Dim.3 ctr cos2
## PEOPLE'S PARK | -0.679 0.097 0.156 |
## BUKIT MERAH | 2.062 0.896 0.316 |
## CHINATOWN | 0.192 0.008 0.004 |
## PHILLIP | -0.505 0.054 0.066 |
## RAFFLES PLACE | 0.589 0.073 0.005 |
## CHINA SQUARE | -0.474 0.047 0.038 |
## TIONG BAHRU | -0.630 0.084 0.155 |
## BAYFRONT SUBZONE | -0.654 0.090 0.114 |
## TIONG BAHRU STATION | -0.652 0.090 0.037 |
## CLIFFORD PIER | -0.611 0.079 0.103 |
##
## Variables (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## Private_properties | 0.278 1.565 0.077 | 0.297 4.053 0.088 | 0.004
## Shopping_Infrastructures | 0.188 0.716 0.035 | 0.859 34.009 0.738 | -0.016
## Businesses | -0.254 1.305 0.064 | -0.014 0.010 0.000 | 0.829
## Industries | -0.127 0.324 0.016 | -0.066 0.200 0.004 | 0.863
## Govt_institutions | 0.036 0.026 0.001 | 0.777 27.842 0.604 | 0.044
## Financials | 0.422 3.602 0.178 | 0.790 28.790 0.625 | 0.081
## YOUNG | 0.927 17.396 0.859 | -0.135 0.835 0.018 | 0.083
## AGED | 0.950 18.256 0.902 | -0.050 0.116 0.003 | 0.103
## ACTIVE | 0.959 18.624 0.920 | -0.119 0.655 0.014 | 0.093
## DENSITY | 0.811 13.326 0.658 | -0.237 2.579 0.056 | -0.108
## ctr cos2
## Private_properties 0.001 0.000 |
## Shopping_Infrastructures 0.016 0.000 |
## Businesses 46.098 0.688 |
## Industries 49.942 0.745 |
## Govt_institutions 0.130 0.002 |
## Financials 0.442 0.007 |
## YOUNG 0.461 0.007 |
## AGED 0.706 0.011 |
## ACTIVE 0.582 0.009 |
## DENSITY 0.789 0.012 |
From the results, we can derive that 80% of the variability in the variables can be found in the first five principal components. The first principal component is the direction in space which consists the maximum variance, after which, the variability keeps decreasing in each principal component. To understand the varaibles which contribute to each principal component, we will be plotting graphs which indicate the contribution of different variables in each component.
# Extract the results for variables
var <- get_pca_var(res.pca)
# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)
# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)
# Control variable colors using their contributions to the principle axis
fviz_contrib(res.pca, choice = "var", axes = 3, top = 10)
fviz_contrib(res.pca, choice = "var", axes = 4, top = 10)
fviz_contrib(res.pca, choice = "var", axes = 5, top = 10)
fviz_pca_var(res.pca, col.var="contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
) + theme_minimal() + ggtitle("Variables - PCA")
The first principal component is strongly correlated with five of the original variables. The first principal component increases with increasing Active, Aged, YOUNG, Density, and HDB3_4 rooms scores. This suggests that these five criteria vary together. If one increases, then the remaining ones tend to increase as well. Hence, with relation to our correlation analysis, we will elimnate Active, Aged, and HDB_3_4Room as the data is already been captured in the other variables, i.e. Young and Density.
Principal components 2-5 only contain 1-2 variables which significantly contribute in variation, however, they are not significantly correlated as found in our correlation analysis. Hence, we will be retaining all those variables.
cluster_vars.std <- data_by_subzones.std %>%
select("Private_properties", "Shopping_Infrastructures","Businesses","Industries" ,"Govt_institutions", "Financials", "ACTIVE","HDB_1_and_2Room_Flats", "HDB_5Room_and_Executive_Flats", "DENSITY", "Condominiums_and_Other_Apartments" , "Landed_Properties")
In order to perform clustering, we first need to define a proximity matrix. The proximity matrix is a matrix which consists a measure of similarity from one variable to all the other variables. The measure of similarity will be calculated by Euclidean distance, which is a straight line distance between two points. ### Calculating the proximity matrix
proxmat <- dist(cluster_vars.std, method = 'euclidean')
Hierarchical clustering algorithm will seperate the subzones into different clusters based on their measure of similarity. Clustering will allow us to subgroup subzones based on their socioeconomic characteristics. The analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. This will ensure that we get very distinct clusters. We will be using agglomerative hierarchical clustering, which is a bottom-up approach, i.e. all subzones are iteratively merged until it belongs to one big cluster. There are various methods to merge these clusters. They are:
(1) Using average distance between two clusters
(2) Calculating the maximum distance between the points of the two clusters, i.e. using the distance between the two furthest points
(3) Calculating the minimum distance between the points of the two clusters, i.e. using the distance between the two closest points
(4) Using Ward’s method which merges two clusters in order to reduce within cluster variance
In order to decide the most optimal algorithm for our case study, we will be calculating the agglomerative coefficient, which measures the amount of clustering structure found. The method with the highest index value will be chosen.
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
ac <- function(x) {
agnes(cluster_vars.std, method = x)$ac
}
map_dbl(m, ac)
## average single complete ward
## 0.8961083 0.8223974 0.9230676 0.9754843
From the above output, it is evident that Ward’s method has the highest agglomerative index value of 0.976. Ward’s method is also preferred for this analysis because the pooled with-in group sum of squares is minimized.
hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.5)
As we have 318 subzones, the names of the subzones are not visible. However, that is not important right now as we can visualise that using a projected map later. The most important interpretation from the dendogram is to notice the height at which clusters are being merged. If we look at the 2nd merge from the top, it is evident that there is a significant difference between the first two merges. However, there is not much difference in height between the third and fourth merge. This may indicate that the difference between our clusters might not be significant. We will examine this by diving the dendogram into difference clusters and analysing the clusters using mean and standard deviation.
This raises an important question of determining the number of clusters we need to split into.
There are various indices which give an estimate of number of clusters we need to split the data. However, each index determines the number of clusters on factors such as standard deviation, mean, co-varaiance, etc, giving a different weight to each of these components. In order to get a aggregated result, we will use NbClust() function from NbClust library, whichprovides 30 indices for determining the number of clusters and proposes to user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.
NbClust(data = cluster_vars.std, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 15, method = "ward.D", index = "all", alphaBeale = 0.05)
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 9 proposed 2 as the best number of clusters
## * 2 proposed 3 as the best number of clusters
## * 5 proposed 4 as the best number of clusters
## * 4 proposed 7 as the best number of clusters
## * 3 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 12.9659 122.7496 28.3866 7.0875 912.5539 10485639.24 42.0575 66.2066
## 3 0.1832 80.8249 52.8436 4.7288 1204.1994 9429213.26 39.5372 60.7494
## 4 1.7284 80.2815 34.7138 7.7734 1580.9506 5126507.33 25.9030 52.0223
## 5 0.6739 75.3055 47.0608 9.9170 1887.5373 3054459.95 20.9926 46.8436
## 6 1.6459 78.4630 32.3751 14.9311 2254.9997 1385004.75 13.3250 40.7210
## 7 7.4687 77.3180 10.2581 17.2964 2558.8884 724973.76 11.3338 36.8928
## 8 0.1465 69.6991 29.9237 16.8726 2647.0389 717659.57 10.8632 35.7147
## 9 1.2214 70.3863 26.1096 19.9447 2946.3683 354346.07 9.0213 32.5707
## 10 2.2503 70.5243 14.3309 22.5930 3157.2467 225395.22 7.3340 30.0330
## 11 1.7842 67.6379 9.8179 23.2318 3320.3402 163302.33 6.6652 28.6978
## 12 0.3384 64.1384 20.6391 23.1862 3458.0682 126029.89 6.3056 27.8085
## 13 2.1344 64.2682 11.6995 25.1841 3632.3530 85501.95 5.3496 26.0513
## 14 1.1839 62.2952 10.3160 25.6776 3742.0240 70237.16 4.9035 25.0890
## 15 0.4275 60.3462 19.9851 26.0082 3849.6212 57484.03 4.5407 24.2655
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 28.7443 1.8928 0.2467 1.5025 0.3501 0.8761 30.5466 1.1495 0.2137
## 3 31.1718 2.0628 0.2058 2.2907 0.2765 0.7372 27.0984 2.8738 0.2524
## 4 35.3081 2.4088 0.1836 1.8282 0.3026 0.7663 29.8854 2.4650 0.2785
## 5 39.1740 2.6752 0.1638 1.7282 0.3174 0.6374 22.7562 4.5322 0.2673
## 6 44.0604 3.0774 0.2043 1.3714 0.3354 0.6950 35.1103 3.5395 0.2728
## 7 52.4524 3.3967 0.2103 1.2874 0.3379 0.7813 38.6377 2.2698 0.2661
## 8 54.9915 3.5087 0.2192 1.4428 0.1856 0.5858 24.0355 5.6077 0.2530
## 9 58.2626 3.8474 0.2086 1.4037 0.2086 0.5470 14.9045 6.4056 0.2450
## 10 60.7132 4.1725 0.2053 1.3549 0.2186 0.8149 16.1230 1.8286 0.2448
## 11 63.6033 4.3667 0.1856 1.5095 0.1918 0.7239 24.0295 3.0659 0.2352
## 12 65.3967 4.5063 0.1775 1.4736 0.1985 0.6035 10.5140 5.0503 0.2284
## 13 69.3474 4.8103 0.1727 1.4318 0.2080 0.7126 16.1326 3.2131 0.2242
## 14 73.8113 4.9948 0.1643 1.4075 0.2061 0.3957 29.0122 11.8454 0.2170
## 15 75.3594 5.1643 0.1617 1.3566 0.2115 0.3657 34.6881 13.4884 0.2111
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 33.1033 0.4216 0.5832 0.4891 0.0732 0.0203 12.6076 0.3812 0.9413
## 3 20.2498 0.4564 -0.1643 1.0560 0.0278 0.0237 15.5367 0.3571 0.8957
## 4 13.0056 0.5210 -0.0342 1.0669 0.0278 0.0264 15.6966 0.3350 1.0203
## 5 9.3687 0.5647 -0.1147 1.0907 0.0278 0.0303 14.8211 0.3198 0.9336
## 6 6.7868 0.5867 -0.0056 1.0744 0.0366 0.0314 13.5959 0.3045 0.9242
## 7 5.2704 0.6068 -4.4268 1.0701 0.0403 0.0352 13.1561 0.2929 0.8518
## 8 4.4643 0.4393 -0.0060 2.1271 0.0403 0.0366 19.0603 0.2808 0.7645
## 9 3.6190 0.4498 -0.0670 2.1162 0.0403 0.0378 18.3985 0.2685 0.7016
## 10 3.0033 0.4547 0.2763 2.0984 0.0404 0.0389 18.5205 0.2611 0.6738
## 11 2.6089 0.4451 0.5511 2.3430 0.0404 0.0404 18.3223 0.2537 0.6373
## 12 2.3174 0.4274 -0.0226 2.6066 0.0404 0.0410 18.5608 0.2472 0.5969
## 13 2.0039 0.4318 0.2063 2.5736 0.0404 0.0412 17.9178 0.2408 0.5761
## 14 1.7921 0.4282 0.1611 2.6543 0.0404 0.0421 17.6560 0.2353 0.5425
## 15 1.6177 0.4277 -0.0094 2.6695 0.0404 0.0426 17.0168 0.2294 0.5003
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.8615 34.7229 0.3147
## 3 0.8041 18.5143 0.0007
## 4 0.8208 21.3960 0.0035
## 5 0.7523 13.1707 0.0000
## 6 0.8076 19.0542 0.0000
## 7 0.8403 26.2208 0.0075
## 8 0.7367 12.1519 0.0000
## 9 0.6649 9.0730 0.0000
## 10 0.7993 17.8276 0.0399
## 11 0.7905 16.6986 0.0003
## 12 0.6496 8.6302 0.0000
## 13 0.7523 13.1707 0.0002
## 14 0.6717 9.2879 0.0000
## 15 0.6780 9.4987 0.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 2.0000 2.0000 3.000 15.0000 4.0000 4 4.0000
## Value_Index 12.9659 122.7496 24.457 26.0082 376.7512 2230659 13.6342
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 4.0000 7.000 7.0000 15.0000 7.0000 2.0000 2.0000
## Value_Index 3.5484 8.392 -0.2073 0.1617 1.2874 0.3501 0.8761
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.0000 4.0000 3.0000 7.0000 1 2.0000
## Value_Index 30.5466 1.1495 0.2785 12.8535 0.6068 NA 0.4891
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 2.0000 0 2.0000 0 15.0000
## Value_Index 0.0732 0 12.6076 0 0.5003
##
## $Best.partition
## PEOPLE'S PARK BUKIT MERAH
## 1 1
## CHINATOWN PHILLIP
## 2 1
## RAFFLES PLACE CHINA SQUARE
## 1 1
## TIONG BAHRU BAYFRONT SUBZONE
## 2 1
## TIONG BAHRU STATION CLIFFORD PIER
## 2 1
## MARINA SOUTH PEARL'S HILL
## 1 2
## BOAT QUAY HENDERSON HILL
## 1 2
## REDHILL ALEXANDRA HILL
## 2 2
## BUKIT HO SWEE CLARKE QUAY
## 2 1
## TELOK BLANGAH RISE TANJONG PAGAR
## 2 1
## EVERTON PARK TELOK BLANGAH WAY
## 1 2
## MAXWELL CECIL
## 1 1
## KAMPONG TIONG BAHRU TELOK BLANGAH DRIVE
## 2 1
## PASIR PANJANG 2 PASIR PANJANG 1
## 1 1
## QUEENSWAY KENT RIDGE
## 1 1
## ALEXANDRA NORTH MARINA EAST
## 1 1
## INSTITUTION HILL ROBERTSON QUAY
## 1 1
## JURONG ISLAND AND BUKOM SENTOSA
## 1 1
## CITY TERMINALS ANSON
## 1 1
## STRAITS VIEW MARITIME SQUARE
## 1 1
## CENTRAL SUBZONE SINGAPORE GENERAL HOSPITAL
## 1 1
## DEPOT ROAD EAST COAST
## 1 1
## NATIONAL UNIVERSITY OF S'PORE ONE TREE HILL
## 1 1
## ROCHOR CANAL CRAWFORD
## 1 2
## MARGARET DRIVE TANGLIN
## 1 1
## MARINE PARADE TANGLIN HALT
## 2 2
## MACKENZIE SUNGEI ROAD
## 1 2
## ONE NORTH TANJONG RHU
## 1 1
## MOUNTBATTEN COMMONWEALTH
## 1 2
## DOVER BOULEVARD
## 1 1
## ISTANA NEGARA LITTLE INDIA
## 1 1
## GUL BASIN RIDOUT
## 1 1
## CAIRNHILL CLEMENTI WEST
## 1 2
## TUAS VIEW EXTENSION MONK'S HILL
## 1 1
## SIGLAP CLEMENTI WOODS
## 1 1
## FORT CANNING MARINA EAST (MP)
## 1 1
## MARINA CENTRE SOMERSET
## 1 1
## BENCOOLEN CHATSWORTH
## 1 1
## PIONEER SECTOR PENJURU CRESCENT
## 1 1
## ORANGE GROVE KAMPONG BUGIS
## 1 1
## KAMPONG GLAM SELEGIE
## 1 1
## MOUNT EMILY JOO KOON
## 1 1
## KALLANG WAY INTERNATIONAL BUSINESS PARK
## 1 1
## TUKANG CORONATION ROAD
## 1 1
## KEMBANGAN KATONG
## 1 1
## HOLLAND DRIVE FARRER PARK
## 2 1
## NEWTON CIRCUS JURONG PORT
## 1 1
## SAMULUN SHIPYARD
## 1 1
## GHIM MOH LAVENDER
## 2 2
## GOODWOOD PARK PANDAN
## 1 1
## SINGAPORE POLYTECHNIC CLEMENTI CENTRAL
## 1 1
## KAMPONG JAVA BOON KENG
## 1 2
## KALLANG BAHRU ULU PANDAN
## 1 1
## FARRER COURT TUAS VIEW
## 1 1
## NASSIM WEST COAST
## 1 1
## BAYSHORE BENOI SECTOR
## 1 1
## GUL CIRCLE ALJUNIED
## 1 2
## TYERSALL MOULMEIN
## 1 1
## LIU FANG FRANKEL
## 1 1
## CLEMENTI NORTH BRAS BASAH
## 2 1
## OXLEY CITY HALL
## 1 1
## MEI CHIN LEONIE HILL
## 2 1
## PORT DHOBY GHAUT
## 1 1
## BUGIS VICTORIA
## 1 1
## PATERSON TUAS BAY
## 1 1
## LEEDON PARK GEYLANG EAST
## 1 2
## TEBAN GARDENS JURONG RIVER
## 1 1
## GEYLANG BAHRU FABER
## 2 1
## MALCOLM BEDOK SOUTH
## 1 1
## TAMPINES EAST KAKI BUKIT
## 2 1
## YUHUA EAST BUKIT BATOK SOUTH
## 2 1
## JURONG WEST CENTRAL BEDOK RESERVOIR
## 2 1
## ANAK BUKIT SWISS CLUB
## 1 1
## XILIN SIMEI
## 1 1
## BOON LAY PLACE BUKIT BATOK EAST
## 2 2
## BUKIT BATOK WEST BUKIT BATOK CENTRAL
## 2 2
## UPPER PAYA LEBAR TAI SENG
## 2 1
## TENGEH YUHUA WEST
## 1 2
## YUNNAN LORONG CHUAN
## 2 1
## HONG KAH TUAS PROMENADE
## 2 1
## AIRPORT ROAD SERANGOON CENTRAL
## 1 2
## BISHAN EAST TAMPINES WEST
## 2 2
## BRICKWORKS DUNEARN
## 1 1
## SUNSET WAY MACPHERSON
## 1 2
## KIM KEAT BEDOK NORTH
## 2 2
## TOA PAYOH CENTRAL JURONG GATEWAY
## 2 1
## HOLLAND ROAD KAMPONG UBI
## 1 1
## SENNETT POTONG PASIR
## 1 2
## BENDEMEER BALESTIER
## 2 2
## JOO SENG CHIN BEE
## 1 1
## LORONG 8 TOA PAYOH TOH GUAN
## 2 2
## BRADDELL BIDADARI
## 2 1
## WOODLEIGH TAMAN JURONG
## 1 2
## LAKESIDE TOA PAYOH WEST
## 1 1
## DEFU INDUSTRIAL PARK GUILIN
## 1 1
## MARYMOUNT WENYA
## 2 1
## NATURE RESERVE HILLVIEW
## 1 1
## CHANGI BAY PAYA LEBAR EAST
## 1 1
## UPPER THOMSON HONG KAH NORTH
## 1 2
## TOWNSVILLE KOVAN
## 2 1
## CHONG BOON SHANGRI-LA
## 2 2
## SERANGOON GARDEN HOUGANG CENTRAL
## 1 1
## LOYANG EAST DAIRY FARM
## 1 1
## PASIR RIS DRIVE TAMPINES NORTH
## 2 1
## CHENG SAN ANG MO KIO TOWN CENTRE
## 2 1
## KEBUN BAHRU SERANGOON NORTH IND ESTATE
## 2 1
## TENGAH SERANGOON NORTH
## 1 2
## PASIR RIS CENTRAL GOMBAK
## 2 1
## PLAB PAYA LEBAR NORTH
## 1 1
## HOUGANG EAST LORONG HALUS
## 2 1
## KANGKAR SEMBAWANG HILLS
## 2 1
## JELEBU KEAT HONG
## 2 2
## HOUGANG WEST PAYA LEBAR WEST
## 2 1
## BANGKIT LORONG HALUS NORTH
## 2 1
## PENG SIANG PASIR RIS WEST
## 2 2
## YIO CHU KANG WEST TRAFALGAR
## 2 2
## TECK WHYE TUAS NORTH
## 2 1
## PEI CHUN BOON TECK
## 2 2
## KIAN TECK SAFTI
## 1 1
## TOH TUCK MOUNT PLEASANT
## 1 1
## HILLCREST SAUJANA
## 1 2
## SELETAR HILLS COMPASSVALE
## 1 2
## YIO CHU KANG EAST YIO CHU KANG
## 1 1
## LOYANG WEST TAGORE
## 1 1
## LORONG AH SOO FLORA DRIVE
## 2 1
## CHOA CHU KANG CENTRAL CHANGI WEST
## 2 1
## FAJAR SENJA
## 2 2
## WATERWAY EAST GALI BATU
## 2 1
## SPRINGLEAF PUNGGOL TOWN CENTRE
## 1 1
## NEE SOON LOWER SELETAR
## 1 1
## NORTHSHORE MANDAI ESTATE
## 1 1
## YISHUN CENTRAL PULAU PUNGGOL TIMOR
## 1 1
## TURF CLUB WOODLANDS SOUTH
## 1 2
## WOODGROVE YISHUN EAST
## 2 2
## WESTERN WATER CATCHMENT PULAU PUNGGOL BARAT
## 1 1
## YISHUN WEST WOODLANDS REGIONAL CENTRE
## 2 1
## MANDAI EAST SIMPANG SOUTH
## 1 1
## NORTHLAND MIDVIEW
## 2 2
## WOODLANDS WEST SEMBAWANG SPRINGS
## 2 1
## KRANJI RESERVOIR VIEW
## 1 1
## WOODLANDS EAST SEMBAWANG CENTRAL
## 2 2
## GREENWOOD PARK SEMBAWANG EAST
## 1 1
## SENOKO WEST PASIR RIS PARK
## 1 1
## CHOA CHU KANG NORTH RIVERVALE
## 2 2
## CHANGI AIRPORT YIO CHU KANG NORTH
## 1 1
## PUNGGOL CANAL CENTRAL WATER CATCHMENT
## 1 1
## SELETAR ADMIRALTY
## 1 1
## LIM CHU KANG SIMPANG NORTH
## 1 1
## SENOKO SOUTH SEMBAWANG NORTH
## 1 2
## TANJONG IRAU PANG SUA
## 1 1
## SELETAR AEROSPACE PARK KHATIB
## 1 1
## MANDAI WEST CONEY ISLAND
## 1 1
## YISHUN SOUTH THE WHARVES
## 2 1
## SENOKO NORTH CHANGI POINT
## 1 1
## SENGKANG TOWN CENTRE ANCHORVALE
## 2 2
## SENGKANG WEST FERNVALE
## 1 2
## PUNGGOL FIELD YEW TEE
## 2 2
## PASIR RIS WAFER FAB PARK MATILDA
## 1 2
## NORTH COAST SEMBAWANG STRAITS
## 1 1
As seen above, most indices proposed 4 as the most optimal number of clusters. Hence, we are going to go ahead and divide the dendogram into four clusters.
hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.5)
rect.hclust(hclust_ward, k = 4, border = "red")
hclust_ward
##
## Call:
## hclust(d = proxmat, method = "ward.D")
##
## Cluster method : ward.D
## Distance : euclidean
## Number of objects: 318
As seen in the above output, the dendogram is divided into four clusters, as seen by the coloured boxes. As there are many subzones, we are not able to visualise the subzone names properly, hence we will perform analysis by visualising the clusters on the map. Before we conduct the final analysis, we will also plot the heatmap in order to detect how clusters are formed in different variables.
The heatmap is a great tool to understand how various clusters are formed by analysing each variable individually. As the number of subzones are too many, the heatmap is not too clear. The heatmap is interactive, so exact values can be extracted and it can be zoomed in as well if needed.
heatmap <- data.matrix(cluster_vars.std)
heatmaply(heatmap,
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Blues,
k_row = 4,
margins = c(NA,200,60,NA),
fontsize_row = 3,
fontsize_col = 5,
main="Geographic Segmentation of Shan State by ICT indicators",
xlab = "Demographic and Urban Indicators",
ylab = "Subzones of Singapore"
)
We will analyse each cluster from the heatmap, after we plot the map representing the clusters. This will allow analysis to be more coherent.
tmap_mode("plot")
groups <- as.factor(cutree(hclust_ward, k=4))
data_by_subzones_sf$CLUSTER <- groups
tm_shape(data_by_subzones_sf)+
tm_polygons("CLUSTER",
palette="Set3")
The four clusters are very evident in the map above. In order to analyse the clusters, we will be plotting the mean value of the socio-economic factors of every cluster to compare them. This will be used in tandom with the heatmap plotted in section 6.5.
A histogram will be plotted for each variable in order to perfor cluster analysis and find out the simalrities and differences in each cluster.
data_by_subzones.std$CLUSTER <- groups
aggregate <- aggregate(data_by_subzones.std,by= list(data_by_subzones.std$CLUSTER),FUN = "mean")
aggregate$CLUSTER <- NULL
aggregate <- aggregate %>%
rename("CLUSTER"=Group.1)
plot_data <- function(maindata,attribute){
return(ggplot(aggregate, aes_string(x="CLUSTER",y=attribute, fill = "CLUSTER")) +
geom_bar(stat="identity", position = "dodge",size=0.5) +
theme(legend.position = 'none')+
scale_fill_brewer(palette = "Set3"))}
private_plot <- plot_data(aggregate,"Private_properties")
shopping_plot <- plot_data(aggregate,"Shopping_Infrastructures")
business_plot <- plot_data(aggregate,"Businesses")
industry_plot <- plot_data(aggregate,"Industries")
govt_plot <- plot_data(aggregate,"Govt_institutions")
financial_plot <- plot_data(aggregate,"Financials")
young_plot <- plot_data(aggregate,"YOUNG")
aged_plot <- plot_data(aggregate,"AGED")
active_plot <- plot_data(aggregate,"ACTIVE")
density_plot <- plot_data(aggregate,"DENSITY")
HDB1_2_plot <- plot_data(aggregate,"HDB_1_and_2Room_Flats")
HDB3_4_plot <- plot_data(aggregate,"HDB_3_and_4Room_Flats")
HDB5_plot <- plot_data(aggregate,"HDB_5Room_and_Executive_Flats")
condo_plot <- plot_data(aggregate,"Condominiums_and_Other_Apartments")
landed_plot <- plot_data(aggregate,"Landed_Properties")
To visualise the graphs, we arrange it and plot it.
ggarrange(private_plot, shopping_plot, business_plot, industry_plot, govt_plot, financial_plot,
young_plot, aged_plot, active_plot, density_plot, HDB1_2_plot, HDB3_4_plot, HDB5_plot, condo_plot, landed_plot,
ncol = 3,
nrow = 2)
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
Cluster One (shown in green) is the largest cluster amongst the four. The most unique factor of this cluster is that it does not dominate in any of the urban functions or the social demographics. If we observe the map, this region is located in all the four regions of Singapore. One of the many reason this cluster is low on socioeconomic factors is because it consists of various regions such as Central Catchement Area, Western Catachement Area, and the Changi Bay which consists of Changi Airport. The water catchement area mainly comprises of forests and water bodies, hence are very low on the urban functions. As the density of this region is very low, along with the age demographic, this cluster indicates that the economic properties in a region go hand in hand with the demographic properties. These regions have room for development to attract people to either live or work. These are more “open regions” of Singapore, i.e. they contain lesser buildings and commercial infrastructure and have more open land, forests, and parks. They play an important role in making Singapore a green city and maintain enviornmental properties. Notably, this cluster extends in the central region as well in subzones such as Tanglin and Tanjong Rhu. These are regions which have low population density, however, are known as the posh areas of Singapore as they are very open and have very few buildings. There are many other subzones in this cluster which have a very low popuation, due to lesser and shorter buildings. As we can see from the financials histogram, there are a lot of financial infrastructure present in this area. This brings about one more inference, i.e. these regions can developed as one of the more posh areas of Singapore if they already don’t have any residential infrastructure.
Cluster Two (shown in yellow) dominates two urban functions: Businesses and Industries. Industries refer to indiustrial parks, manufacturing facilities, etc whereas businesses in the “tertiary sector”. One of the most beautiful understanding from this cluster is that even though they are located in all the four regions of Singapore, they are found in groups of “mini-clusters” as most of them have adjoining subzones which are part of this cluster. Industries require a lot of raw materials and transportation resources, hence, it is more essential for the industries to form in clusters together. Singapore is one of the very first countries to adopt the concept of “Industrial Parks” which are large areas that contain manufacturing and industrial facilities. As these regions are not suitable for any other social activities, it is evident that these regions should be developed in a way which suits the requirements of businesses and industries such as having truck trailer parkings, etc. It is very evident that population density and number of households are extremely low in these areas. Industries usually have various harmful and toxic chemicals as its pollutants which creates an unhealthy living enviornment, leading to low levels of residential areas in these regions, having the least amount of residential infrastructure for each of the different types of dwelling as compared to other clusters.
Cluster Three (in purple) dominates all the demographic factors. It consists the highest proportion of population by density and all three age groups. These are the densly populated residential areas of Singapore. They can be found in the western and eastern region of Singapore. Even though these areas are built for residential purposes, it can be found that Condominiums, Landed Properties, and Private Properties are not found as much in this cluster (as compared to Cluster 4). We can therefore infer that the distribution of HDBs and Condominiums/Landed property follow an inverse spatial relationship. It is very evident that financial infrastructure is heavily required in such regions. This is because banking facilities are used by everyone in the population, and hence should be heavily concentrated in residential areas. The second most required amenity are shopping facilities, which goes without saying, is an essential requirement if the region is densly populated. The data also suggests that Singapore has taken a very bi-modal approach by spatially seggregating businesses and residential areas. As there are limited businesses and industries in these regions, it implies that most of the population travels to work from these regions and hence public transport facilities should be readily available.
Cluster Four (in red) dominates in most of the urban functions, having the most private properties, shopping infrastructure, government institutions, financial infrastructure. It is also notable that this region also consists of the highest proportion of Landed properties and Condominiums. These subzones mark where most of the public service facilities are present and also where the most richest segment of the population prefers to live as they predominantly consits of landed properties and condominiums. It can be infered that these regions are the most developed regions of Singapore.
The subzones clustered in the above methodology were not spatially related. In this section, we will perform clutering by a SKATER approach.
Firstly, we will convert our sf dataframe to sp format. This is because the SKATER clustering function requires a spatial dataframe object as its input.
data_by_subzones_sf$CLUSTER = NULL
data_by_subzones_sp <- as_Spatial(data_by_subzones_sf)
From the sp object, we will now be creating a neighbour list. All the subzones which are adjoining a subzone are considred to be its neighbours.
data.nb <- poly2nb(data_by_subzones_sp)
summary(data.nb)
## Neighbour list object:
## Number of regions: 318
## Number of nonzero links: 1934
## Percentage nonzero weights: 1.912503
## Average number of links: 6.081761
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 14 17
## 2 6 10 26 77 87 51 34 16 3 3 1 1 1
## 2 least connected regions:
## JURONG ISLAND AND BUKOM CHANGI BAY with 1 link
## 1 most connected region:
## CENTRAL WATER CATCHMENT with 17 links
The neighbours can be plotted with the code below. Note that each vertex represents the centroid of the subzone.
plot(data_by_subzones_sp, border=grey(.5))
plot(data.nb, coordinates(data_by_subzones_sp), col="blue", add=TRUE)
The neighbours list is a graph which has each subzone as a vertex, and every edge indicates a connection between two subzones. We will now calculate the cost of each edge through nbcosts() function.
data_by_subzones.std$CLUSTER = NULL
lcosts <- nbcosts(data.nb, data_by_subzones.std)
We can now examine how lcosts looks like.
head(lcosts)
## [[1]]
## [1] 0.5841519 1.0070996 0.9461811
##
## [[2]]
## [1] 1.2014312 1.1631365 0.9765882 1.0070794 1.0734969 0.9572217 0.6871655
##
## [[3]]
## [1] 0.5841519 0.9077098 0.3670866 0.8199890 0.4936605 0.5091575 0.9407277
## [8] 0.6625817 0.5993047 0.9351744
##
## [[4]]
## [1] 1.0209524 0.3716535 0.1654916
##
## [[5]]
## [1] 0.9077098 1.0209524 0.9324593 1.0811211 1.0997412 0.5814547 0.8590983
## [8] 0.7925918
##
## [[6]]
## [1] 0.3670866 0.3716535 0.9324593 0.9887064 0.3677505 0.2848824
As we have a prepared dataset with a list of values representing demographics and urban functions for each subzones, we will convert the graph to a weighed graph where each edge represents the measure of similarity between two subzones by accounting for all the variables.
data.w <- nb2listw(data.nb, lcosts, style="B")
glimpse(data.w)
## List of 3
## $ style : chr "B"
## $ neighbours:List of 318
## ..$ : int [1:3] 3 12 42
## ..$ : int [1:7] 9 14 15 16 22 25 43
## ..$ : int [1:10] 1 5 6 12 20 21 23 24 38 42
## ..$ : int [1:3] 5 6 13
## ..$ : int [1:8] 3 4 6 10 13 24 41 122
## ..$ : int [1:6] 3 4 5 12 13 18
## ..$ : int [1:5] 9 12 17 25 42
## ..$ : int [1:5] 10 11 39 41 73
## ..$ : int [1:5] 2 7 14 17 25
## ..$ : int [1:5] 5 8 41 73 122
## ..$ : int [1:5] 8 32 39 41 73
## ..$ : int [1:9] 1 3 6 7 13 17 18 34 42
## ..$ : int [1:6] 4 5 6 12 18 122
## ..$ : int [1:6] 2 9 15 17 25 31
## ..$ : int [1:5] 2 14 16 31 49
## ..$ : int [1:8] 2 15 29 30 31 43 49 123
## ..$ : int [1:7] 7 9 12 14 31 34 76
## ..$ : int [1:6] 6 12 13 34 71 122
## ..$ : int [1:5] 21 22 25 37 40
## ..$ : int [1:5] 3 23 24 38 41
## ..$ : int [1:6] 3 19 25 37 38 42
## ..$ : int [1:6] 2 19 25 26 40 43
## ..$ : int [1:3] 3 20 24
## ..$ : int [1:5] 3 5 20 23 41
## ..$ : int [1:9] 2 7 9 14 19 21 22 37 42
## ..$ : int [1:5] 22 27 30 40 43
## ..$ : int [1:5] 26 28 30 40 125
## ..$ : int [1:6] 27 30 45 66 70 125
## ..$ : int [1:5] 16 30 43 55 123
## ..$ : int [1:8] 16 26 27 28 29 43 45 55
## ..$ : int [1:6] 14 15 16 17 49 76
## ..$ : int [1:5] 11 48 56 72 73
## ..$ : int [1:4] 34 71 121 124
## ..$ : int [1:7] 12 17 18 33 71 76 124
## ..$ : int 94
## ..$ : int [1:2] 37 40
## ..$ : int [1:8] 19 21 25 36 38 39 40 41
## ..$ : int [1:5] 3 20 21 37 41
## ..$ : int [1:4] 8 11 37 41
## ..$ : int [1:7] 19 22 26 27 36 37 125
## ..$ : int [1:9] 5 8 10 11 20 24 37 38 39
## ..$ : int [1:6] 1 3 7 12 21 25
## ..$ : int [1:6] 2 16 22 26 29 30
## ..$ : int [1:4] 51 57 69 72
## ..$ : int [1:5] 28 30 55 59 70
## ..$ : int [1:5] 50 60 76 124 129
## ..$ : int [1:6] 48 54 81 98 127 128
## ..$ : int [1:9] 32 47 56 73 80 81 98 122 127
## ..$ : int [1:9] 15 16 31 52 58 64 76 123 131
## ..$ : int [1:5] 46 60 64 76 109
## ..$ : int [1:6] 44 57 69 72 90 118
## ..$ : int [1:5] 49 55 58 91 123
## ..$ : int [1:5] 61 82 83 92 103
## ..$ : int [1:5] 47 62 75 98 128
## ..$ : int [1:8] 29 30 45 52 59 91 97 123
## ..$ : int [1:6] 32 48 57 72 80 114
## ..$ : int [1:7] 44 51 56 72 90 114 132
## ..$ : int [1:5] 49 52 91 123 131
## ..$ : int [1:5] 45 55 70 97 101
## ..$ : int [1:9] 46 50 65 74 79 99 109 124 129
## ..$ : int [1:6] 53 68 74 83 103 126
## ..$ : int [1:6] 54 75 82 92 98 128
## ..$ : int [1:4] 77 96 112 113
## ..$ : int [1:7] 49 50 76 107 109 115 131
## ..$ : int [1:5] 60 68 74 93 99
## ..$ : int [1:6] 28 70 100 102 110 125
## ..$ : int [1:2] 108 160
## ..$ : int [1:6] 61 65 74 93 103 116
## ..$ : int [1:4] 44 51 111 118
## ..$ : int [1:7] 28 45 59 66 101 102 119
## ..$ : int [1:7] 18 33 34 120 121 122 126
## ..$ : int [1:5] 32 44 51 56 57
## ..$ : int [1:6] 8 10 11 32 48 122
## ..$ : int [1:8] 60 61 65 68 121 124 126 129
## ..$ : int [1:7] 54 62 82 92 120 126 128
## ..$ : int [1:9] 17 31 34 46 49 50 64 124 129
## ..$ : int [1:3] 63 113 130
## ..$ : int [1:4] 94 100 133 134
## ..$ : int [1:3] 60 99 109
## ..$ : int [1:6] 48 56 98 104 105 114
## ..$ : int [1:4] 47 48 122 127
## ..$ : int [1:6] 53 62 75 83 92 126
## ..$ : int [1:4] 53 61 82 126
## ..$ : int [1:9] 87 112 113 117 155 234 237 238 269
## ..$ : int [1:6] 104 114 135 168 175 179
## ..$ : int [1:7] 110 133 136 172 182 187 239
## ..$ : int [1:7] 84 112 117 134 180 186 237
## ..$ : int [1:6] 107 115 131 146 173 241
## ..$ : int [1:6] 118 132 140 144 170 174
## ..$ : int [1:4] 51 57 118 132
## ..$ : int [1:6] 52 55 58 97 106 131
## ..$ : int [1:6] 53 62 75 82 98 103
## ..$ : int [1:4] 65 68 99 116
## ..$ : int [1:5] 35 78 95 117 134
## ..$ : int [1:4] 94 96 112 117
## ..$ : int [1:4] 63 95 112 117
## ..$ : int [1:5] 55 59 91 101 106
## ..$ : int [1:9] 47 48 54 62 80 92 103 105 177
## ..$ : int [1:7] 60 65 79 93 109 116 137
## .. [list output truncated]
## ..- attr(*, "class")= chr "nb"
## ..- attr(*, "region.id")= chr [1:318] "PEOPLE'S PARK" "BUKIT MERAH" "CHINATOWN" "PHILLIP" ...
## ..- attr(*, "call")= language poly2nb(pl = data_by_subzones_sp)
## ..- attr(*, "type")= chr "queen"
## ..- attr(*, "sym")= logi TRUE
## $ weights :List of 318
## ..$ : num [1:3] 0.584 1.007 0.946
## ..$ : num [1:7] 1.201 1.163 0.977 1.007 1.073 ...
## ..$ : num [1:10] 0.584 0.908 0.367 0.82 0.494 ...
## ..$ : num [1:3] 1.021 0.372 0.165
## ..$ : num [1:8] 0.908 1.021 0.932 1.081 1.1 ...
## ..$ : num [1:6] 0.367 0.372 0.932 0.989 0.368 ...
## ..$ : num [1:5] 0.541 0.966 0.762 0.376 1.106
## ..$ : num [1:5] 0.169 0.129 0.149 0.2 0.698
## ..$ : num [1:5] 1.201 0.541 0.743 0.62 0.441
## ..$ : num [1:5] 1.081 0.169 0.309 0.703 1.11
## ..$ : num [1:5] 0.1294 0.0227 0.023 0.3156 0.7902
## ..$ : num [1:9] 1.007 0.82 0.989 0.966 1.039 ...
## ..$ : num [1:6] 0.165 1.1 0.368 1.039 0.409 ...
## ..$ : num [1:6] 1.163 0.743 0.497 0.249 0.449 ...
## ..$ : num [1:5] 0.977 0.497 0.682 0.72 0.538
## ..$ : num [1:8] 1.007 0.682 1.016 0.98 0.999 ...
## ..$ : num [1:7] 0.762 0.62 0.441 0.249 1.003 ...
## ..$ : num [1:6] 0.285 1.067 0.409 0.231 0.415 ...
## ..$ : num [1:5] 0.606 0.356 0.266 1.177 0.67
## ..$ : num [1:5] 0.494 0.587 0.356 0.323 0.383
## ..$ : num [1:6] 0.509 0.606 0.704 1.182 0.493 ...
## ..$ : num [1:6] 1.073 0.356 0.198 0.589 0.946 ...
## ..$ : num [1:3] 0.941 0.587 0.549
## ..$ : num [1:5] 0.663 0.581 0.356 0.549 0.562
## ..$ : num [1:9] 0.957 0.376 0.441 0.449 0.266 ...
## ..$ : num [1:5] 0.589 0.545 0.483 0.534 0.378
## ..$ : num [1:5] 0.545 0.457 0.341 0.447 0.418
## ..$ : num [1:6] 0.457 0.317 0.388 0.534 0.481 ...
## ..$ : num [1:5] 1.016 0.17 0.256 0.505 0.669
## ..$ : num [1:8] 0.98 0.483 0.341 0.317 0.17 ...
## ..$ : num [1:6] 0.958 0.72 0.999 1.003 0.271 ...
## ..$ : num [1:5] 0.0227 1.0886 0.5164 0 0.8009
## ..$ : num [1:4] 0.369 0.333 0.219 0.166
## ..$ : num [1:7] 1.027 1.03 0.231 0.369 0.471 ...
## ..$ : num 0.0891
## ..$ : num [1:2] 1.041 0.104
## ..$ : num [1:8] 1.18 1.18 1.25 1.04 1.02 ...
## ..$ : num [1:5] 0.599 0.323 0.493 1.017 0.295
## ..$ : num [1:4] 0.149 0.023 1.001 0.335
## ..$ : num [1:7] 0.67 0.946 0.534 0.447 0.104 ...
## ..$ : num [1:9] 0.859 0.2 0.309 0.316 0.383 ...
## ..$ : num [1:6] 0.946 0.935 1.106 1.359 0.537 ...
## ..$ : num [1:6] 0.687 0.946 0.686 0.378 0.256 ...
## ..$ : num [1:4] 1.2113 0.4937 0.4667 0.0696
## ..$ : num [1:5] 0.388 0.186 0.452 0.363 0.541
## ..$ : num [1:5] 0.323 1.615 0.405 0.143 0.179
## ..$ : num [1:6] 1.085 0.452 0.058 0.809 0.624 ...
## ..$ : num [1:9] 1.089 1.085 0.984 1.013 1.081 ...
## ..$ : num [1:9] 0.538 0.89 0.271 0.418 0.431 ...
## ..$ : num [1:5] 0.323 1.34 0.239 0.436 0.751
## ..$ : num [1:6] 1.21 1.05 1.04 1.22 1 ...
## ..$ : num [1:5] 0.418 0.826 0.128 0.309 0.212
## ..$ : num [1:5] 0.0389 0.2759 0.2349 0.2843 0.429
## ..$ : num [1:5] 0.452 0.225 0.297 0.609 0.457
## ..$ : num [1:8] 0.505 0.467 0.452 0.826 0.583 ...
## ..$ : num [1:6] 0.516 0.984 0.34 0.516 0.501 ...
## ..$ : num [1:7] 0.494 1.052 0.34 0.503 0.302 ...
## ..$ : num [1:5] 0.431 0.128 0.36 0.259 0.71
## ..$ : num [1:5] 0.363 0.583 0.236 0.467 0.329
## ..$ : num [1:9] 1.615 1.34 1.557 0.632 1.582 ...
## ..$ : num [1:6] 0.0389 0.1405 1.1192 0.2171 0.4218 ...
## ..$ : num [1:6] 0.225 0.149 0.213 0.124 0.644 ...
## ..$ : num [1:4] 0.0904 0.2243 0.1089 0.6568
## ..$ : num [1:7] 0.426 0.239 0.363 0.357 0.679 ...
## ..$ : num [1:5] 1.557 0.23 1.102 0.376 0.218
## ..$ : num [1:6] 0.534 0.324 0.622 0.271 0.424 ...
## ..$ : num [1:2] 1.3 0.264
## ..$ : num [1:6] 0.141 0.23 1.116 0.317 0.337 ...
## ..$ : num [1:4] 0.4667 1.0443 0.0995 1.5138
## ..$ : num [1:7] 0.481 0.541 0.236 0.324 0.502 ...
## ..$ : num [1:7] 0.415 0.333 0.471 0.253 0.397 ...
## ..$ : num [1:5] 0 0.0696 1.2212 0.5164 0.5028
## ..$ : num [1:6] 0.698 0.703 0.79 0.801 1.013 ...
## ..$ : num [1:8] 0.632 1.119 1.102 1.116 1.125 ...
## ..$ : num [1:7] 0.297 0.149 0.13 0.163 0.216 ...
## ..$ : num [1:9] 1.038 0.477 0.432 0.405 0.548 ...
## ..$ : num [1:3] 0.0904 0.5679 1.1449
## ..$ : num [1:4] 0.264 0.178 0.498 0.126
## ..$ : num [1:3] 1.5816 0.0919 0.7371
## ..$ : num [1:6] 1.081 0.501 0.797 0.621 0.167 ...
## ..$ : num [1:4] 0.058 1.057 1.198 0.588
## ..$ : num [1:6] 0.276 0.213 0.13 0.331 0.253 ...
## ..$ : num [1:4] 0.235 0.217 0.331 0.355
## ..$ : num [1:9] 0.529 0.589 0.421 0.738 0.822 ...
## ..$ : num [1:6] 0.926 1.341 0.821 1.393 0.718 ...
## ..$ : num [1:7] 0.868 0.884 0.843 0.918 0.973 ...
## ..$ : num [1:7] 0.5288 0.1095 0.2938 0.1475 0.0891 ...
## ..$ : num [1:6] 0.312 0.43 0.228 0.26 0.283 ...
## ..$ : num [1:6] 1.342 0.643 1.29 0.888 1.076 ...
## ..$ : num [1:4] 1 0.302 1.152 0.752
## ..$ : num [1:6] 0.309 0.782 0.36 0.349 0.758 ...
## ..$ : num [1:6] 0.284 0.124 0.163 0.253 0.686 ...
## ..$ : num [1:4] 0.376 0.317 0.261 1.018
## ..$ : num [1:5] 0.0891 0.264 0.0165 0.0489 0.1449
## ..$ : num [1:4] 0.0165 0.2058 0.165 0.0547
## ..$ : num [1:4] 0.224 0.206 0.289 0.204
## ..$ : num [1:5] 0.829 0.467 0.349 0.686 0.74
## ..$ : num [1:9] 0.809 0.459 0.609 0.644 0.797 ...
## ..$ : num [1:7] 1.5594 0.2177 0.0919 0.261 0.7474 ...
## .. [list output truncated]
## ..- attr(*, "mode")= chr "general"
## ..- attr(*, "glist")= chr [1:532] "list(c(0.584151939139242, 1.00709958433393, 0.946181057476831" "), c(1.20143119850206, 1.16313653026439, 0.976588240759307, 1.00707936495716, " "1.07349688787413, 0.957221684874255, 0.687165519601056), c(0.584151939139242, " "0.907709803954232, 0.367086587583076, 0.819989047758644, 0.49366050313275, " ...
## ..- attr(*, "glistsym")= logi TRUE
## .. ..- attr(*, "d")= num 0
## ..- attr(*, "B")= logi TRUE
## - attr(*, "class")= chr [1:2] "listw" "nb"
## - attr(*, "region.id")= chr [1:318] "PEOPLE'S PARK" "BUKIT MERAH" "CHINATOWN" "PHILLIP" ...
## - attr(*, "call")= language nb2listw(neighbours = data.nb, glist = lcosts, style = "B")
From the above summary, we can notice that the average number of links is 6. This implies that each subzone is connected to six other subzones on average. Jurong island is an island with only one link. We did not remove this subzone from our dataframe as it is a major hub for manufacturing as well as oil&gas production facilities.
In order to perform SKATER cluster analysis, we will find the minimum spanning tree for our weighed graph. The minimum spanning tree connects all the vertices together, without any cycles and with the minimum possible total edge weight. This is calculated through the mstree() function.
data.mst <- mstree(data.w)
We can examine the nature of the output
class(data.mst)
## [1] "mst" "matrix"
dim(data.mst)
## [1] 317 3
The number of dimensions are 317 as a spanning tree consists of (N-1) edges in order to traverse through all the nodes.
We can visualise the spanning tree by plotting it.
plot(data_by_subzones_sp, border=gray(.5))
plot.mst(data.mst, coordinates(data_by_subzones_sp),
col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE,label.areas = NULL)
Note that the number of edges have reduced! This is because this graph is now a acyclic graph.
##7.2 Computing the clusters
The code below computes clusers using SKATER method. In hieracrchical clustering, we found that 4 was the optimum number of clusters. However, as the SKATER method employs spatial contraints, we will split the data into 6 clusters in order to avoid extremely big clusters.
clust <- skater(data.mst[,1:2], data_by_subzones.std, 5)
The output of the above code is a skater object. We can examine it from the code below.
str(clust)
## List of 8
## $ groups : num [1:318] 1 1 1 1 1 1 3 1 3 1 ...
## $ edges.groups:List of 6
## ..$ :List of 3
## .. ..$ node: num [1:258] 142 240 57 188 210 245 90 317 163 87 ...
## .. ..$ edge: num [1:257, 1:3] 240 188 57 210 245 163 317 90 160 87 ...
## .. ..$ ssw : num 115
## ..$ :List of 3
## .. ..$ node: num [1:12] 224 233 254 242 287 253 227 314 223 229 ...
## .. ..$ edge: num [1:11, 1:3] 287 224 224 227 233 287 224 253 254 233 ...
## .. ..$ ssw : num 3.21
## ..$ :List of 3
## .. ..$ node: num [1:17] 49 15 14 17 25 123 52 91 12 9 ...
## .. ..$ edge: num [1:16, 1:3] 14 17 123 49 25 25 49 52 25 14 ...
## .. ..$ ssw : num 5.7
## ..$ :List of 3
## .. ..$ node: num [1:6] 148 164 138 139 170 118
## .. ..$ edge: num [1:5, 1:3] 148 164 138 148 164 164 139 118 138 170 ...
## .. ..$ ssw : num 5.25
## ..$ :List of 3
## .. ..$ node: num [1:14] 316 310 312 309 232 219 244 288 225 313 ...
## .. ..$ edge: num [1:13, 1:3] 310 316 312 309 219 316 232 309 244 219 ...
## .. ..$ ssw : num 5.52
## ..$ :List of 3
## .. ..$ node: num [1:11] 182 151 141 156 159 186 149 157 152 143 ...
## .. ..$ edge: num [1:10, 1:3] 149 156 157 141 186 159 151 152 182 151 ...
## .. ..$ ssw : num 4.66
## $ not.prune : NULL
## $ candidates : int [1:6] 1 2 3 4 5 6
## $ ssto : num 163
## $ ssw : num [1:6] 163 158 153 149 144 ...
## $ crit : num [1:2] 1 Inf
## $ vec.crit : num [1:318] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "class")= chr "skater"
The data has been split up into 4 parts, indicating 4 clusters. Each part consists of the nodes and edge costs. We can find out how the clusters have been assigned from the code below.
clusters <- clust$groups
clusters
## [1] 1 1 1 1 1 1 3 1 3 1 1 3 1 3 3 3 3 1 3 1 1 3 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 4 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 1 6 1 6 1 1 1 1 4
## [149] 6 6 6 6 1 1 1 6 6 1 6 1 1 1 1 4 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 6 1 1 1
## [186] 6 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 5 1 1 1 1 5 1 5 1
## [223] 2 2 5 1 2 1 2 1 1 5 2 1 1 1 1 1 1 1 1 2 1 5 1 1 1 1 1 1 2 1 2 2 5 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 5 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 5 5 1 5 5 2 1 5 1 1
This is a vector which contains the cluster number for each subzone. Similar to the previous section, we will assign it to each table and map it.
groups_mat <- as.matrix(clust$groups)
data_by_subzones.std$SP_CLUSTER <- as.factor(groups_mat)
st_geometry(data_by_subzones.std)<-data_by_subzones_sf$geometry
qtm(data_by_subzones.std, "SP_CLUSTER")
st_geometry(data_by_subzones.std) <- NULL
data_by_subzones.std$CLUSTER <- NULL
aggregate2 <- aggregate(data_by_subzones.std,by= list(data_by_subzones.std$SP_CLUSTER),FUN = "mean")
aggregate2$SP_CLUSTER <- NULL
aggregate2 <- aggregate2 %>%
rename("SP_CLUSTER"=Group.1)
plot_data <- function(maindata,attribute){
return(ggplot(maindata, aes_string(x="SP_CLUSTER",y=attribute, fill = "SP_CLUSTER")) +
geom_bar(stat="identity", position = "dodge",size=0.5) +
theme(legend.position = 'none')+
scale_fill_brewer(palette = "Set3"))}
private_plot <- plot_data(aggregate2,"Private_properties")
shopping_plot <- plot_data(aggregate2,"Shopping_Infrastructures")
business_plot <- plot_data(aggregate2,"Businesses")
industry_plot <- plot_data(aggregate2,"Industries")
govt_plot <- plot_data(aggregate2,"Govt_institutions")
financial_plot <- plot_data(aggregate2,"Financials")
young_plot <- plot_data(aggregate2,"YOUNG")
aged_plot <- plot_data(aggregate2,"AGED")
active_plot <- plot_data(aggregate2,"ACTIVE")
density_plot <- plot_data(aggregate2,"DENSITY")
HDB1_2_plot <- plot_data(aggregate2,"HDB_1_and_2Room_Flats")
HDB3_4_plot <- plot_data(aggregate2,"HDB_3_and_4Room_Flats")
HDB5_plot <- plot_data(aggregate2,"HDB_5Room_and_Executive_Flats")
condo_plot <- plot_data(aggregate2,"Condominiums_and_Other_Apartments")
landed_plot <- plot_data(aggregate2,"Landed_Properties")
To visualise the graphs, we arrange it and plot it.
ggarrange(private_plot, shopping_plot, business_plot, industry_plot, govt_plot, financial_plot,
young_plot, aged_plot, active_plot, density_plot, HDB1_2_plot, HDB3_4_plot, HDB5_plot, condo_plot, landed_plot,
ncol = 3,
nrow = 2)
## $`1`
##
## $`2`
##
## $`3`
##
## attr(,"class")
## [1] "list" "ggarrange"
Cluster one is the biggest and consists of business and industries. Cluster two consists of residential areas. Cluster three is similar to cluster two and consists residential infrastructure. However, it also consists of government institutions, as it is located in the central area. Cluster four consists of all the private properties, financial infrastructure, and governemnt institutions. This is located in the eastern region of Singapore. Cluster Five is located in the north east and dominates in HDB 5 room facilities, indicating that the residential population enjoy bigger homes over there. Cluster Six is a highly dense residential area, located in western singapore.
Hierarchical clustering is a better appraoch for socioeconomic area analysis as Singapore has residential areas, industrial parks, and government facilities split up all around Singapore. SKATER approach analyses the data in close spatial proximity. To have better findings fron this approach, we will need to increase the number of clusters.