Objective of the report

Social area analysis will be performed in the different subzones of Singapore examine to the socio-economic differences and to classify them into relatively homogenous groups.

1. Importing all the required packages

To get started with our analysis, we will get started with importing the required R packages which will help us in the upcoming sections to perform the analysis. Here is a brief description of the packages used:
* The tidyverse package will be used heavily to perform data wrangling and clean our data sets in order to perform the analysis required.
* The rgdal, spdep, and sf package will be used for spatial data manupulation and analysis. They are used for performing various different functions on spatial data.
* The corrplot and tmap packages will be used for visualisation purposes.
* ClustGeo, heatmaply, and psych will be used to perform statistical analysis on spatial data.

packages = c('rgdal', 'spdep', 'ClustGeo',  'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse',"factoextra","NbClust","FactoMineR","knitr", "tmaptools")
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}

2. Importing all datasets

Through the code in this section, we will be importing all the required datasets. These involve both spatial and aspatial data.

2.1 Importing aspatial data

The data below is taken from is from www.data.gov.sg, which is an official government website for Singapore’s public data. The URL for the dataset is as follows:
https://data.gov.sg/dataset/singapore-residents-by-subzone-and-type-of-dwelling-2011-2019

residentData <- read_csv("data/aspatial/singapore-residents-by-subzone-and-type-of-dwelling-2011-2019/planning-area-subzone-age-group-sex-and-type-of-dwelling-june-2011-2019.csv")

2.2 Importing geospatial data

There are multiple datasets which are imported in this section. The function st_read will be used while importing it to ensure that the geospatial data is imported in sfc format.

mpsz <- st_read(dsn="data/geospatial/master-plan-2014-subzone-boundary-no-sea-shp", layer="MP14_SUBZONE_NO_SEA_PL")

## Reading layer `MP14_SUBZONE_NO_SEA_PL' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial/master-plan-2014-subzone-boundary-no-sea-shp' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## proj4string:    +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs

The above code imports the subzone boundary of Singapore. As seen in the output, there is no CRS assigned currently and the data is represented in meters. Hence, we will assign the EPSG code of 3414 and transform the data into EPSG 3414 format, which is the most accurate projection system for spatial data in Singapore.

mpsz <- st_set_crs(mpsz,3414)
mpsz3414 <- st_transform(mpsz,3414)

Next, we will import geospatial data for all the important urban functions in Singapore.

business <- st_read(dsn="data/geospatial", layer="Business")

## Reading layer `Business' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## CRS:            4326

financial <- st_read(dsn="data/geospatial", layer="Financial")

## Reading layer `Financial' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3320 features and 29 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6256 ymin: 1.24392 xmax: 103.9998 ymax: 1.46247
## CRS:            4326

govt <- st_read(dsn="data/geospatial", layer="Govt_Embassy")

## Reading layer `Govt_Embassy' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 443 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6282 ymin: 1.24911 xmax: 103.9884 ymax: 1.45765
## CRS:            4326

private <- st_read(dsn="data/geospatial", layer="Private residential")

## Reading layer `Private residential' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3604 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6295 ymin: 1.23943 xmax: 103.9749 ymax: 1.45379
## CRS:            4326

shopping <- st_read(dsn="data/geospatial", layer="Shopping")

## Reading layer `Shopping' from data source `/Users/Amey/Desktop/Y2S2/IS415/assign3/data/geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 511 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.679 ymin: 1.24779 xmax: 103.9644 ymax: 1.4535
## CRS:            4326

Spatial properties of various urban functions are imported above. As seen in the output, all of them have CRS 4326, and expressed in meters. Singapore uses an EPSG code of 3414. Hence, to ensure that the data is projected accurately, we will be transforming the data into EPSG 3414.

Transforming all geospatial data into EPSG 3414

business3414 <- st_transform(business,3414)
financial3414 <- st_transform(financial,3414)
govt3414 <- st_transform(govt,3414)
private3414 <- st_transform(private,3414)
shopping3414 <- st_transform(shopping,3414)

Checking the data

business3414

## Simple feature collection with 6550 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 3669.148 ymin: 25408.41 xmax: 47034.83 ymax: 50148.54
## CRS:            EPSG:3414
## First 10 features:
##        POI_ID SEQ_NUM FAC_TYPE                        POI_NAME
## 1  1101180209       1     5000                       JOHN CHEN
## 2  1101180210       1     5000    TROPICAL INDUSTRIAL BUILDING
## 3  1101180211       1     5000 LIAN CHEONG INDUSTRIAL BUILDING
## 4  1101180212       1     5000  MALAYSIA GARMENT MANUFACTURERS
## 5  1101180213       1     5000                         UNIGOLD
## 6  1192316144       1     5000             NUS UNIVERSITY HALL
## 7  1144317654       1     5000           SUITES AT BUKIT TIMAH
## 8  1103507488       1     5000                      TIONG HUAT
## 9  1001052867       1     5000  LEE CHOON GUAN TIMBER MERCHANT
## 10 1001052868       1     5000           WEIGHT BRIDGE SERVICE
##                ST_NAME                  geometry
## 1            LITTLE RD POINT (33818.36 35620.16)
## 2            LITTLE RD  POINT (33770.51 35610.2)
## 3            LITTLE RD POINT (33779.41 35612.41)
## 4                 <NA> POINT (33802.78 35598.04)
## 5            LITTLE RD POINT (33835.06 35623.47)
## 6  LOWER KENT RIDGE RD POINT (21813.48 31063.37)
## 7  JALAN JURONG KECHIL POINT (21375.11 35831.37)
## 8   KALLANG PUDDING RD  POINT (33088.33 34439.2)
## 9           PENJURU RD POINT (17103.73 33407.71)
## 10          PENJURU RD   POINT (17178.3 33503.9)

financial3414

## Simple feature collection with 3320 features and 29 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 4881.527 ymin: 25171.88 xmax: 46526.16 ymax: 49338.02
## CRS:            EPSG:3414
## First 10 features:
##       LINK_ID     POI_ID SEQ_NUM FAC_TYPE            POI_NAME POI_LANGCD
## 1  1170624361 1132324230       1     3578                 UOB        ENG
## 2  1112103842 1132315471       1     3578                POSB        ENG
## 3  1112103842 1132315472       1     3578                 UOB        ENG
## 4  1112103842 1132315473       1     3578                OCBC        ENG
## 5   864687596 1100784924       1     3578                OCBC        ENG
## 6   902073032 1132324170       1     6000             MAYBANK        ENG
## 7   778516217 1141424387       1     6000 ADPOST MONEYCHANGER        ENG
## 8   880495939 1096910285       1     3578                 UOB        ENG
## 9   866996334 1096910292       1     3578                OCBC        ENG
## 10  880495939 1096910286       1     3578            CITIBANK        ENG
##    POI_NMTYPE POI_ST_NUM ST_NUM_FUL ST_NFUL_LC           ST_NAME ST_LANGCD
## 1           B        201       <NA>       <NA>      YISHUN AVE 2       ENG
## 2           B        375       <NA>       <NA>  COMMONWEALTH AVE       ENG
## 3           B        375       <NA>       <NA>  COMMONWEALTH AVE       ENG
## 4           B        375       <NA>       <NA>  COMMONWEALTH AVE       ENG
## 5           B       <NA>       <NA>       <NA> JURONG WEST ST 51       ENG
## 6           B        707       <NA>       <NA>     EAST COAST RD       ENG
## 7           B        163       <NA>       <NA>        TANGLIN RD       ENG
## 8           B       <NA>       <NA>       <NA>              <NA>      <NA>
## 9           B         11       <NA>       <NA>         ARTS LINK       ENG
## 10          B       <NA>       <NA>       <NA>              <NA>      <NA>
##    POI_ST_SD ACC_TYPE   PH_NUMBER CHAIN_ID NAT_IMPORT PRIVATE IN_VICIN
## 1          L     <NA>        <NA>     6919          N       N        N
## 2          R     <NA>        <NA>     6918          N       N        N
## 3          R     <NA>        <NA>     6919          N       N        N
## 4          R     <NA>        <NA>     6920          N       N        N
## 5          R     <NA>        <NA>     6920          N       N        N
## 6          L     <NA> 18006292266     3657          N       N        N
## 7          R     <NA>    67330779        0          N       N        N
## 8          R     <NA>        <NA>     6919          N       N        N
## 9          R     <NA>        <NA>     6920          N       N        N
## 10         R     <NA>        <NA>     1165          N       N        N
##    NUM_PARENT NUM_CHILD PERCFRREF VANCITY_ID
## 1           0         0        NA          0
## 2           0         0        NA          0
## 3           0         0        NA          0
## 4           0         0        NA          0
## 5           0         0        60          0
## 6           0         0        NA          0
## 7           1         0        50          0
## 8           0         0        20          0
## 9           0         0        NA          0
## 10          0         0        20          0
##                                                              ACT_ADDR
## 1                                                                <NA>
## 2                                                                <NA>
## 3                                                                <NA>
## 4                                                                <NA>
## 5  501 JURONG WEST STREET 51                         SINGAPORE 640501
## 6                                                                <NA>
## 7                                                                <NA>
## 8                                                                <NA>
## 9                                                                <NA>
## 10                                                               <NA>
##    ACT_LANGCD            ACT_ST_NAM ACT_ST_NUM ACT_ADMIN ACT_POSTAL
## 1        <NA>                  <NA>       <NA>      <NA>       <NA>
## 2        <NA>                  <NA>       <NA>      <NA>       <NA>
## 3        <NA>                  <NA>       <NA>      <NA>       <NA>
## 4        <NA>                  <NA>       <NA>      <NA>       <NA>
## 5         ENG JURONG WEST STREET 51        501 SINGAPORE     640501
## 6        <NA>                  <NA>       <NA>      <NA>       <NA>
## 7        <NA>                  <NA>       <NA>      <NA>       <NA>
## 8        <NA>                  <NA>       <NA>      <NA>       <NA>
## 9        <NA>                  <NA>       <NA>      <NA>       <NA>
## 10       <NA>                  <NA>       <NA>      <NA>       <NA>
##                     geometry
## 1  POINT (27966.77 44304.65)
## 2  POINT (24163.96 31606.25)
## 3  POINT (24163.96 31606.25)
## 4  POINT (24163.96 31606.25)
## 5  POINT (15270.94 36919.65)
## 6  POINT (37917.26 32698.88)
## 7  POINT (26981.85 31956.75)
## 8  POINT (21205.83 30939.54)
## 9  POINT (21159.08 30673.06)
## 10 POINT (21205.83 30939.54)

govt3414

## Simple feature collection with 443 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 5177.756 ymin: 25745.76 xmax: 45262.14 ymax: 48805.09
## CRS:            EPSG:3414
## First 10 features:
##        POI_ID SEQ_NUM FAC_TYPE                       POI_NAME      ST_NAME
## 1  1141424380       1     9993           CONSULATE SAN MARINO    CHURCH ST
## 2  1141424404       1     9993                   EMBASSY LAOS GOLDHILL PLZ
## 3  1141424402       1     9993               CONSULATE BELIZE     CECIL ST
## 4  1141424338       1     9993         GENERAL CONSULATE OMAN         <NA>
## 5  1192460871       1     9525                MND TOWER BLOCK   MAXWELL RD
## 6  1192460819       1     9525 MND AUDITORIUM & FUNCTION HALL   MAXWELL RD
## 7  1192460843       1     9525          AICARE LINK @ MAXWELL   MAXWELL RD
## 8  1192460783       1     9525   HARMONY IN DIVERSITY GALLERY   MAXWELL RD
## 9  1192460750       1     9525    FAMILY SUPPORT DIVISION MSF   MAXWELL RD
## 10 1194224304       1     9525               LTA BEDOK CAMPUS CHAI CHEE ST
##                     geometry
## 1  POINT (29790.84 29540.69)
## 2  POINT (29086.35 33403.07)
## 3  POINT (29780.83 29302.96)
## 4  POINT (30723.45 31361.87)
## 5  POINT (29363.48 29016.57)
## 6  POINT (29352.36 29032.05)
## 7  POINT (29352.36 29032.05)
## 8  POINT (29352.36 29032.05)
## 9  POINT (29352.36 29032.05)
## 10 POINT (37470.93 34345.33)

private3414

## Simple feature collection with 3604 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 5316.959 ymin: 24675.4 xmax: 43760.83 ymax: 48378.23
## CRS:            EPSG:3414
## First 10 features:
##        POI_ID SEQ_NUM FAC_TYPE                       POI_NAME        ST_NAME
## 1  1132324282       1     9590 MARINA BAY SERVICED APARTMENTS    MARINA BLVD
## 2  1132106212       1     9590                 SIN MING VILLE    JALAN TODAK
## 3  1202668778       1     9590         GREENTOPS @ SIMS PLACE           <NA>
## 4  1099690099       1     9590    MOUNTBATTEN DAKOTA CRESCENT    DAKOTA CRES
## 5   995195128       1     9590                    SINGA COURT    JALAN SINGA
## 6  1176000954       1     9590            FORESQUE RESIDENCES       PETIR RD
## 7  1100738877       1     9590              TIONG BAHRU COURT  JALAN MEMBINA
## 8   935999454       1     9590            BIRMINGHAM MANSIONS     THOMSON RD
## 9   935999453       1     9590              THOMSON EURO-ASIA     THOMSON RD
## 10 1069807806       1     9590                STRATFORD COURT BEDOK RIA CRES
##                     geometry
## 1  POINT (30144.75 29293.01)
## 2  POINT (28238.32 37300.83)
## 3  POINT (33158.46 33189.71)
## 4  POINT (34253.58 32295.18)
## 5   POINT (36358.02 34731.2)
## 6   POINT (21556.59 39011.5)
## 7  POINT (27313.49 29646.84)
## 8  POINT (29236.59 33304.66)
## 9  POINT (29222.12 33348.89)
## 10 POINT (41169.09 34457.16)

shopping3414

## Simple feature collection with 511 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 10824.78 ymin: 25599.8 xmax: 42586.69 ymax: 48346.17
## CRS:            EPSG:3414
## First 10 features:
##        POI_ID SEQ_NUM FAC_TYPE                              POI_NAME
## 1  1132106213       1     6512                       SIN MING CENTRE
## 2   801758392       1     6512                           THE ADELPHI
## 3   842821452       1     6512              BOON LAY SHOPPING CENTRE
## 4  1193779191       1     6512                         KATONG SQUARE
## 5   801758399       1     6512                        SIM LIM SQUARE
## 6  1001450091       1     6512                 PEOPLE'S PARK COMPLEX
## 7  1069767253       1     6512 UNITED SQUARE GOLDHILL PLAZA ENTRANCE
## 8  1069767253       2     6512   UNITED SQUARE GOLDHILL PLZ ENTRANCE
## 9  1039562724       1     6512                             THE FORUM
## 10 1039562723       1     6512                            WATERFRONT
##            ST_NAME                  geometry
## 1      SIN MING RD POINT (28293.96 37316.31)
## 2       COLEMAN ST  POINT (30020.1 30404.29)
## 3      BOON LAY PL  POINT (14574.25 36539.3)
## 4    EAST COAST RD  POINT (35876.21 31925.9)
## 5  ROCHOR CANAL RD POINT (30225.98 31749.98)
## 6          PARK RD POINT (29076.35 29667.85)
## 7             <NA> POINT (29099.71 33301.34)
## 8             <NA> POINT (29099.71 33301.34)
## 9             <NA>  POINT (26574.5 26528.63)
## 10            <NA>  POINT (26574.5 26528.63)

The above output shows that all the sfc tables containing key urban feautures have been converted to EPSG 3414 format, which is the Singapore standard. this will allow our data to be projected accurately.

The dataset for business can be further seperated into business and industry. Industry will include all manufacturing and other primary and secondary businesses whereas Business will include all the tertiary businesses.

industry3414 <- business3414 %>%
  filter(FAC_TYPE==9991)
business3414 <- business3414 %>%
  filter(FAC_TYPE==5000)

summary(industry3414)

##      POI_ID             SEQ_NUM         FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1.000   Min.   :9991  
##  1st Qu.:1.100e+09   1st Qu.:1.000   1st Qu.:9991  
##  Median :1.104e+09   Median :1.000   Median :9991  
##  Mean   :1.075e+09   Mean   :1.136   Mean   :9991  
##  3rd Qu.:1.139e+09   3rd Qu.:1.000   3rd Qu.:9991  
##  Max.   :1.203e+09   Max.   :2.000   Max.   :9991  
##                                                    
##                               POI_NAME                         ST_NAME  
##  TUAS TERRACE COMPLEX             : 3   INTERNATIONAL BUSINESS PARK: 7  
##  JTC TERRACE FACTORIES TUAS S ST 5: 2   TUAS AVE 13                : 5  
##  TUAS BAY INDUSTRIAL CENTRE       : 2   TUAS SOUTH ST 5            : 5  
##  TUAS ROAD TERRACE FACTORY        : 2   HENDERSON RD               : 3  
##  115A, 115B COMMONWEALTH DRIVE    : 1   TUAS RD                    : 3  
##  512,514 CHAI CHEE LANE           : 1   (Other)                    :74  
##  (Other)                          :99   NA's                       :13  
##           geometry  
##  POINT        :110  
##  epsg:3414    :  0  
##  +proj=tmer...:  0  
##                     
##                     
##                     
##

3. Data Inspection

3.1 Examining population demographics

The resident data is inspected below using the summary function which allows us to see the data class for each column and its distribution.

summary(residentData)

##  planning_area        subzone           age_group             sex           
##  Length:883728      Length:883728      Length:883728      Length:883728     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##  type_of_dwelling   resident_count         year     
##  Length:883728      Min.   :   0.00   Min.   :2011  
##  Class :character   1st Qu.:   0.00   1st Qu.:2013  
##  Mode  :character   Median :   0.00   Median :2015  
##                     Mean   :  39.83   Mean   :2015  
##                     3rd Qu.:  10.00   3rd Qu.:2017  
##                     Max.   :2860.00   Max.   :2019

As seen above, all columns except for resident_count and year have the class character. As the median for resident count is 0 and the third quartile is below the mean, it is very evident that more than 50% of the subzones have a residential population of 0. This is because many subzones are inhabitable (ex: Central Catchement Area, Western Catachement Area, etc.) and various subzones such as Changi Bay contain key transportation facilities of Singapore, hence do not have any population. Secondly, the data in this table is from year 2011 to 2019. As we will beperforming analysis on the latest (2019) data, we will remove data for all the other years (2011-2018).

3.2 Examining urban functions

3.2.1 Businesses

summary(business3414)

##      POI_ID             SEQ_NUM         FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1.000   Min.   :5000  
##  1st Qu.:9.967e+08   1st Qu.:1.000   1st Qu.:5000  
##  Median :1.084e+09   Median :1.000   Median :5000  
##  Mean   :9.919e+08   Mean   :1.019   Mean   :5000  
##  3rd Qu.:1.108e+09   3rd Qu.:1.000   3rd Qu.:5000  
##  Max.   :1.204e+09   Max.   :3.000   Max.   :5000  
##                                                    
##                        POI_NAME                  ST_NAME    
##  CAMBRIDGE INDUSTRIAL TRUST:   8   TAGORE LN         :  82  
##  DHL                       :   6   JOO KOON CIR      :  80  
##  NATIONAL OILWELL VARCO    :   6   GUL CIR           :  62  
##  ST MICROELECTRONICS       :   6   KAKI BUKIT PL     :  53  
##  CWT                       :   5   KAKI BUKIT IND TER:  52  
##  HALLIBURTON               :   5   (Other)           :5845  
##  (Other)                   :6404   NA's              : 266  
##           geometry   
##  POINT        :6440  
##  epsg:3414    :   0  
##  +proj=tmer...:   0  
##                      
##                      
##                      
##

From the above summary, we can notice that there are 266 NA values for ST_NAMES. However, ST_NAMES is not our variable of interest. We need to prepare the dataset such that it contains distinct businesses. As each business is identified with its POI_ID, we will group by the POI_ID in order to remove any duplicated data.

business3414_cleaned <- business3414 %>%
  distinct_at(vars(POI_ID),.keep_all = TRUE)

summary(business3414_cleaned)

##      POI_ID             SEQ_NUM     FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1   Min.   :5000  
##  1st Qu.:9.967e+08   1st Qu.:1   1st Qu.:5000  
##  Median :1.084e+09   Median :1   Median :5000  
##  Mean   :9.930e+08   Mean   :1   Mean   :5000  
##  3rd Qu.:1.108e+09   3rd Qu.:1   3rd Qu.:5000  
##  Max.   :1.204e+09   Max.   :1   Max.   :5000  
##                                                
##                        POI_NAME                  ST_NAME    
##  CAMBRIDGE INDUSTRIAL TRUST:   8   TAGORE LN         :  80  
##  DHL                       :   6   JOO KOON CIR      :  79  
##  NATIONAL OILWELL VARCO    :   6   GUL CIR           :  62  
##  ST MICROELECTRONICS       :   6   KAKI BUKIT IND TER:  51  
##  CWT                       :   5   KAKI BUKIT PL     :  51  
##  HALLIBURTON               :   5   (Other)           :5744  
##  (Other)                   :6284   NA's              : 253  
##           geometry   
##  POINT        :6320  
##  epsg:3414    :   0  
##  +proj=tmer...:   0  
##                      
##                      
##                      
##

As the data is now clean, we will create a new table which has the subzone name for each of the business based on its location. However, before that, we will create a new variable which only consists of the subzone name and location which will make it easier to perform relational joins and assigning subzones.

mpsz3414_2 <- mpsz3414 %>%
  rename("subzone"=SUBZONE_N)%>%
  select(subzone,geometry)

business_by_subzone <- st_intersection(mpsz3414_2,business3414_cleaned) %>%
  group_by(subzone) %>%
  summarise(Businesses=n())

summary(business_by_subzone)

##                    subzone      Businesses              geometry  
##  ALEXANDRA HILL        :  1   Min.   :  1.00   MULTIPOINT   :174  
##  ALEXANDRA NORTH       :  1   1st Qu.:  2.00   POINT        : 42  
##  ALJUNIED              :  1   Median :  7.00   epsg:3414    :  0  
##  ANAK BUKIT            :  1   Mean   : 29.26   +proj=tmer...:  0  
##  ANG MO KIO TOWN CENTRE:  1   3rd Qu.: 29.00                      
##  ANSON                 :  1   Max.   :303.00                      
##  (Other)               :210

As we have eliminated duplicates, we will now check if any location value is empty

is_empty(business_by_subzone$geometry)

## [1] FALSE

As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.

3.2.2 Industries

summary(industry3414)

##      POI_ID             SEQ_NUM         FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1.000   Min.   :9991  
##  1st Qu.:1.100e+09   1st Qu.:1.000   1st Qu.:9991  
##  Median :1.104e+09   Median :1.000   Median :9991  
##  Mean   :1.075e+09   Mean   :1.136   Mean   :9991  
##  3rd Qu.:1.139e+09   3rd Qu.:1.000   3rd Qu.:9991  
##  Max.   :1.203e+09   Max.   :2.000   Max.   :9991  
##                                                    
##                               POI_NAME                         ST_NAME  
##  TUAS TERRACE COMPLEX             : 3   INTERNATIONAL BUSINESS PARK: 7  
##  JTC TERRACE FACTORIES TUAS S ST 5: 2   TUAS AVE 13                : 5  
##  TUAS BAY INDUSTRIAL CENTRE       : 2   TUAS SOUTH ST 5            : 5  
##  TUAS ROAD TERRACE FACTORY        : 2   HENDERSON RD               : 3  
##  115A, 115B COMMONWEALTH DRIVE    : 1   TUAS RD                    : 3  
##  512,514 CHAI CHEE LANE           : 1   (Other)                    :74  
##  (Other)                          :99   NA's                       :13  
##           geometry  
##  POINT        :110  
##  epsg:3414    :  0  
##  +proj=tmer...:  0  
##                     
##                     
##                     
##

Similarly to the methodology used above, we will group by poi_id so that we remove duplicated values. This is because each industry has a unique POI_ID.

industry3414_cleaned <- industry3414 %>%
  distinct_at(vars(POI_ID),.keep_all = TRUE)

summary(industry3414_cleaned)

##      POI_ID             SEQ_NUM     FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1   Min.   :9991  
##  1st Qu.:1.100e+09   1st Qu.:1   1st Qu.:9991  
##  Median :1.104e+09   Median :1   Median :9991  
##  Mean   :1.072e+09   Mean   :1   Mean   :9991  
##  3rd Qu.:1.139e+09   3rd Qu.:1   3rd Qu.:9991  
##  Max.   :1.203e+09   Max.   :1   Max.   :9991  
##                                                
##                           POI_NAME                         ST_NAME  
##  115A, 115B COMMONWEALTH DRIVE: 1   INTERNATIONAL BUSINESS PARK: 4  
##  512,514 CHAI CHEE LANE       : 1   TUAS AVE 13                : 3  
##  AIRPORT LOGISTICS PARK       : 1   TUAS SOUTH ST 5            : 3  
##  ANG MO KIO INDUSTRIAL PARK 1 : 1   HENDERSON RD               : 2  
##  ANG MO KIO INDUSTRIAL PARK 2 : 1   PASIR RIS IND DR 1         : 2  
##  ANG MO KIO INDUSTRIAL PARK 3 : 1   (Other)                    :69  
##  (Other)                      :89   NA's                       :12  
##           geometry 
##  POINT        :95  
##  epsg:3414    : 0  
##  +proj=tmer...: 0  
##                    
##                    
##                    
##

We will now assign a subzone to each of the industry through st_intersection method.

industry_by_subzone <- st_intersection(mpsz3414_2,industry3414_cleaned) %>%
  group_by(subzone) %>%
  summarise(Industries=n())

summary(industry_by_subzone)

##               subzone     Industries             geometry 
##  ALEXANDRA HILL   : 1   Min.   :1.000   MULTIPOINT   :21  
##  ALJUNIED         : 1   1st Qu.:1.000   POINT        :28  
##  BRADDELL         : 1   Median :1.000   epsg:3414    : 0  
##  BUKIT BATOK SOUTH: 1   Mean   :1.939   +proj=tmer...: 0  
##  BUKIT MERAH      : 1   3rd Qu.:2.000                     
##  CHANGI AIRPORT   : 1   Max.   :5.000                     
##  (Other)          :43

As we have eliminated duplicates, we will now check if any location value is empty

is_empty(industry_by_subzone$geometry)

## [1] FALSE

As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.

3.2.3 Shopping infrastructure

summary(shopping3414)

##      POI_ID             SEQ_NUM         FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1.000   Min.   :6512  
##  1st Qu.:9.656e+08   1st Qu.:1.000   1st Qu.:6512  
##  Median :1.070e+09   Median :1.000   Median :6512  
##  Mean   :8.934e+08   Mean   :1.108   Mean   :6512  
##  3rd Qu.:1.104e+09   3rd Qu.:1.000   3rd Qu.:6512  
##  Max.   :1.204e+09   Max.   :3.000   Max.   :6512  
##                                                    
##                              POI_NAME             ST_NAME   
##  BUKIT BATOK WEST SHOPPING CENTRE:  2   ORCHARD RD    : 37  
##  CHANGE ALLEY                    :  2   BUKIT TIMAH RD:  7  
##  FARMART CENTRE                  :  2   SCOTTS RD     :  7  
##  FORTUNE CENTRE                  :  2   BEACH RD      :  6  
##  HARBOUR FRONT CENTRE ENTRANCE   :  2   BENCOOLEN ST  :  5  
##  NEW WORLD CENTRE                :  2   (Other)       :347  
##  (Other)                         :499   NA's          :102  
##           geometry  
##  POINT        :511  
##  epsg:3414    :  0  
##  +proj=tmer...:  0  
##                     
##                     
##                     
##

From the above summary, we can notice that there are 102 NA values for ST_NAMES. However, ST_NAMES is not our variable of interest. We need to prepare the dataset such that it contains distinct shopping infrastructure in order to avoid repetitions. As each shooping infrastructure is identified with its POI_ID, we will group by the POI_ID in order to remove any duplicated data.

shopping3414_cleaned <- shopping3414 %>%
  distinct_at(vars(POI_ID),.keep_all = TRUE)

summary(shopping3414_cleaned)

##      POI_ID             SEQ_NUM     FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1   Min.   :6512  
##  1st Qu.:9.360e+08   1st Qu.:1   1st Qu.:6512  
##  Median :1.070e+09   Median :1   Median :6512  
##  Mean   :8.787e+08   Mean   :1   Mean   :6512  
##  3rd Qu.:1.105e+09   3rd Qu.:1   3rd Qu.:6512  
##  Max.   :1.204e+09   Max.   :1   Max.   :6512  
##                                                
##                              POI_NAME             ST_NAME   
##  BUKIT BATOK WEST SHOPPING CENTRE:  2   ORCHARD RD    : 31  
##  CHANGE ALLEY                    :  2   BUKIT TIMAH RD:  7  
##  FARMART CENTRE                  :  2   BEACH RD      :  6  
##  FORTUNE CENTRE                  :  2   SCOTTS RD     :  6  
##  NEW WORLD CENTRE                :  2   EAST COAST RD :  5  
##  SULTAN PLAZA                    :  2   (Other)       :325  
##  (Other)                         :446   NA's          : 78  
##           geometry  
##  POINT        :458  
##  epsg:3414    :  0  
##  +proj=tmer...:  0  
##                     
##                     
##                     
##

As the data is now clean, we will create a new table which has the subzone name for each of the shopping infrastructure based on its location.

shopping_by_subzone <- st_intersection(mpsz3414_2,shopping3414_cleaned) %>%
  group_by(subzone) %>%
  summarise(Shopping_Infrastructures=n())

summary(shopping_by_subzone)

##                    subzone    Shopping_Infrastructures          geometry 
##  ADMIRALTY             :  1   Min.   : 1.000           MULTIPOINT   :77  
##  ALEXANDRA HILL        :  1   1st Qu.: 1.000           POINT        :70  
##  ALJUNIED              :  1   Median : 2.000           epsg:3414    : 0  
##  ANAK BUKIT            :  1   Mean   : 3.116           +proj=tmer...: 0  
##  ANG MO KIO TOWN CENTRE:  1   3rd Qu.: 3.500                             
##  ANSON                 :  1   Max.   :27.000                             
##  (Other)               :141

As we have eliminated duplicates, we will now check if any location value is empty

is_empty(shopping_by_subzone$geometry)

## [1] FALSE

As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.

3.2.4 Government Institutions

summary(govt3414)

##      POI_ID             SEQ_NUM         FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1.000   Min.   :9525  
##  1st Qu.:1.010e+09   1st Qu.:1.000   1st Qu.:9525  
##  Median :1.058e+09   Median :1.000   Median :9525  
##  Mean   :1.006e+09   Mean   :1.111   Mean   :9651  
##  3rd Qu.:1.113e+09   3rd Qu.:1.000   3rd Qu.:9993  
##  Max.   :1.203e+09   Max.   :2.000   Max.   :9993  
##                                                    
##                             POI_NAME         ST_NAME             geometry  
##  ANG MO KIO TOWN COUNCIL        :  5   MAXWELL RD: 16   POINT        :443  
##  SEMBAWANG-NEE SOON TOWN COUNCIL:  3   THOMSON RD: 12   epsg:3414    :  0  
##  ALJUNIED HOUGANG TOWN COUNCIL  :  2   ORCHARD RD: 11   +proj=tmer...:  0  
##  ALJUNIED TOWN COUNCIL          :  2   COLLEGE RD: 10                      
##  BISHAN-TOA PAYOH TOWN COUNCIL  :  2   SCOTTS RD :  8                      
##  CENTRAL PROVIDENT FUND BOARD   :  2   (Other)   :358                      
##  (Other)                        :427   NA's      : 28

From the above summary, we can notice that there are 28 NA values for ST_NAMES. However, ST_NAMES is not our variable of interest. We need to prepare the dataset such that it contains distinct government institutions in order to avoid repetitions. As each governemnt institution is identified with its POI_ID, we will group by the POI_ID in order to remove any duplicated data.

govt3414_cleaned <- govt3414 %>%
  distinct_at(vars(POI_ID),.keep_all = TRUE)

summary(govt3414_cleaned)

##      POI_ID             SEQ_NUM     FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1   Min.   :9525  
##  1st Qu.:1.010e+09   1st Qu.:1   1st Qu.:9525  
##  Median :1.058e+09   Median :1   Median :9525  
##  Mean   :1.002e+09   Mean   :1   Mean   :9662  
##  3rd Qu.:1.112e+09   3rd Qu.:1   3rd Qu.:9993  
##  Max.   :1.203e+09   Max.   :1   Max.   :9993  
##                                                
##                             POI_NAME              ST_NAME   
##  ANG MO KIO TOWN COUNCIL        :  5   MAXWELL RD     : 16  
##  SEMBAWANG-NEE SOON TOWN COUNCIL:  3   COLLEGE RD     : 10  
##  ALJUNIED HOUGANG TOWN COUNCIL  :  2   ORCHARD RD     : 10  
##  ALJUNIED TOWN COUNCIL          :  2   THOMSON RD     : 10  
##  BISHAN-TOA PAYOH TOWN COUNCIL  :  2   NORTH BRIDGE RD:  7  
##  CENTRAL PROVIDENT FUND BOARD   :  2   (Other)        :316  
##  (Other)                        :378   NA's           : 25  
##           geometry  
##  POINT        :394  
##  epsg:3414    :  0  
##  +proj=tmer...:  0  
##                     
##                     
##                     
##

As the data is now clean, we will create a new table which has the subzone name for each of the government institution based on its location.

govt_by_subzone <- st_intersection(mpsz3414_2,govt3414_cleaned) %>%
  group_by(subzone) %>%
  summarise(Govt_institutions=n())

summary(govt_by_subzone)

##                    subzone    Govt_institutions          geometry 
##  ALEXANDRA HILL        :  1   Min.   : 1.000    MULTIPOINT   :67  
##  ALJUNIED              :  1   1st Qu.: 1.000    POINT        :66  
##  ANAK BUKIT            :  1   Median : 2.000    epsg:3414    : 0  
##  ANG MO KIO TOWN CENTRE:  1   Mean   : 2.962    +proj=tmer...: 0  
##  ANSON                 :  1   3rd Qu.: 3.000                      
##  BALESTIER             :  1   Max.   :17.000                      
##  (Other)               :127

As we have eliminated duplicates, we will now check if any location value is empty

is_empty(govt_by_subzone$geometry)

## [1] FALSE

As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.

3.2.5 Financial institutions

summary(financial3414)

##     LINK_ID              POI_ID             SEQ_NUM         FAC_TYPE   
##  Min.   :1.161e+08   Min.   :3.644e+07   Min.   :1.000   Min.   :3578  
##  1st Qu.:8.594e+08   1st Qu.:1.097e+09   1st Qu.:1.000   1st Qu.:3578  
##  Median :9.140e+08   Median :1.113e+09   Median :1.000   Median :3578  
##  Mean   :9.092e+08   Mean   :1.088e+09   Mean   :1.008   Mean   :4397  
##  3rd Qu.:1.046e+09   3rd Qu.:1.132e+09   3rd Qu.:1.000   3rd Qu.:6000  
##  Max.   :1.224e+09   Max.   :1.204e+09   Max.   :2.000   Max.   :6000  
##                                                                        
##      POI_NAME   POI_LANGCD POI_NMTYPE   POI_ST_NUM   ST_NUM_FUL  ST_NFUL_LC 
##  OCBC    :788   ENG:3320   B:3293     1      : 212   29A :   1   ENG :   5  
##  UOB     :577              J:  27     10     :  76   333A:   1   NA's:3315  
##  POSB    :564                         2      :  53   77B :   1              
##  DBS     :282                         11     :  50   7A  :   1              
##  CITIBANK:153                         304    :  50   8A  :   1              
##  MAYBANK : 51                         (Other):2004   NA's:3315              
##  (Other) :905                         NA's   : 875                          
##               ST_NAME     ST_LANGCD   POI_ST_SD ACC_TYPE          PH_NUMBER   
##  ORCHARD RD       : 156   ENG :2926   L:1652    NA's:3320   63396666   :  52  
##  BEACH RD         :  44   NA's: 394   N:  24                63272265   :  16  
##  NORTH BRIDGE RD  :  39               R:1644                18002222121:  15  
##  COLLYER QUAY     :  38                                     18004383333:  13  
##  NEW UPP CHANGI RD:  35                                     18001111111:  11  
##  (Other)          :2614                                     (Other)    : 539  
##  NA's             : 394                                     NA's       :2674  
##     CHAIN_ID     NAT_IMPORT PRIVATE  IN_VICIN   NUM_PARENT    
##  Min.   :    0   N:3320     N:3320   N:3320   Min.   :0.0000  
##  1st Qu.: 2526                                1st Qu.:0.0000  
##  Median : 6918                                Median :0.0000  
##  Mean   : 5121                                Mean   :0.3807  
##  3rd Qu.: 6920                                3rd Qu.:1.0000  
##  Max.   :24982                                Max.   :2.0000  
##                                                               
##    NUM_CHILD           PERCFRREF       VANCITY_ID
##  Min.   :0.0000000   Min.   : 1.00   Min.   :0   
##  1st Qu.:0.0000000   1st Qu.:30.00   1st Qu.:0   
##  Median :0.0000000   Median :50.00   Median :0   
##  Mean   :0.0003012   Mean   :46.87   Mean   :0   
##  3rd Qu.:0.0000000   3rd Qu.:60.00   3rd Qu.:0   
##  Max.   :1.0000000   Max.   :99.00   Max.   :0   
##                      NA's   :1339                
##                                                                ACT_ADDR   
##  1 KIM SENG PROMENADE                              SINGAPORE 237994:   7  
##  3 TEMASEK BOULEVARD                               SINGAPORE 038983:   7  
##  530 LORONG 6 TOA PAYOH                            SINGAPORE 310530:   7  
##  2 JURONG EAST ST 21                               SINGAPORE 609601:   6  
##  3D RIVER VALLEY ROAD                              SINGAPORE 179023:   6  
##  (Other)                                                           : 243  
##  NA's                                                              :3044  
##  ACT_LANGCD               ACT_ST_NAM     ACT_ST_NUM       ACT_ADMIN   
##  ENG : 276   DUNEARN ROAD      :   7   1      :  20   INGAPORE :   1  
##  NA's:3044   KIM SENG PROMENADE:   7   3      :  11   SINGAPORE: 275  
##              LORONG 6 TOA PAYOH:   7   2      :  10   NA's     :3044  
##              PAYA LEBAR ROAD   :   7   50     :   8                   
##              TEMASEK BOULEVARD :   7   530    :   7                   
##              (Other)           : 241   (Other): 220                   
##              NA's              :3044   NA's   :3044                   
##    ACT_POSTAL            geometry   
##  038983 :   7   POINT        :3320  
##  237994 :   7   epsg:3414    :   0  
##  310530 :   7   +proj=tmer...:   0  
##  609601 :   7                       
##  179023 :   6                       
##  (Other): 242                       
##  NA's   :3044

There are various variables in this dataset which contain NA values. However, as our end goal is to find the number of financial institutions present in a subzone, we will count distinct locations by grouping the table by POI_ID as each distinct location of a financial institution has a distinct POI_ID.

financial3414_cleaned <- financial3414 %>%
  distinct_at(vars(POI_ID),.keep_all = TRUE) 

summary(financial3414_cleaned)

##     LINK_ID              POI_ID             SEQ_NUM     FAC_TYPE   
##  Min.   :1.161e+08   Min.   :3.644e+07   Min.   :1   Min.   :3578  
##  1st Qu.:8.594e+08   1st Qu.:1.097e+09   1st Qu.:1   1st Qu.:3578  
##  Median :9.140e+08   Median :1.113e+09   Median :1   Median :3578  
##  Mean   :9.099e+08   Mean   :1.088e+09   Mean   :1   Mean   :4384  
##  3rd Qu.:1.046e+09   3rd Qu.:1.132e+09   3rd Qu.:1   3rd Qu.:6000  
##  Max.   :1.224e+09   Max.   :1.204e+09   Max.   :1   Max.   :6000  
##                                                                    
##      POI_NAME   POI_LANGCD POI_NMTYPE   POI_ST_NUM   ST_NUM_FUL  ST_NFUL_LC 
##  OCBC    :788   ENG:3293   B:3293     1      : 209   29A :   1   ENG :   5  
##  UOB     :577              J:   0     10     :  74   333A:   1   NA's:3288  
##  POSB    :564                         2      :  53   77B :   1              
##  DBS     :282                         11     :  49   7A  :   1              
##  CITIBANK:153                         304    :  49   8A  :   1              
##  MAYBANK : 51                         (Other):1986   NA's:3288              
##  (Other) :878                         NA's   : 873                          
##               ST_NAME     ST_LANGCD   POI_ST_SD ACC_TYPE          PH_NUMBER   
##  ORCHARD RD       : 154   ENG :2900   L:1638    NA's:3293   63396666   :  52  
##  BEACH RD         :  44   NA's: 393   N:  24                63272265   :  16  
##  COLLYER QUAY     :  37               R:1631                18002222121:  15  
##  NORTH BRIDGE RD  :  37                                     18004383333:  13  
##  NEW UPP CHANGI RD:  35                                     18001111111:  11  
##  (Other)          :2593                                     (Other)    : 531  
##  NA's             : 393                                     NA's       :2655  
##     CHAIN_ID     NAT_IMPORT PRIVATE  IN_VICIN   NUM_PARENT    
##  Min.   :    0   N:3293     N:3293   N:3293   Min.   :0.0000  
##  1st Qu.: 2529                                1st Qu.:0.0000  
##  Median : 6918                                Median :0.0000  
##  Mean   : 5160                                Mean   :0.3799  
##  3rd Qu.: 6920                                3rd Qu.:1.0000  
##  Max.   :24982                                Max.   :2.0000  
##                                                               
##    NUM_CHILD           PERCFRREF       VANCITY_ID
##  Min.   :0.0000000   Min.   : 1.00   Min.   :0   
##  1st Qu.:0.0000000   1st Qu.:30.00   1st Qu.:0   
##  Median :0.0000000   Median :50.00   Median :0   
##  Mean   :0.0003037   Mean   :46.92   Mean   :0   
##  3rd Qu.:0.0000000   3rd Qu.:60.00   3rd Qu.:0   
##  Max.   :1.0000000   Max.   :99.00   Max.   :0   
##                      NA's   :1327                
##                                                                ACT_ADDR   
##  1 KIM SENG PROMENADE                              SINGAPORE 237994:   7  
##  3 TEMASEK BOULEVARD                               SINGAPORE 038983:   7  
##  530 LORONG 6 TOA PAYOH                            SINGAPORE 310530:   7  
##  2 JURONG EAST ST 21                               SINGAPORE 609601:   6  
##  3D RIVER VALLEY ROAD                              SINGAPORE 179023:   6  
##  (Other)                                                           : 242  
##  NA's                                                              :3018  
##  ACT_LANGCD               ACT_ST_NAM     ACT_ST_NUM       ACT_ADMIN   
##  ENG : 275   DUNEARN ROAD      :   7   1      :  20   INGAPORE :   1  
##  NA's:3018   KIM SENG PROMENADE:   7   3      :  11   SINGAPORE: 274  
##              LORONG 6 TOA PAYOH:   7   2      :  10   NA's     :3018  
##              TEMASEK BOULEVARD :   7   50     :   8                   
##              JURONG EAST ST 21 :   6   530    :   7                   
##              (Other)           : 241   (Other): 219                   
##              NA's              :3018   NA's   :3018                   
##    ACT_POSTAL            geometry   
##  038983 :   7   POINT        :3293  
##  237994 :   7   epsg:3414    :   0  
##  310530 :   7   +proj=tmer...:   0  
##  609601 :   7                       
##  179023 :   6                       
##  (Other): 241                       
##  NA's   :3018

We will now assign a subzone to each of the financial institution through st_intersection method.

financial_by_subzone <- st_intersection(mpsz3414_2,financial3414_cleaned) %>%
  group_by(subzone) %>%
  summarise(Financials=n())

summary(financial_by_subzone)

##                    subzone      Financials              geometry  
##  ADMIRALTY             :  1   Min.   :  1.00   MULTIPOINT   :223  
##  ALEXANDRA HILL        :  1   1st Qu.:  3.25   POINT        : 27  
##  ALJUNIED              :  1   Median :  8.00   epsg:3414    :  0  
##  ANAK BUKIT            :  1   Mean   : 13.17   +proj=tmer...:  0  
##  ANCHORVALE            :  1   3rd Qu.: 16.00                      
##  ANG MO KIO TOWN CENTRE:  1   Max.   :132.00                      
##  (Other)               :244

As we have eliminated duplicates, we will now check if any location value is empty

is_empty(financial_by_subzone$geometry)

## [1] FALSE

As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.

3.2.6 Upmarket residential area

summary(private3414)

##      POI_ID             SEQ_NUM         FAC_TYPE   
##  Min.   :3.644e+07   Min.   :1.000   Min.   :9590  
##  1st Qu.:9.968e+08   1st Qu.:1.000   1st Qu.:9590  
##  Median :1.070e+09   Median :1.000   Median :9590  
##  Mean   :1.052e+09   Mean   :1.007   Mean   :9590  
##  3rd Qu.:1.105e+09   3rd Qu.:1.000   3rd Qu.:9590  
##  Max.   :1.204e+09   Max.   :2.000   Max.   :9590  
##                                                    
##                     POI_NAME                 ST_NAME              geometry   
##  BLISSFUL VIEW          :   3   PASIR PANJANG RD :  45   POINT        :3604  
##  CLEMENTI PARK          :   3   UPP EAST COAST RD:  29   epsg:3414    :   0  
##  COMPASSVALE VIEW       :   3   LOR K TELOK KURAU:  26   +proj=tmer...:   0  
##  KING'S MANSION         :   3   BUKIT TIMAH RD   :  24                       
##  MIDPOINT PROPERTIES    :   3   EAST COAST RD    :  23                       
##  NEE SOON CENTRAL ESTATE:   3   (Other)          :3412                       
##  (Other)                :3586   NA's             :  45

There are various variables in this dataset which contain NA values. However, as our end goal is to find the number of upmarket residential loctations present in a subzone, we will count distinct locations by grouping the table by POI_ID as each distinct location of a private property has a distinct POI_ID.

private3414_cleaned <- private3414 %>%
  distinct_at(vars(POI_ID),.keep_all = TRUE) 

summary(private3414_cleaned)

##      POI_ID             SEQ_NUM     FAC_TYPE                       POI_NAME   
##  Min.   :3.644e+07   Min.   :1   Min.   :9590   BLISSFUL VIEW          :   3  
##  1st Qu.:9.968e+08   1st Qu.:1   1st Qu.:9590   CLEMENTI PARK          :   3  
##  Median :1.070e+09   Median :1   Median :9590   COMPASSVALE VIEW       :   3  
##  Mean   :1.052e+09   Mean   :1   Mean   :9590   KING'S MANSION         :   3  
##  3rd Qu.:1.105e+09   3rd Qu.:1   3rd Qu.:9590   MIDPOINT PROPERTIES    :   3  
##  Max.   :1.204e+09   Max.   :1   Max.   :9590   NEE SOON CENTRAL ESTATE:   3  
##                                                 (Other)                :3562  
##               ST_NAME              geometry   
##  PASIR PANJANG RD :  45   POINT        :3580  
##  UPP EAST COAST RD:  28   epsg:3414    :   0  
##  LOR K TELOK KURAU:  26   +proj=tmer...:   0  
##  BUKIT TIMAH RD   :  24                       
##  EAST COAST RD    :  23                       
##  (Other)          :3391                       
##  NA's             :  43

We will now assign a subzone to each of the private property location through st_intersection method.

private_by_subzone <- st_intersection(mpsz3414_2,private3414_cleaned) %>%
  group_by(subzone) %>%
  summarise(Private_properties=n()) 

summary(private_by_subzone)

##             subzone    Private_properties          geometry  
##  ADMIRALTY      :  1   Min.   :  1.00     MULTIPOINT   :213  
##  ALEXANDRA HILL :  1   1st Qu.:  3.00     POINT        : 26  
##  ALEXANDRA NORTH:  1   Median :  7.00     epsg:3414    :  0  
##  ALJUNIED       :  1   Mean   : 14.98     +proj=tmer...:  0  
##  ANAK BUKIT     :  1   3rd Qu.: 14.50                        
##  ANCHORVALE     :  1   Max.   :215.00                        
##  (Other)        :233

As we have eliminated duplicates, we will now check if any location value is empty

Check for duplicate as well.

is_empty(private_by_subzone$geometry)

## [1] FALSE

As seen above, there are no empty values. Hence, we have thoroughly cleaned this dataset.

3.3 Identifying missing values

sum(complete.cases(residentData))

## [1] 883728

sum(!complete.cases(residentData))

## [1] 0

As seen above, none of the 883728 observations have NA value.

4. Transforming data

Joining data to make demographics into sf format

mpsz3414 <- mpsz3414%>%rename("subzone"=SUBZONE_N)
mpsz3414_1 <- mpsz3414 %>%
  select(subzone,SHAPE_Area, geometry)%>%
  mutate(SHAPE_Area=SHAPE_Area/1000000)

4.1 Data Wrangling for demographics

one <- residentData %>%
  spread(age_group, resident_count) %>%
  mutate(YOUNG=rowSums(.[6:9])+rowSums(.[15])) %>%
  mutate(ACTIVE=rowSums(.[10:14])+rowSums(.[16:18])) %>%
  mutate(AGED=rowSums(.[19:24])) %>%
  select(subzone,type_of_dwelling,YOUNG,ACTIVE,AGED) %>%
  group_by(subzone) %>%
  summarise(YOUNG = sum(YOUNG), AGED= sum(AGED), ACTIVE = sum(ACTIVE))%>%
  mutate(TOTAL=YOUNG+AGED+ACTIVE)

one$subzone <- toupper(one$subzone)
one <- left_join(one,mpsz3414_1)
one <- one %>%
  mutate(DENSITY=TOTAL/SHAPE_Area)


two <- residentData %>%
  spread(type_of_dwelling,resident_count)

names(two)<-str_replace_all(names(two), c(" " = "_" , "-" = "" ))
colnames(two)[11] <- "HUDC_Flats"

4.2 Data Wrangling for urban functions

three <- two %>%
  group_by(subzone) %>%
  summarise(Condominiums_and_Other_Apartments=sum(Condominiums_and_Other_Apartments),
            HDB_1_and_2Room_Flats=sum(HDB_1_and_2Room_Flats),
            HDB_3Room_Flats=sum(HDB_3Room_Flats),
            HDB_4Room_Flats=sum(HDB_4Room_Flats),
            HDB_5Room_and_Executive_Flats= sum(HDB_5Room_and_Executive_Flats),
            HUDC_Flats = sum(HUDC_Flats),
            Landed_Properties = sum(Landed_Properties),
            Others = sum(Others)) %>%
  mutate(HDB_3_and_4Room_Flats=HDB_3Room_Flats+HDB_4Room_Flats) %>%
  select(subzone,HDB_1_and_2Room_Flats,HDB_3_and_4Room_Flats,HDB_5Room_and_Executive_Flats,Condominiums_and_Other_Apartments,Landed_Properties)

three$subzone <- toupper(one$subzone)

4.3 Combining all the data into one table

First, we will create a base table which has the subzone name and geometry

data_by_subzones <- mpsz3414 %>%
  select(OBJECTID, subzone,geometry)

We will now convert all the sf tables into data.frame objects by removing its special properties. This will allow us to make relational joins.

st_geometry(private_by_subzone) <- NULL
st_geometry(shopping_by_subzone) <- NULL
st_geometry(business_by_subzone) <- NULL
st_geometry(industry_by_subzone) <- NULL
st_geometry(govt_by_subzone) <- NULL
st_geometry(financial_by_subzone) <- NULL
one$geometry <- NULL

Now we will join all the urban properties to this table

data_by_subzones <- left_join(data_by_subzones,private_by_subzone)
data_by_subzones <- left_join(data_by_subzones,shopping_by_subzone)
data_by_subzones <- left_join(data_by_subzones,business_by_subzone)
data_by_subzones <- left_join(data_by_subzones,industry_by_subzone)
data_by_subzones <- left_join(data_by_subzones,govt_by_subzone)
data_by_subzones <- left_join(data_by_subzones,financial_by_subzone)

Before joining the demographic data, we will examine the data using the summary functions.

summary(data_by_subzones)

##     OBJECTID                subzone    Private_properties
##  Min.   :  1.0   ADMIRALTY      :  1   Min.   :  1.00    
##  1st Qu.: 81.5   AIRPORT ROAD   :  1   1st Qu.:  3.00    
##  Median :162.0   ALEXANDRA HILL :  1   Median :  7.00    
##  Mean   :162.0   ALEXANDRA NORTH:  1   Mean   : 14.98    
##  3rd Qu.:242.5   ALJUNIED       :  1   3rd Qu.: 14.50    
##  Max.   :323.0   ANAK BUKIT     :  1   Max.   :215.00    
##                  (Other)        :317   NA's   :84        
##  Shopping_Infrastructures   Businesses       Industries    Govt_institutions
##  Min.   : 1.000           Min.   :  1.00   Min.   :1.000   Min.   : 1.000   
##  1st Qu.: 1.000           1st Qu.:  2.00   1st Qu.:1.000   1st Qu.: 1.000   
##  Median : 2.000           Median :  7.00   Median :1.000   Median : 2.000   
##  Mean   : 3.116           Mean   : 29.26   Mean   :1.939   Mean   : 2.962   
##  3rd Qu.: 3.500           3rd Qu.: 29.00   3rd Qu.:2.000   3rd Qu.: 3.000   
##  Max.   :27.000           Max.   :303.00   Max.   :5.000   Max.   :17.000   
##  NA's   :176              NA's   :107      NA's   :274     NA's   :190      
##    Financials              geometry  
##  Min.   :  1.00   MULTIPOLYGON :323  
##  1st Qu.:  3.25   epsg:3414    :  0  
##  Median :  8.00   +proj=tmer...:  0  
##  Mean   : 13.17                      
##  3rd Qu.: 16.00                      
##  Max.   :132.00                      
##  NA's   :73

As seen in the above output, almost all the properties have NA values. This is because many subzones dont contain various urban functions at all. To make the data more accurate, we will replace the NA values by 0. Note that we had already performed an NA check on while performing cleaning on the individual dataset for each urban function, hence the NA values have only arised while performing a relational join.

data_by_subzones[is.na(data_by_subzones)]=0

Joining demographic data

data_by_subzones <- left_join(data_by_subzones,one)
data_by_subzones <- left_join(data_by_subzones,three)

Examining the data

summary(data_by_subzones)

##     OBJECTID       subzone          Private_properties Shopping_Infrastructures
##  Min.   :  1.0   Length:323         Min.   :  0.00     Min.   : 0.000          
##  1st Qu.: 81.5   Class :character   1st Qu.:  0.00     1st Qu.: 0.000          
##  Median :162.0   Mode  :character   Median :  4.00     Median : 0.000          
##  Mean   :162.0                      Mean   : 11.08     Mean   : 1.418          
##  3rd Qu.:242.5                      3rd Qu.: 11.00     3rd Qu.: 1.000          
##  Max.   :323.0                      Max.   :215.00     Max.   :27.000          
##    Businesses       Industries     Govt_institutions   Financials   
##  Min.   :  0.00   Min.   :0.0000   Min.   : 0.00     Min.   :  0.0  
##  1st Qu.:  0.00   1st Qu.:0.0000   1st Qu.: 0.00     1st Qu.:  1.0  
##  Median :  2.00   Median :0.0000   Median : 0.00     Median :  5.0  
##  Mean   : 19.57   Mean   :0.2941   Mean   : 1.22     Mean   : 10.2  
##  3rd Qu.: 13.50   3rd Qu.:0.0000   3rd Qu.: 1.00     3rd Qu.: 13.0  
##  Max.   :303.00   Max.   :5.0000   Max.   :17.00     Max.   :132.0  
##      YOUNG             AGED            ACTIVE           TOTAL        
##  Min.   :     0   Min.   :     0   Min.   :     0   Min.   :      0  
##  1st Qu.:     0   1st Qu.:     0   1st Qu.:     0   1st Qu.:      0  
##  Median : 10740   Median :  4440   Median : 22420   Median :  36430  
##  Mean   : 30969   Mean   : 12916   Mean   : 65096   Mean   : 108981  
##  3rd Qu.: 40065   3rd Qu.: 20880   3rd Qu.: 91505   3rd Qu.: 150475  
##  Max.   :360610   Max.   :129850   Max.   :741760   Max.   :1232220  
##    SHAPE_Area          DENSITY       HDB_1_and_2Room_Flats
##  Min.   : 0.03944   Min.   :     0   Min.   :    0        
##  1st Qu.: 0.62826   1st Qu.:     0   1st Qu.:    0        
##  Median : 1.22989   Median : 41420   Median :    0        
##  Mean   : 2.42088   Mean   : 94944   Mean   : 4323        
##  3rd Qu.: 2.10648   3rd Qu.:176119   3rd Qu.: 3385        
##  Max.   :69.74830   Max.   :435403   Max.   :48330        
##  HDB_3_and_4Room_Flats HDB_5Room_and_Executive_Flats
##  Min.   :     0        Min.   :     0               
##  1st Qu.:     0        1st Qu.:     0               
##  Median :     0        Median :     0               
##  Mean   : 53600        Mean   : 29951               
##  3rd Qu.: 86575        3rd Qu.: 31230               
##  Max.   :709850        Max.   :448060               
##  Condominiums_and_Other_Apartments Landed_Properties          geometry  
##  Min.   :     0                    Min.   :     0    MULTIPOLYGON :323  
##  1st Qu.:     0                    1st Qu.:     0    epsg:3414    :  0  
##  Median :  1510                    Median :     0    +proj=tmer...:  0  
##  Mean   : 13123                    Mean   :  6989                       
##  3rd Qu.: 18440                    3rd Qu.:  3745                       
##  Max.   :144470                    Max.   :172520

As we do not require the area if subzone, we will be removing it

data_by_subzones$SHAPE_Area = NULL
data_by_subzones$OBJECTID = NULL
data_by_subzones$TOTAL = NULL
rownames(data_by_subzones) <- data_by_subzones$subzone
data_by_subzones$subzone <- NULL

From the above summary, we have 15 variables attached to every subzone for analysis. However, before we perform hierarchical cluster analysis, we will perform univariant analysis in order to understand the scale and spread of data for each of the 15 variables.

However, before we start analysis of each variable, we will first examine the subzones.

tm_shape(data_by_subzones)+
  tm_polygons()+
  tm_borders()

As seen above, all the subzones of Singapore are included. To continue with socioeconomic analysis, we will analyse few subzones specifically in order to visualise if any feauture are present in them. These subzones include water catchement areas, which predominantly consists of water bodies and forests. We will also be analysing islands which are disconnected from mainland Singapore.

Analysing these areas

data_by_subzones = data_by_subzones[ !(row.names(data_by_subzones) %in% c("SUDONG","SEMAKAU", "SOUTHERN GROUP","NORTH-EASTERN ISLANDS","PULAU SELETAR")), ]

5. Performing Univariant Analysis

5.1 Understanding data through histograms

The code chunk below makes a function to make histograms and box plots so that we dont have to keep repeating the code.

plot_data <- function(maindata,attribute){
  return(ggplot(data=maindata, 
             aes_string(x= attribute)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue"))
}
private_plot <- plot_data(data_by_subzones,"Financials")

box_plot <- function(maindata,attribute){
  return(ggplot(data=maindata, aes_string(x=attribute)) +
  geom_boxplot(color="black", fill="light blue"))
}

All the plots are now stored in a variable from the code below

private_plot <- plot_data(data_by_subzones,"Financials")
shopping_plot <- plot_data(data_by_subzones,"Shopping_Infrastructures")
business_plot <- plot_data(data_by_subzones,"Businesses")
industry_plot <- plot_data(data_by_subzones,"Industries")
govt_plot <- plot_data(data_by_subzones,"Govt_institutions")
financial_plot <- plot_data(data_by_subzones,"Financials")
young_plot <- plot_data(data_by_subzones,"YOUNG")
aged_plot <- plot_data(data_by_subzones,"AGED")
active_plot <- plot_data(data_by_subzones,"ACTIVE")
density_plot <- plot_data(data_by_subzones,"DENSITY")
HDB1_2_plot <- plot_data(data_by_subzones,"HDB_1_and_2Room_Flats")
HDB3_4_plot <- plot_data(data_by_subzones,"HDB_3_and_4Room_Flats")
HDB5_plot <- plot_data(data_by_subzones,"HDB_5Room_and_Executive_Flats")
condo_plot <- plot_data(data_by_subzones,"Condominiums_and_Other_Apartments")
landed_plot <- plot_data(data_by_subzones,"Landed_Properties")

To visualise the graphs, we arrange it and plot it.

ggarrange(private_plot, shopping_plot, business_plot, industry_plot, govt_plot, financial_plot,
          young_plot, aged_plot, active_plot, density_plot, HDB1_2_plot, HDB3_4_plot, HDB5_plot, condo_plot, landed_plot,
          ncol = 3, 
          nrow = 2)

## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

As seen above, all the data is left skewed and has widely varying scales. Before making a decision on whether or not we need to standardise the data, we will plot the data using box-whisker plot in order to identify the outliers.

5.2 Understanding data through box plots

private_plot <- box_plot(data_by_subzones,"Financials")
shopping_plot <- box_plot(data_by_subzones,"Shopping_Infrastructures")
business_plot <- box_plot(data_by_subzones,"Businesses")
industry_plot <- box_plot(data_by_subzones,"Industries")
govt_plot <- box_plot(data_by_subzones,"Govt_institutions")
financial_plot <- box_plot(data_by_subzones,"Financials")
young_plot <- box_plot(data_by_subzones,"YOUNG")
aged_plot <- box_plot(data_by_subzones,"AGED")
active_plot <- box_plot(data_by_subzones,"ACTIVE")
density_plot <- box_plot(data_by_subzones,"DENSITY")
HDB1_2_plot <- box_plot(data_by_subzones,"HDB_1_and_2Room_Flats")
HDB3_4_plot <- box_plot(data_by_subzones,"HDB_3_and_4Room_Flats")
HDB5_plot <- box_plot(data_by_subzones,"HDB_5Room_and_Executive_Flats")
condo_plot <- box_plot(data_by_subzones,"Condominiums_and_Other_Apartments")
landed_plot <- box_plot(data_by_subzones,"Landed_Properties")

ggarrange(private_plot, shopping_plot, business_plot, industry_plot, govt_plot, financial_plot,
          young_plot, aged_plot, active_plot, density_plot, HDB1_2_plot, HDB3_4_plot, HDB5_plot, condo_plot, landed_plot,
          ncol = 3, 
          nrow = 2)

## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Most of the data is left skewed and contains multiple outliers. To perform accurate hierarchical cluster analysis, we will be normalising the data using min-max function. This function is preferred over using z-scores as none of the graphs resemble normality as seen in the histograms.

6. Multivariate Analysis

6.1 Standardisation of data

Standardising data requires our current data to be transformed from sfc to a data.frame object. The code below preserves the spatial property by creating a new variable data_by_subzones_sf.

data_by_subzones_sf <- data_by_subzones
st_geometry(data_by_subzones) <- NULL

The code below standardises the data using the min-max method, which scales the data from 0 to 1.

data_by_subzones.std <- normalize(data_by_subzones)
summary(data_by_subzones.std)

##  Private_properties Shopping_Infrastructures   Businesses      
##  Min.   :0.000000   Min.   :0.00000          Min.   :0.000000  
##  1st Qu.:0.004651   1st Qu.:0.00000          1st Qu.:0.000000  
##  Median :0.018605   Median :0.00000          Median :0.006601  
##  Mean   :0.052362   Mean   :0.05334          Mean   :0.065591  
##  3rd Qu.:0.051163   3rd Qu.:0.03704          3rd Qu.:0.046205  
##  Max.   :1.000000   Max.   :1.00000          Max.   :1.000000  
##    Industries      Govt_institutions   Financials           YOUNG        
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.007576   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.037879   Median :0.03369  
##  Mean   :0.05975   Mean   :0.07288   Mean   :0.078450   Mean   :0.08723  
##  3rd Qu.:0.00000   3rd Qu.:0.05882   3rd Qu.:0.098485   3rd Qu.:0.11727  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##       AGED             ACTIVE           DENSITY       HDB_1_and_2Room_Flats
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.0000   Min.   :0.00000      
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000      
##  Median :0.03808   Median :0.03238   Median :0.1080   Median :0.00000      
##  Mean   :0.10104   Mean   :0.08914   Mean   :0.2215   Mean   :0.09085      
##  3rd Qu.:0.16290   3rd Qu.:0.12559   3rd Qu.:0.4063   3rd Qu.:0.07273      
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.0000   Max.   :1.00000      
##  HDB_3_and_4Room_Flats HDB_5Room_and_Executive_Flats
##  Min.   :0.0000        Min.   :0.00000              
##  1st Qu.:0.0000        1st Qu.:0.00000              
##  Median :0.0000        Median :0.00000              
##  Mean   :0.0767        Mean   :0.06790              
##  3rd Qu.:0.1259        3rd Qu.:0.07235              
##  Max.   :1.0000        Max.   :1.00000              
##  Condominiums_and_Other_Apartments Landed_Properties
##  Min.   :0.00000                   Min.   :0.00000  
##  1st Qu.:0.00000                   1st Qu.:0.00000  
##  Median :0.01135                   Median :0.00000  
##  Mean   :0.09226                   Mean   :0.04115  
##  3rd Qu.:0.13207                   3rd Qu.:0.02385  
##  Max.   :1.00000                   Max.   :1.00000

As seen in the above summary, all the data is scalled as all have a minimum value of 0 and a maximum value of 1.

6.2 Correlation plot

In order to perform hierarchical cluster analysis, we need to ensure that our variables are not highly correlated. This is because we would prefer to have a mixture of high, low, and moderate values in different variables so that our clusters are well diffrentiated, hence variables with high correlation can hinder the cluster analysis. To examine the corelation, we will plot a corelation plot which indicates the corelation coefficient.

cluster_vars.cor = cor(data_by_subzones.std[,1:15])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black",
         tl.cex=0.5,
         number.cex=0.8)

The above matrix has the correlation coefficient for all the pairs of variables. We are now interested in capturing pairs which have the high correlation coefficient. If a pair of variables are highly correlaeted, we will eliminate one of the variables in the pair for our cluster analysis. Furthermore, the varaible to be retained in the analysis will be chosen on its practical usefulness or actionability potential. We will adaopt the threshold of 0.80 to classify a pair of varaiables as highly correlated. We wil broadly classify our variables into two sub categories and then perform varaible elimination. The categories are:

Urban functions: Private properties, Shopping Infrastructure, Businesses, Industries, Government Institutions, Financials.
Population demographic: Young, Active, Aged, Density, HDB_1_and_2Room_Flats, HDB_3_and_4Room_Flats, HDB_5Room_and_Executive_Flats, Condominiums_and_Other_Apartments, Landed_Properties

In the first category (Urban functions), none of the pair of variables are highly correlated, i.e. none of the combination of pair of variables have correlation coefficient more than 0.80.

In the second category, there are various variables which have correlation coefficient higher than 0.80. They are as follows:

Var1	Var2	Correlation
YOUNG	AGED	0.85
YOUNG	ACTIVE	0.99
YOUNG	HDB 3_4 ROOM	0.91
YOUNG	HDB_5_EXEC	0.92
AGED	ACTIVE	0.91
AGED	HDB 3_4 ROOM	0.90
ACTIVE	HDB 3_4 ROOM	0.95
ACTIVE	HDB_5_EXEC	0.88

From the above reslts, we are going to eliminate the variables AGED and ACTIVE. This is because both of these variables have high correlation with Young and HDB 3,4 room. To further understand the relationship of the data, we will be performing principal component analysis (PCA). This technique helps in reducing the dimensionality increasing interpretability but at the same time minimizing information loss. It does so by creating new uncorrelated variables that successively maximize variance.

6.3 Performing PCA

res.pca <- PCA(data_by_subzones.std[,1:12],  graph = FALSE)
fviz_screeplot(res.pca, addlabels = TRUE, ylim = c(0, 80))

summary(res.pca)

## 
## Call:
## PCA(X = data_by_subzones.std[, 1:12], graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
## Variance               4.939   2.170   1.492   0.937   0.760   0.549   0.470
## % of var.             41.161  18.081  12.429   7.804   6.329   4.572   3.921
## Cumulative % of var.  41.161  59.243  71.672  79.476  85.806  90.378  94.299
##                        Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.307   0.215   0.112   0.048   0.002
## % of var.              2.557   1.795   0.933   0.403   0.013
## Cumulative % of var.  96.856  98.651  99.584  99.987 100.000
## 
## Individuals (the 10 first)
##                              Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2
## PEOPLE'S PARK            |  1.721 | -1.426  0.129  0.687 | -0.068  0.001  0.002
## BUKIT MERAH              |  3.668 | -1.575  0.158  0.184 |  0.535  0.042  0.021
## CHINATOWN                |  2.871 |  0.935  0.056  0.106 |  2.338  0.792  0.663
## PHILLIP                  |  1.969 | -1.643  0.172  0.696 |  0.294  0.013  0.022
## RAFFLES PLACE            |  8.434 | -0.072  0.000  0.000 |  7.295  7.712  0.748
## CHINA SQUARE             |  2.421 | -0.763  0.037  0.099 |  1.734  0.436  0.513
## TIONG BAHRU              |  1.600 |  0.512  0.017  0.102 | -0.809  0.095  0.256
## BAYFRONT SUBZONE         |  1.934 | -1.488  0.141  0.592 |  0.339  0.017  0.031
## TIONG BAHRU STATION      |  3.396 |  2.094  0.279  0.380 | -0.344  0.017  0.010
## CLIFFORD PIER            |  1.901 | -1.695  0.183  0.795 |  0.047  0.000  0.001
##                             Dim.3    ctr   cos2  
## PEOPLE'S PARK            | -0.679  0.097  0.156 |
## BUKIT MERAH              |  2.062  0.896  0.316 |
## CHINATOWN                |  0.192  0.008  0.004 |
## PHILLIP                  | -0.505  0.054  0.066 |
## RAFFLES PLACE            |  0.589  0.073  0.005 |
## CHINA SQUARE             | -0.474  0.047  0.038 |
## TIONG BAHRU              | -0.630  0.084  0.155 |
## BAYFRONT SUBZONE         | -0.654  0.090  0.114 |
## TIONG BAHRU STATION      | -0.652  0.090  0.037 |
## CLIFFORD PIER            | -0.611  0.079  0.103 |
## 
## Variables (the 10 first)
##                             Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## Private_properties       |  0.278  1.565  0.077 |  0.297  4.053  0.088 |  0.004
## Shopping_Infrastructures |  0.188  0.716  0.035 |  0.859 34.009  0.738 | -0.016
## Businesses               | -0.254  1.305  0.064 | -0.014  0.010  0.000 |  0.829
## Industries               | -0.127  0.324  0.016 | -0.066  0.200  0.004 |  0.863
## Govt_institutions        |  0.036  0.026  0.001 |  0.777 27.842  0.604 |  0.044
## Financials               |  0.422  3.602  0.178 |  0.790 28.790  0.625 |  0.081
## YOUNG                    |  0.927 17.396  0.859 | -0.135  0.835  0.018 |  0.083
## AGED                     |  0.950 18.256  0.902 | -0.050  0.116  0.003 |  0.103
## ACTIVE                   |  0.959 18.624  0.920 | -0.119  0.655  0.014 |  0.093
## DENSITY                  |  0.811 13.326  0.658 | -0.237  2.579  0.056 | -0.108
##                             ctr   cos2  
## Private_properties        0.001  0.000 |
## Shopping_Infrastructures  0.016  0.000 |
## Businesses               46.098  0.688 |
## Industries               49.942  0.745 |
## Govt_institutions         0.130  0.002 |
## Financials                0.442  0.007 |
## YOUNG                     0.461  0.007 |
## AGED                      0.706  0.011 |
## ACTIVE                    0.582  0.009 |
## DENSITY                   0.789  0.012 |

From the results, we can derive that 80% of the variability in the variables can be found in the first five principal components. The first principal component is the direction in space which consists the maximum variance, after which, the variability keeps decreasing in each principal component. To understand the varaibles which contribute to each principal component, we will be plotting graphs which indicate the contribution of different variables in each component.

# Extract the results for variables
var <- get_pca_var(res.pca)
# Contributions of variables to PC1
fviz_contrib(res.pca, choice = "var", axes = 1, top = 10)

# Contributions of variables to PC2
fviz_contrib(res.pca, choice = "var", axes = 2, top = 10)

# Control variable colors using their contributions to the principle axis
fviz_contrib(res.pca, choice = "var", axes = 3, top = 10)

fviz_contrib(res.pca, choice = "var", axes = 4, top = 10)

fviz_contrib(res.pca, choice = "var", axes = 5, top = 10)

fviz_pca_var(res.pca, col.var="contrib",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE # Avoid text overlapping
             ) + theme_minimal() + ggtitle("Variables - PCA")

Interpretation

The first principal component is strongly correlated with five of the original variables. The first principal component increases with increasing Active, Aged, YOUNG, Density, and HDB3_4 rooms scores. This suggests that these five criteria vary together. If one increases, then the remaining ones tend to increase as well. Hence, with relation to our correlation analysis, we will elimnate Active, Aged, and HDB_3_4Room as the data is already been captured in the other variables, i.e. Young and Density.

Principal components 2-5 only contain 1-2 variables which significantly contribute in variation, however, they are not significantly correlated as found in our correlation analysis. Hence, we will be retaining all those variables.

Choosing cluster vars

cluster_vars.std <- data_by_subzones.std %>%
  select("Private_properties", "Shopping_Infrastructures","Businesses","Industries" ,"Govt_institutions", "Financials", "ACTIVE","HDB_1_and_2Room_Flats", "HDB_5Room_and_Executive_Flats", "DENSITY", "Condominiums_and_Other_Apartments" , "Landed_Properties")

In order to perform clustering, we first need to define a proximity matrix. The proximity matrix is a matrix which consists a measure of similarity from one variable to all the other variables. The measure of similarity will be calculated by Euclidean distance, which is a straight line distance between two points. ### Calculating the proximity matrix

proxmat <- dist(cluster_vars.std, method = 'euclidean')

6.4 Computing Hierarchical clustering

Hierarchical clustering algorithm will seperate the subzones into different clusters based on their measure of similarity. Clustering will allow us to subgroup subzones based on their socioeconomic characteristics. The analysis seeks to identify a set of groups which both minimize within-group variation and maximize between-group variation. This will ensure that we get very distinct clusters. We will be using agglomerative hierarchical clustering, which is a bottom-up approach, i.e. all subzones are iteratively merged until it belongs to one big cluster. There are various methods to merge these clusters. They are:
(1) Using average distance between two clusters
(2) Calculating the maximum distance between the points of the two clusters, i.e. using the distance between the two furthest points
(3) Calculating the minimum distance between the points of the two clusters, i.e. using the distance between the two closest points
(4) Using Ward’s method which merges two clusters in order to reduce within cluster variance

Choosing the most optimal method

In order to decide the most optimal algorithm for our case study, we will be calculating the agglomerative coefficient, which measures the amount of clustering structure found. The method with the highest index value will be chosen.

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(cluster_vars.std, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.8961083 0.8223974 0.9230676 0.9754843

From the above output, it is evident that Ward’s method has the highest agglomerative index value of 0.976. Ward’s method is also preferred for this analysis because the pooled with-in group sum of squares is minimized.

Plotting hierarchical clustering dendogram

hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.5)

As we have 318 subzones, the names of the subzones are not visible. However, that is not important right now as we can visualise that using a projected map later. The most important interpretation from the dendogram is to notice the height at which clusters are being merged. If we look at the 2nd merge from the top, it is evident that there is a significant difference between the first two merges. However, there is not much difference in height between the third and fourth merge. This may indicate that the difference between our clusters might not be significant. We will examine this by diving the dendogram into difference clusters and analysing the clusters using mean and standard deviation.

This raises an important question of determining the number of clusters we need to split into.

Determining the number of clusters

There are various indices which give an estimate of number of clusters we need to split the data. However, each index determines the number of clusters on factors such as standard deviation, mean, co-varaiance, etc, giving a different weight to each of these components. In order to get a aggregated result, we will use NbClust() function from NbClust library, whichprovides 30 indices for determining the number of clusters and proposes to user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods.

NbClust(data = cluster_vars.std, diss = NULL, distance = "euclidean", min.nc = 2, max.nc = 15, method = "ward.D", index = "all", alphaBeale = 0.05)

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 9 proposed 2 as the best number of clusters 
## * 2 proposed 3 as the best number of clusters 
## * 5 proposed 4 as the best number of clusters 
## * 4 proposed 7 as the best number of clusters 
## * 3 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

## $All.index
##         KL       CH Hartigan     CCC     Scott     Marriot  TrCovW  TraceW
## 2  12.9659 122.7496  28.3866  7.0875  912.5539 10485639.24 42.0575 66.2066
## 3   0.1832  80.8249  52.8436  4.7288 1204.1994  9429213.26 39.5372 60.7494
## 4   1.7284  80.2815  34.7138  7.7734 1580.9506  5126507.33 25.9030 52.0223
## 5   0.6739  75.3055  47.0608  9.9170 1887.5373  3054459.95 20.9926 46.8436
## 6   1.6459  78.4630  32.3751 14.9311 2254.9997  1385004.75 13.3250 40.7210
## 7   7.4687  77.3180  10.2581 17.2964 2558.8884   724973.76 11.3338 36.8928
## 8   0.1465  69.6991  29.9237 16.8726 2647.0389   717659.57 10.8632 35.7147
## 9   1.2214  70.3863  26.1096 19.9447 2946.3683   354346.07  9.0213 32.5707
## 10  2.2503  70.5243  14.3309 22.5930 3157.2467   225395.22  7.3340 30.0330
## 11  1.7842  67.6379   9.8179 23.2318 3320.3402   163302.33  6.6652 28.6978
## 12  0.3384  64.1384  20.6391 23.1862 3458.0682   126029.89  6.3056 27.8085
## 13  2.1344  64.2682  11.6995 25.1841 3632.3530    85501.95  5.3496 26.0513
## 14  1.1839  62.2952  10.3160 25.6776 3742.0240    70237.16  4.9035 25.0890
## 15  0.4275  60.3462  19.9851 26.0082 3849.6212    57484.03  4.5407 24.2655
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2   28.7443 1.8928 0.2467 1.5025     0.3501 0.8761  30.5466  1.1495    0.2137
## 3   31.1718 2.0628 0.2058 2.2907     0.2765 0.7372  27.0984  2.8738    0.2524
## 4   35.3081 2.4088 0.1836 1.8282     0.3026 0.7663  29.8854  2.4650    0.2785
## 5   39.1740 2.6752 0.1638 1.7282     0.3174 0.6374  22.7562  4.5322    0.2673
## 6   44.0604 3.0774 0.2043 1.3714     0.3354 0.6950  35.1103  3.5395    0.2728
## 7   52.4524 3.3967 0.2103 1.2874     0.3379 0.7813  38.6377  2.2698    0.2661
## 8   54.9915 3.5087 0.2192 1.4428     0.1856 0.5858  24.0355  5.6077    0.2530
## 9   58.2626 3.8474 0.2086 1.4037     0.2086 0.5470  14.9045  6.4056    0.2450
## 10  60.7132 4.1725 0.2053 1.3549     0.2186 0.8149  16.1230  1.8286    0.2448
## 11  63.6033 4.3667 0.1856 1.5095     0.1918 0.7239  24.0295  3.0659    0.2352
## 12  65.3967 4.5063 0.1775 1.4736     0.1985 0.6035  10.5140  5.0503    0.2284
## 13  69.3474 4.8103 0.1727 1.4318     0.2080 0.7126  16.1326  3.2131    0.2242
## 14  73.8113 4.9948 0.1643 1.4075     0.2061 0.3957  29.0122 11.8454    0.2170
## 15  75.3594 5.1643 0.1617 1.3566     0.2115 0.3657  34.6881 13.4884    0.2111
##       Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  33.1033     0.4216  0.5832  0.4891 0.0732 0.0203 12.6076 0.3812 0.9413
## 3  20.2498     0.4564 -0.1643  1.0560 0.0278 0.0237 15.5367 0.3571 0.8957
## 4  13.0056     0.5210 -0.0342  1.0669 0.0278 0.0264 15.6966 0.3350 1.0203
## 5   9.3687     0.5647 -0.1147  1.0907 0.0278 0.0303 14.8211 0.3198 0.9336
## 6   6.7868     0.5867 -0.0056  1.0744 0.0366 0.0314 13.5959 0.3045 0.9242
## 7   5.2704     0.6068 -4.4268  1.0701 0.0403 0.0352 13.1561 0.2929 0.8518
## 8   4.4643     0.4393 -0.0060  2.1271 0.0403 0.0366 19.0603 0.2808 0.7645
## 9   3.6190     0.4498 -0.0670  2.1162 0.0403 0.0378 18.3985 0.2685 0.7016
## 10  3.0033     0.4547  0.2763  2.0984 0.0404 0.0389 18.5205 0.2611 0.6738
## 11  2.6089     0.4451  0.5511  2.3430 0.0404 0.0404 18.3223 0.2537 0.6373
## 12  2.3174     0.4274 -0.0226  2.6066 0.0404 0.0410 18.5608 0.2472 0.5969
## 13  2.0039     0.4318  0.2063  2.5736 0.0404 0.0412 17.9178 0.2408 0.5761
## 14  1.7921     0.4282  0.1611  2.6543 0.0404 0.0421 17.6560 0.2353 0.5425
## 15  1.6177     0.4277 -0.0094  2.6695 0.0404 0.0426 17.0168 0.2294 0.5003
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.8615            34.7229       0.3147
## 3          0.8041            18.5143       0.0007
## 4          0.8208            21.3960       0.0035
## 5          0.7523            13.1707       0.0000
## 6          0.8076            19.0542       0.0000
## 7          0.8403            26.2208       0.0075
## 8          0.7367            12.1519       0.0000
## 9          0.6649             9.0730       0.0000
## 10         0.7993            17.8276       0.0399
## 11         0.7905            16.6986       0.0003
## 12         0.6496             8.6302       0.0000
## 13         0.7523            13.1707       0.0002
## 14         0.6717             9.2879       0.0000
## 15         0.6780             9.4987       0.0000
## 
## $Best.nc
##                      KL       CH Hartigan     CCC    Scott Marriot  TrCovW
## Number_clusters  2.0000   2.0000    3.000 15.0000   4.0000       4  4.0000
## Value_Index     12.9659 122.7496   24.457 26.0082 376.7512 2230659 13.6342
##                 TraceW Friedman   Rubin  Cindex     DB Silhouette   Duda
## Number_clusters 4.0000    7.000  7.0000 15.0000 7.0000     2.0000 2.0000
## Value_Index     3.5484    8.392 -0.2073  0.1617 1.2874     0.3501 0.8761
##                 PseudoT2  Beale Ratkowsky    Ball PtBiserial Frey McClain
## Number_clusters   2.0000 2.0000    4.0000  3.0000     7.0000    1  2.0000
## Value_Index      30.5466 1.1495    0.2785 12.8535     0.6068   NA  0.4891
##                   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 2.0000      0  2.0000      0 15.0000
## Value_Index     0.0732      0 12.6076      0  0.5003
## 
## $Best.partition
##                 PEOPLE'S PARK                   BUKIT MERAH 
##                             1                             1 
##                     CHINATOWN                       PHILLIP 
##                             2                             1 
##                 RAFFLES PLACE                  CHINA SQUARE 
##                             1                             1 
##                   TIONG BAHRU              BAYFRONT SUBZONE 
##                             2                             1 
##           TIONG BAHRU STATION                 CLIFFORD PIER 
##                             2                             1 
##                  MARINA SOUTH                  PEARL'S HILL 
##                             1                             2 
##                     BOAT QUAY                HENDERSON HILL 
##                             1                             2 
##                       REDHILL                ALEXANDRA HILL 
##                             2                             2 
##                 BUKIT HO SWEE                   CLARKE QUAY 
##                             2                             1 
##            TELOK BLANGAH RISE                 TANJONG PAGAR 
##                             2                             1 
##                  EVERTON PARK             TELOK BLANGAH WAY 
##                             1                             2 
##                       MAXWELL                         CECIL 
##                             1                             1 
##           KAMPONG TIONG BAHRU           TELOK BLANGAH DRIVE 
##                             2                             1 
##               PASIR PANJANG 2               PASIR PANJANG 1 
##                             1                             1 
##                     QUEENSWAY                    KENT RIDGE 
##                             1                             1 
##               ALEXANDRA NORTH                   MARINA EAST 
##                             1                             1 
##              INSTITUTION HILL                ROBERTSON QUAY 
##                             1                             1 
##       JURONG ISLAND AND BUKOM                       SENTOSA 
##                             1                             1 
##                CITY TERMINALS                         ANSON 
##                             1                             1 
##                  STRAITS VIEW               MARITIME SQUARE 
##                             1                             1 
##               CENTRAL SUBZONE    SINGAPORE GENERAL HOSPITAL 
##                             1                             1 
##                    DEPOT ROAD                    EAST COAST 
##                             1                             1 
## NATIONAL UNIVERSITY OF S'PORE                 ONE TREE HILL 
##                             1                             1 
##                  ROCHOR CANAL                      CRAWFORD 
##                             1                             2 
##                MARGARET DRIVE                       TANGLIN 
##                             1                             1 
##                 MARINE PARADE                  TANGLIN HALT 
##                             2                             2 
##                     MACKENZIE                   SUNGEI ROAD 
##                             1                             2 
##                     ONE NORTH                   TANJONG RHU 
##                             1                             1 
##                   MOUNTBATTEN                  COMMONWEALTH 
##                             1                             2 
##                         DOVER                     BOULEVARD 
##                             1                             1 
##                 ISTANA NEGARA                  LITTLE INDIA 
##                             1                             1 
##                     GUL BASIN                        RIDOUT 
##                             1                             1 
##                     CAIRNHILL                 CLEMENTI WEST 
##                             1                             2 
##           TUAS VIEW EXTENSION                   MONK'S HILL 
##                             1                             1 
##                        SIGLAP                CLEMENTI WOODS 
##                             1                             1 
##                  FORT CANNING              MARINA EAST (MP) 
##                             1                             1 
##                 MARINA CENTRE                      SOMERSET 
##                             1                             1 
##                     BENCOOLEN                    CHATSWORTH 
##                             1                             1 
##                PIONEER SECTOR              PENJURU CRESCENT 
##                             1                             1 
##                  ORANGE GROVE                 KAMPONG BUGIS 
##                             1                             1 
##                  KAMPONG GLAM                       SELEGIE 
##                             1                             1 
##                   MOUNT EMILY                      JOO KOON 
##                             1                             1 
##                   KALLANG WAY   INTERNATIONAL BUSINESS PARK 
##                             1                             1 
##                        TUKANG               CORONATION ROAD 
##                             1                             1 
##                     KEMBANGAN                        KATONG 
##                             1                             1 
##                 HOLLAND DRIVE                   FARRER PARK 
##                             2                             1 
##                 NEWTON CIRCUS                   JURONG PORT 
##                             1                             1 
##                       SAMULUN                      SHIPYARD 
##                             1                             1 
##                      GHIM MOH                      LAVENDER 
##                             2                             2 
##                 GOODWOOD PARK                        PANDAN 
##                             1                             1 
##         SINGAPORE POLYTECHNIC              CLEMENTI CENTRAL 
##                             1                             1 
##                  KAMPONG JAVA                     BOON KENG 
##                             1                             2 
##                 KALLANG BAHRU                    ULU PANDAN 
##                             1                             1 
##                  FARRER COURT                     TUAS VIEW 
##                             1                             1 
##                        NASSIM                    WEST COAST 
##                             1                             1 
##                      BAYSHORE                  BENOI SECTOR 
##                             1                             1 
##                    GUL CIRCLE                      ALJUNIED 
##                             1                             2 
##                      TYERSALL                      MOULMEIN 
##                             1                             1 
##                      LIU FANG                       FRANKEL 
##                             1                             1 
##                CLEMENTI NORTH                    BRAS BASAH 
##                             2                             1 
##                         OXLEY                     CITY HALL 
##                             1                             1 
##                      MEI CHIN                   LEONIE HILL 
##                             2                             1 
##                          PORT                   DHOBY GHAUT 
##                             1                             1 
##                         BUGIS                      VICTORIA 
##                             1                             1 
##                      PATERSON                      TUAS BAY 
##                             1                             1 
##                   LEEDON PARK                  GEYLANG EAST 
##                             1                             2 
##                 TEBAN GARDENS                  JURONG RIVER 
##                             1                             1 
##                 GEYLANG BAHRU                         FABER 
##                             2                             1 
##                       MALCOLM                   BEDOK SOUTH 
##                             1                             1 
##                 TAMPINES EAST                    KAKI BUKIT 
##                             2                             1 
##                    YUHUA EAST             BUKIT BATOK SOUTH 
##                             2                             1 
##           JURONG WEST CENTRAL               BEDOK RESERVOIR 
##                             2                             1 
##                    ANAK BUKIT                    SWISS CLUB 
##                             1                             1 
##                         XILIN                         SIMEI 
##                             1                             1 
##                BOON LAY PLACE              BUKIT BATOK EAST 
##                             2                             2 
##              BUKIT BATOK WEST           BUKIT BATOK CENTRAL 
##                             2                             2 
##              UPPER PAYA LEBAR                      TAI SENG 
##                             2                             1 
##                        TENGEH                    YUHUA WEST 
##                             1                             2 
##                        YUNNAN                  LORONG CHUAN 
##                             2                             1 
##                      HONG KAH                TUAS PROMENADE 
##                             2                             1 
##                  AIRPORT ROAD             SERANGOON CENTRAL 
##                             1                             2 
##                   BISHAN EAST                 TAMPINES WEST 
##                             2                             2 
##                    BRICKWORKS                       DUNEARN 
##                             1                             1 
##                    SUNSET WAY                    MACPHERSON 
##                             1                             2 
##                      KIM KEAT                   BEDOK NORTH 
##                             2                             2 
##             TOA PAYOH CENTRAL                JURONG GATEWAY 
##                             2                             1 
##                  HOLLAND ROAD                   KAMPONG UBI 
##                             1                             1 
##                       SENNETT                  POTONG PASIR 
##                             1                             2 
##                     BENDEMEER                     BALESTIER 
##                             2                             2 
##                      JOO SENG                      CHIN BEE 
##                             1                             1 
##            LORONG 8 TOA PAYOH                      TOH GUAN 
##                             2                             2 
##                      BRADDELL                      BIDADARI 
##                             2                             1 
##                     WOODLEIGH                  TAMAN JURONG 
##                             1                             2 
##                      LAKESIDE                TOA PAYOH WEST 
##                             1                             1 
##          DEFU INDUSTRIAL PARK                        GUILIN 
##                             1                             1 
##                     MARYMOUNT                         WENYA 
##                             2                             1 
##                NATURE RESERVE                      HILLVIEW 
##                             1                             1 
##                    CHANGI BAY               PAYA LEBAR EAST 
##                             1                             1 
##                 UPPER THOMSON                HONG KAH NORTH 
##                             1                             2 
##                    TOWNSVILLE                         KOVAN 
##                             2                             1 
##                    CHONG BOON                    SHANGRI-LA 
##                             2                             2 
##              SERANGOON GARDEN               HOUGANG CENTRAL 
##                             1                             1 
##                   LOYANG EAST                    DAIRY FARM 
##                             1                             1 
##               PASIR RIS DRIVE                TAMPINES NORTH 
##                             2                             1 
##                     CHENG SAN        ANG MO KIO TOWN CENTRE 
##                             2                             1 
##                   KEBUN BAHRU    SERANGOON NORTH IND ESTATE 
##                             2                             1 
##                        TENGAH               SERANGOON NORTH 
##                             1                             2 
##             PASIR RIS CENTRAL                        GOMBAK 
##                             2                             1 
##                          PLAB              PAYA LEBAR NORTH 
##                             1                             1 
##                  HOUGANG EAST                  LORONG HALUS 
##                             2                             1 
##                       KANGKAR               SEMBAWANG HILLS 
##                             2                             1 
##                        JELEBU                     KEAT HONG 
##                             2                             2 
##                  HOUGANG WEST               PAYA LEBAR WEST 
##                             2                             1 
##                       BANGKIT            LORONG HALUS NORTH 
##                             2                             1 
##                    PENG SIANG                PASIR RIS WEST 
##                             2                             2 
##             YIO CHU KANG WEST                     TRAFALGAR 
##                             2                             2 
##                     TECK WHYE                    TUAS NORTH 
##                             2                             1 
##                      PEI CHUN                     BOON TECK 
##                             2                             2 
##                     KIAN TECK                         SAFTI 
##                             1                             1 
##                      TOH TUCK                MOUNT PLEASANT 
##                             1                             1 
##                     HILLCREST                       SAUJANA 
##                             1                             2 
##                 SELETAR HILLS                   COMPASSVALE 
##                             1                             2 
##             YIO CHU KANG EAST                  YIO CHU KANG 
##                             1                             1 
##                   LOYANG WEST                        TAGORE 
##                             1                             1 
##                 LORONG AH SOO                   FLORA DRIVE 
##                             2                             1 
##         CHOA CHU KANG CENTRAL                   CHANGI WEST 
##                             2                             1 
##                         FAJAR                         SENJA 
##                             2                             2 
##                 WATERWAY EAST                     GALI BATU 
##                             2                             1 
##                    SPRINGLEAF           PUNGGOL TOWN CENTRE 
##                             1                             1 
##                      NEE SOON                 LOWER SELETAR 
##                             1                             1 
##                    NORTHSHORE                 MANDAI ESTATE 
##                             1                             1 
##                YISHUN CENTRAL           PULAU PUNGGOL TIMOR 
##                             1                             1 
##                     TURF CLUB               WOODLANDS SOUTH 
##                             1                             2 
##                     WOODGROVE                   YISHUN EAST 
##                             2                             2 
##       WESTERN WATER CATCHMENT           PULAU PUNGGOL BARAT 
##                             1                             1 
##                   YISHUN WEST     WOODLANDS REGIONAL CENTRE 
##                             2                             1 
##                   MANDAI EAST                 SIMPANG SOUTH 
##                             1                             1 
##                     NORTHLAND                       MIDVIEW 
##                             2                             2 
##                WOODLANDS WEST             SEMBAWANG SPRINGS 
##                             2                             1 
##                        KRANJI                RESERVOIR VIEW 
##                             1                             1 
##                WOODLANDS EAST             SEMBAWANG CENTRAL 
##                             2                             2 
##                GREENWOOD PARK                SEMBAWANG EAST 
##                             1                             1 
##                   SENOKO WEST                PASIR RIS PARK 
##                             1                             1 
##           CHOA CHU KANG NORTH                     RIVERVALE 
##                             2                             2 
##                CHANGI AIRPORT            YIO CHU KANG NORTH 
##                             1                             1 
##                 PUNGGOL CANAL       CENTRAL WATER CATCHMENT 
##                             1                             1 
##                       SELETAR                     ADMIRALTY 
##                             1                             1 
##                  LIM CHU KANG                 SIMPANG NORTH 
##                             1                             1 
##                  SENOKO SOUTH               SEMBAWANG NORTH 
##                             1                             2 
##                  TANJONG IRAU                      PANG SUA 
##                             1                             1 
##        SELETAR AEROSPACE PARK                        KHATIB 
##                             1                             1 
##                   MANDAI WEST                  CONEY ISLAND 
##                             1                             1 
##                  YISHUN SOUTH                   THE WHARVES 
##                             2                             1 
##                  SENOKO NORTH                  CHANGI POINT 
##                             1                             1 
##          SENGKANG TOWN CENTRE                    ANCHORVALE 
##                             2                             2 
##                 SENGKANG WEST                      FERNVALE 
##                             1                             2 
##                 PUNGGOL FIELD                       YEW TEE 
##                             2                             2 
##      PASIR RIS WAFER FAB PARK                       MATILDA 
##                             1                             2 
##                   NORTH COAST             SEMBAWANG STRAITS 
##                             1                             1

As seen above, most indices proposed 4 as the most optimal number of clusters. Hence, we are going to go ahead and divide the dendogram into four clusters.

hclust_ward <- hclust(proxmat, method = 'ward.D')
plot(hclust_ward, cex = 0.5)
rect.hclust(hclust_ward, k = 4, border = "red")

hclust_ward

## 
## Call:
## hclust(d = proxmat, method = "ward.D")
## 
## Cluster method   : ward.D 
## Distance         : euclidean 
## Number of objects: 318

As seen in the above output, the dendogram is divided into four clusters, as seen by the coloured boxes. As there are many subzones, we are not able to visualise the subzone names properly, hence we will perform analysis by visualising the clusters on the map. Before we conduct the final analysis, we will also plot the heatmap in order to detect how clusters are formed in different variables.

6.5 Heatmap

The heatmap is a great tool to understand how various clusters are formed by analysing each variable individually. As the number of subzones are too many, the heatmap is not too clear. The heatmap is interactive, so exact values can be extracted and it can be zoomed in as well if needed.

heatmap <- data.matrix(cluster_vars.std)

heatmaply(heatmap,
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 4,
          margins = c(NA,200,60,NA),
          fontsize_row = 3,
          fontsize_col = 5,
          main="Geographic Segmentation of Shan State by ICT indicators",
          xlab = "Demographic and Urban Indicators",
          ylab = "Subzones of Singapore"
          )

We will analyse each cluster from the heatmap, after we plot the map representing the clusters. This will allow analysis to be more coherent.

6.6 Map for visualising clusters

tmap_mode("plot")
groups <- as.factor(cutree(hclust_ward, k=4))
data_by_subzones_sf$CLUSTER <- groups
tm_shape(data_by_subzones_sf)+
  tm_polygons("CLUSTER",
              palette="Set3")

The four clusters are very evident in the map above. In order to analyse the clusters, we will be plotting the mean value of the socio-economic factors of every cluster to compare them. This will be used in tandom with the heatmap plotted in section 6.5.

6.7 Cluster Analysis

A histogram will be plotted for each variable in order to perfor cluster analysis and find out the simalrities and differences in each cluster.

data_by_subzones.std$CLUSTER <- groups
aggregate <- aggregate(data_by_subzones.std,by= list(data_by_subzones.std$CLUSTER),FUN = "mean")
aggregate$CLUSTER <- NULL
aggregate <- aggregate %>%
  rename("CLUSTER"=Group.1)

plot_data <- function(maindata,attribute){
  return(ggplot(aggregate, aes_string(x="CLUSTER",y=attribute, fill = "CLUSTER")) + 
   geom_bar(stat="identity", position = "dodge",size=0.5) + 
    theme(legend.position = 'none')+
   scale_fill_brewer(palette = "Set3"))}
  
private_plot <- plot_data(aggregate,"Private_properties")
shopping_plot <- plot_data(aggregate,"Shopping_Infrastructures")
business_plot <- plot_data(aggregate,"Businesses")
industry_plot <- plot_data(aggregate,"Industries")
govt_plot <- plot_data(aggregate,"Govt_institutions")
financial_plot <- plot_data(aggregate,"Financials")
young_plot <- plot_data(aggregate,"YOUNG")
aged_plot <- plot_data(aggregate,"AGED")
active_plot <- plot_data(aggregate,"ACTIVE")
density_plot <- plot_data(aggregate,"DENSITY")
HDB1_2_plot <- plot_data(aggregate,"HDB_1_and_2Room_Flats")
HDB3_4_plot <- plot_data(aggregate,"HDB_3_and_4Room_Flats")
HDB5_plot <- plot_data(aggregate,"HDB_5Room_and_Executive_Flats")
condo_plot <- plot_data(aggregate,"Condominiums_and_Other_Apartments")
landed_plot <- plot_data(aggregate,"Landed_Properties")

To visualise the graphs, we arrange it and plot it.

ggarrange(private_plot, shopping_plot, business_plot, industry_plot, govt_plot, financial_plot,
          young_plot, aged_plot, active_plot, density_plot, HDB1_2_plot, HDB3_4_plot, HDB5_plot, condo_plot, landed_plot,
          ncol = 3, 
          nrow = 2)

## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Interpretation of Cluster One

Cluster One (shown in green) is the largest cluster amongst the four. The most unique factor of this cluster is that it does not dominate in any of the urban functions or the social demographics. If we observe the map, this region is located in all the four regions of Singapore. One of the many reason this cluster is low on socioeconomic factors is because it consists of various regions such as Central Catchement Area, Western Catachement Area, and the Changi Bay which consists of Changi Airport. The water catchement area mainly comprises of forests and water bodies, hence are very low on the urban functions. As the density of this region is very low, along with the age demographic, this cluster indicates that the economic properties in a region go hand in hand with the demographic properties. These regions have room for development to attract people to either live or work. These are more “open regions” of Singapore, i.e. they contain lesser buildings and commercial infrastructure and have more open land, forests, and parks. They play an important role in making Singapore a green city and maintain enviornmental properties. Notably, this cluster extends in the central region as well in subzones such as Tanglin and Tanjong Rhu. These are regions which have low population density, however, are known as the posh areas of Singapore as they are very open and have very few buildings. There are many other subzones in this cluster which have a very low popuation, due to lesser and shorter buildings. As we can see from the financials histogram, there are a lot of financial infrastructure present in this area. This brings about one more inference, i.e. these regions can developed as one of the more posh areas of Singapore if they already don’t have any residential infrastructure.

Interpretation of Cluster Two

Cluster Two (shown in yellow) dominates two urban functions: Businesses and Industries. Industries refer to indiustrial parks, manufacturing facilities, etc whereas businesses in the “tertiary sector”. One of the most beautiful understanding from this cluster is that even though they are located in all the four regions of Singapore, they are found in groups of “mini-clusters” as most of them have adjoining subzones which are part of this cluster. Industries require a lot of raw materials and transportation resources, hence, it is more essential for the industries to form in clusters together. Singapore is one of the very first countries to adopt the concept of “Industrial Parks” which are large areas that contain manufacturing and industrial facilities. As these regions are not suitable for any other social activities, it is evident that these regions should be developed in a way which suits the requirements of businesses and industries such as having truck trailer parkings, etc. It is very evident that population density and number of households are extremely low in these areas. Industries usually have various harmful and toxic chemicals as its pollutants which creates an unhealthy living enviornment, leading to low levels of residential areas in these regions, having the least amount of residential infrastructure for each of the different types of dwelling as compared to other clusters.

Interpretation of Cluster Three

Cluster Three (in purple) dominates all the demographic factors. It consists the highest proportion of population by density and all three age groups. These are the densly populated residential areas of Singapore. They can be found in the western and eastern region of Singapore. Even though these areas are built for residential purposes, it can be found that Condominiums, Landed Properties, and Private Properties are not found as much in this cluster (as compared to Cluster 4). We can therefore infer that the distribution of HDBs and Condominiums/Landed property follow an inverse spatial relationship. It is very evident that financial infrastructure is heavily required in such regions. This is because banking facilities are used by everyone in the population, and hence should be heavily concentrated in residential areas. The second most required amenity are shopping facilities, which goes without saying, is an essential requirement if the region is densly populated. The data also suggests that Singapore has taken a very bi-modal approach by spatially seggregating businesses and residential areas. As there are limited businesses and industries in these regions, it implies that most of the population travels to work from these regions and hence public transport facilities should be readily available.

Interpretation of Cluster Four

Cluster Four (in red) dominates in most of the urban functions, having the most private properties, shopping infrastructure, government institutions, financial infrastructure. It is also notable that this region also consists of the highest proportion of Landed properties and Condominiums. These subzones mark where most of the public service facilities are present and also where the most richest segment of the population prefers to live as they predominantly consits of landed properties and condominiums. It can be infered that these regions are the most developed regions of Singapore.

7. Spatially Constrained Clustering - SKATER

7.1 Data preperation

The subzones clustered in the above methodology were not spatially related. In this section, we will perform clutering by a SKATER approach.

Firstly, we will convert our sf dataframe to sp format. This is because the SKATER clustering function requires a spatial dataframe object as its input.

data_by_subzones_sf$CLUSTER = NULL
data_by_subzones_sp <- as_Spatial(data_by_subzones_sf)

Computing Neighbour List

From the sp object, we will now be creating a neighbour list. All the subzones which are adjoining a subzone are considred to be its neighbours.

data.nb <- poly2nb(data_by_subzones_sp)
summary(data.nb)

## Neighbour list object:
## Number of regions: 318 
## Number of nonzero links: 1934 
## Percentage nonzero weights: 1.912503 
## Average number of links: 6.081761 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 14 17 
##  2  6 10 26 77 87 51 34 16  3  3  1  1  1 
## 2 least connected regions:
## JURONG ISLAND AND BUKOM CHANGI BAY with 1 link
## 1 most connected region:
## CENTRAL WATER CATCHMENT with 17 links

The neighbours can be plotted with the code below. Note that each vertex represents the centroid of the subzone.

plot(data_by_subzones_sp, border=grey(.5))
plot(data.nb, coordinates(data_by_subzones_sp), col="blue", add=TRUE)

The neighbours list is a graph which has each subzone as a vertex, and every edge indicates a connection between two subzones. We will now calculate the cost of each edge through nbcosts() function.

data_by_subzones.std$CLUSTER = NULL
lcosts <- nbcosts(data.nb, data_by_subzones.std)

We can now examine how lcosts looks like.

head(lcosts)

## [[1]]
## [1] 0.5841519 1.0070996 0.9461811
## 
## [[2]]
## [1] 1.2014312 1.1631365 0.9765882 1.0070794 1.0734969 0.9572217 0.6871655
## 
## [[3]]
##  [1] 0.5841519 0.9077098 0.3670866 0.8199890 0.4936605 0.5091575 0.9407277
##  [8] 0.6625817 0.5993047 0.9351744
## 
## [[4]]
## [1] 1.0209524 0.3716535 0.1654916
## 
## [[5]]
## [1] 0.9077098 1.0209524 0.9324593 1.0811211 1.0997412 0.5814547 0.8590983
## [8] 0.7925918
## 
## [[6]]
## [1] 0.3670866 0.3716535 0.9324593 0.9887064 0.3677505 0.2848824

As we have a prepared dataset with a list of values representing demographics and urban functions for each subzones, we will convert the graph to a weighed graph where each edge represents the measure of similarity between two subzones by accounting for all the variables.

data.w <- nb2listw(data.nb, lcosts, style="B")
glimpse(data.w)

## List of 3
##  $ style     : chr "B"
##  $ neighbours:List of 318
##   ..$ : int [1:3] 3 12 42
##   ..$ : int [1:7] 9 14 15 16 22 25 43
##   ..$ : int [1:10] 1 5 6 12 20 21 23 24 38 42
##   ..$ : int [1:3] 5 6 13
##   ..$ : int [1:8] 3 4 6 10 13 24 41 122
##   ..$ : int [1:6] 3 4 5 12 13 18
##   ..$ : int [1:5] 9 12 17 25 42
##   ..$ : int [1:5] 10 11 39 41 73
##   ..$ : int [1:5] 2 7 14 17 25
##   ..$ : int [1:5] 5 8 41 73 122
##   ..$ : int [1:5] 8 32 39 41 73
##   ..$ : int [1:9] 1 3 6 7 13 17 18 34 42
##   ..$ : int [1:6] 4 5 6 12 18 122
##   ..$ : int [1:6] 2 9 15 17 25 31
##   ..$ : int [1:5] 2 14 16 31 49
##   ..$ : int [1:8] 2 15 29 30 31 43 49 123
##   ..$ : int [1:7] 7 9 12 14 31 34 76
##   ..$ : int [1:6] 6 12 13 34 71 122
##   ..$ : int [1:5] 21 22 25 37 40
##   ..$ : int [1:5] 3 23 24 38 41
##   ..$ : int [1:6] 3 19 25 37 38 42
##   ..$ : int [1:6] 2 19 25 26 40 43
##   ..$ : int [1:3] 3 20 24
##   ..$ : int [1:5] 3 5 20 23 41
##   ..$ : int [1:9] 2 7 9 14 19 21 22 37 42
##   ..$ : int [1:5] 22 27 30 40 43
##   ..$ : int [1:5] 26 28 30 40 125
##   ..$ : int [1:6] 27 30 45 66 70 125
##   ..$ : int [1:5] 16 30 43 55 123
##   ..$ : int [1:8] 16 26 27 28 29 43 45 55
##   ..$ : int [1:6] 14 15 16 17 49 76
##   ..$ : int [1:5] 11 48 56 72 73
##   ..$ : int [1:4] 34 71 121 124
##   ..$ : int [1:7] 12 17 18 33 71 76 124
##   ..$ : int 94
##   ..$ : int [1:2] 37 40
##   ..$ : int [1:8] 19 21 25 36 38 39 40 41
##   ..$ : int [1:5] 3 20 21 37 41
##   ..$ : int [1:4] 8 11 37 41
##   ..$ : int [1:7] 19 22 26 27 36 37 125
##   ..$ : int [1:9] 5 8 10 11 20 24 37 38 39
##   ..$ : int [1:6] 1 3 7 12 21 25
##   ..$ : int [1:6] 2 16 22 26 29 30
##   ..$ : int [1:4] 51 57 69 72
##   ..$ : int [1:5] 28 30 55 59 70
##   ..$ : int [1:5] 50 60 76 124 129
##   ..$ : int [1:6] 48 54 81 98 127 128
##   ..$ : int [1:9] 32 47 56 73 80 81 98 122 127
##   ..$ : int [1:9] 15 16 31 52 58 64 76 123 131
##   ..$ : int [1:5] 46 60 64 76 109
##   ..$ : int [1:6] 44 57 69 72 90 118
##   ..$ : int [1:5] 49 55 58 91 123
##   ..$ : int [1:5] 61 82 83 92 103
##   ..$ : int [1:5] 47 62 75 98 128
##   ..$ : int [1:8] 29 30 45 52 59 91 97 123
##   ..$ : int [1:6] 32 48 57 72 80 114
##   ..$ : int [1:7] 44 51 56 72 90 114 132
##   ..$ : int [1:5] 49 52 91 123 131
##   ..$ : int [1:5] 45 55 70 97 101
##   ..$ : int [1:9] 46 50 65 74 79 99 109 124 129
##   ..$ : int [1:6] 53 68 74 83 103 126
##   ..$ : int [1:6] 54 75 82 92 98 128
##   ..$ : int [1:4] 77 96 112 113
##   ..$ : int [1:7] 49 50 76 107 109 115 131
##   ..$ : int [1:5] 60 68 74 93 99
##   ..$ : int [1:6] 28 70 100 102 110 125
##   ..$ : int [1:2] 108 160
##   ..$ : int [1:6] 61 65 74 93 103 116
##   ..$ : int [1:4] 44 51 111 118
##   ..$ : int [1:7] 28 45 59 66 101 102 119
##   ..$ : int [1:7] 18 33 34 120 121 122 126
##   ..$ : int [1:5] 32 44 51 56 57
##   ..$ : int [1:6] 8 10 11 32 48 122
##   ..$ : int [1:8] 60 61 65 68 121 124 126 129
##   ..$ : int [1:7] 54 62 82 92 120 126 128
##   ..$ : int [1:9] 17 31 34 46 49 50 64 124 129
##   ..$ : int [1:3] 63 113 130
##   ..$ : int [1:4] 94 100 133 134
##   ..$ : int [1:3] 60 99 109
##   ..$ : int [1:6] 48 56 98 104 105 114
##   ..$ : int [1:4] 47 48 122 127
##   ..$ : int [1:6] 53 62 75 83 92 126
##   ..$ : int [1:4] 53 61 82 126
##   ..$ : int [1:9] 87 112 113 117 155 234 237 238 269
##   ..$ : int [1:6] 104 114 135 168 175 179
##   ..$ : int [1:7] 110 133 136 172 182 187 239
##   ..$ : int [1:7] 84 112 117 134 180 186 237
##   ..$ : int [1:6] 107 115 131 146 173 241
##   ..$ : int [1:6] 118 132 140 144 170 174
##   ..$ : int [1:4] 51 57 118 132
##   ..$ : int [1:6] 52 55 58 97 106 131
##   ..$ : int [1:6] 53 62 75 82 98 103
##   ..$ : int [1:4] 65 68 99 116
##   ..$ : int [1:5] 35 78 95 117 134
##   ..$ : int [1:4] 94 96 112 117
##   ..$ : int [1:4] 63 95 112 117
##   ..$ : int [1:5] 55 59 91 101 106
##   ..$ : int [1:9] 47 48 54 62 80 92 103 105 177
##   ..$ : int [1:7] 60 65 79 93 109 116 137
##   .. [list output truncated]
##   ..- attr(*, "class")= chr "nb"
##   ..- attr(*, "region.id")= chr [1:318] "PEOPLE'S PARK" "BUKIT MERAH" "CHINATOWN" "PHILLIP" ...
##   ..- attr(*, "call")= language poly2nb(pl = data_by_subzones_sp)
##   ..- attr(*, "type")= chr "queen"
##   ..- attr(*, "sym")= logi TRUE
##  $ weights   :List of 318
##   ..$ : num [1:3] 0.584 1.007 0.946
##   ..$ : num [1:7] 1.201 1.163 0.977 1.007 1.073 ...
##   ..$ : num [1:10] 0.584 0.908 0.367 0.82 0.494 ...
##   ..$ : num [1:3] 1.021 0.372 0.165
##   ..$ : num [1:8] 0.908 1.021 0.932 1.081 1.1 ...
##   ..$ : num [1:6] 0.367 0.372 0.932 0.989 0.368 ...
##   ..$ : num [1:5] 0.541 0.966 0.762 0.376 1.106
##   ..$ : num [1:5] 0.169 0.129 0.149 0.2 0.698
##   ..$ : num [1:5] 1.201 0.541 0.743 0.62 0.441
##   ..$ : num [1:5] 1.081 0.169 0.309 0.703 1.11
##   ..$ : num [1:5] 0.1294 0.0227 0.023 0.3156 0.7902
##   ..$ : num [1:9] 1.007 0.82 0.989 0.966 1.039 ...
##   ..$ : num [1:6] 0.165 1.1 0.368 1.039 0.409 ...
##   ..$ : num [1:6] 1.163 0.743 0.497 0.249 0.449 ...
##   ..$ : num [1:5] 0.977 0.497 0.682 0.72 0.538
##   ..$ : num [1:8] 1.007 0.682 1.016 0.98 0.999 ...
##   ..$ : num [1:7] 0.762 0.62 0.441 0.249 1.003 ...
##   ..$ : num [1:6] 0.285 1.067 0.409 0.231 0.415 ...
##   ..$ : num [1:5] 0.606 0.356 0.266 1.177 0.67
##   ..$ : num [1:5] 0.494 0.587 0.356 0.323 0.383
##   ..$ : num [1:6] 0.509 0.606 0.704 1.182 0.493 ...
##   ..$ : num [1:6] 1.073 0.356 0.198 0.589 0.946 ...
##   ..$ : num [1:3] 0.941 0.587 0.549
##   ..$ : num [1:5] 0.663 0.581 0.356 0.549 0.562
##   ..$ : num [1:9] 0.957 0.376 0.441 0.449 0.266 ...
##   ..$ : num [1:5] 0.589 0.545 0.483 0.534 0.378
##   ..$ : num [1:5] 0.545 0.457 0.341 0.447 0.418
##   ..$ : num [1:6] 0.457 0.317 0.388 0.534 0.481 ...
##   ..$ : num [1:5] 1.016 0.17 0.256 0.505 0.669
##   ..$ : num [1:8] 0.98 0.483 0.341 0.317 0.17 ...
##   ..$ : num [1:6] 0.958 0.72 0.999 1.003 0.271 ...
##   ..$ : num [1:5] 0.0227 1.0886 0.5164 0 0.8009
##   ..$ : num [1:4] 0.369 0.333 0.219 0.166
##   ..$ : num [1:7] 1.027 1.03 0.231 0.369 0.471 ...
##   ..$ : num 0.0891
##   ..$ : num [1:2] 1.041 0.104
##   ..$ : num [1:8] 1.18 1.18 1.25 1.04 1.02 ...
##   ..$ : num [1:5] 0.599 0.323 0.493 1.017 0.295
##   ..$ : num [1:4] 0.149 0.023 1.001 0.335
##   ..$ : num [1:7] 0.67 0.946 0.534 0.447 0.104 ...
##   ..$ : num [1:9] 0.859 0.2 0.309 0.316 0.383 ...
##   ..$ : num [1:6] 0.946 0.935 1.106 1.359 0.537 ...
##   ..$ : num [1:6] 0.687 0.946 0.686 0.378 0.256 ...
##   ..$ : num [1:4] 1.2113 0.4937 0.4667 0.0696
##   ..$ : num [1:5] 0.388 0.186 0.452 0.363 0.541
##   ..$ : num [1:5] 0.323 1.615 0.405 0.143 0.179
##   ..$ : num [1:6] 1.085 0.452 0.058 0.809 0.624 ...
##   ..$ : num [1:9] 1.089 1.085 0.984 1.013 1.081 ...
##   ..$ : num [1:9] 0.538 0.89 0.271 0.418 0.431 ...
##   ..$ : num [1:5] 0.323 1.34 0.239 0.436 0.751
##   ..$ : num [1:6] 1.21 1.05 1.04 1.22 1 ...
##   ..$ : num [1:5] 0.418 0.826 0.128 0.309 0.212
##   ..$ : num [1:5] 0.0389 0.2759 0.2349 0.2843 0.429
##   ..$ : num [1:5] 0.452 0.225 0.297 0.609 0.457
##   ..$ : num [1:8] 0.505 0.467 0.452 0.826 0.583 ...
##   ..$ : num [1:6] 0.516 0.984 0.34 0.516 0.501 ...
##   ..$ : num [1:7] 0.494 1.052 0.34 0.503 0.302 ...
##   ..$ : num [1:5] 0.431 0.128 0.36 0.259 0.71
##   ..$ : num [1:5] 0.363 0.583 0.236 0.467 0.329
##   ..$ : num [1:9] 1.615 1.34 1.557 0.632 1.582 ...
##   ..$ : num [1:6] 0.0389 0.1405 1.1192 0.2171 0.4218 ...
##   ..$ : num [1:6] 0.225 0.149 0.213 0.124 0.644 ...
##   ..$ : num [1:4] 0.0904 0.2243 0.1089 0.6568
##   ..$ : num [1:7] 0.426 0.239 0.363 0.357 0.679 ...
##   ..$ : num [1:5] 1.557 0.23 1.102 0.376 0.218
##   ..$ : num [1:6] 0.534 0.324 0.622 0.271 0.424 ...
##   ..$ : num [1:2] 1.3 0.264
##   ..$ : num [1:6] 0.141 0.23 1.116 0.317 0.337 ...
##   ..$ : num [1:4] 0.4667 1.0443 0.0995 1.5138
##   ..$ : num [1:7] 0.481 0.541 0.236 0.324 0.502 ...
##   ..$ : num [1:7] 0.415 0.333 0.471 0.253 0.397 ...
##   ..$ : num [1:5] 0 0.0696 1.2212 0.5164 0.5028
##   ..$ : num [1:6] 0.698 0.703 0.79 0.801 1.013 ...
##   ..$ : num [1:8] 0.632 1.119 1.102 1.116 1.125 ...
##   ..$ : num [1:7] 0.297 0.149 0.13 0.163 0.216 ...
##   ..$ : num [1:9] 1.038 0.477 0.432 0.405 0.548 ...
##   ..$ : num [1:3] 0.0904 0.5679 1.1449
##   ..$ : num [1:4] 0.264 0.178 0.498 0.126
##   ..$ : num [1:3] 1.5816 0.0919 0.7371
##   ..$ : num [1:6] 1.081 0.501 0.797 0.621 0.167 ...
##   ..$ : num [1:4] 0.058 1.057 1.198 0.588
##   ..$ : num [1:6] 0.276 0.213 0.13 0.331 0.253 ...
##   ..$ : num [1:4] 0.235 0.217 0.331 0.355
##   ..$ : num [1:9] 0.529 0.589 0.421 0.738 0.822 ...
##   ..$ : num [1:6] 0.926 1.341 0.821 1.393 0.718 ...
##   ..$ : num [1:7] 0.868 0.884 0.843 0.918 0.973 ...
##   ..$ : num [1:7] 0.5288 0.1095 0.2938 0.1475 0.0891 ...
##   ..$ : num [1:6] 0.312 0.43 0.228 0.26 0.283 ...
##   ..$ : num [1:6] 1.342 0.643 1.29 0.888 1.076 ...
##   ..$ : num [1:4] 1 0.302 1.152 0.752
##   ..$ : num [1:6] 0.309 0.782 0.36 0.349 0.758 ...
##   ..$ : num [1:6] 0.284 0.124 0.163 0.253 0.686 ...
##   ..$ : num [1:4] 0.376 0.317 0.261 1.018
##   ..$ : num [1:5] 0.0891 0.264 0.0165 0.0489 0.1449
##   ..$ : num [1:4] 0.0165 0.2058 0.165 0.0547
##   ..$ : num [1:4] 0.224 0.206 0.289 0.204
##   ..$ : num [1:5] 0.829 0.467 0.349 0.686 0.74
##   ..$ : num [1:9] 0.809 0.459 0.609 0.644 0.797 ...
##   ..$ : num [1:7] 1.5594 0.2177 0.0919 0.261 0.7474 ...
##   .. [list output truncated]
##   ..- attr(*, "mode")= chr "general"
##   ..- attr(*, "glist")= chr [1:532] "list(c(0.584151939139242, 1.00709958433393, 0.946181057476831" "), c(1.20143119850206, 1.16313653026439, 0.976588240759307, 1.00707936495716, " "1.07349688787413, 0.957221684874255, 0.687165519601056), c(0.584151939139242, " "0.907709803954232, 0.367086587583076, 0.819989047758644, 0.49366050313275, " ...
##   ..- attr(*, "glistsym")= logi TRUE
##   .. ..- attr(*, "d")= num 0
##   ..- attr(*, "B")= logi TRUE
##  - attr(*, "class")= chr [1:2] "listw" "nb"
##  - attr(*, "region.id")= chr [1:318] "PEOPLE'S PARK" "BUKIT MERAH" "CHINATOWN" "PHILLIP" ...
##  - attr(*, "call")= language nb2listw(neighbours = data.nb, glist = lcosts, style = "B")

From the above summary, we can notice that the average number of links is 6. This implies that each subzone is connected to six other subzones on average. Jurong island is an island with only one link. We did not remove this subzone from our dataframe as it is a major hub for manufacturing as well as oil&gas production facilities.

In order to perform SKATER cluster analysis, we will find the minimum spanning tree for our weighed graph. The minimum spanning tree connects all the vertices together, without any cycles and with the minimum possible total edge weight. This is calculated through the mstree() function.

data.mst <- mstree(data.w)

We can examine the nature of the output

class(data.mst)

## [1] "mst"    "matrix"

dim(data.mst)

## [1] 317   3

The number of dimensions are 317 as a spanning tree consists of (N-1) edges in order to traverse through all the nodes.

We can visualise the spanning tree by plotting it.

plot(data_by_subzones_sp, border=gray(.5))
plot.mst(data.mst, coordinates(data_by_subzones_sp), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE,label.areas = NULL)

Note that the number of edges have reduced! This is because this graph is now a acyclic graph.

##7.2 Computing the clusters

The code below computes clusers using SKATER method. In hieracrchical clustering, we found that 4 was the optimum number of clusters. However, as the SKATER method employs spatial contraints, we will split the data into 6 clusters in order to avoid extremely big clusters.

clust <- skater(data.mst[,1:2], data_by_subzones.std, 5)

The output of the above code is a skater object. We can examine it from the code below.

str(clust)

## List of 8
##  $ groups      : num [1:318] 1 1 1 1 1 1 3 1 3 1 ...
##  $ edges.groups:List of 6
##   ..$ :List of 3
##   .. ..$ node: num [1:258] 142 240 57 188 210 245 90 317 163 87 ...
##   .. ..$ edge: num [1:257, 1:3] 240 188 57 210 245 163 317 90 160 87 ...
##   .. ..$ ssw : num 115
##   ..$ :List of 3
##   .. ..$ node: num [1:12] 224 233 254 242 287 253 227 314 223 229 ...
##   .. ..$ edge: num [1:11, 1:3] 287 224 224 227 233 287 224 253 254 233 ...
##   .. ..$ ssw : num 3.21
##   ..$ :List of 3
##   .. ..$ node: num [1:17] 49 15 14 17 25 123 52 91 12 9 ...
##   .. ..$ edge: num [1:16, 1:3] 14 17 123 49 25 25 49 52 25 14 ...
##   .. ..$ ssw : num 5.7
##   ..$ :List of 3
##   .. ..$ node: num [1:6] 148 164 138 139 170 118
##   .. ..$ edge: num [1:5, 1:3] 148 164 138 148 164 164 139 118 138 170 ...
##   .. ..$ ssw : num 5.25
##   ..$ :List of 3
##   .. ..$ node: num [1:14] 316 310 312 309 232 219 244 288 225 313 ...
##   .. ..$ edge: num [1:13, 1:3] 310 316 312 309 219 316 232 309 244 219 ...
##   .. ..$ ssw : num 5.52
##   ..$ :List of 3
##   .. ..$ node: num [1:11] 182 151 141 156 159 186 149 157 152 143 ...
##   .. ..$ edge: num [1:10, 1:3] 149 156 157 141 186 159 151 152 182 151 ...
##   .. ..$ ssw : num 4.66
##  $ not.prune   : NULL
##  $ candidates  : int [1:6] 1 2 3 4 5 6
##  $ ssto        : num 163
##  $ ssw         : num [1:6] 163 158 153 149 144 ...
##  $ crit        : num [1:2] 1 Inf
##  $ vec.crit    : num [1:318] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "class")= chr "skater"

The data has been split up into 4 parts, indicating 4 clusters. Each part consists of the nodes and edge costs. We can find out how the clusters have been assigned from the code below.

clusters <- clust$groups
clusters

##   [1] 1 1 1 1 1 1 3 1 3 1 1 3 1 3 3 3 3 1 3 1 1 3 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1 1 1 1 4 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 4 1 6 1 6 1 1 1 1 4
## [149] 6 6 6 6 1 1 1 6 6 1 6 1 1 1 1 4 1 1 1 1 1 4 1 1 1 1 1 1 1 1 1 1 1 6 1 1 1
## [186] 6 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 5 1 1 1 1 1 1 1 1 1 5 1 1 1 1 5 1 5 1
## [223] 2 2 5 1 2 1 2 1 1 5 2 1 1 1 1 1 1 1 1 2 1 5 1 1 1 1 1 1 2 1 2 2 5 1 1 1 1
## [260] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 5 1 1 1 1 1 1 1 1
## [297] 1 1 1 1 1 1 1 1 1 1 1 1 5 5 1 5 5 2 1 5 1 1

This is a vector which contains the cluster number for each subzone. Similar to the previous section, we will assign it to each table and map it.

7.3 Visualising the clusters

groups_mat <- as.matrix(clust$groups)
data_by_subzones.std$SP_CLUSTER <- as.factor(groups_mat)
st_geometry(data_by_subzones.std)<-data_by_subzones_sf$geometry
qtm(data_by_subzones.std, "SP_CLUSTER")

7.4 Cluster Analysis

st_geometry(data_by_subzones.std) <- NULL
data_by_subzones.std$CLUSTER <- NULL
aggregate2 <- aggregate(data_by_subzones.std,by= list(data_by_subzones.std$SP_CLUSTER),FUN = "mean")
aggregate2$SP_CLUSTER <- NULL
aggregate2 <- aggregate2 %>%
  rename("SP_CLUSTER"=Group.1)

plot_data <- function(maindata,attribute){
  return(ggplot(maindata, aes_string(x="SP_CLUSTER",y=attribute, fill = "SP_CLUSTER")) + 
   geom_bar(stat="identity", position = "dodge",size=0.5) + 
    theme(legend.position = 'none')+
   scale_fill_brewer(palette = "Set3"))}
  
private_plot <- plot_data(aggregate2,"Private_properties")
shopping_plot <- plot_data(aggregate2,"Shopping_Infrastructures")
business_plot <- plot_data(aggregate2,"Businesses")
industry_plot <- plot_data(aggregate2,"Industries")
govt_plot <- plot_data(aggregate2,"Govt_institutions")
financial_plot <- plot_data(aggregate2,"Financials")
young_plot <- plot_data(aggregate2,"YOUNG")
aged_plot <- plot_data(aggregate2,"AGED")
active_plot <- plot_data(aggregate2,"ACTIVE")
density_plot <- plot_data(aggregate2,"DENSITY")
HDB1_2_plot <- plot_data(aggregate2,"HDB_1_and_2Room_Flats")
HDB3_4_plot <- plot_data(aggregate2,"HDB_3_and_4Room_Flats")
HDB5_plot <- plot_data(aggregate2,"HDB_5Room_and_Executive_Flats")
condo_plot <- plot_data(aggregate2,"Condominiums_and_Other_Apartments")
landed_plot <- plot_data(aggregate2,"Landed_Properties")

To visualise the graphs, we arrange it and plot it.

ggarrange(private_plot, shopping_plot, business_plot, industry_plot, govt_plot, financial_plot,
          young_plot, aged_plot, active_plot, density_plot, HDB1_2_plot, HDB3_4_plot, HDB5_plot, condo_plot, landed_plot,
          ncol = 3, 
          nrow = 2)

## $`1`

## 
## $`2`

## 
## $`3`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

Interpretation

Cluster one is the biggest and consists of business and industries. Cluster two consists of residential areas. Cluster three is similar to cluster two and consists residential infrastructure. However, it also consists of government institutions, as it is located in the central area. Cluster four consists of all the private properties, financial infrastructure, and governemnt institutions. This is located in the eastern region of Singapore. Cluster Five is located in the north east and dominates in HDB 5 room facilities, indicating that the residential population enjoy bigger homes over there. Cluster Six is a highly dense residential area, located in western singapore.

8. Conclusion

Hierarchical clustering is a better appraoch for socioeconomic area analysis as Singapore has residential areas, industrial parks, and government facilities split up all around Singapore. SKATER approach analyses the data in close spatial proximity. To have better findings fron this approach, we will need to increase the number of clusters.

assignment3

Amey Rathi

5/21/2020

Objective of the report

1. Importing all the required packages

2. Importing all datasets

2.1 Importing aspatial data

2.2 Importing geospatial data

Transforming all geospatial data into EPSG 3414

Checking the data

3. Data Inspection

3.1 Examining population demographics

3.2 Examining urban functions

3.2.1 Businesses

3.2.2 Industries

3.2.3 Shopping infrastructure

3.2.4 Government Institutions

3.2.5 Financial institutions

3.2.6 Upmarket residential area

3.3 Identifying missing values

4. Transforming data

Joining data to make demographics into sf format

4.1 Data Wrangling for demographics

4.2 Data Wrangling for urban functions

4.3 Combining all the data into one table

Analysing these areas

5. Performing Univariant Analysis

5.1 Understanding data through histograms

5.2 Understanding data through box plots

6. Multivariate Analysis

6.1 Standardisation of data

6.2 Correlation plot

6.3 Performing PCA

Interpretation

Choosing cluster vars

6.4 Computing Hierarchical clustering

Choosing the most optimal method

Plotting hierarchical clustering dendogram

Determining the number of clusters

6.5 Heatmap

6.6 Map for visualising clusters

6.7 Cluster Analysis

Interpretation of Cluster One

Interpretation of Cluster Two

Interpretation of Cluster Three

Interpretation of Cluster Four

7. Spatially Constrained Clustering - SKATER

7.1 Data preperation

Computing Neighbour List

7.3 Visualising the clusters

7.4 Cluster Analysis

Interpretation

8. Conclusion