1. Aspatial/Geospatial Data Wrangling

1.1 The data

Aspatial

respopagesextod2011to2019.csv

Geospatial

MP14_SUBZONE_WEB_PL
Business (Provided)
Financial (Provided)
Govt_Embassy (Provided)
Private residentail (Provided)
Shopping (Provided)

1.2 Installing all the relevant packages

packages = c('rgdal', 'spdep', 'ClustGeo',  'tmap', 'sf', 'ggpubr', 'cluster', 'heatmaply', 'corrplot', 'psych', 'tidyverse','GISTools', 'tibble','ggplot2',"plotly",'factoextra')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
    }
  library(p,character.only = T)
}

2. Data Import and Prepatation

2.1 Importing geospatial data into R environment

Importing MP14_SUBZONE_WEB_PL layer and named is as mpsz

mpsz <- st_read(dsn = "data/geospatial", layer = "MP14_SUBZONE_WEB_PL")

## Reading layer `MP14_SUBZONE_WEB_PL' from data source `C:\IS415 Geospatial Analytics and Applications\take-home3\Take_home_Exe03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 323 features and 15 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs

View the summary of mpsz

summary(mpsz)

##     OBJECTID       SUBZONE_NO               SUBZONE_N     SUBZONE_C   CA_IND 
##  Min.   :  1.0   Min.   : 1.000   ADMIRALTY      :  1   AMSZ01 :  1   N:274  
##  1st Qu.: 81.5   1st Qu.: 2.000   AIRPORT ROAD   :  1   AMSZ02 :  1   Y: 49  
##  Median :162.0   Median : 4.000   ALEXANDRA HILL :  1   AMSZ03 :  1          
##  Mean   :162.0   Mean   : 4.625   ALEXANDRA NORTH:  1   AMSZ04 :  1          
##  3rd Qu.:242.5   3rd Qu.: 6.500   ALJUNIED       :  1   AMSZ05 :  1          
##  Max.   :323.0   Max.   :17.000   ANAK BUKIT     :  1   AMSZ06 :  1          
##                                   (Other)        :317   (Other):317          
##          PLN_AREA_N    PLN_AREA_C               REGION_N   REGION_C 
##  BUKIT MERAH  : 17   BM     : 17   CENTRAL REGION   :134   CR :134  
##  QUEENSTOWN   : 15   QT     : 15   EAST REGION      : 30   ER : 30  
##  ANG MO KIO   : 12   AM     : 12   NORTH-EAST REGION: 48   NER: 48  
##  DOWNTOWN CORE: 12   DT     : 12   NORTH REGION     : 41   NR : 41  
##  TOA PAYOH    : 12   TP     : 12   WEST REGION      : 70   WR : 70  
##  HOUGANG      : 10   HG     : 10                                    
##  (Other)      :245   (Other):245                                    
##              INC_CRC      FMEL_UPD_D             X_ADDR          Y_ADDR     
##  00F5E30B5C9B7AD8:  1   Min.   :2014-12-05   Min.   : 5093   Min.   :19579  
##  013B509B8EDF15BE:  1   1st Qu.:2014-12-05   1st Qu.:21864   1st Qu.:31776  
##  01A4287FB060A0A6:  1   Median :2014-12-05   Median :28465   Median :35113  
##  029BD940F4455194:  1   Mean   :2014-12-05   Mean   :27257   Mean   :36106  
##  0524461C92F35D94:  1   3rd Qu.:2014-12-05   3rd Qu.:31674   3rd Qu.:39869  
##  05FD555397CBEE7A:  1   Max.   :2014-12-05   Max.   :50425   Max.   :49553  
##  (Other)         :317                                                       
##    SHAPE_Leng        SHAPE_Area                geometry  
##  Min.   :  871.5   Min.   :   39438   MULTIPOLYGON :323  
##  1st Qu.: 3709.6   1st Qu.:  628261   epsg:NA      :  0  
##  Median : 5211.9   Median : 1229894   +proj=tmer...:  0  
##  Mean   : 6524.4   Mean   : 2420882                      
##  3rd Qu.: 6942.6   3rd Qu.: 2106483                      
##  Max.   :68083.9   Max.   :69748299                      
##

Based on the content of the requirement, we can remove those irrelevant column by using subset() function and in this case we will only keeping SUBZONE_N, SHAPE_Area and geometry for future analysis.

mpsz <- subset(mpsz, select = -c(1:2,4:14))
mpsz

## Simple feature collection with 323 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=1.366666666666667 +lon_0=103.8333333333333 +k=1 +x_0=28001.642 +y_0=38744.572 +datum=WGS84 +units=m +no_defs
## First 10 features:
##          SUBZONE_N SHAPE_Area                       geometry
## 1     MARINA SOUTH  1630379.3 MULTIPOLYGON (((31495.56 30...
## 2     PEARL'S HILL   559816.2 MULTIPOLYGON (((29092.28 30...
## 3        BOAT QUAY   160807.5 MULTIPOLYGON (((29932.33 29...
## 4   HENDERSON HILL   595428.9 MULTIPOLYGON (((27131.28 30...
## 5          REDHILL   387429.4 MULTIPOLYGON (((26451.03 30...
## 6   ALEXANDRA HILL  1030378.8 MULTIPOLYGON (((25899.7 297...
## 7    BUKIT HO SWEE   551732.0 MULTIPOLYGON (((27746.95 30...
## 8      CLARKE QUAY   290184.7 MULTIPOLYGON (((29351.26 29...
## 9  PASIR PANJANG 1  1084792.3 MULTIPOLYGON (((20996.49 30...
## 10       QUEENSWAY   631644.3 MULTIPOLYGON (((24472.11 29...

Perform a class() function check to see the class of the mpsz.

class(mpsz)

## [1] "sf"         "data.frame"

Converting mpsz into SpatialPolygonsDataFrame

mpsz_sp <- as(mpsz,"Spatial")

Importing all urban functions layer and named it respectively.

embassy <- st_read(dsn = "data/geospatial", layer = "Govt_Embassy")

## Reading layer `Govt_Embassy' from data source `C:\IS415 Geospatial Analytics and Applications\take-home3\Take_home_Exe03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 443 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6282 ymin: 1.24911 xmax: 103.9884 ymax: 1.45765
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

business <- st_read(dsn = "data/geospatial", layer = "Business")

## Reading layer `Business' from data source `C:\IS415 Geospatial Analytics and Applications\take-home3\Take_home_Exe03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 6550 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6147 ymin: 1.24605 xmax: 104.0044 ymax: 1.4698
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

financial <- st_read(dsn = "data/geospatial", layer = "Financial")

## Reading layer `Financial' from data source `C:\IS415 Geospatial Analytics and Applications\take-home3\Take_home_Exe03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3320 features and 29 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6256 ymin: 1.24392 xmax: 103.9998 ymax: 1.46247
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

residential <- st_read(dsn = "data/geospatial", layer = "Private residential")

## Reading layer `Private residential' from data source `C:\IS415 Geospatial Analytics and Applications\take-home3\Take_home_Exe03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 3604 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6295 ymin: 1.23943 xmax: 103.9749 ymax: 1.45379
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

shopping <- st_read(dsn = "data/geospatial", layer = "Shopping")

## Reading layer `Shopping' from data source `C:\IS415 Geospatial Analytics and Applications\take-home3\Take_home_Exe03\data\geospatial' using driver `ESRI Shapefile'
## Simple feature collection with 511 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.679 ymin: 1.24779 xmax: 103.9644 ymax: 1.4535
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs

Using head() function to view the details of the embassy,business,financial, residential and shopping.

head(embassy)

## Simple feature collection with 6 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.8431 ymin: 1.27869 xmax: 103.8578 ymax: 1.31836
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##       POI_ID SEQ_NUM FAC_TYPE                       POI_NAME      ST_NAME
## 1 1141424380       1     9993           CONSULATE SAN MARINO    CHURCH ST
## 2 1141424404       1     9993                   EMBASSY LAOS GOLDHILL PLZ
## 3 1141424402       1     9993               CONSULATE BELIZE     CECIL ST
## 4 1141424338       1     9993         GENERAL CONSULATE OMAN         <NA>
## 5 1192460871       1     9525                MND TOWER BLOCK   MAXWELL RD
## 6 1192460819       1     9525 MND AUDITORIUM & FUNCTION HALL   MAXWELL RD
##                   geometry
## 1 POINT (103.8494 1.28343)
## 2 POINT (103.8431 1.31836)
## 3 POINT (103.8493 1.28128)
## 4  POINT (103.8578 1.2999)
## 5 POINT (103.8456 1.27869)
## 6 POINT (103.8455 1.27883)

head(business)

## Simple feature collection with 6 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.7777 ymin: 1.2972 xmax: 103.8858 ymax: 1.33844
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##       POI_ID SEQ_NUM FAC_TYPE                        POI_NAME
## 1 1101180209       1     5000                       JOHN CHEN
## 2 1101180210       1     5000    TROPICAL INDUSTRIAL BUILDING
## 3 1101180211       1     5000 LIAN CHEONG INDUSTRIAL BUILDING
## 4 1101180212       1     5000  MALAYSIA GARMENT MANUFACTURERS
## 5 1101180213       1     5000                         UNIGOLD
## 6 1192316144       1     5000             NUS UNIVERSITY HALL
##               ST_NAME                 geometry
## 1           LITTLE RD POINT (103.8856 1.33841)
## 2           LITTLE RD POINT (103.8852 1.33832)
## 3           LITTLE RD POINT (103.8852 1.33834)
## 4                <NA> POINT (103.8855 1.33821)
## 5           LITTLE RD POINT (103.8858 1.33844)
## 6 LOWER KENT RIDGE RD  POINT (103.7777 1.2972)

head(financial)

## Simple feature collection with 6 features and 29 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.7189 ymin: 1.30211 xmax: 103.9224 ymax: 1.41695
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##      LINK_ID     POI_ID SEQ_NUM FAC_TYPE POI_NAME POI_LANGCD POI_NMTYPE
## 1 1170624361 1132324230       1     3578      UOB        ENG          B
## 2 1112103842 1132315471       1     3578     POSB        ENG          B
## 3 1112103842 1132315472       1     3578      UOB        ENG          B
## 4 1112103842 1132315473       1     3578     OCBC        ENG          B
## 5  864687596 1100784924       1     3578     OCBC        ENG          B
## 6  902073032 1132324170       1     6000  MAYBANK        ENG          B
##   POI_ST_NUM ST_NUM_FUL ST_NFUL_LC           ST_NAME ST_LANGCD POI_ST_SD
## 1        201       <NA>       <NA>      YISHUN AVE 2       ENG         L
## 2        375       <NA>       <NA>  COMMONWEALTH AVE       ENG         R
## 3        375       <NA>       <NA>  COMMONWEALTH AVE       ENG         R
## 4        375       <NA>       <NA>  COMMONWEALTH AVE       ENG         R
## 5       <NA>       <NA>       <NA> JURONG WEST ST 51       ENG         R
## 6        707       <NA>       <NA>     EAST COAST RD       ENG         L
##   ACC_TYPE   PH_NUMBER CHAIN_ID NAT_IMPORT PRIVATE IN_VICIN NUM_PARENT
## 1     <NA>        <NA>     6919          N       N        N          0
## 2     <NA>        <NA>     6918          N       N        N          0
## 3     <NA>        <NA>     6919          N       N        N          0
## 4     <NA>        <NA>     6920          N       N        N          0
## 5     <NA>        <NA>     6920          N       N        N          0
## 6     <NA> 18006292266     3657          N       N        N          0
##   NUM_CHILD PERCFRREF VANCITY_ID
## 1         0        NA          0
## 2         0        NA          0
## 3         0        NA          0
## 4         0        NA          0
## 5         0        60          0
## 6         0        NA          0
##                                                             ACT_ADDR ACT_LANGCD
## 1                                                               <NA>       <NA>
## 2                                                               <NA>       <NA>
## 3                                                               <NA>       <NA>
## 4                                                               <NA>       <NA>
## 5 501 JURONG WEST STREET 51                         SINGAPORE 640501        ENG
## 6                                                               <NA>       <NA>
##              ACT_ST_NAM ACT_ST_NUM ACT_ADMIN ACT_POSTAL
## 1                  <NA>       <NA>      <NA>       <NA>
## 2                  <NA>       <NA>      <NA>       <NA>
## 3                  <NA>       <NA>      <NA>       <NA>
## 4                  <NA>       <NA>      <NA>       <NA>
## 5 JURONG WEST STREET 51        501 SINGAPORE     640501
## 6                  <NA>       <NA>      <NA>       <NA>
##                   geometry
## 1  POINT (103.833 1.41695)
## 2 POINT (103.7989 1.30211)
## 3 POINT (103.7989 1.30211)
## 4 POINT (103.7989 1.30211)
## 5 POINT (103.7189 1.35016)
## 6 POINT (103.9224 1.31199)

head(residential)

## Simple feature collection with 6 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.7754 ymin: 1.28119 xmax: 103.9084 ymax: 1.36908
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##       POI_ID SEQ_NUM FAC_TYPE                       POI_NAME     ST_NAME
## 1 1132324282       1     9590 MARINA BAY SERVICED APARTMENTS MARINA BLVD
## 2 1132106212       1     9590                 SIN MING VILLE JALAN TODAK
## 3 1202668778       1     9590         GREENTOPS @ SIMS PLACE        <NA>
## 4 1099690099       1     9590    MOUNTBATTEN DAKOTA CRESCENT DAKOTA CRES
## 5  995195128       1     9590                    SINGA COURT JALAN SINGA
## 6 1176000954       1     9590            FORESQUE RESIDENCES    PETIR RD
##                   geometry
## 1 POINT (103.8526 1.28119)
## 2 POINT (103.8355 1.35361)
## 3 POINT (103.8797 1.31643)
## 4 POINT (103.8895 1.30834)
## 5 POINT (103.9084 1.33037)
## 6 POINT (103.7754 1.36908)

head(shopping)

## Simple feature collection with 6 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.7127 ymin: 1.28458 xmax: 103.9041 ymax: 1.35375
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##       POI_ID SEQ_NUM FAC_TYPE                 POI_NAME         ST_NAME
## 1 1132106213       1     6512          SIN MING CENTRE     SIN MING RD
## 2  801758392       1     6512              THE ADELPHI      COLEMAN ST
## 3  842821452       1     6512 BOON LAY SHOPPING CENTRE     BOON LAY PL
## 4 1193779191       1     6512            KATONG SQUARE   EAST COAST RD
## 5  801758399       1     6512           SIM LIM SQUARE ROCHOR CANAL RD
## 6 1001450091       1     6512    PEOPLE'S PARK COMPLEX         PARK RD
##                   geometry
## 1  POINT (103.836 1.35375)
## 2 POINT (103.8515 1.29124)
## 3 POINT (103.7127 1.34672)
## 4   POINT (103.9041 1.305)
## 5 POINT (103.8533 1.30341)
## 6  POINT (103.843 1.28458)

Using duplicated function to check any duplicate rows, since the POI_ID is an unique identifier, therefore we will perform a check on POI_ID instead of other variables.
In this data set, geometry is a requirement for our data analysis. Therefore, we will perform a NA check on geometry column. If there is any NA value in the geometry column, it will return TRUE, else return FALSE.

any(duplicated(embassy$POI_ID))

## [1] TRUE

any(is.na(embassy$geometry))

## [1] FALSE

any(duplicated(embassy$POI_ID))

## [1] TRUE

any(is.na(embassy$geometry))

## [1] FALSE

any(duplicated(business$POI_ID))

## [1] TRUE

any(is.na(business$geometry))

## [1] FALSE

any(duplicated(financial$POI_ID))

## [1] TRUE

any(is.na(financial$geometry))

## [1] FALSE

any(duplicated(residential$POI_ID))

## [1] TRUE

any(is.na(residential$geometry))

## [1] FALSE

any(duplicated(shopping$POI_ID))

## [1] TRUE

any(is.na(shopping$geometry))

## [1] FALSE

After performing the check on Business, there are 2 different FAC_TYPE which we can further split the data into an additional urban function called “Industry”. Based on the description of the business, we can identify that “FAC_TYPE” 9991 can be further classify as Industry and “FAC_TYPE” 5000 can be remain as Business.

industry <- business %>%
  filter(FAC_TYPE == "9991")
business <- business %>%
  filter(FAC_TYPE == "5000")

head(industry)

## Simple feature collection with 6 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 103.6242 ymin: 1.29325 xmax: 103.9632 ymax: 1.33283
## epsg (SRID):    4326
## proj4string:    +proj=longlat +datum=WGS84 +no_defs
##       POI_ID SEQ_NUM FAC_TYPE
## 1 1110491789       1     9991
## 2 1099992474       1     9991
## 3 1099992477       1     9991
## 4 1099992477       2     9991
## 5 1100464367       1     9991
## 6 1100464367       2     9991
##                                                       POI_NAME
## 1                                 KAKI BUKIT INDUSTRIAL ESTATE
## 2                            TERRACE FACTORIES TUAS SOUTH ST 5
## 3              TERRACE FACTORIES TUAS SOUTH ST 5 EAST ENTRANCE
## 4                            JTC TERRACE FACTORIES TUAS S ST 5
## 5 CHANGI BUSINESS PARK CHANGI BUSINESS PARK CENTRAL 1 ENTRANCE
## 6                               CHANGI BUSINESS PARK CENTRAL 1
##                          ST_NAME                 geometry
## 1               KAKI BUKIT AVE 1 POINT (103.9042 1.33269)
## 2                TUAS SOUTH ST 5 POINT (103.6242 1.29325)
## 3                TUAS SOUTH ST 5 POINT (103.6251 1.29333)
## 4                TUAS SOUTH ST 5 POINT (103.6251 1.29333)
## 5 CHANGI BUSINESS PARK CENTRAL 1 POINT (103.9633 1.33283)
## 6 CHANGI BUSINESS PARK CENTRAL 1 POINT (103.9633 1.33283)

any(duplicated(industry$POI_ID))

## [1] TRUE

any(is.na(industry$geometry))

## [1] FALSE

In the above code chunk, we identified that there are duplicated information, one way to filter them is by their SEQ_NUM. In this case, we will be keep those row with SEQ_NUM equal to 1.

embassy <- embassy %>%
  filter(SEQ_NUM == 1)
business <- business %>%
  filter(SEQ_NUM == 1)
financial <- financial %>%
  filter(SEQ_NUM == 1)
residential <- residential %>%
  filter(SEQ_NUM == 1)
shopping <- shopping %>%
  filter(SEQ_NUM == 1)
industry <- industry %>%
  filter(SEQ_NUM == 1)

As we can see the details of each urban fucntions from the code chunk of the view function, we identified that the CRS is not in SV21. Therefore, we will use st_transform function to transform the CRS to SV21 (+init=epsg:3414).

embassy <- st_transform(embassy,  CRS("+init=epsg:3414"))
business <- st_transform(business,  CRS("+init=epsg:3414"))
financial <- st_transform(financial,  CRS("+init=epsg:3414"))
residential <- st_transform(residential,  CRS("+init=epsg:3414"))
shopping <- st_transform(shopping,  CRS("+init=epsg:3414"))
industry <- st_transform(industry,  CRS("+init=epsg:3414"))

As we can see the class of each urban function is in sf format.

class(embassy)

## [1] "sf"         "data.frame"

class(business)

## [1] "sf"         "data.frame"

class(financial)

## [1] "sf"         "data.frame"

class(residential)

## [1] "sf"         "data.frame"

class(shopping)

## [1] "sf"         "data.frame"

class(industry)

## [1] "sf"         "data.frame"

Converting all the urban functions into Formal class SpatialPoints format.

embassy_sp <- as(embassy, "Spatial")
embassy_sp <- as(embassy_sp, "SpatialPoints")
business_sp <- as(business, "Spatial")
business_sp<- as(business_sp, "SpatialPoints")
financial_sp <- as(financial, "Spatial")
financial_sp<- as(financial_sp, "SpatialPoints")
residential_sp <- as(residential, "Spatial")
residential_sp<- as(residential_sp, "SpatialPoints")
shopping_sp <- as(shopping, "Spatial")
shopping_sp<- as(shopping_sp, "SpatialPoints")
industry_sp <- as(industry, "Spatial")
industry_sp<- as(industry_sp, "SpatialPoints")

Using tmap function to display the location of each point on the Subzone layer map in order to have an rough idea the location of each point. By plotting the map, we are also can have a rough idea which island or subzone do not have any data points. So in the further we can remove them.

 Business_tm <- tm_shape(mpsz_sp) + 
  tm_polygons(col = "lightgrey") +
  tm_shape(business_sp)+
  tm_compass()+
  tm_bubbles(size = 0.1,alpha = 0.5, col = "brown")+
  tm_layout(bg.color = "skyblue",title = "Location of Business Outlet", title.size = 0.4)

 Embassy_tm <- tm_shape(mpsz_sp) + 
  tm_polygons(col = "lightgrey") +
  tm_shape(embassy_sp)+
  tm_compass()+
  tm_bubbles(size = 0.1,alpha = 0.5, col = "brown")+
  tm_layout(bg.color = "skyblue",title = "Location of Embassy Outlet", title.size = 0.4)
 
  Industry_tm <- tm_shape(mpsz_sp) + 
  tm_polygons(col = "lightgrey") +
  tm_shape(industry_sp)+
  tm_compass()+
  tm_bubbles(size = 0.1,alpha = 0.5, col = "brown")+
  tm_layout(bg.color = "skyblue",title = "Location of Industry Outlet", title.size = 0.4)
  
  Residential_tm <- tm_shape(mpsz_sp) + 
  tm_polygons(col = "lightgrey") +
  tm_shape(residential_sp)+
  tm_compass()+
  tm_bubbles(size = 0.1,alpha = 0.5, col = "brown")+
  tm_layout(bg.color = "skyblue",title = "Location of Residential Outlet", title.size = 0.4)
   
  Shopping_tm <- tm_shape(mpsz_sp) + 
  tm_polygons(col = "lightgrey") +
  tm_shape(shopping_sp)+
  tm_compass()+
  tm_bubbles(size = 0.1,alpha = 0.5, col = "brown")+
  tm_layout(bg.color = "skyblue",title = "Location of Shopping Outlet", title.size = 0.4)
  
  Financial_tm <- tm_shape(mpsz_sp) + 
  tm_polygons(col = "lightgrey") +
  tm_shape(financial_sp)+
  tm_compass()+
  tm_bubbles(size = 0.1,alpha = 0.5, col = "brown")+
  tm_layout(bg.color = "skyblue",title = "Location of Financial Outlet", title.size = 0.4)
  
  
  tmap_arrange(Business_tm,Embassy_tm,Industry_tm,Residential_tm,Shopping_tm,Financial_tm)

Using poly.counts function to count the number of points fall within each polygon.

Business <- poly.counts(business_sp,mpsz_sp)
Embassy <- poly.counts(embassy_sp,mpsz_sp)
Industry <- poly.counts(industry_sp,mpsz_sp)
Residential <- poly.counts(residential_sp,mpsz_sp)
Shopping <- poly.counts(shopping_sp,mpsz_sp)
Financial <-poly.counts(financial_sp,mpsz_sp)

After identifying, the number of points fall within each polygon, we will be using cbind function to combine the point data sets with the mpsz data set to make it into a single data set instead of 7 individual data sets.

mpsz_join <- cbind(mpsz,Business,Embassy,Industry,Residential,Shopping,Financial)

summary(mpsz_join)

##            SUBZONE_N     SHAPE_Area          Business         Embassy     
##  ADMIRALTY      :  1   Min.   :   39438   Min.   :  0.00   Min.   : 0.00  
##  AIRPORT ROAD   :  1   1st Qu.:  628261   1st Qu.:  0.00   1st Qu.: 0.00  
##  ALEXANDRA HILL :  1   Median : 1229894   Median :  2.00   Median : 0.00  
##  ALEXANDRA NORTH:  1   Mean   : 2420882   Mean   : 19.57   Mean   : 1.22  
##  ALJUNIED       :  1   3rd Qu.: 2106483   3rd Qu.: 13.50   3rd Qu.: 1.00  
##  ANAK BUKIT     :  1   Max.   :69748299   Max.   :303.00   Max.   :17.00  
##  (Other)        :317                                                      
##     Industry       Residential        Shopping        Financial    
##  Min.   :0.0000   Min.   :  0.00   Min.   : 0.000   Min.   :  0.0  
##  1st Qu.:0.0000   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:  1.0  
##  Median :0.0000   Median :  4.00   Median : 0.000   Median :  5.0  
##  Mean   :0.2941   Mean   : 11.08   Mean   : 1.418   Mean   : 10.2  
##  3rd Qu.:0.0000   3rd Qu.: 11.00   3rd Qu.: 1.000   3rd Qu.: 13.0  
##  Max.   :5.0000   Max.   :215.00   Max.   :27.000   Max.   :132.0  
##                                                                    
##           geometry  
##  MULTIPOLYGON :323  
##  epsg:NA      :  0  
##  +proj=tmer...:  0  
##                     
##                     
##                     
##

Perform a class check on mpsz_join after joining all the points data set to see is that any changes to the class.

class(mpsz_join)

## [1] "sf"         "data.frame"

2.2 Importing aspatial data into R environment

Importing the respopagesextod2011to2019 and named it as pop.

pop <- read_csv("data/aspatial/respopagesextod2011to2019.csv")

3. Exploratory Data Analysis (EDA)

Using filter() function, group_by() function, spread() function and mutate() function to combine the data set according to the requirement.

pop2019 <- pop %>%
  filter(Time == 2019) %>%
  group_by(SZ,TOD)%>%
  summarise(Pop = sum(Pop)) %>%
  ungroup() %>%

spread(TOD,Pop)%>%
  mutate(`HDB1-2RM` = `HDB 1- and 2-Room Flats`) %>%
  mutate(`HDB3-4RM` = `HDB 3-Room Flats` + `HDB 4-Room Flats`) %>%
  mutate(`HDB5RM-Ec` = `HDB 5-Room and Executive Flats`) %>%
  mutate(`Cond_Apt` = `Condominiums and Other Apartments`) %>%
  mutate(`Landed` = `Landed Properties`) %>%
  mutate(total_dwellers = `HDB1-2RM`+`HDB3-4RM`+`HDB5RM-Ec`+`Cond_Apt`+`Landed`)
pop2019 <- subset(pop2019, select = -c(2:9))  
  
pop2019_1 <- pop %>%
  filter(Time == 2019) %>%
  group_by(SZ,AG)%>%
  summarise(Pop = sum(Pop)) %>%
  ungroup() %>%
  
spread(AG,Pop)%>%
  mutate(Economy = `25_to_29` +`30_to_34` +`35_to_39`+`40_to_44`+`45_to_49`+`50_to_54`+`55_to_59`+`60_to_64`) %>%
    mutate(Young = `0_to_4`+`5_to_9`+`10_to_14`+`15_to_19`+`20_to_24`) %>%
    mutate(Aged = `65_to_69`+`70_to_74`+`75_to_79`+`80_to_84`+`85_to_89`+`90_and_over`) %>%
  mutate(total = Economy + Young+ Aged)
pop2019_1 <- subset(pop2019_1, select = -c(2:20))  

pop2019 <- left_join(pop2019_1,pop2019)
pop2019 <- pop2019 %>%
  mutate_at(.vars = vars(SZ), .funs = funs(toupper))

Using summary function to see the detail of pop2019

summary(pop2019)

##       SZ               Economy          Young            Aged      
##  Length:323         Min.   :    0   Min.   :    0   Min.   :    0  
##  Class :character   1st Qu.:    0   1st Qu.:    0   1st Qu.:    0  
##  Mode  :character   Median : 2840   Median : 1170   Median :  730  
##                     Mean   : 7386   Mean   : 3296   Mean   : 1806  
##                     3rd Qu.:10285   3rd Qu.: 4365   3rd Qu.: 3000  
##                     Max.   :79780   Max.   :34260   Max.   :18860  
##      total           HDB1-2RM       HDB3-4RM       HDB5RM-Ec    
##  Min.   :     0   Min.   :   0   Min.   :    0   Min.   :    0  
##  1st Qu.:     0   1st Qu.:   0   1st Qu.:    0   1st Qu.:    0  
##  Median :  4890   Median :   0   Median :    0   Median :    0  
##  Mean   : 12487   Mean   : 542   Mean   : 5953   Mean   : 3297  
##  3rd Qu.: 17065   3rd Qu.: 605   3rd Qu.: 9705   3rd Qu.: 3660  
##  Max.   :132900   Max.   :4700   Max.   :75000   Max.   :47960  
##     Cond_Apt         Landed        total_dwellers  
##  Min.   :    0   Min.   :    0.0   Min.   :     0  
##  1st Qu.:    0   1st Qu.:    0.0   1st Qu.:     0  
##  Median :  230   Median :    0.0   Median :  4880  
##  Mean   : 1827   Mean   :  773.9   Mean   : 12393  
##  3rd Qu.: 2835   3rd Qu.:  400.0   3rd Qu.: 17035  
##  Max.   :16770   Max.   :18820.0   Max.   :132480

Performing a class check on pop2019 data set

class(pop2019)

## [1] "tbl_df"     "tbl"        "data.frame"

using left_join function to combine 2 data set into a single data set and perform a NA check to make sure there is no missing fill.

final <- left_join(mpsz_join,pop2019, by = c( "SUBZONE_N" = "SZ"))
any(is.na(final))

## [1] FALSE

Perform a class check on final

class(final)

## [1] "sf"         "data.frame"

The unit of measurement of the values are number of total Population within each polygon for Economy, Young and Aged. Using these values directly will be bais by the underlying total number of population within each polygon.

The unit of measurement of the values are number of total dwellers within each polygon for HDB1-2RM_PR, HDB3-4RM_PR, HDB5RM-Ec_PR, Cond_Apt_PR and Landed_PR. Using these values directly will be bais by the underlying total number of dwellers within each polygon.

In order to overcome this problem, we will derive the penetration rate of each ICT variable by using the code chunk below.

final_derived <- final %>%
  mutate(Pop_density = `total`/`SHAPE_Area`) %>%
  mutate(Economy_PR = `Economy`/`total`) %>%
  mutate(Young_PR = `Young`/`total`) %>%
  mutate(Aged_PR = `Aged`/`total`) %>%
  mutate(`HDB1-2RM_PR` = `HDB1-2RM`/`total_dwellers`) %>%
  mutate(`HDB3-4RM_PR` = `HDB3-4RM`/`total_dwellers`) %>%
  mutate(`HDB5RM-Ec_PR` = `HDB5RM-Ec`/`total_dwellers`) %>%
  mutate(`Cond_Apt_PR` = `Cond_Apt`/`total_dwellers`) %>%
  mutate(`Landed_PR` = `Landed`/`total_dwellers`)

To view the summary of final_derived

summary(final_derived)

##   SUBZONE_N           SHAPE_Area          Business         Embassy     
##  Length:323         Min.   :   39438   Min.   :  0.00   Min.   : 0.00  
##  Class :character   1st Qu.:  628261   1st Qu.:  0.00   1st Qu.: 0.00  
##  Mode  :character   Median : 1229894   Median :  2.00   Median : 0.00  
##                     Mean   : 2420882   Mean   : 19.57   Mean   : 1.22  
##                     3rd Qu.: 2106483   3rd Qu.: 13.50   3rd Qu.: 1.00  
##                     Max.   :69748299   Max.   :303.00   Max.   :17.00  
##                                                                        
##     Industry       Residential        Shopping        Financial    
##  Min.   :0.0000   Min.   :  0.00   Min.   : 0.000   Min.   :  0.0  
##  1st Qu.:0.0000   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:  1.0  
##  Median :0.0000   Median :  4.00   Median : 0.000   Median :  5.0  
##  Mean   :0.2941   Mean   : 11.08   Mean   : 1.418   Mean   : 10.2  
##  3rd Qu.:0.0000   3rd Qu.: 11.00   3rd Qu.: 1.000   3rd Qu.: 13.0  
##  Max.   :5.0000   Max.   :215.00   Max.   :27.000   Max.   :132.0  
##                                                                    
##     Economy          Young            Aged           total       
##  Min.   :    0   Min.   :    0   Min.   :    0   Min.   :     0  
##  1st Qu.:    0   1st Qu.:    0   1st Qu.:    0   1st Qu.:     0  
##  Median : 2840   Median : 1170   Median :  730   Median :  4890  
##  Mean   : 7386   Mean   : 3296   Mean   : 1806   Mean   : 12487  
##  3rd Qu.:10285   3rd Qu.: 4365   3rd Qu.: 3000   3rd Qu.: 17065  
##  Max.   :79780   Max.   :34260   Max.   :18860   Max.   :132900  
##                                                                  
##     HDB1-2RM       HDB3-4RM       HDB5RM-Ec        Cond_Apt    
##  Min.   :   0   Min.   :    0   Min.   :    0   Min.   :    0  
##  1st Qu.:   0   1st Qu.:    0   1st Qu.:    0   1st Qu.:    0  
##  Median :   0   Median :    0   Median :    0   Median :  230  
##  Mean   : 542   Mean   : 5953   Mean   : 3297   Mean   : 1827  
##  3rd Qu.: 605   3rd Qu.: 9705   3rd Qu.: 3660   3rd Qu.: 2835  
##  Max.   :4700   Max.   :75000   Max.   :47960   Max.   :16770  
##                                                                
##      Landed        total_dwellers            geometry    Pop_density     
##  Min.   :    0.0   Min.   :     0   MULTIPOLYGON :323   Min.   :0.00000  
##  1st Qu.:    0.0   1st Qu.:     0   epsg:NA      :  0   1st Qu.:0.00000  
##  Median :    0.0   Median :  4880   +proj=tmer...:  0   Median :0.00591  
##  Mean   :  773.9   Mean   : 12393                       Mean   :0.01074  
##  3rd Qu.:  400.0   3rd Qu.: 17035                       3rd Qu.:0.01993  
##  Max.   :18820.0   Max.   :132480                       Max.   :0.04606  
##                                                                          
##    Economy_PR         Young_PR         Aged_PR        HDB1-2RM_PR     
##  Min.   :0.05263   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.57064   1st Qu.:0.2146   1st Qu.:0.1075   1st Qu.:0.00000  
##  Median :0.59023   Median :0.2531   Median :0.1523   Median :0.00000  
##  Mean   :0.59611   Mean   :0.2475   Mean   :0.1564   Mean   :0.04037  
##  3rd Qu.:0.60549   3rd Qu.:0.2862   3rd Qu.:0.1942   3rd Qu.:0.04794  
##  Max.   :0.90000   Max.   :0.5588   Max.   :0.9474   Max.   :0.71293  
##  NA's   :89        NA's   :89       NA's   :89       NA's   :95       
##   HDB3-4RM_PR      HDB5RM-Ec_PR     Cond_Apt_PR        Landed_PR     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.02675   1st Qu.:0.0000  
##  Median :0.4027   Median :0.1441   Median :0.14571   Median :0.0000  
##  Mean   :0.3550   Mean   :0.1641   Mean   :0.30746   Mean   :0.1330  
##  3rd Qu.:0.6067   3rd Qu.:0.2594   3rd Qu.:0.49172   3rd Qu.:0.1452  
##  Max.   :0.9482   Max.   :0.8364   Max.   :1.00000   Max.   :1.0000  
##  NA's   :95       NA's   :95       NA's   :95        NA's   :95

As we can see there are some colums with NA, however this NA is slighly different as when we open up the data set it shows NAn. The Nan is due the calculation. Therefore, we need to use is.nan() function to change to 0.

final_derived$Economy_PR[is.nan(final_derived$Economy_PR)] <- 0
final_derived$Young_PR[is.nan(final_derived$Young_PR)] <- 0
final_derived$Aged_PR[is.nan(final_derived$Aged_PR)] <- 0
final_derived$`HDB1-2RM_PR`[is.nan(final_derived$`HDB1-2RM_PR`)] <- 0
final_derived$`HDB3-4RM_PR`[is.nan(final_derived$`HDB3-4RM_PR`)] <- 0
final_derived$`HDB5RM-Ec_PR`[is.nan(final_derived$`HDB5RM-Ec_PR`)] <- 0
final_derived$Cond_Apt_PR[is.nan(final_derived$Cond_Apt_PR)] <- 0
final_derived$Landed_PR[is.nan(final_derived$Landed_PR)] <- 0

3.2 EDA using statistical graphics

Using ggplot function to plot the graph to see the distribution.

  Economy_PR <- ggplot(data=final_derived, 
             aes(x= `Economy_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  Young_PR <- ggplot(data=final_derived, 
             aes(x= `Young_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  Aged_PR <- ggplot(data=final_derived, 
             aes(x= `Aged_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  Pop_density <- ggplot(data=final_derived, 
             aes(x= `Pop_density`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
  
  `HDB1-2RM_PR` <- ggplot(data=final_derived, 
             aes(x= `HDB1-2RM_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  `HDB3-4RM_PR` <- ggplot(data=final_derived, 
             aes(x= `HDB3-4RM_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

`HDB5RM-Ec_PR` <- ggplot(data=final_derived, 
             aes(x= `HDB5RM-Ec_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  Cond_Apt_PR <- ggplot(data=final_derived, 
             aes(x= `Cond_Apt_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  Landed_PR <- ggplot(data=final_derived, 
             aes(x= `Landed_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  `Business` <- ggplot(data=final_derived, 
             aes(x= `Business`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  `Embassy` <- ggplot(data=final_derived, 
             aes(x= `Embassy`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  `Financial` <- ggplot(data=final_derived, 
             aes(x= `Financial`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
  
  `Industry` <- ggplot(data=final_derived, 
             aes(x= `Industry`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
  
  `Residential` <- ggplot(data=final_derived, 
             aes(x= `Residential`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
  
  `Shopping` <- ggplot(data=final_derived, 
             aes(x= `Shopping`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ggarrange(Economy_PR, Young_PR, Aged_PR, Pop_density, `HDB1-2RM_PR`, `HDB3-4RM_PR`, `HDB5RM-Ec_PR`,Cond_Apt_PR,Landed_PR, Embassy,Business, Financial,Industry,Residential,Shopping,
          ncol = 3, 
          nrow = 5)

When we place the order differently, we can see the class turns up differently. The purpose of switch the position of the variable is to make the data set into data.frame format for cluster_vars.cor input as it only take in data set in data.frame format.

final1 <- left_join(pop2019,mpsz_join,by = c( "SZ" = "SUBZONE_N"))

class(final1)

## [1] "tbl_df"     "tbl"        "data.frame"

Will repeat as above steps

  final_derived1 <- final1 %>%
  mutate(Pop_density = `total`/`SHAPE_Area`) %>%
  mutate(Economy_PR = `Economy`/`total`) %>%
  mutate(Young_PR = `Young`/`total`) %>%
  mutate(Aged_PR = `Aged`/`total`) %>%
  mutate(`HDB1-2RM_PR` = `HDB1-2RM`/`total_dwellers`) %>%
  mutate(`HDB3-4RM_PR` = `HDB3-4RM`/`total_dwellers`) %>%
  mutate(`HDB5RM-Ec_PR` = `HDB5RM-Ec`/`total_dwellers`) %>%
  mutate(`Cond_Apt_PR` = `Cond_Apt`/`total_dwellers`) %>%
  mutate(`Landed_PR` = `Landed`/`total_dwellers`)

final_derived1$Economy_PR[is.nan(final_derived1$Economy_PR)] <- 0
final_derived1$Young_PR[is.nan(final_derived1$Young_PR)] <- 0
final_derived1$Aged_PR[is.nan(final_derived1$Aged_PR)] <- 0
final_derived1$`HDB1-2RM_PR`[is.nan(final_derived1$`HDB1-2RM_PR`)] <- 0
final_derived1$`HDB3-4RM_PR`[is.nan(final_derived1$`HDB3-4RM_PR`)] <- 0
final_derived1$`HDB5RM-Ec_PR`[is.nan(final_derived1$`HDB5RM-Ec_PR`)] <- 0
final_derived1$Cond_Apt_PR[is.nan(final_derived1$Cond_Apt_PR)] <- 0
final_derived1$Landed_PR[is.nan(final_derived1$Landed_PR)] <- 0

Using colnames() function to view the name of each columns.

colnames(final_derived1)

##  [1] "SZ"             "Economy"        "Young"          "Aged"          
##  [5] "total"          "HDB1-2RM"       "HDB3-4RM"       "HDB5RM-Ec"     
##  [9] "Cond_Apt"       "Landed"         "total_dwellers" "SHAPE_Area"    
## [13] "Business"       "Embassy"        "Industry"       "Residential"   
## [17] "Shopping"       "Financial"      "geometry"       "Pop_density"   
## [21] "Economy_PR"     "Young_PR"       "Aged_PR"        "HDB1-2RM_PR"   
## [25] "HDB3-4RM_PR"    "HDB5RM-Ec_PR"   "Cond_Apt_PR"    "Landed_PR"

To perform a switch column function to help in our further inputs. As we are putting all the relavent columns near to each other.

final_derived1 <- final_derived1[, c(1,2,3, 4,5,7,8,9,10,11,12,13,20,21,22,23,6,24,25,26,27,28,13,14,15,16,17,18)]
colnames(final_derived1)

##  [1] "SZ"             "Economy"        "Young"          "Aged"          
##  [5] "total"          "HDB3-4RM"       "HDB5RM-Ec"      "Cond_Apt"      
##  [9] "Landed"         "total_dwellers" "SHAPE_Area"     "Business"      
## [13] "Pop_density"    "Economy_PR"     "Young_PR"       "Aged_PR"       
## [17] "HDB1-2RM"       "HDB1-2RM_PR"    "HDB3-4RM_PR"    "HDB5RM-Ec_PR"  
## [21] "Cond_Apt_PR"    "Landed_PR"      "Business"       "Embassy"       
## [25] "Industry"       "Residential"    "Shopping"       "Financial"

To see the different of the class.

class(final_derived)

## [1] "sf"         "data.frame"

class(final_derived1)

## [1] "tbl_df"     "tbl"        "data.frame"

3.3 Correlation Analysis

Before we perform cluster analysis, it is important for us to endure that the input variables use are not highly correlated.To visualise and analyse the correlation of the input variables.If we input final_derived as our input value is will show “Error in cor(final_derived[, 14:28]) : ‘x’ must be numeric”

cluster_vars.cor = cor(final_derived1[,14:28])
corrplot.mixed(cluster_vars.cor,
         lower = "ellipse", 
               upper = "number",
               tl.pos = "lt",
               diag = "l",
               tl.col = "black",
              number.cex = .5)

#final_derived <- st_as_sf(final_derived)

4 Hierarchy Cluster Analysis

4.1 Extrating clustering variables

Since we identify Young_PR and Economy_PR are highly correlated. Therefore, we have to remove one of them. in this case we will be removing Young_PR from our input. However, in order to perform the selection we have to use st_set_geometry to drop the geomoetry columns.

cluster_vars <- final_derived %>%
  st_set_geometry(NULL)
cluster_vars <- dplyr :: select(cluster_vars, "SUBZONE_N","Economy_PR", "Aged_PR", "Pop_density", "HDB1-2RM_PR", "HDB3-4RM_PR","HDB5RM-Ec_PR","Cond_Apt_PR", "Landed_PR", "Business", "Embassy", "Financial", "Industry","Residential" ,"Shopping")

To view the summary of cluster_vars.

summary(cluster_vars)

##   SUBZONE_N           Economy_PR        Aged_PR        Pop_density     
##  Length:323         Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Mode  :character   Median :0.5753   Median :0.1240   Median :0.00591  
##                     Mean   :0.4319   Mean   :0.1133   Mean   :0.01074  
##                     3rd Qu.:0.5991   3rd Qu.:0.1822   3rd Qu.:0.01993  
##                     Max.   :0.9000   Max.   :0.9474   Max.   :0.04606  
##   HDB1-2RM_PR       HDB3-4RM_PR      HDB5RM-Ec_PR     Cond_Apt_PR     
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.0000   Median :0.0000   Median :0.04593  
##  Mean   :0.02850   Mean   :0.2506   Mean   :0.1159   Mean   :0.21703  
##  3rd Qu.:0.02899   3rd Qu.:0.5043   3rd Qu.:0.2072   3rd Qu.:0.30034  
##  Max.   :0.71293   Max.   :0.9481   Max.   :0.8364   Max.   :1.00000  
##    Landed_PR          Business         Embassy        Financial    
##  Min.   :0.00000   Min.   :  0.00   Min.   : 0.00   Min.   :  0.0  
##  1st Qu.:0.00000   1st Qu.:  0.00   1st Qu.: 0.00   1st Qu.:  1.0  
##  Median :0.00000   Median :  2.00   Median : 0.00   Median :  5.0  
##  Mean   :0.09390   Mean   : 19.57   Mean   : 1.22   Mean   : 10.2  
##  3rd Qu.:0.03887   3rd Qu.: 13.50   3rd Qu.: 1.00   3rd Qu.: 13.0  
##  Max.   :1.00000   Max.   :303.00   Max.   :17.00   Max.   :132.0  
##     Industry       Residential        Shopping     
##  Min.   :0.0000   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:  0.00   1st Qu.: 0.000  
##  Median :0.0000   Median :  4.00   Median : 0.000  
##  Mean   :0.2941   Mean   : 11.08   Mean   : 1.418  
##  3rd Qu.:0.0000   3rd Qu.: 11.00   3rd Qu.: 1.000  
##  Max.   :5.0000   Max.   :215.00   Max.   :27.000

Now, we are creating a new columns is called overview, the purpose of having this columns to sum up all the relevant columns together and to identify which are the subzone polygon with 0 value.

cluster_vars <- cluster_vars %>%
  group_by(SUBZONE_N) %>%
  mutate(overview = `Economy_PR` + `Aged_PR`+`Pop_density`+ `HDB1-2RM_PR`+ `HDB3-4RM_PR`+`HDB5RM-Ec_PR`+`Cond_Apt_PR`+ `Landed_PR`+ `Business`+ `Embassy`+ `Financial`+ `Industry`+`Residential`+`Shopping`)

After creating a new columns, the overview is added to the last columns.

colnames(cluster_vars)

##  [1] "SUBZONE_N"    "Economy_PR"   "Aged_PR"      "Pop_density"  "HDB1-2RM_PR" 
##  [6] "HDB3-4RM_PR"  "HDB5RM-Ec_PR" "Cond_Apt_PR"  "Landed_PR"    "Business"    
## [11] "Embassy"      "Financial"    "Industry"     "Residential"  "Shopping"    
## [16] "overview"

However, to help us save time we can re-arrange it next to SUBZONE_N.

cluster_vars <- cluster_vars[, c(16,1, 2, 3, 4,5,6,7,8,9,10,11,12,13,14,15,16)]
colnames(cluster_vars)

##  [1] "overview"     "SUBZONE_N"    "Economy_PR"   "Aged_PR"      "Pop_density" 
##  [6] "HDB1-2RM_PR"  "HDB3-4RM_PR"  "HDB5RM-Ec_PR" "Cond_Apt_PR"  "Landed_PR"   
## [11] "Business"     "Embassy"      "Financial"    "Industry"     "Residential" 
## [16] "Shopping"     "overview"

Since there are SUBZONEs with 0 value, in order to prevent any inaccurate analysis due to the polygons with 0 value. ASs there is no information for those polygons, therefore, we will be removing them from our analysis.

cluster_vars <- cluster_vars[-c(13,17,18,19,44,179,191,203,234,252,256,270,274,286,295,298,301,302,303,304,312,318,319),]

Not only removing from the cluster_vars data set, we also need to remove from mpsz. As we will be combining both data set together in the later stage. When the number of rows are different, we are unable to perform join function.

mpsz <- mpsz[-c(13,17,18,19,44,179,191,203,234,252,256,270,274,286,295,298,301,302,303,304,312,318,319),]

We will name the new data set as mpsz_sp1 which is in SpatialPolygon format.

mpsz_sp1 <- as(mpsz,"Spatial")

Removing the overview columns, since we will not be using it anymore.

cluster_vars <- subset(cluster_vars, select = -c(overview))

In the code chunk below, normalize() of heatmaply package is used to stadardisation the clustering variables by using Min-Max method.

4.2 Data Standardisation

4.2.1 Min-Max standardisation

cluster_vars_SZ <- normalize(cluster_vars)

Next, we need to change the rows by SUBZONE name instead of row number by using the code chunk below

row.names(cluster_vars) <- cluster_vars$"SUBZONE_N"

Notice that the row number has been replaced into the SUBZONE name.

cluster_vars <- as.data.frame(cluster_vars)

Now, we will delete the SUBZONE_N field by using the code chunk below.

final_output <- dplyr::select(cluster_vars, c(2:15))
summary(final_output,10)

##    Economy_PR        Aged_PR         Pop_density         HDB1-2RM_PR     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.0000000   Min.   :0.00000  
##  1st Qu.:0.5363   1st Qu.:0.03755   1st Qu.:0.0002178   1st Qu.:0.00000  
##  Median :0.5804   Median :0.13316   Median :0.0070916   Median :0.00000  
##  Mean   :0.4650   Mean   :0.12199   Mean   :0.0115599   Mean   :0.03068  
##  3rd Qu.:0.6011   3rd Qu.:0.18705   3rd Qu.:0.0216070   3rd Qu.:0.03263  
##  Max.   :0.9000   Max.   :0.94737   Max.   :0.0460579   Max.   :0.71293  
##   HDB3-4RM_PR       HDB5RM-Ec_PR     Cond_Apt_PR        Landed_PR      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.03132   Median :0.0000   Median :0.06175   Median :0.00000  
##  Mean   :0.26982   Mean   :0.1247   Mean   :0.23367   Mean   :0.10110  
##  3rd Qu.:0.53305   3rd Qu.:0.2242   3rd Qu.:0.31995   3rd Qu.:0.05445  
##  Max.   :0.94815   Max.   :0.8364   Max.   :1.00000   Max.   :1.00000  
##     Business         Embassy         Financial         Industry     
##  Min.   :  0.00   Min.   : 0.000   Min.   :  0.00   Min.   :0.0000  
##  1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:  2.00   1st Qu.:0.0000  
##  Median :  3.00   Median : 0.000   Median :  5.50   Median :0.0000  
##  Mean   : 21.07   Mean   : 1.313   Mean   : 10.98   Mean   :0.3167  
##  3rd Qu.: 15.00   3rd Qu.: 1.000   3rd Qu.: 14.00   3rd Qu.:0.0000  
##  Max.   :303.00   Max.   :17.000   Max.   :132.00   Max.   :5.0000  
##   Residential        Shopping     
##  Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:  1.00   1st Qu.: 0.000  
##  Median :  5.00   Median : 0.000  
##  Mean   : 11.93   Mean   : 1.527  
##  3rd Qu.: 11.25   3rd Qu.: 2.000  
##  Max.   :215.00   Max.   :27.000

In the code chunk below, normalize() of heatmaply package is used to stadardisation the clustering variables by using Min-Max method.
The summary() is then used to display the summary statistics of the standardised clustering variables.

final_output.std <- normalize(final_output)
summary(final_output.std)

##    Economy_PR        Aged_PR         Pop_density        HDB1-2RM_PR     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.5959   1st Qu.:0.03964   1st Qu.:0.004729   1st Qu.:0.00000  
##  Median :0.6449   Median :0.14056   Median :0.153971   Median :0.00000  
##  Mean   :0.5166   Mean   :0.12877   Mean   :0.250986   Mean   :0.04304  
##  3rd Qu.:0.6679   3rd Qu.:0.19744   3rd Qu.:0.469127   3rd Qu.:0.04577  
##  Max.   :1.0000   Max.   :1.00000   Max.   :1.000000   Max.   :1.00000  
##   HDB3-4RM_PR       HDB5RM-Ec_PR     Cond_Apt_PR        Landed_PR      
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.03303   Median :0.0000   Median :0.06175   Median :0.00000  
##  Mean   :0.28458   Mean   :0.1491   Mean   :0.23367   Mean   :0.10110  
##  3rd Qu.:0.56220   3rd Qu.:0.2681   3rd Qu.:0.31995   3rd Qu.:0.05445  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##     Business           Embassy          Financial          Industry      
##  Min.   :0.000000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.01515   1st Qu.:0.00000  
##  Median :0.009901   Median :0.00000   Median :0.04167   Median :0.00000  
##  Mean   :0.069527   Mean   :0.07725   Mean   :0.08316   Mean   :0.06333  
##  3rd Qu.:0.049505   3rd Qu.:0.05882   3rd Qu.:0.10606   3rd Qu.:0.00000  
##  Max.   :1.000000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##   Residential          Shopping      
##  Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.004651   1st Qu.:0.00000  
##  Median :0.023256   Median :0.00000  
##  Mean   :0.055504   Mean   :0.05654  
##  3rd Qu.:0.052326   3rd Qu.:0.07407  
##  Max.   :1.000000   Max.   :1.00000

After normalization, all the value are fall within 0 to 1 range.

4.2.2 Visualising the standardised clustering variables

ggplot function will be helping us to display each graph.

  Economy_PR <- ggplot(data=final_output.std, 
             aes(x= `Economy_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  Aged_PR <- ggplot(data=final_output.std, 
             aes(x= `Aged_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  Pop_density <- ggplot(data=final_output.std, 
             aes(x= `Pop_density`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
  
  `HDB1-2RM_PR` <- ggplot(data=final_output.std, 
             aes(x= `HDB1-2RM_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  `HDB3-4RM_PR` <- ggplot(data=final_output.std, 
             aes(x= `HDB3-4RM_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

`HDB5RM-Ec_PR` <- ggplot(data=final_output.std, 
             aes(x= `HDB5RM-Ec_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  Cond_Apt_PR <- ggplot(data=final_output.std, 
             aes(x= `Cond_Apt_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  Landed_PR <- ggplot(data=final_output.std, 
             aes(x= `Landed_PR`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  `Business` <- ggplot(data=final_output.std, 
             aes(x= `Business`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  `Embassy` <- ggplot(data=final_output.std, 
             aes(x= `Embassy`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

  `Financial` <- ggplot(data=final_output.std, 
             aes(x= `Financial`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
  
  `Industry` <- ggplot(data=final_output.std, 
             aes(x= `Industry`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
  
  `Residential` <- ggplot(data=final_output.std, 
             aes(x= `Residential`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")
  
  `Shopping` <- ggplot(data=final_output.std, 
             aes(x= `Shopping`)) +
  geom_histogram(bins=20, 
                 color="black", 
                 fill="light blue")

ggarrange(Economy_PR, Young_PR, Aged_PR, Pop_density, `HDB1-2RM_PR`, `HDB3-4RM_PR`, `HDB5RM-Ec_PR`,Cond_Apt_PR,Landed_PR, Embassy,Business, Financial,Industry,Residential,Shopping,
          ncol = 3, 
          nrow = 5)

4.3 Computing proximity matrix

To compute the proximity matrix using euclidean method.

proxmat <- dist(final_output.std,method = 'euclidean')

To list the content of proxmat for visual inspection.

4.4 Selecting the optimal clustering algorithm

To compute the agglomerative coefficients of all hierarchical clustering algorithms.

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

ac <- function(x) {
  agnes(final_output.std, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.8711732 0.8126779 0.9100738 0.9798199

4.5 Computing hierarchical clustering

Since ward gives the highest value, therefore in the following code chunk will be using ward instead of others.

Performs hierarchical cluster analysis using ward.D method. The hierarchical clustering output is stored in an object of class hclust which describes the tree produced by the clustering process.

hclust_ward <- hclust(proxmat, method = 'ward.D')

Plot the tree by using plot() of R Graphics as shown in the code chunk below.

plot(hclust_ward, cex = 0.1, )

4.6 Computing number of clusters

In order to identify the number of clustering, we will be using fviz_nbclust() function to identify the optimal number of clusters. 10 clusters will be used in the suquence code chunk.

set.seed(123)
fviz_nbclust(final_output.std, hcut, nstart = 25,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")

4.6.1 Interpreting the dendrograms

To draw the dendrogram with a border around the selected clusters by using rect.hclust() of R stats.

plot(hclust_ward, cex = 0.1)
rect.hclust(hclust_ward, k = 10, border = 2:5)

4.7 Visually-driven hierarchical clustering analysis

4.7.1 Transforming the data frame into a matrix

To transform final_output.std data frame into a data matrix.

final_output_mat <- data.matrix(final_output.std)

4.7.2 Plotting interactive cluster heatmap using heatmaply()

To build an interactive cluster heatmap.

heatmaply(normalize(final_output_mat),
          Colv=NA,
          dist_method = "euclidean",
          hclust_method = "ward.D",
          seriate = "OLO",
          colors = Blues,
          k_row = 10,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="Geographic Segmentation of Singapore Subzone by geodemographic and urban functions indicators",
          xlab = "Geodemographic and Urban Functions Indicators",
          ylab = "Planning Subzone of Singapore"
          )

4.8 Mapping the clusters formed

To derive a 10-cluster model.

groups <- as.factor(cutree(hclust_ward, k=10))

The code chunk below form the join in three steps:

the groups list object will be converted into a matrix;
cbind() is used to append groups matrix onto mpsz to produce an output simple feature object called sg_sf_cluster; and
rename of dplyr package is used to rename as.matrix.groups field as CLUSTER.

sg_sf_cluster <- cbind(mpsz, as.matrix(groups)) %>%
  rename(`CLUSTER`=`as.matrix.groups.`)

To plot the choropleth map showing the cluster formed.

tm_shape(sg_sf_cluster)+
  tm_fill(col = "CLUSTER")+
  tm_borders(col = "black") +
  tm_layout(outer.bg.color = "lightblue", legend.text.size = 0.55, title = "Hierarchical Clustering", title.size = 1)

4.9 Analysis Result

Using st_set_geometry to drop the geometry and named it as sg_sf_cluster_all

sg_sf_cluster_all <- sg_sf_cluster %>%
  st_set_geometry(NULL)

To perform a left_join with cluster_vars_SZ and the cluster_vars_SZ already in standardised format.

sg_sf_cluster_all <- left_join(sg_sf_cluster_all, cluster_vars_SZ)

Remane all the columns names

sg_sf_cluster_all <- sg_sf_cluster_all %>% 
  rename("Cl_Economy_PR"= "Economy_PR","Cl_Aged_PR" = "Aged_PR","Cl_Pop_density" = "Pop_density", "Cl_HDB1-2RM_PR"= "HDB1-2RM_PR" , "Cl_HDB3-4RM_PR" = "HDB3-4RM_PR","Cl_HDB5RM-Ec_PR" = "HDB5RM-Ec_PR","Cl_Cond_Apt_PR" = "Cond_Apt_PR", "Cl_Landed_PR" = "Landed_PR", "Cl_Business" = "Business", "Cl_Embassy" ="Embassy" , "Cl_Financial" = "Financial", "Cl_Industry" = "Industry","Cl_Residential" = "Residential" ,"Cl_Shopping" = "Shopping" )

Using pivot_longer() function to pivot the data set.

sg_sf_cluster_all <- sg_sf_cluster_all %>%
  pivot_longer( cols = starts_with("Cl_"),
    names_to = "variable",
    values_to = "value"
  )

Using facet_wrap to display the data sets in geom_col.

ggplot(data = sg_sf_cluster_all, aes(y = variable, x = value )) + 
  geom_col() +
  facet_wrap( ~ CLUSTER, nrow = 3,ncol = 4, scales = "free_x") +
  theme(strip.text.x = element_text(size = 9,face="bold"), axis.text.y = element_text(size = 5,face="bold"), axis.text.x = element_text(size = 7,face="bold"),panel.background = element_rect(fill = "#BFD5E3", colour = "#6D9EC1",size = 2, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")) + 
  ggtitle("Detail of Geodemographic and Urban Functions Indicators for each cluster")

Using facet_wrap to display the data sets in geom_boxplot.

ggplot(data = sg_sf_cluster_all, aes(y=variable, x=value), fill = variable) + 
             geom_boxplot()+
  facet_wrap( ~ CLUSTER, nrow = 3,ncol = 4) +
  theme(strip.text.x = element_text(size = 12), axis.text.y = element_text(size = 5,face="bold"), axis.text.x = element_text(size = 7,face="bold"),panel.background = element_rect(fill = "#BFD5E3", colour = "#6D9EC1",size = 2, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")) + 
  ggtitle("Detail of Geodemographic and Urban Functions Indicators for each cluster")

4.10 Interpretation

Cluster 1: Cluster 1 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that business is the main player in this cluster.

Cluster 2: Cluster 2 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that Economy group and HDB3-4RM are the main player in this cluster.

Cluster 3: Cluster 3 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that econmoy and cond_apt are the main player in this cluster.

Cluster 4: Cluster 4 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that HDB3-4RM, Population density and economy are the main player in this cluster.

Cluster 5: Cluster 5 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that population density, economy, HDB5RM-Ec and HDB3-4Rm are the main player in this cluster.

Cluster 6: Cluster 6 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that economy and cond_apt are the main player in this cluster.

Cluster 7: Cluster 7 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that business and industry are the main player in this cluster.

Cluster 8: Cluster 8 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that shapping, embassy and financial are the main player in this cluster.

Cluster 9: Cluster 9 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that economy and industry are the main player in this cluster.

Cluster 10: Cluster 10 is represented by green colour and by looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that landed and economy group are the main player in this cluster.

5. Spatially Constrained Clustering - SKATER approach

5.1 Computing Neighbour List

To compute the neighbours list from polygon list.

mpsz.nb <- poly2nb(mpsz_sp1)
summary(mpsz.nb)

## Neighbour list object:
## Number of regions: 300 
## Number of nonzero links: 1758 
## Percentage nonzero weights: 1.953333 
## Average number of links: 5.86 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 15 
##  1  5 13 34 79 74 47 31 12  1  1  1  1 
## 1 least connected region:
## 16 with 1 link
## 1 most connected region:
## 313 with 15 links

We can plot the neighbours list on mpsz_sp1

plot(mpsz_sp1, border=grey(.5))
plot(mpsz.nb, coordinates(mpsz_sp1), col="blue", add=TRUE)

5.2 Computing minimum spanning tree

5.2.1 Calculating edge costs

To compute the cost of each edge.

lcosts <- nbcosts(mpsz.nb, final_output)

For each observation, this gives the pairwise dissimilarity between its values on the five variables and the values for the neighbouring observation (from the neighbour list). Basically, this is the notion of a generalised weight for a spatial weights matrix.

Next, We will incorporate these costs into a weights object in the same way as we did in the calculation of inverse of distance weights. In other words, we convert the neighbour list to a list weights object by specifying the just computed lcosts as the weights.

Note that we specify the style as B to make sure the cost values are not row-standardised.

mpsz.w <- nb2listw(mpsz.nb, lcosts, style="B")
summary(mpsz.w)

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 300 
## Number of nonzero links: 1758 
## Percentage nonzero weights: 1.953333 
## Average number of links: 5.86 
## Link number distribution:
## 
##  1  2  3  4  5  6  7  8  9 10 11 12 15 
##  1  5 13 34 79 74 47 31 12  1  1  1  1 
## 1 least connected region:
## 16 with 1 link
## 1 most connected region:
## 313 with 15 links
## 
## Weights style: B 
## Weights constants summary:
##     n    nn      S0       S1        S2
## B 300 90000 75606.3 14121307 137048363

5.2.2 Computing minimum spanning tree

The minimum spanning tree is computed by mean of the mstree()

mpsz.mst <- mstree(mpsz.w)

To check its class and dimension

class(mpsz.mst)

## [1] "mst"    "matrix"

dim(mpsz.mst)

## [1] 299   3

Display the content of mpsz.mst

head(mpsz.mst)

##      [,1] [,2]      [,3]
## [1,]   81   74 27.073973
## [2,]   81  119 30.166206
## [3,]  119  161  5.291503
## [4,]  161  117  5.099020
## [5,]  117  118  5.000000
## [6,]  117   15 27.000000

plot(mpsz_sp1, border=gray(.5))
plot.mst(mpsz.mst, coordinates(mpsz_sp1), 
     col="blue", cex.lab=0.7, cex.circles=0.005, add=TRUE)

5.3 Computing spatially constrained clusters using SKATER method

To compute the spatially constrained cluster

clust10 <- skater(mpsz.mst[,1:2], final_output, 9)

Examine its contents

str(clust10)

## List of 8
##  $ groups      : num [1:300] 4 4 4 4 4 4 4 4 4 4 ...
##  $ edges.groups:List of 10
##   ..$ :List of 3
##   .. ..$ node: num [1:6] 81 88 110 99 75 74
##   .. ..$ edge: num [1:5, 1:3] 99 88 81 81 110 75 110 88 74 99 ...
##   .. ..$ ssw : num 106
##   ..$ :List of 3
##   .. ..$ node: num [1:7] 153 138 107 89 106 84 135
##   .. ..$ edge: num [1:6, 1:3] 153 107 138 106 89 89 138 89 107 84 ...
##   .. ..$ ssw : num 258
##   ..$ :List of 3
##   .. ..$ node: num [1:5] 132 178 152 144 108
##   .. ..$ edge: num [1:4, 1:3] 132 178 152 132 178 ...
##   .. ..$ ssw : num 223
##   ..$ :List of 3
##   .. ..$ node: num [1:261] 54 49 57 32 48 23 213 18 102 52 ...
##   .. ..$ edge: num [1:260, 1:3] 57 32 48 23 18 213 102 220 52 17 ...
##   .. ..$ ssw : num 5259
##   ..$ :List of 3
##   .. ..$ node: num [1:4] 113 165 162 158
##   .. ..$ edge: num [1:3, 1:3] 165 113 165 162 165 ...
##   .. ..$ ssw : num 206
##   ..$ :List of 3
##   .. ..$ node: num [1:2] 283 300
##   .. ..$ edge: num [1, 1:3] 283 300 42.1
##   .. ..$ ssw : num 42.1
##   ..$ :List of 3
##   .. ..$ node: num 221
##   .. ..$ edge: num[0 , 1:3] 
##   .. ..$ ssw : num 0
##   ..$ :List of 3
##   .. ..$ node: num [1:6] 272 275 173 294 70 282
##   .. ..$ edge: num [1:5, 1:3] 272 272 275 173 272 275 282 294 70 173 ...
##   .. ..$ ssw : num 213
##   ..$ :List of 3
##   .. ..$ node: num [1:2] 160 105
##   .. ..$ edge: num [1, 1:3] 160 105 23.7
##   .. ..$ ssw : num 23.7
##   ..$ :List of 3
##   .. ..$ node: num [1:6] 53 43 40 95 121 128
##   .. ..$ edge: num [1:5, 1:3] 121 95 40 53 43 128 121 95 43 40 ...
##   .. ..$ ssw : num 135
##  $ not.prune   : NULL
##  $ candidates  : int [1:10] 1 2 3 4 5 6 7 8 9 10
##  $ ssto        : num 11218
##  $ ssw         : num [1:10] 11218 9371 8666 8195 7723 ...
##  $ crit        : num [1:2] 1 Inf
##  $ vec.crit    : num [1:300] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "class")= chr "skater"

To check the cluster assignment

ccs10 <- clust10$groups
ccs10

##   [1]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
##  [26]  4  4  4  4  4  4  4  4  4  4  4  4  4  4 10  4  4 10  4  4  4  4  4  4  4
##  [51]  4  4 10  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  8  4  4  4  1  1
##  [76]  4  4  4  4  4  1  4  4  2  4  4  4  1  2  4  4  4  4  4 10  4  4  4  1  4
## [101]  4  4  4  4  9  2  2  3  4  1  4  4  5  4  4  4  4  4  4  4 10  4  4  4  4
## [126]  4  4 10  4  4  4  3  4  4  2  4  4  2  4  4  4  4  4  3  4  4  4  4  4  4
## [151]  4  3  2  4  4  4  4  5  4  9  4  5  4  4  5  4  4  4  4  4  4  4  8  4  4
## [176]  4  4  3  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [201]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  7  4  4  4  4
## [226]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4
## [251]  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  8  4  4  8
## [276]  4  4  4  4  4  4  8  6  4  4  4  4  4  4  4  4  4  4  8  4  4  4  4  4  6

We can find out how many observations are in each cluster by means of the table command. Parenthetially, we can also find this as the dimension of each vector in the lists contained in edges.groups.

table(ccs10)

## ccs10
##   1   2   3   4   5   6   7   8   9  10 
##   6   7   5 261   4   2   1   6   2   6

Plot the pruned tree that shows the ten clusters on top of the subzone area.

plot(mpsz_sp1, border=gray(.5))
plot(clust10, coordinates(mpsz_sp1), cex.lab=.7,
     groups.colors=c("red","green","blue", "brown", "pink","yellow","orange","purple","grey","black","coral"), cex.circles=0.005, add=TRUE)

5.4 Visualising the clusters in choropleth map

groups_mat <- as.matrix(clust10$groups)
mpsz_sf_spatialcluster <- cbind(sg_sf_cluster, as.factor(groups_mat)) %>%
  rename(`SP_CLUSTER`=`as.factor.groups_mat.`)

tm_shape(mpsz_sf_spatialcluster)+
  tm_fill(col = "SP_CLUSTER")+
  tm_borders(col = "black") +
  tm_layout(outer.bg.color = "lightblue", legend.text.size = 0.55, title = "Spatially Constrained Clustering", title.size = 1)

5.5 Analysis Result

Using st_set_geometry to drop the geometry and named it as mpsz_sf_spatialcluster_all

mpsz_sf_spatialcluster_all <- mpsz_sf_spatialcluster %>%
  st_set_geometry(NULL)

To perform a left_join with cluster_vars_SZ and the cluster_vars_SZ already in standardised format.

mpsz_sf_spatialcluster_all <- left_join(mpsz_sf_spatialcluster_all, cluster_vars_SZ)

Remane all the columns names

mpsz_sf_spatialcluster_all <- mpsz_sf_spatialcluster_all %>% 
  rename("Cl_Economy_PR"= "Economy_PR","Cl_Aged_PR" = "Aged_PR","Cl_Pop_density" = "Pop_density", "Cl_HDB1-2RM_PR"= "HDB1-2RM_PR" , "Cl_HDB3-4RM_PR" = "HDB3-4RM_PR","Cl_HDB5RM-Ec_PR" = "HDB5RM-Ec_PR","Cl_Cond_Apt_PR" = "Cond_Apt_PR", "Cl_Landed_PR" = "Landed_PR", "Cl_Business" = "Business", "Cl_Embassy" ="Embassy" , "Cl_Financial" = "Financial", "Cl_Industry" = "Industry","Cl_Residential" = "Residential" ,"Cl_Shopping" = "Shopping" )

Using pivot_longer() function to pivot the data set.

mpsz_sf_spatialcluster_all <- mpsz_sf_spatialcluster_all %>%
  pivot_longer( cols = starts_with("Cl_"),
    names_to = "variable",
    values_to = "value"
  )

Using facet_wrap to display the data sets in geom_col.

ggplot(data = mpsz_sf_spatialcluster_all, aes(y = variable, x = value )) + 
  geom_col() +
  facet_wrap( ~ SP_CLUSTER, nrow = 3,ncol = 4, scales = "free_x") +
  theme(strip.text.x = element_text(size = 9,face="bold"), axis.text.y = element_text(size = 5,face="bold"), axis.text.x = element_text(size = 7,face="bold"),panel.background = element_rect(fill = "#BFD5E3", colour = "#6D9EC1",size = 2, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")) + 
  ggtitle("Detail of Geodemographic and Urban Functions Indicators for each cluster")

Using facet_wrap to display the data sets in geom_boxplot.

ggplot(data = mpsz_sf_spatialcluster_all, aes(y=variable, x=value), fill = variable) + 
             geom_boxplot()+
  facet_wrap( ~ SP_CLUSTER, nrow = 3,ncol = 4) +
  theme(strip.text.x = element_text(size = 12), axis.text.y = element_text(size = 5,face="bold"), axis.text.x = element_text(size = 7,face="bold"),panel.background = element_rect(fill = "#BFD5E3", colour = "#6D9EC1",size = 2, linetype = "solid"),
  panel.grid.major = element_line(size = 0.5, linetype = 'solid',
                                colour = "white"), 
  panel.grid.minor = element_line(size = 0.25, linetype = 'solid',
                                colour = "white")) + 
  ggtitle("Detail of Geodemographic and Urban Functions Indicators for each cluster")

5.6 Interpretation

Cluster 1: Cluster 1 is represented by green colour and as we can see the location of cluster 1 is located around Jurong West area. By looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that business and industry are the main player in this cluster and the rests are close to 0.

Cluster 2: Cluster 2 is represented by yellow colour and as we can see the location of cluster 3 is located around Jurong East and Clementi areas. By looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that business is the main player in this cluster and the rests are close to 0.

Cluster 3: Cluster 3 is represented by purple colour and as we can see the location of cluster 3 is located around CBD and town areas. By looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that business, economy group and indsutry are the main player in this cluster. However, it is not only limit to those mentioned groups, but the rests is not as signification as the mentioned threes

Cluster 4: Cluster 4 is represented by red colour and as we can see the location of cluster 4 almost entire Singapore. As we know Singapore is very good at geodemographic and urban function planning. Therefore, we are not surprise that when most of the subzone are fall within the same cluster. However, in this cluster there some singificate definately more others such as Economy group, HDB3-4RM,popualtion density and cond_Apt.

Cluster 5: Cluster 5 is represented by blue colour and as we can see the location of cluster 4 is located around Kallang area. By looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that Residential, Economy group and Cond_apt are the main player in this cluster.

Cluster 6: Cluster 6 is represented by orange colour and as we can see the location of cluster 5 Woodlands area. By looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that Business and Industry are the main player in this cluster.

Cluster 7: Cluster 7 is represented by green colour and as we can see the location of cluster 7 is located around Downtown East area. By looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that Aged group and Business are the main player in this cluster.

Cluster 8: Cluster 8 is represented by pink colour and as we can see the location of cluster 8 is located Choa Chu Kang and Tuas areas. By looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that business and Economy group are the main player in this cluster.

Cluster 9: Cluster 9 is represented by blue colour and as we can see the location of cluster 9 is located around Tangling area. By looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that Residential, Economy group and Embassy are the main player in this cluster.

Cluster 10: Cluster 10 is represented by blue colour and as we can see the location of cluster 10 town area. By looking at detail of Geodemographic and Urban Functions Indicators for each cluster, we can see that shopping, HDB3-4RM, financial, Residential, Economy group and Embassy are the main player in this cluster.

IS415 Take Home Exe03

Lu Zhimao

5/21/2020

1. Aspatial/Geospatial Data Wrangling

1.1 The data

Aspatial

Geospatial

1.2 Installing all the relevant packages

2. Data Import and Prepatation

2.1 Importing geospatial data into R environment

2.2 Importing aspatial data into R environment

3. Exploratory Data Analysis (EDA)

3.2 EDA using statistical graphics

3.3 Correlation Analysis

4 Hierarchy Cluster Analysis

4.1 Extrating clustering variables

4.2 Data Standardisation

4.2.1 Min-Max standardisation

4.2.2 Visualising the standardised clustering variables

4.3 Computing proximity matrix

4.4 Selecting the optimal clustering algorithm

4.5 Computing hierarchical clustering

4.6 Computing number of clusters

4.6.1 Interpreting the dendrograms

4.7 Visually-driven hierarchical clustering analysis

4.7.1 Transforming the data frame into a matrix

4.7.2 Plotting interactive cluster heatmap using heatmaply()

4.8 Mapping the clusters formed

4.9 Analysis Result

4.10 Interpretation

5. Spatially Constrained Clustering - SKATER approach

5.1 Computing Neighbour List

5.2 Computing minimum spanning tree

5.2.1 Calculating edge costs

5.2.2 Computing minimum spanning tree

5.3 Computing spatially constrained clusters using SKATER method

5.4 Visualising the clusters in choropleth map

5.5 Analysis Result

5.6 Interpretation